This session is just a tutorial that provides an introduction to the R language, and some utilities that are commonly used to deal with survey data.

Getting help

Let’s say you want to calculate the mean of a series of numbers. You know that R provides a function called mean(), but you don’t know how to use it. The help() function will show you a page with a more or less detailed description of what a function does, and what arguments it accepts (i.e. what input it needs).

help(mean)
# or
?mean  # This is the shortest version

Sometimes you won’t find a direct link to the help page for a certain function. In such cases you can use the help.search() function.

help.search("mean")
# or
??mean  # This will return a list of help files containing the word "mean"

Finally, the function example(mean) will give you plenty of examples about how to use a function.

Objects

You can think of an object as a variable, to which you can assign values.

x <- 7

To see what an object contains you can simply “call” its name:

x
## [1] 7

Or use the “print” function:

print(x)
## [1] 7

Objects can contain character values

char <- "hello!"  # Note the quotation marks
char
## [1] "hello!"

Why do we need quotation marks? Because otherwise R will think that we are referring to another object:

char <- hello
## Error in eval(expr, envir, enclos): object 'hello' not found

Objects can contain multiple values at once, like with vectors

v <- c(1, 2, 10, 30, 10000)
v
## [1]     1     2    10    30 10000

or matrices

m <- matrix(c(1, 2, 10, 30), ncol = 2, nrow = 2)
m
##      [,1] [,2]
## [1,]    1   10
## [2,]    2   30

However, note that vectors and matrices can not contain both numbers and character values at the same time. Because character values are more general (they are just “labels”), when you add a character to a numeric vector, all numbers will be regarded as characters as well:

v <- c(v, "f")
v  
## [1] "1"     "2"     "10"    "30"    "10000" "f"

You can tell that numbers are treated as characters because now they are surrounded by quotation marks.

To select elements within objects, you need to use squared brackets and specify which element you need. For instance, we want to see what is the first element of the vector v:

v[1]
## [1] "1"

According to the same logic you can select multiple elements

v[c(1, 3, 5)]
## [1] "1"     "10"    "10000"

For multidimensional objects, such as matrices, you need to specify the position of the element in every dimension: the first number refers to the row, the second to the column:

m[1, 2]  # Will pick the number in the first row and second column
## [1] 10

You can also select entire columns or rows

m[1, ]  # First row
## [1]  1 10
m[, 1]  # First column
## [1] 1 2

An important class of objects are the “data frames”. They are matrices where rows are assumed to be observations and colums are assumed to be variables. This implies that different colums can contain different classes of objects:

dat <- data.frame(X1 = c(1, 2, 3, 10), 
                  X2 = c("a", "b", "c", "d"))
dat
##   X1 X2
## 1  1  a
## 2  2  b
## 3  3  c
## 4 10  d

Since colums are regarded as variables, we can see what variables are contained in a data frame by asking for the “names”

names(dat)
## [1] "X1" "X2"

You can select variables in a data frame by using square brackets:

dat[, 1]  # Select the first column, AKA the first variable
## [1]  1  2  3 10

or by calling their name directly, using the dollar sign:

dat$X1
## [1]  1  2  3 10

Squared brackets are very useful in data frames when you need to select some specific observations:

dat[1, ]  # Shows all the variables for the observation in the first row
##   X1 X2
## 1  1  a
dat[dat$X1 > 2, ]  # Shows all observations for which X1 is bigger than 2
##   X1 X2
## 3  3  c
## 4 10  d

In the R console, you can see what objects are stored in the workspace (R’s memory) by typing:

ls()
## [1] "char" "dat"  "m"    "v"    "x"

or

objects()
## [1] "char" "dat"  "m"    "v"    "x"

However, RStudio makes your life easier by showing in the top-right panel the list of objects that are stored in your workspace, divided by categories (e.g. Data, Values), and with a brief description of their content.

Functions

Functions are a very important part of R. All the operations that R performs are functions. What functions do is taking some arguments as input, perform a certain operation, and return some output. They usually follow the form f(argument1, argument2, ...), and, of course, are typically stored into objects:

plusone <- function(x) {x + 1}
plusone(5)
## [1] 6

You can often see the code for a function by typing its name:

plusone
## function(x) {x + 1}

Even when they take one argument only, functions can be applied to entire vectors or matrices: they will simply take each entry as input, and return another vector or matrix where every number has passed through the function.

plusone(c(1, 3, 5, 99))
## [1]   2   4   6 100

Of course, functions can be way more complex than this, and return all different kinds of elaborations. However, the logic is always the same: you define what argument(s) a function accepts, you specify what the function does with those arguments, and the function will give you a result.

Some useful functions

seq(from = 1, to = 5, by = 0.5)  # It returns a sequence of numbers from 1 to 5, with 0.5 increment
## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
seq(from = 1, to = 2, length.out = 10)  # It returns a sequence of 10 numbers from 1 to 2
##  [1] 1.000000 1.111111 1.222222 1.333333 1.444444 1.555556 1.666667
##  [8] 1.777778 1.888889 2.000000
rep(1:3, 5)  # Repeats the sequence from 1 to 3 for five times
##  [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
rep(1:3, each = 5)  # Repeats each number in the sequence from 1 to 3 for five times
##  [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
rep(c("a", "b", "c"), 5)  # Accepts character values as well
##  [1] "a" "b" "c" "a" "b" "c" "a" "b" "c" "a" "b" "c" "a" "b" "c"
sample(c(1:199), size = 2)  # It samples two numbers from the population "c(1:199)"
## [1]  94 158
sample(c(1:5), size = 6, replace = T)  # Samples with replacement
## [1] 1 4 5 1 5 5

Packages

While R can do many things, it is not able to do all of them by itself. The basic R is actually a relatively limited machine, from the point of view of statistical analysis. However, a lot of things can be done, way more than with other commercial softwares, by using “packages”. Generally speaking, a package is a set of functions, help files and data files that have been put together and are somehow related to one another. R offers a lot of packages – right now the CRAN network has 8869 available packages! – many of them are written by statisticians and allow to perform cutting-edge techniques that in other contexts would require a lot of programming skills.

Even if you have just downloaded R, you have a lot of packages installed in your computer. However, not all of them are loaded in your workspace: some packages take a lot of memory, so it is better to just “call” them at need. Moreover, some packages use functions that have the same name as functions in other packages that do different things. In such a case, when you call a function, R will not know which one you are referring to, and will return an error message. Hence, some packages automatically un-load other packages that you have loaded before.

You can see what packages you have loaded in the work space at the moment:

(.packages())
## [1] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"  
## [7] "base"

You can also see all the packages that you have installed, and you can load in the package library when you need them:

(.packages(all.available = T))

To load a package you use the library() or the require() functions (they are essentially equivalent. For instance the foreign package allows you to read and write data in formats that are used by different statistical softwares, including Stata and SPSS.

library(foreign)
# or
require(foreign)

If a package is not installed in your machine, you can do it using the install.packages() function:

install.packages("tidyverse", dependencies = T)  
# This command installs a set of packages that are very useful for working with data

Some packages are built on other packages, i.e. they will use functions that are taken from other user-developed packages. That’s why it is better (but not necessary) to set the option dependencies = TRUE. It tells R to search for all the packages that the one you want to install depends on, and install them on your computer as well.

Loading data

We can load data in R in two ways: either the data are in our computer, or they are online. In the first case, we have to tell R in which folder the data are. For instance, I want to load some data from my Dropbox:

setwd("~/Dropbox/Teaching/Survey_Heidelberg_2018/lab/day1")  

(change the directory above to the one where you keep your data) From now on, every time we ask R to read or write data (and other things too, like figures), R will use the directory that we have specified.

Then, we can load the data by putting them into a data.frame object (these data are just made up, you should try with your .dta file):

data <- read.dta("cupesse.dta")
head(data)  # This will show you the first few lines of the dataset
##     CCODE YIMODE YQ1            YQ2   YQ42
## 1 Austria Online  35 Strongly Agree   Male
## 2 Austria Online  24          Agree Female
## 3 Austria Online  24          Agree   Male
## 4 Austria Online  33          Agree Female
## 5 Austria Online  23          Agree Female
## 6 Austria Online  34 Strongly Agree Female

We just loaded a subset of the CUPESSE data. The variables included are:

The function read.dta() is used to read data in Stata file format (.dta). Other functions are:

All these functions require different arguments, you can look them up in case you need to use them.

Another way to load data in R is getting them directly from the internet. We will not need it right now, however if you are interested in accessing online databases directly from R, there is a pretty recent post on R-Bloggers offering a good review of the methods available.

Loops

Loops are functions by which we tell the software to do a certain operation for a number of times, or until something happens (e.g. until a vector is over, until it reaches a certain value, etc.). They are useful in some cases to recode variables (although there are more efficient ways to do it). In many cases, what you do with a loop can also be done with other specific functions (which might call a loop internally anyway), but using loops makes it somewhat easier to go back to your code and see what you did. Here we talk about 3 loop structures:

i <- 0          # This object works as an index; it is necessary in every loop
repeat{if (i <= 25) {print(i); i <- i + 5} else break}
## [1] 0
## [1] 5
## [1] 10
## [1] 15
## [1] 20
## [1] 25

Without command “break” above, the loop would have been infinite.

a <- 0
i <- 1
while(i <= 10){
    a <- c(a, i)
    i <- i + 1
}
print(a)
##  [1]  0  1  2  3  4  5  6  7  8  9 10

Example of while loop: the “Fibonacci Sequence”. In the sequence, each number is the sum of the two previous numbers, given the first two numbers 0 and 1. Check in on Wikipedia. Below we calculate the first 20 numbers in the sequence:

i <- 1
im1 <- 1
im2 <- 0
fib.seq <- NULL
fib.seq[1] <- im2
fib.seq[2] <- im1
while(i <= 18){    # We have already set the first two numbers, so we need to repeat 18 times
    fib.seq[i + 2] <- im1 + im2
    temp <- im1
    im1 <- im1 + im2
    im2 <- temp
    i <- i + 1
}
fib.seq
##  [1]    0    1    1    2    3    5    8   13   21   34   55   89  144  233
## [15]  377  610  987 1597 2584 4181
random.vector <- c(2, 5, 13, 0.5, 8, 100)
for(i in random.vector){
    print(i^2)    # Note that you need to specify "print" or "return" to show the value
}
## [1] 4
## [1] 25
## [1] 169
## [1] 0.25
## [1] 64
## [1] 10000

Dplyr

dplyr is a package that makes your life much easier when you work with data – or at least with most operations you need to do with data, that is to summarize and transform them. We will see here a few functions that you may find useful in the future. All the things that you do with dplyr can also be done with basic R functions, however the package makes doing it easier, less verbose and often faster.

Normally you would need to install dplyr using the command

install.packages("dplyr")

however dplyr came together with the tidyverse package that we installed before, so don’t need to install it again. However, we still need to load it. We can load the entire tidyverse package so we’ll already have some other useful packages loaded.

library(tidyverse)

We will look at some things that you can do (and if you work with data, you will soon need to do) with the package.

Select variables with select()

The select() command selects a set of columns from your data. For instance, the command below creates a new data frame keeping only a subset of variables from our data object.

data_ctry <- select(data, CCODE, YQ1, YQ2)
head(data_ctry)
##     CCODE YQ1            YQ2
## 1 Austria  35 Strongly Agree
## 2 Austria  24          Agree
## 3 Austria  24          Agree
## 4 Austria  33          Agree
## 5 Austria  23          Agree
## 6 Austria  34 Strongly Agree

Note: the objects created by dplyr look like data frames, but in fact they are called “tibbles”. They are pretty much the same as normal data frames, but if you treat them like data frames using some basic R functions, R might complain. You can alswys turn tibbles into data frames manually:

data_ctry <- as.data.frame(data_ctry)

Select observations with filter()

The filter() command does pretty much what we did before with square brackets. For instance, the command below keeps only the respondents from Hungary, who are younger than 25, and either “agree” or “strongly agree” with the statement of YQ2.

data_ctry <- filter(data_ctry, 
                    CCODE == "Hungary", 
                    YQ1 < 25, 
                    YQ2 %in% c("Strongly Agree", "Agree"))
head(data_ctry)
##     CCODE YQ1            YQ2
## 1 Hungary  20 Strongly Agree
## 2 Hungary  19 Strongly Agree
## 3 Hungary  20 Strongly Agree
## 4 Hungary  24 Strongly Agree
## 5 Hungary  22 Strongly Agree
## 6 Hungary  24 Strongly Agree

Sort observations with arrange()

This command sorts observations based on a specific variable or set of variables

data_ctry <- arrange(data_ctry, YQ1, YQ2)
data_ctry[1:15, ]
##      CCODE YQ1            YQ2
## 1  Hungary  18          Agree
## 2  Hungary  18          Agree
## 3  Hungary  18          Agree
## 4  Hungary  18 Strongly Agree
## 5  Hungary  18 Strongly Agree
## 6  Hungary  18 Strongly Agree
## 7  Hungary  19          Agree
## 8  Hungary  19          Agree
## 9  Hungary  19          Agree
## 10 Hungary  19 Strongly Agree
## 11 Hungary  20          Agree
## 12 Hungary  20          Agree
## 13 Hungary  20 Strongly Agree
## 14 Hungary  20 Strongly Agree
## 15 Hungary  20 Strongly Agree

Create new variables with mutate()

This function is very useful to create new variables in the data set. For instance, in the command below we center age around the mean value, so respondents who are of average age (for the sample) will get a value of 0, respondents who are younger than the average will get a negative value, and respondents who are older than the average will get a positive value.

data_ctry <- mutate(data_ctry, age_c = YQ1 - mean(YQ1, na.rm = T))
data_ctry[1:10, ]
##      CCODE YQ1            YQ2     age_c
## 1  Hungary  18          Agree -2.636364
## 2  Hungary  18          Agree -2.636364
## 3  Hungary  18          Agree -2.636364
## 4  Hungary  18 Strongly Agree -2.636364
## 5  Hungary  18 Strongly Agree -2.636364
## 6  Hungary  18 Strongly Agree -2.636364
## 7  Hungary  19          Agree -1.636364
## 8  Hungary  19          Agree -1.636364
## 9  Hungary  19          Agree -1.636364
## 10 Hungary  19 Strongly Agree -1.636364

Note the option na.rm = T inside of the mean() function. This is necessary to tell R to ignore missing data. If we do not include this option, it is enough to have one single missing observation and the mean will be missing too.

Summarize data with summarize()

Of course, we can also just create an object that contains the data in summarized form. For instance, we may want to keep the mean age into a separate tibble:

mean_age <- summarize(data_ctry, age_c_mean = mean(age_c, na.rm = T))
mean_age
##      age_c_mean
## 1 -3.232237e-16

However, to understand the real value of the summarize() command, we need to learn about the most important function of dplyr

Putting it all together: the pipe operator %>%

The pipe operator allows us to concatenate all the functions that we have seen. For instance, everything that we did until the summarize() command can be done in one single run:

data_ctry <- data %>% 
  select(CCODE, YQ1, YQ2) %>%
  filter(CCODE == "Hungary", YQ1 < 25, YQ2 %in% c("Strongly Agree", "Agree")) %>%
  arrange(YQ1, YQ2) %>%
  mutate(age_c = YQ1 - mean(YQ1, na.rm = T))
data_ctry[1:10, ]
##      CCODE YQ1            YQ2     age_c
## 1  Hungary  18          Agree -2.636364
## 2  Hungary  18          Agree -2.636364
## 3  Hungary  18          Agree -2.636364
## 4  Hungary  18 Strongly Agree -2.636364
## 5  Hungary  18 Strongly Agree -2.636364
## 6  Hungary  18 Strongly Agree -2.636364
## 7  Hungary  19          Agree -1.636364
## 8  Hungary  19          Agree -1.636364
## 9  Hungary  19          Agree -1.636364
## 10 Hungary  19 Strongly Agree -1.636364

We jumped the summarize() command because it would have created one object with one observation only, however we will see a cool functionality to summarize the data soon.

Grouping observations with group_by()

Another very important feature of dplyr, perhaps the one that drove most people to using it, id the possibility to do operations by group. This might seem trivial, but those who learned doing data cleaning the hard way with basic R functions know how painful it can be doing some things that in conceptually are straightforward, like getting group means or things like that.

For instance, if we want to extract the mean age in each country in the sample:

group_means <- data %>%
  select(CCODE, YQ1) %>%
  group_by(CCODE) %>%
  summarize(mean_age = mean(YQ1, na.rm = T))
group_means
## # A tibble: 11 x 2
##    CCODE          mean_age
##    <fctr>            <dbl>
##  1 Austria            26.6
##  2 Czech Republic     27.1
##  3 Denmark            26.4
##  4 Germany            27.3
##  5 Greece             28.7
##  6 Hungary            26.6
##  7 Italy              27.6
##  8 Spain              26.6
##  9 Switzerland        25.6
## 10 Turkey             25.0
## 11 United Kingdom     26.2

Of course the group_by() command can be combined with others, so we can easily create group-mean centered variables in one run:

data <- data %>% group_by(CCODE) %>% mutate(age_c = YQ1 - mean(YQ1))
head(data)
## # A tibble: 6 x 6
## # Groups: CCODE [1]
##   CCODE   YIMODE   YQ1 YQ2            YQ42   age_c
##   <fctr>  <fctr> <dbl> <fctr>         <fctr> <dbl>
## 1 Austria Online  35.0 Strongly Agree Male    8.44
## 2 Austria Online  24.0 Agree          Female -2.56
## 3 Austria Online  24.0 Agree          Male   -2.56
## 4 Austria Online  33.0 Agree          Female  6.44
## 5 Austria Online  23.0 Agree          Female -3.56
## 6 Austria Online  34.0 Strongly Agree Female  7.44

Some graphics

A very cool thing about R is that it allows you to make very beautiful charts – and nice charts are an important ingredient for making friends with your audience. In the tidyverse package there is one very essential package called ggplot2 which makes some of the nicest plots around. Beside the aesthetic value, ggplot2 makes it also much easier to produce complex visualization than base R (you don’t wanna know how big of a pain is to make plots with base R functions).

For instance, we may want to see a histogram of the age variable in our sample:

ggplot(data, aes(x = YQ1)) +
  geom_histogram(alpha = 0.2, col = "black") +
  theme_bw()

Moreover, we may want to see the same histogram just plotted for each country separately:

ggplot(data, aes(x = YQ1)) +
  geom_histogram(alpha = 0.2, col = "black") +
  facet_wrap(~CCODE) +
  theme_bw()

We can also do bivariate plots, for instance, how is the distribution of age for each category of the YQ2 variable:

ggplot(data, aes(y = YQ1, x = YQ2)) +
  geom_jitter(alpha = 0.5) +
  geom_boxplot(fill = "black", alpha = 0.2) +
  coord_flip() +
  theme_bw()