This session is just a tutorial that provides an introduction to the R language, and some utilities that are commonly used to deal with survey data.
Let’s say you want to calculate the mean of a series of numbers. You know that R provides a function called mean()
, but you don’t know how to use it. The help()
function will show you a page with a more or less detailed description of what a function does, and what arguments it accepts (i.e. what input it needs).
help(mean)
# or
?mean # This is the shortest version
Sometimes you won’t find a direct link to the help page for a certain function. In such cases you can use the help.search()
function.
help.search("mean")
# or
??mean # This will return a list of help files containing the word "mean"
Finally, the function example(mean)
will give you plenty of examples about how to use a function.
You can think of an object as a variable, to which you can assign values.
x <- 7
To see what an object contains you can simply “call” its name:
x
## [1] 7
Or use the “print” function:
print(x)
## [1] 7
Objects can contain character values
char <- "hello!" # Note the quotation marks
char
## [1] "hello!"
Why do we need quotation marks? Because otherwise R will think that we are referring to another object:
char <- hello
## Error in eval(expr, envir, enclos): object 'hello' not found
Objects can contain multiple values at once, like with vectors
v <- c(1, 2, 10, 30, 10000)
v
## [1] 1 2 10 30 10000
or matrices
m <- matrix(c(1, 2, 10, 30), ncol = 2, nrow = 2)
m
## [,1] [,2]
## [1,] 1 10
## [2,] 2 30
However, note that vectors and matrices can not contain both numbers and character values at the same time. Because character values are more general (they are just “labels”), when you add a character to a numeric vector, all numbers will be regarded as characters as well:
v <- c(v, "f")
v
## [1] "1" "2" "10" "30" "10000" "f"
You can tell that numbers are treated as characters because now they are surrounded by quotation marks.
To select elements within objects, you need to use squared brackets and specify which element you need. For instance, we want to see what is the first element of the vector v
:
v[1]
## [1] "1"
According to the same logic you can select multiple elements
v[c(1, 3, 5)]
## [1] "1" "10" "10000"
For multidimensional objects, such as matrices, you need to specify the position of the element in every dimension: the first number refers to the row, the second to the column:
m[1, 2] # Will pick the number in the first row and second column
## [1] 10
You can also select entire columns or rows
m[1, ] # First row
## [1] 1 10
m[, 1] # First column
## [1] 1 2
An important class of objects are the “data frames”. They are matrices where rows are assumed to be observations and colums are assumed to be variables. This implies that different colums can contain different classes of objects:
dat <- data.frame(X1 = c(1, 2, 3, 10),
X2 = c("a", "b", "c", "d"))
dat
## X1 X2
## 1 1 a
## 2 2 b
## 3 3 c
## 4 10 d
Since colums are regarded as variables, we can see what variables are contained in a data frame by asking for the “names”
names(dat)
## [1] "X1" "X2"
You can select variables in a data frame by using square brackets:
dat[, 1] # Select the first column, AKA the first variable
## [1] 1 2 3 10
or by calling their name directly, using the dollar sign:
dat$X1
## [1] 1 2 3 10
Squared brackets are very useful in data frames when you need to select some specific observations:
dat[1, ] # Shows all the variables for the observation in the first row
## X1 X2
## 1 1 a
dat[dat$X1 > 2, ] # Shows all observations for which X1 is bigger than 2
## X1 X2
## 3 3 c
## 4 10 d
In the R console, you can see what objects are stored in the workspace (R’s memory) by typing:
ls()
## [1] "char" "dat" "m" "v" "x"
or
objects()
## [1] "char" "dat" "m" "v" "x"
However, RStudio makes your life easier by showing in the top-right panel the list of objects that are stored in your workspace, divided by categories (e.g. Data, Values), and with a brief description of their content.
Functions are a very important part of R. All the operations that R performs are functions. What functions do is taking some arguments as input, perform a certain operation, and return some output. They usually follow the form f(argument1, argument2, ...)
, and, of course, are typically stored into objects:
plusone <- function(x) {x + 1}
plusone(5)
## [1] 6
You can often see the code for a function by typing its name:
plusone
## function(x) {x + 1}
Even when they take one argument only, functions can be applied to entire vectors or matrices: they will simply take each entry as input, and return another vector or matrix where every number has passed through the function.
plusone(c(1, 3, 5, 99))
## [1] 2 4 6 100
Of course, functions can be way more complex than this, and return all different kinds of elaborations. However, the logic is always the same: you define what argument(s) a function accepts, you specify what the function does with those arguments, and the function will give you a result.
seq()
creates vectors with sequences:seq(from = 1, to = 5, by = 0.5) # It returns a sequence of numbers from 1 to 5, with 0.5 increment
## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
seq(from = 1, to = 2, length.out = 10) # It returns a sequence of 10 numbers from 1 to 2
## [1] 1.000000 1.111111 1.222222 1.333333 1.444444 1.555556 1.666667
## [8] 1.777778 1.888889 2.000000
rep()
creates vectors where a value or a set of values are repeated for a number of times:rep(1:3, 5) # Repeats the sequence from 1 to 3 for five times
## [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
rep(1:3, each = 5) # Repeats each number in the sequence from 1 to 3 for five times
## [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
rep(c("a", "b", "c"), 5) # Accepts character values as well
## [1] "a" "b" "c" "a" "b" "c" "a" "b" "c" "a" "b" "c" "a" "b" "c"
sample()
makes random draws from a given set of values, and returns a vector of the required size:sample(c(1:199), size = 2) # It samples two numbers from the population "c(1:199)"
## [1] 94 158
sample(c(1:5), size = 6, replace = T) # Samples with replacement
## [1] 1 4 5 1 5 5
While R can do many things, it is not able to do all of them by itself. The basic R is actually a relatively limited machine, from the point of view of statistical analysis. However, a lot of things can be done, way more than with other commercial softwares, by using “packages”. Generally speaking, a package is a set of functions, help files and data files that have been put together and are somehow related to one another. R offers a lot of packages – right now the CRAN network has 8869 available packages! – many of them are written by statisticians and allow to perform cutting-edge techniques that in other contexts would require a lot of programming skills.
Even if you have just downloaded R, you have a lot of packages installed in your computer. However, not all of them are loaded in your workspace: some packages take a lot of memory, so it is better to just “call” them at need. Moreover, some packages use functions that have the same name as functions in other packages that do different things. In such a case, when you call a function, R will not know which one you are referring to, and will return an error message. Hence, some packages automatically un-load other packages that you have loaded before.
You can see what packages you have loaded in the work space at the moment:
(.packages())
## [1] "stats" "graphics" "grDevices" "utils" "datasets" "methods"
## [7] "base"
You can also see all the packages that you have installed, and you can load in the package library when you need them:
(.packages(all.available = T))
To load a package you use the library()
or the require()
functions (they are essentially equivalent. For instance the foreign
package allows you to read and write data in formats that are used by different statistical softwares, including Stata and SPSS.
library(foreign)
# or
require(foreign)
If a package is not installed in your machine, you can do it using the install.packages()
function:
install.packages("tidyverse", dependencies = T)
# This command installs a set of packages that are very useful for working with data
Some packages are built on other packages, i.e. they will use functions that are taken from other user-developed packages. That’s why it is better (but not necessary) to set the option dependencies = TRUE
. It tells R to search for all the packages that the one you want to install depends on, and install them on your computer as well.
We can load data in R in two ways: either the data are in our computer, or they are online. In the first case, we have to tell R in which folder the data are. For instance, I want to load some data from my Dropbox:
setwd("~/Dropbox/Teaching/Survey_Heidelberg_2018/lab/day1")
(change the directory above to the one where you keep your data) From now on, every time we ask R to read or write data (and other things too, like figures), R will use the directory that we have specified.
Then, we can load the data by putting them into a data.frame
object (these data are just made up, you should try with your .dta file):
data <- read.dta("cupesse.dta")
head(data) # This will show you the first few lines of the dataset
## CCODE YIMODE YQ1 YQ2 YQ42
## 1 Austria Online 35 Strongly Agree Male
## 2 Austria Online 24 Agree Female
## 3 Austria Online 24 Agree Male
## 4 Austria Online 33 Agree Female
## 5 Austria Online 23 Agree Female
## 6 Austria Online 34 Strongly Agree Female
We just loaded a subset of the CUPESSE data. The variables included are:
CCODE
: the country nameYIMODE
: interview modeYQ1
: the respondent ageYQ2
: to what degree the respondent agrees with the following statement: “Youth unemployment is a major problem in [your country].”YQ42
: sex of the respondentThe function read.dta()
is used to read data in Stata file format (.dta). Other functions are:
read.spss()
= reads data in the SPSS .sav formatread.csv()
= reads data in .csv format (often produced by Excel)read.table()
= reads data in .dat format, or tab delimited data in .txtAll these functions require different arguments, you can look them up in case you need to use them.
Another way to load data in R is getting them directly from the internet. We will not need it right now, however if you are interested in accessing online databases directly from R, there is a pretty recent post on R-Bloggers offering a good review of the methods available.
Loops are functions by which we tell the software to do a certain operation for a number of times, or until something happens (e.g. until a vector is over, until it reaches a certain value, etc.). They are useful in some cases to recode variables (although there are more efficient ways to do it). In many cases, what you do with a loop can also be done with other specific functions (which might call a loop internally anyway), but using loops makes it somewhat easier to go back to your code and see what you did. Here we talk about 3 loop structures:
repeat
simply repeats the same expression. Conceptually, it’s similar to the rep()
function. You can use the command “break” to interrupt it:i <- 0 # This object works as an index; it is necessary in every loop
repeat{if (i <= 25) {print(i); i <- i + 5} else break}
## [1] 0
## [1] 5
## [1] 10
## [1] 15
## [1] 20
## [1] 25
Without command “break” above, the loop would have been infinite.
while
loop keeps on repeating an expression “while a certain condition is true”, or inversely “until it is not true anymore”. For instance, the loop below concatenates a sequence of numbers after an arbitrary vector called a
. Before the loop we set the index i
to 1, and then we tell the loop to keep adding values to the vector a
while i
\(\leq 10\), then stop. We need to remember to update the value of i
inside of the loop, otherwise it will keep on adding values to infinity:a <- 0
i <- 1
while(i <= 10){
a <- c(a, i)
i <- i + 1
}
print(a)
## [1] 0 1 2 3 4 5 6 7 8 9 10
Example of while
loop: the “Fibonacci Sequence”. In the sequence, each number is the sum of the two previous numbers, given the first two numbers 0 and 1. Check in on Wikipedia. Below we calculate the first 20 numbers in the sequence:
i <- 1
im1 <- 1
im2 <- 0
fib.seq <- NULL
fib.seq[1] <- im2
fib.seq[2] <- im1
while(i <= 18){ # We have already set the first two numbers, so we need to repeat 18 times
fib.seq[i + 2] <- im1 + im2
temp <- im1
im1 <- im1 + im2
im2 <- temp
i <- i + 1
}
fib.seq
## [1] 0 1 1 2 3 5 8 13 21 34 55 89 144 233
## [15] 377 610 987 1597 2584 4181
for
loops is the most common We will use it later in the course. With for
you will basically loop through each item in a vector. Here we don’t have to define i
as a variable: it is created directly by the loop, and it is changed from within it:random.vector <- c(2, 5, 13, 0.5, 8, 100)
for(i in random.vector){
print(i^2) # Note that you need to specify "print" or "return" to show the value
}
## [1] 4
## [1] 25
## [1] 169
## [1] 0.25
## [1] 64
## [1] 10000
dplyr
is a package that makes your life much easier when you work with data – or at least with most operations you need to do with data, that is to summarize and transform them. We will see here a few functions that you may find useful in the future. All the things that you do with dplyr
can also be done with basic R functions, however the package makes doing it easier, less verbose and often faster.
Normally you would need to install dplyr
using the command
install.packages("dplyr")
however dplyr
came together with the tidyverse
package that we installed before, so don’t need to install it again. However, we still need to load it. We can load the entire tidyverse
package so we’ll already have some other useful packages loaded.
library(tidyverse)
We will look at some things that you can do (and if you work with data, you will soon need to do) with the package.
select()
The select()
command selects a set of columns from your data. For instance, the command below creates a new data frame keeping only a subset of variables from our data
object.
data_ctry <- select(data, CCODE, YQ1, YQ2)
head(data_ctry)
## CCODE YQ1 YQ2
## 1 Austria 35 Strongly Agree
## 2 Austria 24 Agree
## 3 Austria 24 Agree
## 4 Austria 33 Agree
## 5 Austria 23 Agree
## 6 Austria 34 Strongly Agree
Note: the objects created by dplyr
look like data frames, but in fact they are called “tibbles”. They are pretty much the same as normal data frames, but if you treat them like data frames using some basic R functions, R might complain. You can alswys turn tibbles into data frames manually:
data_ctry <- as.data.frame(data_ctry)
filter()
The filter()
command does pretty much what we did before with square brackets. For instance, the command below keeps only the respondents from Hungary, who are younger than 25, and either “agree” or “strongly agree” with the statement of YQ2
.
data_ctry <- filter(data_ctry,
CCODE == "Hungary",
YQ1 < 25,
YQ2 %in% c("Strongly Agree", "Agree"))
head(data_ctry)
## CCODE YQ1 YQ2
## 1 Hungary 20 Strongly Agree
## 2 Hungary 19 Strongly Agree
## 3 Hungary 20 Strongly Agree
## 4 Hungary 24 Strongly Agree
## 5 Hungary 22 Strongly Agree
## 6 Hungary 24 Strongly Agree
arrange()
This command sorts observations based on a specific variable or set of variables
data_ctry <- arrange(data_ctry, YQ1, YQ2)
data_ctry[1:15, ]
## CCODE YQ1 YQ2
## 1 Hungary 18 Agree
## 2 Hungary 18 Agree
## 3 Hungary 18 Agree
## 4 Hungary 18 Strongly Agree
## 5 Hungary 18 Strongly Agree
## 6 Hungary 18 Strongly Agree
## 7 Hungary 19 Agree
## 8 Hungary 19 Agree
## 9 Hungary 19 Agree
## 10 Hungary 19 Strongly Agree
## 11 Hungary 20 Agree
## 12 Hungary 20 Agree
## 13 Hungary 20 Strongly Agree
## 14 Hungary 20 Strongly Agree
## 15 Hungary 20 Strongly Agree
mutate()
This function is very useful to create new variables in the data set. For instance, in the command below we center age around the mean value, so respondents who are of average age (for the sample) will get a value of 0, respondents who are younger than the average will get a negative value, and respondents who are older than the average will get a positive value.
data_ctry <- mutate(data_ctry, age_c = YQ1 - mean(YQ1, na.rm = T))
data_ctry[1:10, ]
## CCODE YQ1 YQ2 age_c
## 1 Hungary 18 Agree -2.636364
## 2 Hungary 18 Agree -2.636364
## 3 Hungary 18 Agree -2.636364
## 4 Hungary 18 Strongly Agree -2.636364
## 5 Hungary 18 Strongly Agree -2.636364
## 6 Hungary 18 Strongly Agree -2.636364
## 7 Hungary 19 Agree -1.636364
## 8 Hungary 19 Agree -1.636364
## 9 Hungary 19 Agree -1.636364
## 10 Hungary 19 Strongly Agree -1.636364
Note the option na.rm = T
inside of the mean()
function. This is necessary to tell R to ignore missing data. If we do not include this option, it is enough to have one single missing observation and the mean will be missing too.
summarize()
Of course, we can also just create an object that contains the data in summarized form. For instance, we may want to keep the mean age into a separate tibble:
mean_age <- summarize(data_ctry, age_c_mean = mean(age_c, na.rm = T))
mean_age
## age_c_mean
## 1 -3.232237e-16
However, to understand the real value of the summarize()
command, we need to learn about the most important function of dplyr
%>%
The pipe operator allows us to concatenate all the functions that we have seen. For instance, everything that we did until the summarize()
command can be done in one single run:
data_ctry <- data %>%
select(CCODE, YQ1, YQ2) %>%
filter(CCODE == "Hungary", YQ1 < 25, YQ2 %in% c("Strongly Agree", "Agree")) %>%
arrange(YQ1, YQ2) %>%
mutate(age_c = YQ1 - mean(YQ1, na.rm = T))
data_ctry[1:10, ]
## CCODE YQ1 YQ2 age_c
## 1 Hungary 18 Agree -2.636364
## 2 Hungary 18 Agree -2.636364
## 3 Hungary 18 Agree -2.636364
## 4 Hungary 18 Strongly Agree -2.636364
## 5 Hungary 18 Strongly Agree -2.636364
## 6 Hungary 18 Strongly Agree -2.636364
## 7 Hungary 19 Agree -1.636364
## 8 Hungary 19 Agree -1.636364
## 9 Hungary 19 Agree -1.636364
## 10 Hungary 19 Strongly Agree -1.636364
We jumped the summarize()
command because it would have created one object with one observation only, however we will see a cool functionality to summarize the data soon.
group_by()
Another very important feature of dplyr
, perhaps the one that drove most people to using it, id the possibility to do operations by group. This might seem trivial, but those who learned doing data cleaning the hard way with basic R functions know how painful it can be doing some things that in conceptually are straightforward, like getting group means or things like that.
For instance, if we want to extract the mean age in each country in the sample:
group_means <- data %>%
select(CCODE, YQ1) %>%
group_by(CCODE) %>%
summarize(mean_age = mean(YQ1, na.rm = T))
group_means
## # A tibble: 11 x 2
## CCODE mean_age
## <fctr> <dbl>
## 1 Austria 26.6
## 2 Czech Republic 27.1
## 3 Denmark 26.4
## 4 Germany 27.3
## 5 Greece 28.7
## 6 Hungary 26.6
## 7 Italy 27.6
## 8 Spain 26.6
## 9 Switzerland 25.6
## 10 Turkey 25.0
## 11 United Kingdom 26.2
Of course the group_by()
command can be combined with others, so we can easily create group-mean centered variables in one run:
data <- data %>% group_by(CCODE) %>% mutate(age_c = YQ1 - mean(YQ1))
head(data)
## # A tibble: 6 x 6
## # Groups: CCODE [1]
## CCODE YIMODE YQ1 YQ2 YQ42 age_c
## <fctr> <fctr> <dbl> <fctr> <fctr> <dbl>
## 1 Austria Online 35.0 Strongly Agree Male 8.44
## 2 Austria Online 24.0 Agree Female -2.56
## 3 Austria Online 24.0 Agree Male -2.56
## 4 Austria Online 33.0 Agree Female 6.44
## 5 Austria Online 23.0 Agree Female -3.56
## 6 Austria Online 34.0 Strongly Agree Female 7.44
A very cool thing about R is that it allows you to make very beautiful charts – and nice charts are an important ingredient for making friends with your audience. In the tidyverse
package there is one very essential package called ggplot2
which makes some of the nicest plots around. Beside the aesthetic value, ggplot2
makes it also much easier to produce complex visualization than base R (you don’t wanna know how big of a pain is to make plots with base R functions).
For instance, we may want to see a histogram of the age variable in our sample:
ggplot(data, aes(x = YQ1)) +
geom_histogram(alpha = 0.2, col = "black") +
theme_bw()
Moreover, we may want to see the same histogram just plotted for each country separately:
ggplot(data, aes(x = YQ1)) +
geom_histogram(alpha = 0.2, col = "black") +
facet_wrap(~CCODE) +
theme_bw()
We can also do bivariate plots, for instance, how is the distribution of age for each category of the YQ2
variable:
ggplot(data, aes(y = YQ1, x = YQ2)) +
geom_jitter(alpha = 0.5) +
geom_boxplot(fill = "black", alpha = 0.2) +
coord_flip() +
theme_bw()