This is a brief tutorial that is meant to provide an overview of the language and the utilities that we will be using during the course. It does not substitute and introductory course to R: if you are completely new to it this tutorial won’t help. However, if you have some basic knowledge about how R works, this might be useful for you.

Getting help

Let’s say you want to calculate the mean of a series of numbers. You know that R provides a function called “mean”, but you don’t know how to use it. The “help” function will show you a page with a more or less detailed description of what a function does, and what arguments it accepts (i.e. what input it needs).

help(mean)
# or
?mean  # This is the shortest version

Sometimes you won’t find a direct link to the help page for a certain function. In such cases you can use the “help.search” function.

help.search("mean")
# or
??mean  # This will return a list of help files containing the word "mean"

Finally, the function example(mean) will give you plenty of examples about how to use a function (“mean” in this case).

Objects

You can think of an object as a variable, to which you can assign values.

x <- 7

To see what an object contains you can simply “call” its name:

x
## [1] 7

Or use the “print” function:

print(x)
## [1] 7

Objects can contain character values

char <- "hello!"  # Note the quotation marks
char
## [1] "hello!"

Why do we need quotation marks? Because otherwise R will think that we are referring to another object:

char <- hello
## Error in eval(expr, envir, enclos): object 'hello' not found

Objects can contain multiple values at once, like with vectors

v <- c(1, 2, 10, 30, 10000)
v
## [1]     1     2    10    30 10000

or matrices

m <- matrix(c(1, 2, 10, 30), ncol = 2, nrow = 2)
m
##      [,1] [,2]
## [1,]    1   10
## [2,]    2   30

However, note that vectors and matrices can not contain both numbers and character values at the same time. Because character values are more general (they are just “labels”), when you add a character to a numeric vector, all numbers will be regarded as characters as well:

v <- c(v, "f")
v  
## [1] "1"     "2"     "10"    "30"    "10000" "f"

You can tell that numbers are treated as characters because now they are surrounded by quotation marks.

To select elements within objects, you need to use squared brackets and specify which element you need. For instance, we want to see what is the first element of the vector “v”:

v[1]
## [1] "1"

According to the same logic you can select multiple elements

v[c(1, 3, 5)]
## [1] "1"     "10"    "10000"

For multidimensional objects, such as matrices, you need to specify the position of the element in every dimension: the first number refers to the row, the second to the column:

m[1, 2]  # Will pick the number in the first row and second column
## [1] 10

You can also select entire columns or rows

m[1, ]  # First row
## [1]  1 10
m[, 1]  # First column
## [1] 1 2

An important class of objects are the “data frames”. They are matrices where rows are assumed to be observations and colums are assumed to be variables. This implies that different colums can contain different classes of objects:

dat <- data.frame(X1 = c(1, 2, 3, 10), 
                  X2 = c("a", "b", "c", "d"))
dat
##   X1 X2
## 1  1  a
## 2  2  b
## 3  3  c
## 4 10  d

Since colums are regarded as variables, we can see what variables are contained in a data frame by asking for the “names”

names(dat)
## [1] "X1" "X2"

You can select variables in a data frame by using square brackets:

dat[, 1]  # Select the first column, AKA the first variable
## [1]  1  2  3 10

or by calling their name directly, using the dollar sign:

dat$X1
## [1]  1  2  3 10

Squared brackets are very useful in data frames when you need to select some specific observations:

dat[1, ]  # Shows all the variables for the observation in the first row
##   X1 X2
## 1  1  a
dat[dat$X1 > 2, ]  # Shows all observations for which X1 is bigger than 2
##   X1 X2
## 3  3  c
## 4 10  d

In the R console, you can see what objects are stored in the workspace (R’s memory) by typing:

ls()
## [1] "char" "dat"  "m"    "v"    "x"

or

objects()
## [1] "char" "dat"  "m"    "v"    "x"

However, RStudio makes your life easier by showing in the top-right panel the list of objects that are stored in your workspace, divided by categories (e.g. Data, Values), and with a brief description of their content.

Functions

Functions are a very important tool of R. All the operations that R performs are functions. What functions do is taking some arguments as input, perform a certain operation, and return some output. They usually follow the form f(argument1, argument2, ...), and, of course, are typically stored into objects:

plusone <- function(x) x + 1
plusone(5)
## [1] 6

You can often see the code for a function by typing its name:

plusone
## function(x) x + 1

Even when they take one argument only, functions can be applied to entire vectors or matrices: they will simply take each entry as input, and return another vector or matrix where every number has passed through the function.

plusone(c(1, 3, 5, 99))
## [1]   2   4   6 100

Of course, functions can be way more complex than this, and return all different kinds of elaborations. However, the logic is always the same: you define what argument(s) a function accepts, you specify what the function does with those arguments, and the function will give you a result.

Some useful functions

seq(from = 1, to = 5, by = 0.5)  # It returns a sequence of numbers from 1 to 5, with 0.5 increment
## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
seq(from = 1, to = 2, length.out = 10)  # It returns a sequence of 10 numbers from 1 to 2
##  [1] 1.000000 1.111111 1.222222 1.333333 1.444444 1.555556 1.666667
##  [8] 1.777778 1.888889 2.000000
rep(1:3, 5)  # Repeats the sequence from 1 to 3 for five times
##  [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
rep(1:3, each = 5)  # Repeats each number in the sequence from 1 to 3 for five times
##  [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
rep(c("a", "b", "c"), 5)  # Accepts character values as well
##  [1] "a" "b" "c" "a" "b" "c" "a" "b" "c" "a" "b" "c" "a" "b" "c"
sample(c(1:199), size = 2)  # It samples two numbers from the population "c(1:199)"
## [1]  77 130
sample(c(1:5), size = 6, replace = T)  # Samples with replacement
## [1] 4 2 5 5 1 4

Packages

While R can do many things, it is not able to do all of them by itself. The basic R is actually a relatively limited machine, from the point of view of statistical analysis. However, a lot of things can be done, way more than with other commercial softwares, by using “packages”. Generally speaking, a package is a set of functions, help files and data files that have been put together and are somehow related to one another. R offers a lot of packages – right now the CRAN network has 8869 available packages! – many of them are written by statisticians and allow to perform cutting-edge techniques that in other contexts would require a lot of programming skills.

Even if you have just downloaded R, you have a lot of packages installed in your computer. However, not all of them are loaded in your workspace: some packages take a lot of memory, so it is better to just “call” them at need. Moreover, some packages use functions that have the same name as functions in other packages that do different things. In such a case, when you call a function, R will not know which one you are referring to, and will return an error message. Hence, some packages automatically un-load other packages that you have loaded before.

You can see what packages you have loaded in the work space at the moment:

(.packages())
## [1] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"  
## [7] "base"

You can also see all the packages that you have installed, and you can load in the package library when you need them:

(.packages(all.available = T))

To load a package you use the “library” or the “require” functions. For instance the “foreign” package allows you to read and write data in formats that are used by different statistical softwares, including Stata and SPSS.

library(foreign)
# or
require(foreign)

If a package is not installed in your machine, you can do it using the “install.packages” function:

install.packages("ggplot2", dependencies = T)  # This package makes very nice plots, we will use it a lot

Some packages are built on other packages, i.e. they will use functions that are taken from other user-developed packages. That’s why it is better (but not necessary) to set the option “dependencies = TRUE”. It tells R to search for all the packages that the one you want to install depends on, and install them on your computer as well.

Loading data

We can load data in R in two ways: either the data are in our computer, or they are online. In the first case, we have to tell R in which folder the data are. For instance, I want to load some data from my Dropbox:

setwd("~/Dropbox/Teaching/GLM_ECPR_2017/Lab/Day1")  

(change the directory above to the one where you keep your data) From now on, every time we ask R to read or write data (and other things too, like figures), R will use the directory that we have specified.

Then, we can load the data by putting them into a “data.frame” object (these data are just made up, you should try with your .dta file):

data <- read.dta("somedata.dta")
head(data)  # This will show you the first few lines of the dataset
##   var1      var2
## 1    1  4.576232
## 2    2  3.720291
## 3    3  6.896662
## 4    4  6.535181
## 5    5  8.819222
## 6    6 11.219568

The function “read.dta” is used to read data in Stata file format (.dta). Other functions are:

All these functions require different arguments, you can look them up in case you need to use them.

Another way to load data in R is getting them directly from the internet. We will not need it for this course (all the data that you need is in the moodle), however if you are interested in accessing online databases directly from R, there is a pretty recent post on R-Bloggers offering a good review of the methods available.

Loops

Loops are functions by which we tell the software to do a certain operation for a number of times, or until something happens (e.g. until a vector is over, until it reaches a certain value, etc.). They are useful in some cases to recode variables (although there are more efficient ways to do it). In many cases, what you do with a loop can also be done with other specific functions (which might call a loop internally anyway), but using loops makes it somewhat easier to go back to your code and see what you did. Here we talk about 3 loop structures:

i <- 0          # This object works as an index; it is necessary in every loop
repeat{if (i <= 25) {print(i); i <- i + 5} else break}
## [1] 0
## [1] 5
## [1] 10
## [1] 15
## [1] 20
## [1] 25

Without command “break” above, the loop would have been infinite.

a <- 0
i <- 1
while(i <= 10){
    a <- c(a, i)
    i <- i + 1
}
print(a)
##  [1]  0  1  2  3  4  5  6  7  8  9 10

Example of “while” loop: the “Fibonacci Sequence”. In the sequence, each number is the sum of the two previous numbers, given the first two numbers 0 and 1. Check in on Wikipedia. Below we calculate the first 20 numbers in the sequence:

i <- 1
im1 <- 1
im2 <- 0
fib.seq <- NULL
fib.seq[1] <- im2
fib.seq[2] <- im1
while(i <= 18){    # We have already set the first two numbers, so we need to repeat 18 times
    fib.seq[i + 2] <- im1 + im2
    temp <- im1
    im1 <- im1 + im2
    im2 <- temp
    i <- i + 1
}
fib.seq
##  [1]    0    1    1    2    3    5    8   13   21   34   55   89  144  233
## [15]  377  610  987 1597 2584 4181
random.vector <- c(2, 5, 13, 0.5, 8, 100)
for(i in random.vector){
    print(i^2)    # Note that you need to specify "print" or "return" to show the value
}
## [1] 4
## [1] 25
## [1] 169
## [1] 0.25
## [1] 64
## [1] 10000

Generate random numbers

Instead of working with data, you can also do a lot of things by using random numbers This is a very useful utility that we will use often in the next days.

normal <- rnorm(n = 1000, mean = 0, sd = 1)  # This is a standard normal distribution of a thousand values
mean(normal); sd(normal)
## [1] -0.01407824
## [1] 0.9894228
hist(normal)

uniform <- runif(n = 1000, min = 0, max = 1)
mean(uniform)
## [1] 0.5030516
hist(uniform)

bernoulli <- rbinom(n = 100, size = 1, prob = 0.5)  # We flip a single coin for 100 times, with 50% probability
table(bernoulli) 
## bernoulli
##  0  1 
## 58 42
mean(bernoulli)
## [1] 0.42

The mean of the binomial distribution reflects the probability as much as the mean and standard deviation in the random normal distribution above reflect the two parameters that we specified. In this case the mean is bounded between 0 and 1: since we flip only one coin every time, even if we set the probability to 100%, we will just obtain a set of 1s. However, the binomial distribution is more general, and it allows to flip more than one coin:

general.bin <- rbinom(n = 100, size = 5, prob = rep(0.5, 5))  # Since we flip 5 coins, we should put a string of 5 probabilities
table(general.bin) 
## general.bin
##  0  1  2  3  4  5 
##  4 13 30 36 14  3
mean(general.bin)
## [1] 2.52

Suppose we want to create a bivariate normal distribution, AKA two variables that are normally distributed (with parameters set by us) and are also correlated to each other (with degree of correlation again set by us). As first step, we need to specify a variance/covariance matrix. Since we have two variables, we need to specify a 2x2 matrix where the diagonal values are the variances of the two variables, and the off-diagonal values their covariances. Here we are specifying \(var=0.8\) for each variable and \(cov=0.5\) between them:

vcov <- matrix(c(0.8, 0.5, 0.5, 0.8), nrow = 2, ncol = 2)
vcov
##      [,1] [,2]
## [1,]  0.8  0.5
## [2,]  0.5  0.8

Second, we generate a vector of means for the two variables. We set the two means at 10 and 0:

means <- c(10, 0)

Finally, we use the “mvrnorm” function to generate the corralated data. We put it into a data frame so it is more intuitive to deal with (otherwise, the output is a matrix):

library(MASS)    # We need to load the package "MASS" in the library before using functions from it
mv.data <- data.frame(mvrnorm(n = 500, mu = means, Sigma = vcov))
names(mv.data) <- c("X1","X2")

Let’s see how the data look like:

c(mean(mv.data$X1), mean(mv.data$X2))    # Means
## [1]  9.88869800 -0.09250917
c(var(mv.data$X1), var(mv.data$X2))      # Variances
## [1] 0.7759183 0.8274134
cov(mv.data$X1, mv.data$X2)              # Covariance
## [1] 0.5038523

How does their bivariate distribution look like?

with(mv.data, plot(X1, X2, pch = 20))

set.seed(4682)
mean(rbinom(n = 100, size = 5, seq(from = 0, to = 1, length.out = 5)))
## [1] 2.53

This function is very useful if we want to make sure that our results are replicable.