# Before starting: # 1. The dash ("#") in R is equivalent to the star ("*") in Stata # 2. In R-Studio you will have to type Ctrl+Enter to send your command from the script to the console. In case you are using the basic R GUI, you need to type Ctrl+R # 3. When your code is longer than 1 line (it will happen frequentyl) you will need to select it all if you want R to run it. However, unlike Stata, this is not always mandatory. If you want to run a single line of syntax, you can just put the cursor on it and type Ctrl+Enter. The positive thing is that you can highlight/select only a part of your syntax, and R will execute only that part. ####################### # Getting help with R # ####################### # This part is very important. You will use the help facilities a lot, at any level of expertise. # Let's imagine you want to calculate the mean of a series of numbers. You know that R provides a function "mean()", but you don't know what to write in it. You only need to type help(mean) # or ?mean # (that's the shortest way to ask for help) # Sometimes you won't find a direct link to the help page for a certain function. In this case, you can call the help.search() help.search("mean") # or ??mean # This will return a list of help files containing the word "mean" # Finally example(mean) # Will provide you with examples about how to use a function ########################### # Basic Operations with R # ########################### # R works as a calculator very well (like Stata, just without the need to use the "display" command): 6 + 6 + 6 6 + 6 * 6 (6 + 6) * 6 6^4 # Raises 6 to the power of 4 sqrt(9) # Squared root of 9 abs(-7) # Returns the absolute value of a number # Note the "[1]" next to the output in the console. In R, every number you write or send to the console is interpreted as a 'vector'. The "[1]" below indicates that the index of the first item displayed is "1", i.e. it is the first number in the vector. # We can also work with entire vectors in the form of series of numbers... c(0, 1, 1, 2, 3, 5, 8, 13, 21, 34) # (the "c" means "combine") # ...or sequences: 1:100 # This is the same as writing: c(1:100) # Notice the number at the beginning of each line of the output. It tells you which position within the vector the first number appearing on the line occupies. # Some examples to see the R logic: # here you're performing an operation on two vectors c(1, 2, 3, 4) * c(100, 200, 300, 400) # here too, but you'll get the same result as the following (shorter) command c(1, 2, 3, 4) * c(100, 100, 100, 100) # multiply a vector and a scalar c(1, 2, 3, 4) * 100 # when one vector is multiple of the other c(1, 2, 3, 4) * c(100, 200) c(1, 2, 3, 4, 5, 6) * c(100, 200, 300) c(1, 2, 3, 4) * c(100, 200, 300) # meet the "Warning message" # When you get a Warning message R is telling you that he's doing what you required, but things are not going as smoothly as you planned. # In our case, the length of one vector is not a multiple of the length of the other. Therefore R did what you asked for, but for the last number it had to improvise (it has chosen arbitrarily that the remainder number is multiplied by the first of the vector, but it could have been the last and it would have made equally sense). # Different story is the "Error message". That tells you that R does not know what to do with the command you gave. 9 out of 10 times it's due to some typos in the syntax. However, it can also be due to wrong commands, specifications, problems with the type of data, etc. c(1, 2, 3, a) # "a" is not a number, so R is assuming it's an object. ########### # Objects # ########### # What is an object? # In abstract terms, objects are items that contain a certain amount of information. Being R an object-oriented language, you can do (and in fact you do) everything with objects. Objects are like Stata global macros, but they are more general, as they can contain more types of information. # You can think of an object as a variable, to which you can assign values. x <- 7 x # Notice that something appeared in the top-right panel. This is one of the cool points of RStudio. # You can see what is inside an object in two ways: x # either you just 'call' the object print(x) # or you apply the "print" function # Objects can be given arbitrary names Beatles <- c("John", "Paul", "George", "Ringo") print(Beatles) `Sgt. Peppers Lonely Heart's Club Band` <- Beatles `Sgt. Peppers Lonely Heart's Club Band` # or print(`Sgt. Peppers Lonely Heart's Club Band`) # If the name contains special characters (or spaces), then it should be enclosed by backticks "`". # Also notice that objects do not necessarily have to contain numbers. # However, to specify that the item that you add to the object must be intended as a character (and not as a number) you will need to use quotation marks, single or double: '', "" # So: object <- word # Produces an error object <- "word" # This works object <- 'word' # This works too object # The symbol "=" has the same function as "<-" one <- 1 two = 2 # That is different from the "equals to" operator one == two # The "==" means "Is the value of 'one' the same as the value of 'two'?" # To avoid confusion I recommend you to keep using "<-" to assign values to objects # Other relational operators can be used to compare objects: one < two one <= two one > two one != two # How to see the objects in memory # (This is not a big issue with RStudio, since they are shown in the up-right panel) ls() # This shows you what is stored in R's memory objects() # Also this will work # Objects can also be deleted, if you don't need them rm(one) one # This may seem useless right now, but it turns very important when you have very large objects (e.g. full datasets) in memory, and R slows down # You can also remove all the objects in memory rm(list=ls()) ########################### # Operations with objects # ########################### y <- 1:10 x <- 7 y / x # Which is the same as writing 1:10 / 7 # Objects can be combined together into other objects z <- 6 a <- c(x, y, z) # You can select any element in any vector by using the square brackets a[1] a[12] # You can select multiple elements a[1:4] # Think about how this can be related to select variables in datasets # Elements in a vector can be given names vec <- 1:3 vec names(vec) <- c("a", "b", "c") vec vec["a"] # Names must be specified with quotation marks # You can select elements using relational and logical operators a[a > 5] a[a > 5 & a < 8] a[a == 3 | a > 7] a[a > 7 & a != 10] # Notice the difference between a[7] # and a[a==7] # The first example selects the 7th item in "a", which does not correspond to the number "7" # A 2-dimensional object: the MATRIX b <- matrix(data = 1:25, nrow = 5, ncol = 5) b # Notice the commas next to the rwo and column indexes # Also, notice that the numbers are ordered by colums, rather than by rows. This is called "column-major order". # You can fill the matrix by row adding the option "byrow" b.row <- matrix(data = 1:25, nrow = 5, ncol = 5, byrow = T) b.row # When selecting elements within matrices, you will need to type two numbers: the first number refers to the row, the second to the column b[1, 4] # If you only type the number BEFORE the comma, you select the whole row b[1, ] # If you only type the number AFTER the comma, you select the whole row b[, 4] # You can substitute single elements within a matrix selecting them in the same way b[1, 4] <- 1000 b # You can select columns and perform operations with them b[, 1] + b[, 3] # Think about this as the sum of two variables in a dataset # You can select items in an arbitrary order b[, c(1, 5, 2)] # and they will be returned in the order you asked for b[c(4, 2), ] # another example, with rows # Notice that the output does not report the "original" row and column numbers, but the numbers that the selected rows and columns have in the newly defined object. # Moreover, you can select rows or columns of a matrix and put them into another object to make a new matrix e <- b[c(4, 2), ] # Now the 1st row of "e" is the 4th row of "b" e[1, ] b[4, ] # The same rules for operations apply here b + 1 b * x d <- -2:2 b * d # Notice that vectors are multiplied by columns, rather than rows # The command above gives a simple multiplication of the single number in the matrix with the single numbers in the vector. If you want to perform a matrix multiplication, then you will need some different symbols: b %*% d # The command above gives you the result of the matrix multiplication of b by c, which is equivalent, for every line, to do: b[1, 1]*d[1] + b[1, 2]*d[2] + b[1, 3]*d[3] + b[1, 4]*d[4] + b[1, 5]*d[5] b[2, 1]*d[1] + b[2, 2]*d[2] + b[2, 3]*d[3] + b[2, 4]*d[4] + b[2, 5]*d[5] b[3, 1]*d[1] + b[3, 2]*d[2] + b[3, 3]*d[3] + b[3, 4]*d[4] + b[3, 5]*d[5] b[4, 1]*d[1] + b[4, 2]*d[2] + b[4, 3]*d[3] + b[4, 4]*d[4] + b[4, 5]*d[5] b[5, 1]*d[1] + b[5, 2]*d[2] + b[5, 3]*d[3] + b[5, 4]*d[4] + b[5, 5]*d[5] # You can add rows and columns to a matrix through the "rbind" and cbind" functions cbind(b, d) rbind(b, d) b <- cbind(b, d) b # This will help a lot when you will have to "append" the data (remember, from Stata?) # Again, you can give names to rows and columns in matrices, and call elements with this: rownames(b) <- c("r1", "r2", "r3", "r4", "r5") colnames(b) <- c("c1", "c2", "c3", "c4", "c5", "c6") b b[, "c6"] b["r3", ] b["r2", "c3"] #----------# # Exercise # #----------# # Create a matrix called "max" with 10 rows and 10 columns with the numbers from 1 to 100 # Create a vector called "vex" containing the numbers from 1 to 10 # Add "vex" as a row at the bottom of "max" # Which number occupies the position 67 of "max"? # Which numbers are in column 3? # Which number occupies the position in row 8, column 4? Substitute it with 5500 # Delete the last row of "max" (the one added with "vex") ########################## # Other types of objects # ########################## # So far we have seen objects like SCALARS (single numbers), VECTORS (or, more precisely "atomic vectors", i.e. sequences of elements of the same type) and MATRICES (two-dimensional sets of elements of the same type). # Other types of objects that R can manage are ARRAYS, LISTS and DATA FRAMES # ARRAYS are like vectors or matrices, but they can have more than 2 dimensions. # When you define an array, you will need to add an additional attribute for the dimensionality. arr <- array(1:27, dim = c(3, 3, 3)) arr # You can select elements in the same way as we have seen for the matrices arr[1, 1, 1] arr[1, 3, 1] arr[1, 1, 3] arr[, , 1] # a "slice", i.e. a matrix which is the subset of our array arr[1, , ] # a matrix made by the first row of each submatrix arr[, 1, ] # a matrix made by the first column of each submatrix # LISTS are vectors of other objects, and can contain elements of any type lt <- list(arr, b, 1:10) lt[[1]] # This contains the "arr" array lt[[2]] # This contains our "b" matrix lt[[3]] # This contains a sequence from 1 to 10 # Finally, we have DATA FRAMES # This is the type of object that you will use most as you will work with data. # Data frames are tables where rows are assumed to be observations and columns variables. # Unlike a matrix, a data frame can contain different type of data in every column: data <- data.frame(var1 = 1:10, var2 = letters[1:10]) data # Since data frames are bidimensional, the same selection rules that you saw for matrices apply: data[, 1] data[1, ] # However, when selecting variables it is more comfortable to call them by their name, using the dollar ("$") symbol: data$var1 data$var2 # You can also form data frames using "cbind": data <- data.frame(cbind(1:10, letters[1:10])) # However, in this case R will assign the varibles arbitrary names data data$X1 data$X2 # You can change all variable names: names(data) <- c("var1", "var2") data # or just change a single variable's name: names(data)[1] <- "newvar1" data # You can select and rename a specific variable by its name using the "which()" function: names(data)[which(names(data)=="var2")] <- "newvar2" data # This last command will be extremely useful when working with data. ############# # Functions # ############# # All the operations that R performs are functions. They usually follow the form # f(argument1, argument2, ...) # Of course, they can also be put into objects f <- function(x, y){c(x + 1, y + 1)} f(1, 2) # The one above will take any object as a single entry, even when it has more than 1 value onetwo <- c(1, 2) f(onetwo) # Error: the function wants 2 entries, you gave only one object threefour <- c(3, 4) f(onetwo, threefour) # Often you can see the code for a function by typing its name f # This is very handy if you are trying to write a function and want to borrow some ideas from a similar one # Some common functions: numbers <- (0:10) sum(numbers) # returns the sum of all the elements in a numeric vector length(numbers) # returns how many values are in a vector mean(numbers) # returns the Mean of a set of numbers var(numbers) # returns the Variance sd(numbers) # returns the Standard Deviation median(numbers) # returns the Median value min(numbers) # returns the Minimum value max(numbers) # returns the Maximum value # Less common yet useful functions: # With "seq()" you create a sequence of numbers seq(from = 1, to = 10, by = 0.5) # gives you a sequence of numbers from 1 to 10, with 0.5 increment seq(from = 1, to = 2, length.out = 100) # gives you a sequence of 20 numbers from 1 to 100 # "With ""rep()" creates a sequence with the same number repeated n times? rep(5, 10) # Repeats the number 5 ten times rep(1:3, 10) # Repeats the sequence from 1 to 3 for ten times rep(1:3, each = 10) # Repeats the sequence from 1 to 3, ten times each # "table()" makes cross-tables # (crosstabs in R are particular types of array) se.q <- rep(1:3, 10) table(se.q) # How do you see percentages? With table() You will have to calculate them by yourself table(se.q)/length(se.q)*100 # With "round()" you can round numbers, in case you are bothered by many decimals tab <- table(se.q)/length(se.q)*100 tab round(tab, digits = 1) tab <- round(tab, digits = 1) tab # "sample()" extracts some random values from a vector sequence <- seq(0, 9, 3) sequence sample(sequence, size = 2) # "size = 2" means to sample twice from the population "sequence" # The size of your sample can be even larger than your population but you will need to add the option "replace" sample(sequence, size = 6, replace = T) # The option "replace" allows the sampler to take the whole population every time. This means that the same element can appear more times # You can also specify the probability for each element to be sampled # EXAMPLE: a coin flip, a thousand times flip <- sample(c(0, 1), size = 1000, prob = c(0.5,0.5), replace = T) table(flip) table(flip)/length(flip)*100 # What if we change the probabilities? #----------# # Exercise # #----------# # Simulate a Russian Roulette # You have a gun with 6 chambers, you put a bullet in one chamber, you spin the revolver, you shoot. # Use the "sample()" function to simulate this process. # First try it once, and see what happens: did you make it? # Then try it 100 times, and see what happens: how many times would you have made it? # Put your trials in an object called "r.r", and make a table out of it. What is the chance of survival? # Calculate the mean of "r.r". What does that value tell you? ################## # Special Values # ################## # R has some special values that you will meet from time to time # NA: this is a missing value, like the "." in Stata somedata <- c(1, 2, 3) length(somedata) <- 4 somedata # if the length of a vector is bigger than the data that you have, R will plug a NA # Inf and -Inf # Sometimes numbers are just too big 2^1024 -2^1024 # Or, if you divide by zero 1/0 # NaN : not a number # Something does not make sense 0/0 Inf - Inf # NULL : just a null object, the emptyness NULL NULL + 1 c(NULL , 1, 2, 3) ############ # Packages # ############ # What is a package? # Generally speaking, a package is a set of functions, help files and data files that have been put together and are somehow related to one another. Many packages are written and managed by the "R core team", but way more are written and managed by users. # If you need to do something very specific or advanced, there is probably an R package that can do it. # R offers a lot of packages - right now the CRAN network has 7822 contributed packages! # To see the packages that you have loaded at the moment you type: (.packages()) # To see all the packages that you have installed, and you can load in the library, you can type: (.packages(all.available = T)) # or: library() # This will open a new window though # You will use the "library" function also to load a package in the library library(foreign). # In this way you are loading the package "foreign", which allows you to read data from other formats, including Stata and SPSS. # An equivalent function is: require(foreign) # You can generally use "library" and "require" interchangeably. # R does not load all the packages in the library because some of them may slow down the execution significantly, and because some packages may use functions that have the same name but do different things. In such a case, when you call a function, R will not know which one you are referring to, and will return an error message.For this reason, some packages automatically un-load other packages that you have loaded before. # However, there are packages that are not installed by default on the version of R that you have on your computer. To use those packages, you will have to install them. # Example: the package "ggplot2" is used to create very nice charts install.packages("ggplot2", dependencies = T) # Some packages may build on other packages, i.e. they will use functions that are taken from other user-developed packages. That's why it is better (but not necessary) to set the option "dependencies = TRUE". This option tells R to search for all the packages that the one you want to install depends on, and install them on your computer as well. # You can also remove some packages that you don't need, using "remove.packages()" ############################# # Load and explore the data # ############################# rm(list=ls()) # First we set the working directory - we say R where he can find the dataset setwd("~/Dropbox/CEPIS Stats Bootcamp 2016/Data/CMP_2015") # set to your working directory, like "cd" in Stata setwd("C:/Users/Vanguard/Dropbox/CEPIS Stats Bootcamp 2016/Data/CMP_2015") # From now on, for everything we ask R to load, it will assume it is in the directory we have specified. # Let's put the CMP data into an object called "MyData" using the "read.dta" function MyData <- read.dta("manifesto2015.dta", convert.factors = F) # This function is used to read the Stata file format (.dta) # Other functions are: # read.spss() = reads data in the SPSS .sav format # read.table() = reads data in .dat format, or tab delimited data also in .txt # read.csv() = reads data in .csv format, a special case of "read.table()" # Note that we can also specify the address within the read.dta function, in case we use data from several directories. MyData <- read.dta("~/Dropbox/CEPIS Stats Bootcamp 2016/Data/CMP_2015/manifesto2015.dta", convert.factors = F) # To see the data, in RStudio we can simply click on the object in the up-right panel. This will call automatically the "View" function, and open the data matrix in a new window View(MyData) # To make it quick, you can just see what are the names of the variables in your dataset names(MyData) # The "summary" function gives you a very basic description of the distribution of the variables contained in the data set. Conceptually it's similar to the summarize function in Stata, and when the object is the whole data frame it will show summary statistics for every variable summary(MyData) # As we saw, the way to refer to single variables in data frmes is with the dollar "$" sign MyData$rile # will show you the whole vector making the variable "rile" # Descriptive statistics for individual variables summary(MyData$rile) mean(MyData$rile) # what's going on here? sd(MyData$rile) # what's going on here? min(MyData$rile) # what's going on here? max(MyData$rile) # what's going on here? # Missing values are a bit problematic when we are to calculate statistics on variables. Basically, it's enough to have 1 NA in our data, and the result of many operations will also be an NA. To avoid that, different functions have different options. For this set of functions the option is "na.rm = TRUE", which stands for "remove NAs from the calculation" mean(MyData$rile, na.rm = T) sd(MyData$rile, na.rm = T) min(MyData$rile, na.rm = T) max(MyData$rile, na.rm = T) # Note that you ALWAYS have to specify the name of the dataframe where the variable comes from # You can also refer to specific subgroups, using relational and logical operators. You will need to put them in square brackets, as if you were selecting some cases within a vector (which is in fact what you are doing) MyData$rile[MyData$rile > 0] MyData$rile[MyData$rile > 0 & MyData$rile < 10] # The "subset()" function helps a lot in subsetting the data, both in terms of observations and variables MyDataRight <- subset(MyData, rile > 50, select = c(countryname, date, partyname)) # This returns a data set containing only the country, year and party name for the parties which are positioned on the extreme right (at least according to the rile index) MyDataRight ######################## # Operations by groups # ######################## # To do operations by different groups is less intuitive in R than in Stata. For instance, how can we see the mean left-right position for different party families? In Stata, it was enough to combine the "tabulate" with the "summarize" functions, here there is nothing like that. # For most of the counterintuitive things we need to do, in R we have 2 choices: # 1) Try with base R syntax - it can take hours. # 2) Use a package that makes it easier for us. # We don't want to waste our day on syntax, so we will take a package that makes it easy to perform operations and obtain descriptive statistics for different groups install.packages("plyr") library(plyr) # NOTE that "plyr" is not the newest (hence best) package available at the moment to do what we need to do here. However, it's a first step in the good direction. # The package "plyr" has a vast range of functions that work on 3 types of objects: data frames, lists, and arrays. We will focus here on a very narrow set of functions that allow to calculate simple descriptive statistics on data by different groups in a data frame. # The syntax is simple: ddply(MyData, .(parfam), summarize, m_rile = mean(rile, na.rm = T)) # What are the components of this command? # Note 2 things: # 1) The "mean" function that I used here is exactly the same as the 'simple' mean function in R. The only difference is that, by putting it into the "ddply" function, I am applying it to different groups. # 2) The "summarize" option: here we are telling R that we don't want to transform the data, just look at them. Let's change the option to "transform" and see what happens. ddply(MyData, .(parfam), transform, m_rile = mean(rile, na.rm = T)) # The "transform" option just creates a new dataset with the new variable containing group-level means added to it. This will be extremely useful in many cases. You can put such dataset into a new object: MyDataMeanRile <- ddply(MyData, .(parfam), transform, m_rile = mean(rile, na.rm = T)) names(MyDataMeanRile) # Note that we can combine different grouping variables. ddply(MyData, .(countryname,date), summarize, m_rile = mean(rile, na.rm = T)) # Moreover, note that you can ask more than one statistic at the same time: round(ddply(MyData, .(parfam), summarize, m_rile = mean(rile, na.rm = T), sd_rile = sd(rile, na.rm = T), min_rile = min(rile, na.rm = T), max_rime = max(rile, na.rm = T)),2) # This output is similar to using "tabulate" together with "summarize" in Stata. ##################################################### # Create new variables and recode the existing ones # ##################################################### # Example: let's create a variable to identify extreme parties on the left-right. So we make a dummy variable that is equal to 1 if the 'rile' position is smaller than -50 and larger than 50 MyData$rile_ext <- NA MyData$rile_ext[MyData$rile < -50 | MyData$rile > 50] <- 1 MyData$rile_ext[MyData$rile >= -50 & MyData$rile <= 50] <- 0 mean(MyData$rile_ext, na.rm = T) # About 2.8% of the parties are so extreme # A faster way to do this is by using the "ifelse" function. # "ifelse" is a conditional statement. The structure of the function is simple: you tell R that IF a condition applies DO X, otherwise DO Y. MyData$rile_ext_alt <- ifelse(MyData$rile < -50 | MyData$rile > 50, 1, 0) table(MyData$rile_ext, MyData$rile_ext_alt, useNA = "always") # We may want to divide a continuous variable into categories. Let's take the rile index as an example: summary(MyData$rile) hist(MyData$rile) # We want 5 categories here: # 1 = from lowest to -20 # 2 = from -20 to -5 # 3 = from -5 to +5 # 4 = from +5 to +20 # 5 = from +20 to highest # One way to do it is to define every single category by itself MyData$rile_cat <- NA MyData$rile_cat[MyData$rile <= -20] <- 1 MyData$rile_cat[MyData$rile > -20 & MyData$rile <= -5] <- 2 MyData$rile_cat[MyData$rile > -5 & MyData$rile <= 5] <- 3 MyData$rile_cat[MyData$rile > 5 & MyData$rile <= 20] <- 4 MyData$rile_cat[MyData$rile > 20] <- 5 table(MyData$rile_cat) # A faster way to do this is to use the "cut" function: MyData$rile_cat_alt <- cut(MyData$rile, breaks = c(-Inf, -20, -5, 5, 20, Inf), labels = c(1, 2, 3, 4, 5), right = T) table(MyData$rile_cat_alt, MyData$rile_cat) # Doing operations with different variables is straightforward. # As we did with Stata, let's compute the party positions on a "cultural traditions" dimension made of national way of life and traditional morality. # This time, let's do it in one line (it's possible with Stata too, btw) MyData$cultrad <- (MyData$per601 + MyData$per603) - (MyData$per602 + MyData$per604) hist(MyData$cultrad) # Again, looks bad. # EXERCISE: Remember how the log transformation worked? Try that (in 1 line of code) # As you may have noticed, NAs need a bit more attention in R than in Stata. # The function "is.na" indicates which elements of a vector are missing, and we can use it to exclude them or flag them somehow: MyData[is.na(MyData$rile), ] # Note the comma. Why is it there? MyData$partyname[is.na(MyData$rile)] # And why isn't it there now? # We can use "is.na" combined with the exclamation mark ("!") to exclude observations from the data that have missing values in some variables. We can put the new reduced data into a new object. MyDataClean <- MyData[!is.na(MyData$rile),] summary(MyDataClean$rile) ################# # Save the data # ################# # With R we can also save data sets in the format that we prefer. # The "foreign" package allows us to write in Stata and SPSS format # to Stata write.dta(MyDataClean, "data_clean.dta") # to SPSS write.foreign(MyDataClean, "~/Dropbox/CEPIS Stats Bootcamp 2016/Data/CMP_2015/data_clean.txt", "~/Dropbox/CEPIS Stats Bootcamp 2016/Data/CMP_2015/data_clean.sps", package = "SPSS") # In fact, this routine exports the data into a text datafile and creates an SPSS syntax file that reads it and converts into an SPSS data file. However, since the "sps" file will need the address of the "txt" file, you will need to provide the complete path for SPSS to find it. # The "write.table" function is a generic function that you can use to write data in several basic formats, including "csv". write.table(MyDataClean, "data_clean.csv", na = "", sep = ",") # However, "write.csv()" will work as well. write.csv(MyDataClean, "data_clean.csv", na = "") # You can also save the entire workspace (i.e. your working environment, consisting in a collection of all the objects that you have defined). save.image("MyWorkspace.RData") # Then you can load the workspace in the future, by using the "load" command. load("MyWorkspace.RData") # This is comfortable if your workspace includes objects that need time to create (for instance outputs of statistical models that take much time to calculate). #----------# # Exercise # #----------# # 1) Use ddply to obtain the mean, standard deviation, minimum and maximum of the 'rile' index by party families (as we did above) taking ONLY Croatian parties (use 'summarize', not 'transform') # 2) Put the data frame table containing these values into an object called "rileHR" # 3) Save the object in .csv format ###################### # Visualize the data # ###################### # There are two ways to visualize the data. With the basic "graphics" package, that comes by default with R, or using some more sophisticated packages that allow to create nicer and possibly more sophisticated graphs. Note that with basic R you can already make more or less every kind of plot that you want. However, that can be extremely difficult, verbose (i.e. will require a lot of code) and frustrating. # Here we will focus a bit on basic R plots, and on the package "ggplot2", because it's one of the best ones (if not the best one) around # HISTOGRAMS hist(MyData$rile) hist(MyData$rile,col = "blue", main = "", xlab = "RILE Index") hist(MyData$rile,col = "blue", main = "", xlab = "RILE Index", freq = F) library(ggplot2) ggplot(MyData, aes(x = rile)) + geom_histogram() ggplot(MyData, aes(x = rile)) + geom_histogram(color = "black", fill = "blue") + xlab("") ggplot(MyData, aes(x = rile, ..density..)) + geom_histogram(color = "black", fill = "blue") + xlab("") ggplot(MyData, aes(x = rile, ..density..)) + geom_histogram(color = "black", fill = "blue") + xlab("") + theme_bw() # If you want to make a histogram for every country: ggplot(MyData, aes(x = rile, ..density..)) + geom_histogram(color = "black", fill = "blue") + facet_wrap(~countryname) + xlab("") + theme_bw() # It is recommended to 'fold' the code when you write very long commands, so everything is better organized # Example: let's visualize how the left-right position of the British parties changed over time. # To do this we need three variables: the rile index, the election date, and some indicator to see what party we are talking about. # Moreover, we need to select only the subgroup of observations belonging to the UK MyDataUK <- subset(MyData, countryname == "Great Britain", select = c(rile, date, partyname)) ggplot(MyDataUK, aes(x = date, y = rile)) + geom_line(aes(color = partyname,lwd=3)) + theme_bw() # Looks messy, better to plot single parties in individual panels ggplot(MyDataUK, aes(x = date, y = rile)) + geom_line() + facet_wrap(~partyname) + theme_bw() # We can replicate the plot from yesterday, with different colors for different parties (recall we didn't make it in Stata) - let's do it only for British parties # First, we add the "welfare" variable: MyDataUK <- subset(MyData, countryname == "Great Britain", select = c(rile, date, partyname, welfare)) ggplot(MyDataUK, aes(x = welfare, y = rile)) + geom_point(aes(color = partyname)) + theme_bw() # We can see whether there are different patterns of association between different parties: ggplot(MyDataUK, aes(x = welfare, y = rile)) + geom_point(aes(color = partyname)) + facet_wrap(~partyname) + theme_bw() # This is only one small example of the potential of "ggplot2". However, before being able to make better looking and more informative plots you will need to master data cleaning and management. We can not delve too much into plotting here, -- not enough time and too many things to cover. But I recommend you to play with it as much as possible, so you will learn how to enrich your papers (or reports, or presentations) with highly informative and pretty visualizations. This is one of the directions where data work is going. ######################### # Append and Merge data # ######################### # Yesterday we saw how to append (add observations) and merge (add variables) datasets. In R this is even easier than in Stata. # First, we load the other dataset that we want to append, and we put it into another object: MyDataUSA <- read.dta("manifesto2015_USA.dta", convert.factors = F) # Do the two files have the same variables? length(names(MyData)) length(names(MyDataUSA)) # Nope, we did create new variables in our data that are not in the USA data. If you try to append files in this way, R will refuse to do it. To get the data back to their original form, we can simply reload the data and put them into the same object: MyData <- read.dta("manifesto2015.dta", convert.factors = F) # Then, we use the command "rbind()" that we saw earlier: MyDataBig <- rbind(MyData,MyDataUSA) # Merging is even easier. The function "merge()" is more intuitive here than in Stata. Again, we load the data that we want to merge into a new object MyDataVote <- read.dta("manifesto2015_votseats.dta", convert.factors = F) MyDataBig <- merge(MyDataBig,MyDataVote, by = c("party","date")) names(MyDataBig)