* Before starting: /* 1. COMMENTS are a part of a code that Stata will ignore (i.e. will not interpret as a command). You use comments to tell your future self or the reader why you are doing something. In this case, comments are used as a substitute for slides, so we won't have to switch all the time. In Stata you can: - comment a sentence by putting a star (*) before it - comment a whole chunk of text by putting slash+star before and after your text - put a comment AFTER a command by putting double slash // 2. When you write a command in the do-file you have to select it and "run" it, i.e. tell Stata it should perform that action. You can do it by clocking on the "Do" icon (top-right of the window) However, you can also run it using the keyboard, by typing Ctrl+d (Windows) or Cmd+shift+d (Mac) 3. ALWAYS REMEMBER TO SELECT THE PART OF THE CODE THAT YOU WANT TO RUN, OTHERWISE STATA WILL RUN ALL THE SCRIPT FROM THE BEGINNING */ /*********************** GETTING HELP IN STATA * ***********************/ /* To get help in Stata, you type "help" followed by a command. Even easier, you can just type "h" followed a command. Stata will open the help tool for you, which will tell you everything you need. It's as simple as that */ /******************** BASIC STATA SYNTAX * ********************/ * Even without loading the data, you can perform several operations. * The most basic one is using Stata as a calculator with the "display" command followed by an expression display 2 + 2 * TIP: Stata is smart enough to guess sometimes what you want to ask it, that's why it's way more comfortable to use ABBREVIATIONS di (10 - 7) * 3 *---------------* * Read the data * *---------------* * This prevents something very irritating from happening: set more off * This "empties" Stata memory from any dataset or temporary object we created previously in the session clear* * This tells Stata in which directory we want to work (you should put the directory where you keep the file in your own computer here) cd "~/Dropbox/CEPIS Stats Bootcamp 2016/Data/CMP_2015/" * This tells you the address of the working directory, in case you forgot it pwd * This loads the data use "manifesto2015.dta" *----------------* * Write the data * *----------------* /* After doing some transformations to your dataset (e.g. recoding some variables, adding more variables, we will see) you may want to save it into a different file (you should. Always). "save" and "saveold" are used to write the dataset currently loaded in Stata into a new file The difference between the two is that "saveold" will write the data to an older version of Stata (yes, changing versions of Stata implies changing compatibility). R does not read the newer Stata files, so I usually use "saveold" */ saveold "mydata.dta" * If you look at the working directory now, the file will be there * Note that in some cases you want to overwrite an existing file. For that, you will need to add the option "replace" saveold "mydata.dta", replace * Remember: you should never overwrite the original data. Always save copies, then you can overwrite them as many times you want *------------------* * Explore the data * *------------------* * "describe" will give you an overview of the dataset describe * "codebook" will provide some information about the variable(s) we care about help codebook codebook country // a numeric categorical variable codebook countryname // a string variable codebook rile // a continuous variable * "summarize" will give you quick summary statistics summarize rile * note that you can add options to most of the Stata commands * the options appear after a comma "," * you can learn which options every variable offers via the help function h summarize sum rile, detail * "tabulate" is used to make crosstables, and it's very useful to explore categorical variables tabulate eumember * by default it shows you the value labels, instead of the actual values * in case you want to see the values (e.g. to decide how to recode a variable) you will need to specify it tabulate eumember, nolabel * also, by default "tabulate" does not show you whether a variable contains missing values * you will need to specify that as an option tab coderyear // note the reduced number of cases tab coderyear, missing * you can also combine "tabulate" with "summarize" to obtain some group statistics * for instance, let's see what is the left-right position of the average party belonging to each party family tab parfam, summarize(rile) * using "tabulate" with two variables will produce the classical crosstables * E.g. are some party families overrepresented in EU countries? tab parfam eumember * you can have row or column percentages tab parfam eumember, row tab parfam eumember, column * "if" statements allow you to focus on some specific subsets within your data. * Let's suppose we want to see the left-right position of different party families in Croatia tab parfam if countryname == "Croatia", sum(rile) * Note that "if" statements always go BEFORE the options * using "if" we can make more slections tab parfam if countryname == "Croatia" & date < 200000, sum(rile) tab parfam if countryname == "Croatia" & date > 200000, sum(rile) tab parfam if countryname == "Croatia" & date > 200000 | countryname == "Croatia" & date < 199500, sum(rile) * note the combination of 'or' and 'and' * "bysort" works in a similar way: it repeats the same operation for different subgroups in the data; * it requires a colon ":" bysort parfam: sum rile * it is a generalization of the "by" command, which won't work unless data are sorted (in this case it will work anyway because by calling "bysort" we have just sorted the data) by parfam: sum rile * "sort" orders the data by the values of a given variable. * it has absolutely no effect on the associations between variables, it just changes the order of the observations sort countryname sort oecdmember sort rile *----------* * Graphics * *----------* * to explore some data types, it's just better to plot * "histogram" will make, well, a histogram histogram rile * you can make histograms for different subgroups hist rile, by(parfam) * "if" statements work here as well hist rile if date > 200000, by(parfam) * "plot" is a very quick-and-dirty solution for visualizing bivariate relationships plot rile welfare * however, when it gets bivariate, there are more sophisticated options * "twoway" is one of them, but it will require some specifications * in fact, plotting "twoway" is the only case where it's just better to go point-and-click *--------------------* * Transform the data * *--------------------* * "generate" is what we use to create a new variable h generate * create a numeric constant gen constant = 1 tab constant * create a string variable gen constring = A * when we want to create a string variable we need to put quotation marks, otherwise Stata will think that "A" is a variable gen constring = "A" tab constring * Of course, we can also do more useful computations * For instance, we may want to see the party positions on a "cultural traditions" dimension made of national way of life and traditional morality. * To do so, we can sum all the positive quasi-sentences and substract all the negative quasi sentences on the two topics gen cultrad_pos = per601 + per603 gen cultrad_neg = per602 + per604 gen cultrad = cultrad_pos - cultrad_neg sum cultrad hist cultrad * Distribution looks nasty. Better to take log ratios gen cultrad_log = log( (cultrad_pos + 0.5) / (cultrad_neg + 0.5) ) hist cultrad_log plot cultrad cultrad_log * Whatis that big mass on zero? We can figure it out using conditonal statements sum per601 per602 per603 per604 if cultrad_log == 0 sum cultrad if cultrad_log == 0 * Looks like they are cases where the total of positive sentences minus the total of negative sentences is exactly zero. We can't do much about it * "gen" can be used also with conditional statements. For instance, we want to create a dummy variable for Austrian parties only gen austria = 1 if countryname == "Austria" tab austria, missing * however, this will add values ONLY for the selected cases * to create a proper dichotomous variable we will need to replace the values for the cases that are not from Austria * "replace" replace austria = 0 if countryname != "Austria" tab austria, missing * in general, "replace" is used to change variable values replace constant = 2 tab constant replace constant = constant + 100 * however, watch out replace constring = 5 * the error "type mismtch" means that we are assigning a variable some values that are not consistent with the type. In this case, because "constring" vas created as a string variable, we can't give it a numeric value replace constring = "B" * to convert numeric to string variables and vice versa, two commands are "tostring" and "destring" * in both cases, you can choose whether to generate a new variable or to replace the existing one tostring constant, gen(new) tab new tostring constant, replace destring constring, gen(new2) * this will not work, unless the string contains numbers destring constant, gen(new2) tab new2 destring constant, replace * "generate" and "replace" are typically used together to create variables gen benelux = 0 replace benelux = 1 if countryname == "Netherlands" | countryname == "Belgium" | countryname == "Luxembourg" tab benelux, missing tab countryname benelux * "drop" followed by a variable name is used to delete variables drop constant constring austria new new2 benelux * Another way to create dummy variables is to combine "tabulate" with "generate" tab eumember, gen(eu) tab eumember eu1 tab eumember eu2 tab eumember eu3 /* TIP: if you want to inspect/transform more variables with the same prefix and different numbers, you can just type "*" instead of the variable number. This way, Stata will understand that you want to do something with all the variables with the same prefix. */ sum eu* * WATCH OUT: note that it also included "eumember" - be careful with using this when prefixes are very generic! * Alternatively, if you want to do an operation on a group of consecutive variables, you can use the "-" sign. sum eu1 - eu3 drop eu1 - eu3 * For reasons, "generate" only allows a limited number of actions. For instance, you can't generate a variable containing the mean value of another variable. * "egen" solves this problem. It is very flexible, and does more or less all you need in terms of variable creation h egen * mean left-right position among all parties in the data egen m_rile = mean(rile) sum m_rile * standard deviation of the left-right position among all parties in the data egen sd_rile = sd(rile) sum sd_rile * minimum and maximum values (to see what are the extremes) egen min_rile = min(rile) egen max_rile = max(rile) * it can be easily vectorized using "bysort" bysort countryname: egen cm_rile = mean(rile) sum cm_rile tab countryname, sum(cm_rile) * note that "bysort" can be used for more than 1 factor bysort countryname date: egen cym_rile = mean(rile) sum cym_rile * "egen" can also do oprations by row * For instance, let's calculate the mean and standard deviation of all the "per" values within each party (it won't tell us much, it's just for the sake of exercise) egen rm_per = rowmean(per101 - per7062) egen rsd_per = rowsd(per101 - per7062) sum rm_per hist rm_per sum rsd_per hist rsd_per /* EXERCISE: RESCALING & STANDARDIZING VARIABLES Sometimes we want to change the range of one or more variables. Reasons are multiple: maybe we want to compare the means two variables that are on completely different scales, or we may want to put standardized predictors into a regression model to be able to compare their effects. "Rescaling" means changing the values of a variable, but not its distribution: For instance, if a variable has values x = [1, 2, 3], we can rescale it dividing it by 2, so the resulting variable will be x_r = [0.5, 1, 1.5] What I want you to do is: 1. Rescale the variables per101, per102, per103, per104 and per105 to range from 0 to 1. For each variable, create a new one with suffix "_01" where to store the rescaled values Plot each of the rescaled variables with the original one, using both the "plot" function and the "twoway" function (scatter plots). In the latter case, try to make the plots look as nice as possible. */ /* 2. Rescale the 'rile' index to go from -1 to +1; Store the rescaled values in a variable called "rile_rsc" */ /* 3. Center the 'rile' index around the mean TIP: To 'mean-center' a variable means that the mean (or another arbitrary value of it) should be equal to 0, so all the values greater than the mean should be positive, and all the values smaller than the mean should be negative; Store the centered values in a variable called 'rile_mc' */ /* 4. Standardize the 'rile' index TIP: To 'standardize' a variable implies first centering it, and then dividing it by one or two standard deviations; Store the standardized values in a variable called 'rile_std' */ /* 5. Group-mean center the 'rile' index Same as exercise 3, but now instead of taking the "grand mean" (the mean in the whole sample) you should take the mean for each country (which may be different from the grand mean); Store the group-mean centered values in a variable called 'rile_gmc' */ /* 6. Group-standardize the 'rile' index Same as exercise 4, but you will need to use the country means and the country standard deviations; Store the group-standardized values in a variable called 'rile_gstd' */ /* 7. Plot: A. 'rile' and 'rile_std' B. 'rile_mc' and 'rile_gmc' C. 'rile_std' and 'rile_gstd' BY COUNTRY all using the twoway function (scatter plots) */ *-----------------------* * Append and Merge Data * *-----------------------* /* Sometimes we need to put together two or more datasets. This can happen in 2 ways. 1) ADD OBSERVATIONS Imagine we have a cross-country study, where the same variables are measured in different countries. We need to add data from one more country to the larger dataset including all the other countries. */ tab countryname // We miss the United States here, but we have their data on a different file /* In this case, we want to add observations for the USA to the ones that we already have. This is done by the command "append". "append" literally adds more row to the data. These rows come from another data file. IMPORTANT: for that to work, we need to have the EXACT SAME VARIABLES in both data files (ours and the one that we want to append). How do we know if this is the case? The package "cfvars" will help us with that. "cfvars" is a user-written package, that compares 2 datasets and tells which variables they have in common, and which they don't. It does not come with the standard version of Stata, so you have to download it separately. You can do it by using the "findit" function: */ findit cfvars cfvars "manifesto2015.dta" "manifesto2015_USA.dta" * We are good to go append using "manifesto2015_USA.dta" tab countryname * Now we can save the data - let's do it in the file we created earlier saveold "mydata.dta", replace /* 2) ADD VARIABLES In other cases, we have all the observations that we need, but we need to add some variables. For instance, in our CMP data, we want to add the vote share, number of seats and some other party information that might be useful for us. "merge" is the command used to add variables to our dataset However, adding variables is not as simple as adding observations: we need to make sure that the right values are assigned to the right cases. For that, we need that both datasets have one or more "key" variables that we can use to assign the values to the right observations. */ cfvars "mydata.dta" "manifesto2015_votseats.dta" merge 1:1 date party using "manifesto2015_votseats.dta" * Note the creation of a new variable called '_merge' tab _merge * To understand its function we can look at the help file h merge saveold "mydata.dta", replace /* EXERCISE: We have dealt here with the simplest case: both our datasetas had the same number of observations, and the combination of 'date' and 'party' identifies uniquely each observation. However, in many cases you want to merge data measured at different levels. For instance, we might want to merge some variables that are observed at the country level, while our level of observation here is the single party (within country). To do so, we need to change a bit the code that we saw above. The file called "postcom.dta" contains a dummy variable which tells whether a country belong to the "eastern" (post-communist) block. It is observed at the country level, so there are way less observations than in the file that we are using - one for each country. I want you to figure out how to merge the two files together, so our CMP data will include this country-level dummy variable. Open the file postcom.dta, and have a look at its structure; Write the code to merge it with "mydata.dta" */ /********************************** * MORE ADVANCED STATA PROGRAMMING * **********************************/ *--------* * Macros * *--------* /* Macros are like virtual "objects" saved in Stata's memory, which contain values, expressions, variables, and a few other things. They have a NAME and a CONTENT. When you call a macro's name, you will be calling its content. They are very similar to what in R language we call 'objects' (that's why we are covering them). There are two types of macros: 'local' and 'global' */ * LOCAL macros can be only called or modified within the same execution where they are defined. Once the execution is finished, local macros disappear. * They are called using single quotes: `' * They can contain numbers local one = 1 di `one' * expressions local one = 1 local two = `one' + 1 di `two' * variable names local myvar = "per101" sum `myvar' local myvar1 = "per" local myvar2 = "101" sum `myvar1'`myvar2' * commands local myvar = "per101" local command = "sum" `command' `myvar' * GLOBAL macros stay there for all the session, so they don't disappear once the execution is over, and can be called by other do-files or programs. * They are called using the dollar sign: $ global one = 1 * try execute these two commands separately di $one global one = 1 global two = $one + 1 di $two global myvar = "per101" sum $myvar global myvar1 = "per" global myvar2 = "101" sum $myvar1$myvar2 global myvar = "per101" global command = "sum" $command $myvar * you can use global macros to attribute commands to function keys global F5 = "tab countryname" *-------* * Loops * *-------* /* Sometimes we have to repeat the same observation a lot of times. Loops help us doing that. They save us time and space when we need to do repetitive tasks. We consider here two types of loops: */ * "foreach"/"forvalues" loops: they repeat the same operation for every element of a certain macro (foreach) or variable (forvalues) * the macro can be directly a list of variables foreach var of varlist per101 - per110 { sum `var' } * which is the same as doing local varlist = "per101 per102 per103 per104 per105 per106 per107 per108 per109 per110" foreach var of local varlist { sum `var' } * the local can be a series of values, that we then use to obtain variable names foreach n of numlist 101/110{ sum per`n' } * which is the same as doing local numlist = "101 102 103 104 105 106 107 108 109 110" foreach n of local numlist { sum per`n' } * however in this case we do not have to define the macro `numlist', we can just use "forvalues" forvalues n = 101(1)110 { sum per`n' } * We can also loop over subgroups within our data * We can use the function "levelsof" to ask Stata to return all the possible values of a variable, and save them into a local macro levelsof country, local(countries) foreach c of local countries { sum per101 if country == `c' } levelsof party, local(pty) foreach p of local pty { sum rile if party == `p' } * "while" loops: they evaluate an expression and IF it's true they execute the command in brackets. When the expression becomes false they stop * again, it can be used to loop over subgroups tab country if country < 15 local i = 1 while `i' < 15 { sum per101 if country == `i' local i = `i' + 1 } * or over variables local i = 101 while `i' <= 110 { sum per`i' local i = `i' + 1 } * loops can be nested levelsof country, local(countries) foreach c of local countries { local i = 101 while `i' <= 110 { sum per`i' if country == `c' local i = `i' + 1 } } *-----------* * Log files * *-----------* /* With Stata we can also record our session, so e.g. we can leave our computer running and make sure that at the end everything that appears in the results window will be saved. We do that by using log files. */ log using session1 * then we do some stuff, like replicating the megaloop we ran before (this will also appear) levelsof country, local(countries) foreach c of local countries { local i = 101 while `i' <= 110 { sum per`i' if country == `c' local i = `i' + 1 } } * then we close the log log close * Since log files are saved in the 'smcl' format, we can convert them into 'pdf' with the command "translate" translate session1.smcl session1.pdf * Note that if you want to write on an existing log file, you'll need to add the "replace" option log using session1, replace sum per101 log close * same for the "translate" command translate session1.smcl session1.pdf, replace *------------------* * Other data files * *------------------* clear* * What if our data are not in .dta format? One way is to transform them (with StatTransfer or just R) * However, Stata can import some specific types of data files. * Two examples: .csv and .xls * 'csv' is a rather common format to store the data * with "insheet" we can import them in the same way as it it were a 'dta' file cd "~/Dropbox/CEPIS Stats Bootcamp 2016/Data/CMP_2015/" insheet using manifestocomma.csv * xlsx is the common format in which Excel saves the data * the "import" command will do the trick clear* import excel using manifestoexcel.xlsx, firstrow