For the session we will need a few packages. The syntax below is a nice way to select a group of packages that we know we will use, install those that we do not have, and load them all. You can copy and paste this syntax and use it for any situation in which you will need several packges and might not have them already installed. You will only need to edit the list at the beginning of the script, putting the names of the packages that you need. (syntax stolen from Pablo Barberà)
# Check if the packages that we need are installed
want = c("foreign", "ggplot2", "dplyr", "haven", "survey", "srvyr")
have = want %in% rownames(installed.packages())
# Install the packages that we miss
if ( any(!have) ) { install.packages( want[!have] ) }
# Load the packages
junk <- lapply(want, library, character.only = T)
# Remove the objects we created
rm(have, want, junk)
First of all, if you need to read a Stata file that has been written with a version \(> 12\), then the foreign
package will not work. You will need to use the haven
package instead (that we installed above). Note that the relevant function now is called read_dta
. Also, this package will load SPSS files letting you to keep the variable and value labels, so it’s a good helper.
rm(list=ls())
setwd("~/Dropbox/Teaching/Survey_Heidelberg_2018/lab/day1")
data.ex <- read_dta("cupesse.dta")
Now let us start with the topic of the day. In this session we will work with ESS data. I downloaded data from the 7th wave of ESS (collected in 2014) for all EU countries. I selected a subset of variables (thanks to ESS’ excellent download Wizard utility), I cleaned them, and put them into a .csv file. We can load them using the read.csv()
function from the foreign
package.
rm(list=ls())
setwd("~/Dropbox/Teaching/Survey_Heidelberg_2018/lab/day2")
data <- read.csv("ess_w7.csv", na.strings = "")
The ESS data contains 3 types of weights:
dweight
: These are the so-called design weights. Quoting the ESS website: “the main purpose of the design weights is to correct for the fact that in some countries respondents have different probabilities to be part of the sample due to the sampling design used.” These weights have been built to correct for the coverage error, that is, the error created by the different chances that individuals from the target population are covered in the sample frame.pspwght
: These are the post-stratification weights. According the the ESS website, these “are a more sophisticated weighting strategy that uses auxiliary information to reduce the sampling error and potential non-response bias.” These weights have been computed after the data has been collected, to correct from differences between population frequencies observed in the sample and the “true” population frequencies (that in this case, are those provided by the Labour Force Survey funded by the EU and available on Eurostat). Unlike the design weights, that are based on the probability of inclusion of different groups of individuals in the sample frames, these have been calculated starting from variables that are there in the data, and are really an “adjustment” of the design weight made to reach observed frequencies that match those of the target population.pweight
: These are the population size weights. These weights have the purpose to match the numbers of observations collected in each country to the country populations. They are to be used only when we calculate statistics on multiple countries (for instance, unemployment in Scandinavia). Their value is the same for all observations within the same country.(you can have more exhaustive information on the weights included in the ESS data here)
The summary()
function allows you to have basic descriptive statistics of the variables of your interest. Let us have a look at the weights provided with the ESS data. Let us have a look at the three weights.
summary(data$dweight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.002272 0.943700 1.000000 1.001000 1.054000 4.000000
summary(data$pspwght)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.002338 0.717500 0.928500 1.001000 1.199000 4.044000
summary(data$pweight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.05401 0.21830 0.49770 0.96230 2.24900 2.79700
Let us have a look at the design weights more closely. If we take the minimum and the maximum values in each country, we can see where a correction was applied (e.g. where there are weights that are different from 1). We can do that very easily using dplyr
package.
data %>%
group_by(cntry) %>%
summarize(min_dw = min(dweight),
max_dw = max(dweight))
## # A tibble: 18 x 3
## cntry min_dw max_dw
## <fctr> <dbl> <dbl>
## 1 AT 0.787 1.47
## 2 BE 1.00 1.00
## 3 CZ 0.00227 2.91
## 4 DE 0.531 1.23
## 5 DK 1.00 1.00
## 6 EE 1.00 1.00
## 7 ES 1.00 2.00
## 8 FI 1.00 1.00
## 9 FR 0.146 4.00
## 10 GB 0.527 4.00
## 11 HU 1.00 1.00
## 12 IE 0.472 3.77
## 13 LT 0.142 3.59
## 14 NL 0.500 4.00
## 15 PL 0.788 1.48
## 16 PT 0.0155 4.00
## 17 SE 1.00 1.00
## 18 SI 1.00 1.00
In Belgium, Denmark, Estonia, Finland, Hungary, Sweden and Slovenia the design weights have not been applied. How can we say that?
Let us look at the population weights. What do these values mean?
data %>%
group_by(cntry) %>%
summarize(n = n(),
mean = mean(pweight))
## # A tibble: 18 x 3
## cntry n mean
## <fctr> <int> <dbl>
## 1 AT 1789 0.406
## 2 BE 1769 0.526
## 3 CZ 2117 0.416
## 4 DE 3032 2.30
## 5 DK 1502 0.310
## 6 EE 2045 0.0540
## 7 ES 1925 2.05
## 8 FI 2087 0.218
## 9 FR 1914 2.80
## 10 GB 2243 2.34
## 11 HU 1698 0.498
## 12 IE 2380 0.150
## 13 LT 2249 0.112
## 14 NL 1916 0.728
## 15 PL 1615 2.25
## 16 PT 1265 0.551
## 17 SE 1790 0.447
## 18 SI 1224 0.144
srvyr
packageIn R there are a few packages to work with survey weights. Two very useful packages are the survey
package, and the srvyr
package. The second is actually built on the first, that is, it takes functions that come from the survey
package and “wraps” them in a way that they are more easily usable with the same syntax used in the dplyr
package and other packages in the tidyverse
(most importantly, the pipe operator).
survey_total()
functionLet us see what happens to the country frequencies when we apply the country population weights. Whereas in dplyr
we use the n()
function, here to function that does the same is called survey_total()
. Note that it will produce a standard error as well. This is a function of the size of the weight: the further is their distance from 1 (that is, the more we have to weight each observations to reach the population frequency) the more is our uncertainty, and the larger is the error.
data %>%
as_survey(weights = c(pweight)) %>%
group_by(cntry) %>%
summarize(n = survey_total())
## # A tibble: 18 x 3
## cntry n n_se
## <fctr> <dbl> <dbl>
## 1 AT 726 16.7
## 2 BE 930 21.5
## 3 CZ 881 18.5
## 4 DE 6986 121
## 5 DK 466 11.8
## 6 EE 110 2.37
## 7 ES 3945 87.4
## 8 FI 456 9.67
## 9 FR 5353 119
## 10 GB 5248 107
## 11 HU 845 20.0
## 12 IE 358 7.08
## 13 LT 251 5.12
## 14 NL 1396 31.0
## 15 PL 3633 88.3
## 16 PT 698 19.3
## 17 SE 799 18.4
## 18 SI 176 4.94
We can make a few comparisons with the raw frequencies and the values of the population weights we observed above:
# Austria
1795*0.406
## [1] 728.77
# Belgium
1769*0.526
## [1] 930.494
survey_mean()
functionWhat difference does it make to use country population weights? In some cases a bit, in others not much. Let us look at the average age:
# Without weighting
mean(data$agea, na.rm = T)
## [1] 49.57983
# With population weights
data %>%
as_survey(weights = c(pweight)) %>%
summarize(lr_m = survey_mean(agea, na.rm = T))
## # A tibble: 1 x 2
## lr_m lr_m_se
## <dbl> <dbl>
## 1 49.7 0.141
Considering the error, the difference between weighted and unweighted data is almost negligible. But let’s look at the average left-right self placement. The variable lrscale
goes from 0
(left) to 10
(right) with 5
being the exact center.
# Without weighting
mean(data$lrscale, na.rm = T)
## [1] 5.019931
# With population weight
data %>%
as_survey(weights = c(pweight)) %>%
summarize(lr_m = survey_mean(lrscale, na.rm = T))
## # A tibble: 1 x 2
## lr_m lr_m_se
## <dbl> <dbl>
## 1 4.93 0.0174
# With population weight and design weight
data %>%
as_survey(weights = c(dweight, pweight)) %>%
summarize(lr_m = survey_mean(lrscale, na.rm = T))
## # A tibble: 1 x 2
## lr_m lr_m_se
## <dbl> <dbl>
## 1 4.95 0.0185
Here, if we take into account the country population sizes, the mean self placement of the respondent gets shifted a bit on the left. However, once we take into account the design weight, the mean gets shifted a little bit on the right again. What do these results imply?
Let us compare female and male respondents. Is there a difference in terms of left-right self placement? Does this difference change across countries? Let us look first at the unweighted data.
# Create summary data
lr_gen <- data %>%
mutate(sex = ifelse(gndr == 1, "Male", "Female")) %>%
group_by(cntry, sex) %>%
summarize(n = mean(lrscale, na.rm = T))
# Plot it
ggplot(lr_gen, aes(x = sex, y = n)) +
geom_bar(aes(fill = sex), stat = "identity") +
facet_wrap(~cntry) +
theme_bw()
First we apply the design weight. Notice that now we have standard errors on the frequencies as well, so we can add error bars to our plots.
# Create summary data
lr_gen_w1 <- data %>%
mutate(sex = ifelse(gndr == 1, "Male", "Female")) %>%
as_survey_design(weights = c(dweight)) %>%
group_by(cntry, sex) %>%
summarize(n = survey_mean(lrscale, na.rm = T, vartype = "ci"))
# Plot it
ggplot(lr_gen_w1, aes(x = sex, y = n)) +
geom_bar(aes(fill = sex), stat = "identity") +
geom_errorbar(aes(ymin = n_low, max = n_upp), width = 0.2) +
facet_wrap(~cntry) +
theme_bw()
Then we apply the design weight and the post-stratification weight together.
# Create summary data
lr_gen_w2 <- data %>%
mutate(sex = ifelse(gndr == 1, "Male", "Female")) %>%
as_survey_design(weights = c(dweight, pspwght)) %>%
group_by(cntry, sex) %>%
summarize(n = survey_mean(lrscale, na.rm = T, vartype = "ci"))
# Plot it
ggplot(lr_gen_w2, aes(x = sex, y = n)) +
geom_bar(aes(fill = sex), stat = "identity") +
geom_errorbar(aes(ymin = n_low, max = n_upp), width = 0.2) +
facet_wrap(~cntry) +
theme_bw()
Of course you can also run a regression model including weights in the calculation. The survey
package includes several functions to do that. Here we just make an example to answer the following research question: how do attitudes towards immigration correlate with left-right self placement in Europe? To do so, we run a regression model with lrscale
as dependent variable, and a set of attitudes towards immigration as predictors. We do not talk about causation here: we just want to see how the variables correlate. The predictors are:
imsmetn
: Allow many/few immigrants of same race/ethnic group as majorityimdfetn
: Allow many/few immigrants of different race/ethnic group from majorityimpcntr
: Allow many/few immigrants from poorer countries in Europeimbgeco
: Allow many/few immigrants from poorer countries outside Europeimueclt
: Immigration bad or good for country’s economyimwbcnt
: Country’s cultural life undermined or enriched by immigrantsFirst, we run a simple linear regression model with all these variables predicting left-right self placement. We use the glm()
function from base R, which fits a regression model using maximum likelihood estimation. It can be used with a variety of dependent variables, here since our outcome variable is an 11-point scale it will assume by default a Gaussian distribution.
m_1 <- glm(lrscale~imsmetn + imdfetn + impcntr + imbgeco + imueclt + imwbcnt, data = data)
summary(m_1)
##
## Call:
## glm(formula = lrscale ~ imsmetn + imdfetn + impcntr + imbgeco +
## imueclt + imwbcnt, data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.0171 -1.2281 -0.0422 1.2796 5.9129
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.148589 0.084227 49.255 < 2e-16 ***
## imsmetn -0.043641 0.023517 -1.856 0.0635 .
## imdfetn 0.178009 0.027294 6.522 7.06e-11 ***
## impcntr 0.230499 0.022168 10.398 < 2e-16 ***
## imbgeco 0.033147 0.007589 4.368 1.26e-05 ***
## imueclt -0.048504 0.007693 -6.305 2.92e-10 ***
## imwbcnt 0.011733 0.008730 1.344 0.1789
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 4.603576)
##
## Null deviance: 133091 on 28210 degrees of freedom
## Residual deviance: 129839 on 28204 degrees of freedom
## (6349 observations deleted due to missingness)
## AIC: 123142
##
## Number of Fisher Scoring iterations: 2
Then we run it on weighted data. To do so, we first need to create a “weighted data” object. We can do so with the syntax we saw already. We use all weights that we have.
data_we <- data %>%
as_survey(weights = c(dweight, pspwght, pweight))
Then we run the regression using the svyglm()
function from the survey
package.
m_2 <- svyglm(lrscale~imsmetn + imdfetn + impcntr + imbgeco + imueclt + imwbcnt, design = data_we)
summary(m_2)
##
## Call:
## svyglm(formula = lrscale ~ imsmetn + imdfetn + impcntr + imbgeco +
## imueclt + imwbcnt, design = data_we)
##
## Survey design:
## Called via srvyr
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.41657 0.15427 28.630 < 2e-16 ***
## imsmetn -0.11855 0.04629 -2.561 0.0104 *
## imdfetn 0.27853 0.05195 5.362 8.31e-08 ***
## impcntr 0.22445 0.04264 5.264 1.42e-07 ***
## imbgeco 0.02684 0.01469 1.828 0.0676 .
## imueclt -0.09997 0.01553 -6.438 1.23e-10 ***
## imwbcnt 0.00337 0.01805 0.187 0.8519
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 4.441555)
##
## Number of Fisher Scoring iterations: 2
Some results seem to change quite a bit, so for instance, some variables that were significant before are not significant any longer. How do you interpret this?