For the session we will need a few packages. The syntax below is a nice way to select a group of packages that we know we will use, install those that we do not have, and load them all. You can copy and paste this syntax and use it for any situation in which you will need several packges and might not have them already installed. You will only need to edit the list at the beginning of the script, putting the names of the packages that you need. (syntax stolen from Pablo Barberà)

# Check if the packages that we need are installed
want = c("foreign", "ggplot2", "dplyr", "haven", "survey", "srvyr")
have = want %in% rownames(installed.packages())
# Install the packages that we miss
if ( any(!have) ) { install.packages( want[!have] ) }
# Load the packages
junk <- lapply(want, library, character.only = T)
# Remove the objects we created
rm(have, want, junk)

First of all, if you need to read a Stata file that has been written with a version \(> 12\), then the foreign package will not work. You will need to use the haven package instead (that we installed above). Note that the relevant function now is called read_dta. Also, this package will load SPSS files letting you to keep the variable and value labels, so it’s a good helper.

rm(list=ls())
setwd("~/Dropbox/Teaching/Survey_Heidelberg_2018/lab/day1")
data.ex <- read_dta("cupesse.dta")

Now let us start with the topic of the day. In this session we will work with ESS data. I downloaded data from the 7th wave of ESS (collected in 2014) for all EU countries. I selected a subset of variables (thanks to ESS’ excellent download Wizard utility), I cleaned them, and put them into a .csv file. We can load them using the read.csv() function from the foreign package.

rm(list=ls())
setwd("~/Dropbox/Teaching/Survey_Heidelberg_2018/lab/day2")
data <- read.csv("ess_w7.csv", na.strings = "")

The ESS data contains 3 types of weights:

(you can have more exhaustive information on the weights included in the ESS data here)

The summary() function allows you to have basic descriptive statistics of the variables of your interest. Let us have a look at the weights provided with the ESS data. Let us have a look at the three weights.

summary(data$dweight)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.002272 0.943700 1.000000 1.001000 1.054000 4.000000
summary(data$pspwght)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.002338 0.717500 0.928500 1.001000 1.199000 4.044000
summary(data$pweight)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.05401 0.21830 0.49770 0.96230 2.24900 2.79700

Let us have a look at the design weights more closely. If we take the minimum and the maximum values in each country, we can see where a correction was applied (e.g. where there are weights that are different from 1). We can do that very easily using dplyr package.

data %>%
  group_by(cntry) %>%
  summarize(min_dw = min(dweight),
            max_dw = max(dweight))
## # A tibble: 18 x 3
##    cntry   min_dw max_dw
##    <fctr>   <dbl>  <dbl>
##  1 AT     0.787     1.47
##  2 BE     1.00      1.00
##  3 CZ     0.00227   2.91
##  4 DE     0.531     1.23
##  5 DK     1.00      1.00
##  6 EE     1.00      1.00
##  7 ES     1.00      2.00
##  8 FI     1.00      1.00
##  9 FR     0.146     4.00
## 10 GB     0.527     4.00
## 11 HU     1.00      1.00
## 12 IE     0.472     3.77
## 13 LT     0.142     3.59
## 14 NL     0.500     4.00
## 15 PL     0.788     1.48
## 16 PT     0.0155    4.00
## 17 SE     1.00      1.00
## 18 SI     1.00      1.00

In Belgium, Denmark, Estonia, Finland, Hungary, Sweden and Slovenia the design weights have not been applied. How can we say that?

Let us look at the population weights. What do these values mean?

data %>%
  group_by(cntry) %>%
  summarize(n = n(),
            mean = mean(pweight))
## # A tibble: 18 x 3
##    cntry      n   mean
##    <fctr> <int>  <dbl>
##  1 AT      1789 0.406 
##  2 BE      1769 0.526 
##  3 CZ      2117 0.416 
##  4 DE      3032 2.30  
##  5 DK      1502 0.310 
##  6 EE      2045 0.0540
##  7 ES      1925 2.05  
##  8 FI      2087 0.218 
##  9 FR      1914 2.80  
## 10 GB      2243 2.34  
## 11 HU      1698 0.498 
## 12 IE      2380 0.150 
## 13 LT      2249 0.112 
## 14 NL      1916 0.728 
## 15 PL      1615 2.25  
## 16 PT      1265 0.551 
## 17 SE      1790 0.447 
## 18 SI      1224 0.144

The srvyr package

In R there are a few packages to work with survey weights. Two very useful packages are the survey package, and the srvyr package. The second is actually built on the first, that is, it takes functions that come from the survey package and “wraps” them in a way that they are more easily usable with the same syntax used in the dplyr package and other packages in the tidyverse (most importantly, the pipe operator).

The survey_total() function

Let us see what happens to the country frequencies when we apply the country population weights. Whereas in dplyr we use the n() function, here to function that does the same is called survey_total(). Note that it will produce a standard error as well. This is a function of the size of the weight: the further is their distance from 1 (that is, the more we have to weight each observations to reach the population frequency) the more is our uncertainty, and the larger is the error.

data %>%
  as_survey(weights = c(pweight)) %>%
  group_by(cntry) %>%
  summarize(n = survey_total())
## # A tibble: 18 x 3
##    cntry      n   n_se
##    <fctr> <dbl>  <dbl>
##  1 AT       726  16.7 
##  2 BE       930  21.5 
##  3 CZ       881  18.5 
##  4 DE      6986 121   
##  5 DK       466  11.8 
##  6 EE       110   2.37
##  7 ES      3945  87.4 
##  8 FI       456   9.67
##  9 FR      5353 119   
## 10 GB      5248 107   
## 11 HU       845  20.0 
## 12 IE       358   7.08
## 13 LT       251   5.12
## 14 NL      1396  31.0 
## 15 PL      3633  88.3 
## 16 PT       698  19.3 
## 17 SE       799  18.4 
## 18 SI       176   4.94

We can make a few comparisons with the raw frequencies and the values of the population weights we observed above:

# Austria
1795*0.406
## [1] 728.77
# Belgium
1769*0.526
## [1] 930.494

The survey_mean() function

What difference does it make to use country population weights? In some cases a bit, in others not much. Let us look at the average age:

# Without weighting
mean(data$agea, na.rm = T)
## [1] 49.57983
# With population weights
data %>%
  as_survey(weights = c(pweight)) %>%
  summarize(lr_m = survey_mean(agea, na.rm = T))
## # A tibble: 1 x 2
##    lr_m lr_m_se
##   <dbl>   <dbl>
## 1  49.7   0.141

Considering the error, the difference between weighted and unweighted data is almost negligible. But let’s look at the average left-right self placement. The variable lrscale goes from 0 (left) to 10 (right) with 5 being the exact center.

# Without weighting
mean(data$lrscale, na.rm = T)
## [1] 5.019931
# With population weight
data %>%
  as_survey(weights = c(pweight)) %>%
  summarize(lr_m = survey_mean(lrscale, na.rm = T))
## # A tibble: 1 x 2
##    lr_m lr_m_se
##   <dbl>   <dbl>
## 1  4.93  0.0174
# With population weight and design weight
data %>%
  as_survey(weights = c(dweight, pweight)) %>%
  summarize(lr_m = survey_mean(lrscale, na.rm = T))
## # A tibble: 1 x 2
##    lr_m lr_m_se
##   <dbl>   <dbl>
## 1  4.95  0.0185

Here, if we take into account the country population sizes, the mean self placement of the respondent gets shifted a bit on the left. However, once we take into account the design weight, the mean gets shifted a little bit on the right again. What do these results imply?

Let us compare female and male respondents. Is there a difference in terms of left-right self placement? Does this difference change across countries? Let us look first at the unweighted data.

# Create summary data
lr_gen <- data %>%
  mutate(sex = ifelse(gndr == 1, "Male", "Female")) %>%
  group_by(cntry, sex) %>%
  summarize(n = mean(lrscale, na.rm = T))

# Plot it
ggplot(lr_gen, aes(x = sex, y = n)) +
  geom_bar(aes(fill = sex), stat = "identity") +
  facet_wrap(~cntry) +
  theme_bw()

First we apply the design weight. Notice that now we have standard errors on the frequencies as well, so we can add error bars to our plots.

# Create summary data
lr_gen_w1 <- data %>%
  mutate(sex = ifelse(gndr == 1, "Male", "Female")) %>%
  as_survey_design(weights = c(dweight)) %>%
  group_by(cntry, sex) %>%
  summarize(n = survey_mean(lrscale, na.rm = T, vartype = "ci"))

# Plot it
ggplot(lr_gen_w1, aes(x = sex, y = n)) +
  geom_bar(aes(fill = sex), stat = "identity") +
  geom_errorbar(aes(ymin = n_low, max = n_upp), width = 0.2) +
  facet_wrap(~cntry) +
  theme_bw()

Then we apply the design weight and the post-stratification weight together.

# Create summary data
lr_gen_w2 <- data %>%
  mutate(sex = ifelse(gndr == 1, "Male", "Female")) %>%
  as_survey_design(weights = c(dweight, pspwght)) %>%
  group_by(cntry, sex) %>%
  summarize(n = survey_mean(lrscale, na.rm = T, vartype = "ci"))

# Plot it
ggplot(lr_gen_w2, aes(x = sex, y = n)) +
  geom_bar(aes(fill = sex), stat = "identity") +
  geom_errorbar(aes(ymin = n_low, max = n_upp), width = 0.2) +
  facet_wrap(~cntry) +
  theme_bw()

Regression with weights

Of course you can also run a regression model including weights in the calculation. The survey package includes several functions to do that. Here we just make an example to answer the following research question: how do attitudes towards immigration correlate with left-right self placement in Europe? To do so, we run a regression model with lrscale as dependent variable, and a set of attitudes towards immigration as predictors. We do not talk about causation here: we just want to see how the variables correlate. The predictors are:

  • imsmetn: Allow many/few immigrants of same race/ethnic group as majority
  • imdfetn: Allow many/few immigrants of different race/ethnic group from majority
  • impcntr: Allow many/few immigrants from poorer countries in Europe
  • imbgeco: Allow many/few immigrants from poorer countries outside Europe
  • imueclt: Immigration bad or good for country’s economy
  • imwbcnt: Country’s cultural life undermined or enriched by immigrants

First, we run a simple linear regression model with all these variables predicting left-right self placement. We use the glm() function from base R, which fits a regression model using maximum likelihood estimation. It can be used with a variety of dependent variables, here since our outcome variable is an 11-point scale it will assume by default a Gaussian distribution.

m_1 <- glm(lrscale~imsmetn + imdfetn + impcntr + imbgeco + imueclt + imwbcnt, data = data)
summary(m_1)
## 
## Call:
## glm(formula = lrscale ~ imsmetn + imdfetn + impcntr + imbgeco + 
##     imueclt + imwbcnt, data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.0171  -1.2281  -0.0422   1.2796   5.9129  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.148589   0.084227  49.255  < 2e-16 ***
## imsmetn     -0.043641   0.023517  -1.856   0.0635 .  
## imdfetn      0.178009   0.027294   6.522 7.06e-11 ***
## impcntr      0.230499   0.022168  10.398  < 2e-16 ***
## imbgeco      0.033147   0.007589   4.368 1.26e-05 ***
## imueclt     -0.048504   0.007693  -6.305 2.92e-10 ***
## imwbcnt      0.011733   0.008730   1.344   0.1789    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 4.603576)
## 
##     Null deviance: 133091  on 28210  degrees of freedom
## Residual deviance: 129839  on 28204  degrees of freedom
##   (6349 observations deleted due to missingness)
## AIC: 123142
## 
## Number of Fisher Scoring iterations: 2

Then we run it on weighted data. To do so, we first need to create a “weighted data” object. We can do so with the syntax we saw already. We use all weights that we have.

data_we <- data %>%
  as_survey(weights = c(dweight, pspwght, pweight))

Then we run the regression using the svyglm() function from the survey package.

m_2 <- svyglm(lrscale~imsmetn + imdfetn + impcntr + imbgeco + imueclt + imwbcnt, design = data_we)
summary(m_2)
## 
## Call:
## svyglm(formula = lrscale ~ imsmetn + imdfetn + impcntr + imbgeco + 
##     imueclt + imwbcnt, design = data_we)
## 
## Survey design:
## Called via srvyr
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.41657    0.15427  28.630  < 2e-16 ***
## imsmetn     -0.11855    0.04629  -2.561   0.0104 *  
## imdfetn      0.27853    0.05195   5.362 8.31e-08 ***
## impcntr      0.22445    0.04264   5.264 1.42e-07 ***
## imbgeco      0.02684    0.01469   1.828   0.0676 .  
## imueclt     -0.09997    0.01553  -6.438 1.23e-10 ***
## imwbcnt      0.00337    0.01805   0.187   0.8519    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 4.441555)
## 
## Number of Fisher Scoring iterations: 2

Some results seem to change quite a bit, so for instance, some variables that were significant before are not significant any longer. How do you interpret this?