The srvyr package

For the session we will need a few packages. The syntax below is a nice way to select a group of packages that we know we will use, install those that we do not have, and load them all. You can copy and paste this syntax and use it for any situation in which you will need several packges and might not have them already installed. You will only need to edit the list at the beginning of the script, putting the names of the packages that you need. (syntax stolen from Pablo Barberà)

# Check if the packages that we need are installed
want = c("foreign", "ggplot2", "dplyr", "haven", "survey", "srvyr")
have = want %in% rownames(installed.packages())
# Install the packages that we miss
if ( any(!have) ) { install.packages( want[!have] ) }
# Load the packages
junk <- lapply(want, library, character.only = T)
# Remove the objects we created
rm(have, want, junk)

First of all, if you need to read a Stata file that has been written with a version \(> 12\), then the foreign package will not work. You will need to use the haven package instead (that we installed above). Note that the relevant function now is called read_dta. Also, this package will load SPSS files letting you to keep the variable and value labels, so it’s a good helper.

rm(list=ls())
setwd("~/Dropbox/Teaching/Survey_Heidelberg_2018/lab/day1")
data.ex <- read_dta("cupesse.dta")

Now let us start with the topic of the day. In this session we will work with ESS data. I downloaded data from the 7th wave of ESS (collected in 2014) for all EU countries. I selected a subset of variables (thanks to ESS’ excellent download Wizard utility), I cleaned them, and put them into a .csv file. We can load them using the read.csv() function from the foreign package.

rm(list=ls())
setwd("~/Dropbox/Teaching/Survey_Heidelberg_2018/lab/day2")
data <- read.csv("ess_w7.csv", na.strings = "")

The ESS data contains 3 types of weights:

dweight: These are the so-called design weights. Quoting the ESS website: “the main purpose of the design weights is to correct for the fact that in some countries respondents have different probabilities to be part of the sample due to the sampling design used.” These weights have been built to correct for the coverage error, that is, the error created by the different chances that individuals from the target population are covered in the sample frame.
pspwght: These are the post-stratification weights. According the the ESS website, these “are a more sophisticated weighting strategy that uses auxiliary information to reduce the sampling error and potential non-response bias.” These weights have been computed after the data has been collected, to correct from differences between population frequencies observed in the sample and the “true” population frequencies (that in this case, are those provided by the Labour Force Survey funded by the EU and available on Eurostat). Unlike the design weights, that are based on the probability of inclusion of different groups of individuals in the sample frames, these have been calculated starting from variables that are there in the data, and are really an “adjustment” of the design weight made to reach observed frequencies that match those of the target population.
pweight: These are the population size weights. These weights have the purpose to match the numbers of observations collected in each country to the country populations. They are to be used only when we calculate statistics on multiple countries (for instance, unemployment in Scandinavia). Their value is the same for all observations within the same country.

(you can have more exhaustive information on the weights included in the ESS data here)

The summary() function allows you to have basic descriptive statistics of the variables of your interest. Let us have a look at the weights provided with the ESS data. Let us have a look at the three weights.

summary(data$dweight)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.002272 0.943700 1.000000 1.001000 1.054000 4.000000

summary(data$pspwght)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.002338 0.717500 0.928500 1.001000 1.199000 4.044000

summary(data$pweight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.05401 0.21830 0.49770 0.96230 2.24900 2.79700

Let us have a look at the design weights more closely. If we take the minimum and the maximum values in each country, we can see where a correction was applied (e.g. where there are weights that are different from 1). We can do that very easily using dplyr package.

data %>%
  group_by(cntry) %>%
  summarize(min_dw = min(dweight),
            max_dw = max(dweight))

## # A tibble: 18 x 3
##    cntry   min_dw max_dw
##    <fctr>   <dbl>  <dbl>
##  1 AT     0.787     1.47
##  2 BE     1.00      1.00
##  3 CZ     0.00227   2.91
##  4 DE     0.531     1.23
##  5 DK     1.00      1.00
##  6 EE     1.00      1.00
##  7 ES     1.00      2.00
##  8 FI     1.00      1.00
##  9 FR     0.146     4.00
## 10 GB     0.527     4.00
## 11 HU     1.00      1.00
## 12 IE     0.472     3.77
## 13 LT     0.142     3.59
## 14 NL     0.500     4.00
## 15 PL     0.788     1.48
## 16 PT     0.0155    4.00
## 17 SE     1.00      1.00
## 18 SI     1.00      1.00

In Belgium, Denmark, Estonia, Finland, Hungary, Sweden and Slovenia the design weights have not been applied. How can we say that?

Let us look at the population weights. What do these values mean?

data %>%
  group_by(cntry) %>%
  summarize(n = n(),
            mean = mean(pweight))

## # A tibble: 18 x 3
##    cntry      n   mean
##    <fctr> <int>  <dbl>
##  1 AT      1789 0.406 
##  2 BE      1769 0.526 
##  3 CZ      2117 0.416 
##  4 DE      3032 2.30  
##  5 DK      1502 0.310 
##  6 EE      2045 0.0540
##  7 ES      1925 2.05  
##  8 FI      2087 0.218 
##  9 FR      1914 2.80  
## 10 GB      2243 2.34  
## 11 HU      1698 0.498 
## 12 IE      2380 0.150 
## 13 LT      2249 0.112 
## 14 NL      1916 0.728 
## 15 PL      1615 2.25  
## 16 PT      1265 0.551 
## 17 SE      1790 0.447 
## 18 SI      1224 0.144

The `srvyr` package

In R there are a few packages to work with survey weights. Two very useful packages are the survey package, and the srvyr package. The second is actually built on the first, that is, it takes functions that come from the survey package and “wraps” them in a way that they are more easily usable with the same syntax used in the dplyr package and other packages in the tidyverse (most importantly, the pipe operator).

The `survey_total()` function

Let us see what happens to the country frequencies when we apply the country population weights. Whereas in dplyr we use the n() function, here to function that does the same is called survey_total(). Note that it will produce a standard error as well. This is a function of the size of the weight: the further is their distance from 1 (that is, the more we have to weight each observations to reach the population frequency) the more is our uncertainty, and the larger is the error.

data %>%
  as_survey(weights = c(pweight)) %>%
  group_by(cntry) %>%
  summarize(n = survey_total())

## # A tibble: 18 x 3
##    cntry      n   n_se
##    <fctr> <dbl>  <dbl>
##  1 AT       726  16.7 
##  2 BE       930  21.5 
##  3 CZ       881  18.5 
##  4 DE      6986 121   
##  5 DK       466  11.8 
##  6 EE       110   2.37
##  7 ES      3945  87.4 
##  8 FI       456   9.67
##  9 FR      5353 119   
## 10 GB      5248 107   
## 11 HU       845  20.0 
## 12 IE       358   7.08
## 13 LT       251   5.12
## 14 NL      1396  31.0 
## 15 PL      3633  88.3 
## 16 PT       698  19.3 
## 17 SE       799  18.4 
## 18 SI       176   4.94

We can make a few comparisons with the raw frequencies and the values of the population weights we observed above:

# Austria
1795*0.406

## [1] 728.77

# Belgium
1769*0.526

## [1] 930.494

The `survey_mean()` function

What difference does it make to use country population weights? In some cases a bit, in others not much. Let us look at the average age:

# Without weighting
mean(data$agea, na.rm = T)

## [1] 49.57983

# With population weights
data %>%
  as_survey(weights = c(pweight)) %>%
  summarize(lr_m = survey_mean(agea, na.rm = T))

## # A tibble: 1 x 2
##    lr_m lr_m_se
##   <dbl>   <dbl>
## 1  49.7   0.141

Considering the error, the difference between weighted and unweighted data is almost negligible. But let’s look at the average left-right self placement. The variable lrscale goes from 0 (left) to 10 (right) with 5 being the exact center.

# Without weighting
mean(data$lrscale, na.rm = T)

## [1] 5.019931

# With population weight
data %>%
  as_survey(weights = c(pweight)) %>%
  summarize(lr_m = survey_mean(lrscale, na.rm = T))

## # A tibble: 1 x 2
##    lr_m lr_m_se
##   <dbl>   <dbl>
## 1  4.93  0.0174

# With population weight and design weight
data %>%
  as_survey(weights = c(dweight, pweight)) %>%
  summarize(lr_m = survey_mean(lrscale, na.rm = T))

## # A tibble: 1 x 2
##    lr_m lr_m_se
##   <dbl>   <dbl>
## 1  4.95  0.0185

Here, if we take into account the country population sizes, the mean self placement of the respondent gets shifted a bit on the left. However, once we take into account the design weight, the mean gets shifted a little bit on the right again. What do these results imply?

Let us compare female and male respondents. Is there a difference in terms of left-right self placement? Does this difference change across countries? Let us look first at the unweighted data.

# Create summary data
lr_gen <- data %>%
  mutate(sex = ifelse(gndr == 1, "Male", "Female")) %>%
  group_by(cntry, sex) %>%
  summarize(n = mean(lrscale, na.rm = T))

# Plot it
ggplot(lr_gen, aes(x = sex, y = n)) +
  geom_bar(aes(fill = sex), stat = "identity") +
  facet_wrap(~cntry) +
  theme_bw()

First we apply the design weight. Notice that now we have standard errors on the frequencies as well, so we can add error bars to our plots.

# Create summary data
lr_gen_w1 <- data %>%
  mutate(sex = ifelse(gndr == 1, "Male", "Female")) %>%
  as_survey_design(weights = c(dweight)) %>%
  group_by(cntry, sex) %>%
  summarize(n = survey_mean(lrscale, na.rm = T, vartype = "ci"))

# Plot it
ggplot(lr_gen_w1, aes(x = sex, y = n)) +
  geom_bar(aes(fill = sex), stat = "identity") +
  geom_errorbar(aes(ymin = n_low, max = n_upp), width = 0.2) +
  facet_wrap(~cntry) +
  theme_bw()

Then we apply the design weight and the post-stratification weight together.

# Create summary data
lr_gen_w2 <- data %>%
  mutate(sex = ifelse(gndr == 1, "Male", "Female")) %>%
  as_survey_design(weights = c(dweight, pspwght)) %>%
  group_by(cntry, sex) %>%
  summarize(n = survey_mean(lrscale, na.rm = T, vartype = "ci"))

# Plot it
ggplot(lr_gen_w2, aes(x = sex, y = n)) +
  geom_bar(aes(fill = sex), stat = "identity") +
  geom_errorbar(aes(ymin = n_low, max = n_upp), width = 0.2) +
  facet_wrap(~cntry) +
  theme_bw()

Regression with weights

Of course you can also run a regression model including weights in the calculation. The survey package includes several functions to do that. Here we just make an example to answer the following research question: how do attitudes towards immigration correlate with left-right self placement in Europe? To do so, we run a regression model with lrscale as dependent variable, and a set of attitudes towards immigration as predictors. We do not talk about causation here: we just want to see how the variables correlate. The predictors are:

imsmetn: Allow many/few immigrants of same race/ethnic group as majority
imdfetn: Allow many/few immigrants of different race/ethnic group from majority
impcntr: Allow many/few immigrants from poorer countries in Europe
imbgeco: Allow many/few immigrants from poorer countries outside Europe
imueclt: Immigration bad or good for country’s economy
imwbcnt: Country’s cultural life undermined or enriched by immigrants

First, we run a simple linear regression model with all these variables predicting left-right self placement. We use the glm() function from base R, which fits a regression model using maximum likelihood estimation. It can be used with a variety of dependent variables, here since our outcome variable is an 11-point scale it will assume by default a Gaussian distribution.

m_1 <- glm(lrscale~imsmetn + imdfetn + impcntr + imbgeco + imueclt + imwbcnt, data = data)
summary(m_1)

## 
## Call:
## glm(formula = lrscale ~ imsmetn + imdfetn + impcntr + imbgeco + 
##     imueclt + imwbcnt, data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.0171  -1.2281  -0.0422   1.2796   5.9129  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.148589   0.084227  49.255  < 2e-16 ***
## imsmetn     -0.043641   0.023517  -1.856   0.0635 .  
## imdfetn      0.178009   0.027294   6.522 7.06e-11 ***
## impcntr      0.230499   0.022168  10.398  < 2e-16 ***
## imbgeco      0.033147   0.007589   4.368 1.26e-05 ***
## imueclt     -0.048504   0.007693  -6.305 2.92e-10 ***
## imwbcnt      0.011733   0.008730   1.344   0.1789    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 4.603576)
## 
##     Null deviance: 133091  on 28210  degrees of freedom
## Residual deviance: 129839  on 28204  degrees of freedom
##   (6349 observations deleted due to missingness)
## AIC: 123142
## 
## Number of Fisher Scoring iterations: 2

Then we run it on weighted data. To do so, we first need to create a “weighted data” object. We can do so with the syntax we saw already. We use all weights that we have.

data_we <- data %>%
  as_survey(weights = c(dweight, pspwght, pweight))

Then we run the regression using the svyglm() function from the survey package.

m_2 <- svyglm(lrscale~imsmetn + imdfetn + impcntr + imbgeco + imueclt + imwbcnt, design = data_we)
summary(m_2)

## 
## Call:
## svyglm(formula = lrscale ~ imsmetn + imdfetn + impcntr + imbgeco + 
##     imueclt + imwbcnt, design = data_we)
## 
## Survey design:
## Called via srvyr
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.41657    0.15427  28.630  < 2e-16 ***
## imsmetn     -0.11855    0.04629  -2.561   0.0104 *  
## imdfetn      0.27853    0.05195   5.362 8.31e-08 ***
## impcntr      0.22445    0.04264   5.264 1.42e-07 ***
## imbgeco      0.02684    0.01469   1.828   0.0676 .  
## imueclt     -0.09997    0.01553  -6.438 1.23e-10 ***
## imwbcnt      0.00337    0.01805   0.187   0.8519    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 4.441555)
## 
## Number of Fisher Scoring iterations: 2

Some results seem to change quite a bit, so for instance, some variables that were significant before are not significant any longer. How do you interpret this?

Survey Statistics – Day 2 Survey Weights with R

Federico Vegetti

The `srvyr` package

The `survey_total()` function

The `survey_mean()` function

Regression with weights

Survey Statistics – Day 2 Survey Weights with R

Federico Vegetti

The srvyr package

The survey_total() function

The survey_mean() function

Regression with weights

The `srvyr` package

The `survey_total()` function

The `survey_mean()` function