Federico Vegetti
ECPR Summer School in Methods and Techniques
Let’s suppose we want to explain why Slovakian people voted at the European Parliament elections in 2009. We have a few ideas about which predictors might have influenced people’s decision whether to go vote or not. More specifically, we think that those who voted at the EUP elections were highly interested in politics, and had a high trust in the EU as an institution. Moreover, we have a conditional hypothesis: people who are dissatisfied by the national government will be less likely to turnout, but only in case they also have low trust for the EU institutions. We want to test our expectations with our data, so we have a look at the Slovakian data from the European Election Study of 2009.
install.packages("lmtest") # We will need this package to perform a Likelihood Ratio test
library(foreign)
library(ggplot2)
library(lmtest)
library(MASS)
setwd("~/Dropbox/Teaching/GLM_ECPR_2017/Lab/Day3")
data <- read.dta("EES_2009_SK.dta")
head(data)
## turnout polint govdis trusteu
## 1 0 0 1 0
## 2 0 0 NA NA
## 3 0 0 NA 0
## 4 0 0 NA NA
## 5 1 2 1 2
## 6 0 1 1 2
We have 4 variables:
Let’s first have a look at how the data are distributed:
table(data$turnout)
##
## 0 1
## 537 475
Note that the actual turnout in SVK in 2009 was 19.63%.
table(data$govdis)
##
## 0 1
## 477 377
par(mfrow=c(1,2))
barplot(table(data$polint), col = "maroon")
barplot(table(data$trusteu), col = "purple")
The distributions look ok. Let’s run a logit model of “turnout” with all the covariates:
mod.1 <- glm(turnout ~ polint + govdis + trusteu,
family = binomial(link = logit),
data = data)
summary(mod.1)
##
## Call:
## glm(formula = turnout ~ polint + govdis + trusteu, family = binomial(link = logit),
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2912 -1.0162 -0.4269 1.0859 1.8218
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.26706 0.25081 -9.039 < 2e-16 ***
## polint 0.90117 0.11223 8.030 9.78e-16 ***
## govdis -0.08251 0.15284 -0.540 0.589
## trusteu 0.52831 0.08103 6.520 7.05e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1136.52 on 819 degrees of freedom
## Residual deviance: 999.47 on 816 degrees of freedom
## (196 observations deleted due to missingness)
## AIC: 1007.5
##
## Number of Fisher Scoring iterations: 3
Looks like “govdis” has no effect on turnout. So dropping it from the model can do no wrong. We can compare the fit of the models with and without “govdis” and see if the variable contributes to the model fit at all. Let’s run a second model without “govdis” and look at the AIC:
mod.2 <- glm(turnout ~ polint + trusteu,
family = binomial(link = logit),
data = data)
summary(mod.2)
##
## Call:
## glm(formula = turnout ~ polint + trusteu, family = binomial(link = logit),
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2530 -1.0310 -0.5635 1.1315 1.9589
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.22347 0.21610 -10.289 < 2e-16 ***
## polint 0.94145 0.10394 9.058 < 2e-16 ***
## trusteu 0.46369 0.07431 6.240 4.38e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1331.2 on 961 degrees of freedom
## Residual deviance: 1174.6 on 959 degrees of freedom
## (54 observations deleted due to missingness)
## AIC: 1180.6
##
## Number of Fisher Scoring iterations: 3
Wow, it seems that by excluding “govdis” we lose a lot in terms of model fit. However, look better at the output.
Let’s remove all the observations that have missing responses.
data <- na.omit(data)
This this kills almost 200 observations. Now rerun both models.
mod.1 <- glm(turnout ~ polint + govdis + trusteu,
family = binomial(link = logit),
data = data)
mod.2 <- glm(turnout ~ polint + trusteu,
family = binomial(link = logit),
data = data)
summary(mod.1)
##
## Call:
## glm(formula = turnout ~ polint + govdis + trusteu, family = binomial(link = logit),
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2912 -1.0162 -0.4269 1.0859 1.8218
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.26706 0.25081 -9.039 < 2e-16 ***
## polint 0.90117 0.11223 8.030 9.78e-16 ***
## govdis -0.08251 0.15284 -0.540 0.589
## trusteu 0.52831 0.08103 6.520 7.05e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1136.52 on 819 degrees of freedom
## Residual deviance: 999.47 on 816 degrees of freedom
## AIC: 1007.5
##
## Number of Fisher Scoring iterations: 3
summary(mod.2)
##
## Call:
## glm(formula = turnout ~ polint + trusteu, family = binomial(link = logit),
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2774 -1.0342 -0.4359 1.1012 1.8019
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.30619 0.24038 -9.594 < 2e-16 ***
## polint 0.90253 0.11212 8.050 8.29e-16 ***
## trusteu 0.52853 0.08101 6.524 6.82e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1136.52 on 819 degrees of freedom
## Residual deviance: 999.76 on 817 degrees of freedom
## AIC: 1005.8
##
## Number of Fisher Scoring iterations: 3
We can also compare the BIC.
BIC(mod.1); BIC(mod.2)
## [1] 1026.306
## [1] 1019.888
Question: As we exclude “govdis” from the model, the residual deviance increases a tiny bit, while the AIC & BIC decrease. Why so?
Let’s do a likelihood ratio test.
lrtest(mod.2,mod.1)
## Likelihood ratio test
##
## Model 1: turnout ~ polint + trusteu
## Model 2: turnout ~ polint + govdis + trusteu
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 3 -499.88
## 2 4 -499.73 1 0.2914 0.5893
According to the test, the two models are not significantly different from one another. In other words, including “govdis” in the model does not produce a significant improvement in the likelihood that the model has generated the data.
We want to present our results in a meaningful and compelling way. To do so, we want to create some predicted probabilties of Y given some interesting values of one or more independent variables X.
We focus on “polint”: how much does the probability to vote at the EU elections change as political interest increases, holding all the other predictors constant?
To present some predicted probabilities with an appropriate account of the error term, we need to simulate a distribution of predicted Y for different levels of X, based on the values of the coefficients and their standard errors.
Let’s use the results of model 1.
summary(mod.1)
##
## Call:
## glm(formula = turnout ~ polint + govdis + trusteu, family = binomial(link = logit),
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2912 -1.0162 -0.4269 1.0859 1.8218
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.26706 0.25081 -9.039 < 2e-16 ***
## polint 0.90117 0.11223 8.030 9.78e-16 ***
## govdis -0.08251 0.15284 -0.540 0.589
## trusteu 0.52831 0.08103 6.520 7.05e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1136.52 on 819 degrees of freedom
## Residual deviance: 999.47 on 816 degrees of freedom
## AIC: 1007.5
##
## Number of Fisher Scoring iterations: 3
First, let’s save the coefficients and the variance-covariance matrix.
betas <- coef(mod.1)
varcov <- vcov(mod.1)
Set seeds – for replication. This command controls the random number generator in R, so by setting the same seeds we’ll get the same results when we run the same simulation again. This is very useful in case you want to submit your results based on simulation to a journal.
set.seed(4682)
Set the number of simulations:
n <- 1000
Generate a multivariate distribution of coefficients based on their means and variance/covariance matrix (note that the function “mvrnorm” comes from the package MASS, that we loaded at the beginning. Remember to load it in case you want to use the function):
sim.pred <- data.frame(mvrnorm(n = n,mu = betas,Sigma = varcov))
names(sim.pred) <- c("intercept",names(betas)[2:4])
What we did here is the core of the exercise. We created a dataset distribution with 1000 observations and 4 variables. Each variable contains 1000 possible values of the coefficients, given their estimated variances and covariances. Their means should be similar to the actual coefficients, and their standard deviations should be similar to the actual standard errors:
data.frame(true.coef = betas, mean.sim.coef = apply(sim.pred, 2, mean))
## true.coef mean.sim.coef
## (Intercept) -2.26706319 -2.26989701
## polint 0.90116888 0.89871428
## govdis -0.08250697 -0.07687921
## trusteu 0.52831202 0.52794602
data.frame(true.se = sqrt(diag(varcov)), sd.sim.coef = apply(sim.pred, 2, sd))
## true.se sd.sim.coef
## (Intercept) 0.2508083 0.25123319
## polint 0.1122305 0.11201996
## govdis 0.1528438 0.15325847
## trusteu 0.0810343 0.08207092
Now that we have a distribution of 1000 possible coefficients, we need to choose some values of X that we are interested in, and multiply them for all simulated coefficients. This will create a string of possible predicted probabilities.
We want to see how the probability to turnout changes when polint goes from its minimum to its maximum. So we take the range of the variable and put it into an object:
polint.range <- seq(min(data$polint), max(data$polint), length.out = 1000)
We need to hold the other independent variable constant to a meanigful value – otherwise they will be assumed to be 0. We set “trusteu” to the mean, and “govdis” to 0 - for dummy variables it makes more sense to use either 1 or 0. We extrat such values and put them into other two objects:
m.dis <- 0
m.tru <- mean(data$trusteu)
We need create one empty matrix to save our predictions.
res <- matrix(NA,nrow = length(polint.range), ncol = 4)
Moreover, we need the inverse logit function to transform our predicted values into probabilities.
inv.logit.fun <- function(x) exp(x)/(1 + exp(x))
Now we can simulate the values and the confidence intervals. We make a loop that goes through the range of “polint”, and for each level we create a distribution of predicted probabilities. From there we keep the mean (i.e. the predicted probability for that level of “polint”) and two quantiles, 2.5% and 97.5%: these constitute the borders of the 95% Confidence Interval.
for(i in 1:length(polint.range)){
a <- polint.range[i]
p.1 <- inv.logit.fun(sim.pred$intercept +
a*sim.pred$polint +
m.dis*sim.pred$govdis + m.tru*sim.pred$trusteu)
res[i,] <- c(a,
mean(p.1),
quantile(p.1, p = 0.025),
quantile(p.1, p = 0.975))
}
data.sim <- data.frame(res)
names(data.sim) <- c("X","Y","LO","HI")
Finally, we plot the simulated predicted probabilities and confidence intervals:
ggplot(data.sim, aes(x = X, y = Y)) +
geom_line() +
geom_ribbon(aes(ymin = LO, ymax = HI), fill = "gray70", alpha = 0.5) +
scale_y_continuous(limits = c(0,1)) +
xlab('Political interest') +
ylab('Turnout') +
theme_bw()
What does this plot tell us?
At the beginning we came out with a hypothesis: people who are dissatisfied by the national government will be less likely to turnout, but only in case they also have low trust for the EU institutions. In other words, the effect of “govdis” depends on the values of “trusteu”: when “trusteu” has value 0 (the respondent does not trust the EU), “govdis” should have a negative effect on turnout, while when “trusteu” has value 1 the effect of “govdis” should not be there (or be even positive, for that matter). Practically, we will fit a new model with an interaction between “govdis” and “trusteu”. We expect the main effect of “govdis” to be negative, and the interaction effect of “govdis” with “trusteu” to be positive.
mod.3 <- glm(turnout ~ polint + govdis*trusteu,
family = binomial(link = logit),
data = data)
summary(mod.3)
##
## Call:
## glm(formula = turnout ~ polint + govdis * trusteu, family = binomial(link = logit),
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1669 -0.9911 -0.3388 1.0619 2.0291
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.8533 0.2981 -6.218 5.05e-10 ***
## polint 0.9068 0.1129 8.029 9.81e-16 ***
## govdis -0.9756 0.4092 -2.384 0.01713 *
## trusteu 0.3450 0.1097 3.147 0.00165 **
## govdis:trusteu 0.3883 0.1645 2.360 0.01829 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1136.52 on 819 degrees of freedom
## Residual deviance: 993.82 on 815 degrees of freedom
## AIC: 1003.8
##
## Number of Fisher Scoring iterations: 3
The coefficients seem to confirm our theory. All good. We can go for a beer. Let’s plot the results using the same procedure that we have seen above:
betas.2 <- coef(mod.3)
varcov.2 <- vcov(mod.3)
set.seed(4682)
n <- 1000
sim.pred.2 <- data.frame(mvrnorm(n = n,mu = betas.2,Sigma = varcov.2))
names(sim.pred.2) <- c("intercept",names(betas.2)[2:4],"int")
Now we care about the effect of “trusteu”, so we extrapolate its range:
trusteu.range <- seq(min(data$trusteu), max(data$trusteu), length.out = 100)
And we set “polint” to its mean:
m.int <- mean(data$polint)
How about “govdis”? This is also part of our story, so we should let it vary as well. However, it is a dummy variable, so there is no range to be stored, only two levels: 0 and 1.
This time we create two empty matrices to save our predictions: one for the predicted probabilities when “govdis” = 0, the other for the predicted probabilities when “govdis” = 1.
res.0 <- matrix(NA,nrow = length(trusteu.range), ncol = 4)
res.1 <- matrix(NA,nrow = length(trusteu.range), ncol = 4)
Now we can run the loop
for(i in 1:length(trusteu.range)){
a <- trusteu.range[i]
p.0 <- inv.logit.fun(sim.pred.2$intercept +
m.int*sim.pred.2$polint +
0*sim.pred.2$govdis + a*sim.pred.2$trusteu + 0*a*sim.pred.2$int)
# This is the set of prediction with "govdis" = 0
res.0[i,] <- c(a,
mean(p.0),
quantile(p.0, p = 0.025),
quantile(p.0, p = 0.975))
p.1 <- inv.logit.fun(sim.pred.2$intercept +
m.int*sim.pred.2$polint +
1*sim.pred.2$govdis + a*sim.pred.2$trusteu + 1*a*sim.pred.2$int)
# This is the set of prediction with "govdis" = 1
res.1[i,] <- c(a,
mean(p.1),
quantile(p.1, p = 0.025),
quantile(p.1, p = 0.975))
}
data.sim.2 <- data.frame(rbind(res.0,res.1)) # We put the two sets of predictions together
names(data.sim.2) <- c("X","Y","LO","HI")
data.sim.2$GR <- rep(c("Approve","Disapprove"), each = length(trusteu.range))
And now let’s do the plot:
ggplot(data.sim.2, aes(x = X, y = Y, group = GR)) +
geom_line(aes(colour = GR)) +
geom_ribbon(aes(ymin = LO, ymax = HI), fill = "gray70", alpha = 0.5) +
scale_colour_manual(values = c("red","blue"), "") +
scale_y_continuous(limits = c(0,1)) +
xlab('Trust in the EU') +
ylab('Turnout') +
theme_bw()
Looks like that beer will have to wait. Once we have plotted the coefficients, it the effect is not there anymore.