Exercise 1: Poisson regression

install.packages("pscl")    # This package allows to fit zero-inflated models
options(scipen=9)    # To fix notation
library(foreign)
library(ggplot2)
library(MASS)
library(texreg)
library(pscl)
rm(list=ls())

Poisson regression is used in instances when the dependent variable in our analyses follows a “discrete probability distribution”, which refers to the probability of a given number of events to occur in a given amount of time. In we think of an empirical situation, if we were to be interested in the distribution of the number of accidents that occur daily on a crowded street in Budapest, this particular distribution would be a Poisson one. It would have most cases in the 0 category (since usually no accidents happen in a day), with a few cases in the 1 category (just 1 accident that day), and even fewer in the 2 category (2 accidents), and so on. For these types of distributions (see below an example with \(\lambda = 0.5\)), other common types of models are of no help. OLS clearly isn’t the right choice, since any such distribution of a count of events is bounded at 0, making it a poor approximation for a normal distribution. We could try to model it as an ordinal variable, but this would result in loss of information, as we know that the probability of j number of events is less than the probability of j-1 events occurring, for any j greater or equal than 1 (see the graph in the figure again).

In this short tutorial, we continue the topic of the last lab session: online and offline mobilization in Belgium. This time, the goal is to determine what predicts the number of online mobilization attempts an individual is subjected to during a political campaign. Our theoretical expectation is that this variable is distributed according to a Poisson distribution: most individuals would have never been contacted by a party, only some would have been contacted once, and only a small minority would have been subjected to two contact attempts.

Please note that the we will be using the dataset “EES_2009_BE_D5.sav”, which is not identical to the one used yesterday. For the analyses we will use the variable “online” rather than “onl_mob” (the one used yesterday). This is because “onl_mob” was converted into a dichotomous measure for convenience. For the analysis now, however, we will use the indicator of online mobilization in the way it was measured (as a count). The new version of the dataset inclused the following variables:

setwd("~/Dropbox/Teaching/GLM_ECPR_2017/Lab/Day5")
data <- read.spss("EES_2009_BE_D5.sav", to.data.frame=TRUE, use.value.labels=FALSE)
data <- data[!is.na(data$online), ]
ggplot(data, aes(x = online)) +
  geom_bar(fill="brown") + 
  xlab("Online mobilization") + ylab("Frequency") + 
  theme_bw()

The graph confirms this: most individuals have never been contacted by a party through online means during the campaign. Looking at the parameters of the distribution (mean, variance) it looks like the variable is not even that much overdispersed.

mean(data$online); var(data$online)
## [1] 0.1875733
## [1] 0.2182961

Let’s run two Poisson models: one with a few sociodemographic predictors, the other including two crucial predictors: interest for politics and political knowledge.

poi.1 <- glm(online ~ age + male + married + union + larg_urb + edu_rec,
             family="poisson", 
             data = subset(data, !is.na(pol_int) & !is.na(indx_kno)))
poi.2 <- glm(online ~ age + male + married + union + larg_urb + edu_rec + 
               pol_int + indx_kno,
             family="poisson", 
             data = data)

htmlreg(list(poi.1, poi.2), single.row = T, stars = c(0.001, 0.01, 0.05), bold = 0.01,
        caption = "Poisson models of online political mobilization")
Poisson models of online political mobilization
Model 1 Model 2
(Intercept) -2.52 (0.22)*** -3.98 (0.36)***
age -0.02 (0.01)*** -0.03 (0.01)***
male 0.33 (0.18) 0.15 (0.19)
married 0.12 (0.19) 0.17 (0.19)
union 0.33 (0.18) 0.27 (0.18)
larg_urb 0.49 (0.18)** 0.45 (0.18)*
edu_rec 0.12 (0.03)*** 0.06 (0.04)
pol_int 0.31 (0.12)**
indx_kno 0.28 (0.06)***
AIC 681.93 650.94
BIC 714.04 692.23
Log Likelihood -333.96 -316.47
Deviance 448.43 413.44
Num. obs. 726 726
p < 0.001, p < 0.01, p < 0.05

It looks like by including political interest and knowledge in the model we kill the effect of education, and substantially reduce the effect of living in a large urban area or suburb. The effect of age is still there though: not surprisingly, being older implies being subject to less online attempts of mobilization (most likely because older people are less likely to go online in the first place). How do we interpret the coefficients? The units presented in the table are log counts. This means that moving one step in political interest increases the log count of online mobilization attempts by 0.31. We can probably all agree that this is not intuitive at all.

When we exponentiate the coefficients we get incident rate ratios:

data.frame(coef = coef(poi.2), IRR = exp(coef(poi.2)))
##                    coef        IRR
## (Intercept) -3.97668515 0.01874768
## age         -0.03465234 0.96594117
## male         0.14571199 1.15686296
## married      0.17192128 1.18758434
## union        0.27297838 1.31387184
## larg_urb     0.44687605 1.56342050
## edu_rec      0.05609770 1.05770101
## pol_int      0.31240687 1.36671065
## indx_kno     0.28423385 1.32874362

Question: How do we interpret these values?

We saw that our dependent variable “online” is not really overdispersed – AKA mean and variance are pretty similar. However, for the sake of exercise, we will repeat the same two models using the negative binomial regression. The package MASS has a function “glm.nb()” which does what we need.

nb.1 <- glm.nb(online ~ age + male + married + union + larg_urb + edu_rec,
             data = subset(data, !is.na(pol_int) & !is.na(indx_kno)))
nb.2 <- glm.nb(online ~ age + male + married + union + larg_urb + edu_rec + 
               pol_int + indx_kno,
             data = data)

htmlreg(list(nb.1, nb.2), single.row = T, stars = c(0.001, 0.01, 0.05), bold = 0.01,
        caption = "Negative Binomial models of online political mobilization")
Negative Binomial models of online political mobilization
Model 1 Model 2
(Intercept) -2.54 (0.23)*** -3.99 (0.37)***
age -0.02 (0.01)*** -0.03 (0.01)***
male 0.34 (0.20) 0.16 (0.19)
married 0.13 (0.21) 0.19 (0.20)
union 0.32 (0.20) 0.27 (0.19)
larg_urb 0.49 (0.20)* 0.45 (0.19)*
edu_rec 0.12 (0.04)*** 0.06 (0.04)
pol_int 0.31 (0.12)*
indx_kno 0.29 (0.06)***
AIC 680.00 652.45
BIC 716.70 698.33
Log Likelihood -332.00 -316.23
Deviance 378.94 389.85
Num. obs. 726 726
p < 0.001, p < 0.01, p < 0.05

Results are pretty similar, but some coefficients are now accompained by larger errors. The negative binomial model in most of the cases produces more conservative estimates than the Poisson model.

Unfortunately, when reporting the outputs from a glm.nb object, it seems that “texreg” does not include the dispersion parameter \(\theta\). We can extract it directly from the object containing the model:

nb.1$theta
## [1] 1.481192
nb.2$theta
## [1] 4.935071

Remember, in the slides we parametrized the dispersion of the negative binomial distribution like \(\lambda + \frac{\lambda^2}{\theta}\). I chose this type of parametrization (different from the direct one that is suggested in other places, like e.g. Benoit’s article that was a reading for today) because it is more congruent of how R intends \(\theta\). More specifically, the result that R gives you is the inverse of the dispersion parameter given by other softwares (like Stata) where it might be called \(\alpha\) or \(\sigma^2\). So in model “nb.1” \(\theta\) has value 1.481, whereas in Stata it would be 1/1.481 = 0.675.

The “inverse” nature of \(\theta\) in this context implies that smaller values mean more overdispersion, while larger values mean less overdispersion. As \(\theta\) approaches 0, the overdispersion increases, while as \(\theta\) approaches infinity the negative binomial distribution approaches the Poisson distribution. In our case, in model “nb.1”, 1 + 1^2/1.481 = 1.6752, which means that the dispersion of the distribution is 1.68 times as much as the mean (or the expected count). This should be reflected by calculating the value of the variance given \(\theta\):

var(data$online)
## [1] 0.2182961
mean(data$online) + mean(data$online)^2/nb.1$theta
## [1] 0.2113269

Why is the value so different for model “nb.2”, though? Here the inflation of the variance is only 1 + 1^2/4.935 = 1.2026, AKA the dispersion of the distribution of “online” is 1.2 times as much as the mean. This suggests that \(\theta\) is an indicator of the variance of the residuals: the larger its value, the more of the overdispersion is explained by our model. In this case, the variance gets even closer to the mean:

mean(data$online)
## [1] 0.1875733
var(data$online)
## [1] 0.2182961
mean(data$online) + mean(data$online)^2/nb.2$theta
## [1] 0.1947026

This means that our two predictors, political interest and information, take care very well of the overdispersion of our dependent variable. Given these results, we could justify using the more “parsimonious” Poisson model.

Let’s use the results of the “poi.2” model to estimate predicted counts for different levels of political knowledge. We can use the function “predict()”, so we can see how it produces standard errors (it did not work yesterday with the ordinal model).

First, we set the predictors at some useful values:

sim.data <- as.data.frame(cbind(age = 0, 
                                male = 0, 
                                married = 0, 
                                union = 0, 
                                larg_urb = 0,
                                edu_rec = 0, 
                                pol_int = mean(data$pol_int, na.rm = T),
                                indx_kno = seq(min(data$indx_kno, na.rm = T),
                                               max(data$indx_kno, na.rm = T),
                                               length.out = 10)))
pred <- predict(poi.2, sim.data, type = "response", se.fit=TRUE)
head(pred)
## $fit
##          1          2          3          4          5          6 
## 0.02940653 0.03668205 0.04575761 0.05707858 0.07120049 0.08881633 
##          7          8          9         10 
## 0.11079053 0.13820140 0.17239405 0.21504635 
## 
## $se.fit
##          1          2          3          4          5          6 
## 0.01003751 0.01123214 0.01259090 0.01425569 0.01650652 0.01982417 
##          7          8          9         10 
## 0.02492180 0.03272975 0.04438245 0.06127046 
## 
## $residual.scale
## [1] 1

Looks like the object we created is a list, so we need to extract the predictions and their errors and merge them with the data we created:

# Predictions
sim.data$pred <- pred$fit
# SEs
sim.data$se <- pred$se.fit
# Calculate CIs
sim.data$lower <- sim.data$pred - 1.96*sim.data$se
sim.data$upper <- sim.data$pred + 1.96*sim.data$se
# That's how it looks like
round(cbind(sim.data$indx_kno, sim.data$pred, sim.data$lower, sim.data$upper), digits = 3)
##        [,1]  [,2]  [,3]  [,4]
##  [1,] 0.000 0.029 0.010 0.049
##  [2,] 0.778 0.037 0.015 0.059
##  [3,] 1.556 0.046 0.021 0.070
##  [4,] 2.333 0.057 0.029 0.085
##  [5,] 3.111 0.071 0.039 0.104
##  [6,] 3.889 0.089 0.050 0.128
##  [7,] 4.667 0.111 0.062 0.160
##  [8,] 5.444 0.138 0.074 0.202
##  [9,] 6.222 0.172 0.085 0.259
## [10,] 7.000 0.215 0.095 0.335

Lets plot it!

ggplot(sim.data, aes(x = indx_kno, y = pred)) + 
  geom_line() + 
  geom_line(aes(x = indx_kno, y = lower), linetype = 2) + 
  geom_line(aes(x = indx_kno, y = upper), linetype = 2) +
  theme_bw() + xlab("Political Knowledge") + ylab("Predicted number of events")

How do we describe this result?

Finally, let’s check for one little thing: our theory is that the negative effect of age on online mobilization can be explained because older people are less likely to use the internet. This means that for them, a zero count is due to the fact that they are not even exposed to attempts of mobilization via online means. In other words, we can theorize that, in our sample, there are some cases with structural zeros, and that age is one predictor of the probability to be in this group. Another likely predictor for that is education: less educated people are less likely to be internet users (or at least, this used to be the case a few years ago, before smartphones were so common).

So let’s fit a zero-inflated model with the same predictors as the “poi.2” model plus age and education predicting the probability to be in the “zero count” group. The package “pscl” has a function called “zeroinfl()” that is perfect for our purpose:

zinf.1 <- zeroinfl(online ~ age + male + married + union + larg_urb + edu_rec + 
                     pol_int + indx_kno | 
                     age + edu_rec,
                   dist = "poisson", link = "logit",
                   data = data)
summary(zinf.1)
## 
## Call:
## zeroinfl(formula = online ~ age + male + married + union + larg_urb + 
##     edu_rec + pol_int + indx_kno | age + edu_rec, data = data, dist = "poisson", 
##     link = "logit")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9364 -0.4195 -0.2953 -0.1517  8.3713 
## 
## Count model coefficients (poisson with log link):
##             Estimate Std. Error z value          Pr(>|z|)    
## (Intercept) -3.33846    0.45022  -7.415 0.000000000000121 ***
## age         -0.01585    0.01172  -1.352            0.1764    
## male         0.18340    0.19310   0.950            0.3422    
## married      0.13200    0.21248   0.621            0.5345    
## union        0.24513    0.19244   1.274            0.2027    
## larg_urb     0.44141    0.19185   2.301            0.0214 *  
## edu_rec     -0.05110    0.06047  -0.845            0.3981    
## pol_int      0.32146    0.12507   2.570            0.0102 *  
## indx_kno     0.28023    0.06597   4.248 0.000021606924877 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.47320    0.63056  -0.750   0.4530  
## age          0.05419    0.02723   1.990   0.0466 *
## edu_rec     -0.32565    0.13273  -2.453   0.0141 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 38 
## Log-likelihood: -313.5 on 12 Df

Question: how do you describe these results?