Federico Vegetti
ECPR Summer School in Methods and Techniques
Theories of voting behavior in Europe used to be focused for long on people’s political loyalties, regarded as a political expression of their socio-structural characteristics. This was due to the strong legacy of the “cleavage” theory. However, in the last decades other models relating shorter term factors (such as issues or party character) to the vote have replaced the old idea of party preferences as long term loyalties. In his view, party preferences do not come as a direct consequence of people’s membership to social groups (like class or religion), but depend to a greater extent on what parties do - e.g. what position they take on issues or ideology, how competent they are in dealing with salient problems, and so on.
Germany is quite a textbook example of a political system where theories of voting ‘work’, let’s have a look at how theories of voting based on social group membership perform there. To do so, we take some German data from the European Elections of 2009. Back then, 5 parties used to compete for most of the votes:
CDU, or Christian Democratic Union, the party of the (now and then) chancelor Angela Merkel. The CDU is a rather conservative christian-democratic party, that in Bavaria has its expression in the CSU - the somewhat more conservative Christian Social Union. Note that in our file there will be only 1 option for both CDU and CSU, as the two parties are regarded as functional equivalent.
SPD, or German Socialist Party, is the biggest left-wing German party, a rather moderate social-democratic party that has been in government several times, both alone and in grand coalition with the CDU.
The Greens, a middle-sized party that is decidedly placed on the left, in spite of being abit more conservative than other green parties in Europe (like for instance the GroenLinks in the Netherlands). Born as a single-issue environmental party, it’s collecting more and more consensus (now) among young left-wingers who care more about social policy issues than economic policy issues.
FDP, aka the German liberal party. As the name says, this is a liberal party, both in terms of economic and social policies (unlike the CDU, that in the latter field is more concerned with traditional family values and stuff).
The Linke. An old-fashion left-wing party, born from a merger between the leftmost fringe of the SPD and the PDS, the descendent of the East-German ruling socialist party.
install.packages("mlogit") # This is my favorite package to do multinomial logit in R
options(scipen=9) # To fix notation
library(foreign)
library(ggplot2)
library(MASS)
library(xtable)
library(mlogit)
library(reshape2) # We will need this package at the very end
setwd("~/Dropbox/Teaching/GLM_ECPR_2017/Lab/Day4")
data <- read.dta("EES_2009_DE.dta")
names(data)
## [1] "vote_int" "lrsp" "eusp" "polint" "polinfo"
## [6] "educ" "female" "age" "class" "workclass"
## [11] "midclass" "upclass" "relig"
head(data)
## vote_int lrsp eusp polint polinfo educ female age class workclass
## 1 NA 4 2 2 5 1 1 70 3 0
## 2 NA 10 3 2 2 NA 0 20 1 1
## 3 1 5 1 1 1 1 1 48 3 0
## 4 4 3 10 1 4 2 0 68 2 0
## 5 1 5 5 2 4 2 1 43 3 0
## 6 NA 5 5 0 1 1 1 54 1 1
## midclass upclass relig
## 1 1 0 8
## 2 0 0 3
## 3 1 0 1
## 4 1 0 10
## 5 1 0 7
## 6 0 0 4
The variables that we have are:
The party codes are:
table(data$vote_int, useNA = "always")
##
## 1 2 3 4 5 <NA>
## 288 161 84 51 104 316
There are quite a few missing observations. We can drop them.
data <- data[!is.na(data$vote_int),]
This kills more than 300 cases, but we wouldn’t use them anyway.
For a multinomial model of vote choice we pick social class (the 3 dummies), religiosity, and a bunch of controls, such as gender, age, education and political information.
The function “mlogit()” (from the package of the same name) will do the work. The syntax is a tad bit different than in the “glm()” function of R, but in an intuitive way. The intercept (the “1”) is specified (while in “glm” it is included by default), and we need to add two more arguments:
We pick “CDU/CSU”. After all it’s the most important party in Germany, so it makes sense to compare all the other parties to it.
mnl.1 <- mlogit(vote_int ~ 1 | age + female + educ + polinfo +
workclass + upclass +
relig,
data = data, shape = "wide", reflevel = "1")
summary(mnl.1)
##
## Call:
## mlogit(formula = vote_int ~ 1 | age + female + educ + polinfo +
## workclass + upclass + relig, data = data, reflevel = "1",
## shape = "wide", method = "nr", print.level = 0)
##
## Frequencies of alternatives:
## 1 2 3 4 5
## 0.424437 0.228296 0.118971 0.072347 0.155949
##
## nr method
## 16 iterations, 0h:0m:1s
## g'(-H)^-1g = 3.78E-07
## gradient close to zero
##
## Coefficients :
## Estimate Std. Error t-value Pr(>|t|)
## 2:(intercept) -0.5494873 0.6039645 -0.9098 0.3629277
## 3:(intercept) -2.8011235 0.8436574 -3.3202 0.0008995 ***
## 4:(intercept) -3.4008645 1.0762256 -3.1600 0.0015777 **
## 5:(intercept) -0.0175918 0.6545086 -0.0269 0.9785571
## 2:age 0.0085560 0.0067985 1.2585 0.2082086
## 3:age -0.0142864 0.0088833 -1.6082 0.1077849
## 4:age 0.0255630 0.0112477 2.2727 0.0230422 *
## 5:age -0.0119719 0.0074704 -1.6026 0.1090282
## 2:female 0.1361002 0.2290180 0.5943 0.5523267
## 3:female 0.8890121 0.2963157 3.0002 0.0026978 **
## 4:female -0.3204562 0.3807699 -0.8416 0.4000115
## 5:female -0.2210516 0.2622717 -0.8428 0.3993212
## 2:educ 0.0548316 0.2249891 0.2437 0.8074572
## 3:educ 0.8279904 0.3002898 2.7573 0.0058280 **
## 4:educ 0.9082849 0.3942995 2.3035 0.0212485 *
## 5:educ 0.0184443 0.2496423 0.0739 0.9411036
## 2:polinfo -0.0779493 0.0695449 -1.1208 0.2623520
## 3:polinfo 0.2084637 0.0942486 2.2118 0.0269771 *
## 4:polinfo -0.0159211 0.1166877 -0.1364 0.8914719
## 5:polinfo 0.0086176 0.0785360 0.1097 0.9126248
## 2:workclass 0.4276752 0.2938251 1.4555 0.1455189
## 3:workclass -0.7570396 0.6387980 -1.1851 0.2359778
## 4:workclass 0.8381831 0.4620890 1.8139 0.0696932 .
## 5:workclass -0.3083670 0.3973650 -0.7760 0.4377315
## 2:upclass 0.9718464 1.0281899 0.9452 0.3445562
## 3:upclass -14.8264715 2680.7259079 -0.0055 0.9955871
## 4:upclass 1.7267618 1.3263762 1.3019 0.1929628
## 5:upclass 0.9324537 1.0253184 0.9094 0.3631240
## 2:relig -0.0950517 0.0386427 -2.4598 0.0139031 *
## 3:relig -0.0765886 0.0492395 -1.5554 0.1198433
## 4:relig -0.3139166 0.0637395 -4.9250 0.0000008436 ***
## 5:relig -0.0576338 0.0444641 -1.2962 0.1949107
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Log-Likelihood: -841.84
## McFadden R^2: 0.056196
## Likelihood ratio test : chisq = 100.25 (p.value = 0.00000000046142)
The output of “mlogit()” is not very handy (at least not as well organized as the output of multinomial logit in Stata), but we can play a bit on it. Extract betas and standard errors for different categories and tidy them up a bit.
betas <- coef(mnl.1)
varcov <- vcov(mnl.1)
se <- sqrt(diag(varcov))
cf.2 <- betas[c(1, 5, 9, 13, 17, 21, 25, 29)]
cf.3 <- betas[c(2, 6, 10, 14, 18, 22, 26, 30)]
cf.4 <- betas[c(3, 7, 11, 15, 19, 23, 27, 31)]
cf.5 <- betas[c(4, 8, 12, 16, 20, 24, 28, 32)]
se.2 <- se[c(1, 5, 9, 13, 17, 21, 25, 29)]
se.3 <- se[c(2, 6, 10, 14, 18, 22, 26, 30)]
se.4 <- se[c(3, 7, 11, 15, 19, 23, 27, 31)]
se.5 <- se[c(4, 8, 12, 16, 20, 24, 28, 32)]
tab <- xtable(data.frame(SPD.beta = cf.2, SPD.se = se.2, Greens.beta = cf.3, Greens.se = se.3,
Linke.beta = cf.4, Linke.se = se.4, FDP.beta = cf.5, FDP.se = se.5), digits = 3)
print.xtable(tab, type = "html")
SPD.beta | SPD.se | Greens.beta | Greens.se | Linke.beta | Linke.se | FDP.beta | FDP.se | |
---|---|---|---|---|---|---|---|---|
2:(intercept) | -0.549 | 0.604 | -2.801 | 0.844 | -3.401 | 1.076 | -0.018 | 0.655 |
2:age | 0.009 | 0.007 | -0.014 | 0.009 | 0.026 | 0.011 | -0.012 | 0.007 |
2:female | 0.136 | 0.229 | 0.889 | 0.296 | -0.320 | 0.381 | -0.221 | 0.262 |
2:educ | 0.055 | 0.225 | 0.828 | 0.300 | 0.908 | 0.394 | 0.018 | 0.250 |
2:polinfo | -0.078 | 0.070 | 0.208 | 0.094 | -0.016 | 0.117 | 0.009 | 0.079 |
2:workclass | 0.428 | 0.294 | -0.757 | 0.639 | 0.838 | 0.462 | -0.308 | 0.397 |
2:upclass | 0.972 | 1.028 | -14.826 | 2680.726 | 1.727 | 1.326 | 0.932 | 1.025 |
2:relig | -0.095 | 0.039 | -0.077 | 0.049 | -0.314 | 0.064 | -0.058 | 0.044 |
What do you read into this?
Let’s simulate the predicted probabilities to vote for each party as a function of religiosity. First let’s have a look at the coefficients and standard errors for the four parties:
tab.relig <- data.frame(beta = betas[c(29:32)], se = se[c(29:32)])
row.names(tab.relig) <- c("SPD", "Greens", "Linke", "FDP")
print.xtable(xtable(tab.relig, digits = 3), type = "html")
beta | se | |
---|---|---|
SPD | -0.095 | 0.039 |
Greens | -0.077 | 0.049 |
Linke | -0.314 | 0.064 |
FDP | -0.058 | 0.044 |
At least SPD and more clearly the Linke should have slopes of “relig” that are significantly different from the CDU. However, these results do not tell us much about the effect of religiosity on the probability to vote for CDU itself. By simulating predicted probabilities, we will be able to see that as well.
Let’s set up the simulation.
set.seed(461982)
n <- 1000
sim.pred <- mvrnorm(n, betas, varcov)
Extract the variable names.
c.n.2 <- names(betas)[c(1, 5, 9, 13, 17, 21, 25, 29)]
c.n.3 <- names(betas)[c(2, 6, 10, 14, 18, 22, 26, 30)]
c.n.4 <- names(betas)[c(3, 7, 11, 15, 19, 23, 27, 31)]
c.n.5 <- names(betas)[c(4, 8, 12, 16, 20, 24, 28, 32)]
Set the independent variables to some meaningful values storing them into an object.
means <- c(mean(data$age, na.rm = T), 1, mean(data$educ, na.rm = T),
mean(data$polinfo, na.rm = T), 0, 0)
x.range <- seq(min(data$relig, na.rm = T), max(data$relig, na.rm = T), length.out = 10)
Create empty matrices for the predictions.
res.1 <- matrix(NA, nrow = length(x.range), ncol = 4)
res.2 <- matrix(NA, nrow = length(x.range), ncol = 4)
res.3 <- matrix(NA, nrow = length(x.range), ncol = 4)
res.4 <- matrix(NA, nrow = length(x.range), ncol = 4)
res.5 <- matrix(NA, nrow = length(x.range), ncol = 4)
Run the loop.
for(i in 1:length(x.range)){
a <- x.range[i]
# Generate vectors of predictions on the linear predictor
b.2 <- sim.pred[, c.n.2]%*%c(1, means,a)
b.3 <- sim.pred[, c.n.3]%*%c(1, means,a)
b.4 <- sim.pred[, c.n.4]%*%c(1, means,a)
b.5 <- sim.pred[, c.n.5]%*%c(1, means,a)
# Generate vectors of probabilities (notice that category 1 is the baseline)
p.1 <- 1/(1 + exp(b.2) + exp(b.3) + exp(b.4) + exp(b.5))
p.2 <- exp(b.2)/(1 + exp(b.2) + exp(b.3) + exp(b.4) + exp(b.5))
p.3 <- exp(b.3)/(1 + exp(b.2) + exp(b.3) + exp(b.4) + exp(b.5))
p.4 <- exp(b.4)/(1 + exp(b.2) + exp(b.3) + exp(b.4) + exp(b.5))
p.5 <- exp(b.5)/(1 + exp(b.2) + exp(b.3) + exp(b.4) + exp(b.5))
# Put means and CIs in 4 matrices
res.1[i, ] <- c(a,
mean(p.1),
quantile(p.1, p = 0.025),
quantile(p.1, p = 0.975))
res.2[i, ] <- c(a,
mean(p.2),
quantile(p.2, p = 0.025),
quantile(p.2, p = 0.975))
res.3[i, ] <- c(a,
mean(p.3),
quantile(p.3, p = 0.025),
quantile(p.3, p = 0.975))
res.4[i, ] <- c(a,
mean(p.4),
quantile(p.4, p = 0.025),
quantile(p.4, p = 0.975))
res.5[i, ] <- c(a,
mean(p.5),
quantile(p.5, p = 0.025),
quantile(p.5, p = 0.975))
}
data.sim <- data.frame(rbind(res.1, res.2, res.3, res.4, res.5))
names(data.sim) <- c("X", "Y", "LO", "HI")
data.sim$GR <- rep(c("1. CDU", "2. SPD", "3. Greens", "4. Linke", "5. FDP"), each = length(x.range))
Let’s now see what we just did.
ggplot(data.sim, aes(x = X, y = Y, group = GR)) +
geom_line() +
geom_ribbon(aes(ymin = LO, ymax = HI), fill = "gray70", alpha = 0.5) +
facet_wrap(~ GR, ncol = 3) +
scale_y_continuous(limits = c(0,1)) +
scale_x_continuous(breaks = seq(0, 10, by = 1)) +
xlab("Religiosity") +
ylab("Probability to vote for party") +
theme_bw()
Questions:
Suppose we want to study people’s engagement with the European Elections campaign in Belgium. Specifically, our research question is: does online mobilization have an effect on psychological engagement with the political campaign, even after controlling for a variety of socio-demographic predictors and, most importantly, for offline mobilization?
BE.data <- read.spss(file="~/Dropbox/Teaching/GLM_ECPR_2017/Lab/Day4/EES_2009_BE.sav",
to.data.frame=TRUE, use.value.labels=FALSE)
head(BE.data)
## age male employed union eng_tri edu_rec off_mob onl_mob
## 1 -1 1 1 0 1 1 NA NA
## 2 23 0 0 1 2 2 0 0
## 3 31 1 0 0 1 -5 0 0
## 4 30 0 0 0 1 1 0 0
## 5 9 0 1 1 2 6 1 1
## 6 NA 0 NA 0 1 NA NA NA
The variables in the dataset are:
Let’s see how “eng_tri” is distributed
BE.data <- BE.data[!is.na(BE.data$eng_tri), ] # Kill the missing cases
BE.data$eng_tri <- as.factor(BE.data$eng_tri) # Convert it into a factor (required by R)
round(table(BE.data$eng_tri)/length(BE.data$eng_tri), digits = 2)
##
## 0 1 2
## 0.27 0.55 0.18
ggplot(BE.data, aes(x = eng_tri)) + geom_bar(col = "black", fill = "brown") + theme_bw()
OLS can’t really work here, and clearly neither does logit. How does it covary with the two variables indicating the two types of mobilization?
round(prop.table(table(BE.data$onl_mob, BE.data$eng_tri), 1), digits = 2)
##
## 0 1 2
## 0 0.29 0.54 0.17
## 1 0.20 0.51 0.30
round(prop.table(table(BE.data$off_mob, BE.data$eng_tri), 1), digits = 2)
##
## 0 1 2
## 0 0.30 0.54 0.16
## 1 0.15 0.54 0.31
There is weak evidence that both online and offline mobilization do have an effect on political engagement.
Ordinal logistic regression can be run with the “polr()” command from the “MASS” package. Note that there is also a package called “ordinal” that deals specifically with ordinal models. We will use “polr()” here, but if you need to specify more complex ordinal models (e.g. multilevel modeling), the package “ordinal” is the way to go. Before doing anything substantive, let’s get one thing straight: how do we interpret the estimated thresholds? To have an idea, let’s run an ordinal logit model without predictors:
ord.0 <- polr(eng_tri ~ 1,
method = "logistic", Hess = T,
data = BE.data)
summary(ord.0)
## Call:
## polr(formula = eng_tri ~ 1, data = BE.data, Hess = T, method = "logistic")
##
## No coefficients
##
## Intercepts:
## Value Std. Error t value
## 0|1 -0.9983 0.0717 -13.9195
## 1|2 1.5361 0.0833 18.4328
##
## Residual Deviance: 1950.707
## AIC: 1954.707
In “polr” objects, the values of the threshold are stored in a vector called “zeta”. Let’s extract them and exponentiate them:
thr <- ord.0$zeta
exp(thr)
## 0|1 1|2
## 0.3684939 4.6462619
To understand what the values of the thresholds are, let’s calculate the cumulative odds to end up in category 0 (versus 1 and 2) and in categories 0 and 1 (versus category 2):
table(BE.data$eng_tri)
##
## 0 1 2
## 266 547 175
# Odds of y = 0/(1 + 2)
(266)/(547+175)
## [1] 0.3684211
# Odds of y = (0 + 1)/2
(266+547)/(175)
## [1] 4.645714
This should clarify that, in ordinal logit models, the threshold values are the cumulative log odds to observe y in a category below the threshold instead of a category above the threshold (when all predictors are 0, but here this doesn’t matter because we have no predictors). We can also convert them into probabilities:
# What is the probability (when all predictors are 0) to end up in category 0?
exp(thr[1])/(1 + exp(thr[1]))
## 0|1
## 0.2692697
# What is the probability to end up in cateogory 0 OR 1?
exp(thr[2])/(1 + exp(thr[2]))
## 1|2
## 0.8228917
# What is the probability to end up exactly in category 1?
exp(thr[2])/(1 + exp(thr[2])) - exp(thr[1])/(1 + exp(thr[1]))
## 1|2
## 0.553622
# What is the probability to end up in category 2?
1 - exp(thr[2])/(1 + exp(thr[2]))
## 1|2
## 0.1771083
Let’s get into the business now, and run an ordinal logit model with just socio-demographic indicators:
ord.1 <- polr(eng_tri ~ age + union + male + edu_rec + employed,
method = "logistic", Hess = T,
data = BE.data)
summary(ord.1)
## Call:
## polr(formula = eng_tri ~ age + union + male + edu_rec + employed,
## data = BE.data, Hess = T, method = "logistic")
##
## Coefficients:
## Value Std. Error t value
## age 0.01322 0.003879 3.4084
## union -0.25230 0.141188 -1.7870
## male 0.35487 0.134962 2.6294
## edu_rec 0.10525 0.023335 4.5104
## employed 0.08211 0.140408 0.5848
##
## Intercepts:
## Value Std. Error t value
## 0|1 -0.8859 0.1248 -7.0985
## 1|2 1.7951 0.1384 12.9695
##
## Residual Deviance: 1657.993
## AIC: 1671.993
## (122 observations deleted due to missingness)
Let’s see the probability to end up in the 3 categories for men (vs. women)
c.m <- as.numeric(coef(ord.1)["male"])
# Probability to end up in category 0 when you are a men (and all other predictors are 0)
p.0.m <- exp(thr[1] - (1*c.m))/(1 + exp(thr[1] - (1*c.m)))
p.0.m
## 0|1
## 0.2053477
# Let's compare it with women
p.0.w <- exp(thr[1] - (0*c.m))/(1 + exp(thr[1] - (0*c.m)))
p.0.w
## 0|1
## 0.2692697
# Probability to end up exactly in category 1 when you are a men
p.1.m <- exp(thr[2] - (1*c.m))/(1 + exp(thr[2] - (1*c.m))) - exp(thr[1] - (1*c.m))/(1 + exp(thr[1] - (1*c.m)))
p.1.m
## 1|2
## 0.5598148
# Let's compare it with women
p.1.w <- exp(thr[2] - (0*c.m))/(1 + exp(thr[2] - (0*c.m))) - exp(thr[1] - (0*c.m))/(1 + exp(thr[1] - (0*c.m)))
p.1.w
## 1|2
## 0.553622
# Probability to end up exactly in category 2 when you are a men
p.2.m <- 1 - exp(thr[2] - (1*c.m))/(1 + exp(thr[2] - (1*c.m)))
p.2.m
## 1|2
## 0.2348375
# Let's compare it with women
p.2.w <- 1 - exp(thr[2] - (0*c.m))/(1 + exp(thr[2] - (0*c.m)))
p.2.w
## 1|2
## 0.1771083
Let’s plot these values:
prob.dat <- data.frame(cat = as.factor(rep(c(0, 1, 2), each = 2)),
p = c(p.0.m, p.0.w, p.1.m, p.1.w, p.2.m, p.2.w),
sex = rep(c("Men", "Women"), 3))
ggplot(prob.dat, aes(x = cat, y = p, group = sex)) +
geom_bar(aes(fill = sex), stat = "identity", position = "dodge") +
theme_bw()
Let’s focus on the original question, which concerned online and offline mobilization:
ord.2 <- polr(eng_tri ~ age + union + male + edu_rec + employed +
off_mob + onl_mob,
method = "logistic", Hess = T,
data = BE.data)
summary(ord.2)
## Call:
## polr(formula = eng_tri ~ age + union + male + edu_rec + employed +
## off_mob + onl_mob, data = BE.data, Hess = T, method = "logistic")
##
## Coefficients:
## Value Std. Error t value
## age 0.01049 0.004242 2.4730
## union -0.37587 0.151803 -2.4760
## male 0.37410 0.145315 2.5744
## edu_rec 0.09523 0.025902 3.6763
## employed -0.06165 0.151116 -0.4079
## off_mob 0.74097 0.187717 3.9473
## onl_mob 0.61689 0.214837 2.8714
##
## Intercepts:
## Value Std. Error t value
## 0|1 -0.7680 0.1381 -5.5620
## 1|2 1.8892 0.1556 12.1422
##
## Residual Deviance: 1423.613
## AIC: 1441.613
## (242 observations deleted due to missingness)
Surprisingly, there is an effect for both variables. Confidence intervals look as we would expect.
mat.res <- cbind(exp(ord.2$coefficients), exp(confint.default(ord.2)))
colnames(mat.res) <- c("OR", "Lower CI", "Upper CI")
rownames(mat.res) <- c("Age", "Union membership", "Male", "Education", "Employed", "Offline mobilization", "Online mobilization")
mat.res
## OR Lower CI Upper CI
## Age 1.0105460 1.0021788 1.0189832
## Union membership 0.6866921 0.5099734 0.9246484
## Male 1.4536883 1.0933989 1.9326979
## Education 1.0999065 1.0454606 1.1571880
## Employed 0.9402160 0.6991934 1.2643228
## Offline mobilization 2.0979702 1.4521606 3.0309862
## Online mobilization 1.8531480 1.2163000 2.8234461
Question: How do you interpret the coefficients?
These coefficients are called proportional odds ratios, and one would interpret them similar to effects in binary logistic regression. For online mobilization, we would say that going from a “low” level to a “high” level (whatever that may mean here), the odds of moving from “low engagement” to “middle” or “high” (or, from “low” and “middle” to “high engagement”) increase by a factor of 1.85.
What “proportional odds” refers to in “polr()”, is a very important assumption of ordinary logistic (and ordinary probit) regression. This means that the coefficients that describe the relationship between the first group (“low engagement”) and the other two, are the same as the ones that describe the relationship between the second group and the third group - the “proportional odds” assumption. This is what allows the model to estimate only one set of coefficients.
Let’s plot predicted probabilities of engagement as a function of “age”, using a different technique to the one we saw yesterday and earlier today.
First, let’s create a new data matrix where we set our predictors at interesting values (age in a sequence between minimum and maximum, and all the other predictors fixed at zero, which makes sense since most of them are dummies and education is centered around the middle):
sim.data <- cbind(age = seq(min(BE.data$age, na.rm=TRUE),
max(BE.data$age, na.rm=TRUE), length.out = 100),
union = rep(0, each = 100),
male = rep(0, each = 100),
edu_rec = rep(0, each = 100),
employed = rep(0, each = 100),
off_mob = rep(0, each = 100),
onl_mob = rep(0, each = 100))
head(sim.data)
## age union male edu_rec employed off_mob onl_mob
## [1,] -26.00000 0 0 0 0 0 0
## [2,] -25.27273 0 0 0 0 0 0
## [3,] -24.54545 0 0 0 0 0 0
## [4,] -23.81818 0 0 0 0 0 0
## [5,] -23.09091 0 0 0 0 0 0
## [6,] -22.36364 0 0 0 0 0 0
Second, we use the “predict()” function to obtain predicted probabilities given the model coefficients and the values of X that we have specified in “sim.data”. This function does something similar to what we did with the loop, the only difference being that predicted probabilities are not “simulated” here but obtained analytically. Note, however, that calling predict after the “polr()” function does not seem to allow any way to extract the standard errors – hence we won’t be able to add confidence intervals to our plot.
predictions <- data.frame(predict(ord.2, sim.data, type = "prob"))
names(predictions) <- c("0", "1", "2")
predictions$age <- seq(18, 90, length.out = 100) # Remeber the original range for age
head(predictions)
## 0 1 2 age
## 1 0.3786621 0.5181164 0.1032215 18.00000
## 2 0.3768687 0.5192014 0.1039299 18.72727
## 3 0.3750787 0.5202787 0.1046426 19.45455
## 4 0.3732920 0.5213484 0.1053596 20.18182
## 5 0.3715088 0.5224102 0.1060809 20.90909
## 6 0.3697291 0.5234643 0.1068066 21.63636
To make a nice plot, we need to restructure the data a little bit. Specifically, we need the predictions for the three categories to be “stacked” on each other within the same variable (with an additional variable that identifies which cases in the data belong to which category). We can do this using the funciton “melt()” included in the “reshape2” package (we have loaded it in the library at the beginning):
pred.melt <- melt(predictions, id.vars = "age")
head(pred.melt)
## age variable value
## 1 18.00000 0 0.3786621
## 2 18.72727 0 0.3768687
## 3 19.45455 0 0.3750787
## 4 20.18182 0 0.3732920
## 5 20.90909 0 0.3715088
## 6 21.63636 0 0.3697291
Finally, we can plot our data. We can make an “area plot”, since we do not have confidence intervals to show anyway:
ggplot(pred.melt, aes(x = age, y = value, group = variable)) +
geom_area(aes(fill = variable)) +
scale_fill_manual(values = c("gray80", "gray50", "gray20"), "Category") +
ylab("Probability") + xlab("age") +
theme_bw()