Exercise 1: Multinomial logit

Theories of voting behavior in Europe used to be focused for long on people’s political loyalties, regarded as a political expression of their socio-structural characteristics. This was due to the strong legacy of the “cleavage” theory. However, in the last decades other models relating shorter term factors (such as issues or party character) to the vote have replaced the old idea of party preferences as long term loyalties. In his view, party preferences do not come as a direct consequence of people’s membership to social groups (like class or religion), but depend to a greater extent on what parties do - e.g. what position they take on issues or ideology, how competent they are in dealing with salient problems, and so on.

Germany is quite a textbook example of a political system where theories of voting ‘work’, let’s have a look at how theories of voting based on social group membership perform there. To do so, we take some German data from the European Elections of 2009. Back then, 5 parties used to compete for most of the votes:

install.packages("mlogit")    # This is my favorite package to do multinomial logit in R
options(scipen=9)    # To fix notation
library(foreign)
library(ggplot2)
library(MASS)
library(xtable)
library(mlogit)
library(reshape2)    # We will need this package at the very end
setwd("~/Dropbox/Teaching/GLM_ECPR_2017/Lab/Day4")
data <- read.dta("EES_2009_DE.dta")
names(data)
##  [1] "vote_int"  "lrsp"      "eusp"      "polint"    "polinfo"  
##  [6] "educ"      "female"    "age"       "class"     "workclass"
## [11] "midclass"  "upclass"   "relig"
head(data)
##   vote_int lrsp eusp polint polinfo educ female age class workclass
## 1       NA    4    2      2       5    1      1  70     3         0
## 2       NA   10    3      2       2   NA      0  20     1         1
## 3        1    5    1      1       1    1      1  48     3         0
## 4        4    3   10      1       4    2      0  68     2         0
## 5        1    5    5      2       4    2      1  43     3         0
## 6       NA    5    5      0       1    1      1  54     1         1
##   midclass upclass relig
## 1        1       0     8
## 2        0       0     3
## 3        1       0     1
## 4        1       0    10
## 5        1       0     7
## 6        0       0     4

The variables that we have are:

The party codes are:

  1. CDU/CSU
  2. SPD
  3. Greens
  4. Linke
  5. FDP
table(data$vote_int, useNA = "always")
## 
##    1    2    3    4    5 <NA> 
##  288  161   84   51  104  316

There are quite a few missing observations. We can drop them.

data <- data[!is.na(data$vote_int),]

This kills more than 300 cases, but we wouldn’t use them anyway.

For a multinomial model of vote choice we pick social class (the 3 dummies), religiosity, and a bunch of controls, such as gender, age, education and political information.

The function “mlogit()” (from the package of the same name) will do the work. The syntax is a tad bit different than in the “glm()” function of R, but in an intuitive way. The intercept (the “1”) is specified (while in “glm” it is included by default), and we need to add two more arguments:

We pick “CDU/CSU”. After all it’s the most important party in Germany, so it makes sense to compare all the other parties to it.

mnl.1 <- mlogit(vote_int ~ 1 | age + female + educ + polinfo +
                  workclass + upclass +
                  relig,
                data = data, shape = "wide", reflevel = "1")
summary(mnl.1)
## 
## Call:
## mlogit(formula = vote_int ~ 1 | age + female + educ + polinfo + 
##     workclass + upclass + relig, data = data, reflevel = "1", 
##     shape = "wide", method = "nr", print.level = 0)
## 
## Frequencies of alternatives:
##        1        2        3        4        5 
## 0.424437 0.228296 0.118971 0.072347 0.155949 
## 
## nr method
## 16 iterations, 0h:0m:1s 
## g'(-H)^-1g = 3.78E-07 
## gradient close to zero 
## 
## Coefficients :
##                   Estimate   Std. Error t-value     Pr(>|t|)    
## 2:(intercept)   -0.5494873    0.6039645 -0.9098    0.3629277    
## 3:(intercept)   -2.8011235    0.8436574 -3.3202    0.0008995 ***
## 4:(intercept)   -3.4008645    1.0762256 -3.1600    0.0015777 ** 
## 5:(intercept)   -0.0175918    0.6545086 -0.0269    0.9785571    
## 2:age            0.0085560    0.0067985  1.2585    0.2082086    
## 3:age           -0.0142864    0.0088833 -1.6082    0.1077849    
## 4:age            0.0255630    0.0112477  2.2727    0.0230422 *  
## 5:age           -0.0119719    0.0074704 -1.6026    0.1090282    
## 2:female         0.1361002    0.2290180  0.5943    0.5523267    
## 3:female         0.8890121    0.2963157  3.0002    0.0026978 ** 
## 4:female        -0.3204562    0.3807699 -0.8416    0.4000115    
## 5:female        -0.2210516    0.2622717 -0.8428    0.3993212    
## 2:educ           0.0548316    0.2249891  0.2437    0.8074572    
## 3:educ           0.8279904    0.3002898  2.7573    0.0058280 ** 
## 4:educ           0.9082849    0.3942995  2.3035    0.0212485 *  
## 5:educ           0.0184443    0.2496423  0.0739    0.9411036    
## 2:polinfo       -0.0779493    0.0695449 -1.1208    0.2623520    
## 3:polinfo        0.2084637    0.0942486  2.2118    0.0269771 *  
## 4:polinfo       -0.0159211    0.1166877 -0.1364    0.8914719    
## 5:polinfo        0.0086176    0.0785360  0.1097    0.9126248    
## 2:workclass      0.4276752    0.2938251  1.4555    0.1455189    
## 3:workclass     -0.7570396    0.6387980 -1.1851    0.2359778    
## 4:workclass      0.8381831    0.4620890  1.8139    0.0696932 .  
## 5:workclass     -0.3083670    0.3973650 -0.7760    0.4377315    
## 2:upclass        0.9718464    1.0281899  0.9452    0.3445562    
## 3:upclass      -14.8264715 2680.7259079 -0.0055    0.9955871    
## 4:upclass        1.7267618    1.3263762  1.3019    0.1929628    
## 5:upclass        0.9324537    1.0253184  0.9094    0.3631240    
## 2:relig         -0.0950517    0.0386427 -2.4598    0.0139031 *  
## 3:relig         -0.0765886    0.0492395 -1.5554    0.1198433    
## 4:relig         -0.3139166    0.0637395 -4.9250 0.0000008436 ***
## 5:relig         -0.0576338    0.0444641 -1.2962    0.1949107    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Log-Likelihood: -841.84
## McFadden R^2:  0.056196 
## Likelihood ratio test : chisq = 100.25 (p.value = 0.00000000046142)

The output of “mlogit()” is not very handy (at least not as well organized as the output of multinomial logit in Stata), but we can play a bit on it. Extract betas and standard errors for different categories and tidy them up a bit.

betas <- coef(mnl.1)
varcov <- vcov(mnl.1)
se <- sqrt(diag(varcov))
cf.2 <- betas[c(1, 5, 9,  13, 17, 21, 25, 29)]
cf.3 <- betas[c(2, 6, 10, 14, 18, 22, 26, 30)]
cf.4 <- betas[c(3, 7, 11, 15, 19, 23, 27, 31)]
cf.5 <- betas[c(4, 8, 12, 16, 20, 24, 28, 32)]
se.2 <- se[c(1, 5, 9,  13, 17, 21, 25, 29)]
se.3 <- se[c(2, 6, 10, 14, 18, 22, 26, 30)]
se.4 <- se[c(3, 7, 11, 15, 19, 23, 27, 31)]
se.5 <- se[c(4, 8, 12, 16, 20, 24, 28, 32)]
tab <- xtable(data.frame(SPD.beta = cf.2, SPD.se = se.2, Greens.beta = cf.3, Greens.se = se.3, 
           Linke.beta = cf.4, Linke.se = se.4, FDP.beta = cf.5, FDP.se = se.5), digits = 3)
print.xtable(tab, type = "html")
SPD.beta SPD.se Greens.beta Greens.se Linke.beta Linke.se FDP.beta FDP.se
2:(intercept) -0.549 0.604 -2.801 0.844 -3.401 1.076 -0.018 0.655
2:age 0.009 0.007 -0.014 0.009 0.026 0.011 -0.012 0.007
2:female 0.136 0.229 0.889 0.296 -0.320 0.381 -0.221 0.262
2:educ 0.055 0.225 0.828 0.300 0.908 0.394 0.018 0.250
2:polinfo -0.078 0.070 0.208 0.094 -0.016 0.117 0.009 0.079
2:workclass 0.428 0.294 -0.757 0.639 0.838 0.462 -0.308 0.397
2:upclass 0.972 1.028 -14.826 2680.726 1.727 1.326 0.932 1.025
2:relig -0.095 0.039 -0.077 0.049 -0.314 0.064 -0.058 0.044

What do you read into this?

Let’s simulate the predicted probabilities to vote for each party as a function of religiosity. First let’s have a look at the coefficients and standard errors for the four parties:

tab.relig <- data.frame(beta = betas[c(29:32)], se = se[c(29:32)])
row.names(tab.relig) <- c("SPD", "Greens", "Linke", "FDP")
print.xtable(xtable(tab.relig, digits = 3), type = "html")
beta se
SPD -0.095 0.039
Greens -0.077 0.049
Linke -0.314 0.064
FDP -0.058 0.044

At least SPD and more clearly the Linke should have slopes of “relig” that are significantly different from the CDU. However, these results do not tell us much about the effect of religiosity on the probability to vote for CDU itself. By simulating predicted probabilities, we will be able to see that as well.

Let’s set up the simulation.

set.seed(461982)
n <- 1000
sim.pred <- mvrnorm(n, betas, varcov)

Extract the variable names.

c.n.2 <- names(betas)[c(1, 5, 9,  13, 17, 21, 25, 29)]
c.n.3 <- names(betas)[c(2, 6, 10, 14, 18, 22, 26, 30)]
c.n.4 <- names(betas)[c(3, 7, 11, 15, 19, 23, 27, 31)]
c.n.5 <- names(betas)[c(4, 8, 12, 16, 20, 24, 28, 32)]

Set the independent variables to some meaningful values storing them into an object.

means <- c(mean(data$age, na.rm = T), 1, mean(data$educ, na.rm = T),
           mean(data$polinfo, na.rm = T), 0, 0)
x.range <- seq(min(data$relig, na.rm = T), max(data$relig, na.rm = T), length.out = 10)

Create empty matrices for the predictions.

res.1 <- matrix(NA, nrow = length(x.range), ncol = 4)
res.2 <- matrix(NA, nrow = length(x.range), ncol = 4)
res.3 <- matrix(NA, nrow = length(x.range), ncol = 4)
res.4 <- matrix(NA, nrow = length(x.range), ncol = 4)
res.5 <- matrix(NA, nrow = length(x.range), ncol = 4)

Run the loop.

for(i in 1:length(x.range)){
  a <- x.range[i]
  # Generate vectors of predictions on the linear predictor
  b.2 <- sim.pred[, c.n.2]%*%c(1, means,a)
  b.3 <- sim.pred[, c.n.3]%*%c(1, means,a)
  b.4 <- sim.pred[, c.n.4]%*%c(1, means,a)
  b.5 <- sim.pred[, c.n.5]%*%c(1, means,a)
  # Generate vectors of probabilities (notice that category 1 is the baseline)
  p.1 <- 1/(1 + exp(b.2) + exp(b.3) + exp(b.4) + exp(b.5))
  p.2 <- exp(b.2)/(1 + exp(b.2) + exp(b.3) + exp(b.4) + exp(b.5))
  p.3 <- exp(b.3)/(1 + exp(b.2) + exp(b.3) + exp(b.4) + exp(b.5))
  p.4 <- exp(b.4)/(1 + exp(b.2) + exp(b.3) + exp(b.4) + exp(b.5))
  p.5 <- exp(b.5)/(1 + exp(b.2) + exp(b.3) + exp(b.4) + exp(b.5))
  # Put means and CIs in 4 matrices
  res.1[i, ] <- c(a,
                  mean(p.1), 
                  quantile(p.1, p = 0.025),
                  quantile(p.1, p = 0.975))
  res.2[i, ] <- c(a,
                  mean(p.2), 
                  quantile(p.2, p = 0.025),
                  quantile(p.2, p = 0.975))
  res.3[i, ] <- c(a,
                  mean(p.3), 
                  quantile(p.3, p = 0.025),
                  quantile(p.3, p = 0.975))
  res.4[i, ] <- c(a,
                  mean(p.4), 
                  quantile(p.4, p = 0.025),
                  quantile(p.4, p = 0.975))
  res.5[i, ] <- c(a,
                  mean(p.5), 
                  quantile(p.5, p = 0.025),
                  quantile(p.5, p = 0.975))
}

data.sim <- data.frame(rbind(res.1, res.2, res.3, res.4, res.5))
names(data.sim) <- c("X", "Y", "LO", "HI")
data.sim$GR <- rep(c("1. CDU", "2. SPD", "3. Greens", "4. Linke", "5. FDP"), each = length(x.range))

Let’s now see what we just did.

ggplot(data.sim, aes(x = X, y = Y, group = GR)) +
  geom_line() +
  geom_ribbon(aes(ymin = LO, ymax = HI), fill = "gray70", alpha = 0.5) +
  facet_wrap(~ GR, ncol = 3) +
  scale_y_continuous(limits = c(0,1)) +
  scale_x_continuous(breaks = seq(0, 10, by = 1)) +
  xlab("Religiosity") +
  ylab("Probability to vote for party") +
  theme_bw()

Questions:

  1. Imagine that you had to discuss this figure in a paper. How would you describe it?
  2. Run a second model which includes all the predictors that are in “mnl.1” plus “eusp” and “lrsp”
    1. Do the two new variables have an effect on vote choice in Germany? If yes, how does their effect change across different parties?
    2. Calculate predicted probabilities to vote for all 5 parties when “lrsp” changes from minimum to maximum, holding all the other predictors constant (TIP: keeping “lrsp” as last variable in the model will make your life easier).
    3. Describe the model results and the predicted probabilities as you would do in a research paper

Exercise 2: Ordinal logit

Suppose we want to study people’s engagement with the European Elections campaign in Belgium. Specifically, our research question is: does online mobilization have an effect on psychological engagement with the political campaign, even after controlling for a variety of socio-demographic predictors and, most importantly, for offline mobilization?

BE.data <- read.spss(file="~/Dropbox/Teaching/GLM_ECPR_2017/Lab/Day4/EES_2009_BE.sav", 
                     to.data.frame=TRUE, use.value.labels=FALSE)
head(BE.data)
##   age male employed union eng_tri edu_rec off_mob onl_mob
## 1  -1    1        1     0       1       1      NA      NA
## 2  23    0        0     1       2       2       0       0
## 3  31    1        0     0       1      -5       0       0
## 4  30    0        0     0       1       1       0       0
## 5   9    0        1     1       2       6       1       1
## 6  NA    0       NA     0       1      NA      NA      NA

The variables in the dataset are:

Let’s see how “eng_tri” is distributed

BE.data <- BE.data[!is.na(BE.data$eng_tri), ]    # Kill the missing cases
BE.data$eng_tri <- as.factor(BE.data$eng_tri)    # Convert it into a factor (required by R)
round(table(BE.data$eng_tri)/length(BE.data$eng_tri), digits = 2)
## 
##    0    1    2 
## 0.27 0.55 0.18
ggplot(BE.data, aes(x = eng_tri)) + geom_bar(col = "black", fill = "brown") + theme_bw()

OLS can’t really work here, and clearly neither does logit. How does it covary with the two variables indicating the two types of mobilization?

round(prop.table(table(BE.data$onl_mob, BE.data$eng_tri), 1), digits = 2)
##    
##        0    1    2
##   0 0.29 0.54 0.17
##   1 0.20 0.51 0.30
round(prop.table(table(BE.data$off_mob, BE.data$eng_tri), 1), digits = 2)
##    
##        0    1    2
##   0 0.30 0.54 0.16
##   1 0.15 0.54 0.31

There is weak evidence that both online and offline mobilization do have an effect on political engagement.

Ordinal logistic regression can be run with the “polr()” command from the “MASS” package. Note that there is also a package called “ordinal” that deals specifically with ordinal models. We will use “polr()” here, but if you need to specify more complex ordinal models (e.g. multilevel modeling), the package “ordinal” is the way to go. Before doing anything substantive, let’s get one thing straight: how do we interpret the estimated thresholds? To have an idea, let’s run an ordinal logit model without predictors:

ord.0 <- polr(eng_tri ~ 1,
              method = "logistic", Hess = T,
              data = BE.data)
summary(ord.0)
## Call:
## polr(formula = eng_tri ~ 1, data = BE.data, Hess = T, method = "logistic")
## 
## No coefficients
## 
## Intercepts:
##     Value    Std. Error t value 
## 0|1  -0.9983   0.0717   -13.9195
## 1|2   1.5361   0.0833    18.4328
## 
## Residual Deviance: 1950.707 
## AIC: 1954.707

In “polr” objects, the values of the threshold are stored in a vector called “zeta”. Let’s extract them and exponentiate them:

thr <- ord.0$zeta
exp(thr)
##       0|1       1|2 
## 0.3684939 4.6462619

To understand what the values of the thresholds are, let’s calculate the cumulative odds to end up in category 0 (versus 1 and 2) and in categories 0 and 1 (versus category 2):

table(BE.data$eng_tri)
## 
##   0   1   2 
## 266 547 175
# Odds of y = 0/(1 + 2)
(266)/(547+175)
## [1] 0.3684211
# Odds of y = (0 + 1)/2
(266+547)/(175)
## [1] 4.645714

This should clarify that, in ordinal logit models, the threshold values are the cumulative log odds to observe y in a category below the threshold instead of a category above the threshold (when all predictors are 0, but here this doesn’t matter because we have no predictors). We can also convert them into probabilities:

# What is the probability (when all predictors are 0) to end up in category 0?
exp(thr[1])/(1 + exp(thr[1]))
##       0|1 
## 0.2692697
# What is the probability to end up in cateogory 0 OR 1?
exp(thr[2])/(1 + exp(thr[2]))
##       1|2 
## 0.8228917
# What is the probability to end up exactly in category 1?
exp(thr[2])/(1 + exp(thr[2])) - exp(thr[1])/(1 + exp(thr[1]))
##      1|2 
## 0.553622
# What is the probability to end up in category 2?
1 - exp(thr[2])/(1 + exp(thr[2]))
##       1|2 
## 0.1771083

Let’s get into the business now, and run an ordinal logit model with just socio-demographic indicators:

ord.1 <- polr(eng_tri ~ age + union + male + edu_rec + employed, 
              method = "logistic", Hess = T,
              data = BE.data)
summary(ord.1)
## Call:
## polr(formula = eng_tri ~ age + union + male + edu_rec + employed, 
##     data = BE.data, Hess = T, method = "logistic")
## 
## Coefficients:
##             Value Std. Error t value
## age       0.01322   0.003879  3.4084
## union    -0.25230   0.141188 -1.7870
## male      0.35487   0.134962  2.6294
## edu_rec   0.10525   0.023335  4.5104
## employed  0.08211   0.140408  0.5848
## 
## Intercepts:
##     Value   Std. Error t value
## 0|1 -0.8859  0.1248    -7.0985
## 1|2  1.7951  0.1384    12.9695
## 
## Residual Deviance: 1657.993 
## AIC: 1671.993 
## (122 observations deleted due to missingness)

Let’s see the probability to end up in the 3 categories for men (vs. women)

c.m <- as.numeric(coef(ord.1)["male"])
# Probability to end up in category 0 when you are a men (and all other predictors are 0)
p.0.m <- exp(thr[1] - (1*c.m))/(1 + exp(thr[1] - (1*c.m)))
p.0.m
##       0|1 
## 0.2053477
# Let's compare it with women
p.0.w <- exp(thr[1] - (0*c.m))/(1 + exp(thr[1] - (0*c.m)))
p.0.w
##       0|1 
## 0.2692697
# Probability to end up exactly in category 1 when you are a men
p.1.m <- exp(thr[2] - (1*c.m))/(1 + exp(thr[2] - (1*c.m))) - exp(thr[1] - (1*c.m))/(1 + exp(thr[1] - (1*c.m)))
p.1.m
##       1|2 
## 0.5598148
# Let's compare it with women
p.1.w <- exp(thr[2] - (0*c.m))/(1 + exp(thr[2] - (0*c.m))) - exp(thr[1] - (0*c.m))/(1 + exp(thr[1] - (0*c.m)))
p.1.w
##      1|2 
## 0.553622
# Probability to end up exactly in category 2 when you are a men
p.2.m <- 1 - exp(thr[2] - (1*c.m))/(1 + exp(thr[2] - (1*c.m)))
p.2.m
##       1|2 
## 0.2348375
# Let's compare it with women
p.2.w <- 1 - exp(thr[2] - (0*c.m))/(1 + exp(thr[2] - (0*c.m)))
p.2.w
##       1|2 
## 0.1771083

Let’s plot these values:

prob.dat <- data.frame(cat = as.factor(rep(c(0, 1, 2), each = 2)), 
                       p = c(p.0.m, p.0.w, p.1.m, p.1.w, p.2.m, p.2.w),
                       sex = rep(c("Men", "Women"), 3))
ggplot(prob.dat, aes(x = cat, y = p, group = sex)) +
  geom_bar(aes(fill = sex), stat = "identity", position = "dodge") +
  theme_bw()

Let’s focus on the original question, which concerned online and offline mobilization:

ord.2 <- polr(eng_tri ~ age + union + male + edu_rec + employed + 
                off_mob + onl_mob, 
              method = "logistic", Hess = T,
              data = BE.data)
summary(ord.2)
## Call:
## polr(formula = eng_tri ~ age + union + male + edu_rec + employed + 
##     off_mob + onl_mob, data = BE.data, Hess = T, method = "logistic")
## 
## Coefficients:
##             Value Std. Error t value
## age       0.01049   0.004242  2.4730
## union    -0.37587   0.151803 -2.4760
## male      0.37410   0.145315  2.5744
## edu_rec   0.09523   0.025902  3.6763
## employed -0.06165   0.151116 -0.4079
## off_mob   0.74097   0.187717  3.9473
## onl_mob   0.61689   0.214837  2.8714
## 
## Intercepts:
##     Value   Std. Error t value
## 0|1 -0.7680  0.1381    -5.5620
## 1|2  1.8892  0.1556    12.1422
## 
## Residual Deviance: 1423.613 
## AIC: 1441.613 
## (242 observations deleted due to missingness)

Surprisingly, there is an effect for both variables. Confidence intervals look as we would expect.

mat.res <- cbind(exp(ord.2$coefficients), exp(confint.default(ord.2)))
colnames(mat.res) <- c("OR", "Lower CI", "Upper CI")
rownames(mat.res) <- c("Age", "Union membership", "Male", "Education", "Employed", "Offline mobilization", "Online mobilization")
mat.res
##                             OR  Lower CI  Upper CI
## Age                  1.0105460 1.0021788 1.0189832
## Union membership     0.6866921 0.5099734 0.9246484
## Male                 1.4536883 1.0933989 1.9326979
## Education            1.0999065 1.0454606 1.1571880
## Employed             0.9402160 0.6991934 1.2643228
## Offline mobilization 2.0979702 1.4521606 3.0309862
## Online mobilization  1.8531480 1.2163000 2.8234461

Question: How do you interpret the coefficients?

These coefficients are called proportional odds ratios, and one would interpret them similar to effects in binary logistic regression. For online mobilization, we would say that going from a “low” level to a “high” level (whatever that may mean here), the odds of moving from “low engagement” to “middle” or “high” (or, from “low” and “middle” to “high engagement”) increase by a factor of 1.85.

What “proportional odds” refers to in “polr()”, is a very important assumption of ordinary logistic (and ordinary probit) regression. This means that the coefficients that describe the relationship between the first group (“low engagement”) and the other two, are the same as the ones that describe the relationship between the second group and the third group - the “proportional odds” assumption. This is what allows the model to estimate only one set of coefficients.

Let’s plot predicted probabilities of engagement as a function of “age”, using a different technique to the one we saw yesterday and earlier today.

First, let’s create a new data matrix where we set our predictors at interesting values (age in a sequence between minimum and maximum, and all the other predictors fixed at zero, which makes sense since most of them are dummies and education is centered around the middle):

sim.data <- cbind(age = seq(min(BE.data$age, na.rm=TRUE), 
                            max(BE.data$age, na.rm=TRUE), length.out = 100),
                  union = rep(0, each = 100), 
                  male = rep(0, each = 100), 
                  edu_rec = rep(0, each = 100), 
                  employed = rep(0, each = 100), 
                  off_mob = rep(0, each = 100), 
                  onl_mob = rep(0, each = 100))
head(sim.data)
##            age union male edu_rec employed off_mob onl_mob
## [1,] -26.00000     0    0       0        0       0       0
## [2,] -25.27273     0    0       0        0       0       0
## [3,] -24.54545     0    0       0        0       0       0
## [4,] -23.81818     0    0       0        0       0       0
## [5,] -23.09091     0    0       0        0       0       0
## [6,] -22.36364     0    0       0        0       0       0

Second, we use the “predict()” function to obtain predicted probabilities given the model coefficients and the values of X that we have specified in “sim.data”. This function does something similar to what we did with the loop, the only difference being that predicted probabilities are not “simulated” here but obtained analytically. Note, however, that calling predict after the “polr()” function does not seem to allow any way to extract the standard errors – hence we won’t be able to add confidence intervals to our plot.

predictions <- data.frame(predict(ord.2, sim.data, type = "prob"))
names(predictions) <- c("0", "1", "2")
predictions$age <- seq(18, 90, length.out = 100)  # Remeber the original range for age
head(predictions)
##           0         1         2      age
## 1 0.3786621 0.5181164 0.1032215 18.00000
## 2 0.3768687 0.5192014 0.1039299 18.72727
## 3 0.3750787 0.5202787 0.1046426 19.45455
## 4 0.3732920 0.5213484 0.1053596 20.18182
## 5 0.3715088 0.5224102 0.1060809 20.90909
## 6 0.3697291 0.5234643 0.1068066 21.63636

To make a nice plot, we need to restructure the data a little bit. Specifically, we need the predictions for the three categories to be “stacked” on each other within the same variable (with an additional variable that identifies which cases in the data belong to which category). We can do this using the funciton “melt()” included in the “reshape2” package (we have loaded it in the library at the beginning):

pred.melt <- melt(predictions, id.vars = "age")
head(pred.melt)
##        age variable     value
## 1 18.00000        0 0.3786621
## 2 18.72727        0 0.3768687
## 3 19.45455        0 0.3750787
## 4 20.18182        0 0.3732920
## 5 20.90909        0 0.3715088
## 6 21.63636        0 0.3697291

Finally, we can plot our data. We can make an “area plot”, since we do not have confidence intervals to show anyway:

ggplot(pred.melt, aes(x = age, y = value, group = variable)) +
  geom_area(aes(fill = variable)) +
  scale_fill_manual(values = c("gray80", "gray50", "gray20"), "Category") +
  ylab("Probability") + xlab("age") +
  theme_bw()