Federico Vegetti
ECPR Summer School in Methods and Techniques
Let’s start with an exercise from the Adler’s book, “R in a Nutshell” (O’Reilly, 2010, pp. 435-439). We have a dataset with several American Football statistics, including whether a player attempted a “field goal” (i.e. kicking the ball through the goal, like we know from European football) with or without success. We presume that the chance of succeeding or failing depends on the distance from where the kick is made: goals attempted closer to the end zone are more likely to succeed.
options(scipen = 9, digits = 2)
setwd("~/Dropbox/Teaching/GLM_ECPR_2017/Lab/Day1")
library(foreign)
library(ggplot2) # This is a package to make nice plots
library(plyr) # This is a very useful package to perform transformations in your data
load("field.goals.rda")
head(field.goals)
## home.team week qtr away.team offense defense play.type player
## 1 ARI 14 3 WAS ARI WAS FG good 1-N.Rackers
## 2 ARI 2 4 STL ARI STL FG good 1-N.Rackers
## 3 ARI 7 3 TEN ARI TEN FG good 1-N.Rackers
## 4 ARI 12 2 JAC JAC ARI FG good 10-J.Scobee
## 5 ARI 2 3 STL ARI STL FG good 1-N.Rackers
## 6 ARI 7 4 TEN TEN ARI FG aborted 15-C.Hentrich
## yards stadium.type
## 1 20 Out
## 2 35 Out
## 3 24 Out
## 4 30 Out
## 5 48 Out
## 6 33 Out
We are interested in the variables “play.type” and “yards”. It’s always a good idea to inspect the data visually first:
ggplot(field.goals, aes(x = play.type)) + geom_bar(fill = "brown", col = "black") + theme_bw()
ggplot(field.goals, aes(x = yards)) + geom_histogram(fill = "brown", col = "black", bins = 30) + theme_bw()
We only care whether the field goal was successful, so we create a dichotomous variable which is 1 if the outcome of “play.type” is “FG good”, and 0 otherwise
field.goals$good <- ifelse(field.goals$play.type == "FG good", 1, 0)
table(field.goals$good)
##
## 0 1
## 195 787
Question: What is the probability that a player scored a goal in our data?
Now our data now present a dichotomous outcome:
We want to see whether the chance to score increases as the players get closer to the goal. We assume that the probability of a “good” outcome varies as a linear function of the distance in yards from the target. To estimate the effect of “distance”, we run a linear regression with the dichotomous variable “good” as dependent variable, and the continuous variable “yards” as independent variable. This is a linear probability model:
linear <- lm(good ~ yards, data = field.goals)
summary(linear)
##
## Call:
## lm(formula = good ~ yards, data = field.goals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.025 -0.011 0.126 0.250 0.496
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.29891 0.04603 28.2 <2e-16 ***
## yards -0.01371 0.00122 -11.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.38 on 980 degrees of freedom
## Multiple R-squared: 0.113, Adjusted R-squared: 0.112
## F-statistic: 125 on 1 and 980 DF, p-value: <2e-16
We can also plot the two variables to see how the linear relation looks like. The function “geom_smooth” of ggplot allows to fit a line with confidence intervals over a scatter plot. The method “lm” stands for linear model, and it fits a line that is equivalent to the predictions of a bivariate regression.
ggplot(field.goals, aes(x = yards, y = good)) +
geom_jitter(height = 0.1) +
geom_smooth(method = "lm") +
theme_bw()
Questions:
Alternatively, we can look at the share of successful attempts at different values of distance. An intuitive way to do it is to make a table:
score.tab <- table(field.goals$yards,field.goals$good)
score.tab
##
## 0 1
## 18 0 1
## 19 0 12
## 20 1 24
## 21 1 28
## 22 1 24
## 23 1 29
## 24 0 30
## 25 0 18
## 26 0 27
## 27 3 22
## 28 5 26
## 29 5 32
## 30 2 22
## 31 6 21
## 32 7 30
## 33 5 31
## 34 3 21
## 35 0 25
## 36 4 20
## 37 3 23
## 38 11 29
## 39 6 35
## 40 7 27
## 41 5 32
## 42 6 21
## 43 11 15
## 44 5 24
## 45 9 16
## 46 12 15
## 47 11 26
## 48 10 18
## 49 9 14
## 50 5 11
## 51 8 9
## 52 11 12
## 53 10 10
## 54 3 2
## 55 1 1
## 56 2 3
## 57 1 0
## 58 1 1
## 59 1 0
## 60 1 0
## 61 1 0
## 62 1 0
The rows in the table are the unique values of the variable “yards”: they are all the distances from which players have attempted a field goal in our data. Inside of the cells there are counts: in the first column is the count of failed attempts, and in the second column is the count of successful attempts. To have an estimate of the probability to score for each value of distance, we can look at the share of 1s in every row. For instance, the probability to score from 46 years in our data is 15/(12+15) = 0.56.
We can calculate the share of successful attempts in each row using the “ddply” function from the “plyr” package:
shares <- ddply(field.goals, .(yards), summarise, prob = mean(good))
shares
## yards prob
## 1 18 1.00
## 2 19 1.00
## 3 20 0.96
## 4 21 0.97
## 5 22 0.96
## 6 23 0.97
## 7 24 1.00
## 8 25 1.00
## 9 26 1.00
## 10 27 0.88
## 11 28 0.84
## 12 29 0.86
## 13 30 0.92
## 14 31 0.78
## 15 32 0.81
## 16 33 0.86
## 17 34 0.88
## 18 35 1.00
## 19 36 0.83
## 20 37 0.88
## 21 38 0.72
## 22 39 0.85
## 23 40 0.79
## 24 41 0.86
## 25 42 0.78
## 26 43 0.58
## 27 44 0.83
## 28 45 0.64
## 29 46 0.56
## 30 47 0.70
## 31 48 0.64
## 32 49 0.61
## 33 50 0.69
## 34 51 0.53
## 35 52 0.52
## 36 53 0.50
## 37 54 0.40
## 38 55 0.50
## 39 56 0.60
## 40 57 0.00
## 41 58 0.50
## 42 59 0.00
## 43 60 0.00
## 44 61 0.00
## 45 62 0.00
Now we can plot the share of 1s against the distance in yards:
ggplot(shares, aes(x = yards, y = prob)) +
geom_point() +
theme_bw()
This looks much nicer. We can clearly see that as the distance from the target increases, the probability to score approaches zero. However, the plot also shows the inherently non-linear nature of the relationship between distance and the probability to score. How can we model these values linearly, so then we can obtain meaningful parameters? We can transform the probabilities in log-odds. Recall the formula: \[ log\left(\frac{\pi}{1-\pi}\right) \] We can turn this formula into an R function, and transform the values of “prob” into log odds. Then we can model them linearly.
logit.fun <- function(x) log(x/(1 - x))
Let’s see how it works:
logit.fun(seq(0, 1, by = 0.1))
## [1] -Inf -2.20 -1.39 -0.85 -0.41 0.00 0.41 0.85 1.39 2.20 Inf
Note the “-Inf” and “Inf” values. This is because \(log(0)=-\infty\) and \(log\left(\frac{1}{0}\right)=log(\infty)=\infty\). In other words, the logit transformation can’t deal with “certainties”: it always requires probabilities to be within the bounds, but never to reach it.
Let’s apply the function to our data, and let’s plot it against distance:
shares$logodds <- logit.fun(shares$prob)
shares$logodds
## [1] Inf Inf 3.178 3.332 3.178 3.367 Inf Inf Inf 1.992
## [11] 1.649 1.856 2.398 1.253 1.455 1.825 1.946 Inf 1.609 2.037
## [21] 0.969 1.764 1.350 1.856 1.253 0.310 1.569 0.575 0.223 0.860
## [31] 0.588 0.442 0.788 0.118 0.087 0.000 -0.405 0.000 0.405 -Inf
## [41] 0.000 -Inf -Inf -Inf -Inf
ggplot(shares, aes(x = yards, y = logodds)) +
geom_point() +
theme_bw()
Note the presence of some out-of-bounds values. These are the cases where we have exact 0s and 1s. Funnily enough, ggplot puts them over the margins of the plot, telling us that some other values are there, but have no room in our plot. Either way, the relation between “yards” and “logodds” look way more linear than the one between “yards” and “prob”.
Let’s drop the infinite values, and fit a linear regression with “logodds” over “yards”:
shares$logodds <- ifelse(shares$prob == 0 | shares$prob == 1, NA, shares$logodds)
ggplot(shares, aes(x = yards, y = logodds)) +
geom_point() +
geom_smooth(method = "lm") +
theme_bw()
linear.2 <- lm(logodds ~ yards, data = shares)
summary(linear.2)
##
## Call:
## lm(formula = logodds ~ yards, data = shares)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8015 -0.3448 0.0175 0.3293 0.6693
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.74277 0.27492 17.2 < 2e-16 ***
## yards -0.08673 0.00666 -13.0 2.5e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.42 on 32 degrees of freedom
## (11 observations deleted due to missingness)
## Multiple R-squared: 0.841, Adjusted R-squared: 0.836
## F-statistic: 169 on 1 and 32 DF, p-value: 2.49e-14
What we just did can be regarded as a rudimentary form of GLM. However, we were dealing here with aggregate data (and we even had to get rid of the 0s and 1s because the logit function is not so good at dealing with them), while in most of the cases we want to deal with individual level data directly. Tomorrow we will see how GLM works in these cases.
A short review:
Let’s now make an example of a data generating process using the logistic function, and a conceptualization similar to the latent variable approach. This exercise is the inverse route to what we have seen so far: we start from the independent variable, we relate it to the linear predictor, we transform the linear predictor into a probability, and we generate binary data (our dpendent variable) based on that. This is a narration of the process we assume has generated the data when we model a binary outcome.
Let’s assume we live in a country where the political system consists of two parties. Obviously, they compete against one another at every election and they govern alone: at every election one party is incumbent, and one party is at the opposition. When political competition is so “one-to-one”, it is common for citizens who are not satisfied by the government performance to vote for the opposition party, while voters who are satisfied will keep on voting for the incumbent party. Let’s assume that for the citizens this is all that counts: if a party did good it will be rewarded, if a party did bad it will be punished. Basically, we are simulating the world as it is in the mind of rational choice theorists.
As we discussed, the linear predictor is a meaningless quantity that is generated in GLM to allow us to model non-linear DVs in a linear way. However, for this example, we imagine that our linear predictor has its own meaning, like a latent variable. So in this case our linear predictor is a latent trait – the propensity to vote for the government. When it takes a positive value, citizens are expected to vote for the incumbent party, when it takes a negative value, they are expected to vote for the opposition party.
Our predictor is the economic conditions at the moment of the elections compared to the previous year. If the individual finances of our citizens have improved, they clearly see the government’s responsibility in it, so they will have a higher propensity to vote for the incumbent party. When their finances got worse, they obviously blame the government, and their propensity to vote for the incumbent will be lower. We measure this variable as the change in net monthly income in hundreds of Euro compared to last year. People who are better off will have a positive value, people who are worse off will have a negative value. We assume that in our population this can be as good as 1000€ more, or as bad as 1000€ less per month, hence we generate a uniform distribution going from -10 to +10.
n <- 1000 # Let's set our population to 1000 individuals
X <- round(runif(n, min = -10, max = 10), 0) # We round the simulated values to avoid decimals
Now let’s create our linear predictor as a function of X. We set two “true” parameters, aka how X and the linear predictor are related in reality. We set the constant at -1 (beta0, the intercept) and the slope at 1 (beta1). The intercept at -1 implies that people are, on average, less likely to vote for the incumbent party, regardless their finances. The slope at 1 means that, for every hundred euro increase, people’s propensity to vote for the government is 1 point higher on the scale of the linear predictor.
beta0 <- -1
beta1 <- 1
lin.pred <- beta0 + beta1*X
data <- data.frame(lin.pred, X)
Question: we are linking our linear predictor to X in a deterministic way. In other words, “lin.pred” is a direct function of “X”, without any random noise that would make the process probabilistic. Why?
We define the probability to vote for the incumbent party as a function of our latent trait, “lin.pred”. This means that we need to find a mathematical formula that transforms our normally-distributed continuous trait into something that looks like a probability. This is the response function, and we have learned today that in binary models we can use the “inverse logit” function:
\[ \pi = \frac{exp(\eta)}{1 + exp(\eta)} \]
Where \(\eta\) is precisely our linear predictor. We only need to convert this into an R function:
inv.logit.fun <- function(x) exp(x)/(1 + exp(x))
This function turns a continuous variable into a variable that is bound between 0 and 1 - like a probability. Moreover, it “centers” the distribution on 0.5, so then when the input is a negative value, the returned probability is smaller than 0.5 (and therefore it is more likely that the outcome observed is a 0) and when the input is a positive value the probability is bigger than 0.5 (so the observed outcome is more likely to be positive).
values <- seq(-5, 5, by = 1)
cbind(values, inv.logit.fun(values))
## values
## [1,] -5 0.0067
## [2,] -4 0.0180
## [3,] -3 0.0474
## [4,] -2 0.1192
## [5,] -1 0.2689
## [6,] 0 0.5000
## [7,] 1 0.7311
## [8,] 2 0.8808
## [9,] 3 0.9526
## [10,] 4 0.9820
## [11,] 5 0.9933
Note that this is the same function that it is normally applied to obtain predicted probabilities from the coefficients of a logit model (we will make extensive use of it on Day 3).
How does the resulting set of probabilities relate to the linear predictor?
data$prob <- inv.logit.fun(data$lin.pred)
ggplot(data, aes(x = lin.pred, y = prob)) + geom_point() + theme_bw()
Now, based on our vector of probabilities, we generate our response. This part adds the randomness to our data generating process. So while the propensity to vote for the government is a deterministic function of our predictor, the observed score (1 or 0) is produced by a process that contains a stochastic component. This is an assumption of GLM. We can simulate the response by generating a set of single draws from a binomial distribution with the vector of probabilities being the values that we have just generated.
data$response <- rbinom(n = n, size = 1, prob=data$prob)
table(data$response)
##
## 0 1
## 566 434
This is everything that we can usually observe about the data generating process.
Question: why are there more 0s than 1s?
How does the response relate to our predictor?
ggplot(data, aes(x = X, y = response)) + geom_jitter(height = 0.1) +
theme_bw() + xlab("Change in net monthly income compared to last year") + ylab("Vote for the incumbent party")
It looks like our independent variable has a huge effect on the probability to vote for the incumbent party. You might want to play with the values of “beta1” to see how this changes.
Let’s also see how the response relates to the probability:
ggplot(data, aes(x = prob, y = response)) + geom_jitter(height = 0.1, width = 0.1) +
theme_bw()
Here we can see the randomness at work. If the link was deterministic, all the observations where the probability is bigger than 0.5 would have value 1. However, in GLM the linear predictor is a function of the conditional mean of the dependent variable, i.e. of \(X\beta\). You can see what this means if we compare the mean of Y (i.e. the share of 1s) for every level of X and the probabilities that we calculated transforming the linear predictor.
ddply(data, .(X), summarize, mean.Y = mean(response), prob = mean(prob))
## X mean.Y prob
## 1 -10 0.000 0.000017
## 2 -9 0.000 0.000045
## 3 -8 0.000 0.000123
## 4 -7 0.000 0.000335
## 5 -6 0.000 0.000911
## 6 -5 0.000 0.002473
## 7 -4 0.000 0.006693
## 8 -3 0.021 0.017986
## 9 -2 0.056 0.047426
## 10 -1 0.094 0.119203
## 11 0 0.234 0.268941
## 12 1 0.457 0.500000
## 13 2 0.796 0.731059
## 14 3 0.867 0.880797
## 15 4 0.979 0.952574
## 16 5 0.980 0.982014
## 17 6 0.979 0.993307
## 18 7 1.000 0.997527
## 19 8 1.000 0.999089
## 20 9 1.000 0.999665
## 21 10 1.000 0.999877
Now let’s forget about the linear predictor for a second. In fact, we never have information about the linear predictor, we can only observe the “response” variable, i.e. our set of 0s and 1s. Therefore, if we want to run a regression model with “Vote for the incumbent party” as dependent variable, we need to tell R how the dependent variable is distributed and how the distribution is linked to our linear predictor. Then R will estimate the effect of our independent variable “Change in net monthly income” on the linear predictor. Only at the end, if we want, we can convert the results that we obtain from the regression into quantities that are interpretable in terms of the phenomenon that we are observing.
Meet the “glm” function. It is the main function to specify generalized linear models in R, and it works in a very similar way to the “lm” function used to fit linear regressions. However, it will require two more pieces of information:
In our simulated data, we know the family (the random process that was used to generate our dependent variable, “rbinom”) and we know the link function – indeed, we have written the response function by ourselves to generate the probabilities. The resulting model, with family “binomial” and link “logit”, is the classic logistic regression. In other words, the logistic regression does the inverse process of what we have done so far.
logit.1 <- glm(response ~ X, data = data, family = binomial(link = "logit"))
summary(logit.1)
##
## Call:
## glm(formula = response ~ X, family = binomial(link = "logit"),
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.229 -0.170 -0.007 0.104 2.912
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.0815 0.1624 -6.66 0.000000000028 ***
## X 1.0484 0.0802 13.08 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1368.82 on 999 degrees of freedom
## Residual deviance: 322.86 on 998 degrees of freedom
## AIC: 326.9
##
## Number of Fisher Scoring iterations: 7
Now recall the “true” parameters that we set initially:
and compare it to the coefficients of the model. We would obtain these values if we could run a linear model directly on the linear predictor (impossible in reality):
linear.1 <- lm(lin.pred ~ X, data = data)
summary(linear.1)
##
## Call:
## lm(formula = lin.pred ~ X, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.15e-13 3.00e-16 4.00e-16 6.00e-16 1.40e-15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.00e+00 4.18e-16 -2390405241922256 <2e-16 ***
## X 1.00e+00 7.44e-17 13441221255571476 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.3e-14 on 998 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.81e+32 on 1 and 998 DF, p-value: <2e-16
Question: why do we get such strange results – and a Warning message in some cases?
Of course we would obtain very different results if we ran a linear model on our binary response directly:
linear.2 <- lm(response ~ X, data = data)
summary(linear.2)
##
## Call:
## lm(formula = response ~ X, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8776 -0.2218 -0.0032 0.1953 0.7782
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.44041 0.00883 49.9 <2e-16 ***
## X 0.07286 0.00157 46.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.28 on 998 degrees of freedom
## Multiple R-squared: 0.683, Adjusted R-squared: 0.683
## F-statistic: 2.15e+03 on 1 and 998 DF, p-value: <2e-16