Exercise 1: Playing with Maximum Likelihood

In this part we want to have a grasp about how Maximum Likelihood estimation works practically. We will start by finding the parameter \(\pi\) of a binary variable (i.e. the probability to support the government) using ML estimation.

We start by setting the parameter value, i.e. the probability that an average citizen supports the government, and generating our population of 10 individuals:

p <- 0.6
gov.app <- rbinom(n = 10, size = 1, prob = p)
table(gov.app)
## gov.app
## 0 1 
## 5 5

Analytically, we can easily find the most likely value of p by taking the mean of the variable:

mean(gov.app)
## [1] 0.5

However, we want to find the parameter numerically, that is using ML to see what is the most “likely” value of p. To do so, we first need to write a function that defines the probability of our data. We saw in the lecture that the likelihood function for a binomial distribution where \(N=1\) is: \[ L(p| y,n) = \prod_{i=1}^{n} p_i^{y_i} (1-p_i)^{1-y_i} \] We only have to write this function in R:

p.lik <- function(data, p){
  lik <- 1
  # We need to loop through all observations to multiply them
  for (i in 1:length(data)){ 
    y <- data[i]
    lik <- lik*(p^y*(1-p)^(1-y)) # This is the distribution function
  }
  return(lik) # Returns the final product
}

Let’s have a look at some values. Remember, p = 0.6 but mean(y) = 0.5. Given that we have 10 observations, it’s likely that the actual mean is very different from the value of p that we set.

options(scipen=999)  # This command kills the scientific notation
p.lik(gov.app, 0.5)
## [1] 0.0009765625
p.lik(gov.app, 0.6)
## [1] 0.0007962624
p.lik(gov.app, 0.7)
## [1] 0.0004084101

We can see this more clearly by plotting it:

p.seq <- seq(0, 1, by = 0.001)
plot(x = p.seq, y = p.lik(gov.app, p.seq), type = "n", 
     xlab = "p", ylab = "likelihood")
lines(x = p.seq, y = p.lik(gov.app, p.seq))

ML estimations are usually obtained iteratively: the algorithm tries some possible values of the parameter(s) to be estimated, and it keeps updating them given the values of the first and second derivative, until it reaches the maximum. R has a base function called “optimize” that automatically returns the maximum of a given function:

opt <- optimize(f = p.lik,         # The function to be optimized
                data = gov.app,    # The data
                interval = c(0,1), # The interval of possible values
                maximum = T)       # The defalut is minimize, we ask to maximize
opt
## $maximum
## [1] 0.5
## 
## $objective
## [1] 0.0009765625

Question: As we saw in the lecture, the log-likelihood function for a binomial distribution where \(N=1\) is: \[ log L(p| y,n)=\sum_{i=1}^{n} y_i log p_i + (1-y_i) log (1-p_i) \] Based on this formula and the syntax that we used up here:

  1. Write a log-likelihood function
  2. Plot the obtained values with a range of probabilities going from 0 to 1
  3. Look for the maximum using the “optimize()” function

To make your life easier with the R function, below there’s a “skeleton” that you can fill in.

p.log.lik <- function(,){
  l.lik <- 
  for (i in 1:){ 
    y <- 
    l.lik <- 
  }
  return() 
}

Exercise 2: ML and Logistic regression

Let’s go back to the “field.goals” data that we used yesterday. Now we have the tools to calculate the effect of “yards” on the probability of success using ML estimation by hand.

setwd("~/Dropbox/Teaching/GLM_ECPR_2017/Lab/Day2")
library(ggplot2)
library(plyr)
load("field.goals.rda")
field.goals$good <- ifelse(field.goals$play.type == "FG good", 1, 0)
table(field.goals$good)
## 
##   0   1 
## 195 787

As a first step, we have to write the log-likelihood function for a logit model. In the lecture we saw that such function is: \[ l(\beta) = \sum_{i=1}^{n}y_iX_i\beta - \sum_{i=1}^{n}log[1 + exp(X_i\beta)] \] Which, if we simplify the notation, can be also written as: \[ l(\beta) = \sum_{i=1}^{n}y_iX_i\beta - log[1 + exp(X_i\beta)] \]

We will use the optim function, which is slightly different from the optimize function. optimize searches for the ML parameters within a given interval (see previous exercise), while optim is a more general purpose optimization method, where we need to plug in some starting values for the parameters, and the algorithm will search for the ML values iteratively. By default the algorithm performs minimization, so we have to specify that the function returns a negative of the value what we are interested in. Sample log-likelihood is a negative number. Maximizing it is the same as minimizing its positive.

Let’s convert this into an R function.

logit.Lik <- function(par, X, Y) {
  beta0 <- par[1]
  beta1 <- par[2]
  out <- -sum(Y*(1 * beta0 + X * beta1) - log(1 + exp(1 * beta0 + X * beta1)))
  return(out)
}

The function above takes three arguments: par is a vector of parameter values (those of which likelihood has to be evaluated given the data); Y and X are two vectors of data, the dependent and the independent variable. Now we can use the optim function to maximize the values of the parameters.

hand.logit = optim(par = c(0.1,0.1),      # The starting values
                   fn = logit.Lik,        # The function
                   X = field.goals$yards, # Vector of X
                   Y = field.goals$good,  # Vector of Y
                   hessian=TRUE)          # Ask to return the "Hessian" matrix
hand.logit$par
## [1]  5.18085084 -0.09730862
hand.logit$hessian
##           [,1]       [,2]
## [1,]  137.3036   5654.138
## [2,] 5654.1378 243041.990

As we discussed, the Hessian matrix is the matrix of second partial derivatives of the log-likelihood function with respect of the estimated parameters. Its inverse is the variance-covariance matrix of the ML estimates. This is the same as the variance-covariance matrix in linear regression.

solve(hand.logit$hessian)
##              [,1]           [,2]
## [1,]  0.173446094 -0.00403505631
## [2,] -0.004035056  0.00009798621

If we want to see the standard errors, we need to take the square root of the diagonal of the variance-covariance matrix.

sqrt(diag(solve(hand.logit$hessian)))
## [1] 0.416468599 0.009898799

Now let’s run a logit model with the glm function (which we will use in the next days) and compare results obtained with the two methods:

logit <- glm(good ~ yards, data = field.goals,family = binomial(link = "logit"))
summary(logit)
## 
## Call:
## glm(formula = good ~ yards, family = binomial(link = "logit"), 
##     data = field.goals)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5582   0.2916   0.4664   0.6978   1.3789  
## 
## Coefficients:
##              Estimate Std. Error z value            Pr(>|z|)    
## (Intercept)  5.178856   0.416201  12.443 <0.0000000000000002 ***
## yards       -0.097261   0.009892  -9.832 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 978.90  on 981  degrees of freedom
## Residual deviance: 861.22  on 980  degrees of freedom
## AIC: 865.22
## 
## Number of Fisher Scoring iterations: 5
# Logit model by hand
data.frame(beta = hand.logit$par, se = sqrt(diag(solve(hand.logit$hessian))))
##          beta          se
## 1  5.18085084 0.416468599
## 2 -0.09730862 0.009898799
# Logit model with "glm()"
data.frame(beta = coef(logit), se = sqrt(diag(vcov(logit))))
##                    beta          se
## (Intercept)  5.17885616 0.416201278
## yards       -0.09726075 0.009891874

Let’s add a fitted line to the plot we have done yesterday:

shares <- ddply(field.goals, .(yards), summarise, prob = mean(good))
ggplot(shares, aes(x = yards, y = prob)) +
  geom_point() +
  theme_bw()

Given the model coefficients, we can calculate for each distance the predicted probability of success.

shares$pred <- coef(logit)[1] + shares$yards*coef(logit)[2]
head(shares)
##   yards      prob     pred
## 1    18 1.0000000 3.428163
## 2    19 1.0000000 3.330902
## 3    20 0.9600000 3.233641
## 4    21 0.9655172 3.136380
## 5    22 0.9600000 3.039120
## 6    23 0.9666667 2.941859

Since the coefficients are in the scale of the linear predictor the resulting values are predicted log-odds. To be able to plot them together with the shares, we need to convert them into probabilities. To do so, we can use the “inverse logit” function that we have seen yesterday:

inv.logit.fun <- function(x) exp(x)/(1 + exp(x))
shares$pred <- inv.logit.fun(shares$pred)

Now we can add the fitted line to the plot:

ggplot(shares, aes(x = yards, y = prob)) +
  geom_point() +
  geom_line(aes(x = yards, y = pred)) +
  theme_bw()

Question: How does the model fit the data?

One way to tell this is to do a classification table: based on the model results and the distance in yards, we can predict for every observation whether their attempt would be successful (1) or not (0). Then we can compare the predicted scores with the actual observed scores.

  1. The “predict” function in R applies to several types of models, and produces predictions based on the model results. Use it to obtain predicted probabilities of success for each observation on our data
  2. Convert the probabilities into actual outcomes
  3. Make a classification table with predicted and actual outcomes
  4. Calculate the share of correctly predicted cases
field.goals$p.pred <- predict( ... )
field.goals$good.pred <- ...
clas.tab <- table( ... , ... )
...

Another approach is to do the Hosmer–Lemeshow test. R has a function to do it automatically in the “ResourceSelection” package. Here too you will need the predicted values coming out of the “logit” object (the predicted probabilities, not the outcomes), and the observed values of the dependent variable. Is the test significant? If so, what does it mean?

library(ResourceSelection)
hl <- hoslem.test( ... )
...