Introduction to Generalized Linear Models
FW8051 Statistics for Ecologists
Learning Objectives
Understand the role of random variables and common statistical distributions in formulating modern statistical regression models
Be able to fit appropriate models to count data and binary data (yes/no, presence/absence) in both R and JAGS
Be able to evaluate model goodness-of-fit
Be able to describe a variety of statistical models and their assumptions using equations and text and match parameters in these equations to estimates in computer output.
Outline
- Introduction to generalized linear models
- Models for count data (Poisson and Negative Binomial regression)
- Models for Binary data (logistic regression)
- Models for data with lots of zeros
Linear Regression
Often written in terms of “signal + error”:
\[y_i = \underbrace{\beta_0 + x_i\beta_1}_\text{Signal} + \underbrace{\epsilon_i}_\text{error}, \mbox{ with}\]
\[\epsilon_i \sim N(0, \sigma^2)\]
Possible because the Normal distribution has separate parameters that describe:
- mean: \(E[Y_i|X_i] = \mu_i = \beta_0 + x_i\beta_1\)
- variance: \(Var[Y_i|X_i] = \sigma^2\)
Remember: for Poisson, Binomial distributions, the variance is a function of the mean.
Linear Regression
\[Y_i|X_i \sim N(\mu_i, \sigma^2)\] \[\mu_i=\beta_0 + \beta_1X_{1,i} + \ldots \beta_pX_{p,i}\]
This description highlights:
- The distribution of \(Y_i\) depends on a set of predictor variables \(X_i\)
- The distribution of the response variable, conditional on predictor variables is Normal
- The mean of the Normal distribution depends on predictor variables (\(X_1\) through \(X_p\)) and regression coefficients (the \(\beta_1\) through \(\beta_p\))
- The variance is constant and given by \(\sigma^2\).
Linear Regression = General Linear Model
Sometimes referred to as: General Linear Model
- t-test (categorical predictor with 2 categories)
- ANOVA (categorical predictor with \(> 2\) categories)
- ANCOVA (continuous and categorical predictor, no interaction so common slope)
- Continuous and categorical variables, with possible interactions
Generalized Linear Models
Generalized linear models further unifies several different regression models:
- General linear model
- Logistic regression
- Poisson regression
Rather elegant general theory developed for exponential family of distributions
Generalized Linear Models (glm)
Systematic component: \(g(\mu_i) = \eta_i = \beta_0 + \beta_1x_1 + \ldots \beta_px_p\)
Some transformation of the the mean, \(g(\mu_i)\), results in a linear model.
- \(g( )\) is called the link function
- \(\eta_i = \beta_0 + \beta_1x_1 + \ldots \beta_px_p\) is called the linear predictor.
- \(\mu_i = g^{-1}(\eta_i) = g^{-1}(\beta_0 + \beta_1x_1 + \ldots \beta_px_p)\)
Random component: \(Y_i|X_i \sim f(y_i|x_i), i=1, \ldots, n\)
- \(f(y_i|x_i)\) is in the exponential family (includes normal, Poisson, binomial, gamma, inverse Gaussian)
- \(f(y_i|x_i)\) describes unmodeled variation about \(\mu_i = E[Y_i|X_i]\)
Generalized Linear Models (glm)
Linear Regression:
- \(f(y_i|x_i) = N(\mu_i, \sigma^2)\)
- \(E[Y_i|X_i]= \mu_i = \beta_0 + \beta_1x_1 + \ldots \beta_px_p\)
- \(g(\mu_i) =\eta_i = \mu_i\), the identity link
- \(\mu_i = g^{-1}(\eta_i) = \eta_i= \beta_0 + \beta_1x_1 + \ldots \beta_px_p\)
Poisson regression:
- \(f(y_i|x_i) = Poisson(\lambda_i)\)
- \(E[Y_i|X_i] = \mu_i = \lambda_i\)
- \(g(\mu_i) = \eta_i = log(\lambda_i) = \beta_0 + \beta_1x_1 + \ldots \beta_px_p\)
- \(\mu_i = g^{-1}(\eta_i) = \exp(\eta_i) = \exp(\beta_0 + \beta_1x_1 + \ldots \beta_px_p)\)
Other GLMs
Logistic regression:
- \(f(y_i|x_i) =\) Bernoulli\((p_i)\)
- \(E[Y_i|X_i] = p_i\)
- \(g(\mu_i) = \eta_i = logit(p_i) = log(\frac{p_i}{1-p_i}) = \beta_0 + \beta_1x_1 + \ldots \beta_px_p\)
- \(\mu_i = g^{-1}(\eta_i) = \frac{\exp^{\eta_i}}{1+\exp^{\eta_i}} = \frac{\exp^{\beta_0 + \beta_1x_1 + \ldots \beta_px_p}}{1+\exp^{\beta_0 + \beta_1x_1 + \ldots \beta_px_p}}\)
Link functions and sample space
Link functions allow the “structural component” (\(\beta_0 + \beta_1x_1 + \ldots \beta_px_p\)) to live on \((-\infty, \infty)\) while keeping the \(\mu_i\) consistent with the range of the response variable.
Poisson (counts) = \({0, 1, 2, \ldots, \infty}\)
- \(g(\mu_i) = \eta_i = log(\lambda_i) = \beta_0 + \beta_1x_1 + \ldots \beta_px_p\), range = \((-\infty, \infty)\)
- \(\mu_i = exp(\beta_0 + \beta_1x_1 + \ldots \beta_px_p)\), range = \([0, \infty]\)
Logistic regression:
- \(g(\mu_i) = \eta_i = log(\frac{p_i}{1-p_i}) = \beta_0 + \beta_1x_1 + \ldots \beta_px_p\), range = \((-\infty, \infty)\)
- \(\mu_i = g^{-1}(\eta_i) = \frac{\exp^{\eta_i}}{1+\exp^{\eta_i}} = \frac{\exp^{\beta_0 + \beta_1x_1 + \ldots \beta_px_p}}{1+\exp^{\beta_0 + \beta_1x_1 + \ldots \beta_px_p}}\), range = \((0, 1)\)
- \(X_i\) = amount of grassland cover
- \(Y_i|X_i \sim Poisson(\lambda_i)\)
- \(log(\lambda_i) = \beta_0 + \beta_1X_{i}\)
Because the mean of the Poisson distribution is \(\lambda\):
- \(E[Y_i|X_i] = \lambda_i = \exp(\beta_0 + \beta_1X_i) = \exp(\beta_0)\exp(\beta_1X_i)\)
- The mean number of pheasants increases by a factor of \(\exp(\beta_1)\) as we increase \(X_i\) by 1 unit.
Assumptions
- Poisson Response: The response variable is a count per unit of time or space, described by a Poisson distribution.
- Independence: The observations must be independent of one another.
- Mean=Variance: By definition, the mean of a Poisson random variable must be equal to its variance.
- Linearity: The log of the mean rate, log(\(\lambda\)), must be a linear function of x.
Visually
Next Steps
- Fit regression models (using ML and Bayesian methods) appropriate for count data
- Poisson regression models
- Negative Binomial regression
- Use an offset to model rates and densities, accounting for variable survey effort
- Use simple tools to assess model fit
- Residuals (deviance and Pearson)
- Goodness-of-fit tests
Next Steps
- Interpret estimated coefficients and describe their uncertainty using confidence and credible intervals
- Use deviances and AIC to compare models.
- Be able to describe statistical models and their assumptions using equations and text and match parameters in these equations to estimates in computer output.