Bayesian Linear Model

FW8051 Statistics for Ecologists

Learning Objectives

Fit a linear model to the LionNoses data set in a Bayesian framework

Will emphasize the assumed statistical distribution, \(Y_i \sim N(\mu_i, \sigma^2)\)
Gives us practice at implementing models in JAGS
See how to implement residual plots, goodness-of-fit testing
Revisit ~~confidence~~ credible and prediction intervals from a linear model

\(H_0\): the data are consistent with the assumed linear model

\(H_A\): the data are not consistent with the assumed model.

Determine a test statistic, \(T\), which will measure the degree to which our data are consistent with the null hypothesis:

\[T_{obs} = \sum_{i = 1}^n(Y_i - \hat{Y}_i)^2 = \sum_{i = 1}^n(Y_i - \hat{\beta}_0 - \hat{\beta}_1X_i)^2 = SSE_{obs}\]

One wrinkel here….

\(\hat{Y}_i\) has a posterior distribution, which means that \(T_{obs}\) also has a posterior distribution.

Determine the sampling distribution of the test statistic when the null hypothesis is true.

We can use JAGS to generate data from the assumed model and calculate the test statistic:

{... 
y.new[i] ~ dnorm(mu[i], tau)
sq.new[i] <- pow(y.new[i]-mu[i], 2)
}
fit.new <- sum(sq.new[])

This gives us:

\[T_{H_0} = \sum_{i = 1}^n(Y^{new}_i - \hat{Y}_i)^2 = \sum_{i = 1}^n(Y^{new}_i - \hat{\beta}_0 - \hat{\beta}_1X_i)^2 = SSE_{new}\]

Which will also have a posterior distribution!

Lastly, we compare the posterior distribution of our test statistic calculated using the observed data, \(T_{obs}\), to the posterior distribution using data generated under the null hypothesis, \(T_{H_0}\):

\[\text{Bayesian p-value} = \sum_{k=1}^M\frac{T_{H_0}^k \ge T_{obs}^k}{M}\]

where \(k\) indexes the \(M\) different posterior samples generated using JAGS.

If our model fits poorly, then we should expect the total sum of squares to be smaller when the data are generated from the assumed model:

\(T_{H_0} < T_{obs}\) (leading to a small p-value)

small p-value \(\implies\) the model is a poor one
large p-value \(\implies\) we do not have evidence to conclude our assumed model of the data generating process is a poor one.

Confidence intervals: We can calculate these after the fact, using:

Prediction intervals: We can calculate these after the fact, using:

the posterior distributions of \(\beta\) or \(\mu\) and adding random variation using rnorm and the posterior distribution of \(\sigma\)
Or, more directly, using y.new (\(Y^{new}\)) (data simulated from the fitted model)!