Frequentist versus Bayesian statistics

FW8051 Statistics for Ecologists

Learning Objectives

Understand differences in how probability is defined in Frequentist and Bayesian statistics
Understand how to estimate parameters and their uncertainty using Bayesian methods
Compare Bayesian and Frequentist inference, starting with a simple problem that we can solve analytically.

Frequentist Statistics

Probability = relative frequency of some event across an infinite sequence of random, repeatable experiments or trials

Data are random and parameters are fixed and estimated using:
- Maximum Likelihood: P(data; parameters)

Data are used to test hypotheses which are either True or False.
- p-value = P(getting data/statistic as or more extreme as our sample data/statistic| Null hypothesis is true)

Goal: make ‘good’ decisions with high probability (across potential repeated experiments)

Bayesian Statistics

Probability reflects “belief” about the system, taking into account prior expectations and data

Random variables are used to model all sources of uncertainty
- Parameters have a distribution

Data are fixed. Inference is performed, conditional on the data.

P(Hypothesis | data), ranges from 0 to 1

Still want to make ‘good’ decisions with high probability (across potential repeated experiments)…calibrated Bayes!

Key Difference: Probability

Frequentist: relative frequency of events

Bayesian: belief about the system

Lets compare inference from the two methods with a simple example

What is the probability, \(p\), that a MNDNR biologist will detect a moose when flying in a helicopter?

Moose within circular fields of view with varying levels of cover. Pictures were taken from a helicopter.

Goal: estimate \(p\) and characterize uncertainty wrt \(p\).

Maximum Likelihood Using Binomial Distribution

Collect data: \(n\) = 124 trials, \(y\) = 59 moose observed …
Make assumptions about the data generating process:

Each trial is independent with constant probability of detection. What distribution would you use?

\(y \sim Binomial(124, p)\), with \(p\) = Probability of a success (i.e., probability of seeing a moose)

Estimate \(p\) using Maximum Likelihood

Maximum Likelihood

If \(y \sim\) Binomial(n, \(p\)), then:

\(L(p | y, n) = \frac{n!}{y!(n-y!)}p^{y}(1-p)^{n-y}\)

\(\log(L) = log(\frac{n!}{y!(n-y!)}) + y\log(p) +(n-y)\log(1-p)\)

Maximize \(log[L(p | y)]\) with respect to \(p\) (take derivatives, set equal to 0, and solve) [On Board]:

\(\hat{p} = y/n\)

\(var(\hat{p}) = var(y/n) = var(y)/n^2 = p(1-p)/n\)

\(= I^{-1}(p), \mbox{ where } I(p) = E\left(-\frac{\partial^2 logL(p)}{\partial p^2}\right)\)

Frequentist Inference for \(p\)

For large \(n\), a 95% CI = \(\hat{p} \pm 1.96\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\)

Where does this come from?

For large \(n\), \(\hat{p} \sim N(p, I^{-1}(p))\). This is the approximate sampling distribution of \(\hat{p}\)!

\(I^{-1}(p)= \frac{p(1-p)}{n}\)

If \(\hat{p} \sim N(p, I^{-1}(p))\), then \(\frac{\hat{p}-p}{\sqrt{var(\hat{p})}} \sim N(0, 1)\)

\(P(-z_{1-\alpha/2} \le \frac{\hat{p}-p}{\sqrt{var(\hat{p})}} \le z_{\alpha/2}) = \alpha\) where \(z\) is a standard normal random variable.

To get a (1-\(\alpha\)) CI, solve the above expression for \(p\).

\[P(\hat{p}+ 1.96\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}> p > \hat{p}- 1.96\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}) \approx 0.95\]

Frequentist Inference

Calculate the large sample, normal-based confidence interval in R. \(\hat{p} = 59/124 = 0.48\).

# Estimate and SE
(theta.hat<-59/124)

[1] 0.4758065

(se.theta.hat<-sqrt(theta.hat*(1-theta.hat)/124))

[1] 0.04484873

# Confidence Interval
round(rep(theta.hat,2)+ c(-1.96,1.96)*se.theta.hat,2)

[1] 0.39 0.56

How well do these work?

Simulate 10,000 binomial random variables using:
- x<-rbinom(10,000, size = n, p)
- with n = 15, 30, 100; p=0.1, 0.5, 0.9
Estimate 10,000 \(\hat{p}\)’s = x/n and 10,000 95% CIs
Determine how many of these CIs include \(p\).

# 10,000 repeated samples of size 124 and theta = theta.hat
ys<-rbinom(10000,size=124,prob=59/124) 

# Calculate 10,000 theta^'s, SE(theta^)'s, CI's
theta.hats<-ys/124
se.theta.hats<-sqrt(theta.hats*(1-theta.hats)/124)
up.CIs<-theta.hats+1.96*se.theta.hats
low.CIs<-theta.hats-1.96*se.theta.hats

# Determine coverage
inCI<-I(low.CIs < 59/124 & up.CIs > 59/124)  # true theta is in the interval
sum(inCI)/10000

[1] 0.9407

Confidence intervals

Frequentist Interpretation

In reality, we get 1 data set. We ended up with \(\hat{p}\) = 0.48, with 95% CI = (0.39, 0.56)

How do we interpret this CI?

\(p\) is either in the confidence interval or not! \(P(p \in CI) = 0 \text{ or } 1\)

The procedure we used should result in an interval that contains the true parameter 95% of the time.

Bayesian Inference

How does it differ? What are the steps?

Specify a likelihood for the data, \(L(y | p)\)

Specify a prior distribution for the parameters, \(\pi(p)\), reflecting our a prior belief about \(p\)

Use Bayes rule to determine the posterior distribution of \(p\) given the data, \(p(p | y)\):

\[p(p | y) = \frac{L(y | p)\pi(p)}{p(y)} = \frac{L(y | p)\pi(p)}{\int L(y | p)\pi(p)dp}\]

The posterior distribution captures our belief about the parameters after having collected data!

Bayes Theorem

\[p(p | y) = \frac{L(y | p)\pi(p)}{p(y)} = \frac{L(y | p)\pi(p)}{\int L(y | p)\pi(p)dp}\]

\(p(y) = \int L(y | p)\pi(p)dp\) is the marginal distribution of \(y\) which requires integrating over \(p\).

This is a continuous version of the total law of probability formula we saw previously
This integral is often difficult to solve, so Markov Chain Monte Carlo (MCMC) is used generate summaries of the posterior distribution, \(p(p | y)\)

In words…

Posterior distribution \(\propto\) Likelihood x prior distribution

Bayesian Inference

Likelihood (from binomial): \(p(y | p) \propto p^{59}(1-p)^{124-59}\)

Prior probability distribution for \(p\)?

\(\pi(p)\): \(0 \le p \le 1\).
\(\pi(p)\): Beta distribution(\(\alpha, \beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} p^{\alpha-1}(1-p)^{\beta-1}\)

\(\pi(p) \sim\) beta(1, 1) (equivalent to a Uniform(0,1))

[Plot this using curve(dbeta(x, 1, 1), from=0, to=1)]

Use \(\pi(p)\) and \(p(y|p)\) and Bayes Theorem to calculate \(p(p | y)\), the posterior distribution.

Bayes Theorem

\[p(p | y) = \frac{p(y | p)\pi(p)}{p(y)} = \frac{p(y | p)\pi(p)}{\int_{-\infty}^{\infty}p(y | p)\pi(p)d(p)}\]

\[p(p | y) \propto p(y | p)\pi(p)\]

\[p(p |y) \propto p^{59}(1-p)^{124-59}\cdot 1\]

\[p(p |y) \propto p^{60-1}(1-p)^{66-1}\]

This is a beta distribution with parameters (60, 66).

The posterior distribution gives us the probability distribution of the parameter, given the data and our prior beliefs.

Bayesian Inference

Use curve to plot the posterior distribution = Beta(60,66).

par(bty="L", mar=c(2,4.1,1,2.1))
# Plot the Posterior Distribution of theta
plot(curve(dbeta(x, 60, 66), from=0, to=1), 
     type="l",xlab=expression(theta), ylab=c(expression(p(group("", theta, "|") *y))))

Plot of a beta distribution with parameters alpha = 60 and beta = 66.

Credible interval

Find the endpoints, \(x_1\) and \(x_2\) such that \(P(p \in (x_1,x_2)) = 0.95\).

Use qbeta [remember, \(\alpha = 60, \beta = 66\)]

# 95% credible interval
round(qbeta(c(0.025, 0.975), 60,66),2)

[1] 0.39 0.56

Same endpoints as Frequentist confidence interval, different interpretation!

Interpretation: \(p\) has a 95% chance of being in the interval.

Represents our belief based on data and our prior assumptions

Frequentist vs. Bayesian¹

Table from http://www.austincc.edu/mparker/stat/nov04/ comparing and contrasting Bayesian and Frequentist analyses.

Advantages of Bayesian statistics

Easier to fit complex models using Bayesian methods

Easy to characterize uncertainty for functions of the parameters

Intuitive appeal of credibility intervals (vs. confidence intervals)

Coherent philosophy of statistics
- All inferences come from the posterior distribution
- No separate theories for estimation, hypothesis testing, multiple comparisons, etc.

Disadvantages of Bayesian statistics

With small samples, priors can make a big difference
Perceived subjectivity
Computationally demanding when using MCMC

“Ecologists should be aware that Bayesian methods constitute a radically different way of doing science. Bayesian statistics is not just another tool to be added into ecologists’ repertoire of statistical methods. Instead, Bayesians categorically reject various tenets of statistics and the scientific method that are currently widely accepted in ecology and other sciences. The Bayesian approach has split the statistics world into warring factions (ecologists’ “density independence” vs “density dependence” debates of the 1950s pale by comparison), and it is fair to say that the Bayesian approach is growing rapidly in influence” - Brian Dennis (1996, Ecological Applications, p.1095-1103).

Pragmatic Statistician

Many do consider Bayesian methods another tool in the toolbox…

We will often fit models using both frequentist and Bayesian statistics (often, with similar answers)!

When is Bayesian Inference “Easier” or Prefered?

Dorazio 2016. Population Ecology 58:31-44

Hierarchical models that combine sampling and ecological processes.
Inference for latent (i.e., unobserved) state variables
Missing data problems
Intractable likelihood functions
Complex models that combine different sources and types of data (with shared parameters)

Frequentist versus Bayesian statistics

Learning Objectives

Frequentist Statistics

Bayesian Statistics

Key Difference: Probability

Maximum Likelihood Using Binomial Distribution

Maximum Likelihood

Frequentist Inference for \(p\)

Frequentist Inference

How well do these work?

Confidence intervals

Frequentist Interpretation

Bayesian Inference

Bayes Theorem

Bayesian Inference

Bayes Theorem

Bayesian Inference

Credible interval

Frequentist vs. Bayesian1

Advantages of Bayesian statistics

Disadvantages of Bayesian statistics

Pragmatic Statistician

When is Bayesian Inference “Easier” or Prefered?

Frequentist vs. Bayesian¹