FW 8051: Slide Presentations – HW 02: Multiple Regression

Coding of categorical variables

We covered 2 main ways to code for categorical variables:

Reference coding (the book also calls this effects coding following Marc Kery’s book, Introduction to Winbugs for Ecologists)
Means coding

Several students used a third type of coding, sometimes ALSO called effects coding (not to be confused with reference/effects coding in the book).

Models

males<-c(120, 107, 110, 116, 114, 111, 113, 117, 114, 112)
females<-c(110, 111, 107, 108, 110, 105, 107, 106, 111, 111)
jawdat <- data.frame(jaws = c(males, females),
                     sex = c(rep("M",10), rep("F", 10)))

lm.ref <- lm(jaws ~ sex, data = jawdat)
lm.means <- lm(jaws ~ sex - 1, data = jawdat)
lm.effects <-lm(jaws ~ sex, data = jawdat, 
                contrasts = list(sex = contr.sum))

Design matrices

summary(lm.ref)


Call:
lm(formula = jaws ~ sex, data = jawdat)

Residuals:
   Min     1Q Median     3Q    Max 
  -6.4   -1.8    0.1    2.4    6.6 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 108.6000     0.9741 111.486  < 2e-16 ***
sexM          4.8000     1.3776   3.484  0.00265 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.08 on 18 degrees of freedom
Multiple R-squared:  0.4028,    Adjusted R-squared:  0.3696 
F-statistic: 12.14 on 1 and 18 DF,  p-value: 0.002647

jawdat[c(1,11),]

   jaws sex
1   120   M
11  110   F

model.matrix(lm.ref)[c(1,11),]

   (Intercept) sexM
1            1    1
11           1    0

Design matrices

summary(lm.means)


Call:
lm(formula = jaws ~ sex - 1, data = jawdat)

Residuals:
   Min     1Q Median     3Q    Max 
  -6.4   -1.8    0.1    2.4    6.6 

Coefficients:
     Estimate Std. Error t value Pr(>|t|)    
sexF 108.6000     0.9741   111.5   <2e-16 ***
sexM 113.4000     0.9741   116.4   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.08 on 18 degrees of freedom
Multiple R-squared:  0.9993,    Adjusted R-squared:  0.9992 
F-statistic: 1.299e+04 on 2 and 18 DF,  p-value: < 2.2e-16

jawdat[c(1,11),]

   jaws sex
1   120   M
11  110   F

model.matrix(lm.means)[c(1,11),]

   sexF sexM
1     0    1
11    1    0

Design matrices

summary(lm.effects)


Call:
lm(formula = jaws ~ sex, data = jawdat, contrasts = list(sex = contr.sum))

Residuals:
   Min     1Q Median     3Q    Max 
  -6.4   -1.8    0.1    2.4    6.6 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 111.0000     0.6888 161.150  < 2e-16 ***
sex1         -2.4000     0.6888  -3.484  0.00265 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.08 on 18 degrees of freedom
Multiple R-squared:  0.4028,    Adjusted R-squared:  0.3696 
F-statistic: 12.14 on 1 and 18 DF,  p-value: 0.002647

jawdat[c(1,11),]

   jaws sex
1   120   M
11  110   F

model.matrix(lm.effects)[c(1,11),]

   (Intercept) sex1
1            1   -1
11           1    1

modelsummary::modelsummary(list(lm.ref, lm.means, lm.effects),   
                           estimate  = c("estimate"),  gof_map = NA)

	(1)	(2)	(3)
(Intercept)	108.600		111.000
	(0.974)		(0.689)
sexM	4.800	113.400
	(1.378)	(0.974)
sexF		108.600
		(0.974)
sex1			-2.400
			(0.689)

RIKZdat <- RIKZdat |> dplyr::mutate(weekf = as.factor(week))
lm.effects <-lm(Richness ~ NAP + weekf, data = RIKZdat, 
                contrasts = list(weekf = contr.sum))

RIKZdat[c(10, 20, 30, 25),c("Richness", "week", "NAP")]

   Richness week    NAP
10       17    1 -1.334
20        4    2 -0.811
30        4    3  0.766
25        6    4  0.054

model.matrix(lm.effects)[c(10, 20, 30, 25),]

   (Intercept)    NAP weekf1 weekf2 weekf3
10           1 -1.334      1      0      0
20           1 -0.811      0      1      0
30           1  0.766      0      0      1
25           1  0.054     -1     -1     -1

The methods/tools from the last class (getting confidence and prediction intervals for GLS models) are also useful here (and elsewhere)!

Estimate species richness (when NAP = 0) in week 4: \(\hat{\beta}_0 - \hat{\beta}_{weekf1} - \hat{\beta}_{weekf2} - \hat{\beta}_{weekf3}\). Let X = [1 0 -1 -1 -1]. Then:

Estimate: X%*%coef(lm.effects)
SE: sqrt(X%%vcov(lm.effects)%%t(X))

Equations

In general, when writing down an equation to describe a model it is important to:

Be very clear about any (probability) distribution assumptions

\[Y_i = \beta_0 + \beta_1flipperln_i + \beta_2I(sex = male)_i + \epsilon_i\] \[\epsilon_i \sim N(0, \sigma^2)\]

\[Y_i \sim N(\mu_i, \sigma^2)\] \[\mu_i = \beta_0 + \beta_1flipperln_i + \beta_2I(sex = male)_i\]

If I ask for equations to give predictions, that is a little different. In this case, it is reasonable to:

replace \(Y_i\) with \(\hat{Y_i}\) (or \(\widehat{E[Y_i | X_i]}\) or \(\hat{\mu_i}\)), add ^ to the \(\beta\)’s, and drop the \(\epsilon_i\).
it is also reasonable to replace the coefficients with specific values from the regression output

\(\widehat{Body mass}_i = \hat{\beta}_0 + \hat{\beta}_1flipperln_i + \hat{\beta}_2\) (if the prediction is for a male)