
Will draw from:
Collinearity - when one predictor variable is correlated with another predictor variable.
Multicollinearity - when multiple predictor variables are correlated with each other.
Multicollinearity implies one of the explanatory variables can be predicted by the others with a high degree of accuracy.
Intrinsic collinearity: multiple variables measure the same inherent quantity
Compositional variables: have to sum to 1 (the last category is completely determined by the others)
[Think-pair-share] Do you have similar examples from your study systems?
Interpretation of \(\beta\) in a multiple regression:
Multicollinearity can be measured using a variance inflation factor (VIF)
\(VIF(\hat{\beta}_j) = \frac{1}{1-R^2_{x_j|x_1, ..., x_{j-1}, x_{j+1}, x_p}}\), where
\(R^2_{x_j|x_1, ..., x_{j-1}, x_{j+1}, x_p}\) = multiple \(R^2\) from: lm(\(x_j \sim x_1 + \ldots + x_{j-1} + x_{j+1} + x_p\))
If a variable can be predicted from the other variables in the regression model, then it will have a large VIF.
Calculate using the vif function in the car package
Rules of Thumb in Published Literature:
Look at sleep example from book.
Truth:
tau <- seq(0,9,3))Simulated 2000 data sets and to each fit:
lm(Y ~ X1)lm(Y ~ X1+X2)\[Y_i = 10 + 3X_{1,i} + 3X_{2,i} + \epsilon_i \text{ and } X_{2,i} =\tau X_{1,i} + \gamma_i\]
\[Y_i = 10 + 3X_{1,i} + 3(\tau X_{1,i} + \gamma_i) + \epsilon_i\]
\[Y_i = 10 + (3+3\tau)X_{1,i} + (3\gamma_i + \epsilon_i)\]

Models with collinear variables
Models in which confounding variables are left out
When possible, try to eliminate confounding variables via study design (e.g., experiments, matching)
If the only goal is prediction, may choose to ignore multicollinearity
For estimation, there are methods that introduce some bias to improve precision
Graham (2003) and the textbook also briefly consider:
Example from Graham (2003):
OD BD LTD W
2.574934 2.355055 1.175270 2.094319
Always look at the relationship among your predictors (without the response variables) as a first step to assessing collinearity!
Prioritize different variables to include sequentially:
How to Prioritize?
Graham considered (newly formed) predictors in this order:
Call:
lm(formula = Response ~ OD + W.g.OD + LTD.g.W.OD + BD.g.W.OD.LTD,
data = Kelp)
Residuals:
Min 1Q Median 3Q Max
-0.284911 -0.098861 -0.002388 0.099031 0.301931
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.747588 0.078192 35.139 < 2e-16 ***
OD 0.194243 0.028877 6.726 1.16e-07 ***
W.g.OD 0.008082 0.003953 2.045 0.0489 *
LTD.g.W.OD -0.055333 0.141350 -0.391 0.6980
BD.g.W.OD.LTD -0.004295 0.021137 -0.203 0.8402
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1431 on 33 degrees of freedom
Multiple R-squared: 0.6006, Adjusted R-squared: 0.5522
F-statistic: 12.41 on 4 and 33 DF, p-value: 2.893e-06
# Original model
seq.lm<-lm(Response~OD+W.g.OD+LTD.g.W.OD+BD.g.W.OD.LTD, data=Kelp)
summary(seq.lm)$coef Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.747587774 0.07819170 35.1391212 1.000393e-27
OD 0.194243475 0.02887749 6.7264678 1.156387e-07
W.g.OD 0.008082141 0.00395292 2.0446003 4.893874e-02
LTD.g.W.OD -0.055333263 0.14135008 -0.3914626 6.979717e-01
BD.g.W.OD.LTD -0.004294572 0.02113731 -0.2031750 8.402459e-01
Advantages:
Disadvantages:
Form new predictors as linear combinations of the correlated variables:
\(pca_1 = \lambda_{1,1}X_1 + \lambda_{1,2}X_2 + \ldots \lambda_{1,p}x_p\)
\(pca_2 = \lambda_{2,1}X_1 + \lambda_{2,2}X_2 + \ldots \lambda_{2,p}x_p\)
\(\cdots\)
\(pca_p = \lambda_{p,1}X_1 + \lambda_{p,2}X_2 + \ldots \lambda_{p,p}x_p\), where
PC1 PC2 PC3 PC4
OD 0.5479919 -0.2901058 -0.15915149 -0.76825404
BD 0.5453470 -0.1793692 -0.58088137 0.57706165
LTD -0.3384653 -0.9335391 0.06706729 0.09720099
W 0.5364166 -0.1103180 0.79545560 0.25949479
OD BD LTD W PC1 PC2 PC3 PC4
1 2.0176 4.87 -0.59 -4.1 -0.19127827 1.7527358 -0.66278941 0.24694830
2 1.9553 4.78 -0.75 4.7 0.62234092 2.5023873 0.18091063 0.46900655
3 1.8131 3.14 -0.38 -4.9 -1.33268779 0.9190480 -0.03361542 -0.05590063
4 2.5751 3.28 -0.16 -3.2 -1.08056344 -0.5416139 0.01891911 -0.55322453
5 2.2589 3.28 0.01 5.6 -1.03524778 -1.4381622 1.00570204 0.11858908
6 2.5448 4.87 -0.19 4.1 -0.05452203 -0.6398905 0.18695547 0.22937274
Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 1.6017 0.8975 0.60895 0.50822
Proportion of Variance 0.6413 0.2014 0.09271 0.06457
Cumulative Proportion 0.6413 0.8427 0.93543 1.00000
The first principal component explains 64% of the variation in (OD, BD, LTD, W)
Choose one or more \(pca_i\) to include as new regressors (Graham 2003 suggests including all of them).
Call:
lm(formula = Response ~ PC1 + PC2 + PC3 + PC4, data = Kelp)
Residuals:
Min 1Q Median 3Q Max
-0.284911 -0.098861 -0.002388 0.099031 0.301931
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.24984 0.02321 140.035 < 2e-16 ***
PC1 0.09806 0.01468 6.678 1.33e-07 ***
PC2 -0.02971 0.02620 -1.134 0.265
PC3 0.03612 0.03862 0.935 0.356
PC4 -0.07826 0.04628 -1.691 0.100
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1431 on 33 degrees of freedom
Multiple R-squared: 0.6006, Adjusted R-squared: 0.5522
F-statistic: 12.41 on 4 and 33 DF, p-value: 2.893e-06
The main disadvantage is the principal components can be difficult to interpret.
Options:
“The suite of techniques described herein compliment each other and offer ecologists useful alternatives to standard multiple regression for identifying ecologically relevant patterns in collinear data. Each comes with its own set of benefits and limitations, yet together they allow ecologists to directly address the nature of shared variance contributions in ecological data.”