
Learning objectives:
Be able to implement common approaches for modeling non-linear relationships between \(X_i\) and \(Y_i\)
poly function in Rns function (splines library)



So far, we have focused on linear models of the form:
\(Y_i = \beta_0 + X_i\beta + \epsilon_i\)
or
\(Y_i = \beta_0 + X_{i,1}\beta_1 + X_{i,2}\beta_2 + \ldots + \epsilon_i\)
The model can be written as a “linear combination” of parameters.
These options still lead to linear models:
\(Y_i = \beta_0 + X_i\beta_1 + X^2_i\beta_2+ \ldots + \epsilon_i\)
\(\sqrt{Y_i} = \beta_0 + log(X_i)\beta_1 + \ldots + \epsilon_i\)
So, we can use all the same tools we’ve learned about (e.g., residual plots, t-tests, F-tests, AIC, etc) [note: try writing out the above models in matrix notation!]
Plant species richness for 29 islands in the Galapagos Islands archipelago (Johnson and Raven 1973)
Call:
lm(formula = Species ~ logarea + logarea.squared, data = gala)
Residuals:
Min 1Q Median 3Q Max
-151.009 -27.361 -1.033 20.825 178.805
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.1530 14.5607 0.972 0.340010
logarea 12.6226 4.8614 2.596 0.015293 *
logarea.squared 3.5641 0.9445 3.773 0.000842 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 59.88 on 26 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7338
F-statistic: 39.6 on 2 and 26 DF, p-value: 1.285e-08
Call:
lm(formula = Species ~ poly(logarea, 2, raw = TRUE), data = gala)
Residuals:
Min 1Q Median 3Q Max
-151.009 -27.361 -1.033 20.825 178.805
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.1530 14.5607 0.972 0.340010
poly(logarea, 2, raw = TRUE)1 12.6226 4.8614 2.596 0.015293 *
poly(logarea, 2, raw = TRUE)2 3.5641 0.9445 3.773 0.000842 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 59.88 on 26 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7338
F-statistic: 39.6 on 2 and 26 DF, p-value: 1.285e-08
Anova Table (Type II tests)
Response: Species
Sum Sq Df F value Pr(>F)
logarea 24175 1 6.7417 0.0152925 *
logarea.squared 51058 1 14.2387 0.0008418 ***
Residuals 93232 26
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A linear model is a model that is linear in the parameters:
\(Y_i = \sum_{j=1}^P \beta_j b_j(X_i) + \epsilon_i\)
The \(b_j(X_i)\) are called basis functions or basis vectors.
\(Y_i = \beta_0 + \beta_2X_i + \beta_3X_i^2 + \ldots + \epsilon_i\)
\(b_j(X_i) = 1, X, X^2, X^3, \ldots\)
\(E[Y_i|X_i] = \beta_01 + \beta_2X_i + \beta_3X_i^2\)
\(E[Y | X]\) is given by a linear combination of a horizontal line (1), a line through the origin (\(X\)), a quadratic centered on the origin (\(X^2\)).
\(Species_i = 14.15 + 12.62X_i + 3.56X^2_i\)
A polynomial of degree D is a function formed by linear combinations of the powers of its argument up to D:
\(y = \beta_0 + \beta_1x + \beta_2x^2 + \ldots + \beta_Dx^D\)
Specific polynomials:
The design matrix for a regression model with \(n\) observations and \(p\) predictors is the matrix with \(n\) rows and \(p\) columns such that the value of the \(j^{th}\) predictor for the \(i^{th}\) observation is located in column \(j\) of row \(i\).
Design matrix for a polynomial of degree D
\(\begin{bmatrix} 1 & x_1 & x_1^2 & x_1^3 & ... & x_1^D \\ 1 & x_2 & x_2^2 & x_2^3 & ... & x_2^D \\ 1 & x_3 & x_3^2 & x_3^3 & ... & x_3^D \\ & & \vdots & & \\ 1 & x_n & x_n^2 & x_n^3 & ... & x_n^D \\ \end{bmatrix}\)
Standard polynomials can cause numerical issues due to differences in scale:
\(X = 100\) \(X^3 = 1,000,000\)
Centering and scaling \(X\) can help.
Alternatively, we can use “orthogonal polynomials” created using poly(raw=FALSE) (the default). See Section 4.10 in the book.
Linear models are often a good approximation over small ranges of \(x\).
Splines are piecewise polynomials used in curve fitting.
A linear spline is a continuous function formed by connecting linear segments. The points where the segments connect are called the knots of the spline.
Call:
lm(formula = Species ~ logarea + logarea.1 + logarea.4.2, data = gala)
Residuals:
Min 1Q Median 3Q Max
-160.691 -16.547 -4.209 13.133 166.430
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.869 17.384 1.373 0.1819
logarea 5.213 8.956 0.582 0.5658
logarea.1 17.464 18.836 0.927 0.3627
logarea.4.2 44.815 23.156 1.935 0.0643 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 58.97 on 25 degrees of freedom
Multiple R-squared: 0.7695, Adjusted R-squared: 0.7418
F-statistic: 27.82 on 3 and 25 DF, p-value: 3.934e-08
A spline of degree D is a function formed by connecting polynomial segments of degree D so that:
Linear splines (D = 1): first derivative is not constant (can go from increasing to decreasing at a knot)
Fits a cubic polynomial on segments of the data
D-1 = 2 continuous derivatives everywhere (even at the knot locations)
The truncated polynomial of degree D associated with a knot \(\xi_k\) is the function which is equal to 0 to the left of \(\xi_k\) and equal to \((x - \xi_k)^D\) to the right of \(\xi_k\).
\((x - \xi_k)^D_+ = \left\{ \begin{array}{ll} 0 & \mbox{if } x < \xi_k\\ (x - \xi_k)^D & \mbox{if } x \ge \xi_k \end{array} \right.\)
The equation for a spline of degree D with K knots is:
\(y = \beta_0 +\sum_{d=1}^D \beta_Dx^d + \sum_{k=1}^K b_k(x - \xi_k)^D_+\)
The design matrix for a cubic spline with K knots is the n by 1 + 3 + K matrix with entries:
\(\begin{bmatrix} 1 & x_1 & x_1^2 & x_1^3 & (x_1 - \xi_1)^3_+ & ... & (x_1 - \xi_k)^3_+\\ 1 & x_2 & x_2^2 & x_2^3 & (x_2 - \xi_1)^3_+ & ... & (x_2 - \xi_k)^3_+\\ 1 & x_3 & x_3^2 & x_3^3 & (x_3 - \xi_1)^3_+ & ... & (x_3 - \xi_k)^3_+\\ & & \vdots & & \\ 1 & x_n & x_n^2 & x_n^3 & (x_n - \xi_1)^3_+ & ... & (x_n - \xi_k)^3_+\\ \end{bmatrix}\)
Truncated power basis:
Bsplines ( in splines package)
Natural or restricted cubic splines ( in splines package; in rcs package)
Call:
lm(formula = Species ~ ns(logarea, df = 3), data = gala)
Residuals:
Min 1Q Median 3Q Max
-156.173 -13.819 -5.998 13.922 170.555
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.468 43.542 0.034 0.9734
ns(logarea, df = 3)1 47.790 45.957 1.040 0.3084
ns(logarea, df = 3)2 276.125 102.146 2.703 0.0122 *
ns(logarea, df = 3)3 381.743 45.084 8.467 8.22e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 59.48 on 25 degrees of freedom
Multiple R-squared: 0.7655, Adjusted R-squared: 0.7374
F-statistic: 27.21 on 3 and 25 DF, p-value: 4.859e-08
a <- ggplot(gala, aes(x = logarea, y = Species)) + geom_point(size = 3) +
geom_smooth(method = "lm",
formula = y ~ ns(x, 3),
se = TRUE) + ggtitle("Spline Model") + ylim(c(-100, 500))
b <- ggplot(gala, aes(x = logarea, y = Species)) + geom_point(size = 3) +
geom_smooth(method = "lm",
formula = y ~ poly(x, 2),
se = TRUE) + ggtitle("Polynomial Model")+ ylim(c(-100, 500))
gridExtra::grid.arrange(a, b, ncol = 2) df AIC
lmfit 3 335.1547
lm.poly1.raw 4 324.4895
lm.sp 5 324.4646
lm.ns 5 324.9600
Any and all approaches fit better than a linear model!
The shape of a spline can be controlled by carefully choosing the number of knots and their exact locations in order to:
Could in principle compare models (e.g., using AIC) that have varying numbers of knots, or different knot locations
Choose a small number of knots (df), based on how much data you have and how complex you expect the relationship to be a priori
Choose knot locations based on quantiles (what ns does by default if you do not provide knot locations)
\(E[Y|X] = \beta_0 + f(x_1)\)
where \(f(x_1)\) can be modeled in a variety of ways
Smoothing splines: Use lots of knots, but then attempt to balance overfitting and smoothness by controlling the size of the spline coefficients.
Klappstein, N. J., Michelot, T., Fieberg, J., Pedersen, E. J., & Mills Flemming, J. (2024). Step selection functions with non‐linear and random effects. Methods in ecology and evolution, 15(8), 1332-1346.
What if you want to allow for multiple non-linear relationships?
ns(x1, 3) + ns(x2, 4) or multiple smoothing splines
There are cyclical splines that
ensure ends meet at 0 and 24 hours
(or, Jan 1 and Dec 31).

\(Y \sim f(x,\beta)\), where \(f(x,\beta)\) may have a strong theoretical motivation.
We will eventually learn how to fit these models using Maximum likelihood and Bayesian methods.