Understanding Generalized Additive Models (GAMs)

A Practical Introduction with R

What are GAMs?

Generalized Additive Models (GAMs) are an extension of Generalized Linear Models (GLMs) that:

Allow for non-linear relationships between predictors and response
Maintain interpretability of effects
Automatically determine the optimal amount of smoothing
Combine the flexibility of non-linear models with interpretability of linear models

Why Use GAMs?

Real-world relationships are often non-linear
More flexible than linear models but more interpretable than black-box models
Excellent for:
- Environmental data
- Time series analysis
- Spatial modeling
- Any data with smooth underlying patterns

Key Advantages

Statistical Benefits:

Automatically handles non-linear relationships
Controls overfitting through smoothing
Maintains interpretability

Practical Benefits:

Visualization of smooth effects
Confidence intervals included
Works with various response types

Statistical Framework of GAMs

Basic form: \[g(μᵢ) = β₀ + f₁(x₁ᵢ) + f₂(x₂ᵢ) + ... + fₖ(xₖᵢ)\]

Where:

\(μᵢ = E(Yᵢ)\) is the expected value
\(g()\) is the link function
\(fⱼ\) are smooth functions
Response \(Y\) follows exponential family distribution

Let’s Code: Setup and Data

# Load required packages
library(mgcv)
library(tidyverse)
library(lubridate)
library(gamair)

Load Chicago air pollution data

Load chicago data

data(chicago)

First 6 observations

chicago |> 
  slice_head(n = 6)

  death pm10median pm25median  o3median  so2median    time tmpd
1   130 -7.4335443         NA -19.59234  1.9280426 -2556.5 31.5
2   150         NA         NA -19.03861 -0.9855631 -2555.5 33.0
3   101 -0.8265306         NA -20.21734 -1.8914161 -2554.5 33.0
4   135  5.5664557         NA -19.67567  6.1393413 -2553.5 29.0
5   126         NA         NA -19.21734  2.2784649 -2552.5 32.0
6   130  6.5664557         NA -17.63400  9.8585839 -2551.5 40.0

Fitting a Basic GAM

name the model as model1
use gam function
smooth function for temperature and median pm 10
outcome is death which is count (integer)
hence we assume death follows Poisson distribution given the predictors

# Fit GAM with smooth terms for tmpd and pm10median
model1 <- gam(death ~ s(tmpd) + s(pm10median),
              data = chicago,
              family = poisson())

Smooth Functions in mgcv

Common basis types:

tp: Thin plate regression splines
- Default in mgcv
- Good for most situations
- Computationally intensive for large datasets

Smooth Functions in mgcv

cr: Cubic regression splines
- Knots at evenly spaced locations
- Computationally efficient
- Good for one-dimensional smooths

Smooth Functions in mgcv

bs: B-splines
- Local support
- Good for cyclic patterns
- Efficient for large datasets

Specifying Basis Functions - thin plate

# Using thin plate regression splines (default)
s(x, bs = "tp", k = 10)

$term
[1] "x"

$bs.dim
[1] 10

$fixed
[1] FALSE

$dim
[1] 1

$p.order
[1] NA

$by
[1] "NA"

$label
[1] "s(x)"

$xt
NULL

$id
NULL

$sp
NULL

attr(,"class")
[1] "tp.smooth.spec"

Specifying Basis Functions - cubic regression spline

# Using cubic regression splines
s(x, bs = "cr", k = 10)

$term
[1] "x"

$bs.dim
[1] 10

$fixed
[1] FALSE

$dim
[1] 1

$p.order
[1] NA

$by
[1] "NA"

$label
[1] "s(x)"

$xt
NULL

$id
NULL

$sp
NULL

attr(,"class")
[1] "cr.smooth.spec"

Specifying Basis Functions - B-spline

# Using B-splines
s(x, bs = "bs", k = 10)

$term
[1] "x"

$bs.dim
[1] 10

$fixed
[1] FALSE

$dim
[1] 1

$p.order
[1] NA

$by
[1] "NA"

$label
[1] "s(x)"

$xt
NULL

$id
NULL

$sp
NULL

attr(,"class")
[1] "bs.smooth.spec"

Results of gam

summary(model1)


Family: poisson 
Link function: log 

Formula:
death ~ s(tmpd) + s(pm10median)

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) 4.746282   0.001337    3549   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                edf Ref.df Chi.sq p-value    
s(tmpd)       8.343  8.852 1789.7  <2e-16 ***
s(pm10median) 7.822  8.616  134.2  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.181   Deviance explained = 19.3%
UBRE = 0.56431  Scale est. = 1         n = 4863

Understanding Summary Output

Key components:

Parametric coefficients: - Intercept and linear terms
Smooth terms:
- edf: Effective degrees of freedom
- Ref.df: Reference degrees of freedom
- F-statistic and p-value

Visualizing Smooth Terms

par(mfrow = c(1, 2))
# Plot smooth effects
plot(model1, pages = 1)

Checking Model Assumptions

par(mfrow = c(2, 2))
# Model diagnostics
gam.check(model1)


Method: UBRE   Optimizer: outer newton
full convergence after 6 iterations.
Gradient range [2.524847e-07,2.604434e-06]
(score 0.5643101 & scale 1).
Hessian positive definite, eigenvalue range [0.0001783152,0.0002205525].
Model rank =  19 / 19 

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

                k'  edf k-index p-value    
s(tmpd)       9.00 8.34    0.93  <2e-16 ***
s(pm10median) 9.00 7.82    1.01    0.78    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Adding Time Trend

add smooth function for time

# More complex model with time trend
model2 <- gam(death ~ s(tmpd) + s(pm10median) + s(time),
              data = chicago,
              family = poisson())

Visualizing results

Using gratia for enhanced visualization

library(gratia)
draw(model2)

Visualizing Time Effect

# Plot the time trend
plot(model2, select = 3)  # select=3 shows the time smooth

Interpreting Results

temperature Effect:
- Non-linear relationship with mortality
- Higher mortality at tmpd extremes
- Optimal tmpd range visible

Interpreting Results

pm10median Effect:
- Generally positive association
- Increasing pm10median associated with higher mortality
- Relationship flattens at higher levels

Interpreting Results

Time Trend:
- Captures long-term patterns
- Accounts for seasonal variations

Model Diagnostics Show:

Residuals:
- Generally well-behaved
- No obvious patterns
- QQ-plot shows good fit

Model Diagnostics Show:

Smoothing Parameters:
- Appropriate degrees of freedom
- No over-smoothing detected

Best Practices

Start with exploratory data analysis
Begin with simple models
Check model assumptions
Compare different smoothing bases
Validate basis dimensions
Consider interactions if needed
Use visualization for interpretation

Additional Resources

Wood, S.N. (2017) Generalized Additive Models: An Introduction with R
mgcv package documentation
gratia package vignettes
Lots of examples on Stack Overflow

Questions?

Thank you for your attention!

Contact: drkamarul@usm.my