Understanding Generalized Additive Models (GAMs)

A Practical Introduction with R

What are GAMs?

Generalized Additive Models (GAMs) are an extension of Generalized Linear Models (GLMs) that:

  • Allow for non-linear relationships between predictors and response
  • Maintain interpretability of effects
  • Automatically determine the optimal amount of smoothing
  • Combine the flexibility of non-linear models with interpretability of linear models

Why Use GAMs?

  • Real-world relationships are often non-linear
  • More flexible than linear models but more interpretable than black-box models
  • Excellent for:
    • Environmental data
    • Time series analysis
    • Spatial modeling
    • Any data with smooth underlying patterns

Key Advantages

Statistical Benefits:

  • Automatically handles non-linear relationships
  • Controls overfitting through smoothing
  • Maintains interpretability

Practical Benefits:

  • Visualization of smooth effects
  • Confidence intervals included
  • Works with various response types

Statistical Framework of GAMs

Basic form: \[g(μᵢ) = β₀ + f₁(x₁ᵢ) + f₂(x₂ᵢ) + ... + fₖ(xₖᵢ)\]

Where:

  • \(μᵢ = E(Yᵢ)\) is the expected value
  • \(g()\) is the link function
  • \(fⱼ\) are smooth functions
  • Response \(Y\) follows exponential family distribution

Let’s Code: Setup and Data

# Load required packages
library(mgcv)
library(tidyverse)
library(lubridate)
library(gamair)

Load Chicago air pollution data

  • Load chicago data
data(chicago)

First 6 observations

chicago |> 
  slice_head(n = 6)
  death pm10median pm25median  o3median  so2median    time tmpd
1   130 -7.4335443         NA -19.59234  1.9280426 -2556.5 31.5
2   150         NA         NA -19.03861 -0.9855631 -2555.5 33.0
3   101 -0.8265306         NA -20.21734 -1.8914161 -2554.5 33.0
4   135  5.5664557         NA -19.67567  6.1393413 -2553.5 29.0
5   126         NA         NA -19.21734  2.2784649 -2552.5 32.0
6   130  6.5664557         NA -17.63400  9.8585839 -2551.5 40.0

Fitting a Basic GAM

  • name the model as model1
  • use gam function
  • smooth function for temperature and median pm 10
  • outcome is death which is count (integer)
  • hence we assume death follows Poisson distribution given the predictors
# Fit GAM with smooth terms for tmpd and pm10median
model1 <- gam(death ~ s(tmpd) + s(pm10median),
              data = chicago,
              family = poisson())

Smooth Functions in mgcv

Common basis types:

  1. tp: Thin plate regression splines
    • Default in mgcv
    • Good for most situations
    • Computationally intensive for large datasets

Smooth Functions in mgcv

  1. cr: Cubic regression splines
    • Knots at evenly spaced locations
    • Computationally efficient
    • Good for one-dimensional smooths

Smooth Functions in mgcv

  1. bs: B-splines
    • Local support
    • Good for cyclic patterns
    • Efficient for large datasets

Specifying Basis Functions - thin plate

# Using thin plate regression splines (default)
s(x, bs = "tp", k = 10)
$term
[1] "x"

$bs.dim
[1] 10

$fixed
[1] FALSE

$dim
[1] 1

$p.order
[1] NA

$by
[1] "NA"

$label
[1] "s(x)"

$xt
NULL

$id
NULL

$sp
NULL

attr(,"class")
[1] "tp.smooth.spec"

Specifying Basis Functions - cubic regression spline

# Using cubic regression splines
s(x, bs = "cr", k = 10)
$term
[1] "x"

$bs.dim
[1] 10

$fixed
[1] FALSE

$dim
[1] 1

$p.order
[1] NA

$by
[1] "NA"

$label
[1] "s(x)"

$xt
NULL

$id
NULL

$sp
NULL

attr(,"class")
[1] "cr.smooth.spec"

Specifying Basis Functions - B-spline

# Using B-splines
s(x, bs = "bs", k = 10)
$term
[1] "x"

$bs.dim
[1] 10

$fixed
[1] FALSE

$dim
[1] 1

$p.order
[1] NA

$by
[1] "NA"

$label
[1] "s(x)"

$xt
NULL

$id
NULL

$sp
NULL

attr(,"class")
[1] "bs.smooth.spec"

Results of gam

summary(model1)

Family: poisson 
Link function: log 

Formula:
death ~ s(tmpd) + s(pm10median)

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) 4.746282   0.001337    3549   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                edf Ref.df Chi.sq p-value    
s(tmpd)       8.343  8.852 1789.7  <2e-16 ***
s(pm10median) 7.822  8.616  134.2  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.181   Deviance explained = 19.3%
UBRE = 0.56431  Scale est. = 1         n = 4863

Understanding Summary Output

Key components:

  1. Parametric coefficients: - Intercept and linear terms

  2. Smooth terms:

    • edf: Effective degrees of freedom
    • Ref.df: Reference degrees of freedom
    • F-statistic and p-value

Visualizing Smooth Terms

par(mfrow = c(1, 2))
# Plot smooth effects
plot(model1, pages = 1)

Checking Model Assumptions

par(mfrow = c(2, 2))
# Model diagnostics
gam.check(model1)

Method: UBRE   Optimizer: outer newton
full convergence after 6 iterations.
Gradient range [2.524847e-07,2.604434e-06]
(score 0.5643101 & scale 1).
Hessian positive definite, eigenvalue range [0.0001783152,0.0002205525].
Model rank =  19 / 19 

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

                k'  edf k-index p-value    
s(tmpd)       9.00 8.34    0.93  <2e-16 ***
s(pm10median) 9.00 7.82    1.01    0.78    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Adding Time Trend

  • add smooth function for time
# More complex model with time trend
model2 <- gam(death ~ s(tmpd) + s(pm10median) + s(time),
              data = chicago,
              family = poisson())

Visualizing results

  • Using gratia for enhanced visualization
library(gratia)
draw(model2)

Visualizing Time Effect

# Plot the time trend
plot(model2, select = 3)  # select=3 shows the time smooth

Interpreting Results

  1. temperature Effect:
    • Non-linear relationship with mortality
    • Higher mortality at tmpd extremes
    • Optimal tmpd range visible

Interpreting Results

  1. pm10median Effect:
    • Generally positive association
    • Increasing pm10median associated with higher mortality
    • Relationship flattens at higher levels

Interpreting Results

  1. Time Trend:
    • Captures long-term patterns
    • Accounts for seasonal variations

Model Diagnostics Show:

  1. Residuals:
    • Generally well-behaved
    • No obvious patterns
    • QQ-plot shows good fit

Model Diagnostics Show:

  1. Smoothing Parameters:
    • Appropriate degrees of freedom
    • No over-smoothing detected

Best Practices

  1. Start with exploratory data analysis
  2. Begin with simple models
  3. Check model assumptions
  4. Compare different smoothing bases
  5. Validate basis dimensions
  6. Consider interactions if needed
  7. Use visualization for interpretation

Additional Resources

  • Wood, S.N. (2017) Generalized Additive Models: An Introduction with R
  • mgcv package documentation
  • gratia package vignettes
  • Lots of examples on Stack Overflow

Questions?

Thank you for your attention!

Contact: drkamarul@usm.my