library(tidyverse)
library(mgcv)
library(gamair)Generalized Additive Model (GAM)
Quick Introduction to GAM using mgcv
Prepare environment
- open new R project
- find and select a folder (that contains the data)
- load the packages
- tidyverse for data management
- mgcv for generalized additive model
- gamair to get access to gam codes and dataset
Data 1
- Data
chicago
data(chicago)Explore data
- outcome is death (count)
- median of pm10
- median of pm2.5
- median of ozone
- time
- temperature
summary(chicago) death pm10median pm25median o3median
Min. : 69.0 Min. :-37.3761 Min. :-16.426 Min. :-24.779
1st Qu.:105.0 1st Qu.:-13.1082 1st Qu.: -6.588 1st Qu.:-10.232
Median :114.0 Median : -3.5391 Median : -1.326 Median : -3.326
Mean :115.4 Mean : -0.1464 Mean : 0.243 Mean : -2.179
3rd Qu.:124.0 3rd Qu.: 8.3029 3rd Qu.: 5.344 3rd Qu.: 4.468
Max. :411.0 Max. :320.7248 Max. : 38.150 Max. : 43.688
NA's :251 NA's :4387
so2median time tmpd
Min. :-8.2061 Min. :-2556 Min. :-16.00
1st Qu.:-2.6894 1st Qu.:-1278 1st Qu.: 35.00
Median :-1.2183 Median : 0 Median : 51.00
Mean :-0.6361 Mean : 0 Mean : 50.19
3rd Qu.: 0.8316 3rd Qu.: 1278 3rd Qu.: 67.00
Max. :28.9034 Max. : 2556 Max. : 92.00
NA's :27
chicago |>
ggplot(aes(x = time, y = death)) +
geom_line(linewidth = 0.25) +
geom_point(size = 0.25)Analysis with GAM
- only one variable with smooth function
ap0 <- gam(death ~ s(time, bs="cr", k = 200) +
pm10median + so2median+ o3median + tmpd,
data = chicago, family = poisson)- the observed numbers of deaths are Poisson random variables
- an underlying mean that is the product of a basic, time varying, death rate
- modified through multiplication by pollution dependent effects.
- A cubic regression spline has been used for
fto speed up computation a little
Results
summary(ap0)
Family: poisson
Link function: log
Formula:
death ~ s(time, bs = "cr", k = 200) + pm10median + so2median +
o3median + tmpd
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.7006394 0.0091690 512.666 < 2e-16 ***
pm10median 0.0004321 0.0000889 4.861 1.17e-06 ***
so2median 0.0008184 0.0005523 1.482 0.138
o3median 0.0009306 0.0002032 4.581 4.63e-06 ***
tmpd 0.0009318 0.0001784 5.223 1.76e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(time) 168.6 188 2090 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.351 Deviance explained = 38.7%
UBRE = 0.25467 Scale est. = 1 n = 4841
gam.check(ap0)
Method: UBRE Optimizer: outer newton
full convergence after 3 iterations.
Gradient range [3.514596e-08,3.514596e-08]
(score 0.2546689 & scale 1).
Hessian positive definite, eigenvalue range [0.004247567,0.004247567].
Model rank = 204 / 204
Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.
k' edf k-index p-value
s(time) 199 169 0.92 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Plots
plot(ap0, n = 1000) # n increased to make plot smoothplot(ap0, residuals = TRUE, n = 1000)- and four gross outliers, in close proximity to each other, are clearly visible.
- examination of the data indicates that the outliers are the four highest daily death rates occurring in the data
- they occurred on consecutive days,
Adding more smooth functions
- some non-linear response of death rate to temperature and ozone is required
- replacing the linear dependencies on the air quality covariates with smooth functions
- so that the model structure becomes
ap1 <- gam(death ~ s(time, bs = "cr", k = 200) +
s(pm10median, bs = "cr") +
s(so2median, bs = "cr") +
s(o3median, bs = "cr") +
s(tmpd, bs = "cr"),
data = chicago, family = poisson)summary(ap1)
Family: poisson
Link function: log
Formula:
death ~ s(time, bs = "cr", k = 200) + s(pm10median, bs = "cr") +
s(so2median, bs = "cr") + s(o3median, bs = "cr") + s(tmpd,
bs = "cr")
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.744645 0.001342 3534 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(time) 167.933 187.535 1827.171 <2e-16 ***
s(pm10median) 6.863 7.695 14.480 0.0526 .
s(so2median) 7.382 8.141 9.221 0.3415
s(o3median) 1.579 1.985 1.916 0.3493
s(tmpd) 8.270 8.850 105.360 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.361 Deviance explained = 39.8%
UBRE = 0.24107 Scale est. = 1 n = 4841
Plots
plot(ap1, n = 1000) # n increased to make plot smoothplot(ap1, residuals = TRUE, n = 1000)Data 2
- let’s say we have a simulated data with only
- outcome is admission
- a predictor which is
pm2.5
Data with one predictor
glimpse(data_simple)Rows: 1,000
Columns: 2
$ pm25 <dbl> 28.757752, 78.830514, 40.897692, 88.301740, 94.046728, 4.55…
$ admissions <dbl> 20, 60, 36, 79, 75, 11, 35, 66, 42, 34, 84, 35, 58, 45, 11,…
Data with three predictors
- outcome is admission (count)
- predictors are
- Particulate Matter 2.5
PM2.5 - Temperature
temp - Relative Humidity
rh
- Particulate Matter 2.5
data_complex |>
View()glimpse(data_complex)Rows: 1,000
Columns: 4
$ pm25 <dbl> 28.757752, 78.830514, 40.897692, 88.301740, 94.046728, 4.55…
$ temp <dbl> 15.89507, 18.46371, 15.48951, 23.13534, 25.60178, 30.63607,…
$ rh <dbl> 44.91779, 89.36779, 73.02731, 69.10370, 44.43337, 35.20757,…
$ admissions <dbl> 34, 90, 40, 104, 129, 22, 65, 99, 71, 55, 118, 48, 72, 77, …
GAM with one predictor
- Estimate effect of PM2.5
# Fit simple GAM model
model_simple <- gam(admissions ~ s(pm25),
family = poisson(), # for count data
data = data_simple)
summary(model_simple)
Family: poisson
Link function: log
Formula:
admissions ~ s(pm25)
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.547039 0.005907 600.4 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(pm25) 3.295 4.093 11246 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.955 Deviance explained = 93.1%
UBRE = -0.023647 Scale est. = 1 n = 1000
- Used
s()smooth function for PM2.5 - Poisson family as we’re modeling count data
Arguments explained:
s() creates smooth term family = poisson() specifies distribution family data is input dataset
Predict mortality
- values for predictions
# Generate predictions
pm25_seq <- seq(min(pm25), max(pm25), length.out = 100)- make predictions
pred <- predict(model_simple,
newdata = data.frame(pm25 = pm25_seq),
se.fit = TRUE)Plot results
- smooth term
# Plot smooth term
plot(model_simple, shade = TRUE,
main = "Smooth function of PM2.5")
# Add observed data points
points(data_simple$pm25,
data_simple$admissions, col = "black", pch = 20)Model checking
# Model checking
gam.check(model_simple)
Method: UBRE Optimizer: outer newton
full convergence after 4 iterations.
Gradient range [7.028009e-08,7.028009e-08]
(score -0.02364664 & scale 1).
Hessian positive definite, eigenvalue range [0.002102548,0.002102548].
Model rank = 10 / 10
Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.
k' edf k-index p-value
s(pm25) 9.0 3.3 1.01 0.66
Compare with glm
- glm model
glm_simple <- glm(admissions ~ pm25, family = poisson(), data = data_simple)- compare
anova(glm_simple, model_simple, test = "Chisq")Analysis of Deviance Table
Model 1: admissions ~ pm25
Model 2: admissions ~ s(pm25)
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 998.0 1118.08
2 995.7 967.76 2.295 150.32 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
GAM with four predictors
# Fit complex GAM model with interaction
model_complex <- gam(admissions ~ s(pm25) + s(temp) + s(rh) +
ti(pm25, temp), # tensor product interaction
family = poisson(),
data = data_complex)- Added multiple smooth terms
s()for each predictor - Included tensor product interaction
ti()between PM2.5 and temperature
Key arguments:
ti()is the tensor product interaction- Multiple
s()terms for each predictor
Results
# Model summary
summary(model_complex)
Family: poisson
Link function: log
Formula:
admissions ~ s(pm25) + s(temp) + s(rh) + ti(pm25, temp)
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.990664 0.004603 866.9 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(pm25) 3.343 4.149 12446.74 <2e-16 ***
s(temp) 1.768 2.249 154.46 <2e-16 ***
s(rh) 1.492 1.831 40.49 <2e-16 ***
ti(pm25,temp) 1.342 1.602 0.27 0.734
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.971 Deviance explained = 96.3%
UBRE = -0.44909 Scale est. = 1 n = 1000
# Plot results for complex model
plot(model_complex, pages = 1)Model checking
# For complex model
gam.check(model_complex)
Method: UBRE Optimizer: outer newton
full convergence after 7 iterations.
Gradient range [-2.15315e-07,1.51255e-08]
(score -0.4490946 & scale 1).
Hessian positive definite, eigenvalue range [2.143967e-07,0.001694236].
Model rank = 44 / 44
Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.
k' edf k-index p-value
s(pm25) 9.00 3.34 0.95 0.06 .
s(temp) 9.00 1.77 1.03 0.73
s(rh) 9.00 1.49 1.00 0.53
ti(pm25,temp) 16.00 1.34 1.02 0.82
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
gam.check() provides:
- Q-Q plot of residuals
- Residuals vs linear predictor
- Histogram of residuals
- Response vs fitted values