Poisson Regression for Count and Rate Data

Kamarul Imran

Wan Nor Arifin

2025-09-19

Prerequisites

Required Packages

Load the necessary packages for our analysis:

# Load required packages using tidy principles
library(tidyverse)  # For data manipulation and visualization
library(gtsummary)  # For creating summary tables
library(haven)      # For reading data files
library(here)
library(broom)      # For tidying model outputs

Part A: Concepts of Poisson Regression

What is Poisson Regression?

Poisson regression is a regression analysis for count and rate data
It’s a type of Generalized Linear Model (GLM)
Models the linear relationship between:
- Outcome: count variable (transformed to natural log scale) or rate (count with offset)
- Predictors: numerical and categorical variables

Count Data

Characteristics of Count Data

Discrete numerical data (0, 1, 2, 3, …)
Examples in public health:
- Number of hospital admissions
- Number of asthmatic attacks per year
- Number of disease cases
- Number of deaths

Why Not Use Linear Regression?

May produce negative predictions (illogical for counts)
Assumes continuous data distribution
Violates assumptions for count data

Rate Data

Characteristics of Rate Data

Count proportional to a denominator
Examples:
- Death rate per 100,000 population
- Disease incidence rate per person-years
- Accident rate per mile driven

Why Use Rates?

Allows fair comparison between different population sizes
Accounts for exposure time or population at risk
Essential for epidemiological studies

Poisson Distribution

Properties

Models discrete events occurring in fixed intervals
Mean equals variance (λ = μ = σ²)
Skewed distribution for small means
Approaches normal for large means

Mathematical Form

For count outcome: \[P(Y = k) = \frac{e^{-\lambda}\lambda^k}{k!}\]

where λ is the rate parameter

Poisson Regression Model

For Count Data

\[ln(count) = β₀ + β₁x₁ + β₂x₂ + ... + βₚxₚ\]

For Rate Data (with offset)

\[ln(\frac{count}{denominator}) = β₀ + β₁x₁ + β₂x₂ + ... + βₚxₚ\]

\[ln(count) = ln(denominator) + β₀ + β₁x₁ + β₂x₂ + ... + βₚxₚ\]

Interpretation

Coefficients: log rate ratios
exp(β): Rate ratios (RR) or Incidence Rate Ratios (IRR)

Part B: Demonstration Using R

Part B1: Poisson Regression for Count Data

Dataset: Asthma Attacks

Load the asthma dataset
Explore data structure

asthma <- read_csv(here("data", "asthma.csv"))
glimpse(asthma)

Rows: 120
Columns: 4
$ gender  <chr> "female", "male", "male", "female", "male", "male", "female", …
$ res_inf <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "yes", …
$ ghq12   <dbl> 21, 17, 30, 22, 27, 33, 24, 23, 25, 28, 14, 20, 24, 5, 32, 17,…
$ attack  <dbl> 6, 4, 8, 5, 2, 3, 2, 1, 2, 2, 4, 4, 3, 0, 2, 2, 4, 1, 5, 0, 2,…

Data Exploration

Summary statistics

summary(asthma)

    gender            res_inf              ghq12           attack     
 Length:120         Length:120         Min.   : 0.00   Min.   :0.000  
 Class :character   Class :character   1st Qu.: 7.00   1st Qu.:1.000  
 Mode  :character   Mode  :character   Median :19.00   Median :2.000  
                                       Mean   :16.34   Mean   :2.458  
                                       3rd Qu.:25.00   3rd Qu.:4.000  
                                       Max.   :33.00   Max.   :9.000

Data Exploration (continued)

Descriptive table using gtsummary

asthma |> 
  tbl_summary() |>
  as_gt()

Characteristic	N = 120¹
gender
female	67 (56%)
male	53 (44%)
res_inf	69 (58%)
ghq12	19 (7, 25)
attack	2.00 (1.00, 4.00)
¹ n (%); Median (Q1, Q3)

Step 1: Univariable Analysis

Model for Gender

pois_gender <- glm(attack ~ gender, data = asthma, family = "poisson")
summary(pois_gender)


Call:
glm(formula = attack ~ gender, family = "poisson", data = asthma)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.02105    0.07332  13.925   <2e-16 ***
gendermale  -0.30000    0.12063  -2.487   0.0129 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 229.56  on 119  degrees of freedom
Residual deviance: 223.23  on 118  degrees of freedom
AIC: 500.3

Number of Fisher Scoring iterations: 5

Step 1: Univariable Analysis

Model for Respiratory Infection

pois_res_inf <- glm(attack ~ res_inf, data = asthma, family = "poisson")
summary(pois_res_inf)


Call:
glm(formula = attack ~ res_inf, family = "poisson", data = asthma)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   0.2877     0.1213   2.372   0.0177 *  
res_infyes    0.9032     0.1382   6.533 6.44e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 229.56  on 119  degrees of freedom
Residual deviance: 180.49  on 118  degrees of freedom
AIC: 457.56

Number of Fisher Scoring iterations: 5

Step 1: Univariable Analysis

Model for GHQ12 Score

pois_ghq12 <- glm(attack ~ ghq12, data = asthma, family = "poisson")
summary(pois_ghq12)


Call:
glm(formula = attack ~ ghq12, family = "poisson", data = asthma)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.230923   0.159128  -1.451    0.147    
ghq12        0.059500   0.006919   8.599   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 229.56  on 119  degrees of freedom
Residual deviance: 145.13  on 118  degrees of freedom
AIC: 422.2

Number of Fisher Scoring iterations: 5

Step 2: Multivariable Model (No Interaction)

pois_multi <- glm(attack ~ res_inf + ghq12, data = asthma, family = "poisson")
summary(pois_multi)


Call:
glm(formula = attack ~ res_inf + ghq12, family = "poisson", data = asthma)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.34051    0.16823  -2.024  0.04296 *  
res_infyes   0.42816    0.15282   2.802  0.00508 ** 
ghq12        0.04989    0.00779   6.404 1.51e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 229.56  on 119  degrees of freedom
Residual deviance: 136.80  on 117  degrees of freedom
AIC: 415.86

Number of Fisher Scoring iterations: 5

Step 3: Multivariable Model with Interaction

pois_interaction <- glm(attack ~ res_inf * ghq12, data = asthma, family = "poisson")
summary(pois_interaction)


Call:
glm(formula = attack ~ res_inf * ghq12, family = "poisson", data = asthma)

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -0.63436    0.23408  -2.710  0.00673 ** 
res_infyes        1.01927    0.32822   3.105  0.00190 ** 
ghq12             0.06834    0.01186   5.763 8.29e-09 ***
res_infyes:ghq12 -0.03135    0.01531  -2.048  0.04056 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 229.56  on 119  degrees of freedom
Residual deviance: 132.67  on 116  degrees of freedom
AIC: 413.74

Number of Fisher Scoring iterations: 5

Step 4: Generate Predictions

pois_augmented <- augment(pois_multi, type.predict = "response")
head(pois_augmented)

# A tibble: 6 × 9
  attack res_inf ghq12 .fitted .resid   .hat .sigma .cooksd .std.resid
   <dbl> <chr>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>      <dbl>
1      6 yes        21    3.11  1.45  0.0147   1.08 0.0136       1.46 
2      4 no         17    1.66  1.53  0.0248   1.08 0.0287       1.55 
3      8 yes        30    4.88  1.29  0.0346   1.08 0.0248       1.32 
4      5 yes        22    3.27  0.886 0.0148   1.08 0.00463      0.893
5      2 yes        27    4.20 -1.20  0.0219   1.08 0.00879     -1.21 
6      3 yes        33    5.66 -1.23  0.0571   1.08 0.0268      -1.27

Model Results Presentation

# Present results with IRRs
pois_multi |>
  tbl_regression() |>
  as_gt() |>
  gt::tab_source_note(gt::md("*constant is -0.34*"))

Characteristic	log(IRR)	95% CI	p-value
res_inf
no	—	—
yes	0.43	0.13, 0.73	0.005
ghq12	0.05	0.03, 0.07	<0.001
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio
constant is -0.34

Calculation for fitted value for observation 1 is \(ln(count) = -0.34 + 0.43*1 + 0.05*21\)
expected count \(= \exp{1.14} = 3.13\)

Model Results Presentation

# Present results with IRRs
pois_multi |>
  tbl_regression(exponentiate = TRUE) |>
  as_gt()

Characteristic	IRR	95% CI	p-value
res_inf
no	—	—
yes	1.53	1.14, 2.08	0.005
ghq12	1.05	1.04, 1.07	<0.001
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio

Interpretation of Count Model

Respiratory Infection: Those with recurrent respiratory infection have a rate ratio of 1.53 for asthmatic attacks (95% CI: 1.14, 2.08)
GHQ12 Score: Each one-point increase in GHQ12 score increases the rate of attacks by 5% (RR = 1.05, 95% CI: 1.04, 1.07)

Part B2: Poisson Regression for Rate Data

Dataset: Smoking and Lung Cancer

# Load the smoking dataset
smoke <- read_csv(here("data", "smoke.csv"))

# Explore data structure
glimpse(smoke)

Rows: 63
Columns: 4
$ smoke_yrs  <chr> "15-19", "20-24", "25-29", "30-34", "35-39", "40-44", "45-4…
$ cigar_day  <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.2, 5.2, 5.2,…
$ person_yrs <dbl> 10366, 8162, 5969, 4496, 3512, 2201, 1421, 1121, 826, 3121,…
$ case       <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 1, 2, 0, 3, 0, 0, 1,…

Rate Data Exploration

# Calculate observed rates
smoke |> 
  mutate(rate = round(case/person_yrs, 4)) |>
  head(10)

# A tibble: 10 × 5
   smoke_yrs cigar_day person_yrs  case   rate
   <chr>         <dbl>      <dbl> <dbl>  <dbl>
 1 15-19           0        10366     1 0.0001
 2 20-24           0         8162     0 0     
 3 25-29           0         5969     0 0     
 4 30-34           0         4496     0 0     
 5 35-39           0         3512     0 0     
 6 40-44           0         2201     0 0     
 7 45-49           0         1421     0 0     
 8 50-54           0         1121     0 0     
 9 55-59           0          826     2 0.0024
10 15-19           5.2       3121     0 0

Step 1: Univariable Analysis with Offset

Model for Cigarettes per Day

pois_cigar <- glm(case ~ cigar_day, data = smoke, family = "poisson", 
                  offset = log(person_yrs))
summary(pois_cigar)


Call:
glm(formula = case ~ cigar_day, family = "poisson", data = smoke, 
    offset = log(person_yrs))

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -8.159268   0.175063  -46.61   <2e-16 ***
cigar_day    0.070359   0.006468   10.88   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 445.10  on 62  degrees of freedom
Residual deviance: 324.71  on 61  degrees of freedom
AIC: 448.55

Number of Fisher Scoring iterations: 6

Step 1: Univariable Analysis with Offset

Model for Smoking Years

pois_smoke_yrs <- glm(case ~ smoke_yrs, data = smoke, family = "poisson", 
                      offset = log(person_yrs))
summary(pois_smoke_yrs)


Call:
glm(formula = case ~ smoke_yrs, family = "poisson", data = smoke, 
    offset = log(person_yrs))

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -10.335      1.000 -10.335  < 2e-16 ***
smoke_yrs20-24    1.117      1.155   0.968 0.333149    
smoke_yrs25-29    1.990      1.080   1.842 0.065423 .  
smoke_yrs30-34    3.563      1.020   3.493 0.000477 ***
smoke_yrs35-39    3.615      1.024   3.532 0.000412 ***
smoke_yrs40-44    4.580      1.013   4.521 6.15e-06 ***
smoke_yrs45-49    4.785      1.016   4.707 2.52e-06 ***
smoke_yrs50-54    5.085      1.020   4.987 6.14e-07 ***
smoke_yrs55-59    5.384      1.024   5.261 1.44e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 445.10  on 62  degrees of freedom
Residual deviance: 170.17  on 54  degrees of freedom
AIC: 308.01

Number of Fisher Scoring iterations: 5

Step 2: Multivariable Model (No Interaction)

pois_rate_multi <- glm(case ~ cigar_day + smoke_yrs, data = smoke, 
                       family = "poisson", offset = log(person_yrs))
summary(pois_rate_multi)


Call:
glm(formula = case ~ cigar_day + smoke_yrs, family = "poisson", 
    data = smoke, offset = log(person_yrs))

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -11.317162   1.007492 -11.233  < 2e-16 ***
cigar_day        0.064361   0.006401  10.054  < 2e-16 ***
smoke_yrs20-24   0.962417   1.154694   0.833  0.40457    
smoke_yrs25-29   1.710894   1.080420   1.584  0.11330    
smoke_yrs30-34   3.207676   1.020378   3.144  0.00167 ** 
smoke_yrs35-39   3.242776   1.024187   3.166  0.00154 ** 
smoke_yrs40-44   4.208361   1.013726   4.151 3.30e-05 ***
smoke_yrs45-49   4.448972   1.017054   4.374 1.22e-05 ***
smoke_yrs50-54   4.893674   1.019945   4.798 1.60e-06 ***
smoke_yrs55-59   5.372600   1.023404   5.250 1.52e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 445.099  on 62  degrees of freedom
Residual deviance:  68.015  on 53  degrees of freedom
AIC: 207.86

Number of Fisher Scoring iterations: 5

Step 3: Multivariable Model with Interaction

pois_rate_interaction <- glm(case ~ cigar_day * smoke_yrs, data = smoke,
                            family = "poisson", offset = log(person_yrs))
summary(pois_rate_interaction)


Call:
glm(formula = case ~ cigar_day * smoke_yrs, family = "poisson", 
    data = smoke, offset = log(person_yrs))

Coefficients:
                         Estimate Std. Error z value Pr(>|z|)    
(Intercept)               -9.2463     1.0000  -9.246  < 2e-16 ***
cigar_day                 -2.8137   319.6351  -0.009 0.992976    
smoke_yrs20-24            -0.7514     1.5114  -0.497 0.619084    
smoke_yrs25-29            -0.2539     1.3518  -0.188 0.851030    
smoke_yrs30-34             1.2493     1.1032   1.132 0.257489    
smoke_yrs35-39             0.5856     1.1660   0.502 0.615498    
smoke_yrs40-44             1.9706     1.0790   1.826 0.067800 .  
smoke_yrs45-49             2.1070     1.0982   1.919 0.055044 .  
smoke_yrs50-54             3.0710     1.0761   2.854 0.004318 ** 
smoke_yrs55-59             3.5704     1.0706   3.335 0.000853 ***
cigar_day:smoke_yrs20-24   2.8609   319.6351   0.009 0.992859    
cigar_day:smoke_yrs25-29   2.8736   319.6351   0.009 0.992827    
cigar_day:smoke_yrs30-34   2.8736   319.6351   0.009 0.992827    
cigar_day:smoke_yrs35-39   2.8997   319.6351   0.009 0.992762    
cigar_day:smoke_yrs40-44   2.8845   319.6351   0.009 0.992800    
cigar_day:smoke_yrs45-49   2.8886   319.6351   0.009 0.992790    
cigar_day:smoke_yrs50-54   2.8669   319.6351   0.009 0.992844    
cigar_day:smoke_yrs55-59   2.8641   319.6351   0.009 0.992851    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 445.099  on 62  degrees of freedom
Residual deviance:  60.597  on 45  degrees of freedom
AIC: 216.44

Number of Fisher Scoring iterations: 18

Step 4: Generate Predictions for Rate

pois_rate_augmented <- augment(pois_rate_multi, type.predict = "response")
head(pois_rate_augmented)

# A tibble: 6 × 10
   case cigar_day smoke_yrs `(offset)` .fitted .resid   .hat .sigma .cooksd
  <dbl>     <dbl> <chr>          <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
1     1         0 15-19           9.25   0.126  1.55  0.128    1.12 0.102  
2     0         0 20-24           9.01   0.260 -0.721 0.0918   1.14 0.00289
3     0         0 25-29           8.69   0.402 -0.896 0.0766   1.14 0.00361
4     0         0 30-34           8.41   1.35  -1.64  0.0897   1.12 0.0146 
5     0         0 35-39           8.16   1.09  -1.48  0.0816   1.12 0.0106 
6     0         0 40-44           7.70   1.80  -1.90  0.0959   1.11 0.0211 
# ℹ 1 more variable: .std.resid <dbl>

Rate Model Results

# Present rate model results
pois_rate_multi |>
  tbl_regression(exponentiate = TRUE) |>
  as_gt()

Characteristic	IRR	95% CI	p-value
cigar_day	1.07	1.05, 1.08	<0.001
smoke_yrs
15-19	—	—
20-24	2.62	0.34, 52.9	0.4
25-29	5.53	0.94, 105	0.11
30-34	24.7	5.23, 442	0.002
35-39	25.6	5.35, 459	0.002
40-44	67.2	14.6, 1,195	<0.001
45-49	85.5	18.3, 1,525	<0.001
50-54	133	28.3, 2,385	<0.001
55-59	215	45.1, 3,862	<0.001
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio

Interpretation of Rate Model

Cigarettes per day:
- Each additional cigarette per day increases lung cancer rate by 7% (RR = 1.07, 95% CI: 1.05, 1.08)
Smoking years:
- Those smoking 30-34 years have 24.7 times higher rate compared to 15-19 years (RR = 24.7, 95% CI: 5.23, 442)

Model Diagnostics

# Check model fit using scaled Pearson chi-square
qpois_check <- summary(glm(attack ~ res_inf + ghq12, data = asthma, 
                          family = "quasipoisson"))
cat("Dispersion parameter:", qpois_check$dispersion)

Dispersion parameter: 1.052383

# Values close to 1 indicate good model fit

Summary

When to Use Poisson Regression

Count outcomes: Number of events, cases, attacks
Rate outcomes: Events per unit time or population
Small counts: Especially when zeros are common
Alternative to linear regression for count data

Advantages

Appropriate distribution for count data
Prevents negative predictions
Handles rate data with offsets
Interpretable results via rate ratios

Best Practices

Check assumptions: Mean ≈ variance
Model diagnostics: Scaled Pearson chi-square
Consider overdispersion: Use quasi-Poisson if needed
Interpret carefully: Rate ratios, not odds ratios
Use offsets for rate data

Thank You

Resources:

Fleiss, Levin & Paik (2003). Statistical Methods for Rates and Proportions
Agresti (2015). Foundations of Linear and Generalized Linear Models
Kamarul Imran, Wan Nor Arifin, Tengku Muhammad Hanis (2024). Data Analysis in Health and Medicine Using R. (Online)