Poisson Regression for Count and Rate Data

Kamarul Imran
Wan Nor Arifin

2025-09-19

Prerequisites

Required Packages

Load the necessary packages for our analysis:

# Load required packages using tidy principles
library(tidyverse)  # For data manipulation and visualization
library(gtsummary)  # For creating summary tables
library(haven)      # For reading data files
library(here)
library(broom)      # For tidying model outputs

Part A: Concepts of Poisson Regression

What is Poisson Regression?

  • Poisson regression is a regression analysis for count and rate data
  • It’s a type of Generalized Linear Model (GLM)
  • Models the linear relationship between:
    • Outcome: count variable (transformed to natural log scale) or rate (count with offset)
    • Predictors: numerical and categorical variables

Count Data

Characteristics of Count Data

  • Discrete numerical data (0, 1, 2, 3, …)
  • Examples in public health:
    • Number of hospital admissions
    • Number of asthmatic attacks per year
    • Number of disease cases
    • Number of deaths

Why Not Use Linear Regression?

  • May produce negative predictions (illogical for counts)
  • Assumes continuous data distribution
  • Violates assumptions for count data

Rate Data

Characteristics of Rate Data

  • Count proportional to a denominator
  • Examples:
    • Death rate per 100,000 population
    • Disease incidence rate per person-years
    • Accident rate per mile driven

Why Use Rates?

  • Allows fair comparison between different population sizes
  • Accounts for exposure time or population at risk
  • Essential for epidemiological studies

Poisson Distribution

Properties

  • Models discrete events occurring in fixed intervals
  • Mean equals variance (λ = μ = σ²)
  • Skewed distribution for small means
  • Approaches normal for large means

Mathematical Form

For count outcome: \[P(Y = k) = \frac{e^{-\lambda}\lambda^k}{k!}\]

where λ is the rate parameter

Poisson Regression Model

For Count Data

\[ln(count) = β₀ + β₁x₁ + β₂x₂ + ... + βₚxₚ\]

For Rate Data (with offset)

\[ln(\frac{count}{denominator}) = β₀ + β₁x₁ + β₂x₂ + ... + βₚxₚ\]

\[ln(count) = ln(denominator) + β₀ + β₁x₁ + β₂x₂ + ... + βₚxₚ\]

Interpretation

  • Coefficients: log rate ratios
  • exp(β): Rate ratios (RR) or Incidence Rate Ratios (IRR)

Part B: Demonstration Using R

Part B1: Poisson Regression for Count Data

Dataset: Asthma Attacks

  • Load the asthma dataset
  • Explore data structure
asthma <- read_csv(here("data", "asthma.csv"))
glimpse(asthma)
Rows: 120
Columns: 4
$ gender  <chr> "female", "male", "male", "female", "male", "male", "female", …
$ res_inf <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "yes", …
$ ghq12   <dbl> 21, 17, 30, 22, 27, 33, 24, 23, 25, 28, 14, 20, 24, 5, 32, 17,…
$ attack  <dbl> 6, 4, 8, 5, 2, 3, 2, 1, 2, 2, 4, 4, 3, 0, 2, 2, 4, 1, 5, 0, 2,…

Data Exploration

  • Summary statistics
summary(asthma)
    gender            res_inf              ghq12           attack     
 Length:120         Length:120         Min.   : 0.00   Min.   :0.000  
 Class :character   Class :character   1st Qu.: 7.00   1st Qu.:1.000  
 Mode  :character   Mode  :character   Median :19.00   Median :2.000  
                                       Mean   :16.34   Mean   :2.458  
                                       3rd Qu.:25.00   3rd Qu.:4.000  
                                       Max.   :33.00   Max.   :9.000  

Data Exploration (continued)

  • Descriptive table using gtsummary
asthma |> 
  tbl_summary() |>
  as_gt()
Characteristic N = 1201
gender
    female 67 (56%)
    male 53 (44%)
res_inf 69 (58%)
ghq12 19 (7, 25)
attack 2.00 (1.00, 4.00)
1 n (%); Median (Q1, Q3)

Step 1: Univariable Analysis

Model for Gender

pois_gender <- glm(attack ~ gender, data = asthma, family = "poisson")
summary(pois_gender)

Call:
glm(formula = attack ~ gender, family = "poisson", data = asthma)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.02105    0.07332  13.925   <2e-16 ***
gendermale  -0.30000    0.12063  -2.487   0.0129 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 229.56  on 119  degrees of freedom
Residual deviance: 223.23  on 118  degrees of freedom
AIC: 500.3

Number of Fisher Scoring iterations: 5

Step 1: Univariable Analysis

Model for Respiratory Infection

pois_res_inf <- glm(attack ~ res_inf, data = asthma, family = "poisson")
summary(pois_res_inf)

Call:
glm(formula = attack ~ res_inf, family = "poisson", data = asthma)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   0.2877     0.1213   2.372   0.0177 *  
res_infyes    0.9032     0.1382   6.533 6.44e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 229.56  on 119  degrees of freedom
Residual deviance: 180.49  on 118  degrees of freedom
AIC: 457.56

Number of Fisher Scoring iterations: 5

Step 1: Univariable Analysis

Model for GHQ12 Score

pois_ghq12 <- glm(attack ~ ghq12, data = asthma, family = "poisson")
summary(pois_ghq12)

Call:
glm(formula = attack ~ ghq12, family = "poisson", data = asthma)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.230923   0.159128  -1.451    0.147    
ghq12        0.059500   0.006919   8.599   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 229.56  on 119  degrees of freedom
Residual deviance: 145.13  on 118  degrees of freedom
AIC: 422.2

Number of Fisher Scoring iterations: 5

Step 2: Multivariable Model (No Interaction)

pois_multi <- glm(attack ~ res_inf + ghq12, data = asthma, family = "poisson")
summary(pois_multi)

Call:
glm(formula = attack ~ res_inf + ghq12, family = "poisson", data = asthma)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.34051    0.16823  -2.024  0.04296 *  
res_infyes   0.42816    0.15282   2.802  0.00508 ** 
ghq12        0.04989    0.00779   6.404 1.51e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 229.56  on 119  degrees of freedom
Residual deviance: 136.80  on 117  degrees of freedom
AIC: 415.86

Number of Fisher Scoring iterations: 5

Step 3: Multivariable Model with Interaction

pois_interaction <- glm(attack ~ res_inf * ghq12, data = asthma, family = "poisson")
summary(pois_interaction)

Call:
glm(formula = attack ~ res_inf * ghq12, family = "poisson", data = asthma)

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -0.63436    0.23408  -2.710  0.00673 ** 
res_infyes        1.01927    0.32822   3.105  0.00190 ** 
ghq12             0.06834    0.01186   5.763 8.29e-09 ***
res_infyes:ghq12 -0.03135    0.01531  -2.048  0.04056 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 229.56  on 119  degrees of freedom
Residual deviance: 132.67  on 116  degrees of freedom
AIC: 413.74

Number of Fisher Scoring iterations: 5

Step 4: Generate Predictions

pois_augmented <- augment(pois_multi, type.predict = "response")
head(pois_augmented)
# A tibble: 6 × 9
  attack res_inf ghq12 .fitted .resid   .hat .sigma .cooksd .std.resid
   <dbl> <chr>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>      <dbl>
1      6 yes        21    3.11  1.45  0.0147   1.08 0.0136       1.46 
2      4 no         17    1.66  1.53  0.0248   1.08 0.0287       1.55 
3      8 yes        30    4.88  1.29  0.0346   1.08 0.0248       1.32 
4      5 yes        22    3.27  0.886 0.0148   1.08 0.00463      0.893
5      2 yes        27    4.20 -1.20  0.0219   1.08 0.00879     -1.21 
6      3 yes        33    5.66 -1.23  0.0571   1.08 0.0268      -1.27 

Model Results Presentation

# Present results with IRRs
pois_multi |>
  tbl_regression() |>
  as_gt() |>
  gt::tab_source_note(gt::md("*constant is -0.34*"))
Characteristic log(IRR) 95% CI p-value
res_inf


    no
    yes 0.43 0.13, 0.73 0.005
ghq12 0.05 0.03, 0.07 <0.001
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio
constant is -0.34
  • Calculation for fitted value for observation 1 is \(ln(count) = -0.34 + 0.43*1 + 0.05*21\)
  • expected count \(= \exp{1.14} = 3.13\)

Model Results Presentation

# Present results with IRRs
pois_multi |>
  tbl_regression(exponentiate = TRUE) |>
  as_gt()
Characteristic IRR 95% CI p-value
res_inf


    no
    yes 1.53 1.14, 2.08 0.005
ghq12 1.05 1.04, 1.07 <0.001
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio

Interpretation of Count Model

  • Respiratory Infection: Those with recurrent respiratory infection have a rate ratio of 1.53 for asthmatic attacks (95% CI: 1.14, 2.08)

  • GHQ12 Score: Each one-point increase in GHQ12 score increases the rate of attacks by 5% (RR = 1.05, 95% CI: 1.04, 1.07)

Part B2: Poisson Regression for Rate Data

Dataset: Smoking and Lung Cancer

# Load the smoking dataset
smoke <- read_csv(here("data", "smoke.csv"))

# Explore data structure
glimpse(smoke)
Rows: 63
Columns: 4
$ smoke_yrs  <chr> "15-19", "20-24", "25-29", "30-34", "35-39", "40-44", "45-4…
$ cigar_day  <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.2, 5.2, 5.2,…
$ person_yrs <dbl> 10366, 8162, 5969, 4496, 3512, 2201, 1421, 1121, 826, 3121,…
$ case       <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 1, 2, 0, 3, 0, 0, 1,…

Rate Data Exploration

# Calculate observed rates
smoke |> 
  mutate(rate = round(case/person_yrs, 4)) |>
  head(10)
# A tibble: 10 × 5
   smoke_yrs cigar_day person_yrs  case   rate
   <chr>         <dbl>      <dbl> <dbl>  <dbl>
 1 15-19           0        10366     1 0.0001
 2 20-24           0         8162     0 0     
 3 25-29           0         5969     0 0     
 4 30-34           0         4496     0 0     
 5 35-39           0         3512     0 0     
 6 40-44           0         2201     0 0     
 7 45-49           0         1421     0 0     
 8 50-54           0         1121     0 0     
 9 55-59           0          826     2 0.0024
10 15-19           5.2       3121     0 0     

Step 1: Univariable Analysis with Offset

Model for Cigarettes per Day

pois_cigar <- glm(case ~ cigar_day, data = smoke, family = "poisson", 
                  offset = log(person_yrs))
summary(pois_cigar)

Call:
glm(formula = case ~ cigar_day, family = "poisson", data = smoke, 
    offset = log(person_yrs))

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -8.159268   0.175063  -46.61   <2e-16 ***
cigar_day    0.070359   0.006468   10.88   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 445.10  on 62  degrees of freedom
Residual deviance: 324.71  on 61  degrees of freedom
AIC: 448.55

Number of Fisher Scoring iterations: 6

Step 1: Univariable Analysis with Offset

Model for Smoking Years

pois_smoke_yrs <- glm(case ~ smoke_yrs, data = smoke, family = "poisson", 
                      offset = log(person_yrs))
summary(pois_smoke_yrs)

Call:
glm(formula = case ~ smoke_yrs, family = "poisson", data = smoke, 
    offset = log(person_yrs))

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -10.335      1.000 -10.335  < 2e-16 ***
smoke_yrs20-24    1.117      1.155   0.968 0.333149    
smoke_yrs25-29    1.990      1.080   1.842 0.065423 .  
smoke_yrs30-34    3.563      1.020   3.493 0.000477 ***
smoke_yrs35-39    3.615      1.024   3.532 0.000412 ***
smoke_yrs40-44    4.580      1.013   4.521 6.15e-06 ***
smoke_yrs45-49    4.785      1.016   4.707 2.52e-06 ***
smoke_yrs50-54    5.085      1.020   4.987 6.14e-07 ***
smoke_yrs55-59    5.384      1.024   5.261 1.44e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 445.10  on 62  degrees of freedom
Residual deviance: 170.17  on 54  degrees of freedom
AIC: 308.01

Number of Fisher Scoring iterations: 5

Step 2: Multivariable Model (No Interaction)

pois_rate_multi <- glm(case ~ cigar_day + smoke_yrs, data = smoke, 
                       family = "poisson", offset = log(person_yrs))
summary(pois_rate_multi)

Call:
glm(formula = case ~ cigar_day + smoke_yrs, family = "poisson", 
    data = smoke, offset = log(person_yrs))

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -11.317162   1.007492 -11.233  < 2e-16 ***
cigar_day        0.064361   0.006401  10.054  < 2e-16 ***
smoke_yrs20-24   0.962417   1.154694   0.833  0.40457    
smoke_yrs25-29   1.710894   1.080420   1.584  0.11330    
smoke_yrs30-34   3.207676   1.020378   3.144  0.00167 ** 
smoke_yrs35-39   3.242776   1.024187   3.166  0.00154 ** 
smoke_yrs40-44   4.208361   1.013726   4.151 3.30e-05 ***
smoke_yrs45-49   4.448972   1.017054   4.374 1.22e-05 ***
smoke_yrs50-54   4.893674   1.019945   4.798 1.60e-06 ***
smoke_yrs55-59   5.372600   1.023404   5.250 1.52e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 445.099  on 62  degrees of freedom
Residual deviance:  68.015  on 53  degrees of freedom
AIC: 207.86

Number of Fisher Scoring iterations: 5

Step 3: Multivariable Model with Interaction

pois_rate_interaction <- glm(case ~ cigar_day * smoke_yrs, data = smoke,
                            family = "poisson", offset = log(person_yrs))
summary(pois_rate_interaction)

Call:
glm(formula = case ~ cigar_day * smoke_yrs, family = "poisson", 
    data = smoke, offset = log(person_yrs))

Coefficients:
                         Estimate Std. Error z value Pr(>|z|)    
(Intercept)               -9.2463     1.0000  -9.246  < 2e-16 ***
cigar_day                 -2.8137   319.6351  -0.009 0.992976    
smoke_yrs20-24            -0.7514     1.5114  -0.497 0.619084    
smoke_yrs25-29            -0.2539     1.3518  -0.188 0.851030    
smoke_yrs30-34             1.2493     1.1032   1.132 0.257489    
smoke_yrs35-39             0.5856     1.1660   0.502 0.615498    
smoke_yrs40-44             1.9706     1.0790   1.826 0.067800 .  
smoke_yrs45-49             2.1070     1.0982   1.919 0.055044 .  
smoke_yrs50-54             3.0710     1.0761   2.854 0.004318 ** 
smoke_yrs55-59             3.5704     1.0706   3.335 0.000853 ***
cigar_day:smoke_yrs20-24   2.8609   319.6351   0.009 0.992859    
cigar_day:smoke_yrs25-29   2.8736   319.6351   0.009 0.992827    
cigar_day:smoke_yrs30-34   2.8736   319.6351   0.009 0.992827    
cigar_day:smoke_yrs35-39   2.8997   319.6351   0.009 0.992762    
cigar_day:smoke_yrs40-44   2.8845   319.6351   0.009 0.992800    
cigar_day:smoke_yrs45-49   2.8886   319.6351   0.009 0.992790    
cigar_day:smoke_yrs50-54   2.8669   319.6351   0.009 0.992844    
cigar_day:smoke_yrs55-59   2.8641   319.6351   0.009 0.992851    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 445.099  on 62  degrees of freedom
Residual deviance:  60.597  on 45  degrees of freedom
AIC: 216.44

Number of Fisher Scoring iterations: 18

Step 4: Generate Predictions for Rate

pois_rate_augmented <- augment(pois_rate_multi, type.predict = "response")
head(pois_rate_augmented)
# A tibble: 6 × 10
   case cigar_day smoke_yrs `(offset)` .fitted .resid   .hat .sigma .cooksd
  <dbl>     <dbl> <chr>          <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
1     1         0 15-19           9.25   0.126  1.55  0.128    1.12 0.102  
2     0         0 20-24           9.01   0.260 -0.721 0.0918   1.14 0.00289
3     0         0 25-29           8.69   0.402 -0.896 0.0766   1.14 0.00361
4     0         0 30-34           8.41   1.35  -1.64  0.0897   1.12 0.0146 
5     0         0 35-39           8.16   1.09  -1.48  0.0816   1.12 0.0106 
6     0         0 40-44           7.70   1.80  -1.90  0.0959   1.11 0.0211 
# ℹ 1 more variable: .std.resid <dbl>

Rate Model Results

# Present rate model results
pois_rate_multi |>
  tbl_regression(exponentiate = TRUE) |>
  as_gt()
Characteristic IRR 95% CI p-value
cigar_day 1.07 1.05, 1.08 <0.001
smoke_yrs


    15-19
    20-24 2.62 0.34, 52.9 0.4
    25-29 5.53 0.94, 105 0.11
    30-34 24.7 5.23, 442 0.002
    35-39 25.6 5.35, 459 0.002
    40-44 67.2 14.6, 1,195 <0.001
    45-49 85.5 18.3, 1,525 <0.001
    50-54 133 28.3, 2,385 <0.001
    55-59 215 45.1, 3,862 <0.001
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio

Interpretation of Rate Model

  • Cigarettes per day:
    • Each additional cigarette per day increases lung cancer rate by 7% (RR = 1.07, 95% CI: 1.05, 1.08)
  • Smoking years:
    • Those smoking 30-34 years have 24.7 times higher rate compared to 15-19 years (RR = 24.7, 95% CI: 5.23, 442)

Model Diagnostics

# Check model fit using scaled Pearson chi-square
qpois_check <- summary(glm(attack ~ res_inf + ghq12, data = asthma, 
                          family = "quasipoisson"))
cat("Dispersion parameter:", qpois_check$dispersion)
Dispersion parameter: 1.052383
# Values close to 1 indicate good model fit

Summary

When to Use Poisson Regression

  • Count outcomes: Number of events, cases, attacks
  • Rate outcomes: Events per unit time or population
  • Small counts: Especially when zeros are common
  • Alternative to linear regression for count data

Advantages

  • Appropriate distribution for count data
  • Prevents negative predictions
  • Handles rate data with offsets
  • Interpretable results via rate ratios

Best Practices

  1. Check assumptions: Mean ≈ variance
  2. Model diagnostics: Scaled Pearson chi-square
  3. Consider overdispersion: Use quasi-Poisson if needed
  4. Interpret carefully: Rate ratios, not odds ratios
  5. Use offsets for rate data

Thank You

Resources:

  • Fleiss, Levin & Paik (2003). Statistical Methods for Rates and Proportions
  • Agresti (2015). Foundations of Linear and Generalized Linear Models
  • Kamarul Imran, Wan Nor Arifin, Tengku Muhammad Hanis (2024). Data Analysis in Health and Medicine Using R. (Online)