--- title: "Estimating Prevalence Ratios with prLogistic" author: "Leila D. Amorim & Raydonal Ospina" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 3 number_sections: true vignette: > %\VignetteIndexEntry{Estimating Prevalence Ratios with prLogistic} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: references.bib --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 4.5 ) library(prLogistic) ``` # Introduction In cross-sectional epidemiological studies, the prevalence ratio (PR) is often the measure of association of interest. When logistic regression is used to adjust for confounders, the model directly yields odds ratios (OR), not PRs. Although OR approximates PR when the outcome is rare (< 10%), the approximation breaks down for common outcomes and can substantially overestimate the strength of the association [@Zhang1998]. The **prLogistic** package estimates adjusted PRs — and their confidence intervals — directly from logistic regression models, using two standardisation procedures [@Wilcosky1985; @Amorim2021]: - **Conditional standardisation**: PR at fixed covariate values (reference profile). - **Marginal standardisation**: population-averaged PR over the observed covariate distribution. Both procedures support four model types: | Function | Model class | Package | Use case | |-----------------------|-------------|-----------|------------------------------| | `prLogisticDelta()` | `glm` | stats | Independent observations | | `prLogisticDelta()` | `glmerMod` | lme4 | Clustered / multilevel data | | `prLogisticGEE()` | `geeglm` | geepack | Longitudinal / GEE | | `prLogisticSurvey()` | `svyglm` | survey | Complex survey designs | --- # Installation ```r # CRAN (stable) install.packages("prLogistic") # Development version (GitHub) # install.packages("remotes") remotes::install_github("Raydonal/prLogistic") ``` --- # Independent observations — `glm` ## Data We use the `birthwt` dataset from the **MASS** package, a retrospective study of risk factors for low birth weight (n = 189). ```{r data} data(birthwt, package = "MASS") # Recode predictors birthwt$smoke <- factor(birthwt$smoke, labels = c("Non-smoker", "Smoker")) birthwt$race <- factor(birthwt$race, labels = c("White", "Black", "Other")) birthwt$ht <- factor(birthwt$ht, labels = c("No", "Yes")) birthwt$ui <- factor(birthwt$ui, labels = c("No", "Yes")) # Outcome prevalence mean(birthwt$low) # 31 % — common outcome, OR is a poor approximation ``` ## Fitting the model ```{r glm-fit} fit_glm <- glm(low ~ smoke + race + age + lwt + ht + ui, family = binomial, data = birthwt) ``` ## Delta method — conditional standardisation Reference profile: binary predictors at 0 (reference category), continuous predictors (`age`, `lwt`) at their sample medians. ```{r delta-cond} res_cond <- prLogisticDelta(fit_glm, standardisation = "conditional") res_cond ``` ## Delta method — marginal standardisation ```{r delta-marg} res_marg <- prLogisticDelta(fit_glm, standardisation = "marginal") res_marg ``` ## Custom reference values Use `ref_values` to fix continuous predictors at clinically meaningful values (e.g., a 25-year-old woman weighing 55 kg): ```{r ref-values} prLogisticDelta(fit_glm, standardisation = "conditional", ref_values = list(age = 25, lwt = 55)) ``` ## Forest plot ```{r forest-plot, fig.cap = "Forest plot: conditional PR estimates with 95% CI"} plot(res_cond, main = "Prevalence Ratios — conditional (birthwt)") ``` ## Bootstrap confidence intervals Bootstrap is recommended as a sensitivity check, especially with small samples. ```{r bootstrap, cache = TRUE} set.seed(2024) res_boot_c <- prLogisticBootCond(fit_glm, data = birthwt, R = 999) res_boot_c ``` Extract a specific CI type: ```{r confint-boot} # Percentile intervals confint(res_boot_c, type = "percentile") ``` --- # Clustered / multilevel data — `glmer` (lme4) ```{r glmer, eval = requireNamespace("lme4", quietly = TRUE)} library(lme4) # Treat race as a clustering variable (illustrative) fit_glmer <- glmer(low ~ smoke + age + lwt + ht + (1 | race), family = binomial, data = birthwt) prLogisticDelta(fit_glmer, standardisation = "marginal") ``` The random effect `(1 | race)` accounts for unobserved between-group heterogeneity. Fixed-effect coefficients — and hence PRs — are conditional on the random effects. --- # Longitudinal data — GEE via `geepack` GEE models yield **population-averaged** estimates and naturally accommodate within-subject correlation. The `prLogisticGEE()` wrapper sets `marginal` as the default standardisation. ```{r gee, eval = requireNamespace("geepack", quietly = TRUE)} library(geepack) data(ohio, package = "geepack") # Respiratory symptoms in children (4 repeated measures per child) fit_gee <- geeglm(resp ~ smoke + age, family = binomial, id = id, corstr = "exchangeable", data = ohio) prLogisticGEE(fit_gee) ``` The robust sandwich variance from `vcov(fit_gee)` is used automatically, providing valid inference even if the working correlation structure is misspecified. --- # Complex survey data — `svyglm` (survey) ```{r survey, eval = requireNamespace("survey", quietly = TRUE)} library(survey) data(api, package = "survey") apiclus2$target_met <- as.numeric(apiclus2$sch.wide == "Yes") # Two-stage cluster sample dclus2 <- svydesign(id = ~dnum + snum, fpc = ~fpc1 + fpc2, data = apiclus2) fit_svy <- svyglm(target_met ~ meals + stype, design = dclus2, family = quasibinomial) prLogisticSurvey(fit_svy, standardisation = "conditional") ``` Design-consistent (sandwich) standard errors are incorporated automatically through the `vcov()` method for `svyglm` objects. --- # Comparing OR and PR The odds ratio overestimates PR when the outcome is common. The table below illustrates this for the smoking predictor in the `birthwt` example: ```{r or-vs-pr} OR <- exp(coef(fit_glm)["smokeSmoker"]) PR <- coef(res_cond)["smokeSmoker"] data.frame( Measure = c("Odds Ratio (logistic)", "Prevalence Ratio (conditional)", "Prevalence Ratio (marginal)"), Estimate = round(c(OR, PR, coef(res_marg)["smokeSmoker"]), 3) ) ``` With a 31% baseline prevalence the OR (`r round(OR, 2)`) substantially overstates the PR (`r round(PR, 2)`). --- # Methodological notes ## Conditional standardisation For predictor $X_j$ (binary), the adjusted PR is: $$ \widehat{PR}_j = \frac{\operatorname{expit}(\hat\beta_0 + \hat\beta_j + \sum_{k \neq j} \hat\beta_k r_k)} {\operatorname{expit}(\hat\beta_0 + \sum_{k \neq j} \hat\beta_k r_k)} $$ where $r_k$ is the reference value of covariate $k$ (0 for binary/dummy predictors; sample median or mean for continuous predictors). ## Marginal standardisation $$ \widehat{PR}_j = \frac{n^{-1}\sum_i \operatorname{expit}(\hat\eta_i^{(1)})} {n^{-1}\sum_i \operatorname{expit}(\hat\eta_i^{(0)})} $$ where $\hat\eta_i^{(1)}$ and $\hat\eta_i^{(0)}$ are the linear predictors with $X_{ij}$ set to 1 and 0, respectively. ## Delta method Confidence intervals use the first-order Taylor (delta method) approximation [@Oliveira1997]: $$ \widehat{\operatorname{Var}}[\log(\widehat{PR})] \approx \mathbf{x}^*{}' \hat\Sigma \, \mathbf{x}^* $$ where $\mathbf{x}^* = (1-\hat p_1)\mathbf{x}_1 - (1-\hat p_0)\mathbf{x}_0$ and $\hat\Sigma$ is the estimated covariance matrix of $\hat{\boldsymbol\beta}$. --- # Session information ```{r session} sessionInfo() ``` # References