The {mxfda}
package
contains tools for analyzing spatial single-cell data, including
multiplex imaging and spatial transcriptomics, using methods from
functional data analysis. Analyses for this package are executed and
stored using an S4 object of class mxFDA
. This vignette
outlines how to perform functional regression using spatial summary
functions of multiplex image samples as model covariates. To set up an
mxFDA
object from spatial single cell data, calculate
spatial summary functions, and perform exploratory data analysis and
visualization of these spatial summary functions, see
vignette("mx_fda")
. To perform feature extraction using
functional principal component analysis, see
vignette("mx_fpca")
.
Here we load data from the Ovarian cancer dataset where univariate
nearest-neighbor G-functions for immune cells have already been
estimated. See the vignette mxfda::mx_fda
for more details
on extracting spatial summary functions.
We visualize these functions below.
plot(ovarian_FDA, y = "fundiff", what = "uni g") +
geom_hline(yintercept = 0, color = "red", linetype = 2) +
theme_minimal() +
ggtitle("Nearest neighbor G-functions for immune cells")
The plot shows the G-function values as a function of radius (r) for multiple immune cell samples, where each black line is an individual sample. The red dashed line at g = 0 indicates a baseline where no spatial clustering or dispersion is observed. Deviations from this baseline highlight potential patterns of clustering or dispersion of immune cells at varying radii.
Single-cell imaging is common in cancer applications, where the analysis goal is to relate cell-level spatial information to patient-level time-to-event outcomes like overall survival or time-to-recurrence. Cox regression is a popular regression technique for assessing the association between covariates and a survival outcome. The standard Cox model is given by
$$\log h_i(t) = \log h_0(t) + \sum_{p = 1}^P \gamma_p Z_{ip},$$
where hi(t) is the hazard rate for subject i and h0(t) is an unspecified baseline hazard. Each Zip is a scalar covariate, and P is the total number of scalar covariates. These are covariates like age, sex, treatment group, etc and can be contrasted with spatial summary functions which are known as functional covariates. The γp are regression coefficients; these are interpreted as log hazard ratios.
For single cell spatial data, we have a spatial summary function for each subject (e.g. univariate Ripley’s K), Xi(r), that we want to add to the Cox regression model as a functional covariate. Below we provide three different strategies for incorporating Xi(r) into a Cox regression model as a covariate.
In this approach, we first decompose the spatial summary functions using FPCA, then use subject-specific scores from FPCA as covariates in a Cox regression model. Conceptually, we are fitting the model given by
$$\log h_i(t) = \log h_0(t) + \sum_{p = 1}^P \gamma_p Z_{ip} + \sum_{k = 1}^K \beta_k c_{ik},$$
where cik is
the kth score from functional
principal components analysis for the ith subject. K represents the number of principal
components, usually selected to explain at least 95% of the variance in
the spatial summary functions (see vignette("mx_fpca")
).
The βk are
regression coefficients that tell us about the association between
overall survival and the kth
functional principal component. Like the γp, the the
βk are
interpreted as log hazard ratios, and eβk
is a hazard ratio. This is the approach to Cox regression with spatial
summary functions as covariates taken in Vu et
al. (2023).
The code below runs FPCA the G-functions from the ovarian cancer data
and calculates principal components that explain up to 99% variance
using the argument pve
. The argument
lightweight = TRUE
returns only what is needed for
downstream analysis. To return all items calculate by FPCA when using
refund::fpca.face
, use
lightweight = FALSE
.
ovarian_FDA <- run_fpca(ovarian_FDA, metric = "uni g", r = "r", value = "fundiff",
lightweight = TRUE,
pve = .99)
ovarian_FDA
#> mxFDA Object:
#> Subjects: 128
#> Samples: 128
#> Has spatial data
#> Univariate Summaries: Gest
#> Bivariate Summaries: None
#> Multivariate Summaries: None
#> FPCs Calculated:
#> Gest: 6 FPCs describe 99.2% variance
#> MFPCs not yet calculated
#> FCMs not yet calculated
#> MFCMs not yet calculated
#> Scalar on Functional Regression not calculated
Six FPCs were returned. Below, we extract the fpca scores from the
ovarian_FDA
object and visualize the relationship between
the first two FPC scores and survival time.
Gdf_fpc = extract_fpca_scores(ovarian_FDA, 'uni g fpca')
p1 = Gdf_fpc %>%
mutate(event = factor(event, levels = 0:1, labels = c("censored", "died"))) %>%
ggplot(aes(fpc1, survival_time, color = event)) +
geom_point() +
labs(y = "survival time (days)", title = "fpc1") +
theme(legend.position = c(.5, .7))
p2 = Gdf_fpc %>%
mutate(event = factor(event, levels = 0:1, labels = c("censored", "died"))) %>%
ggplot(aes(fpc2, survival_time, color = event)) +
geom_point() +
labs(y = "survival time (days)", title = "fpc2") +
theme(legend.position = "none")
ggarrange(p1, p2, nrow = 1, ncol = 2)
The figure above shows scatter plots of survival time versus functional principal components for FPC 1 and FPC 2. The left panel shows the relationship between survival time (in days) and the first functional principal component (fpc1) score, while the right panel illustrates the relationship with the second component (fpc2). Data points are color-coded based on whether a given patient experienced the event, with red representing censoring and blue representing individuals who died. These plots highlight the variation in survival times as explained by the two principal components. In general, it appears that subjects with lower scores for FPC 1 tended to have longer survival times.
Next, we use these scores as covariates in a Cox regression model. We also control for age.
library(survival)
phmod_fpc = coxph(Surv(survival_time, event) ~ fpc1 + fpc2 + fpc3 + fpc4 + age,
data = Gdf_fpc)
The results from this model are printed below in a tidy format. Note that all FPCs except FPC 2 have a statistically significant relationship with overall survival. The hazard ratio above 1 for FPC 1 indicates that higher scores are associate with worse survival outcomes, which is consistent with the exploratory analysis in the plot above.
tidy(phmod_fpc, exp = TRUE, conf.int = TRUE) %>%
mutate(p.value = format.pval(p.value, digits = 1)) %>%
select(term, hazard_ratio = estimate, conf.low, conf.high, p = p.value) %>%
knitr::kable(digits = 2)
term | hazard_ratio | conf.low | conf.high | p |
---|---|---|---|---|
fpc1 | 4.38 | 2.58 | 7.46 | 5e-08 |
fpc2 | 1.16 | 0.46 | 2.89 | 0.754 |
fpc3 | 62.67 | 7.24 | 542.16 | 2e-04 |
fpc4 | 0.04 | 0.00 | 0.67 | 0.026 |
age | 1.03 | 1.01 | 1.05 | 0.009 |
Linear and additive functional Cox regression models were used in Vu et al. (2022) to model the relationship between survival and spatial summary functions without having to first decompose the spatial summary functions using FPCA. Linear functional cox regression was first introduced by Gellar et al. (2015) and takes the form:
$$\log h_i(t;Z_i, X_i) = \log h_0(t) + \sum_{p = 1}^P \gamma_p Z_{ip} + \int_{r = 0}^R \beta(r)X_i(r)dr,$$
where Xi(r) is the spatial summary function for subject i, and β(r) is a coefficient function. For this model, the coefficient function is a log hazard ratio for the spatial summary function at radius r. In effect, this allows us to model the association between of spatial clustering and survival at many different radii simultaneously. Here Xi(r) is considered a functional covariate, and the Zip are scalar covariates.
The code below runs an LFCM for the G-functions from the ovarian
cancer data with age as a scalar covariate. Setting the argument
afcm = FALSE
runs an LFCM instead of an AFCM.
ovarian_FDA = run_fcm(ovarian_FDA, model_name = "fit_lfcm",
formula = survival_time ~ age, event = "event",
metric = "uni g", r = "r", value = "fundiff",
afcm = FALSE)
Below we print the class and summary of the extracted model.
summary(extract_model(ovarian_FDA, 'uni g', 'cox', 'fit_lfcm'))
#>
#> Family: Cox PH
#> Link function: identity
#>
#> Formula:
#> survival_time ~ age + s(t_int, by = l_int * func, bs = "cr",
#> k = 20)
#>
#> Parametric coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> age 0.03083 0.01136 2.714 0.00664 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Approximate significance of smooth terms:
#> edf Ref.df Chi.sq p-value
#> s(t_int):l_int * func 4.286 4.759 41.02 4.76e-07 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Deviance explained = 5.72%
#> -REML = 314.14 Scale est. = 1 n = 128
Finally, we visualized the exponentiated estimated coefficient function, eβ̂(r), from this model. The code below extracts this quantity, then plots eβ̂(r) as a solid black line and 95% pointwise confidence intervals as dotted black lines. For the linear functional Cox model, eβ̂(r) is interpreted as a hazard ratio as a function of radius r. The red dotted line represents the hazard ratio of 1. Values above the red line indicate an increased hazard, while values below it suggest a decreased hazard. Values of the radius r where the 95% confidence interval does not contain one are statistically significant.
Additive functional cox regression was first introduced by Cui, Crainiceanu, and Leroux (2021) and takes the form:
$$\log h_i(t;Z_i, X_i) = \log h_0(t) + \sum_{p = 1}^P \gamma_p Z_{ip} + \int_{r = 0}^R F\left(r, X_i(r)\right)dr.$$
This model is slightly more flexible than the LFCM, in that it allows the relationship between Xi(r) and survival to vary nonlinearly with r.
The code below runs an AFCM with age as a scalar covariate. Setting
the argument afcm = TRUE
runs an AFCM.
ovarian_FDA <- run_fcm(ovarian_FDA, model_name = "fit_afcm",
formula = survival_time ~ age, event = "event",
metric = "uni g", r = "r", value = "fundiff",
afcm = TRUE)
summary(extract_model(ovarian_FDA, 'uni g', 'cox', 'fit_afcm'))
#>
#> Family: Cox PH
#> Link function: identity
#>
#> Formula:
#> survival_time ~ age + ti(t_int, func, by = l_int, bs = c("cr",
#> "cr"), k = c(10, 10), mc = c(FALSE, TRUE))
#>
#> Parametric coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> age 0.02852 0.01162 2.454 0.0141 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Approximate significance of smooth terms:
#> edf Ref.df Chi.sq p-value
#> ti(t_int,func):l_int 8.248 10.52 55.88 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Deviance explained = 10.7%
#> -REML = 310.05 Scale est. = 1 n = 128
The AFCM model produces a coefficient surface, F̂(r, Xi(r)), rather than a coefficient function, illustrating the relationship between G value and radius r in a continuous gradient. The code below extracts this estimated surface, then on the left plots the surface, and on the right plots regions of the surface that are statistically significant at the 95% confidence level. The color scale represents the magnitude of the hazard ratio.
For example, at r = 30 and G = 0.3, we observe a statistically significant region with a color corresponding to a hazard ratio value between 1.5 and 2.0. This indicates that at a radius of 30mm, individuals with a G value of 0.3, experience an expected 50% to 100% increased hazard compared to the baseline hazard.
C-index is a good way to compare across models, especially when incorporating cross validation. Larger c-index is indicative of a better predictive model. Below, we calculate the c-index for each model.
fit_afcm = extract_model(ovarian_FDA, 'uni g', 'cox', 'fit_afcm')
fit_lfcm = extract_model(ovarian_FDA, 'uni g', 'cox', 'fit_lfcm')
c_index = c(
phmod_fpc$concordance[["concordance"]],
extract_c(fit_lfcm, Gdf_fpc$survival_time, Gdf_fpc$event),
extract_c(fit_afcm, Gdf_fpc$survival_time, Gdf_fpc$event)
)
tibble(model = c("fpc", "lfcm", "afcm"), c_index) %>%
knitr::kable(digits = 2)
model | c_index |
---|---|
fpc | 0.71 |
lfcm | 0.73 |
afcm | 0.76 |
Models with continuous or binary scalar outcomes and functional covariates are also possible, and are referred to as scalar-on-function regression (sofr) models in the functional data analysis literature (Goldsmith et al. (2011)).
Below we fit a sofr model with age as the outcome, G-functions from
the ovarian cancer data as the functional covariate, and no additional
scalar covariates using the run_sofr()
function.
ovarian_FDA <- run_sofr(ovarian_FDA,
model_name = "fit_sofr_age",
formula = age ~ 1,
metric = "uni g", r = "r", value = "fundiff")
This model also produces an estimated coefficient surface β̂(r), which can be visualized. The code below extracts this quantity, then plots β̂(r) as a solid black line and 95% pointwise confidence intervals as dotted black lines. Values of the radius r where the 95% confidence interval does not contain zero are statistically significant. At each radius value r, for a continuous outcome β̂(r) can be interpreted as a regular linear regression coefficient.
Below we fit a sofr model with stage as the outcome, G-functions from
the ovarian cancer data as the functional covariate, and age as a scalar
covariate. Stage is a binary outcome, and as a result this model can
also be called functional logistic regression. When the outcome
is binary, you need to specify the argument
family = "binomial"
. In addition, the outcome should be
coded as a numeric rather than a factor or character variable.
ovarian_FDA <- run_sofr(ovarian_FDA,
model_name = "fit_sofr_stage",
formula = stage ~ age,
family = "binomial",
metric = "uni g", r = "r", value = "fundiff")
When the outcome is binary β̂(r) is interpreted as a log odds ratio at each radius r.
model = extract_model(ovarian_FDA, 'uni g', type = 'sofr', model_name = 'fit_sofr_stage')
plot(model, ylab=expression(paste(beta(t))), xlab="t")