The {mxfda}
package
contains tools for analyzing spatial single-cell data using methods from
functional data analysis. Analyses for this package are executed and
stored using an S4 object of class mxFDA
. This vignette
outlines how to perform feature extraction using functional principal
component analysis and multilevel functional principal component
analysis. To set up an mxFDA
object from spatial single
cell imaging data, calculate spatial summary functions, and perform
exploratory data analysis and visualization of these spatial summary
functions, see vignette("mx_fda")
. To perform functional
regression on spatial summary functions from multiplex imaging data, see
vignette("mx_funreg")
.
The basic unit of observation is the curve Xi(r)
for subjects i ∈ …, I
in the cross-sectional setting and Xij(r)
for subject i at sample j ∈ …, Ji
for the multilevel structure, which occurs when there are multiple
samples for each subject. Methods for functional data are typically
presented in terms of continuous functions, but in practice data are
observed on a discrete grid, t. In the {mxfda}
package, Xij(r)
is the spatial summary function (for example, Ripley’s K) value at
radius r for sample j from subject i.
FPCA characterizes modes of variability by decomposing functional observations into population level basis functions and subject-specific scores. The basis functions have a clear interpretation, analogous to that of PCA: the first basis function explains the largest direction of variation, and each subsequent basis function describes less. The FPCA model is typically written
$$ X_i(r) = \mu(r) + \sum_{k=1}^{K} c_{ik}\psi_{k}(r) + \epsilon_i(r) $$
where μ(r) is the population mean, ψk(r) are a set of orthonormal population-level functional principal components, cik are subject-specific scores with mean zero and variance λk, and ϵi(r) are residual curves. Estimated functional principal components ψ̂1(r), ψ̂2(r), …, ψ̂K(r) and corresponding variances λ̂1 ≥ λ̂2 ≥ … ≥ λ̂K are obtained from a truncated Karhunen-Lo`eve decomposition of the sample covariance $\widehat{\Sigma}(s,t) = \widehat{\mbox{Cov}}(Y_i(s), X_i(r))$. The truncation lag K is often chosen so that the resulting approximation accounts for at least 95% of observed variance (Wrobel et al. (2016)).
Conceptually, functional principal components can be thought of as patterns in the data that explain the most variance, and scores cik are weights that describe how much each functional principal component contributes to the shape of a given subject’s spatial summary function.
Here we load data from the VectraPolarisData ovarian cancer dataset, which contains tumor microarray data for 128 women with high-grade serous ovarian cancer, where univariate nearest-neighbor G-functions for immune cells have already been extracted. This data has been modified, but the original data and experiment are described in Steinhart et al. (2021). Each subject has one image.
data("ovarian_FDA")
ovarian_FDA
#> mxFDA Object:
#> Subjects: 128
#> Samples: 128
#> Has spatial data
#> Univariate Summaries: Gest
#> Bivariate Summaries: None
#> Multivariate Summaries: None
#> FPCs not yet calculated
#> MFPCs not yet calculated
#> FCMs not yet calculated
#> MFCMs not yet calculated
#> Scalar on Functional Regression not calculated
These functions are visualized below.
plot(ovarian_FDA, y = "fundiff", what = "uni g", sampleID = "patient_id") +
geom_hline(yintercept = 0, color = "red", linetype = 2) +
theme_minimal()
Values above the dotted red line are interpreted as having a higher probability of observing a neighboring immune cell at radius r than would be expected under complete spatial randomness (CSR). Conversely, values below the dotted line exhibit lower probability of observing a neighboring immune cell than expected under CSR.
Below we run FPCA on these G-function curves, using the
run_fpca()
function, and store these results in the
mxFDAobject
called ovarian_FDA
. The argument
metric
is used to specify which spatial summary function to
estimate FPCA for - in this case "uni g"
specifies the
univariate G function, r
denotes which variable specifies
the radius, and value
denotes which value are the values of
the spatial summary function. The argument pve
takes a
number from 0 to 1, and specifies the percent variance explained by the
FPCA decomposition. For example, if pve = 0.98
is selected,
the resulting FPCA will return the number of FPCs that explain 98% of
variance in the data.
ovarian_FDA <- run_fpca(ovarian_FDA,
metric = "uni g",
r = "r",
value = "fundiff",
pve = .95)
#> 128 sample have >= 4 values for FPCA; removing 0 samples
Note that the summary of this object now shows the number of functional principal components (FPCs) that have been calculated:
summary(ovarian_FDA)
#> mxFDA Object:
#> Subjects: 128
#> Samples: 128
#> Has spatial data
#> Univariate Summaries: Gest
#> Bivariate Summaries: None
#> Multivariate Summaries: None
#> FPCs Calculated:
#> Gest: 4 FPCs describe 96.9% variance
#> MFPCs not yet calculated
#> FCMs not yet calculated
#> MFCMs not yet calculated
#> Scalar on Functional Regression not calculated
The FPCA results are stored in the functional_pca
slot
of the ovarian_FDA
object. To access this slot directly and
view the extracted summary functions, type:
We can visualize the results of FPCA using the plot()
function (see ?plot.mxFDA
for details). The output of
plot()
is a {ggplot2}
object which can then be
easily added to/manipulated as any ggplot plot would. Below we plot the
first two functional principal components from our FPCA analysis of the
G summary functions.
p1 = plot(ovarian_FDA, what = 'uni g fpca', pc_choice = 1)
p2 = plot(ovarian_FDA, what = 'uni g fpca', pc_choice = 2)
ggarrange(p1, p2, nrow = 1, ncol = 2)
The plots above show $\hat{\mu}(r)\pm \sqrt{\hat{\lambda}_1\hat{\psi_k}(t)}$ for the first and second FPC. The black line in each plot is the mean of the data- this shows that the peak in probability of observing a neighboring immune cell occurs, on average, at about radius r = 12. In these plots, the blue and red lines represent one standard deviation of the FPC added (blue) or subtracted (red) from the population mean. For example in the plot for FPC 1, we can interpret the pattern of this FPC as a shift up or down from the population mean. Subjects who have a high score, ci1 for FPC 1 have, across radii, a larger probability of observing a neighboring immune cell than average in the dataset. This pattern explains 76.2% of variance in the data.
We can further explore the results of FPCA interactively using the
{refund.shiny}
package from Wrobel
et al. (2016). Execute the code below to launch a Shiny app with
interactive plots.
Multilevel functional principal components analysis (MFPCA) extends the ideas of FPCA to functional data with a multilevel structure. In the case of multiplex imaging data, this structure arises from multiple samples or ROIs for each subject. MFPCA models the within-subject correlation induced by repeated measures as well as the between-subject correlation modeled by classic FPCA. This leads to a two-level FPC decomposition, where level 1 concerns subject-specific effects and level 2 concerns sample-specific effects. Population-level basis functions and subject-specific scores are calculated for both levels (Di et al. (2009), Cui et al. (2023)). The MFPCA model is:
$$ X_{ij}(r) = \mu(r) + \sum_{k_1=1}^{K_1} c_{ik}^{(1)}\psi_{k}^{(1)}(r) + \sum_{k_2=1}^{K_2}c^{(2)}_{ijk}\psi_{k}^{(2)}(r) + \epsilon_{ij}(r) $$
where μ(r) is the population mean, ψk(1)(r) and ψk(2)(r) are the functional principal components for levels 1 and 2, respectively, and cik(1) and cijk(2) are the subject-specific and subject-sample-specific scores. K1 and K2 represent the number of FPCs for levels 1 and 2, respectively.
For MFPCA we will use the non-small cell lung carcinoma data, since we need multiple samples per subject. Below we extract the univariate K functions from the lung data.
data(lung_df)
clinical = lung_df %>%
select(image_id, patient_id, patientImage_id, gender, age, survival_days, survival_status, stage) %>%
distinct()
spatial = lung_df %>%
select(-image_id, -gender, -age, -survival_days, -survival_status, -stage)
mxFDAobject = make_mxfda(metadata = clinical,
spatial = spatial,
subject_key = "patient_id",
sample_key = "patientImage_id"
)
mxFDAobject = extract_summary_functions(mxFDAobject,
extract_func = univariate,
summary_func = Kest,
r_vec = seq(0, 100, by = 1),
edge_correction = "iso",
markvar = "immune",
mark1 = "immune")
#> Using Theoretical Complete Spatial Randomness for Ripley's K
#> ■■■■■■■■■■■■ 37% | ETA: 6s
#> ■■■■■■■■■■■■■■■■■■■■■■■ 72% | ETA: 2s
These K-functions (with CSR substracted) are visualized below. Each black line is an individual, and values above zero at a particular radius indicate spatial clustering beyond what is expected under CSR.
plot(mxFDAobject, y = "fundiff", what = "uni k", sampleID = "patientImage_id") +
geom_hline(yintercept = 0, color = "red", linetype = 2)
Below we run MFPCA on these K-function curves, using the
run_mfpca()
function, and store these results in the
mxFDAobject
lung cancer data object. Other arguments are
the same as the run_fpca()
function. From the summary of
the object we see that 2 level 1 FPCs and 2 level 2 FPCs were
calculated.
mxFDAobject <- run_mfpca(mxFDAobject,
metric = "uni k",
r = "r",
value = "fundiff",
pve = .99)
#> 247 sample have >= 4 values for FPCA; removing 0 samples
#> Joining with `by = join_by(patient_id)`
mxFDAobject
#> mxFDA Object:
#> Subjects: 50
#> Samples: 247
#> Has spatial data
#> Univariate Summaries: Kest
#> Bivariate Summaries: None
#> Multivariate Summaries: None
#> FPCs not yet calculated
#> MFPCs Calculated:
#> Kest: 1 Level1 MFPCs and 2 Level2 MFPCs explaining 99.7% and 99.1% variance, respectively
#> FCMs not yet calculated
#> MFCMs not yet calculated
#> Scalar on Functional Regression not calculated
We can also visualize the results of MFPCA using the
plot()
function (see ?plot.mxFDA
for details).
Below we plot the first FPC for level 1 and the first FPC for level
2.
p = plot(mxFDAobject, what = 'uni k mfpca', level1 = 1, level2 = 1)
ggarrange(plotlist = p, nrow = 1, ncol = 2)
The left plot shows $\hat{\mu}(r)\pm \sqrt{\hat{\lambda}_1^{(1)}}\hat{\psi_1}^{(1)}(t)$, and indicates that the first level 1 FPC explains essentially a shift up or down from the population mean. Subjects who have a high score, ci1 for FPC 1 have, across radii, have more spatial clustering of immune cells than average in the dataset. The right plot shows $\hat{\mu}(r)\pm \sqrt{\hat{\lambda}_1^{(2)}}\hat{\psi}_1^{(2)}(t)$. This plot shows the pattern explained by FPC 1 from level 2. Essentially, it is the main pattern of variation explain differences across samples within a given subjects and shows that within a subject, differences are primarily driven by some images having higher overall K-function values and some being lower.