Grouped Hyper Data Frame

Introduction

This vignette of package groupedHyperframe (CRAN, Github, RPubs) documents the creation of groupedHyperframe object, the batch processes for a groupedHyperframe, and aggregations of various statistics over multi-level grouping structure.

Prerequisite

Package groupedHyperframe may require the development versions of the spatstat family.

devtools::install_github('spatstat/spatstat')
devtools::install_github('spatstat/spatstat.data')
devtools::install_github('spatstat/spatstat.explore')
devtools::install_github('spatstat/spatstat.geom')
devtools::install_github('spatstat/spatstat.linnet')
devtools::install_github('spatstat/spatstat.model')
devtools::install_github('spatstat/spatstat.random')
devtools::install_github('spatstat/spatstat.sparse')
devtools::install_github('spatstat/spatstat.univar')
devtools::install_github('spatstat/spatstat.utils')

Note to Users

Examples in this vignette require that the search path has

library(groupedHyperframe)
library(spatstat.data)
library(survival) # to help hyperframe understand Surv object

Users should remove the parameter mc.cores = 1L from all examples to engage all CPU cores on the current host under macOS. The authors of package groupedHyperframe are forced to have mc.cores = 1L in this vignette to pass CRAN’s submission check.

Terms and Abbreviations

Term / Abbreviation Description Reference
Forward pipe operator ?base::pipeOp introduced in R 4.1.0
attr Attributes base::attr; base::attributes
CRAN, R The Comprehensive R Archive Network https://cran.r-project.org
data.frame Data frame base::data.frame
formula Formula stats::formula
fv, fv.object, fv.plot (Plot of) function value table spatstat.explore::fv.object, spatstat.explore::plot.fv
groupedData, ~ g1/.../gm Grouped data frame; nested grouping structure nlme::groupedData; nlme::lme
hypercolumns, hyperframe (Hyper columns of) hyper data frame spatstat.geom::hyperframe
inherits Class inheritance base::inherits
kerndens Kernel density stats::density.default()$y
mc.cores Number of CPU cores to use parallel::mclapply; parallel::detectCores
multitype Multitype object spatstat.geom::is.multitype
object.size Memory allocation utils::object.size
pmean, pmedian Parallel mean and median groupedHyperframe::pmean; groupedHyperframe::pmedian
pmax, pmin Parallel maxima and minima base::pmax; base::pmin
ppp, ppp.object (Marked) point pattern spatstat.geom::ppp.object
quantile Quantile stats::quantile
save, xz Save with xz compression base::save(., compress = 'xz'); base::saveRDS(., compress = 'xz'); https://en.wikipedia.org/wiki/XZ_Utils
S3, generic, methods S3 object oriented system base::UseMethod; utils::methods; utils::getS3method; https://adv-r.hadley.nz/s3.html
search Search path base::search
Surv Survival object survival::Surv
trapz, cumtrapz (Cumulative) trapezoidal integration pracma::trapz; pracma::cumtrapz; https://en.wikipedia.org/wiki/Trapezoidal_rule

Acknowledgement

This work supported by NCI R01CA222847 (I. Chervoneva, T. Zhan, and H. Rui) and R01CA253977 (H. Rui and I. Chervoneva).

groupedHyperframe Class

The S3 class groupedHyperframe inherits from the hyperframe class, in a similar fashion as the groupedData class inherits from the data.frame class.

A groupedHyperframe object, in addition to a hyperframe object, has attribute(s)

  • attr(., 'group'), a formula to specify the (nested) grouping structure

Create a groupedHyperframe

From a hyperframe

The S3 method dispatch as.groupedHyperframe.hyperframe() converts a hyperframe to groupedHyperframe. Data set spatstat.data::osteo has the serial number of sampling volume brick nested in the bone sample id,

osteo |> as.groupedHyperframe(group = ~ id/brick)
#> Grouped Hyperframe: ~id/brick
#> 
#> 40 brick nested in
#> 4 id
#> 
#>        id shortid brick   pts depth
#> 1  c77za4       4     1 (pp3)    45
#> 2  c77za4       4     2 (pp3)    60
#> 3  c77za4       4     3 (pp3)    55
#> 4  c77za4       4     4 (pp3)    60
#> 5  c77za4       4     5 (pp3)    85
#> 6  c77za4       4     6 (pp3)    90
#> 7  c77za4       4     7 (pp3)    95
#> 8  c77za4       4     8 (pp3)    65
#> 9  c77za4       4     9 (pp3)   100
#> 10 c77za4       4    10 (pp3)   100

From a data.frame

The S3 method dispatch as.groupedHyperframe.data.frame() converts a data.frame to a groupedHyperframe. This function inspects the input by the (nested) grouping structure, identifies the column(s) with elements not identical within the lowest group, and converts them into hypercolumns. Data set Ki67. in this package has non-identical column logKi67 in the nested grouping structure ~ patientID/tissueID.

(Ki67g = Ki67. |> as.groupedHyperframe(group = ~ patientID/tissueID, mc.cores = 1L))
#> Grouped Hyperframe: ~patientID/tissueID
#> 
#> 6 tissueID nested in
#> 6 patientID
#> 
#>     logKi67 tissueID Tstage  PFS recfreesurv_mon recurrence adj_rad adj_chemo
#> 1 (numeric) TJUe_I17      2 100+             100          0   FALSE     FALSE
#> 2 (numeric) TJUe_G17      1   22              22          1   FALSE     FALSE
#> 3 (numeric) TJUe_F17      1  99+              99          0   FALSE        NA
#> 4 (numeric) TJUe_D17      1  99+              99          0   FALSE      TRUE
#> 5 (numeric) TJUe_J18      1  112             112          1    TRUE      TRUE
#> 6 (numeric) TJUe_N17      4   12              12          1    TRUE     FALSE
#>   histology  Her2   HR  node  race age patientID
#> 1         3  TRUE TRUE  TRUE White  66   PT00037
#> 2         3 FALSE TRUE FALSE Black  42   PT00039
#> 3         3 FALSE TRUE FALSE White  60   PT00040
#> 4         3 FALSE TRUE  TRUE White  53   PT00042
#> 5         3 FALSE TRUE  TRUE White  52   PT00054
#> 6         2  TRUE TRUE  TRUE Black  51   PT00059

Converting a data.frame with cell intensities, etc., into a groupedHyperframe reduces memory allocation, but does not reduce much the saved files size if xz compression is used.

unclass(object.size(Ki67g)) / unclass(object.size(Ki67.))
#> [1] 0.1148083
f_g = tempfile(fileext = '.rds')
Ki67g |> saveRDS(file = f_g, compress = 'xz')
f = tempfile(fileext = '.rds')
Ki67. |> saveRDS(file = f, compress = 'xz')
file.size(f_g) / file.size(f) # not much reduction
#> [1] 0.9629481

Create a groupedHyperframe with ppp-hypercolumn

Function grouped_ppp() creates a groupedHyperframe with one-and-only-one ppp-hypercolumn. In the following example, the argument formula specifies

  • the marks, e.g., numeric mark hladr and multitype mark phenotype, on the left-hand-side
  • the additional predictors and/or endpoints for downstream analysis, e.g., OS, gender and age, before the | separator on the right-hand-side
  • the grouping structure, e.g., image_id nested in patient_id, after the | separator on the right-hand-side.
(s = grouped_ppp(formula = hladr + phenotype ~ OS + gender + age | patient_id/image_id, 
                 data = wrobel_lung, mc.cores = 1L))
#> Grouped Hyperframe: ~patient_id/image_id
#> 
#> 25 image_id nested in
#> 5 patient_id
#> 
#>       OS gender age    patient_id          image_id  ppp.
#> 1  3488+      F  85 #01 0-889-121 [40864,18015].im3 (ppp)
#> 2  3488+      F  85 #01 0-889-121 [42689,19214].im3 (ppp)
#> 3  3488+      F  85 #01 0-889-121 [42806,16718].im3 (ppp)
#> 4  3488+      F  85 #01 0-889-121 [44311,17766].im3 (ppp)
#> 5  3488+      F  85 #01 0-889-121 [45366,16647].im3 (ppp)
#> 6   1605      M  66 #02 1-037-393 [56576,16907].im3 (ppp)
#> 7   1605      M  66 #02 1-037-393 [56583,15235].im3 (ppp)
#> 8   1605      M  66 #02 1-037-393 [57130,16082].im3 (ppp)
#> 9   1605      M  66 #02 1-037-393 [57396,17896].im3 (ppp)
#> 10  1605      M  66 #02 1-037-393 [57403,16934].im3 (ppp)

Batch Process on ppp-hypercolumn

In this section, we outline the batch processes of spatial point pattern analyses applicable to the one-and-only-one ppp-hypercolumn of a hyperframe. These batch processes are not intended for a hyperframe with multiple ppp-hypercolumns in the foreseeable future, as that would require checking for name clashes in the $marks from multiple ppp-hypercolumns.

… which adds a fv-hypercolumn

Batch Process Workhorse in spatstat.explore Applicable To fv-hypercolumn Suffix
Emark_() Emark() numeric marks .E
Vmark_() Vmark() numeric marks .V
markcorr_() markcorr() numeric marks .k
markvario_() markvario() numeric marks .gamma
Gcross_() Gcross() multitype marks .G
Kcross_() Kcross() multitype marks .K
Jcross_() Jcross() multitype marks .J

… which adds a numeric-hypercolumn

Batch Process Workhorse in spatstat.geom Applicable To numeric-hypercolumn Suffix
nncross_() nncross.ppp(., what = 'dist') multitype marks .nncross

Pipe operator compatible

Multiple batch processes may be applied to a hyperframe (or groupedHyperframe) in a pipeline.

r = seq.int(from = 0, to = 250, by = 10)
out = s |>
  Emark_(r = r, correction = 'best', mc.cores = 1L) |> # slow
  # Vmark_(r = r, correction = 'best', mc.cores = 1L) |> # slow
  # markcorr_(r = r, correction = 'best', mc.cores = 1L) |> # slow
  # markvario_(r = r, correction = 'best', mc.cores = 1L) |> # slow
  Gcross_(i = 'CK+.CD8-', j = 'CK-.CD8+', r = r, correction = 'best', mc.cores = 1L) |> # fast
  # Kcross_(i = 'CK+.CD8-', j = 'CK-.CD8+', r = r, correction = 'best', mc.cores = 1L) |> # fast
  nncross_(i = 'CK+.CD8-', j = 'CK-.CD8+', correction = 'best', mc.cores = 1L) # fast
#> 

The returned hyperframe (or groupedHyperframe) has

  • fv-hypercolumn hladr.E, created by function Emark_() on numeric mark hladr
  • fv-hypercolumn phenotype.G, created by function Gcross_() on multitype mark phenotype
  • numeric-hypercolumn phenotype.nncross, created by function nncross_() on multitype mark phenotype
out
#> Grouped Hyperframe: ~patient_id/image_id
#> 
#> 25 image_id nested in
#> 5 patient_id
#> 
#>       OS gender age    patient_id          image_id  ppp. hladr.E phenotype.G
#> 1  3488+      F  85 #01 0-889-121 [40864,18015].im3 (ppp)    (fv)        (fv)
#> 2  3488+      F  85 #01 0-889-121 [42689,19214].im3 (ppp)    (fv)        (fv)
#> 3  3488+      F  85 #01 0-889-121 [42806,16718].im3 (ppp)    (fv)        (fv)
#> 4  3488+      F  85 #01 0-889-121 [44311,17766].im3 (ppp)    (fv)        (fv)
#> 5  3488+      F  85 #01 0-889-121 [45366,16647].im3 (ppp)    (fv)        (fv)
#> 6   1605      M  66 #02 1-037-393 [56576,16907].im3 (ppp)    (fv)        (fv)
#> 7   1605      M  66 #02 1-037-393 [56583,15235].im3 (ppp)    (fv)        (fv)
#> 8   1605      M  66 #02 1-037-393 [57130,16082].im3 (ppp)    (fv)        (fv)
#> 9   1605      M  66 #02 1-037-393 [57396,17896].im3 (ppp)    (fv)        (fv)
#> 10  1605      M  66 #02 1-037-393 [57403,16934].im3 (ppp)    (fv)        (fv)
#>    phenotype.nncross
#> 1          (numeric)
#> 2          (numeric)
#> 3          (numeric)
#> 4          (numeric)
#> 5          (numeric)
#> 6          (numeric)
#> 7          (numeric)
#> 8          (numeric)
#> 9          (numeric)
#> 10         (numeric)

Aggregation Over Nested Grouping Structure

When nested grouping structure ~g1/g2/.../gm is present, we may aggregate over the

  • fv-hypercolumns
  • numeric-hypercolumns
  • numeric marks in the ppp-hypercolumn

by either one of the grouping levels ~g1, ~g2, …, or ~gm. If the lowest grouping ~gm is specified, then no aggregation is performed.

Aggregation of fv-hypercolumns

Function aggregate_fv() aggregates

  • the function values, i.e., the black-solid-curve of fv.plot. In the following example, we have
    • numeric-hypercolumns hladr.E.value and phenotype.G.value, aggregated function values from fv-hypercolumns hladr.E and phenotype.G
  • the cumulative trapezoidal integration under the black-solid-curve. In the following example, we have
    • numeric-hypercolumns hladr.E.cumtrapz and phenotype.G.cumtrapz, aggregated cumulative trapezoidal integration from fv-hypercolumns hladr.E and phenotype.G
(afv = out |>
  aggregate_fv(by = ~ patient_id, f_aggr_ = pmean, mc.cores = 1L))
#> Column(s) image_id removed; as they are not identical per aggregation-group
#> Hyperframe:
#>      OS gender age    patient_id hladr.E.value hladr.E.cumtrapz
#> 1 3488+      F  85 #01 0-889-121     (numeric)        (numeric)
#> 2  1605      M  66 #02 1-037-393     (numeric)        (numeric)
#> 3   176      M  84 #03 2-080-378     (numeric)        (numeric)
#> 4 2042+      M  79 #04 2-223-153     (numeric)        (numeric)
#> 5 3747+      M  68 #05 2-286-740     (numeric)        (numeric)
#>   phenotype.G.value phenotype.G.cumtrapz
#> 1         (numeric)            (numeric)
#> 2         (numeric)            (numeric)
#> 3         (numeric)            (numeric)
#> 4         (numeric)            (numeric)
#> 5         (numeric)            (numeric)

Each of the numeric-hypercolumns contains tabulated values on the common grid of r. One “slice” of this grid may be extracted by

afv$hladr.E.cumtrapz |> .slice(j = '50')
#>         1         2         3         4         5 
#> 10.489960 10.463419 31.248955  3.162186 23.635120

Aggregation of numeric-hypercolumns and numeric mark(s) in ppp-hypercolumn

Function aggregate_quantile() aggregates the quantile of

  • the numeric-hypercolumns. In the following example, we have
    • numeric-hypercolumn phenotype.nncross.quantile, aggregated quantile of numeric-hypercolumn phenotype.nncross
  • the numeric mark(s) in the ppp-hypercolumn. In the following example, we have
    • numeric-hypercolumn hladr.quantile, aggregated quantile of numeric mark hladr in ppp-hypercolumn
out |>
  aggregate_quantile(by = ~ patient_id, probs = seq.int(from = 0, to = 1, by = .1), mc.cores = 1L)
#> Column(s) image_id removed; as they are not identical per aggregation-group
#> Hyperframe:
#>      OS gender age    patient_id phenotype.nncross.quantile hladr.quantile
#> 1 3488+      F  85 #01 0-889-121                  (numeric)      (numeric)
#> 2  1605      M  66 #02 1-037-393                  (numeric)      (numeric)
#> 3   176      M  84 #03 2-080-378                  (numeric)      (numeric)
#> 4 2042+      M  79 #04 2-223-153                  (numeric)      (numeric)
#> 5 3747+      M  68 #05 2-286-740                  (numeric)      (numeric)

Function aggregate_kerndens() aggregates the kernel density of

  • the numeric-hypercolumns. In the following example, we have
    • numeric-hypercolumn phenotype.nncross.kerndens, aggregated kernel density of numeric-hypercolumn phenotype.nncross
  • the numeric mark(s) in the ppp-hypercolumn. In the following example, we have
    • numeric-hypercolumn hladr.kerndens, aggregated kernel density of numeric mark hladr in ppp-hypercolumn
(mdist = out$phenotype.nncross |> unlist() |> max())
#> [1] 354.2968
out |> 
  aggregate_kerndens(by = ~ patient_id, from = 0, to = mdist, mc.cores = 1L)
#> Column(s) image_id removed; as they are not identical per aggregation-group
#> Hyperframe:
#>      OS gender age    patient_id phenotype.nncross.kerndens hladr.kerndens
#> 1 3488+      F  85 #01 0-889-121                  (numeric)      (numeric)
#> 2  1605      M  66 #02 1-037-393                  (numeric)      (numeric)
#> 3   176      M  84 #03 2-080-378                  (numeric)      (numeric)
#> 4 2042+      M  79 #04 2-223-153                  (numeric)      (numeric)
#> 5 3747+      M  68 #05 2-286-740                  (numeric)      (numeric)