---
title: "Grouped Hyper Data Frame"
output: rmarkdown::html_vignette
author: Tingting Zhan
vignette: >
  %\VignetteIndexEntry{intro}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
library(knitr)
opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
options(rmarkdown.html_vignette.check_title = FALSE)
```

# Introduction

This vignette of package **`groupedHyperframe`** ([CRAN](https://cran.r-project.org/package=groupedHyperframe), [Github](https://github.com/tingtingzhan/groupedHyperframe), [RPubs](https://rpubs.com/tingtingzhan/groupedHyperframe)) documents the creation of `groupedHyperframe` object, the batch processes for a `groupedHyperframe`, and aggregations of various statistics over multi-level grouping structure.

## Prerequisite

Package **`groupedHyperframe`** may require the development versions of the **`spatstat`** family.

```{r eval = FALSE}
devtools::install_github('spatstat/spatstat')
devtools::install_github('spatstat/spatstat.data')
devtools::install_github('spatstat/spatstat.explore')
devtools::install_github('spatstat/spatstat.geom')
devtools::install_github('spatstat/spatstat.linnet')
devtools::install_github('spatstat/spatstat.model')
devtools::install_github('spatstat/spatstat.random')
devtools::install_github('spatstat/spatstat.sparse')
devtools::install_github('spatstat/spatstat.univar')
devtools::install_github('spatstat/spatstat.utils')
```

## Note to Users

Examples in this vignette require that the `search` path has

```{r setup}
library(groupedHyperframe)
library(spatstat.data)
library(survival) # to help hyperframe understand Surv object
```

Users should remove the parameter `mc.cores = 1L` from all examples to engage all CPU cores on the current host under macOS. The authors of package **`groupedHyperframe`** are forced to have `mc.cores = 1L` in this vignette to pass `CRAN`'s submission check.

## Terms and Abbreviations

```{r echo = FALSE, results = 'asis'}
c(
  '', 'Forward pipe operator', '`?base::pipeOp` introduced in `R` 4.1.0', 
  '`attr`', 'Attributes', '`base::attr`; `base::attributes`',
  '`CRAN`, `R`', 'The Comprehensive R Archive Network', 'https://cran.r-project.org',
  '`data.frame`', 'Data frame', '`base::data.frame`',
  '`formula`', 'Formula', '`stats::formula`',
  '`fv`, `fv.object`, `fv.plot`', '(Plot of) function value table', '`spatstat.explore::fv.object`, `spatstat.explore::plot.fv`',
  '`groupedData`, `~ g1/.../gm`', 'Grouped data frame; nested grouping structure', '`nlme::groupedData`; `nlme::lme`',
  '`hypercolumns`, `hyperframe`', '(Hyper columns of) hyper data frame', '`spatstat.geom::hyperframe`',
  '`inherits`', 'Class inheritance', '`base::inherits`',
  '`kerndens`', 'Kernel density', '`stats::density.default()$y`',
  # '`matrix`', 'Matrix', '`base::matrix`', # hahaha!!!
  '`mc.cores`', 'Number of CPU cores to use', '`parallel::mclapply`; `parallel::detectCores`',
  '`multitype`', 'Multitype object', '`spatstat.geom::is.multitype`',
  '`object.size`', 'Memory allocation', '`utils::object.size`',
  '`pmean`, `pmedian`', 'Parallel mean and median', '`groupedHyperframe::pmean`; `groupedHyperframe::pmedian`',
  '`pmax`, `pmin`', 'Parallel maxima and minima', '`base::pmax`; `base::pmin`',
  '`ppp`, `ppp.object`', '(Marked) point pattern', '`spatstat.geom::ppp.object`',
  '`quantile`', 'Quantile', '`stats::quantile`',
  '`save`, `xz`', 'Save with `xz` compression', '`base::save(., compress = \'xz\')`; `base::saveRDS(., compress = \'xz\')`; https://en.wikipedia.org/wiki/XZ_Utils', 
  '`S3`, `generic`, `methods`', '`S3` object oriented system',  '`base::UseMethod`; `utils::methods`; `utils::getS3method`; https://adv-r.hadley.nz/s3.html',
  '`search`', 'Search path', '`base::search`',
  '`Surv`', 'Survival object', '`survival::Surv`',
  '`trapz`, `cumtrapz`', '(Cumulative) trapezoidal integration', '`pracma::trapz`; `pracma::cumtrapz`; https://en.wikipedia.org/wiki/Trapezoidal_rule'
) |>
  matrix(nrow = 3L, dimnames = list(c('Term / Abbreviation', 'Description', 'Reference'), NULL)) |>
  t.default() |>
  as.data.frame.matrix() |> 
  kable(format = 'html') 
# ?knitr::kable
# default: `|` shown as &...
# format = 'html': `>` shown as &..
```

## Acknowledgement

This work supported by NCI R01CA222847 ([I. Chervoneva](https://orcid.org/0000-0002-9104-4505), [T. Zhan](https://orcid.org/0000-0001-9971-4844), and [H. Rui](https://orcid.org/0000-0002-8778-261X)) and R01CA253977 (H. Rui and I. Chervoneva).

# `groupedHyperframe` Class

The `S3` class `groupedHyperframe` `inherits` from the `hyperframe` class, in a similar fashion as the `groupedData` class inherits from the `data.frame` class.

A `groupedHyperframe` object, in addition to a `hyperframe` object, has attribute(s)

-   `attr(., 'group')`, a `formula` to specify the (nested) grouping structure

## Create a `groupedHyperframe`

### From a `hyperframe`

The `S3` method dispatch `as.groupedHyperframe.hyperframe()` converts a `hyperframe` to `groupedHyperframe`. Data set `spatstat.data::osteo` has the serial number of sampling volume `brick` nested in the bone sample `id`,

```{r}
osteo |> as.groupedHyperframe(group = ~ id/brick)
```

### From a `data.frame`

The `S3` method dispatch `as.groupedHyperframe.data.frame()` converts a `data.frame` to a `groupedHyperframe.` This function inspects the input by the (nested) grouping structure, identifies the column(s) with elements not identical within the lowest group, and converts them into `hypercolumns`. Data set **`Ki67.`** in this package has non-identical column *`logKi67`* in the nested grouping structure *`~ patientID/tissueID`*.

```{r}
(Ki67g = Ki67. |> as.groupedHyperframe(group = ~ patientID/tissueID, mc.cores = 1L))
```

Converting a `data.frame` with cell intensities, etc., into a `groupedHyperframe` reduces memory allocation, but does not reduce much the `save`d files size if `xz` compression is used.

```{r}
unclass(object.size(Ki67g)) / unclass(object.size(Ki67.))
```
```{r}
f_g = tempfile(fileext = '.rds')
Ki67g |> saveRDS(file = f_g, compress = 'xz')
f = tempfile(fileext = '.rds')
Ki67. |> saveRDS(file = f, compress = 'xz')
file.size(f_g) / file.size(f) # not much reduction
```

## Create a `groupedHyperframe` with `ppp`-`hypercolumn`

Function `grouped_ppp()` creates a `groupedHyperframe` with *one-and-only-one* `ppp`-`hypercolumn`. In the following example, the argument `formula` specifies

-   the marks, e.g., `numeric` mark *`hladr`* and `multitype` mark *`phenotype`*, on the left-hand-side
-   the additional predictors and/or endpoints for downstream analysis, e.g., *`OS`*, *`gender`* and *`age`*, before the `|` separator on the right-hand-side
-   the grouping structure, e.g., *`image_id`* nested in *`patient_id`*, after the `|` separator on the right-hand-side.

```{r}
(s = grouped_ppp(formula = hladr + phenotype ~ OS + gender + age | patient_id/image_id, 
                 data = wrobel_lung, mc.cores = 1L))
```

# Batch Process on `ppp`-`hypercolumn`

In this section, we outline the batch processes of spatial point pattern analyses applicable to the *one-and-only-one* `ppp`-`hypercolumn` of a `hyperframe`. These batch processes are not intended for a `hyperframe` with multiple `ppp`-`hypercolumns` in the foreseeable future, as that would require checking for name clashes in the `$marks` from multiple `ppp`-`hypercolumns`.

## ... which adds a `fv`-`hypercolumn`

```{r echo = FALSE, results = 'asis'}
c(
  '`Emark_()`', '`Emark()`', '`numeric` marks', '`.E`',
  '`Vmark_()`', '`Vmark()`', '`numeric` marks', '`.V`',
  '`markcorr_()`', '`markcorr()`', '`numeric` marks', '`.k`',
  '`markvario_()`', '`markvario()`', '`numeric` marks', '`.gamma`',
  '`Gcross_()`', '`Gcross()`', '`multitype` marks', '`.G`',
  '`Kcross_()`', '`Kcross()`', '`multitype` marks', '`.K`',
  '`Jcross_()`', '`Jcross()`', '`multitype` marks', '`.J`'
) |>
  matrix(nrow = 4L, dimnames = list(c('Batch Process', 'Workhorse in **`spatstat.explore`**', 'Applicable To', '`fv`-`hypercolumn` Suffix'), NULL)) |>
  t.default() |>
  as.data.frame.matrix() |> 
  kable()
```

## ... which adds a `numeric`-`hypercolumn`

```{r echo = FALSE, results = 'asis'}
c(
  '`nncross_()`', '`nncross.ppp(., what = \'dist\')`', '`multitype` marks', '`.nncross`'
) |>
  matrix(nrow = 4L, dimnames = list(c('Batch Process', 'Workhorse in **`spatstat.geom`**', 'Applicable To', '`numeric`-`hypercolumn` Suffix'), NULL)) |>
  t.default() |>
  as.data.frame.matrix() |> 
  kable()
```

## Pipe operator compatible

Multiple batch processes may be applied to a `hyperframe` (or `groupedHyperframe`) in a pipeline.

```{r}
r = seq.int(from = 0, to = 250, by = 10)
out = s |>
  Emark_(r = r, correction = 'best', mc.cores = 1L) |> # slow
  # Vmark_(r = r, correction = 'best', mc.cores = 1L) |> # slow
  # markcorr_(r = r, correction = 'best', mc.cores = 1L) |> # slow
  # markvario_(r = r, correction = 'best', mc.cores = 1L) |> # slow
  Gcross_(i = 'CK+.CD8-', j = 'CK-.CD8+', r = r, correction = 'best', mc.cores = 1L) |> # fast
  # Kcross_(i = 'CK+.CD8-', j = 'CK-.CD8+', r = r, correction = 'best', mc.cores = 1L) |> # fast
  nncross_(i = 'CK+.CD8-', j = 'CK-.CD8+', correction = 'best', mc.cores = 1L) # fast
```

The returned `hyperframe` (or `groupedHyperframe`) has

-   `fv`-`hypercolumn` *`hladr.E`*, created by function `Emark_()` on `numeric` mark *`hladr`*
-   `fv`-`hypercolumn` *`phenotype.G`*, created by function `Gcross_()` on `multitype` mark *`phenotype`*
-   `numeric`-`hypercolumn` *`phenotype.nncross`*, created by function `nncross_()` on `multitype` mark *`phenotype`*

```{r}
out
```

# Aggregation Over Nested Grouping Structure

When nested grouping structure `~g1/g2/.../gm` is present, we may aggregate over the

-   `fv`-`hypercolumns`
-   `numeric`-`hypercolumns`
-   `numeric` marks in the `ppp`-`hypercolumn`

by either one of the grouping levels `~g1`, `~g2`, ..., or `~gm`. If the lowest grouping `~gm` is specified, then no aggregation is performed.

## Aggregation of `fv`-`hypercolumns`

Function `aggregate_fv()` aggregates

-   the **function values**, i.e., the black-solid-curve of `fv.plot`. In the following example, we have
    -   `numeric`-`hypercolumns` *`hladr.E.value`* and *`phenotype.G.value`*, aggregated function values from `fv`-`hypercolumns` *`hladr.E`* and *`phenotype.G`*
-   the **cumulative trapezoidal integration** under the black-solid-curve. In the following example, we have
    -   `numeric`-`hypercolumns` *`hladr.E.cumtrapz`* and *`phenotype.G.cumtrapz`*, aggregated cumulative trapezoidal integration from `fv`-`hypercolumns` *`hladr.E`* and *`phenotype.G`*

```{r}
(afv = out |>
  aggregate_fv(by = ~ patient_id, f_aggr_ = pmean, mc.cores = 1L))
```

Each of the `numeric`-`hypercolumns` contains tabulated values on the common grid of `r`. One "slice" of this grid may be extracted by

```{r}
afv$hladr.E.cumtrapz |> .slice(j = '50')
```

## Aggregation of `numeric`-`hypercolumns` and `numeric` mark(s) in `ppp`-`hypercolumn`

Function `aggregate_quantile()` aggregates the quantile of

-   the `numeric`-`hypercolumns`. In the following example, we have
    -   `numeric`-`hypercolumn` *`phenotype.nncross.quantile`*, aggregated quantile of `numeric`-`hypercolumn` *`phenotype.nncross`*
-   the `numeric` mark(s) in the `ppp`-`hypercolumn.` In the following example, we have
    -   `numeric`-`hypercolumn` *`hladr.quantile`*, aggregated quantile of `numeric` mark *`hladr`* in `ppp`-`hypercolumn`

```{r}
out |>
  aggregate_quantile(by = ~ patient_id, probs = seq.int(from = 0, to = 1, by = .1), mc.cores = 1L)
```

Function `aggregate_kerndens()` aggregates the kernel density of

-   the `numeric`-`hypercolumns`. In the following example, we have
    -   `numeric`-`hypercolumn` *`phenotype.nncross.kerndens`*, aggregated kernel density of `numeric`-`hypercolumn` *`phenotype.nncross`*
-   the `numeric` mark(s) in the `ppp`-`hypercolumn`. In the following example, we have
    -   `numeric`-`hypercolumn` *`hladr.kerndens`*, aggregated kernel density of `numeric` mark *`hladr`* in `ppp`-`hypercolumn`

```{r}
(mdist = out$phenotype.nncross |> unlist() |> max())
out |> 
  aggregate_kerndens(by = ~ patient_id, from = 0, to = mdist, mc.cores = 1L)
```