---
title: "Overview of the prevtoinc package"
author: "Niklas Willrich"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{Overview of the prevtoinc package}
  %\VignetteEncoding{UTF-8}
bibliography: main.bib
---

\newcommand{\Prev}{P} 
\newcommand{\Prhame}{P_{rhame}}
\newcommand{\Ipp}{I_{pp}} 
\newcommand{\Ipd}{I} 
\newcommand{\Ipphat}{\hat{I}_{pp}}
\newcommand{\Ipdhat}{\hat{I}}
\newcommand{\Prevhat}{\hat{P}}
\newcommand{\EE}{\mathbb{E}}
\newcommand{\Prob}{\mathbb{P}}
\newcommand{\age}{A} 
\newcommand{\XX}{X} 
\newcommand{\xx}{x}
\newcommand{\LL}{L} 
\newcommand{\Aloi}{\age_{loi}}
\newcommand{\Xloi}{\XX_{loi}}
\newcommand{\Lloi}{\LL_{loi}}
\newcommand{\aloi}{a_{loi}}
\newcommand{\xloi}{x_{loi}}
\newcommand{\lloi}{l_{loi}}
\newcommand{\Alos}{\age_{los}}
\newcommand{\Xlos}{\XX_{los}}
\newcommand{\Llos}{\LL_{los}}
\newcommand{\alos}{a_{los}}
\newcommand{\xlos}{\xx_{los}}
\newcommand{\llos}{l_{los}}
\newcommand{\Alnint}{\age_{LN-INT}}
\newcommand{\Xlnint}{\XX_{LN-INT}}
\newcommand{\Llnint}{\LL_{LN-INT}}
\newcommand{\alnint}{a_{LN-INT}}
\newcommand{\xlnint}{\xx_{LN-INT}}
\newcommand{\llnint}{l_{LN-INT}}
\newcommand{\median}{\operatorname{median}}
\newcommand{\mean}{\operatorname{mean}}
\newcommand{\xloigren}{x_{loi,\text{gren}}}
\newcommand{\xgren}{\hat{x}_{\text{gren}}}
\newcommand{\xrear}{\hat{x}_{\text{rear}}}


```{r, echo=FALSE}
knitr::opts_chunk$set(fig.width=5, fig.height=3) 
```

```{r,  warning=F, message=F}
require(prevtoinc)
```

This package implements functions to convert prevalence to incidence based on data obtained in point-prevalence studies (PPSs) along the lines of [@rhame],[@freeman1980] and is a companion to an upcoming paper [@willrich2019].
It also implements methods to simulate PPS-data to benchmark different estimation methods. Notation will follow the companion paper. So a good idea is to read the paper first and afterwards the vignette.

# The basic structure of the package

The package has functions to simulate PPS-data based on distributions of length of infection and length of stay. Results of PPS simulations and incidence calculations are stored as tibbles. The functionality in the package can be divided in three parts:

* functions for PPS simulation,
* functions to estimate incidence from PPS data,
* helper functions dealing with length-biased distributions and monotone smoothing.

In the following sections, we will go through these different aspects, starting with simulation of PPS-data.

## Simulating PPS data
<!-- 
export(simulate_incidence_stats)
export(simulate_incidence_stats_fast)
export(simulate_pps_data)
export(simulate_pps_fast)
-->
### Simulating a PPS

The function `simulate_pps_fast` can be used to generate PPS data.
This functions simulates a PPS on the basis of a given prevalence `P` using
a vector of probabilities `dist.X.loi` for the values 1:length(dist.X.loi) of $\Xloi$.
It directly samples the time of infection up to date based on `dist.X.loi`.
Optionally, the length of stay is also sampled independently  using `dist.X.los` which is in the same format as `dist.X.loi`. The sample is generated by treating the marginal distributions of length of stay and length of infection as independent
by assumption. 
Because of this non-joint sampling rows should not be interpreted as individual patients.


```{r}
example.dist <- create_dist_vec(function(x) dpois(x-1, 7), max.dist = 70)
example.dist.los <- create_dist_vec(function(x) dpois(x-1, lambda = 12),
                                    max.dist = 70)
data.pps.fast <- simulate_pps_fast(n.sample=5000,
                               P=0.05,
                               dist.X.loi = example.dist,
                               dist.X.los = example.dist.los)
head(data.pps.fast)
```

Values of zero for `A.loi` and `L.loi` indicate absence of a HAI.

The incidence rate per patient day $I$ and the expected length of infection in the whole population $\xloi$ for a given distribution of $X_{loi}$ and given $P$ can be calculated with `simulate_incidence_stats_fast` by supplying a prevalence `P` and a vector of probabilities `dist.X.loi` for $\Xloi$. Optionally one can also calculate $\Ipp$ and $\xlos$ if one supplies a vector of probabilites `dist.X.los` for $\Xlos$.


```{r}
data.fast.inc.theo <- simulate_incidence_stats_fast(P=0.05, 
                                                    dist.X.loi = example.dist,
                                                    dist.X.los = example.dist.los)
data.fast.inc.theo
```

## Extended simulation method

While the above method of simulation is fast and efficient and is useful for larger simulation studies, it is useful to have a more explicit simulation technique which samples from the joint distribution of $\Xlos$ and $\Xloi$ and gives more control over subpopulations of patients.

The setup of this simulation model is described in the following.

### Modelling assumptions

We assume the following setup. Patients arrive sequentially at a hospital X.
A `hospital` is a named list with the following named elements:

* a general factor `inc.factor` which modifies the risk of nosocomial
infections for all types of patients,
* a list of patients `patient.list` characterized below,
* a distribution of the patient types for the hospital given by `pat.dist` .

A `patient.type` is a list with the following named elements

* a vector `dist.X.los` of probabilities for the values `1:length(dist.X.los)` of $\Xlos$.,
* a vector `dist.X.loi` of probabilities for the values `1:length(dist.X.loi)` of $\Xloi$.,
* and the base incidence rate (per patient-day) `I.p` for this type of patient.

The base-value of the length of stay is additive with the possible length of a nosocomial infection. Clustering of infections is not explicitly modelled.

### Simulation environment
As an example we define a hospital with two different patient types.
```{r}
pat.1 <- list(dist.X.los = 
                create_dist_vec(function(x) dpois(x-1, lambda = 12),
                                70),
              I.p = 0.008,
              dist.X.loi = 
                create_dist_vec(function(x) dpois(x-1, lambda = 10),
                                70))

pat.2 <- list(dist.X.los = 
                create_dist_vec(function(x) dpois(x-1, lambda = 10),
                                70),
              I.p = 0.02,
              dist.X.loi = 
                create_dist_vec(function(x) dpois(x-1, lambda = 7),
                                70))

patient.list <- list(pat.1, pat.2)


# define distribution of patients
pat.1.prob <- 0.4; pat.2.prob <- 0.6
pat.dist.hosp <-  c(pat.1.prob, pat.2.prob)
hospital.1 <- list(inc.factor = 1,
                   pat.dist = pat.dist.hosp,
                   patient.list = patient.list)
```

Using `simulate_pps_data` one can generate PPS data by simulating the evolution of `n.sample` beds for `steps` days.

```{r, warning = FALSE}
data.pps <- simulate_pps_data(n.sample=5000, 
                          steps=200, 
                          hospital=hospital.1)
head(data.pps)
```
To get additional theoretical quantities based on the whole population, one can use `simulate_incidence_stats`.
```{r}
data.inc.theo <- simulate_incidence_stats(hospital.1, 365 * 1000)
# gives incidence rate I
data.inc.theo$I
# gives incidence proportion per admission
data.inc.theo$I.pp
# average length of stay of patients who did not have a HAI
data.inc.theo$x.los.wo.noso
# average length of stay of patients who had at least one HAI during their stay
data.inc.theo$x.los.only.noso
```


## Estimating incidence from PPS data

<!--
export(calculate_I_smooth)
export(calculate_I_rhame)
export(calculate_I)
export(generate_I_fast)
-->

### The new estimator

To use the newly proposed estimator _gren_ presented in the companion paper, one can use the function `calculate_I_smooth` with `method="gren"`. The `data` should be supplied as a data frame with at least a column named `A.loi` giving lengths of infection up to date of PPS.
Values of zero for `A.loi` indicate absence of a HAI. Optionally, the data frame can also contain a column `A.los` supplying lengths of stay up to PPS to estimate $\xlos$ with the same method as well.

```{r}
calculate_I_smooth(data = data.pps,
                  method = "gren")
data.inc.theo$I

calculate_I_smooth(data = data.pps.fast,
                  method = "gren")
data.fast.inc.theo$I
```

There is another variation of this estimator specified with `method = "rear"`.
This uses the rearrangement estimator studied in [@jankowski2009] instead of the Grenander estimator as an estimator for the monotonously decreasing distribution of $\Aloi$ and $\Alos$. We will denote this type of estimator by _rear_.

### Confidence intervals for the new estimator

There are two helper functions to calculate confidence intervals for the estimates of $\Ipp$ with the _gren_ estimator:
One ( `calculate_CI_I_pp`) is based on the typical output of `calculate_I_smooth`:
```{r}
gren_est <- calculate_I_smooth(data = data.pps.fast, method = "gren")
gren_est
calculate_CI_I_pp(gren_est, method = "asymptotic", alpha = 0.05)
```
The other (`CI_np_bs`) is based on a bootstrapping approach which resamples from the estimates of the distributions for $\Aloi$ and $Alos$ based on the Grenander estimator. It works with data as output by `simulate_pps_data*`.
```{r}
CI_np_bs(data.pps.fast)
```

### Other estimators

The function `calculate_I_rhame` can be used to calculate the incidence with a user-supplied value `x.loi.hat` for the estimated length of infection $\xloi$ and an optional specification of `x.los.hat` for the estimated length of stay $\xlos$ to get an estimate of $\Ipp$ too. Here we take the example of an estimator where 
`x.loi.hat` and `x.los.hat` are fixed to their theoretical values and which depends only on the estimate of $\Prev$. We will call this type of estimator _rhame.theo_ .
```{r}
calculate_I_rhame(data.pps,
                  x.loi.hat = data.inc.theo$x.loi,
                  x.los.hat = data.inc.theo$x.los,
                  method = "method identifier")
data.inc.theo$I
data.inc.theo$I.pp

calculate_I_rhame(data.pps.fast,
                  x.loi.hat = data.fast.inc.theo$x.loi,
                  x.los.hat = data.fast.inc.theo$x.los,
                  method = "method identifier")
data.fast.inc.theo$I
data.fast.inc.theo$I.pp
```

As a convenience function, one can use `calculate_I` to get estimates of I for a range of estimators (including the ones studied in [@willrich2019]) based on PPS data and the accompanying theoretical data .

The estimators are the following:

* _gren_ - new method using Grenander estimator,
* _rear_ - new method using rearrangement estimator,
* _pps.mixed_ - mixed estimator described in the paper,
* _pps.median_ - estimator based on the median duration up to PPS (used in  [@surveyecdc] ),
* _pps.mean_ - alternative estimator used in [@surveyecdc] based on the mean instead of the median
* _L.full_ - estimator based on samples from the PPS with information on full length of stay/infection.
* _rhame.theo_ - estimator using theoretical values for x.loi/x.los.

```{r}
  calculate_I(data.pps.fast, data.fast.inc.theo)
```

If one wants to combine the (fast) simulation step with the estimation step one can 
use `generate_I_fast`. This is just a wrapper for first calling `simulate_pps_fast` and then calling `calculate_I`.

```{r}
generate_I_fast(n.sample = 10000,
                P = 0.05,
                dist.X.loi = example.dist,
                data.theo = data.fast.inc.theo)
```

## Helper functions

<!--
export(epmf)
export(length_biased_dist)
export(length_unbiased_mean)
export(monotone_smoother)
export(A_to_X_dist)
export(X_to_A_dist)
-->

The function `monotone_smoother` implements the rearrangement estimator and Grenander estimator described in [@jankowski2009].

```{r}
A.loi.sample <- data.pps$A.loi[data.pps$A.loi>0]
# raw histogram of data
hist(A.loi.sample)
A.loi.smoothed <- monotone_smoother(A.loi.sample, method = "gren")
# estimated monotonously decreasing distribution
plot(A.loi.smoothed)
```

For creating length-biased distributions there is `length_biased_dist`, which takes a vector of probabilities of a discrete positive distribution as an argument.
 
```{r}
# geometric distribution starting in 1 and cutoff at 70 with mean at about 8.
geom.dist <- create_dist_vec(geom_dist_fct, max.dist = 70)
# calculate mean
sum(1:length(geom.dist)*geom.dist)
# plot original distribution
plot(geom.dist)
geom.dist.lb <- length_biased_dist(geom.dist)
# plot length biased distribution
plot(geom.dist.lb)
```

To calculate the mean of the original distribution based on the length-biased distribution one can use `length_unbiased_mean`.

```{r}
# length biased distribution from chunk above
length_unbiased_mean(geom.dist.lb)
```


# References