---
title: "Introduction to `ssp.quantreg`: Subsampling for Quantile Regression"
output: rmarkdown::html_vignette
bibliography: references.bib
vignette: >
  %\VignetteIndexEntry{Introduction to `ssp.quantreg`: Subsampling for Quantile Regression}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(subsampling)
```

This vignette introduces the usage of `ssp.quantreg`. The statistical theory
and algorithms behind this implementation can be found in the relevant reference
papers.

Quantile regression aims to estimate conditional quantiles by minimizing the
following loss function:

$$
\min_{\beta} L(\beta) = \frac{1}{N} \sum_{i=1}^{N} \rho_\tau \left( y_i - \beta^\top x_i \right) = 
\frac{1}{N} \sum_{i=1}^{N} \left( y_i - \beta^\top x_i \right) \left\{ \tau - I \left( y_i < \beta^\top x_i \right) \right\},
$$
where $\tau$ is the quantile of interest, $y$ is the response variable, $x$ is covariates vector and $N$ is the number of observations in full dataset.

The idea of subsampling methods is as follows: instead of fitting the model on
the size $N$ full dataset, a subsampling probability is assigned to each
observation and a smaller, informative subsample is drawn. The model is then
fitted on the subsample to obtain an estimator with reduced computational cost.

## Terminology

- Full dataset: The whole dataset used as input.

- Full data estimator: The estimator obtained by fitting the model on the full
  dataset.

- Subsample: A subset of observations drawn from the full dataset.

- Subsample estimator: The estimator obtained by fitting the model on the
  subsample.

- Subsampling probability ($\pi$): The probability assigned to each observation
  for inclusion in the subsample.

## Example

We introduce `ssp.quantreg` with simulated data. $X$ contains $d=6$ covariates
drawn from multinormal distribution and $Y$ is the response variable. The full
data size is $N = 1 \times 10^4$. The interested quantile $\tau=0.75$.

```{r}
set.seed(1)
N <- 1e4
tau <- 0.75
beta.true <- rep(1, 7)
d <- length(beta.true) - 1
corr  <- 0.5
sigmax  <- matrix(0, d, d)
for (i in 1:d) for (j in 1:d) sigmax[i, j] <- corr^(abs(i-j))
X <- MASS::mvrnorm(N, rep(0, d), sigmax)
err <- rnorm(N, 0, 1) - qnorm(tau)
Y <- beta.true[1] + X %*% beta.true[-1] + err * rowMeans(abs(X))
data <- as.data.frame(cbind(Y, X))
colnames(data) <- c("Y", paste("V", 1:ncol(X), sep=""))
formula <- Y ~ .
head(data)
```

## Key Arguments

The function usage is

```{r, eval = FALSE}
ssp.quantreg(
  formula,
  data,
  subset = NULL,
  tau = 0.5,
  n.plt,
  n.ssp,
  B = 5,
  boot = TRUE,
  criterion = "optL",
  sampling.method = "withReplacement",
  likelihood = c("weighted"),
  control = list(...),
  contrasts = NULL,
  ...
)
```

The core functionality of `ssp.quantreg` revolves around three key questions:

- How are subsampling probabilities computed? (Controlled by the `criterion`
  argument)

- How is the subsample drawn? (Controlled by the `sampling.method` argument)

- How is the likelihood adjusted to correct for bias? (Controlled by the
  `likelihood` argument)

### `criterion`

`criterion` stands for the criterion we choose to compute the sampling
probability for each observation. The choices of `criterion` include
`optL`(default) and `uniform`. In `optL`, the optimal subsampling probability is
by minimizing a transformation of the asymptotic variance of subsample
estimator. `uniform` is a baseline method.

### `sampling.method`

The options for the `sampling.method` argument include `withReplacement`
(default) and `poisson`. `withReplacement` stands for drawing $n.ssp$ subsamples
from full dataset with replacement, using the specified subsampling
probabilities. `poisson` stands for drawing subsamples one by one by comparing the
subsampling probability with a realization of uniform random variable
$U(0,1)$. The expected number of drawn samples are $n.ssp$.

### `likelihood`

The available choice for `likelihood` in `ssp.quantreg` is `weighted`. It takes
the inverse of sampling probabblity as the weights in likelihood function to
correct the bias introduced by unequal subsampling probabilities.

### `boot` and `B`

An option for drawing $B$ subsamples (each with expected size `n.ssp`) and deriving
subsample estimator and asymptotic covariance matrix based on them. After
getting $\hat{\beta}_{b}$ on the $b$-th subsample, $b=1,\dots B$, it calculates

$$
\hat{\beta}_I = \frac{1}{B} \sum_{b=1}^{B} \hat{\beta}_{b}
$$
as the final subsample estimator and 
$$
\hat{V}(\hat{\beta}_I) = \frac{1}{r_{ef} B (B - 1)} 
\sum_{b=1}^{B} \left( \hat{\beta}_{b} - \hat{\beta}_I  \right)^{\otimes 2},
$$
where $r_{ef}$ is a correction term for effective subsample size since the
observations in each subsample can be replicated.  For more details, see
@wang2021optimal.

## Outputs

After drawing subsample(s), `ssp.quantreg` utilizes `quantreg::rq` to fit the
model on the subsample(s). Arguments accepted by `quantreg::rq` can be passed
through `...` in `ssp.quantreg`.

Below are two examples demonstrating the use of `ssp.quantreg` with different
configurations.

```{r}
B <- 5
n.plt <- 200
n.ssp <- 200
ssp.results1 <- ssp.quantreg(formula, 
                             data, 
                             tau = tau, 
                             n.plt = n.plt,
                             n.ssp = n.ssp,
                             B = B, 
                             boot = TRUE, 
                             criterion = 'optL',
                             sampling.method = 'withReplacement', 
                             likelihood = 'weighted'
                             )

ssp.results2 <- ssp.quantreg(formula, 
                             data, 
                             tau = tau, 
                             n.plt = n.plt,
                             n.ssp = n.ssp,
                             B = B, 
                             boot = FALSE, 
                             criterion = 'optL',
                             sampling.method = 'withReplacement', 
                             likelihood = 'weighted'
                             )
```

### Returned object 

The returned object contains estimation results and index of drawn
subsample in the full dataset. 

```{r}
names(ssp.results1)
```

```{r}
summary(ssp.results1)
```

```{r}
summary(ssp.results2)
```
Some key returned variables:

- `index.plt` and `index` are the row indices of drawn pilot subsamples and
optimal subsamples in the full data.

- `coef.ssp` is the subsample estimator for $\beta$ and `coef` is the linear 
combination of `coef.plt` (pilot estimator) and `coef.ssp`. 

- `cov.ssp` and `cov` are estimated covariance matrices of `coef.ssp` and
`coef`. If `boot=FALSE`, covariance matrix would not be estimated and a size
`n.ssp * B` subsample would be drawn.

See the help documentation of `ssp.quantreg` for details.

## References