Introduction to ssp.glm: Subsampling for Generalized Linear Models

This vignette introduces the usage of the ssp.glm using logistic regression as an example of generalized linear models (GLM). The statistical theory and algorithms in this implementation can be found in the relevant reference papers.

The log-likelihood function for a GLM is

$$ \max_{\beta} L(\beta) = \frac{1}{N} \sum_{i=1}^N \left\{y_i u(\beta^{\top} x_i) - \psi \left[ u(\beta^{\top} x_i) \right] \right\}, $$ where u and ψ are known functions depend on the distribution from the exponential family. For the binomial distribution, the log-likelihood function becomes

$$ \max_{\beta} L(\beta) = \frac{1}{N} \sum_{i=1}^N \left[y_i \beta^{\top} x_i - \log\left(1 + e^{\beta^\top x_i}\right) \right]. $$

The idea of subsampling methods is as follows: instead of fitting the model on the size N full dataset, a subsampling probability is assigned to each observation and a smaller, informative subsample is drawn. The model is then fitted on the subsample to obtain an estimator with reduced computational cost.

Installation

You can install the development version of subsampling from GitHub with:

# install.packages("devtools")
devtools::install_github("dqksnow/Subsampling")
library(subsampling)

Terminology

  • Full dataset: The whole dataset used as input.

  • Full data estimator: The estimator obtained by fitting the model on the full dataset.

  • Subsample: A subset of observations drawn from the full dataset.

  • Subsample estimator: The estimator obtained by fitting the model on the subsample.

  • Subsampling probability (π): The probability assigned to each observation for inclusion in the subsample.

Example: Logistic Regression with Simulated Data

We introduce the usage of ssp.glm with simulated data. X contains d = 6 covariates drawn from multinormal distribution, and Y is the binary response variable. The full dataset size is N = 1 × 104.

set.seed(1)
N <- 1e4
beta0 <- rep(-0.5, 7)
d <- length(beta0) - 1
corr <- 0.5
sigmax  <- matrix(corr, d, d) + diag(1-corr, d)
X <- MASS::mvrnorm(N, rep(0, d), sigmax)
colnames(X) <- paste("V", 1:ncol(X), sep = "")
P <- 1 - 1 / (1 + exp(beta0[1] + X %*% beta0[-1]))
Y <- rbinom(N, 1, P)
data <- as.data.frame(cbind(Y, X))
formula <- Y ~ .
head(data)
#>   Y           V1         V2          V3         V4           V5         V6
#> 1 1 -0.857566061 -0.7773284  0.15764195 -0.3034826 -0.662692711 -0.4273442
#> 2 0  0.003672707 -0.5238648  0.34997799  0.9248544 -0.078952968  0.1658720
#> 3 1 -0.875670649 -1.2112225  0.04210680 -1.1333718 -0.004493163 -0.6466801
#> 4 0  0.773300579  0.6924800  2.01842380  0.7793008  2.490899158  0.5560907
#> 5 0  0.464432153 -0.2713982 -0.05759442  0.3116189  0.012120133  1.0508158
#> 6 0 -1.076277578 -0.6697823 -0.76237732  0.8525745 -0.910222847 -1.1937730

Key Arguments

The function usage is

ssp.glm(
  formula,
  data,
  subset = NULL,
  n.plt,
  n.ssp,
  family = "quasibinomial",
  criterion = "optL",
  sampling.method = "poisson",
  likelihood = "weighted",
  control = list(...),
  contrasts = NULL,
  ...
  )

The core functionality of ssp.glm revolves around three key questions:

  • How are subsampling probabilities computed? (Controlled by the criterion argument)

  • How is the subsample drawn? (Controlled by the sampling.method argument)

  • How is the likelihood adjusted to correct for bias? (Controlled by the likelihood argument)

criterion

The choices of criterion include optA, optL(default), LCC and uniform. The optimal subsampling criterion optA and optL are derived by minimizing the asymptotic covariance of subsample estimator, proposed by Wang, Zhu, and Ma (2018). LCC and uniform are baseline methods.

sampling.method

The options for the sampling.method argument include withReplacement and poisson (default). withReplacement stands for drawing n.ssp subsamples from full dataset with replacement, using the specified subsampling probabilities. poisson stands for drawing subsamples one by one by comparing the subsampling probability with a realization of uniform random variable U(0, 1). The expected number of drawn samples are n.ssp. More details see Wang (2019).

likelihood

The available choices for likelihood include weighted (default) and logOddsCorrection. Both of these likelihood functions can derive an unbiased estimator. Theoretical results indicate that logOddsCorrection is more efficient than weighted in the context of logistic regression. See Wang and Kim (2022).

Outputs

After drawing subsample, ssp.glm utilizes survey::svyglm to fit the model on the subsample, which eventually uses glm. Arguments accepted by svyglm can be passed through ... in ssp.glm.

Below are two examples demonstrating the use of ssp.glm with different configurations.

n.plt <- 200
n.ssp <- 600
ssp.results <- ssp.glm(formula = formula,
                       data = data,
                       n.plt = n.plt,
                       n.ssp = n.ssp,
                       family = "quasibinomial",
                       criterion = "optL",
                       sampling.method = "withReplacement",
                       likelihood = "weighted"
                       )
summary(ssp.results)
#> Model Summary
#> 
#> Call:
#> 
#> ssp.glm(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp, 
#>     family = "quasibinomial", criterion = "optL", sampling.method = "withReplacement", 
#>     likelihood = "weighted")
#> 
#> Subsample Size:
#>                                
#> 1       Total Sample Size 10000
#> 2 Expected Subsample Size   600
#> 3   Actual Subsample Size   600
#> 4   Unique Subsample Size   561
#> 5  Expected Subample Rate    6%
#> 6    Actual Subample Rate    6%
#> 7    Unique Subample Rate 5.61%
#> 
#> Coefficients:
#> 
#>           Estimate Std. Error z value Pr(>|z|)
#> Intercept  -0.5876     0.0867 -6.7749  <0.0001
#> V1         -0.5959     0.1082 -5.5079  <0.0001
#> V2         -0.4197     0.1069 -3.9271  <0.0001
#> V3         -0.5781     0.1062 -5.4424  <0.0001
#> V4         -0.5430     0.1073 -5.0612  <0.0001
#> V5         -0.4836     0.1148 -4.2136  <0.0001
#> V6         -0.6125     0.1137 -5.3896  <0.0001
ssp.results <- ssp.glm(formula = formula,
                       data = data,
                       n.plt = n.plt,
                       n.ssp = n.ssp,
                       family = "quasibinomial",
                       criterion = "optA",
                       sampling.method = "poisson",
                       likelihood = "logOddsCorrection"
                       )
summary(ssp.results)
#> Model Summary
#> 
#> Call:
#> 
#> ssp.glm(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp, 
#>     family = "quasibinomial", criterion = "optA", sampling.method = "poisson", 
#>     likelihood = "logOddsCorrection")
#> 
#> Subsample Size:
#>                                
#> 1       Total Sample Size 10000
#> 2 Expected Subsample Size   600
#> 3   Actual Subsample Size   634
#> 4   Unique Subsample Size   634
#> 5  Expected Subample Rate    6%
#> 6    Actual Subample Rate 6.34%
#> 7    Unique Subample Rate 6.34%
#> 
#> Coefficients:
#> 
#>           Estimate Std. Error z value Pr(>|z|)
#> Intercept  -0.5052     0.0775 -6.5171  <0.0001
#> V1         -0.6170     0.0956 -6.4519  <0.0001
#> V2         -0.3829     0.1010 -3.7922   0.0001
#> V3         -0.5155     0.0973 -5.2963  <0.0001
#> V4         -0.5154     0.1025 -5.0270  <0.0001
#> V5         -0.5641     0.1028 -5.4858  <0.0001
#> V6         -0.4538     0.0997 -4.5509  <0.0001

As recommended by survey::svyglm, when working with binomial models, it is advisable to use use family=quasibinomial() to avoid a warning issued by glm. Refer to svyglm() help documentation Details. The ‘quasi’ version of the family objects provide the same point estimates.

Returned object

The object returned by ssp.glm contains estimation results and indices of the drawn subsample in the full dataset.

names(ssp.results)
#>  [1] "model.call"            "coef.plt"              "coef.ssp"             
#>  [4] "coef"                  "cov.ssp"               "cov"                  
#>  [7] "index.plt"             "index"                 "N"                    
#> [10] "subsample.size.expect" "terms"

Some key returned variables:

  • index.plt and index are the row indices of drawn pilot subsamples and optimal subsamples in the full data.

  • coef.ssp is the subsample estimator for β and coef is the linear combination of coef.plt (pilot estimator) and coef.ssp.

  • cov.ssp and cov are estimated covariance matrices of coef.ssp and coef.

The coefficients and standard errors printed by summary() are coef and the square root of diag(cov). See the help documentation of ssp.glm for details.

Other Families

We also provide examples for poisson regression and gamma regression in the help documentation of ssp.glm. Note that likelihood = logOddsCorrection is currently implemented only for logistic regression (family = binomial or quasibonomial).

References

Wang, HaiYing. 2019. “More Efficient Estimation for Logistic Regression with Optimal Subsamples.” Journal of Machine Learning Research 20 (132): 1–59.
Wang, HaiYing, and Jae Kwang Kim. 2022. “Maximum Sampled Conditional Likelihood for Informative Subsampling.” Journal of Machine Learning Research 23 (332): 1–50.
Wang, HaiYing, Rong Zhu, and Ping Ma. 2018. “Optimal Subsampling for Large Sample Logistic Regression.” Journal of the American Statistical Association 113 (522): 829–44.