Package: subsampling 0.1.1
subsampling: Optimal Subsampling Methods for Statistical Models
Balancing computational and statistical efficiency, subsampling techniques offer a practical solution for handling large-scale data analysis. Subsampling methods enhance statistical modeling for massive datasets by efficiently drawing representative subsamples from full dataset based on tailored sampling probabilities. These probabilities are optimized for specific goals, such as minimizing the variance of coefficient estimates or reducing prediction error.
Authors:
subsampling_0.1.1.tar.gz
subsampling_0.1.1.tar.gz(r-4.5-noble)subsampling_0.1.1.tar.gz(r-4.4-noble)
subsampling_0.1.1.tgz(r-4.4-emscripten)subsampling_0.1.1.tgz(r-4.3-emscripten)
subsampling.pdf |subsampling.html✨
subsampling/json (API)
NEWS
# Install 'subsampling' in R: |
install.packages('subsampling', repos = 'https://cloud.r-project.org') |
Bug tracker:https://github.com/dqksnow/subsampling/issues0 issues
Last updated 5 months agofrom:897268b17f. Checks:3 OK. Indexed: no.
Target | Result | Latest binary |
---|---|---|
Doc / Vignettes | OK | Mar 06 2025 |
R-4.5-linux-x86_64 | OK | Mar 06 2025 |
R-4.4-linux-x86_64 | OK | Mar 06 2025 |
Exports:ssp.glmssp.quantregssp.relogitssp.softmax
Dependencies:DBIexpmlatticeMASSMatrixMatrixModelsminqamitoolsnnetnumDerivquantregRcppRcppArmadilloSparseMsurveysurvival
Introduction to ssp.glm: Subsampling for Generalized Linear Models
Rendered fromssp-logit.Rmd
usingknitr::rmarkdown
on Mar 06 2025.Last update: 2024-11-05
Started: 2024-11-05
Introduction to ssp.quantreg: Subsampling for Quantile Regression
Rendered fromssp-quantreg.Rmd
usingknitr::rmarkdown
on Mar 06 2025.Last update: 2024-11-05
Started: 2024-11-05
Introduction to ssp.relogit: Subsampling for Logistic Regression Model with Rare Events
Rendered fromssp-relogit.Rmd
usingknitr::rmarkdown
on Mar 06 2025.Last update: 2024-11-05
Started: 2024-11-05
Introduction to ssp.softmax: Subsampling for Softmax (Multinomial) Regression Model
Rendered fromssp-softmax.Rmd
usingknitr::rmarkdown
on Mar 06 2025.Last update: 2024-11-05
Started: 2024-11-05
Citation
To cite package ‘subsampling’ in publications use:
Dong Q, Yao Y, Wang H (2024). subsampling: Optimal Subsampling Methods for Statistical Models. R package version 0.1.1, https://CRAN.R-project.org/package=subsampling.
Corresponding BibTeX entry:
@Manual{, title = {subsampling: Optimal Subsampling Methods for Statistical Models}, author = {Qingkai Dong and Yaqiong Yao and Haiying Wang}, year = {2024}, note = {R package version 0.1.1}, url = {https://CRAN.R-project.org/package=subsampling}, }
Readme and manuals
subsampling
A major challenge in big data statistical analysis is the demand for computing resources. For example, when fitting a logistic regression model to binary response variable with $N \times d$ dimensional covariates, the computational complexity of estimating the coefficients using the IRLS algorithm is $O(\zeta N d^2)$, where $\zeta$ is the number of iteriation. When $N$ is large, the cost can be prohibitive, especially if high performance computing resources are unavailable. Subsampling has become a widely used technique to balance the trade-off between computational efficiency and statistical efficiency.
The R package subsampling
provides optimal subsampling methods for
various statistical models such as generalized linear models (GLM),
softmax (multinomial) regression, rare event logistic regression and
quantile regression model. Specialized subsampling techniques are
provided to address specific challenges across different models and
datasets.
Installation
You can install the development version of subsampling from GitHub with:
# install.packages("devtools")
devtools::install_github("dqksnow/subsampling")
Getting Started
The Online document provides a guidance for quick start.
- Generalized linear model.
- Rare event logistic regression.
- Softmax (multinomial) regression.
- Quantile regression.
Example
This is an example of subsampling method on logistic regression:
library(subsampling)
set.seed(1)
N <- 1e4
beta0 <- rep(-0.5, 7)
d <- length(beta0) - 1
corr <- 0.5
sigmax <- matrix(corr, d, d) + diag(1-corr, d)
X <- MASS::mvrnorm(N, rep(0, d), sigmax)
colnames(X) <- paste("V", 1:ncol(X), sep = "")
P <- 1 - 1 / (1 + exp(beta0[1] + X %*% beta0[-1]))
Y <- rbinom(N, 1, P)
data <- as.data.frame(cbind(Y, X))
formula <- Y ~ .
n.plt <- 200
n.ssp <- 600
ssp.results <- ssp.glm(formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
family = "quasibinomial",
criterion = "optL",
sampling.method = "poisson",
likelihood = "weighted"
)
summary(ssp.results)
#> Model Summary
#>
#> Call:
#>
#> ssp.glm(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp,
#> family = "quasibinomial", criterion = "optL", sampling.method = "poisson",
#> likelihood = "weighted")
#>
#> Subsample Size:
#>
#> 1 Total Sample Size 10000
#> 2 Expected Subsample Size 600
#> 3 Actual Subsample Size 635
#> 4 Unique Subsample Size 635
#> 5 Expected Subample Rate 6%
#> 6 Actual Subample Rate 6.35%
#> 7 Unique Subample Rate 6.35%
#>
#> Coefficients:
#>
#> Estimate Std. Error z value Pr(>|z|)
#> Intercept -0.4149 0.0803 -5.1694 <0.0001
#> V1 -0.5874 0.0958 -6.1286 <0.0001
#> V2 -0.4723 0.1086 -4.3499 <0.0001
#> V3 -0.5492 0.1014 -5.4164 <0.0001
#> V4 -0.4044 0.1012 -3.9950 <0.0001
#> V5 -0.3725 0.1045 -3.5649 0.0004
#> V6 -0.6703 0.0973 -6.8859 <0.0001
Help Manual
Help page | Topics |
---|---|
Optimal Subsampling Methods for Generalized Linear Models | ssp.glm |
Optimal Subsampling Methods for Quantile Regression Model | ssp.quantreg |
Optimal Subsampling for Logistic Regression Model with Rare Events Data | ssp.relogit |
Optimal Subsampling Method for Softmax (multinomial logistic) Regression Model | ssp.softmax |
Optimal Subsampling Methods for Statistical Models | subsampling-package subsampling |