ssp.softmax: Subsampling for
Softmax (Multinomial) Regression ModelThis vignette introduces the usage
of ssp.softmax, which draws optimal subsample from full
data and fit softmax (multinomial) regression on the subsample. The
statistical theory and algorithms in this implementation can be found in
the relevant reference papers.
Denote \(y\) as multi-category response variable and \(K+1\) is the number of categories. \(N\) is the number of observations in the full dataset. \(X\) is the \(N \times d\) covariates matrix. Softmax regression model assumes that \[ P(y_{i,k} = 1 \mid \mathbf{x}_i) = \frac{\exp(\mathbf{x}_i^\top \boldsymbol{\beta}_k)}{\sum_{l=0}^{K} \exp(\mathbf{x}_i^\top \boldsymbol{\beta}_l)} \] for \(i = 1, \ldots, N\) and \(k = 0, 1, \ldots, K\), where \(\boldsymbol{\beta}_k\)’s are \(d\)-dimensional unknown coefficients.
The log-likelihood function of softmax regression is
\[ \max_{\beta} L(\beta) = \frac{1}{N} \sum_{i=1}^{N} \left[ \sum_{k=0}^{K} y_{i,k} \mathbf{x}_i^\top \boldsymbol{\beta}_k - \ln \left\{ \sum_{l=0}^{K} \exp(\mathbf{x}_i^\top \boldsymbol{\beta}_l) \right\} \right]. \]
The idea of subsampling methods is as follows: instead of fitting the model on the size \(N\) full dataset, a subsampling probability is assigned to each observation and a smaller, informative subsample is drawn. The model is then fitted on the subsample to obtain an estimator with reduced computational cost.
Full dataset: The whole dataset used as input.
Full data estimator: The estimator obtained by fitting the model on the full dataset.
Subsample: A subset of observations drawn from the full dataset.
Subsample estimator: The estimator obtained by fitting the model on the subsample.
Subsampling probability (\(\pi\)): The probability assigned to each observation for inclusion in the subsample.
We introduce the usage of ssp.softmax with simulated
data. \(X\) contains \(d=3\) covariates drawn from multinormal
distribution and \(Y\) is the
multicategory response variable with \(K+1=3\) categories. The full data size is
\(N = 1 \times
10^4\).
set.seed(1)
d <- 3
K <- 2
G <- rbind(rep(-1/(K+1), K), diag(K) - 1/(K+1)) %x% diag(d)
N <- 1e4
beta.true.baseline <- cbind(rep(0, d), matrix(-1.5, d, K))
beta.true.summation <- cbind(rep(1, d), 0.5 * matrix(-1, d, K))
mu <- rep(0, d)
sigma <- matrix(0.5, nrow = d, ncol = d)
diag(sigma) <- rep(1, d)
X <- MASS::mvrnorm(N, mu, sigma)
prob <- exp(X %*% beta.true.summation)
prob <- prob / rowSums(prob)
Y <- apply(prob, 1, function(row) sample(0:K, size = 1, prob = row))
data <- as.data.frame(cbind(Y, X))
colnames(data) <- c("Y", paste("V", 1:ncol(X), sep=""))
head(data)
#> Y V1 V2 V3
#> 1 0 0.647375927 0.8457239 0.04139233
#> 2 2 -0.008594357 0.3076438 -0.74888187
#> 3 0 0.311521132 1.3853692 0.34997340
#> 4 1 -2.419603531 -0.1512300 -1.33679039
#> 5 1 0.330649245 -0.3186678 -0.81910733
#> 6 0 0.506201892 1.0142579 0.48926915The function usage is
ssp.softmax(
formula,
data,
subset,
n.plt,
n.ssp,
criterion = "MSPE",
sampling.method = "poisson",
likelihood = "MSCLE",
constraint = "summation",
control = list(...),
contrasts = NULL,
...
)The core functionality of ssp.softmax revolves around
three key questions:
How are subsampling probabilities computed? (Controlled by the
criterion argument)
How is the subsample drawn? (Controlled by the
sampling.method argument)
How is the likelihood adjusted to correct for bias? (Controlled
by the likelihood argument)
criterionThe choices of criterion include optA,
optL, ,MSPE(default), LUC and
uniform. The default criterion MSPE minimizes
the mean squared prediction error between subsample estimator and full
data estimator. Criterion optA and optL are
derived by minimizing the asymptotic covariance of subsample estimator.
LUC and uniform are baseline methods. See
Yao et al. (2023) and Wang and Kim (2022) for details.
sampling.methodThe options for sampling.method include
withReplacement and poisson (default).
withReplacement. stands for drawing \(n.ssp\) subsamples from full dataset with
replacement, using the specified subsampling probability.
poisson stands for drawing subsamples one by one by
comparing the subsampling probability with a realization of uniform
random variable \(U(0,1)\). The
expected number of drawn samples are \(n.ssp\).
likelihoodThe available choices for likelihood include
weighted and MSCLE(default).
MSCLE stands for maximum sampled conditional likelihood.
Both of these likelihood functions can derive an unbiased optimal
subsample estimator. See Wang and Kim
(2022) for details about MSCLE.
constraintSoftmax model needs constraint on unknown coefficients for
identifiability. The options for constraint include
summation and baseline (default). The baseline
constraint assumes the coefficient for the baseline category are \(0\). Without loss of generality,
ssp.softmax sets the category \(Y=0\) as the baseline category so that
\(\boldsymbol{\beta}_0=0\). The
summation constraint \(\sum_{k=0}^{K}
\boldsymbol{\beta}_k\) can also used in the subsampling method
for the purpose of calculating optimal subsampling probability. These
two constraints lead to different interpretation of coefficients but are
equal for computing \(P(y_{i,k} = 1 \mid
\mathbf{x}_i)\). The estimation of coefficients returned by
ssp.softmax() is under baseline constraint.
After drawing subsample, ssp.softmax utilizes
nnet::multinom to fit the model on the subsample. Arguments
accepted by nnet::multinom can be passed through
... in ssp.softmax.
Below are two examples demonstrating the use of
ssp.softmax with different configurations.
n.plt <- 200
n.ssp <- 600
formula <- Y ~ . -1
ssp.results1 <- ssp.softmax(formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
criterion = 'MSPE',
sampling.method = 'withReplacement',
likelihood = 'weighted',
constraint = 'baseline'
)
#> [1] "Message from nnet::multinom: "
#> # weights: 12 (6 variable)
#> initial value 219.722458
#> iter 10 value 139.767781
#> final value 139.762297
#> converged
#> [1] "Message from nnet::multinom: "
#> # weights: 12 (6 variable)
#> initial value 5671281.524906
#> iter 10 value 4189655.105108
#> final value 4189652.860667
#> converged
summary(ssp.results1)
#> Model Summary
#>
#>
#> Call:
#>
#> ssp.softmax(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp,
#> criterion = "MSPE", sampling.method = "withReplacement",
#> likelihood = "weighted", constraint = "baseline")
#>
#> Subsample Size:
#>
#> 1 Total Sample Size 10000
#> 2 Expected Subsample Size 600
#> 3 Actual Subsample Size 600
#> 4 Unique Subsample Size 563
#> 5 Expected Subample Rate 6%
#> 6 Actual Subample Rate 6%
#> 7 Unique Subample Rate 5.63%
#>
#> Coefficients:
#>
#> [,1] [,2]
#> V1 -1.490486 -1.501145
#> V2 -1.282840 -1.120447
#> V3 -1.266387 -1.364313
#>
#> Std. Errors:
#>
#> [,1] [,2]
#> V1 0.1662016 0.1695415
#> V2 0.1547625 0.1540065
#> V3 0.1410814 0.1463690summary(ssp.results1) shows that it draws 600
observations out of 10000, where the number of unique indices is less
than 600 since we use sampling.method = 'withReplacement'.
After fitting softmax model on subsample using the chosen
weighted likelihood function, we get coefficients
estimation and standard errors as above.
ssp.results2 <- ssp.softmax(formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
criterion = 'MSPE',
sampling.method = 'poisson',
likelihood = 'MSCLE',
constraint = 'baseline'
)
#> [1] "Message from nnet::multinom: "
#> # weights: 12 (6 variable)
#> initial value 219.722458
#> iter 10 value 128.011207
#> final value 127.954692
#> converged
#> [1] "Message from nnet::multinom: "
#> # weights: 21 (6 variable)
#> initial value 790.698184
#> iter 10 value 591.242598
#> final value 591.236607
#> converged
summary(ssp.results2)
#> Model Summary
#>
#>
#> Call:
#>
#> ssp.softmax(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp,
#> criterion = "MSPE", sampling.method = "poisson", likelihood = "MSCLE",
#> constraint = "baseline")
#>
#> Subsample Size:
#>
#> 1 Total Sample Size 10000
#> 2 Expected Subsample Size 600
#> 3 Actual Subsample Size 624
#> 4 Unique Subsample Size 624
#> 5 Expected Subample Rate 6%
#> 6 Actual Subample Rate 6.24%
#> 7 Unique Subample Rate 6.24%
#>
#> Coefficients:
#>
#> [,1] [,2]
#> V1 -1.635380 -1.458310
#> V2 -1.249560 -1.290106
#> V3 -1.604003 -1.680821
#>
#> Std. Errors:
#>
#> [,1] [,2]
#> V1 0.1428372 0.1438973
#> V2 0.1491207 0.1522322
#> V3 0.1476266 0.1495299The returned object contains estimation results and index of drawn subsamples in the full dataset.
names(ssp.results1)
#> [1] "model.call" "coef.plt" "coef.ssp"
#> [4] "coef" "coef.plt.sum" "coef.ssp.sum"
#> [7] "coef.sum" "cov.plt" "cov.ssp"
#> [10] "cov" "cov.plt.sum" "cov.sum"
#> [13] "cov.ssp.sum" "index.plt" "index.ssp"
#> [16] "N" "subsample.size.expect" "terms"Some key returned variables:
index.plt and index are the row indices
of drawn pilot subsamples and optimal subsamples in the full
data.
coef.ssp is the subsample estimator for \(\beta\) and coef is the linear
combination of coef.plt (pilot estimator) and
coef.ssp.
cov.ssp and cov are estimated
covariance matrices of coef.ssp and
coef.