The sprintr
package
contains the implementations of a computationally efficient method,
called sprinter, to fit large interaction models based on the reluctant
interaction selection principle. The details of the method can be found
in Yu, Bien, and Tibshirani
(2019) Reluctant interaction modeling. In particular,
sprinter
is a multi-stage method that fits the following
pairwise interaction model: $$
y = \sum_{j = 1}^p X_j \beta^\ast_j + \sum_{\ell \leq k} X_{\ell} X_k
\gamma^\ast_{\ell k} + \varepsilon.
$$ This document serves as an introduction of using the package
with a simple simulated data example.
We consider the following simple simulation setting, where X ∼ N(0, Ip). There are two non-trivial main effects β1 = 1, β2 = −2, and βj = 0 for j > 2. The two important interactions are X1 * X3 with γ13 = 3, and X4 * X5 with γ45 = −4. With ε ∼ N(0, 1), the following code simulates n = 100 observation from the model above with p = 200.
sprinter
functionThe function sprinter
implements the sprinter method
(please note that the function name sprinter
is different
from the package name sprintr
), which involves the
following three main steps:
square = FALSE
)
or with both main effects and squared effects (X, X2) (if
square = TRUE
).num_keep
.lambda
, fit a lasso of
the response on main effects, squared effects (if
square = TRUE
), and selected interactions from the previous
step.There are two tuning parameters: num_keep
(used in Step
2) and lambda
(used in Step 3). If num_keep
is
not specified, it will then be set to n/⌈log n⌉ (see, e.g., Fan & Lv
(2008)). If lambda
is not specified, then
sprinter
would compute its own path of tuning parameter
values based on the number of tuning parameters (nlam
) and
the range of the path (lam_min_ratio
).
The output of sprinter
is a S3
object
including several useful components. In particular, it involves a matrix
idx
that represents the index pairs of all variables
considered in Step 3:
mod$idx[(p + 1) : nrow(mod$idx), ]
#> index_1 index_2
#> [1,] 96 140
#> [2,] 5 79
#> [3,] 7 113
#> [4,] 113 155
#> [5,] 94 148
#> [6,] 5 173
#> [7,] 108 144
#> [8,] 7 165
#> [9,] 17 77
#> [10,] 158 175
#> [11,] 4 102
#> [12,] 58 108
#> [13,] 5 97
#> [14,] 30 168
#> [15,] 5 182
#> [16,] 5 25
#> [17,] 135 168
#> [18,] 115 175
#> [19,] 3 177
#> [20,] 4 96
#> [21,] 1 3
#> [22,] 4 5
Since Step 3 of sprinter
always includes the main
effects, mod$idx[(p + 1): nrow(mod$idx), ]
contains the
indices of all the selected interactions from Step 2. The two columns of
this output represents the index pair (ℓ, k) of a selected interaction
Xℓ * Xk,
where ℓ ≤ k. Note that here
the last two rows are the true interactions X1 * X3
and X4 * X5.
If the first entry of an index pair is zero, i.e., (ℓ = 0, k), then it represents a
main effect Xk.
The output mod$coef
is a
nrow(mod$idx)
-by-length(mod$lambda)
matrix.
Each column of mod$coef
is a vector of estimate of all
variable coefficients considered in Step 3 corresponding to one value of
the lasso tuning parameter lambda
. For example, for the
30-th tuning parameter, we have the corresponding coefficient
estiamte:
cv.sprinter
The function cv.sprinter()
performs cross-validation to
select the value of lasso tuning parameter lambda
used in
Step 3, while holding the value of num_keep
fixed.
The output of cv.sprinter
is a S3
object.
The most intersting information is mod_cv$compact
, which is
a matrix of three columns. The first two columns show the index pairs of
all variables finally selected by the lasso in Step 3, and the last
column is the coefficient estimate corresponding to that selected
variable.
mod_cv$compact
#> index_1 index_2 coefficient
#> [1,] 0 1 0.9730877309
#> [2,] 0 2 -1.4431692880
#> [3,] 0 3 0.0066158779
#> [4,] 0 4 -0.2105449793
#> [5,] 0 36 0.0665947662
#> [6,] 0 66 -0.0489653687
#> [7,] 0 87 -0.0202139156
#> [8,] 0 94 0.1239236233
#> [9,] 0 112 0.0271044551
#> [10,] 0 123 -0.0293707954
#> [11,] 0 126 -0.0108057274
#> [12,] 0 157 -0.1948085716
#> [13,] 7 113 -0.0002098571
#> [14,] 108 144 0.0354711242
#> [15,] 158 175 0.0493387336
#> [16,] 5 97 0.0736288986
#> [17,] 17 77 0.0364564902
#> [18,] 30 168 -0.0605726982
#> [19,] 5 182 0.0256102439
#> [20,] 115 175 -0.0374660411
#> [21,] 1 3 2.4892512069
#> [22,] 4 5 -3.6153254041
We see (from the first two rows and the last two rows) that the fit selected by cross-validation includes all the four important variables in the model, with relatively accurate estimates of their coefficients.
Finally, there is a predict
function for the
S3
object returned by cv.sprinter
that
computes the prediction for a new data matrix of main effects: