Title: | Linear Model with Tree-Based Lasso Regularization for Rare Features |
---|---|
Description: | Implementation of an alternating direction method of multipliers algorithm for fitting a linear model with tree-based lasso regularization, which is proposed in Algorithm 1 of Yan and Bien (2018) <arXiv:1803.06675>. The package allows efficient model fitting on the entire 2-dimensional regularization path for large datasets. The complete set of functions also makes the entire process of tuning regularization parameters and visualizing results hassle-free. |
Authors: | Xiaohan Yan [aut, cre], Jacob Bien [aut, cre] |
Maintainer: | Xiaohan Yan <[email protected]> |
License: | GPL-3 |
Version: | 0.1.1 |
Built: | 2024-11-22 06:49:51 UTC |
Source: | CRAN |
The package fits the linear model with tree-based lasso regularization proposed in Yan and Bien (2018) using alternating direction method of multipliers (ADMM). The ADMM algorithm is proposed in Algorithm 1 of the same paper. The package also provides tools for tuning regularization parameters, making predictions from the fitted model and visualizing recovered groups of the covariates in a dendrogram.
Its main functions are rarefit
, rarefit.cv
,
rarefit.predict
, group.recover
and
group.plot
.
Xiaohan Yan [email protected], Jacob Bien
Yan, X. and Bien, J. (2018) Rare Feature Selection in High Dimensions, https://arxiv.org/abs/1803.06675.
A 500-by-200 document-term matrix for 200 adjectives appearing in 500 TripAdvisor reviews. The document-term matrix is in sparse format.
data.dtm
data.dtm
An object of class dgCMatrix
with 500 rows and 200 columns.
An hclust
tree for the 200 adjectives appearing in the TripAdvisor reviews.
The tree was generated with 100-dimensional word embeddings pre-trained by GloVe
(Pennington et al., 2014) on Gigaword5 and Wikipedia2014 corpora for the adjectives.
data.hc
data.hc
An object of class hclust
of length 7.
Embeddings available at http://nlp.stanford.edu/data/glove.6B.zip
Pennington, J., Socher, R., and Manning, C. D. (2014). Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543.
A length-500 TripAdvisor review ratings on the scale 1 to 5.
data.rating
data.rating
An object of class integer
of length 500.
TripAdvisor Data Set used in https://www.cs.virginia.edu/~hw5x/paper/rp166f-wang.pdf
The function recursively finds all leaves that are descendants of a
node in an hclust
tree.
find.leaves(ind, merge)
find.leaves(ind, merge)
ind |
Index of the tree node. For an |
merge |
A ( |
Returns a sequence of indices for descendant leaves
in the leaf set {1, ..., p}. Unlike the notation used in
ind
, we use positive integers to denote leaves here.
## Not run: hc <- hclust(dist(USArrests), "ave") # Descendant leaves of the 10th leaf (should be iteself) find.leaves(-10, hc$merge) # Descendant leaves of the 10th interior node find.leaves(10, hc$merge) # Descendant leaves of the root (should be all leaves) ind_root <- nrow(hc$merge) all.equal(find.leaves(ind_root, hc$merge), hc$order) ## End(Not run)
## Not run: hc <- hclust(dist(USArrests), "ave") # Descendant leaves of the 10th leaf (should be iteself) find.leaves(-10, hc$merge) # Descendant leaves of the 10th interior node find.leaves(10, hc$merge) # Descendant leaves of the root (should be all leaves) ind_root <- nrow(hc$merge) all.equal(find.leaves(ind_root, hc$merge), hc$order) ## End(Not run)
The function plots an hclust
tree with branches and leaves colored
based on group membership. The groups span the covariate indices {1, ..., nvars
}.
Covariates from the same group share equal coefficient (beta
), and sibling
groups have different coefficients. The function determines groups based on
the sparsity in gamma
. In an hclust
tree with beta[i]
on the
i
th leaf, the branch and leaf are colored in blue, red or gray according to beta[i]
being positive, negative or zero, respectively. The larger the magnitude of beta[i]
is,
the darker the color will be. So branches and leaves from the same group will have the
same color.
group.plot(beta, gamma, A, hc, nbreaks = 20)
group.plot(beta, gamma, A, hc, nbreaks = 20)
beta |
Length- |
gamma |
Length- |
A |
|
hc |
An |
nbreaks |
Number of breaks in binning |
## Not run: # See vignette for more details. set.seed(100) ts <- sample(1:length(data.rating), 400) # Train set indices # Fit the model on train set ourfit <- rarefit(y = data.rating[ts], X = data.dtm[ts, ], hc = data.hc, lam.min.ratio = 1e-6, nlam = 20, nalpha = 10, rho = 0.01, eps1 = 1e-5, eps2 = 1e-5, maxite = 1e4) # Cross validation ourfit.cv <- rarefit.cv(ourfit, y = data.rating[ts], X = data.dtm[ts, ], rho = 0.01, eps1 = 1e-5, eps2 = 1e-5, maxite = 1e4) # Visualize the groups at optimal beta and gamma ibest.lambda <- ourfit.cv$ibest[1] ibest.alpha <- ourfit.cv$ibest[2] beta.opt <- ourfit$beta[[ibest.alpha]][, ibest.lambda] gamma.opt <- ourfit$gamma[[ibest.alpha]][, ibest.lambda] # works if ibest.alpha > 1 # Visualize the groups at optimal beta and gamma group.plot(beta.opt, gamma.opt, ourfit$A, data.hc) ## End(Not run)
## Not run: # See vignette for more details. set.seed(100) ts <- sample(1:length(data.rating), 400) # Train set indices # Fit the model on train set ourfit <- rarefit(y = data.rating[ts], X = data.dtm[ts, ], hc = data.hc, lam.min.ratio = 1e-6, nlam = 20, nalpha = 10, rho = 0.01, eps1 = 1e-5, eps2 = 1e-5, maxite = 1e4) # Cross validation ourfit.cv <- rarefit.cv(ourfit, y = data.rating[ts], X = data.dtm[ts, ], rho = 0.01, eps1 = 1e-5, eps2 = 1e-5, maxite = 1e4) # Visualize the groups at optimal beta and gamma ibest.lambda <- ourfit.cv$ibest[1] ibest.alpha <- ourfit.cv$ibest[2] beta.opt <- ourfit$beta[[ibest.alpha]][, ibest.lambda] gamma.opt <- ourfit$gamma[[ibest.alpha]][, ibest.lambda] # works if ibest.alpha > 1 # Visualize the groups at optimal beta and gamma group.plot(beta.opt, gamma.opt, ourfit$A, data.hc) ## End(Not run)
The function finds aggregated groups of leaf indices by traversing non-zero
gamma
elements and finding descendant leaves at each gamma
element. In our problem,
gamma
are latent variables corresponding to tree nodes. The order
of the traversal is post-order, i.e., a node is visited after its descendants.
group.recover(gamma, A, postorder = seq(ncol(A)))
group.recover(gamma, A, postorder = seq(ncol(A)))
gamma |
Length- |
A |
|
postorder |
Length- |
Returns a list of recovered groups of leaf indices.
## Not run: # See vignette for more details. set.seed(100) ts <- sample(1:length(data.rating), 400) # Train set indices # Fit the model on train set ourfit <- rarefit(y = data.rating[ts], X = data.dtm[ts, ], hc = data.hc, lam.min.ratio = 1e-6, nlam = 20, nalpha = 10, rho = 0.01, eps1 = 1e-5, eps2 = 1e-5, maxite = 1e4) # Cross validation ourfit.cv <- rarefit.cv(ourfit, y = data.rating[ts], X = data.dtm[ts, ], rho = 0.01, eps1 = 1e-5, eps2 = 1e-5, maxite = 1e4) # Group recovered at optimal beta and gamma ibest.lambda <- ourfit.cv$ibest[1] ibest.alpha <- ourfit.cv$ibest[2] gamma.opt <- ourfit$gamma[[ibest.alpha]][, ibest.lambda] # works if ibest.alpha > 1 groups.opt <- group.recover(gamma.opt, ourfit$A) ## End(Not run)
## Not run: # See vignette for more details. set.seed(100) ts <- sample(1:length(data.rating), 400) # Train set indices # Fit the model on train set ourfit <- rarefit(y = data.rating[ts], X = data.dtm[ts, ], hc = data.hc, lam.min.ratio = 1e-6, nlam = 20, nalpha = 10, rho = 0.01, eps1 = 1e-5, eps2 = 1e-5, maxite = 1e4) # Cross validation ourfit.cv <- rarefit.cv(ourfit, y = data.rating[ts], X = data.dtm[ts, ], rho = 0.01, eps1 = 1e-5, eps2 = 1e-5, maxite = 1e4) # Group recovered at optimal beta and gamma ibest.lambda <- ourfit.cv$ibest[1] ibest.alpha <- ourfit.cv$ibest[2] gamma.opt <- ourfit$gamma[[ibest.alpha]][, ibest.lambda] # works if ibest.alpha > 1 groups.opt <- group.recover(gamma.opt, ourfit$A) ## End(Not run)
Fit the rare feature selection model proposed in Yan and Bien (2018):
using an alternating direction method of multipliers (ADMM) algorithm
described in Algorithm 1 of the same paper.
The regularization path is computed over a two-dimensional grid of
regularization parameters: lambda
and alpha
. Of the two,
lambda
controls the overall amount of regularization, and alpha
controls the tradeoff between sparsity and fusion of (larger
alpha
induces more fusion in ).
rarefit(y, X, A = NULL, Q = NULL, hc, intercept = T, lambda = NULL, alpha = NULL, nlam = 50, lam.min.ratio = 1e-04, nalpha = 10, rho = 0.01, eps1 = 1e-06, eps2 = 1e-05, maxite = 1e+06)
rarefit(y, X, A = NULL, Q = NULL, hc, intercept = T, lambda = NULL, alpha = NULL, nlam = 50, lam.min.ratio = 1e-04, nalpha = 10, rho = 0.01, eps1 = 1e-06, eps2 = 1e-05, maxite = 1e+06)
y |
Length- |
X |
|
A |
|
Q |
|
hc |
An |
intercept |
Whether intercept be fitted (default = TRUE) or set to zero (FALSE). |
lambda |
A user-supplied |
alpha |
A user-supplied |
nlam |
Number of |
lam.min.ratio |
Smallest value for |
nalpha |
Number of |
rho |
Penalty parameter for the quadratic penalty in the ADMM algorithm.
The default value is |
eps1 |
Convergence threshold in terms of the absolute tolerance level
for the ADMMM algorithm. The default value is |
eps2 |
Convergence threshold in terms of the relative tolerance level
for the ADMM algorithm. The default value is |
maxite |
Maximum number of passes over the data for every pair of
( |
The function splits model fitting path by alpha
. At each alpha
value,
the model is fit on the entire sequence of lambda
with warm start. We recommend
including an intercept (by setting intercept=T
) unless the input data have been
centered.
Returns regression coefficients for beta
and gamma
and
intercept beta0
. We use a matrix-nested-within-list structure to store the coefficients: each list
item corresponds to an alpha
value; matrix (or vector) in that list item stores
coefficients at various lambda
values by columns (or entries).
beta0 |
Length- |
beta |
Length- |
gamma |
Length- |
lambda |
Sequence of |
alpha |
Sequence of |
A |
Binary matrix encoding ancestor-descendant relationship between leaves and nodes in the tree. |
Q |
Matrix with columns forming an orthonormal basis for the null space of |
intercept |
Whether an intercept is included in model fit. |
Yan, X. and Bien, J. (2018) Rare Feature Selection in High Dimensions, https://arxiv.org/abs/1803.06675.
## Not run: # See vignette for more details. set.seed(100) ts <- sample(1:length(data.rating), 400) # Train set indices # Fit the model on train set ourfit <- rarefit(y = data.rating[ts], X = data.dtm[ts, ], hc = data.hc, lam.min.ratio = 1e-6, nlam = 20, nalpha = 10, rho = 0.01, eps1 = 1e-5, eps2 = 1e-5, maxite = 1e4) ## End(Not run)
## Not run: # See vignette for more details. set.seed(100) ts <- sample(1:length(data.rating), 400) # Train set indices # Fit the model on train set ourfit <- rarefit(y = data.rating[ts], X = data.dtm[ts, ], hc = data.hc, lam.min.ratio = 1e-6, nlam = 20, nalpha = 10, rho = 0.01, eps1 = 1e-5, eps2 = 1e-5, maxite = 1e4) ## End(Not run)
The function does K-fold cross validaton (CV) to choose an optimal pair of (lambda
, alpha
)
on which the model performs best according to the chosen error metric: mean squared error
or mean absolute error.
rarefit.cv(fitObj, y, X, errtype = "mean-squared-error", nfolds = 5, ...)
rarefit.cv(fitObj, y, X, errtype = "mean-squared-error", nfolds = 5, ...)
fitObj |
Output of |
y |
Response variable. |
X |
|
errtype |
Type of error metric used in cross validation. Available choices are mean-squared-error (default) and mean-absolute-error. |
nfolds |
Number of folds (default is 5) |
... |
Other arguments that can be passed to |
folds |
A length- |
errs |
A |
m |
A |
se |
A |
ibest |
Indices of pair of ( |
lambda.best |
Value of |
alpha.best |
Value of |
## Not run: # See vignette for more details. set.seed(100) ts <- sample(1:length(data.rating), 400) # Train set indices # Fit the model on train set ourfit <- rarefit(y = data.rating[ts], X = data.dtm[ts, ], hc = data.hc, lam.min.ratio = 1e-6, nlam = 20, nalpha = 10, rho = 0.01, eps1 = 1e-5, eps2 = 1e-5, maxite = 1e4) # Cross validation ourfit.cv <- rarefit.cv(ourfit, y = data.rating[ts], X = data.dtm[ts, ], rho = 0.01, eps1 = 1e-5, eps2 = 1e-5, maxite = 1e4) ## End(Not run)
## Not run: # See vignette for more details. set.seed(100) ts <- sample(1:length(data.rating), 400) # Train set indices # Fit the model on train set ourfit <- rarefit(y = data.rating[ts], X = data.dtm[ts, ], hc = data.hc, lam.min.ratio = 1e-6, nlam = 20, nalpha = 10, rho = 0.01, eps1 = 1e-5, eps2 = 1e-5, maxite = 1e4) # Cross validation ourfit.cv <- rarefit.cv(ourfit, y = data.rating[ts], X = data.dtm[ts, ], rho = 0.01, eps1 = 1e-5, eps2 = 1e-5, maxite = 1e4) ## End(Not run)
The function makes predictions using a rarefit
object at optimal
(lambda
, alpha
) chosen by rarefit.cv
.
rarefit.predict(fitObj, cvObj, newx)
rarefit.predict(fitObj, cvObj, newx)
fitObj |
Output of |
cvObj |
Output of |
newx |
Matrix of new values for x at which predictions are made. |
Returns a sequence of predictions.
## Not run: # See vignette for more details. set.seed(100) ts <- sample(1:length(data.rating), 400) # Train set indices # Fit the model on train set ourfit <- rarefit(y = data.rating[ts], X = data.dtm[ts, ], hc = data.hc, lam.min.ratio = 1e-6, nlam = 20, nalpha = 10, rho = 0.01, eps1 = 1e-5, eps2 = 1e-5, maxite = 1e4) # Cross validation ourfit.cv <- rarefit.cv(ourfit, y = data.rating[ts], X = data.dtm[ts, ], rho = 0.01, eps1 = 1e-5, eps2 = 1e-5, maxite = 1e4) # Prediction on test set pred <- rarefit.predict(ourfit, ourfit.cv, data.dtm[-ts, ]) pred.error <- mean((pred - data.rating[-ts])^2) ## End(Not run)
## Not run: # See vignette for more details. set.seed(100) ts <- sample(1:length(data.rating), 400) # Train set indices # Fit the model on train set ourfit <- rarefit(y = data.rating[ts], X = data.dtm[ts, ], hc = data.hc, lam.min.ratio = 1e-6, nlam = 20, nalpha = 10, rho = 0.01, eps1 = 1e-5, eps2 = 1e-5, maxite = 1e4) # Cross validation ourfit.cv <- rarefit.cv(ourfit, y = data.rating[ts], X = data.dtm[ts, ], rho = 0.01, eps1 = 1e-5, eps2 = 1e-5, maxite = 1e4) # Prediction on test set pred <- rarefit.predict(ourfit, ourfit.cv, data.dtm[-ts, ]) pred.error <- mean((pred - data.rating[-ts])^2) ## End(Not run)
The function generates the binary matrix A
defined in Yan
and Bien (2018). The matrix encodes ancestor-descendant relationships between leaves
and tree nodes in an hclust
tree.
tree.matrix(hc)
tree.matrix(hc)
hc |
An |
Returns a nvars
-by-nnodes
binary matrix A
where nvars
is the number of leaves (we associate covariate with leaf),
and nnodes
is the number of tree nodes (including both leaves and interior nodes).
For an hclust
tree, nnodes
= 2*nvars-1
. A[i,j]
is 1 if the i
th leaf
is a descendant of the j
th node in the tree, and 0 otherwise. By default, we
let the first nvars
columns correspond to leaves and the remaining
nvars-1
columns correspond to interior nodes.
A
is in sparse matrix format (inherit from class
sparseMatrix
as in package Matrix
).
Yan, X. and Bien, J. (2018) Rare Feature Selection in High Dimensions, https://arxiv.org/abs/1803.06675.
find.leaves
for finding descendant leaves of a node.
## Not run: # For a perfect binary tree of depth 2 below # # 3 # /\ # 1 2 # /\ /\ # -1 -2 -3 -4 # # A can expressed as the following: A_true <- cbind(diag(4), as.matrix(c(1, 1, 0, 0)), as.matrix(c(0, 0, 1, 1)), as.matrix(c(1, 1, 1, 1))) # Now use tree.matrix to generate A tree0 <- list() tree0$merge <- matrix(c(-1, -2, -3, -4, 1, 2), ncol = 2, byrow = TRUE) tree0$labels <- c("leaf1", "leaf2", "leaf3", "leaf4") A <- tree.matrix(tree0) all(A_true == as.matrix(A)) # Another example hc <- hclust(dist(USArrests), "ave") A <- tree.matrix(hc) ## End(Not run)
## Not run: # For a perfect binary tree of depth 2 below # # 3 # /\ # 1 2 # /\ /\ # -1 -2 -3 -4 # # A can expressed as the following: A_true <- cbind(diag(4), as.matrix(c(1, 1, 0, 0)), as.matrix(c(0, 0, 1, 1)), as.matrix(c(1, 1, 1, 1))) # Now use tree.matrix to generate A tree0 <- list() tree0$merge <- matrix(c(-1, -2, -3, -4, 1, 2), ncol = 2, byrow = TRUE) tree0$labels <- c("leaf1", "leaf2", "leaf3", "leaf4") A <- tree.matrix(tree0) all(A_true == as.matrix(A)) # Another example hc <- hclust(dist(USArrests), "ave") A <- tree.matrix(hc) ## End(Not run)