The rare
package implements the rare feature selection
procedure introduced in Yan, X. and Bien, J. (2018) Rare Feature
Selection in High Dimensions, including fitting the model,
performing K-fold cross validation, making predictions for new
observations and visualizing aggregated groups of rare features in a
colored dendrogram. In the following, we will use a real data example to
demonstrate how to use the relevant functions.
Rare features are hard to model because of their sparseness. For features measuring frequency of rare events, Yan and Bien (2018) proposes a regression framework for modeling the rare features. They use a tree as side information to relate p features based on the similarity and aggregate them in a flexible manner with a tree-based parametrization strategy. The tree, denoted as 𝒯 and with p features’ coefficients β on the leaves, merges two features at an earlier stage if they are more “similar”. The tree can be learned from a different data source or based on other prior knowledge. In every tree node, they assign a latent variable γ and express βj on the jth leaf as a sum of all γus that are ancestors of the jth leaf: βj = ∑u ∈ ancesotr(j) ∪ {j}γu. So sparsity in γ induces fusion of βj’s in a subtree. For notation conciseness, we represent the equality contraint between β and γ for a tree in a binary matrix A ∈ {0, 1}p × |𝒯|: Ajk = 1{uk ∈ ancestor(j) ∪ {j}}.
Under a linear model y = β0*1n + Xβ* + ϵ where y ∈ ℝn, X ∈ ℝn × p is the design matrix for p features and ϵ ∼ N(0, σ2In), our proposed estimator β̂ is the solution to the following optimization problem: $$ \min_{\beta\in\mathbb R^p, \gamma\in\mathbb R^{|\mathcal T|}}\left\{\frac{1}{2} \left\|y - X\beta - \beta_01_n \right\|_2^2 + \lambda \left(\alpha\left\|\gamma_{-root}\right\|_1 + (1-\alpha)\|\beta\|_1\right)\ \text{s.t. }\beta = A\gamma \right\} $$ where λ controls the overall regularization level while α determines the trade-off between fusion and sparsity in β̂. In practice, both λ and α are determined via cross validation. Refer to Yan and Bien (2018) for more details of the proposed framework.
To demonstrate the usage of rare
, we use a review data
set crawled from TripAdvisor.com (used in https://www.cs.virginia.edu/~hw5x/paper/rp166f-wang.pdf)
as an example. The original data set contains more than 200 thousands
reviews and ratings. . For the sake of the demonstration, we randomly
subset 500 reviews and 200 adjectives appearing in them from the data
set. In each review, user provides a rating on the scale ranging from 1
star to 5 stars. We model the rating with a Gaussian linear model: $y_i = \beta_0^* + \sum_{j=1}^pX_{ij}\beta_j^* +
\epsilon_i$ where yi is the rating
for the ith review, Xij
counts the jth adjective in
the ith review and ϵi ∼ N(0, σ2)
i.i.d. for σ > 0. We attach
the sample data and a pre-trained hierarchical clustering tree to the
package in data.rating
, data.dtm
and
data.hc
.
library(rare)
library(Matrix)
# Design matrix = document-term matrix
dim(data.dtm)
#> [1] 500 200
# Ratings for the reviews in data.dtm
length(data.rating)
#> [1] 500
The data set contains 200 adjectives in the sample and most of them are highly sparse. Below is a histogram of percentage of reviews using adjective.
hist(colMeans(sign(data.dtm)) * 100, breaks = 50, main = "Histogram of Adjective Rarity in the TripAdvisor Sample",
xlab = "% of Reviews Using Adjective")
Our model relies on a hierarchical clustering tree as side
information to guide feature aggregation. In the example, we generate
the tree for adjectives in two steps: sentiment separation (negative
and positive) and hierarchical clustering within each sentiment
set. For sentiment separation, we use positive/negative emotion
words from NRC Emotion Lexicon (Mohammad and
Turney 2013) as train set to classify our adjectives to the two
sentiments using 5NN. In the hierarchical clustering step, we apply
hclust
on 100-dimensional word embeddings for adjectives,
which are pre-trained by GloVe (Pennington,
Socher, and Manning 2014) on Gigaword5 and Wikipedia2014 corpora.
The following dendrogram depicts the tree with 200 adjectives on the
leaves.
We split the sample data into training set and test set at the ratio of 4:1.
We let the program to determine λ sequence and α sequence, after setting length of
sequences to be nlam = 20
and nalpha = 10
. We
fit the model on the training set over the two-dimensional grid of
regularization parameters (λ, α). The
rarefit
function implements the model fit alongside α, i.e., at each α the model is fitted over the
entire sequence of λ
values.
ourfit <- rarefit(y = data.rating[ts], X = data.dtm[ts, ], hc = data.hc, lam.min.ratio = 1e-6,
nlam = 20, nalpha = 10, rho = 0.01, eps1 = 1e-5, eps2 = 1e-5, maxite = 1e4)
rarefit
provides various options for users to customize
the fit. We introduce some commonly used options here and they can be
specified in the rarefit
function.
A
is a p-by-|𝒯| binary matrix encoding
ancestor-descendant relationships between leaves (β) and tree nodes (γ). If the tree 𝒯 is not generated by hclust
,
user needs to provide A
in a sparse matrix format (inherit
from class sparseMatrix
as in package Matrix
).
If 𝒯 is generated by
hclust
, user can just provides the tree in
hc
.
Q
is a (p + |𝒯|)-by-p matrix with columns forming an
orthonormal basis for $\begin{pmatrix}I_p:
-A \end{pmatrix}$. Computing Q
can be
time-consuming especially when p is large. When fitting the model
on the entire training set, user does not need to compute Q
separately (i.e., leaving it NULL
is fine). Later in cross
validation, Q
will be re-used every time the model is
fitted on different folds of training set.
intercept
is a boolean value standing for whether
intercept should be fitted. Default is TRUE. We recommend always
including an intercept unless the data set has been centered.
lambda
can be provided, but is typically not and the
program constructs a sequence. When automatically generated, the λ sequence is determined by λmax and
lam.min.ratio
. The former is the smallest λ that sets all coefficients β to zero. The latter is the
smallest value for λ as a
fraction of λmax.
alpha
is another sequence of regularization
parameters and can be provided. When automatically generated, the α sequence is a
length-nalpha
sequence of equally spaced values between 0
and 1. However, in practice user may find optimal α tends to be at a smaller region
within [0, 1] interval. In that case, user may consider provide its own
alpha
sequence, e.g.,
alpha = c(1-exp(seq(0, log(1e-2), len = nalpha - 1)), 1)).
is more granular towards 1.
rho
, eps1
, eps2
and
maxite
are hyperparameters used in the ADMM algoirthm for
solving our optimization problem. Refer to Algorithm 1 in Yan and Bien (2018) for details.
rarefit
returns estimated coefficients β̂0, β̂ and γ̂ as length-nalpha
lists: the jth entry in list
corresponds to coefficients estimated at αj. In
particular, β̂0[j] is a
length-nlambda
vector with the ith entry being estimated intercept
at (λi, αj);
β̂[j] is a p-by-nlambda
matrix
where the ith column being
estimated β at (λi, αj);
γ̂[j] is a |𝒯|-by-nlambda
matrix where the
ith column being estimated
γ at (λi, αj).
When α = 0, our problem
becomes the lasso on β and
rarefit
returns NA
value for γ̂ (because we use
glmnet
to solve the lasso on β); for all other nonzero α values, γ̂ are solved numerically.
To choose optimal (λ, α) from the
two-dimensional solution paths, we use K-fold cross validationt. The
function rarefit.cv
performs K-fold cross validation based
on model fit from rarefit
on the entire training set.
rarefit.cv
first randomly splits the training set into K
folds that are roughly of the same size. At round k, rarefit.cv
fits the
model on all but the kth fold
and predict on the kth fold,
generating error metric errfun(y(k), ŷ(k)(λ, α)).
The optimal tuning parameter pair is the minimizer of an average of
these metrics across K folds: $$
(\hat\lambda, \hat\alpha) = \arg\min_{\lambda, \alpha}\frac{1}{K}
\sum_{k=1}^K errfun\left(y^{(k)}, \hat y^{(k)}(\lambda, \alpha)\right).
$$ An option that allows user to customize CV is
errtype
, a character string indicating the type of error
function. Two error types are allowed:
errtype = "mean-squared-error"
or
errtype = "mean-absolute-error"
. The default value for K is
nfolds=5
.
# Cross validation
ourfit.cv <- rarefit.cv(ourfit, y = data.rating[ts], X = data.dtm[ts, ],
rho = 0.01, eps1 = 1e-5, eps2 = 1e-5, maxite = 1e4)
Note that CV are done on the same sequences of λ and α from the previous model fit in
ourfit
.
After choosing optimal (λ, α) using CV, we evalute
our model’s performance on the hold-out test set (100 reviews and
ratings from the sample). The function rarefit.predict
is
the one-click function for making new predictions, based on model fit
object ourfit
and CV object ourfit.cv
(for
choosing optimal (λ, α)).
# Prediction on test set
pred <- rarefit.predict(ourfit, ourfit.cv, data.dtm[-ts, ])
pred.error <- mean((pred - data.rating[-ts])^2)
pred.error
The predictions are made at (β̂0(λ̂CV, α̂CV), β̂(λ̂CV, α̂CV)),
i.e., estimated regression coefficients (β̂0, β̂) from
ourfit
at the CV-chosen optimal (λ̂CV, α̂CV).
In addition to the prediction performance of the model, we may also
be interested in seeing how the model aggregates rare adjectives into
groups. We provide two functions to allow user view recovered groups at
given (β̂, γ̂):
group.recover
and group.plot
.
The function group.recover
determines aggregated groups
of leaf indices (i.e., β
elements) based on sparsity in γ. In particular, we iterate over
all non-zero γ elements in
postorder; at every non-zero γ, we make its descendant leaves a
set after excluding all leaves that have appeared in previous groups.
For example, suppose v1 and v2 are the only two
children nodes of some node u
with γv1 ≠ 0,
γv2 = 0,
γu ≠ 0 and
γw = 0 for
all w ∈ descendant(v1) ∪ descendant(v2).
At node v1, we
recover ℒ(𝒯v1) (the
leaf set of subtree rooted at v1) as a group. Then we
move to u and recover ℒ(𝒯u) ∖ ℒ(𝒯v1)
as a group. The postorder traversal across nodes with non-zero γ ensures us recover the correct
groups.
Since rarefit
returns NA
for γ̂ when solving at α = 0, group.recover
(and the following group.plot
) will only work for α ≠ 0 cases.
In the following, we find the groups aggregated at (β̂0(λ̂CV, α̂CV), β̂(λ̂CV, α̂CV)).
# Find recovered groups at optimal beta and gamma
ibest.lambda <- ourfit.cv$ibest[1]
ibest.alpha <- ourfit.cv$ibest[2]
beta.opt <- ourfit$beta[[ibest.alpha]][, ibest.lambda]
gamma.opt <- ourfit$gamma[[ibest.alpha]][, ibest.lambda]
groups.opt <- group.recover(gamma.opt, ourfit$A)
length(groups.opt) # total number of aggregated groups
#> [1] 32
In addition to a list of leaf indices representing aggregated groups,
we can visualize the groups on a dendrogram. The function
group.plot
colors branches and leaves of an
hclust
tree based on corresponding β values. In an hclust
tree with βi on the ith leaf, the branch and leaf are
colored in blue, red or gray according to βi being
positive, negative or zero, respectively. The larger the magnitude of
βi is, the
darker the color will be. So branches and leaves from the same group
will have the same color. In the following, we visualize the groups
aggregated at (β̂0(λ̂CV, α̂CV), β̂(λ̂CV, α̂CV)).