Title: | Fast Searches for Interactions |
---|---|
Description: | Performs fast detection of interactions in large-scale data using the method of random intersection trees introduced in Shah, R. D. and Meinshausen, N. (2014) <http://www.jmlr.org/papers/v15/shah14a.html>. The algorithm finds potentially high-order interactions in high-dimensional binary two-class classification data, without requiring lower order interactions to be informative. The search is particularly fast when the matrices of predictors are sparse. It can also be used to perform market basket analysis when supplied with a single binary data matrix. Here it will find collections of columns which for many rows contain all 1's. |
Authors: | Hyun Jik Kim, Rajen D. Shah |
Maintainer: | Rajen D. Shah <[email protected]> |
License: | GPL-2 |
Version: | 0.1.2 |
Built: | 2024-11-05 06:19:32 UTC |
Source: | CRAN |
Function to perform random intersection trees. When two binary data matrices z
(class 1) and z0
(class 0) are supplied, it searches for interactions. More precisely, since the data matrices are binary, each row of each matrix can be represented by the set of column indices with non-zero entries. The function searches for sets (interactions) that are more prevalent in class 1 than class 0, and then sets that are more prevalent in class 0 than class 1. When given a single binary matrix z
with the argument z0
omitted, the function simply finds sets with high prevalence. Prevalences of interactions returned are estimated using min-wise hashing.
RIT(z, z0, branch = 5, depth = 10L, n_trees = 100L, theta0 = 0.5, theta1 = theta0, min_inter_sz = 2L, L = 100L, n_cores = 1L, output_list = FALSE)
RIT(z, z0, branch = 5, depth = 10L, n_trees = 100L, theta0 = 0.5, theta1 = theta0, min_inter_sz = 2L, L = 100L, n_cores = 1L, output_list = FALSE)
z |
data matrix where each row corresponds to an observation and columns correspond to variables. Can be in sparse matrix format (inherit from class "sparseMatrix" in the Matrix package). |
z0 |
optional second data matrix with the same number of columns as |
branch |
average number of branches to use when creating each tree. |
depth |
maximum depth of trees. |
n_trees |
number of trees to be constructed. |
theta0 |
when searching for sets of variables that are more prevalent in class 1 than class 0, the maximum threshold for prevalence in class 0. |
theta1 |
as above but with class 1 and class 0 interchanged. |
min_inter_sz |
minimum size of the interactions to be returned |
L |
number of rows of the min-wise hash matrix used to estimate prevalences. A larger value will result in more accurate estimates, but computation time will increase linearly with |
n_cores |
number of cores for parallel processing. Only used when openMP is installed. |
output_list |
if |
There are two tasks which can be performed with this function depending on whether or not z0
is supplied (note z
must always be supplied).
1. If z0
is omitted, the function finds prevalent sets in z
and theta0
and theta1
are ignored.
2. If z0
is supplied, it searches for sets that are prevalent in z
but have prevalence at most theta0
in z0
. Next sets that are prevalent in z0
but have prevalence in z
at most theta1
are found.
If output_list
is FALSE
(the default), the output is either a data frame (if z0
is omitted) or list of two data frames (if z0
is supplied). The data frames have first column a character vector of interaction sets with the variables in the sets separated by spaces, and second column the estimated prevalences. When z0
is supplied, the interactions in the first component of the list named Class1
are those which are prevalent in z
and their prevalences in z
are reported. The second component named named Class0
contains those interactions prevalent in z0
and their prevalences in z0
.
When output_list
is TRUE
, each interaction is reported as an integer vector and so the collection of interactions is a list of such vectors.
Hyun Jik Kim, Rajen D. Shah
Shah, R. D. and Meinshausen, N. (2014) Random Intersection Trees. Journal of Machine Learning Research, 15, 629–654.
## Generate two binary matrices z <- matrix(rbinom(250*500, 1, 0.3), 250, 500) z0 <- matrix(rbinom(250*500, 1, 0.3), 250, 500) ## Make the first and second cols of z identical ## so the set 1, 2 has prevalence roughly 0.3 compared ## to roughly 0.09 for any other pair of columns z[, 1] <- z[, 2] ## Similarly for z0 z0[, 3] <- z0[, 4] ## Market basket analysis out1 <- RIT(z) out1[1:5, ] ## Finding interactions out2 <- RIT(z, z0) out2$Class1[1:5, ] out2$Class0[1:5, ] ## Can also perform the above using sparse matrices if (require(Matrix)) { S <- Matrix(z, sparse=TRUE) S0 <- Matrix(z0, sparse=TRUE) out3 <- RIT(S, S0) }
## Generate two binary matrices z <- matrix(rbinom(250*500, 1, 0.3), 250, 500) z0 <- matrix(rbinom(250*500, 1, 0.3), 250, 500) ## Make the first and second cols of z identical ## so the set 1, 2 has prevalence roughly 0.3 compared ## to roughly 0.09 for any other pair of columns z[, 1] <- z[, 2] ## Similarly for z0 z0[, 3] <- z0[, 4] ## Market basket analysis out1 <- RIT(z) out1[1:5, ] ## Finding interactions out2 <- RIT(z, z0) out2$Class1[1:5, ] out2$Class0[1:5, ] ## Can also perform the above using sparse matrices if (require(Matrix)) { S <- Matrix(z, sparse=TRUE) S0 <- Matrix(z0, sparse=TRUE) out3 <- RIT(S, S0) }