Title: | Variable Selection by Revisited Knockoffs Procedures |
---|---|
Description: | Performs variable selection for many types of L1-regularised regressions using the revisited knockoffs procedure. This procedure uses a matrix of knockoffs of the covariates independent from the response variable Y. The idea is to determine if a covariate belongs to the model depending on whether it enters the model before or after its knockoff. The procedure suits for a wide range of regressions with various types of response variables. Regression models available are exported from the R packages 'glmnet' and 'ordinalNet'. Based on the paper linked to via the URL below: Gegout A., Gueudin A., Karmann C. (2019) <arXiv:1907.03153>. |
Authors: | Clemence Karmann [aut, cre], Aurelie Gueudin [aut] |
Maintainer: | Clemence Karmann <[email protected]> |
License: | GPL-3 |
Version: | 0.0.1 |
Built: | 2024-11-28 06:29:18 UTC |
Source: | CRAN |
Returns the vector of statistics W of the revisited knockoffs procedure for regressions available in the R package glmnet
. Most of the parameters come from glmnet()
. See glmnet
documentation for more details.
ko.glm(x, y, family = "gaussian", alpha = 1, type.gaussian = ifelse(nvars < 500, "covariance", "naive"), type.logistic = "Newton", type.multinomial = "ungrouped", nVal = 50, random = FALSE)
ko.glm(x, y, family = "gaussian", alpha = 1, type.gaussian = ifelse(nvars < 500, "covariance", "naive"), type.logistic = "Newton", type.multinomial = "ungrouped", nVal = 50, random = FALSE)
x |
Input matrix, of dimension nobs x nvars; each row is an observation vector. Can be in sparse matrix format (inherit from class " |
y |
Response variable. Quantitative for |
family |
Response type: "gaussian","binomial","poisson","multinomial","cox". Not available for "mgaussian". |
alpha |
The elasticnet mixing parameter, with 0 <= |
type.gaussian |
See |
type.logistic |
See |
type.multinomial |
See |
nVal |
Length of lambda sequence - default is 50. |
random |
If |
A vector of dimension nvars corresponding to the statistics W.
# see ko.sel
# see ko.sel
Returns the vector of statistics W of the revisited knockoffs procedure for regressions available in the R package ordinalNet
. Most of the parameters come from ordinalNet()
. See ordinalNet
documentation for more details.
ko.ordinal(x, y, family = "cumulative", reverse = FALSE, link = "logit", alpha = 1, parallelTerms = TRUE, nonparallelTerms = FALSE, nVal = 100, warn = FALSE, random = FALSE)
ko.ordinal(x, y, family = "cumulative", reverse = FALSE, link = "logit", alpha = 1, parallelTerms = TRUE, nonparallelTerms = FALSE, nVal = 100, warn = FALSE, random = FALSE)
x |
Covariate matrix, of dimension nobs x nvars; each row is an observation vector. It is recommended that categorical covariates are converted to a set of indicator variables with a variable for each category (i.e. no baseline category); otherwise the choice of baseline category will affect the model fit. |
y |
Response variable. Can be a factor, ordered factor, or a matrix where each row is a multinomial vector of counts. A weighted fit can be obtained using the matrix option, since the row sums are essentially observation weights. Non-integer matrix entries are allowed. |
family |
Specifies the type of model family. Options are "cumulative" for cumulative probability, "sratio" for stopping ratio, "cratio" for continuation ratio, and "acat" for adjacent category. |
reverse |
Logical. If TRUE, then the "backward" form of the model is fit, i.e. the model is defined with response categories in reverse order. For example, the reverse cumulative model with K+1 response categories applies the link function to the cumulative probabilities P(Y >= 2), …, P(Y >= K+1), rather then P(Y <= 1), …, P(Y <= K). |
link |
Specifies the link function. The options supported are logit, probit, complementary log-log, and cauchit. |
alpha |
The elastic net mixing parameter, with |
parallelTerms |
Logical. If |
nonparallelTerms |
Logical. if |
nVal |
Length of lambda sequence - default is 100. |
warn |
Logical. If |
random |
If |
A vector of dimension nvars corresponding to the statistics W.
nonparallelTerms = TRUE
is highly discouraged because the knockoffs procedure does not suit well to this setting.
# see ko.sel
# see ko.sel
Performs variable selection from an object (vector of statistics W) returned by ko.glm
or ko.ordinal
.
ko.sel(W, print = FALSE, method = "stats")
ko.sel(W, print = FALSE, method = "stats")
W |
A vector of length nvars corresponding to the statistics W. Object returned by the functions |
print |
Logical. If |
method |
Can be |
A list containing two elements:
threshold
A positive real value corresponding to the threshold used.
estimation
A binary vector of length nvars corresponding to the variable selection: 1*(W >= threshold). 1 indicates that the associated covariate belongs to the estimated model.
Gegout-Petit Anne, Gueudin Aurelie, Karmann Clemence (2019). The revisited knockoffs method for variable selection in L1-penalised regressions, arXiv:1907.03153.
library(graphics) # linear Gaussian regression n = 100 p = 20 set.seed(11) x = matrix(rnorm(n*p),nrow = n,ncol = p) beta = c(rep(1,5),rep(0,15)) y = x%*%beta + rnorm(n) W = ko.glm(x,y) ko.sel(W, print = TRUE) # logistic regression n = 100 p = 20 set.seed(11) x = matrix(runif(n*p, -1,1),nrow = n,ncol = p) u = runif(n) beta = c(c(3:1),rep(0,17)) y = rep(0, n) a = 1/(1+exp(0.1-x%*%beta)) y = 1*(u>a) W = ko.glm(x,y, family = 'binomial', nVal = 50) ko.sel(W, print = TRUE) # cumulative logit regression n = 100 p = 10 set.seed(11) x = matrix(runif(n*p),nrow = n,ncol = p) u = runif(n) beta = c(3,rep(0,9)) y = rep(0, n) a = 1/(1+exp(0.8-x%*%beta)) b = 1/(1+exp(-0.6-x%*%beta)) y = 1*(u<a) + 2*((u>=a) & (u<b)) + 3*(u>=b) W = ko.ordinal(x,as.factor(y), nVal = 20) ko.sel(W, print = TRUE) # adjacent logit regression n = 100 p = 10 set.seed(11) x = matrix(rnorm(n*p),nrow = n,ncol = p) U = runif(n) beta = c(5,rep(0,9)) alpha = c(-2,1.5) M = 2 y = rep(0, n) for(i in 1:n){ eta = alpha + sum(beta*x[i,]) u = U[i] Prob = rep(1,M+1) for(j in 1:M){ Prob[j] = exp(sum(eta[j:M])) } Prob = Prob/sum(Prob) C = cumsum(Prob) C = c(0,C) j = 1 while((C[j]> u) || (u >= C[j+1])){j = j+1} y[i] = j } W = ko.ordinal(x,as.factor(y), family = 'acat', nVal = 10) ko.sel(W, method = 'manual') 0.4 # How to use randomness? n = 100 p = 20 set.seed(11) x = matrix(rnorm(n*p),nrow = n,ncol = p) beta = c(5:1,rep(0,15)) y = x%*%beta + rnorm(n) Esti = 0 for(i in 1:100){ W = ko.glm(x,y, random = TRUE) Esti = Esti + ko.sel(W, method = 'gaps')$estimation } Esti
library(graphics) # linear Gaussian regression n = 100 p = 20 set.seed(11) x = matrix(rnorm(n*p),nrow = n,ncol = p) beta = c(rep(1,5),rep(0,15)) y = x%*%beta + rnorm(n) W = ko.glm(x,y) ko.sel(W, print = TRUE) # logistic regression n = 100 p = 20 set.seed(11) x = matrix(runif(n*p, -1,1),nrow = n,ncol = p) u = runif(n) beta = c(c(3:1),rep(0,17)) y = rep(0, n) a = 1/(1+exp(0.1-x%*%beta)) y = 1*(u>a) W = ko.glm(x,y, family = 'binomial', nVal = 50) ko.sel(W, print = TRUE) # cumulative logit regression n = 100 p = 10 set.seed(11) x = matrix(runif(n*p),nrow = n,ncol = p) u = runif(n) beta = c(3,rep(0,9)) y = rep(0, n) a = 1/(1+exp(0.8-x%*%beta)) b = 1/(1+exp(-0.6-x%*%beta)) y = 1*(u<a) + 2*((u>=a) & (u<b)) + 3*(u>=b) W = ko.ordinal(x,as.factor(y), nVal = 20) ko.sel(W, print = TRUE) # adjacent logit regression n = 100 p = 10 set.seed(11) x = matrix(rnorm(n*p),nrow = n,ncol = p) U = runif(n) beta = c(5,rep(0,9)) alpha = c(-2,1.5) M = 2 y = rep(0, n) for(i in 1:n){ eta = alpha + sum(beta*x[i,]) u = U[i] Prob = rep(1,M+1) for(j in 1:M){ Prob[j] = exp(sum(eta[j:M])) } Prob = Prob/sum(Prob) C = cumsum(Prob) C = c(0,C) j = 1 while((C[j]> u) || (u >= C[j+1])){j = j+1} y[i] = j } W = ko.ordinal(x,as.factor(y), family = 'acat', nVal = 10) ko.sel(W, method = 'manual') 0.4 # How to use randomness? n = 100 p = 20 set.seed(11) x = matrix(rnorm(n*p),nrow = n,ncol = p) beta = c(5:1,rep(0,15)) y = x%*%beta + rnorm(n) Esti = 0 for(i in 1:100){ W = ko.glm(x,y, random = TRUE) Esti = Esti + ko.sel(W, method = 'gaps')$estimation } Esti