This package provides a kernel knockoffs selection procedure, dubbed KKO, for the nonparametric additive model. The procedure integrates three key components: the knockoffs, the subsampling for stability, and the random feature mapping for nonparametric function approximation. Finite-sample false discovery rate (FDR) control guarantee is established for KKO, see Dai et al. (2021).
Let us begin by creating some synthetic data. The data is generated from additive polynomial function.
### generate regression coefficent
p=20 # number of predictors
sig_mag=10 # signal strength
s=5 # sparsity, number of nonzero component functions
reg_coef=c(rep(1,s),rep(0,p-s)) # regression coefficient
### generate response and design
n= 600 # sample size
X=matrix(rnorm(n*p),n,p) # generate design
X_k = create.second_order(X) # generate knockoff
y=generate_data(X,reg_coef,model) # response
We then apply KKO method to generate importance scores of variables.
rkernel="laplacian" # kernel choice
rk_scale=1 # scaling paramtere of kernel
rfn_range=c(2,3,4) # number of random features
cv_folds=15 # folds of cross-validation in group lasso
n_stb=200 # number of subsampling for importance scores
n_stb_tune=100 # number of subsampling for tuning random feature number
frac_stb=1/2 # fraction of subsample
nCores_para=2 # number of cores for parallelization
### KKO selection
The importance scores by KKO are the difference of selection frequencies between variables and knockoffs, ranging from −1 to 1. The active variables are expected to have high positive scores (close to one). Those of null variables are expcted to stay centered at zero.
## [1] 10 10 -10 -10 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [20] 0
## [1] 0.703333333 0.160000000 0.870000000 0.886666667 0.776666667
## [6] -0.006666667 0.023333333 -0.040000000 -0.006666667 0.000000000
## [11] -0.003333333 -0.003333333 -0.003333333 0.000000000 -0.043333333
## [16] -0.016666667 -0.030000000 0.003333333 0.000000000 -0.003333333
myplot = ggplot(mydata, aes(W, fill = var_group)) +
geom_histogram(color = "gray2",binwidth=1/p) + theme_bw()+
xlab("Importance scores")+ylab("Number of variables")+
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_bar()`).
We apply knockoff filter on KKO importance scores. The filter computes a threshold on scores, and pick significant variables above the threshold.
fdr=0.2 #FDR control level
thres = knockoff.threshold(W, fdr=fdr) # thresholding on scores by knockoff filter
selected = which(W >= thres)
selected # indices of selected variables
## [1] 1 2 3 4 5