Title: | Using Fence Methods for Model Selection |
---|---|
Description: | This method is a new class of model selection strategies, for mixed model selection, which includes linear and generalized linear mixed models. The idea involves a procedure to isolate a subgroup of what are known as correct models (of which the optimal model is a member). This is accomplished by constructing a statistical fence, or barrier, to carefully eliminate incorrect models. Once the fence is constructed, the optimal model is selected from among those within the fence according to a criterion which can be made flexible. References: 1. Jiang J., Rao J.S., Gu Z., Nguyen T. (2008), Fence Methods for Mixed Model Selection. The Annals of Statistics, 36(4): 1669-1692. <DOI:10.1214/07-AOS517> <https://projecteuclid.org/euclid.aos/1216237296>. 2. Jiang J., Nguyen T., Rao J.S. (2009), A Simplified Adaptive Fence Procedure. Statistics and Probability Letters, 79, 625-629. <DOI:10.1016/j.spl.2008.10.014> <https://www.researchgate.net/publication/23991417_A_simplified_adaptive_fence_procedure> 3. Jiang J., Nguyen T., Rao J.S. (2010), Fence Method for Nonparametric Small Area Estimation. Survey Methodology, 36(1), 3-11. <http://publications.gc.ca/collections/collection_2010/statcan/12-001-X/12-001-x2010001-eng.pdf>. 4. Jiming Jiang, Thuan Nguyen and J. Sunil Rao (2011), Invisible fence methods and the identification of differentially expressed gene sets. Statistics and Its Interface, Volume 4, 403-415. <http://www.intlpress.com/site/pub/files/_fulltext/journals/sii/2011/0004/0003/SII-2011-0004-0003-a014.pdf>. 5. Thuan Nguyen & Jiming Jiang (2012), Restricted fence method for covariate selection in longitudinal data analysis. Biostatistics, 13(2), 303-314. <DOI:10.1093/biostatistics/kxr046> <https://academic.oup.com/biostatistics/article/13/2/303/263903/Restricted-fence-method-for-covariate-selection-in>. 6. Thuan Nguyen, Jie Peng, Jiming Jiang (2014), Fence Methods for Backcross Experiments. Statistical Computation and Simulation, 84(3), 644-662. <DOI:10.1080/00949655.2012.721885> <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3891925/>. 7. Jiang, J. (2014), The fence methods, in Advances in Statistics, Hindawi Publishing Corp., Cairo. <DOI:10.1155/2014/830821>. 8. Jiming Jiang and Thuan Nguyen (2015), The Fence Methods, World Scientific, Singapore. <https://www.abebooks.com/9789814596060/Fence-Methods-Jiming-Jiang-981459606X/plp>. |
Authors: | Jiming Jiang, Jianyang Zhao, J. Sunil Rao, Thuan Nguyen |
Maintainer: | Thuan Nguyen <[email protected]> |
License: | BSD_2_clause + file LICENSE |
Version: | 1.0 |
Built: | 2024-12-07 06:48:54 UTC |
Source: | CRAN |
Adaptive Fence model selection
adaptivefence(mf, f, ms, d, lf, pf, bs, grid = 101, bandwidth)
adaptivefence(mf, f, ms, d, lf, pf, bs, grid = 101, bandwidth)
mf |
function for fitting the model |
f |
formula of full model |
ms |
list of formula of candidates models |
d |
data |
lf |
measure lack of fit (to minimize) |
pf |
model selection criteria, e.g., model dimension |
bs |
bootstrap samples |
grid |
grid for c |
bandwidth |
bandwidth for kernel smooth function |
models |
list all model candidates in the model space |
B |
list the number of bootstrap samples that have been used |
lack_of_fit_matrix |
list a matrix of Qs for all model candidates (in columns). Each row is for each bootstrap sample |
Qd_matrix |
list a matrix of QM - QM.tilde for all model candidates. Each row is for each bootrap sample |
bandwidth |
list the value of bandwidth |
model_mat |
list a matrix of selected models at each c values in grid (in columns). Each row is for each bootstrap sample |
freq_mat |
list a matrix of coverage probabilities (frequency/smooth_frequency) of each selected models for a given c value (index) |
c |
list the adaptive choice of c value from which the parsimonious model is selected |
sel_model |
list the selected (parsimonious) model given the adaptive c value |
Jiming Jiang Jianyang Zhao J. Sunil Rao Thuan Nguyen
Jiang J., Rao J.S., Gu Z., Nguyen T. (2008), Fence Methods for Mixed Model Selection. The Annals of Statistics, 36(4): 1669-1692
Jiang J., Nguyen T., Rao J.S. (2009), A Simplified Adaptive Fence Procedure. Statistics and Probability Letters, 79, 625-629
Thuan Nguyen, Jie Peng, Jiming Jiang (2014), Fence Methods for Backcross Experiments. Statistical Computation and Simulation, 84(3), 644-662
## Not run: require(fence) #### Example 1 ##### data(iris) full = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + (1|Species) test_af = fence.lmer(full, iris) plot(test_af) test_af$sel_model #### Example 2 ##### r =1234; set.seed(r) p=8; n=150; rho = 0.6 id = rep(1:50,each=3) R = diag(p) for(i in 1:p){ for(j in 1:p){ R[i,j] = rho^(abs(i-j)) } } R = 1*R x=mvrnorm(n, rep(0, p), R) # all x's are time-varying dependence # colnames(x)=paste('x',1:p, sep='') tbetas = c(0,0.5,1,0,0.5,1,0,0.5) # non-zero beta 2,3,5,6,8 epsilon = rnorm(150) y = x%*%tbetas + epsilon colnames(y) = 'y' data = data.frame(cbind(x,y,id)) full = y ~ x1+x2+x3+x4+x5+x6+x7+x8+(1|id) #X = paste('x',1:p, sep='', collapse='+') #full = as.formula(paste('y~',X,'+(1|id)', sep="")) #same as previous one fence_obj = fence.lmer(full,data) # it takes 3-5 min # plot(fence_obj) fence_obj$sel_model ## End(Not run)
## Not run: require(fence) #### Example 1 ##### data(iris) full = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + (1|Species) test_af = fence.lmer(full, iris) plot(test_af) test_af$sel_model #### Example 2 ##### r =1234; set.seed(r) p=8; n=150; rho = 0.6 id = rep(1:50,each=3) R = diag(p) for(i in 1:p){ for(j in 1:p){ R[i,j] = rho^(abs(i-j)) } } R = 1*R x=mvrnorm(n, rep(0, p), R) # all x's are time-varying dependence # colnames(x)=paste('x',1:p, sep='') tbetas = c(0,0.5,1,0,0.5,1,0,0.5) # non-zero beta 2,3,5,6,8 epsilon = rnorm(150) y = x%*%tbetas + epsilon colnames(y) = 'y' data = data.frame(cbind(x,y,id)) full = y ~ x1+x2+x3+x4+x5+x6+x7+x8+(1|id) #X = paste('x',1:p, sep='', collapse='+') #full = as.formula(paste('y~',X,'+(1|id)', sep="")) #same as previous one fence_obj = fence.lmer(full,data) # it takes 3-5 min # plot(fence_obj) fence_obj$sel_model ## End(Not run)
Adaptive Fence model selection (Small Area Estmation)
adaptivefence.fh(mf, f, ms, d, lf, pf, bs, grid = 101, bandwidth, method)
adaptivefence.fh(mf, f, ms, d, lf, pf, bs, grid = 101, bandwidth, method)
mf |
Call function, for example: default calls: function(m, b) eblupFH(formula = m, vardir = D, data = b, method = "FH") |
f |
Full Model |
ms |
find candidate model, findsubmodel.fh(full) |
d |
Dimension number |
lf |
Measures lack of fit using function(res) -res$fit$goodness[1] |
pf |
Dimensions of model |
bs |
Bootstrap |
grid |
grid for c |
bandwidth |
bandwidth for kernel smooth function |
method |
Method to be used. Fay-Herriot method is the default. |
In Jiang et. al (2008), the adaptive c value is chosen from the highest peak in the p* vs. c plot. In Jiang et. al (2009), 95% CI is taken into account while choosing such an adaptive choice of c. In Thuan Nguyen et. al (2014), the adaptive c value is chosen from the first peak. This approach works better in the moderate sample size or weak signal situations. Empirically, the first peak becomes highest peak when sample size increases or signals become stronger
models |
list all model candidates in the model space |
B |
list the number of bootstrap samples that have been used |
lack_of_fit_matrix |
list a matrix of Qs for all model candidates (in columns). Each row is for each bootstrap sample |
Qd_matrix |
list a matrix of QM - QM.tilde for all model candidates. Each row is for each bootrap sample |
bandwidth |
list the value of bandwidth |
model_mat |
list a matrix of selected models at each c values in grid (in columns). Each row is for each bootstrap sample |
freq_mat |
list a matrix of coverage probabilities (frequency/smooth_frequency) of each selected models for a given c value (index) |
c |
list the adaptive choice of c value from which the parsimonious model is selected |
sel_model |
list the selected (parsimonious) model given the adaptive c value |
The current Fence package focuses on variable selection. However, Fence methods can be used to select other parameters of interest, e.g., tunning parameter, variance-covariance structure, etc.
The number of bootstrap samples is suggested to be increased, e.g., B=1000 when the sample size is small, or signals are weak
Jiming Jiang Jianyang Zhao J. Sunil Rao Thuan Nguyen
Jiang J., Rao J.S., Gu Z., Nguyen T. (2008), Fence Methods for Mixed Model Selection. The Annals of Statistics, 36(4): 1669-1692
Jiang J., Nguyen T., Rao J.S. (2009), A Simplified Adaptive Fence Procedure. Statistics and Probability Letters, 79, 625-629
Thuan Nguyen, Jie Peng, Jiming Jiang (2014), Fence Methods for Backcross Experiments. Statistical Computation and Simulation, 84(3), 644-662
## Not run: require(fence) ### example 1 #### data("kidney") data = kidney[-which.max(kidney$x),] # Delete a suspicious data point # data$x2 = data$x^2 data$x3 = data$x^3 data$x4 = data$x^4 data$D = data$sqrt.D.^2 plot(data$y ~ data$x) full = y~x+x2+x3+x4 testfh = fence.sae(full, data, B=1000, fence="adaptive", method="F-H", D = D) testfh$sel_model testfh$c ## End(Not run)
## Not run: require(fence) ### example 1 #### data("kidney") data = kidney[-which.max(kidney$x),] # Delete a suspicious data point # data$x2 = data$x^2 data$x3 = data$x^3 data$x4 = data$x^4 data$D = data$sqrt.D.^2 plot(data$y ~ data$x) full = y~x+x2+x3+x4 testfh = fence.sae(full, data, B=1000, fence="adaptive", method="F-H", D = D) testfh$sel_model testfh$c ## End(Not run)
Fence model selection (Linear Mixed Model)
fence.lmer(full, data, B = 100, grid = 101, fence = c("adaptive", "nonadaptive"), cn = NA, REML = TRUE, bandwidth = NA, cpus = parallel::detectCores())
fence.lmer(full, data, B = 100, grid = 101, fence = c("adaptive", "nonadaptive"), cn = NA, REML = TRUE, bandwidth = NA, cpus = parallel::detectCores())
full |
formula of full model |
data |
data |
B |
number of bootstrap samples, parametric bootstrap is used |
grid |
grid for c |
fence |
a procedure of the fence method to be used. It's suggested to choose nonadaptive procedure if c is known; otherwise nonadaptive must be chosen |
cn |
cn value for nonadaptive |
REML |
Restricted Maximum Likelihood approach |
bandwidth |
bandwidth for kernel smooth function |
cpus |
Number of parallel computers |
In Jiang et. al (2008), the adaptive c value is chosen from the highest peak in the p* vs. c plot. In Jiang et. al (2009), 95% CI is taken into account while choosing such an adaptive choice of c. In Thuan Nguyen et. al (2014), the adaptive c value is chosen from the first peak. This approach works better in the moderate sample size or weak signal situations. Empirically, the first peak becomes highest peak when sample size increases or signals become stronger
models |
list all model candidates in the model space |
B |
list the number of bootstrap samples that have been used |
lack_of_fit_matrix |
list a matrix of Qs for all model candidates (in columns). Each row is for each bootstrap sample |
Qd_matrix |
list a matrix of QM - QM.tilde for all model candidates. Each row is for each bootrap sample |
bandwidth |
list the value of bandwidth |
model_mat |
list a matrix of selected models at each c values in grid (in columns). Each row is for each bootstrap sample |
freq_mat |
list a matrix of coverage probabilities (frequency/smooth_frequency) of each selected models for a given c value (index) |
c |
list the adaptive choice of c value from which the parsimonious model is selected |
sel_model |
list the selected (parsimonious) model given the adaptive c value |
@note The current Fence package focuses on variable selection. However, Fence methods can be used to select other parameters of interest, e.g., tunning parameter, variance-covariance structure, etc.
Jiming Jiang Jianyang Zhao J. Sunil Rao Thuan Nguyen
Jiang J., Rao J.S., Gu Z., Nguyen T. (2008), Fence Methods for Mixed Model Selection. The Annals of Statistics, 36(4): 1669-1692
Jiang J., Nguyen T., Rao J.S. (2009), A Simplified Adaptive Fence Procedure. Statistics and Probability Letters, 79, 625-629
Thuan Nguyen, Jie Peng, Jiming Jiang (2014), Fence Methods for Backcross Experiments. Statistical Computation and Simulation, 84(3), 644-662
require(fence) library(snow) #### Example 1 ##### data(iris) full = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + (1|Species) # Takes greater than 5 seconds to run # test_af = fence.lmer(full, iris) # test_af$c # test_naf = fence.lmer(full, iris, fence = "nonadaptive", cn = 12) # plot(test_af) # test_af$sel_model # test_naf$sel_model
require(fence) library(snow) #### Example 1 ##### data(iris) full = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + (1|Species) # Takes greater than 5 seconds to run # test_af = fence.lmer(full, iris) # test_af$c # test_naf = fence.lmer(full, iris, fence = "nonadaptive", cn = 12) # plot(test_af) # test_af$sel_model # test_naf$sel_model
Fence model selection (Noparametric Model)
fence.NF(full, data, spline, ps = 1:3, qs = NA, B = 100, grid = 101, bandwidth = NA, lambda)
fence.NF(full, data, spline, ps = 1:3, qs = NA, B = 100, grid = 101, bandwidth = NA, lambda)
full |
formula of full model |
data |
data |
spline |
variable needed for spline terms |
ps |
order of power |
qs |
number of knots |
B |
number of bootstrap sample, parametric for lmer |
grid |
grid for c |
bandwidth |
bandwidth for kernel smooth function |
lambda |
A grid of lambda values |
models |
list all model candidates with p polynomial degrees and q knots in the model space |
Qd_matrix |
list a matrix of QM - QM.tilde for all model candidates. Each row is for each bootrap sample |
bandwidth |
list the value of bandwidth |
model_mat |
list a matrix of selected models at each c values in grid (in columns). Each row is for each bootstrap sample |
freq_mat |
list a matrix of coverage probabilities (frequency/smooth_frequency) of each selected models for a given c value (index) |
c |
list the adaptive choice of c value from which the parsimonious model is selected |
lambda |
penalty (or smoothing) parameter estimate given selected p and q |
sel_model |
list the selected (parsimonious) model given the adaptive c value |
beta.est.u |
A list of coefficient estimates given a lambda value |
f.x.hat |
A vector of fitted values obtained from a given lambda value and beta.est.u |
@note The current Fence method in Nonparametric model focuses on one spline variable. This method can be extended to a general case with more than one spline variables, and includes non-spline variables.
Jiming Jiang Jianyang Zhao J. Sunil Rao Bao-Qui Tran Thuan Nguyen
Jiang J., Rao J.S., Gu Z., Nguyen T. (2008), Fence Methods for Mixed Model Selection. The Annals of Statistics, 36(4): 1669-1692
Jiang J., Nguyen T., Rao J.S. (2009), A Simplified Adaptive Fence Procedure. Statistics and Probability Letters, 79, 625-629
Jiang J., Nguyen T., Rao J.S. (2010), Fence Method for Nonparametric Small Area Estimation. Survey Methodology, 36, 1, 3-11
## Not run: require(fence) n = 100 set.seed(1234) x=runif(n,0,3) y = 1-x+x^2- 2*(x-1)^2*(x>1) + 2*(x-2)^2*(x>2) + rnorm(n,sd=.2) lambda=exp((c(1:60)-30)/3) data=data.frame(cbind(x,y)) test_NF = fence.NF(full=y~x, data=data, spline='x', ps=c(1:3), qs=c(2,5), B=1000, lambda=lambda) plot(test_NF) summary <- summary(test_NF) model_sel <- summary[[1]] model_sel lambda_sel <- summary[[2]] lambda_sel ## End(Not run)
## Not run: require(fence) n = 100 set.seed(1234) x=runif(n,0,3) y = 1-x+x^2- 2*(x-1)^2*(x>1) + 2*(x-2)^2*(x>2) + rnorm(n,sd=.2) lambda=exp((c(1:60)-30)/3) data=data.frame(cbind(x,y)) test_NF = fence.NF(full=y~x, data=data, spline='x', ps=c(1:3), qs=c(2,5), B=1000, lambda=lambda) plot(test_NF) summary <- summary(test_NF) model_sel <- summary[[1]] model_sel lambda_sel <- summary[[2]] lambda_sel ## End(Not run)
Fence model selection (Small Area Estmation)
fence.sae(full, data, B = 100, grid = 101, fence = c("adaptive", "nonadaptive"), cn = NA, method = c("F-H", "NER"), D = NA, REML = FALSE, bandwidth = NA, cpus = parallel::detectCores())
fence.sae(full, data, B = 100, grid = 101, fence = c("adaptive", "nonadaptive"), cn = NA, method = c("F-H", "NER"), D = NA, REML = FALSE, bandwidth = NA, cpus = parallel::detectCores())
full |
formular of full model |
data |
data |
B |
number of bootstrap sample, parametric for lmer |
grid |
grid for c |
fence |
fence method to be used, e.g., adaptive, or nonadaptive. It's suggested to choose nonadaptive procedure if c is known; otherwise nonadaptive must be chosen |
cn |
cn for nonadaptive |
method |
Select method to use |
D |
vector containing the D sampling variances of direct estimators for each domain. The values must be sorted as the variables in formula. Only used in FH model |
REML |
Restricted Maximum Likelihood approach |
bandwidth |
bandwidth for kernel smooth function |
cpus |
Number of parallel computers |
In Jiang et. al (2008), the adaptive c value is chosen from the highest peak in the p* vs. c plot. In Jiang et. al (2009), 95% CI is taken into account while choosing such an adaptive choice of c. In Thuan Nguyen et. al (2014), the adaptive c value is chosen from the first peak. This approach works better in the moderate sample size or weak signal situations. Empirically, the first peak becomes highest peak when sample size increases or signals become stronger
models |
list all model candidates in the model space |
B |
list the number of bootstrap samples that have been used |
lack_of_fit_matrix |
list a matrix of Qs for all model candidates (in columns). Each row is for each bootstrap sample |
Qd_matrix |
list a matrix of QM - QM.tilde for all model candidates. Each row is for each bootrap sample |
bandwidth |
list the value of bandwidth |
model_mat |
list a matrix of selected models at each c values in grid (in columns). Each row is for each bootstrap sample |
freq_mat |
list a matrix of coverage probabilities (frequency/smooth_frequency) of each selected models for a given c value (index) |
c |
list the adaptive choice of c value from which the parsimonious model is selected |
sel_model |
list the selected (parsimonious) model given the adaptive c value |
The current Fence package focuses on variable selection. However, Fence methods can be used to select other parameters of interest, e.g., tunning parameter, variance-covariance structure, etc.
The number of bootstrap samples is suggested to be increased, e.g., B=1000 when the sample size is small, or signals are weak
Jiming Jiang Jianyang Zhao J. Sunil Rao Thuan Nguyen
Jiang J., Rao J.S., Gu Z., Nguyen T. (2008), Fence Methods for Mixed Model Selection. The Annals of Statistics, 36(4): 1669-1692
Jiang J., Nguyen T., Rao J.S. (2009), A Simplified Adaptive Fence Procedure. Statistics and Probability Letters, 79, 625-629
Thuan Nguyen, Jie Peng, Jiming Jiang (2014), Fence Methods for Backcross Experiments. Statistical Computation and Simulation, 84(3), 644-662
require(fence) library(snow) ### example 1 #### data("kidney") data = kidney[-which.max(kidney$x),] # Delete a suspicious data point # data$x2 = data$x^2 data$x3 = data$x^3 data$x4 = data$x^4 data$D = data$sqrt.D.^2 plot(data$y ~ data$x) full = y~x+x2+x3+x4 # Takes more than 5 seconds to run # testfh = fence.sae(full, data, B=100, fence="adaptive", method="F-H", D = D) # testfh$sel_model # testfh$c
require(fence) library(snow) ### example 1 #### data("kidney") data = kidney[-which.max(kidney$x),] # Delete a suspicious data point # data$x2 = data$x^2 data$x3 = data$x^3 data$x4 = data$x^4 data$D = data$sqrt.D.^2 plot(data$y ~ data$x) full = y~x+x2+x3+x4 # Takes more than 5 seconds to run # testfh = fence.sae(full, data, B=100, fence="adaptive", method="F-H", D = D) # testfh$sel_model # testfh$c
Invisible Fence model selection (Linear Model)
IF.lm(full, data, B = 100, cpus = 2, lftype = c("abscoef", "pvalue"))
IF.lm(full, data, B = 100, cpus = 2, lftype = c("abscoef", "pvalue"))
full |
formula of full model |
data |
data |
B |
number of bootstrap sample, parametric for lm |
cpus |
number of parallel computers |
lftype |
subtractive measure type, e.g., absolute value of coefficients, p-value, t-value, etc. |
This method (Jiang et. al, 2011) is motivated by computational expensive in complex and high dimensional problem. The idea of the method–there is the best model in each dimension (in model space). The boostrapping determines the coverage probability of the selected model in each dimensions. The parsimonious model is the selected model with the highest coverage probabily (except the one for the full model, always probability of 1.)
full |
list the full model |
B |
list the number of bootstrap samples that have been used |
freq |
list the coverage probabilities of the selected model for each dimension |
size |
list the number of variables in the parsimonious model |
term |
list variables included in the full model |
model |
list the variables selected in-the-order in the parsimonious model |
@note The current Invisible Fence focuses on variable selection. The current routine is applicable to the case in which the subtractive measure is the absolute value of the coefficients, p-value, t-value. However, the method can be extended to other subtractive measures. See Jiang et. al (2011) for more details.
Jiming Jiang Jianyang Zhao J. Sunil Rao Thuan Nguyen
Jiang J., Rao J.S., Gu Z., Nguyen T. (2008), Fence Methods for Mixed Model Selection. The Annals of Statistics, 36(4): 1669-1692
Jiming Jiang, Thuan Nguyen and J. Sunil Rao (2011), Invisible fence methods and the identification of differentially expressed gene sets. Statistics and Its Interface, Volume 4, 403-415.
library(fence) library(MASS) library(snow) r =1234; set.seed(r) p=10; n=300; rho = 0.6 R = diag(p) for(i in 1:p){ for(j in 1:p){ R[i,j] = rho^(abs(i-j)) } } R = 1*R x=mvrnorm(n, rep(0, p), R) colnames(x)=paste('x',1:p, sep='') X = cbind(rep(1,n),x) tbetas = c(1,1,1,0,1,1,0,1,0,0,0) # non-zero beta 1,2,4,5,7 epsilon = rnorm(n) y = as.matrix(X)%*%tbetas + epsilon colnames(y) = 'y' data = data.frame(cbind(X,y)) full = y ~ x1+x2+x3+x4+x5+x6+x7+x8+x9+x10 # Takes greater than 5 seconds (~`17 seconds) to run # obj1 = IF.lm(full = full, data = data, B = 100, lftype = "abscoef") # sort((names(obj1$model$coef)[-1])) # obj2 = IF.lm(full = full, data = data, B = 100, lftype = "pvalue") # sort(setdiff(names(data[c(-1,-12)]), names(obj2$model$coef)))
library(fence) library(MASS) library(snow) r =1234; set.seed(r) p=10; n=300; rho = 0.6 R = diag(p) for(i in 1:p){ for(j in 1:p){ R[i,j] = rho^(abs(i-j)) } } R = 1*R x=mvrnorm(n, rep(0, p), R) colnames(x)=paste('x',1:p, sep='') X = cbind(rep(1,n),x) tbetas = c(1,1,1,0,1,1,0,1,0,0,0) # non-zero beta 1,2,4,5,7 epsilon = rnorm(n) y = as.matrix(X)%*%tbetas + epsilon colnames(y) = 'y' data = data.frame(cbind(X,y)) full = y ~ x1+x2+x3+x4+x5+x6+x7+x8+x9+x10 # Takes greater than 5 seconds (~`17 seconds) to run # obj1 = IF.lm(full = full, data = data, B = 100, lftype = "abscoef") # sort((names(obj1$model$coef)[-1])) # obj2 = IF.lm(full = full, data = data, B = 100, lftype = "pvalue") # sort(setdiff(names(data[c(-1,-12)]), names(obj2$model$coef)))
Invisible Fence model selection (Linear Mixed Model)
IF.lmer(full, data, B = 100, REML = TRUE, method = c("marginal", "conditional"), cpus = parallel::detectCores(), lftype = c("abscoef", "tvalue"))
IF.lmer(full, data, B = 100, REML = TRUE, method = c("marginal", "conditional"), cpus = parallel::detectCores(), lftype = c("abscoef", "tvalue"))
full |
formula of full model |
data |
data |
B |
number of bootstrap sample, parametric for lmer |
REML |
Restricted maximum likelihood estimation |
method |
choose either marginal (e.g., GEE) or conditional model |
cpus |
Number of parallel computers |
lftype |
subtractive measure type, e.g., absolute value of coefficients, p-value, t-value, etc. |
This method (Jiang et. al, 2011) is motivated by computational expensive in complex and high dimensional problem. The idea of the method–there is the best model in each dimension (in model space). The boostrapping determines the coverage probability of the selected model in each dimensions. The parsimonious model is the selected model with the highest coverage probabily (except the one for the full model, always probability of 1.)
full |
list the full model |
B |
list the number of bootstrap samples that have been used |
freq |
list the coverage probabilities of the selected model for each dimension |
size |
list the number of variables in the parsimonious model |
term |
list variables included in the full model |
model |
list the variables selected in-the-order in the parsimonious model |
@note The current Invisible Fence focuses on variable selection. The current routine is applicable to the case in which the subtractive measure is the absolute value of the coefficients, p-value, t-value. However, the method can be extended to other subtractive measures. See Jiang et. al (2011) for more details.
Jiming Jiang Jianyang Zhao J. Sunil Rao Thuan Nguyen
Jiang J., Rao J.S., Gu Z., Nguyen T. (2008), Fence Methods for Mixed Model Selection. The Annals of Statistics, 36(4): 1669-1692
Jiming Jiang, Thuan Nguyen and J. Sunil Rao (2011), Invisible fence methods and the identification of differentially expressed gene sets. Statistics and Its Interface, Volume 4, 403-415.
require(fence) library(snow) library(MASS) data("X.lmer") data = data.frame(X.lmer) # non-zero beta I.col.2, I.col.3a, I.col.3b, V5, V7, V8, V9 beta = matrix(c(0, 1, 1, 1, 1, 0, 0.1, 0.05, 0.25, 0), ncol = 1) set.seed(1234) alpha = rep(rnorm(100), each = 3) mu = alpha + as.matrix(data[,-1]) %*% beta data$id = as.factor(data$id) data$y = mu + rnorm(300) raw = "y ~ (1|id)+I.col.2+I.col.3a+I.col.3b" for (i in 5:10) { raw = paste0(raw, "+V", i) } full = as.formula(raw) # The following output takes more than 5 seconds (~70 seconds) to run. # obj1.lmer = IF.lmer(full = full, data = data, B = 100, method="conditional",lftype = "abscoef") # sort(obj1.lmer$model) # obj2.lmer = IF.lmer(full = full, data = data, B = 100, method="conditional",lftype = "tvalue") # sort(obj2.lmer$model) # Similarly, the following scenarios can be run # obj2.lmer = IF.lmer(full = full, data = data, B = 100, method="conditional",lftype = "tvalue") # sort(obj2.lmer$model) # obj1.lm = IF.lmer(full = full, data = data, B = 100, method="marginal", lftype = "abscoef") # sort(names(obj1.lm$model$coefficients[-1])) # obj2.lm = IF.lmer(full = full, data = data, B = 100, method="marginal", lftype = "tvalue") # sort(names(obj2.lm$model$coefficients[-1]))
require(fence) library(snow) library(MASS) data("X.lmer") data = data.frame(X.lmer) # non-zero beta I.col.2, I.col.3a, I.col.3b, V5, V7, V8, V9 beta = matrix(c(0, 1, 1, 1, 1, 0, 0.1, 0.05, 0.25, 0), ncol = 1) set.seed(1234) alpha = rep(rnorm(100), each = 3) mu = alpha + as.matrix(data[,-1]) %*% beta data$id = as.factor(data$id) data$y = mu + rnorm(300) raw = "y ~ (1|id)+I.col.2+I.col.3a+I.col.3b" for (i in 5:10) { raw = paste0(raw, "+V", i) } full = as.formula(raw) # The following output takes more than 5 seconds (~70 seconds) to run. # obj1.lmer = IF.lmer(full = full, data = data, B = 100, method="conditional",lftype = "abscoef") # sort(obj1.lmer$model) # obj2.lmer = IF.lmer(full = full, data = data, B = 100, method="conditional",lftype = "tvalue") # sort(obj2.lmer$model) # Similarly, the following scenarios can be run # obj2.lmer = IF.lmer(full = full, data = data, B = 100, method="conditional",lftype = "tvalue") # sort(obj2.lmer$model) # obj1.lm = IF.lmer(full = full, data = data, B = 100, method="marginal", lftype = "abscoef") # sort(names(obj1.lm$model$coefficients[-1])) # obj2.lm = IF.lmer(full = full, data = data, B = 100, method="marginal", lftype = "tvalue") # sort(names(obj2.lm$model$coefficients[-1]))
Invisible Fence model selection
invisiblefence(mf, f, d, lf, bs)
invisiblefence(mf, f, d, lf, bs)
mf |
Call function, for example: default calls: function(m, b) eblupFH(formula = m, vardir = D, data = b, method = "FH") |
f |
Full model |
d |
Dimension number |
lf |
Measures lack of fit using function(res) -res$fit$goodness[1] |
bs |
Bootstrap |
This method (Jiang et. al, 2011) is motivated by computational expensive in complex and high dimensional problem. The idea of the method–there is the best model in each dimension (in model space). The boostrapping determines the coverage probability of the selected model in each dimensions. The parsimonious model is the selected model with the highest coverage probabily (except the one for the full model, always probability of 1.)
full |
list the full model |
B |
list the number of bootstrap samples that have been used |
freq |
list the coverage probabilities of the selected model for each dimension |
size |
list the number of variables in the parsimonious model |
term |
list variables included in the full model |
model |
list the variables selected in-the-order in the parsimonious model |
@note The current Invisible Fence focuses on variable selection. The current routine is applicable to the case in which the subtractive measure is the absolute value of the coefficients, p-value, t-value. However, the method can be extended to other subtractive measures. See Jiang et. al (2011) for more details.
Jiming Jiang Jianyang Zhao J. Sunil Rao Thuan Nguyen
Jiang J., Rao J.S., Gu Z., Nguyen T. (2008), Fence Methods for Mixed Model Selection. The Annals of Statistics, 36(4): 1669-1692
Jiming Jiang, Thuan Nguyen and J. Sunil Rao (2011), Invisible fence methods and the identification of differentially expressed gene sets. Statistics and Its Interface, Volume 4, 403-415.
## Not run: data("X.lmer") data = data.frame(X.lmer) beta = matrix(c(0, 1, 1, 1, 1, 0, 0.1, 0.05, 0.25, 0), ncol = 1) set.seed(1234) alpha = rep(rnorm(100), each = 3) mu = alpha + as.matrix(data[,-1]) %*% beta data$id = as.factor(data$id) data$y = mu + rnorm(300) raw = "y ~ (1|id)+I.col.2+I.col.3a+I.col.3b" for (i in 5:10) { raw = paste0(raw, "+V", i) } full = as.formula(raw) obj1.lmer = IF.lmer(full = full, data = data, B = 100, method="conditional",lftype = "abscoef") obj1.lmer$model$coefficients obj2.lmer = IF.lmer(full = full, data = data, B = 100, method="conditional",lftype = "tvalue") obj2.lmer$model$coefficients obj1.lm = IF.lmer(full = full, data = data, B = 100, method="marginal", lftype = "abscoef") obj1.lm$model$coefficients obj2.lm = IF.lmer(full = full, data = data, B = 100, method="marginal", lftype = "tvalue") obj2.lm$model$coefficients ## End(Not run)
## Not run: data("X.lmer") data = data.frame(X.lmer) beta = matrix(c(0, 1, 1, 1, 1, 0, 0.1, 0.05, 0.25, 0), ncol = 1) set.seed(1234) alpha = rep(rnorm(100), each = 3) mu = alpha + as.matrix(data[,-1]) %*% beta data$id = as.factor(data$id) data$y = mu + rnorm(300) raw = "y ~ (1|id)+I.col.2+I.col.3a+I.col.3b" for (i in 5:10) { raw = paste0(raw, "+V", i) } full = as.formula(raw) obj1.lmer = IF.lmer(full = full, data = data, B = 100, method="conditional",lftype = "abscoef") obj1.lmer$model$coefficients obj2.lmer = IF.lmer(full = full, data = data, B = 100, method="conditional",lftype = "tvalue") obj2.lmer$model$coefficients obj1.lm = IF.lmer(full = full, data = data, B = 100, method="marginal", lftype = "abscoef") obj1.lm$model$coefficients obj2.lm = IF.lmer(full = full, data = data, B = 100, method="marginal", lftype = "tvalue") obj2.lm$model$coefficients ## End(Not run)
Data used for kidney example
kidney
kidney
A data frame with 4 variables
Nonadaptive Fence model selection
nonadaptivefence(mf, f, ms, d, lf, pf, cn)
nonadaptivefence(mf, f, ms, d, lf, pf, cn)
mf |
function for fitting the model |
f |
formula of full model |
ms |
list of formula of candidates models |
d |
data |
lf |
measure lack of fit (to minimize) |
pf |
model selection criteria, e.g., model dimension |
cn |
given a specific c value |
models |
list all model candidates in the model space |
lack_of_fit |
list a vector of Qs for all model candidates |
formula |
list the model of the selected parsimonious model |
sel_model |
list the selected (parsimonious) model given the adaptive c value |
Jiming Jiang Jianyang Zhao J. Sunil Rao Thuan Nguyen
Jiang J., Rao J.S., Gu Z., Nguyen T. (2008), Fence Methods for Mixed Model Selection. The Annals of Statistics, 36(4): 1669-1692
Jiang J., Nguyen T., Rao J.S. (2009), A Simplified Adaptive Fence Procedure. Statistics and Probability Letters, 79, 625-629
Thuan Nguyen, Jie Peng, Jiming Jiang (2014), Fence Methods for Backcross Experiments. Statistical Computation and Simulation, 84(3), 644-662
## Not run: require(fence) #### Example 1 ##### data(iris) full = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + (1|Species) test_naf = fence.lmer(full, iris, fence = "nonadaptive", cn = 12) test_naf$sel_model ## End(Not run)
## Not run: require(fence) #### Example 1 ##### data(iris) full = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + (1|Species) test_naf = fence.lmer(full, iris, fence = "nonadaptive", cn = 12) test_naf$sel_model ## End(Not run)
Plot Adaptive Fence model selection
## S3 method for class 'AF' plot(x = res, ...)
## S3 method for class 'AF' plot(x = res, ...)
x |
Object to be plotted |
... |
Additional arguments. CNot currently used. |
Plot Nonparametric Fence model selection
## S3 method for class 'NF' plot(x = res, ...)
## S3 method for class 'NF' plot(x = res, ...)
x |
Object to be plotted |
... |
Additional arguments. CNot currently used. |
Adaptive Fence model selection (Restricted Fence)
RF(full, data, groups, B = 100, grid = 101, bandwidth = NA, plot = FALSE, method = c("marginal", "conditional"), id = "id", cpus = parallel::detectCores())
RF(full, data, groups, B = 100, grid = 101, bandwidth = NA, plot = FALSE, method = c("marginal", "conditional"), id = "id", cpus = parallel::detectCores())
full |
formula of full model |
data |
data |
groups |
A list of formulas of (full) model in each bins (groups) of variables |
B |
number of bootstrap sample, parametric for lmer |
grid |
grid for c |
bandwidth |
bandwidth for kernel smooth function |
plot |
Plot object |
method |
either marginal (GEE) or conditional approach is selected |
id |
Subject or cluster id variable |
cpus |
Number of parallel computers |
In Jiang et. al (2008), the adaptive c value is chosen from the highest peak in the p* vs. c plot. In Jiang et. al (2009), 95% CI is taken into account while choosing such an adaptive choice of c. In Thuan Nguyen et. al (2014), the adaptive c value is chosen from the first peak. This approach works better in the moderate sample size or weak signal situations. Empirically, the first peak becomes highest peak when sample size increases or signals become stronger
models |
list all model candidates in the model space |
B |
list the number of bootstrap samples that have been used |
lack_of_fit_matrix |
list a matrix of Qs for all model candidates (in columns). Each row is for each bootstrap sample |
Qd_matrix |
list a matrix of QM - QM.tilde for all model candidates. Each row is for each bootrap sample |
bandwidth |
list the value of bandwidth |
model_mat |
list a matrix of selected models at each c values in grid (in columns). Each row is for each bootstrap sample |
freq_mat |
list a matrix of coverage probabilities (frequency/smooth_frequency) of each selected models for a given c value (index) |
c |
list the adaptive choice of c value from which the parsimonious model is selected |
sel_model |
list the selected (parsimonious) model given the adaptive c value |
bandwidth = (cs[2] - cs[1]) * 3. So it's chosen as 3 times grid between two c values.
Jiang J., Rao J.S., Gu Z., Nguyen T. (2008), Fence Methods for Mixed Model Selection. The Annals of Statistics, 36(4): 1669-1692
Jiang J., Nguyen T., Rao J.S. (2009), A Simplified Adaptive Fence Procedure. Statistics and Probability Letters, 79, 625-629
Thuan Nguyen, Jiming Jiang (2012), Restricted fence method for covariate selection in longitudinal data analysis. Biostatistics, 13(2), 303-314
Thuan Nguyen, Jie Peng, Jiming Jiang (2014), Fence Methods for Backcross Experiments. Statistical Computation and Simulation, 84(3), 644-662
## Not run: r =1234; set.seed(r) n = 100; p=15; rho = 0.6 beta = c(1,1,1,0,1,1,0,1,0,0,1,0,0,0,0) # non-zero beta 1,2,3,V6,V7,V9,V12 id = rep(1:n,each=3) V.1 = rep(1,n*3) I.1 = rep(c(1,-1),each=150) I.2a = rep(c(0,1,-1),n) I.2b = rep(c(0,-1,1),n) x = matrix(rnorm(n*3*11), nrow=n*3, ncol=11) x = cbind(id,V.1,I.1,I.2a,I.2b,x) R = diag(3) for(i in 1:3){ for(j in 1:3){ R[i,j] = rho^(abs(i-j)) } } e=as.vector(t(mvrnorm(n, rep(0, 3), R))) y = as.vector(x[,-1]%*%beta) + e data = data.frame(x,y) raw = "y ~ V.1 + I.1 + I.2a +I.2b" for (i in 6:16) { raw = paste0(raw, "+V", i)}; full = as.formula(raw) bin1="y ~ V.1 + I.1 + I.2a +I.2b" for (i in 6:8) { bin1 = paste0(bin1, "+V", i)}; bin1 = as.formula(bin1) bin2="y ~ V9" for (i in 10:16){ bin2 = paste0(bin2, "+V", i)}; bin2 = as.formula(bin2) # May take longer than 30 min since there are two stages in this RF procedure obj1.RF = RF(full = full, data = data, groups = list(bin1,bin2), method="conditional") obj1.RF$sel_model obj2.RF = RF(full = full, data = data, groups = list(bin1,bin2), B=100, method="marginal") obj2.RF$sel_model ## End(Not run)
## Not run: r =1234; set.seed(r) n = 100; p=15; rho = 0.6 beta = c(1,1,1,0,1,1,0,1,0,0,1,0,0,0,0) # non-zero beta 1,2,3,V6,V7,V9,V12 id = rep(1:n,each=3) V.1 = rep(1,n*3) I.1 = rep(c(1,-1),each=150) I.2a = rep(c(0,1,-1),n) I.2b = rep(c(0,-1,1),n) x = matrix(rnorm(n*3*11), nrow=n*3, ncol=11) x = cbind(id,V.1,I.1,I.2a,I.2b,x) R = diag(3) for(i in 1:3){ for(j in 1:3){ R[i,j] = rho^(abs(i-j)) } } e=as.vector(t(mvrnorm(n, rep(0, 3), R))) y = as.vector(x[,-1]%*%beta) + e data = data.frame(x,y) raw = "y ~ V.1 + I.1 + I.2a +I.2b" for (i in 6:16) { raw = paste0(raw, "+V", i)}; full = as.formula(raw) bin1="y ~ V.1 + I.1 + I.2a +I.2b" for (i in 6:8) { bin1 = paste0(bin1, "+V", i)}; bin1 = as.formula(bin1) bin2="y ~ V9" for (i in 10:16){ bin2 = paste0(bin2, "+V", i)}; bin2 = as.formula(bin2) # May take longer than 30 min since there are two stages in this RF procedure obj1.RF = RF(full = full, data = data, groups = list(bin1,bin2), method="conditional") obj1.RF$sel_model obj2.RF = RF(full = full, data = data, groups = list(bin1,bin2), B=100, method="marginal") obj2.RF$sel_model ## End(Not run)
Summary Adaptive Fence model selection
## S3 method for class 'AF' summary(object = res, ...)
## S3 method for class 'AF' summary(object = res, ...)
object |
Object to be summarized |
... |
addition arguments. Not currently used |
Summary Nonparametric Fence model selection
## S3 method for class 'NF' summary(object = res, ...)
## S3 method for class 'NF' summary(object = res, ...)
object |
Object to be summarized |
... |
addition arguments. Not currently used |
Data used in the example for X.lmer
data(X.lmer)
data(X.lmer)
A data frame with 10 variables: