Title: | Iterative Bayesian Additive Regression Trees Descriptor Selection Method |
---|---|
Description: | A statistical method based on Bayesian Additive Regression Trees with Global Standard Error Permutation Test (BART-G.SE) for descriptor selection and symbolic regression. It finds the symbolic formula of the regression function y=f(x) as described in Ye, Senftle, and Li (2023) <arXiv:2110.10195>. |
Authors: | Shengbin Ye [aut, cre, cph] , Meng Li [aut] |
Maintainer: | Shengbin Ye <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.0.0 |
Built: | 2024-11-09 06:37:01 UTC |
Source: | CRAN |
Single-Atom Catalysis Data
catalysis
catalysis
A list with 4 objects:
Primary feature matrix: physical properties of transition metals and oxide supports
Reponse variable: binding energy of metal/oxide pairs
Column names of X
Unit of columns of X
A helper function to generate unit for iBART input
generate_unit(unit, dimension)
generate_unit(unit, dimension)
unit |
A vector of unit of the primary features. For example, unit <- c("cm", "s"). Then the unit of |
dimension |
A vector of dimension of the units. For example, unit <- c("cm", "s") and dimension <- c(2, 1) mean that the unit of |
A list that contains unit and dimension information.
Finds a symbolic formula for the regression function using
as inputs.
iBART( X = NULL, y = NULL, head = NULL, unit = NULL, BART_var_sel_method = "global_se", num_trees = 20, num_burn_in = 10000, num_iterations_after_burn_in = 5000, num_reps_for_avg = 10, num_permute_samples = 50, type.measure = "deviance", nfolds = 10, nlambda = 100, relax = FALSE, gamma = c(0, 0.25, 0.5, 0.75, 1), opt = c("binary", "unary", "binary"), sin_cos = FALSE, apply_pos_opt_on_neg_x = TRUE, hold = 0, pre_screen = TRUE, corr_screen = TRUE, out_sample = FALSE, train_idx = NULL, train_ratio = 1, Lzero = TRUE, parallel = FALSE, K = ifelse(Lzero, 5, 0), aic = FALSE, standardize = TRUE, writeLog = FALSE, verbose = TRUE, count = NULL, seed = NULL )
iBART( X = NULL, y = NULL, head = NULL, unit = NULL, BART_var_sel_method = "global_se", num_trees = 20, num_burn_in = 10000, num_iterations_after_burn_in = 5000, num_reps_for_avg = 10, num_permute_samples = 50, type.measure = "deviance", nfolds = 10, nlambda = 100, relax = FALSE, gamma = c(0, 0.25, 0.5, 0.75, 1), opt = c("binary", "unary", "binary"), sin_cos = FALSE, apply_pos_opt_on_neg_x = TRUE, hold = 0, pre_screen = TRUE, corr_screen = TRUE, out_sample = FALSE, train_idx = NULL, train_ratio = 1, Lzero = TRUE, parallel = FALSE, K = ifelse(Lzero, 5, 0), aic = FALSE, standardize = TRUE, writeLog = FALSE, verbose = TRUE, count = NULL, seed = NULL )
X |
Input matrix of primary features |
y |
Response variable |
head |
Optional: name of primary features. |
unit |
Optional: units and their respective dimensions of primary features. This is used to perform dimension analysis for generated descriptors to avoid generating unphyiscal descriptors, such as |
BART_var_sel_method |
Variable selection criterion used in BART. Three options are available: (1) "global_se", (2) "global_max", (3) "local". The default is "global_se". See |
num_trees |
BART parameter: number of trees to be grown in the sum-of-trees model. If you want different values for each iteration of BART, input a vector of length equal to number of iterations. Default is |
num_burn_in |
BART parameter: number of MCMC samples to be discarded as “burn-in". If you want different values for each iteration of BART, input a vector of length equal to number of iterations. Default is |
num_iterations_after_burn_in |
BART parameter: number of MCMC samples to draw from the posterior distribution of |
num_reps_for_avg |
BART parameter: number of replicates to over over to for the BART model's variable inclusion proportions. If you want different values for each iteration of BART, input a vector of length equal to number of iterations. Default is |
num_permute_samples |
BART parameter: number of permutations of the response to be made to generate the “null” permutation distribution. If you want different values for each iteration of BART, input a vector of length equal to number of iterations. Default is |
type.measure |
|
nfolds |
|
nlambda |
|
relax |
|
gamma |
|
opt |
A vector of operation order. For example, |
sin_cos |
Logical flag for using |
apply_pos_opt_on_neg_x |
Logical flag for applying non-negative-valued operators, such as |
hold |
Number of iterations to hold. This allows iBART to run consecutive operator transformations before screening. Note |
pre_screen |
Logical flag for pre-screening the primary features X using BART. Only selected primary features will be used to generate descriptors. Note that |
corr_screen |
Logical flag for screening out primary features that are independet of the response variable |
out_sample |
Logical flag for out of sample assessment. Default is |
train_idx |
Numerical vector storing the row indices for training data. Please set |
train_ratio |
Proportion of data used to train model. Value must be between (0,1]. This is only needed when |
Lzero |
Logical flag for L-zero variable selection. Default is |
parallel |
Logical flag for parallel L-zero variable selection. Default is |
K |
If |
aic |
If |
standardize |
Logical flag for data standardization prior to model fitting in BART and LASSO. Default is |
writeLog |
Logical flag for writing log file to working directory. The log file will contain information such as the descriptors selected by iBART, RMSE of the linear model build on the selected descriptors, etc. Default is |
verbose |
Logical flag for printing progress to console. Default is |
count |
Internal parameter. Default is |
seed |
Optional: sets the seed in both R and Java. Default is |
A list of iBART output.
iBART_model |
The LASSO output of the last iteration of iBART. The predictors with non-zero coefficient are called the iBART selected descriptors. |
X_selected |
The numerical values of the iBART selected descriptors. |
descriptor_names |
The names of the iBART selected descriptors. |
coefficients |
Coefficients of the iBART model. The first element is an intercept. |
X_train |
The training matrix used in the last iteration. |
X_test |
The testing matrix used in the last iteration. |
iBART_gen_size |
The number of descriptors generated by iBART in each iteration. |
iBART_sel_size |
The number of descriptors selected by iBART in each iteration. |
iBART_in_sample_RMSE |
In sample RMSE of the LASSO model. |
iBART_out_sample_RMSE |
Out of sample RMSE of the LASSO model if |
Lzero_models |
The |
Lzero_names |
The name of the best |
Lzero_in_sample_RMSE |
In sample RMSE of the |
Lzero_out_sample_RMSE |
Out of sample RMSE of the |
Lzero_AIC_model |
The best |
Lzero_AIC_names |
The best |
Lzero_AIC_in_sample_RMSE |
In sample RMSE of the best |
Lzero_AIC_out_sample_RMSE |
Out of sample RMSE of the best |
runtime |
Runtime in second. |
Shengbin Ye
Ye, S., Senftle, T.P., and Li, M. (2023) Operator-induced structural variable selection for identifying materials genes, https://arxiv.org/abs/2110.10195.
iBART result in the real data vignette
iBART_real_data
iBART_real_data
A list of iBART outputs
A cv.glmnet object storing the iBART selected model
...
iBART result in the simulation vignette
iBART_sim
iBART_sim
A list of iBART outputs
A cv.glmnet object storing the iBART selected model
...
Best subset selection for linear regression
k_var_model( X_train, y_train, X_test = NULL, y_test = NULL, k = 1, parallel = FALSE )
k_var_model( X_train, y_train, X_test = NULL, y_test = NULL, k = 1, parallel = FALSE )
X_train |
The design matrix used during training. |
y_train |
The response variable used during training. |
X_test |
The design matrix used during testing. Default is |
y_test |
The response variable used during testing. Default is |
k |
The maximum number of predictors allowed in the model. For example, |
parallel |
Logical flag for parallelization. Default is |
A list of outputs.
models |
An |
names |
The variable name of the best k predictors. |
rmse_in |
In-sample RMSE of the model. |
rmse_out |
Out-of-sample RMSE of the model. |