The single-atom catalysis data is stored in
data/single_atom_catalysis.RData
, and the raw data is
available at this Github
repo. Here we model the metal/oxide binding energy (the response
variable y) using p = 59 physical properties of the
transition metals and the oxide supports (the primary features X). The response variable y and the primary features X are treated as continuous
variables, and we aim to use iBART to find an
interpretable model with high predictive performance for the metal/oxide
binding energy.
A total of 13 transition metals (Cu, Ag, Au, Ni, Pd, Pt, Co, Rh, Ir, Fe, Ru, Mn, V) and 7 oxide supports (CeO2(111), MgO(100), CeO2(110), TbO2(111), ZnO(100), TiO2(011), α-Al2O3(0001)) were studied in the dataset, making a total of n = 13 × 7 = 91 metal/oxide pairs. The primary feature matrix X contains various physical properties of the transition metals and the oxide supports including Pauling Electronegativity (χP), (n − 1)th and nth Ionization Energies (IEn − 1, IEn), Electron Affinity (EA), HOMO Energy, LUMO Energy, Heat of Sublimation (ΔHsub), Oxidation Energy of oxide support (ΔHf,ox,bulk), Oxide Formation Enthalpy (ΔHf,ox), Zunger Orbital Radius (r), Atomic Number (Z), Meidema Parameters of metal atoms (η1/3, φ), Valance Electron (Nval), Oxygen Vacancy Energy of oxide support (ΔEvac), Workfunction of oxide support (WF), Surface Energy (γ), Coordination Number (CN), and Bond Valence of surface metal atom (BV). Most of these physical properties are defined for both the transition metals and the oxide supports while a few of them are only defined for either the transition metals or the oxide supports. A detailed description of the 59 primary features X can be find in pages 11–14 of the data supplementary materials published by O’Connor et al.
Before loading the iBART package, we must allocate enough memory for Java to avoid out of memory errors.
# Allocate 10GB of memory for Java. Must be called before library(iBART)
options(java.parameters = "-Xmx10g")
library(iBART)
Next, we load the real data set and examine what data are needed to run iBART.
load("../data/catalysis.rda") # load data
summary(catalysis)
#> Length Class Mode
#> X 5369 -none- numeric
#> y 91 -none- numeric
#> head 59 -none- character
#> unit 59 -none- list
The data set consists of 4 objects:
X
: a matrix
of physical properties of the
transition metals and the oxide supports described in Data Description. These are our primary features
(predictors).y
: a numeric
vector of metal/oxide binding
energy described in Data Description. This is our
response variable.head
: a character
vector storing the
column names of X
.unit
: a (optional) list
of named numeric
vectors. This stores the unit information of the primary features
X
. This can be generated using the helper function
generate_unit(unit, dimension)
. See
?iBART::generate_unit
for more detail.Now let’s apply iBART to this data set. Besides the usual regression
data (X,y)
, we need to specify the descriptor generating
strategy through opt
. Here we specify
opt = c("binary", "unary", "binary")
, meaning there will be
3 iterations and we want to alternate between binary and unary
operators, starting with binary operators 𝒪b. We can also use all
operators O in an iteration.
For example, opt = c("all", "all")
will apply all operators
O for 2 iterations.
iBART_real_data <- iBART(X = catalysis$X, y = catalysis$y,
head = catalysis$head, # colnames of X
unit = catalysis$unit, # units of X
opt = c("binary", "unary", "binary"), # binary operator first
out_sample = FALSE,
Lzero = TRUE,
K = 5, # maximum descriptors in l-zero model
standardize = FALSE,
seed = 888)
To save time, we cached the result of a complete run in
data("iBART_real_data", package = "iBART")
, which can be
replicated by using the above code. iBART()
returns many
interesting outputs. For example,
iBART_real_data$iBART_gen_size
and
iBART_real_data$iBART_sel_size
store dimension of the newly
generated / selected descriptor space for each iteration. Let’s plot
them and see how iBART use nonparametric variable
selection for dimension reduction.
library(ggplot2)
df_dim <- data.frame(dim = c(iBART_real_data$iBART_sel_size, iBART_real_data$iBART_gen_size),
iter = rep(0:3, 2),
type = rep(c("Selected", "Generated"), each = 4))
ggplot(df_dim, aes(x = iter, y = dim, colour = type, group = type)) +
theme(text = element_text(size = 15), legend.title = element_blank()) +
geom_line(size = 1) +
geom_point(size = 3, shape = 21, fill = "white") +
geom_text(data = df_dim, aes(label = dim, y = dim + 40, group = type),
position = position_dodge(0), size = 5, colour = "blue") +
labs(x = "Iteration", y = "Number of descriptors") +
scale_x_continuous(breaks = c(0, 1, 2, 3))
#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
#> ℹ Please use `linewidth` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
We can access the selected k-descriptor via
iBART_real_data$Lzero_names
and the corresponding
regression model in iBART_real_data$Lzero_models
. For
instance, the selected 3-descriptor model is
iBART_real_data$Lzero_names[[3]]
#> [1] "(s_EA*Hf)" "abs((Hfo/Oxv))"
#> [3] "abs(((m_n13/m_N_val)/Oxv))"
summary(iBART_real_data$Lzero_models[[3]])
#>
#> Call:
#> lm(formula = y_train ~ ., data = dat_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.70871 -0.42326 0.05825 0.44715 1.97315
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.01707 0.12675 -0.135 0.893
#> `(s_EA*Hf)` 0.40427 0.04441 9.104 2.75e-14 ***
#> `abs((Hfo/Oxv))` -0.58838 0.09857 -5.969 5.05e-08 ***
#> `abs(((m_n13/m_N_val)/Oxv))` -19.62963 4.25098 -4.618 1.33e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.6378 on 87 degrees of freedom
#> Multiple R-squared: 0.9534, Adjusted R-squared: 0.9518
#> F-statistic: 593.9 on 3 and 87 DF, p-value: < 2.2e-16
The OIS model differs from the non-OIS model in that the former builds on nonlinear descriptors (composition of 𝒪 on X) while the latter builds on the primary features X. The OIS model has many advantages. In particular, it reveals interpretable nonlinear relationship between y and X, and improves prediction accuracy over a simple linear regression model (or non-OIS model). We showcase the improved accuracy over non-OIS model using Figure 7 in the paper.
# Train a non-OIS model with 3 predictors
set.seed(123)
model_no_OIS <- k_var_model(X_train = catalysis$X, y_train = catalysis$y, k = 3, parallel = FALSE)
#### Figure 7 ####
library(ggpubr)
model_OIS <- iBART_real_data$Lzero_model[[3]]
# Prepare data for plotting
data_OIS <- data.frame(y = catalysis$y, y_hat = model_OIS$fitted.values)
data_no_OIS <- data.frame(y = catalysis$y, y_hat = model_no_OIS$models$fitted.values)
p1 <- ggplot(data_OIS, aes(x = y_hat, y = catalysis$y)) +
geom_point() +
geom_abline() +
xlim(c(min(data_OIS$y_hat, data_OIS$y) - 0.2, max(data_OIS$y_hat, data_OIS$y) + 0.2)) +
ylim(c(min(data_OIS$y_hat, data_OIS$y) - 0.2, max(data_OIS$y_hat, data_OIS$y) + 0.2)) +
xlab("") +
ylab("") +
annotate("text", x = -12, y = -3, parse = TRUE,
label = paste("R^{2} ==", round(summary(model_OIS)$r.squared, 4)))
p2 <- ggplot(data_no_OIS, aes(x = y_hat, y = catalysis$y)) +
geom_point() +
geom_abline() +
xlim(c(min(data_no_OIS$y_hat, data_no_OIS$y) - 0.2, max(data_no_OIS$y_hat, data_no_OIS$y) + 0.2)) +
ylim(c(min(data_no_OIS$y_hat, data_no_OIS$y) - 0.2, max(data_no_OIS$y_hat, data_no_OIS$y) + 0.2)) +
xlab("") +
ylab("") +
annotate("text", x = -12, y = -3, parse = TRUE,
label = paste("R^{2} ==", round(summary(model_no_OIS$models)$r.squared, 4)))
fig <- ggarrange(p1, p2,
labels = c("OIS", "non-OIS"),
ncol = 2, nrow = 1)
annotate_figure(fig,
bottom = text_grob("Predicted binding energy from descriptors (eV)"),
left = text_grob("DFT binding energy (eV)", rot = 90))
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8
#> [9] LC_ADDRESS=en_US.UTF-8 LC_TELEPHONE=en_US.UTF-8
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] ggpubr_0.6.0 ggplot2_3.5.1 iBART_1.0.0 rmarkdown_2.29
#>
#> loaded via a namespace (and not attached):
#> [1] tidyr_1.3.1 sass_0.4.9 utf8_1.2.4
#> [4] generics_0.1.3 bartMachineJARs_1.2.1 rstatix_0.7.2
#> [7] shape_1.4.6.1 lattice_0.22-6 digest_0.6.37
#> [10] magrittr_2.0.3 evaluate_1.0.1 grid_4.4.2
#> [13] iterators_1.0.14 fastmap_1.2.0 foreach_1.5.2
#> [16] jsonlite_1.8.9 glmnet_4.1-8 Matrix_1.7-1
#> [19] missForest_1.5 backports_1.5.0 Formula_1.2-5
#> [22] survival_3.7-0 gridExtra_2.3 bartMachine_1.3.4.1
#> [25] purrr_1.0.2 fansi_1.0.6 doRNG_1.8.6
#> [28] itertools_0.1-3 scales_1.3.0 codetools_0.2-20
#> [31] jquerylib_0.1.4 abind_1.4-8 cli_3.6.3
#> [34] rlang_1.1.4 cowplot_1.1.3 munsell_0.5.1
#> [37] splines_4.4.2 withr_3.0.2 cachem_1.1.0
#> [40] yaml_2.3.10 tools_4.4.2 parallel_4.4.2
#> [43] ggsignif_0.6.4 dplyr_1.1.4 colorspace_2.1-1
#> [46] rngtools_1.5.2 broom_1.0.7 buildtools_1.0.0
#> [49] vctrs_0.6.5 R6_2.5.1 lifecycle_1.0.4
#> [52] randomForest_4.7-1.2 car_3.1-3 pkgconfig_2.0.3
#> [55] rJava_1.0-11 bslib_0.8.0 pillar_1.9.0
#> [58] gtable_0.3.6 glue_1.8.0 Rcpp_1.0.13-1
#> [61] tidyselect_1.2.1 xfun_0.49 tibble_3.2.1
#> [64] sys_3.4.3 knitr_1.49 farver_2.1.2
#> [67] htmltools_0.5.8.1 carData_3.0-5 labeling_0.4.3
#> [70] maketools_1.3.1 compiler_4.4.2