--- title: "Single-Atom Catalysis Data Analysis" output: rmarkdown::html_vignette urlcolor: blue vignette: > %\VignetteIndexEntry{Single-Atom Catalysis Data Analysis} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Data Description{#data} The single-atom catalysis data is stored in `data/single_atom_catalysis.RData`, and the raw data is available at [this Github repo](https://github.com/tsenftle/Metal-Oxide-LASSO-lo). Here we model the metal/oxide binding energy (the response variable $y$) using $p=59$ physical properties of the transition metals and the oxide supports (the primary features $X$). The response variable $y$ and the primary features $X$ are treated as continuous variables, and we aim to use **iBART** to find an interpretable model with high predictive performance for the metal/oxide binding energy. A total of 13 transition metals ($\text{Cu, Ag, Au, Ni, Pd, Pt,}$ $\text{Co, Rh, Ir, Fe, Ru, Mn, V}$) and 7 oxide supports ($\text{CeO}_2(111)$, $\text{MgO}(100)$, $\text{CeO}_2(110)$, $\text{TbO}_2(111)$, $\text{ZnO}(100)$, $\text{TiO}_2(011)$, $\alpha\text{-Al}_2\text{O}_3(0001)$) were studied in the dataset, making a total of $n = 13 \times 7 = 91$ metal/oxide pairs. The primary feature matrix $X$ contains various physical properties of the transition metals and the oxide supports including Pauling Electronegativity ($\chi_P$), $(n-1)^{\text{th}}$ and $n^{\text{th}}$ Ionization Energies ($\text{IE}_{n-1}$, $\text{IE}_n$), Electron Affinity ($\text{EA}$), $\text{HOMO}$ Energy, $\text{LUMO}$ Energy, Heat of Sublimation ($\Delta H_{\text{sub}}$), Oxidation Energy of oxide support ($\Delta H_{\text{f,ox,bulk}}$), Oxide Formation Enthalpy ($\Delta H_{\text{f,ox}}$), Zunger Orbital Radius ($r$), Atomic Number ($Z$), Meidema Parameters of metal atoms $(\eta^{1/3}, \varphi)$, Valance Electron ($\text{N}_{\text{val}}$), Oxygen Vacancy Energy of oxide support ($\Delta\text{E}_{\text{vac}}$), Workfunction of oxide support $(\text{WF})$, Surface Energy ($\gamma$), Coordination Number ($\text{CN}$), and Bond Valence of surface metal atom ($\text{BV}$). Most of these physical properties are defined for both the transition metals and the oxide supports while a few of them are only defined for either the transition metals or the oxide supports. A detailed description of the 59 primary features $X$ can be find in pages 11--14 of [the data supplementary materials](https://static-content.springer.com/esm/art%3A10.1038%2Fs41929-018-0094-5/MediaObjects/41929_2018_94_MOESM1_ESM.pdf) published by O'Connor et al. ## Package and Data Loading Before loading the **iBART** package, we must allocate enough memory for Java to avoid out of memory errors. ```{r load package} # Allocate 10GB of memory for Java. Must be called before library(iBART) options(java.parameters = "-Xmx10g") library(iBART) ``` Next, we load the real data set and examine what data are needed to run iBART. ```{r load data} load("../data/catalysis.rda") # load data summary(catalysis) ``` The data set consists of 4 objects: + `X`: a `matrix` of physical properties of the transition metals and the oxide supports described in [Data Description](#data). These are our primary features (predictors). + `y`: a `numeric` vector of metal/oxide binding energy described in [Data Description](#data). This is our response variable. + `head`: a `character` vector storing the column names of `X`. + `unit`: a (optional) `list` of named numeric vectors. This stores the unit information of the primary features `X`. This can be generated using the helper function `generate_unit(unit, dimension)`. See `?iBART::generate_unit` for more detail. ## iBART Now let's apply iBART to this data set. Besides the usual regression data `(X,y)`, we need to specify the descriptor generating strategy through `opt`. Here we specify `opt = c("binary", "unary", "binary")`, meaning there will be 3 iterations and we want to alternate between binary and unary operators, starting with binary operators $\mathcal{O}_b$. We can also use all operators $O$ in an iteration. For example, `opt = c("all", "all")` will apply all operators $O$ for 2 iterations. ```{r iBART, eval=FALSE} iBART_real_data <- iBART(X = catalysis$X, y = catalysis$y, head = catalysis$head, # colnames of X unit = catalysis$unit, # units of X opt = c("binary", "unary", "binary"), # binary operator first out_sample = FALSE, Lzero = TRUE, K = 5, # maximum descriptors in l-zero model standardize = FALSE, seed = 888) ``` ```{r load result, echo=FALSE} load("../data/iBART_real_data.rda") # load full result ``` To save time, we cached the result of a complete run in `data("iBART_real_data", package = "iBART")`, which can be replicated by using the above code. `iBART()` returns many interesting outputs. For example, `iBART_real_data$iBART_gen_size` and `iBART_real_data$iBART_sel_size` store dimension of the newly generated / selected descriptor space for each iteration. Let's plot them and see how **iBART** use nonparametric variable selection for dimension reduction. ```{r size_plot, fig.width=7, fig.height=3.5} library(ggplot2) df_dim <- data.frame(dim = c(iBART_real_data$iBART_sel_size, iBART_real_data$iBART_gen_size), iter = rep(0:3, 2), type = rep(c("Selected", "Generated"), each = 4)) ggplot(df_dim, aes(x = iter, y = dim, colour = type, group = type)) + theme(text = element_text(size = 15), legend.title = element_blank()) + geom_line(size = 1) + geom_point(size = 3, shape = 21, fill = "white") + geom_text(data = df_dim, aes(label = dim, y = dim + 40, group = type), position = position_dodge(0), size = 5, colour = "blue") + labs(x = "Iteration", y = "Number of descriptors") + scale_x_continuous(breaks = c(0, 1, 2, 3)) ``` We can access the selected $k$-descriptor via `iBART_real_data$Lzero_names` and the corresponding regression model in `iBART_real_data$Lzero_models`. For instance, the selected 3-descriptor model is ```{r k-descriptor} iBART_real_data$Lzero_names[[3]] summary(iBART_real_data$Lzero_models[[3]]) ``` ## OIS vs non-OIS The OIS model differs from the non-OIS model in that the former builds on nonlinear descriptors (composition of $\mathcal{O}$ on $X$) while the latter builds on the primary features $X$. The OIS model has many advantages. In particular, it reveals interpretable nonlinear relationship between $y$ and $X$, and improves prediction accuracy over a simple linear regression model (or non-OIS model). We showcase the improved accuracy over non-OIS model using Figure 7 in the paper. ```{r OIS vs non-OIS, fig.width=7, fig.height=3.5} # Train a non-OIS model with 3 predictors set.seed(123) model_no_OIS <- k_var_model(X_train = catalysis$X, y_train = catalysis$y, k = 3, parallel = FALSE) #### Figure 7 #### library(ggpubr) model_OIS <- iBART_real_data$Lzero_model[[3]] # Prepare data for plotting data_OIS <- data.frame(y = catalysis$y, y_hat = model_OIS$fitted.values) data_no_OIS <- data.frame(y = catalysis$y, y_hat = model_no_OIS$models$fitted.values) p1 <- ggplot(data_OIS, aes(x = y_hat, y = catalysis$y)) + geom_point() + geom_abline() + xlim(c(min(data_OIS$y_hat, data_OIS$y) - 0.2, max(data_OIS$y_hat, data_OIS$y) + 0.2)) + ylim(c(min(data_OIS$y_hat, data_OIS$y) - 0.2, max(data_OIS$y_hat, data_OIS$y) + 0.2)) + xlab("") + ylab("") + annotate("text", x = -12, y = -3, parse = TRUE, label = paste("R^{2} ==", round(summary(model_OIS)$r.squared, 4))) p2 <- ggplot(data_no_OIS, aes(x = y_hat, y = catalysis$y)) + geom_point() + geom_abline() + xlim(c(min(data_no_OIS$y_hat, data_no_OIS$y) - 0.2, max(data_no_OIS$y_hat, data_no_OIS$y) + 0.2)) + ylim(c(min(data_no_OIS$y_hat, data_no_OIS$y) - 0.2, max(data_no_OIS$y_hat, data_no_OIS$y) + 0.2)) + xlab("") + ylab("") + annotate("text", x = -12, y = -3, parse = TRUE, label = paste("R^{2} ==", round(summary(model_no_OIS$models)$r.squared, 4))) fig <- ggarrange(p1, p2, labels = c("OIS", "non-OIS"), ncol = 2, nrow = 1) annotate_figure(fig, bottom = text_grob("Predicted binding energy from descriptors (eV)"), left = text_grob("DFT binding energy (eV)", rot = 90)) ``` ## R Session Info ```{r sessioninfo} sessionInfo() ```