--- title: "glober package" author: "Mary E. Savino" date: " " output: pdf_document vignette: > %\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{glober package} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(message = FALSE,warning = FALSE) library(glober) library(genlasso) library(fda) library(ggplot2) library(plot3D) ``` # Introduction The package \textsf{glober} provides two tools to estimate the function $f$ in the following nonparametric regression model: \begin{equation} \label{eq:model} Y_i = f(x_i) + \varepsilon_i, \quad 1 \leq i \leq n, \end{equation} where the $\varepsilon_i$ are i.i.d centered random variables of variance $\sigma^2$, the $x_i$ are observation points which belong to a compact set $S$ of $\mathbb{R}^d$, $d=1$ or 2 and $n$ is the total number of observations. This estimation is performed using the GLOBER approach described in [1]. This method consists in estimating $f$ by approximating it with a linear combination of B-splines, where their knots are selected adaptively using the Generalized Lasso proposed by [2], since they can be seen as changes in the derivatives of the function to estimate. We refer the reader to [1] for further details. # Estimation of $f$ in the one-dimensional case ($d=1$) In the following, we apply our method to a function of one input variable $f_1$. This function is defined as a linear combination of quadratic B-splines with the set of knots $\mathbf{t} = (0.1, 0.27,0.745)$ and $\sigma = 0.1$ in \eqref{eq:model}. ## Description of the dataset We load the dataset of observations with $n=70$ provided within the package $(x_1, \ldots, x_{70})$: ```{r observations1D} ## --- Loading the values of the input variable --- ## data('x_1D') ``` and $(Y_1, \ldots, Y_{70})$: ```{r y1D} ## --- Loading the corresponding noisy values of the response variable --- ## data('y_1D') ``` We load the dataset containing the values of the input variable $\{x_1, \ldots, x_N\}$ for which an estimation of $f_1$ is sought. They correspond to the observation points as well as additional points where $f_1$ has not been observed. Here, $N = 201$. In order to have a better idea of the underlying function $f_1$, we load the corresponding evaluations of $f_1$ at these input values. ```{r xpred_1D} ## --- Loading the values of the input variable for which an estimation ## of f_1 is required --- ## data('xpred_1D') ## --- Loading the corresponding evaluations to plot the function --- ## data('f_1D') ``` We can visualize it for 201 input values by using the $\texttt{ggplot2}$ package: ```{r} ## -- Building dataframes to plot -- ## data_1D = data.frame(x = xpred_1D, f = f_1D) obs_1D = data.frame(x = x_1D, y = y_1D) real.knots = c(0.1, 0.27,0.745) ``` ```{r, echo=FALSE, out.width="50%", fig.align = 'center'} ggplot(data_1D, aes(x, f)) + geom_line(color = 'red') + geom_point(aes(x, y), data =obs_1D, shape= 4, color = "blue", size = 4)+ geom_vline(xintercept = real.knots, linetype = 'dashed', color = 'grey27') + xlab('x') + ylab('y') + theme_bw()+ theme(axis.title.x = element_text(size=20), axis.title.y = element_text(size =20), axis.text.x = element_text(size=19), axis.text.y = element_text(size=19)) ``` The vertical dashed lines represent the real knots $\mathbf{t}$ implied in the definition of $f_1$, the red curve describes the true underlying function $f_1$ to estimate and the blue crosses are the observation points. ## Application of $\texttt{glober.1d}$ to estimate $f_1$ The $\texttt{glober.1d}$ function of the $\texttt{glober}$ package is applied by using the following arguments: the input values $(x_i)_{1 \leq i \leq n}$ ($\texttt{x}$), the corresponding $(Y_i)_{1 \leq i \leq n}$ ($\texttt{y}$), $N$ input values $\{x_1, \ldots, x_N\}$ for which $f_1$ has to be estimated ($\texttt{xpred}$) and the order of the B-spline basis used to estimate $f_1$ ($\texttt{ord}$). ```{r} res = glober.1d(x = x_1D, y = y_1D, xpred = xpred_1D, ord = 3, parallel = FALSE) ``` Additional arguments can also be used in this function: * $\texttt{parallel}$: Logical, if set to TRUE then a parallelized version of the code is used. The default value is FALSE. * $\texttt{nb.Cores}$: Numerical, it represents the number of cores used for parallelization, if parallel is set to TRUE. The resulting outputs are the following: * $\texttt{festimated}$: the estimated values of $f_1$. * $\texttt{knotSelec}$: the selected knots used in the definition of the B-splines of the GLOBER estimator. * $\texttt{rss}$: Residual sum-of-squares (RSS) of the model defined as: $\sum_{k=1}^n (Y_i - \widehat{f_1}(x_i))^2$, where $\widehat{f_1}$ is the estimator of $f_1$. * $\texttt{rsq}$: R-squared of the model, calculated as $1 - RSS/TSS$ where TSS is the total sum-of-squares of the model defined as $\sum_{k=1}^n (Y_i - \bar{Y})^2$ with $\bar{Y} = (\sum_{i=1}^n Y_i)/n$. Thus, we can print the estimated values corresponding to the input values $\{x_1, \ldots, x_N\}$: ```{r} fhat = res$festimated head(fhat) ``` The value of the Residual Sum-of-square: ```{r} res$rss ``` The value of the R-squared: ```{r} res$rsq ``` We can get the set of the estimated knots $\mathbf{\widehat{t}}$: ```{r} knots.set = res$Selected.knots print(knots.set) ``` Finally, we can display the estimation of $f_1$ by using the $\texttt{ggplot2}$ package: ```{r, out.width="50%", fig.align = 'center'} ## Dataframe of selected knots ## idknots = which(xpred_1D %in% knots.set) yknots = f_1D[idknots] data_knots = data.frame(x.knots = knots.set, y.knots = yknots) ## Dataframe of the estimation ## data_res = data.frame(xpred = xpred_1D, fhat = fhat) plot_1D = ggplot(data_1D, aes(xpred_1D, f_1D)) + geom_line(color = 'red') + geom_line(data = data_1D, aes(x = xpred_1D, y = fhat), color = "black") + geom_vline(xintercept = real.knots, linetype = 'dashed', color = 'grey27') + geom_point(aes(x, y), data = obs_1D, shape = 4, color = "blue", size = 4)+ geom_point(aes(x.knots, y.knots), data = data_knots, shape = 19, color = "blue", size = 4)+ xlab('x') + ylab('y') + theme_bw()+ theme(axis.title.x = element_text(size = 20), axis.title.y = element_text(size = 20), axis.text.x = element_text(size = 19), axis.text.y = element_text(size = 19)) plot_1D ``` The vertical dashed lines represent the real knots $\mathbf{t}$ implied in the definition of $f_1$, the red curve describes the true underlying function $f_1$ to estimate, the black curve corresponds to the estimation with GLOBER, the blue crosses are the observation points and the blue bullets are the observation points chosen as estimated knots $\widehat{t}$. # Estimation of $f$ in the two-dimensional case ($d=2$) In the following, we apply our method to a function of two input variables $f_2$. This function is defined as a linear combination of tensor products of quadratic univariate B-splines with the sets of knots $\mathbf{t}_1 = (0.24, 0.545)$ and $\mathbf{t}_2 = (0.395, 0.645)$ and $\sigma = 0.01$ in \eqref{eq:model}. ## Description of the dataset We load the dataset of observations with $n=100$, provided within the package $(x_1, \ldots, x_{100})$ ```{r} ## --- Loading the values of the input variables --- ## data('x_2D') head(x_2D) ``` and $(Y_1, \ldots, Y_{100})$: ```{r} ## --- Loading the corresponding noisy values of the response variable --- ## data('y_2D') ``` We load the dataset containing the values of the input variables $\{x_1, \ldots, x_N\}$ for which an estimation of $f_2$ is sought. They correspond to the observation points as well as additional points where $f_2$ has not been observed. Here, $N = 10000$. In order to have a better idea of the underlying function $f_2$, we load the corresponding evaluations of $f_2$ at these input values. ```{r xpred_2D} ## --- Loading the values of the input variables for which an estimation ## of f_2 is required --- ## data('xpred_2D') head(xpred_2D) ## --- Loading the corresponding evaluations to plot the function --- ## data('f_2D') ``` We can visualize it for 10000 input values by using the $\texttt{plot3D}$ package: ```{r, message=FALSE, fig.align = 'center',echo=FALSE, out.width="60%", out.height="60%" } scatter3D(xpred_2D[,1], xpred_2D[,2], f_2D, bty = "g", pch = 18, col = gg.col(100), theta = 180, phi = 10) ``` ## Application of $\texttt{glober.2d}$ to estimate $f_2$ The $\texttt{glober.2d}$ function of the $\texttt{glober}$ package is applied by using the following arguments: the input values $(x_i)_{1 \leq i \leq n}$ ($\texttt{x}$), the corresponding $(Y_i)_{1 \leq i \leq n}$ ($\texttt{y}$), $N$ input values $\{x_1, \ldots, x_N\}$ for which $f_2$ has to be estimated ($\texttt{xpred}$) and the order of the B-spline basis used to estimate $f_2$ ($\texttt{ord}$). ```{r} res = glober.2d(x = x_2D, y = y_2D, xpred = xpred_2D, ord = 3, parallel = FALSE) ``` Additional arguments can also be used in this function: * $\texttt{parallel}$: Logical, if TRUE then a parallelized version of the code is used. Default is FALSE. * $\texttt{nb.Cores}$: Numerical, it corresponds to the number of cores used for parallelization, if parallel is set to TRUE. Outputs: * $\texttt{festimated}$: the estimated values of $f_2$. * $\texttt{knotSelec}$: the selected knots used in the definition of the B-splines of the GLOBER estimator. * $\texttt{rss}$: Residual sum-of-squares (RSS) of the model defined as: $\sum_{k=1}^n (Y_i - \widehat{f_2}(x_i))^2$, where $\widehat{f_2}$ is the estimator of $f_2$. * $\texttt{rsq}$: R-squared of the model, calculated as $1 - RSS/TSS$ where TSS is the total sum-of-squares of the model defined as $\sum_{k=1}^n (Y_i - \bar{Y})^2$. Thus, we can print the estimated values corresponding to the input values $\{x_1, \ldots, x_N\}$: ```{r} fhat_2D = res$festimated head(fhat_2D) ``` The value of the Residual Sum-of-square: ```{r} res$rss ``` The value of the R-squared: ```{r} res$rsq ``` We can get the set of estimated knots for each dimension $\mathbf{\widehat{t}_1}$ and $\mathbf{\widehat{t}_2}$: ```{r} knots.set = res$Selected.knots print('For the first dimension:') print(knots.set[[1]]) print('For the second dimension:') print(knots.set[[2]]) ``` As for $f_1$, we can visualize the corresponding estimation of $f_2$: ```{r, fig.align = 'center', out.width="40%", out.height="40%"} scatter3D(xpred_2D[,1], xpred_2D[,2], f_2D, bty = "g", pch = 18, col = 'red', theta = 180, phi = 10) ``` ```{r, fig.align = 'center', out.width="40%", out.height="40%"} scatter3D(xpred_2D[,1], xpred_2D[,2], fhat_2D, bty = "g", pch = 18, col = 'forestgreen', theta = 180, phi = 10) ``` The red surface describes the true underlying function $f_2$ to estimate and the green surface corresponds to the estimation with GLOBER. **References** [1] Savino, M. E. and Lévy-Leduc, C. A novel approach for estimating functions in the multivariate setting based on an adaptive knot selection for B-splines with an application to a chemical system used in geoscience (2023), arXiv:2306.00686. [2] Tibshirani, R. J. and J. Taylor (2011). The solution path of the generalized lasso. The Annals of Statistics 39(3), 1335 – 1371.