The package
RandomForestsGLS
fits non-linear regression models on
dependent data with Generalised Least Square (GLS) based Random Forest
(RF-GLS) detailed in Saha, Basu and Datta (2020) https://arxiv.org/abs/2007.15421. We will start by
loading the RandomForestsGLS
R package.
Next, we discuss how the RandomForestsGLS
package can be
used for estimation and prediction in a non-linear regression setup
under correlated errors in different scenarios.
We consider spatial point referenced data with the following model:
yi = m(xi) + w(si) + ϵi; where, yi, xi respectively denotes the observed response and the covariate corresponding to the ith observed location si. m(xi) denotes the covariate effect, spatial random effect, w(s) accounts for spatial dependence beyond covariates, and ϵ accounts for the independent and identically distributed random Gaussian noise.
In the spatial mixture model setting, the package
RandomForestsGLS
allows for fitting m(.) using RF-GLS. Spatial random
effects are modeled using Gaussian Process as is the practice. For model
fitting, we use the computationally convenient Nearest Neighbor Gaussian
Process (NNGP) (Datta, Banerjee, Finley, and Gelfand (2016)). Along with
prediction of the covariate effect (mean function) m(.) we also offer kriging based
prediction of spatial responses at new location.
We simulate a data from the following model: yi = 10sin (πxi) + w(si) + ϵi; ϵ ∼ N(0, τ2I), τ2 = 0.1; w ∼ exponential GP; σ2 = 10; ϕ = 1.
Here, the mean function is E(Y) = 10sin (πX); w accounts for the spatial correlation, which is generated as a exponential Gaussian process with spatial variance σ2 = 10 and spatial correlation decay ϕ = 1; and ϵ is the i.i.d random noise with variance τ2 = 0.1, which is also called the nugget in spatial literature.
For illustration purposes, we simulate with n = 200:
rmvn <- function(n, mu = 0, V = matrix(1)){
p <- length(mu)
if(any(is.na(match(dim(V),p))))
stop("Dimension not right!")
D <- chol(V)
t(matrix(rnorm(n*p), ncol=p)%*%D + rep(mu,rep(n,p)))
}
set.seed(5)
n <- 200
coords <- cbind(runif(n,0,1), runif(n,0,1))
set.seed(2)
x <- as.matrix(runif(n),n,1)
sigma.sq = 10
phi = 1
tau.sq = 0.1
D <- as.matrix(dist(coords))
R <- exp(-phi*D)
w <- rmvn(1, rep(0,n), sigma.sq*R)
y <- rnorm(n, 10*sin(pi * x) + w, sqrt(tau.sq))
In the package RandomForestsGLS
, the working precision
matrix used in the GLS-loss are NNGP approximations of precision
matrices corresponding to Matérn covariance function.
In order to fit the model, the code requires:
coords
): an n × 2 matrix of 2-dimensional
locations.y
): an n length vector of response at the
observed coordinates.X
): an n × p matrix of the
covariates in the observation coordinates.Xtest
): an ntest × p
matrix of the covariates where we want to estimate the function. Must
have identical variables as that of X
. Default is
X
.nthsize
): We recommend not
setting this value too small, as that will lead to very deep trees that
takes a lot of time to be built and can produce unstable estimates.
Default value is 20.For the details on choice of other parameters, please refer to the
help file of the code RFGLS_estimate_spatial
, which can be
accessed with ?RFGLS_estimate_spatial
.
If the covariance parameters are known, we set
param_estimate = FALSE
(default value); the code
additionally requires the following:
cov.model
): Supported keywords are:
“exponential”, “matern”, “spherical”, and “gaussian” for exponential,
Matérn, spherical and Gaussian covariance function respectively. Default
value is “exponential”.sigma.sq
): The spatial variance. Default value is 1.tau.sq
): The nugget. Default value is 0.01.phi
): The
spatial correlation decay parameter. Default value is 5.nu
): The
smoothing parameter corresponding to the Matérn covariance function.
Default value is 0.5.We can fit the model as follows:
set.seed(1)
est_known <- RFGLS_estimate_spatial(coords, y, x, ntree = 50, cov.model = "exponential",
nthsize = 20, sigma.sq = sigma.sq, tau.sq = tau.sq,
phi = phi)
The estimate of the function at the covariates Xtest
is
given in estimation_reult$predicted
. For interpretation of
the rest of the outputs, please see the help file of the code
RFGLS_estimate_spatial
. Using covariance models other than
exponential model are in beta testing stage.
If the covariance parameters are not known we set
param_estimate = TRUE
; the code additionally requires the
covariance model (cov.model
) to be used for parameter
estimation prior to RF-GLS fitting. We fit the model with unknown
covariance parameters as follows.
Given a fitted model using RFGLS_estimate_spatial
, we
can estimate the mean function at new covariate values as follows:
Xtest <- matrix(seq(0,1, by = 1/10000), 10001, 1)
RFGLS_predict_known <- RFGLS_predict(est_known, Xtest)
We obtain the Mean Integrated Squared Error (MISE) of the estimate
m̂ from RF-GLS on [0,1] and
compare it with that corresponding to the classical Random Forest (RF)
obtained using package randomForest
(with similar minimum
nodesize, nodesize = 20
, as default nodesize
performs worse). We see that our method has a significantly smaller
MISE. Additionally, we show that the MISE obtained with unknown
parameters in RF-GLS is comparable to that of the MISE obtained with
known covariance parameters.
library(randomForest)
set.seed(1)
RF_est <- randomForest(x, y, nodesize = 20)
RF_predict <- predict(RF_est, Xtest)
#RF MISE
mean((RF_predict - 10*sin(pi * Xtest))^2)
#> [1] 8.36778
#RF-GLS MISE
mean((RFGLS_predict_known$predicted - 10*sin(pi * Xtest))^2)
#> [1] 0.150152
RFGLS_predict_unknown <- RFGLS_predict(est_unknown, Xtest)
#RF-GLS unknown MISE
mean((RFGLS_predict_unknown$predicted - 10*sin(pi * Xtest))^2)
#> [1] 0.1851928
We plot the true m(x) = 10sin(πx) along with the loess-smoothed version of estimated m̂(.) obtained from RF-GLS and RF where we show that RF-GLS estimate approximates m(x) better than that corresponding to RF.
rfgls_loess_10 <- loess(RFGLS_predict_known$predicted ~ c(1:length(Xtest)), span=0.1)
rfgls_smoothed10 <- predict(rfgls_loess_10)
rf_loess_10 <- loess(RF_predict ~ c(1:length(RF_predict)), span=0.1)
rf_smoothed10 <- predict(rf_loess_10)
xval <- c(10*sin(pi * Xtest), rf_smoothed10, rfgls_smoothed10)
xval_tag <- c(rep("Truth", length(10*sin(pi * Xtest))), rep("RF", length(rf_smoothed10)),
rep("RF-GLS",length(rfgls_smoothed10)))
plot_data <- as.data.frame(xval)
plot_data$Methods <- xval_tag
coval <- c(rep(seq(0,1, by = 1/10000), 3))
plot_data$Covariate <- coval
library(ggplot2)
ggplot(plot_data, aes(x=Covariate, y=xval, color=Methods)) +
geom_point() + labs( x = "x") + labs( y = "f(x)")
Given a fitted model using RFGLS_estimate_spatial
, we
can predict the spatial response/outcome at new locations provided the
covariates at that location. This approach performs kriging at a new
location using the mean function estimates at the corresponding
covariate values. Here we partition the simulated data into training and
test sets in 4:1 ratio. Next we perform prediction on the test set using
a model fitted on the training set.
est_known_short <- RFGLS_estimate_spatial(coords[1:160,], y[1:160],
matrix(x[1:160,],160,1), ntree = 50, cov.model = "exponential",
nthsize = 20, param_estimate = TRUE)
RFGLS_predict_spatial <- RFGLS_predict_spatial(est_known_short, coords[161:200,],
matrix(x[161:200,],40,1))
pred_mat <- as.data.frame(cbind(RFGLS_predict_spatial$prediction, y[161:200]))
colnames(pred_mat) <- c("Predicted", "Observed")
ggplot(pred_mat, aes(x=Observed, y=Predicted)) + geom_point() +
geom_abline(intercept = 0, slope = 1, color = "blue") +
ylim(0, 16) + xlim(0, 16)
The following example considers a setting when the parameters are estimated from a misspecified covariance model. We simulate the spatial correlation from a Matérn covariance function with smoothing parameter ν = 1.5. While fitting the RF-GLS, we estimate the covariance parameters using an exponential covariance model (ν = 0.5) and show that the obtained MISE can compare favorably to that of classical RF.
#Data simulation from matern with nu = 1.5
nu = 3/2
R1 <- (D*phi)^nu/(2^(nu-1)*gamma(nu))*besselK(x=D*phi, nu=nu)
diag(R1) <- 1
set.seed(2)
w <- rmvn(1, rep(0,n), sigma.sq*R1)
y <- rnorm(n, 10*sin(pi * x) + w, sqrt(tau.sq))
#RF-GLS with exponential covariance
set.seed(3)
est_misspec <- RFGLS_estimate_spatial(coords, y, x, ntree = 50, cov.model = "exponential",
nthsize = 20, param_estimate = TRUE)
RFGLS_predict_misspec <- RFGLS_predict(est_misspec, Xtest)
#RF
set.seed(4)
RF_est <- randomForest(x, y, nodesize = 20)
RF_predict <- predict(RF_est, Xtest)
#RF-GLS MISE
mean((RFGLS_predict_misspec$predicted - 10*sin(pi * Xtest))^2)
#> [1] 0.1380569
#RF MISE
mean((RF_predict - 10*sin(pi * Xtest))^2)
#> [1] 2.295639
RF-GLS can also be used for function estimation in a time series setting under autoregressive errors. We consider time series data with errors from an AR(q) process as follows:
$$ y_t = m(\mathbf{x}_t) + e_t;\: e_t = \sum_{i = 1}^q\rho_i e_{t-i} + \eta_t $$ where, yi, xi denotes the response and the covariate corresponding to the tth time point, et is an AR(q) pprocess, ηt denotes the i.i.d. white noise and (ρ1, ⋯, ρq) are the model parameters that captures the dependence of et on (et − 1, ⋯, et − q).
In the AR time series scenario, the package
RandomForestsGLS
allows for fitting m(.) using RF-GLS. RF-GLS exploits
the sparsity of the closed form precision matrix of the AR process for
model fitting and prediction of mean function m(.).
Here, we simulate from the AR(1) process as follows: y = 10sin (πx) + e; et = ρet − 1 + ηt; ηt ∼ N(0, σ2); e1 = η1; ρ = 0.9; σ2 = 10.
Here, E(Y) = 10sin (πX); e which is an AR(1) process, accounts for the temporal correlation, σ2 denotes the variance of white noise part of the AR(1) process and ρ captures the degree of dependence of et on et − 1.
For illustration purposes, we simulate with n = 200:
In case of time series data, the code requires:
y
): an n length vector of response at the
observed time points.X
): an n × p matrix of the
covariates in the observation time points.Xtest
): an ntest × p
matrix of the covariates where we want to estimate the function. Must
have identical variables as that of X
. Default is
X
.nthsize
): We recommend not
setting this value too small, as that will lead to very deep trees that
takes a lot of time to be built and can produce unstable estimates.
Default value is 20.For the details on choice of other parameters, please refer to the
help file of the code RFGLS_estimate_timeseries
, which can
be accessed with ?RFGLS_estimate_timeseries
.
If the AR process parameters are known we set
param_estimate = FALSE
(default value); the code
additionally requires lag_params
= c(ρ1, ⋯, ρq).
We can fit the model as follows:
If the AR process parameters are not known, we set
param_estimate = TRUE
; the code requires the order of the
AR process, which is obtained from the length of the
lag_params
input vector. Hence if we want to estimate the
parameters from a AR(q) process, lag_params
should be any
vector of length q
. Here we fit the model with
q = 1
This part of time series data analysis is identical to that corresponding to the spatial data.
Xtest <- matrix(seq(0,1, by = 1/10000), 10001, 1)
RFGLS_predict_temp_known <- RFGLS_predict(est_temp_known, Xtest)
Here also, similar to the spatial data scenario, RF-GLS outperforms classical RF in terms of MISE both with true and estimated AR process parameters.
library(randomForest)
set.seed(1)
RF_est_temp <- randomForest(x, y, nodesize = 20)
RF_predict_temp <- predict(RF_est_temp, Xtest)
#RF MISE
mean((RF_predict_temp - 10*sin(pi * Xtest))^2)
#> [1] 7.912517
#RF-GLS MISE
mean((RFGLS_predict_temp_known$predicted - 10*sin(pi * Xtest))^2)
#> [1] 2.471876
RFGLS_predict_temp_unknown <- RFGLS_predict(est_temp_unknown, Xtest)
#RF-GLS unknown MISE
mean((RFGLS_predict_temp_unknown$predicted - 10*sin(pi * Xtest))^2)
#> [1] 0.8791857
We consider a scenario where the order of autoregression used for RF-GLS model fitting is mis-specified. We simulate the AR errors from an AR(2) process and fit RF-GLS with an AR(1) process.
#Simulation from AR(2) process
rho1 <- 0.7
rho2 <- 0.2
set.seed(2)
b <- c(rho1, rho2)
s <- sqrt(sigma.sq)
eps = arima.sim(list(order = c(2,0,0), ar = b), n = n, rand.gen = rnorm, sd = s)
y <- c(eps + 10*sin(pi * x))
#RF-GLS with AR(1)
set.seed(3)
est_misspec_temp <- RFGLS_estimate_timeseries(y, x, ntree = 50, lag_params = 0,
nthsize = 20, param_estimate = TRUE)
RFGLS_predict_misspec_temp <- RFGLS_predict(est_misspec_temp, Xtest)
#RF
set.seed(4)
RF_est_temp <- randomForest(x, y, nodesize = 20)
RF_predict_temp <- predict(RF_est_temp, Xtest)
#RF-GLS MISE
mean((RFGLS_predict_misspec_temp$predicted - 10*sin(pi * Xtest))^2)
#> [1] 1.723218
#RF MISE
mean((RF_predict_temp - 10*sin(pi * Xtest))^2)
#> [1] 3.735003
For RFGLS_estimate_spatial
,
RFGLS_estimate_timeseries
, RFGLS_predict
and
RFGLS_predict_spatial
one can also take the advantage of
parallelization, contingent upon the availability of multiple cores. The
component h
in all the functions determines the number of
cores to be used. Here we demonstrate an example with
h = 2
.
#simulation from exponential distribution
set.seed(5)
n <- 200
coords <- cbind(runif(n,0,1), runif(n,0,1))
set.seed(2)
x <- as.matrix(runif(n),n,1)
sigma.sq = 10
phi = 1
tau.sq = 0.1
nu = 0.5
D <- as.matrix(dist(coords))
R <- exp(-phi*D)
w <- rmvn(1, rep(0,n), sigma.sq*R)
y <- rnorm(n, 10*sin(pi * x) + w, sqrt(tau.sq))
#RF-GLS model fitting and prediction with parallel computation
set.seed(1)
est_known_pl <- RFGLS_estimate_spatial(coords, y, x, ntree = 50, cov.model = "exponential",
nthsize = 20, sigma.sq = sigma.sq, tau.sq = tau.sq,
phi = phi, h = 2)
RFGLS_predict_known_pl <- RFGLS_predict(est_known_pl, Xtest, h = 2)
#MISE from single core
mean((RFGLS_predict_known$predicted - 10*sin(pi * Xtest))^2)
#> [1] 0.150152
#MISE from parallel computation
mean((RFGLS_predict_known_pl$predicted - 10*sin(pi * Xtest))^2)
#> [1] 0.150152
For RFGLS_estimate_spatial
with very small dataset
(n
) and small number of trees (ntree
),
communication overhead between the nodes for parallelization outweighs
the benefits of the parallel computing hence it is recommended to
parallelize only for moderately large n
and/or
ntree
. It is strongly recommended that the max value of
h
is kept strictly less than the number of total cores
available. Parallelization for RFGLS_estimate_timeseries
can be addressed identically. For RFGLS_predict
and
RFGLS_predict_spatial
, even for large dataset, single core
performance is very fast, hence unless ntest
and
ntree
are very high, we do not recommend using
parallelization for RFGLS_predict
and
RFGLS_predict_spatial
.