Title: | Random Forests for Dependent Data |
---|---|
Description: | Fits non-linear regression models on dependant data with Generalised Least Square (GLS) based Random Forest (RF-GLS) detailed in Saha, Basu and Datta (2021) <doi:10.1080/01621459.2021.1950003>. |
Authors: | Arkajyoti Saha [aut, cre], Sumanta Basu [aut], Abhirup Datta [aut] |
Maintainer: | Arkajyoti Saha <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.1.5 |
Built: | 2024-11-02 06:42:25 UTC |
Source: | CRAN |
The function RFGLS_estimate_spatial
fits univariate non-linear spatial regression models for
spatial data using RF-GLS in Saha et al. 2020. RFGLS_estimate_spatial
uses the sparse Cholesky representation
of Vecchia’s likelihood (Vecchia, 1988) developed in Datta et al., 2016 and Saha & Datta, 2018.
The fitted Random Forest (RF) model is used later for prediction via the RFGLS_predict
and RFGLS_predict_spatial
.
Some code blocks are borrowed from the R packages: spNNGP:
Spatial Regression Models for Large Datasets using Nearest
Neighbor Gaussian Process
https://CRAN.R-project.org/package=spNNGP and randomForest:
Breiman and Cutler's Random
Forests for Classification and Regression
https://CRAN.R-project.org/package=randomForest .
RFGLS_estimate_spatial(coords, y, X, Xtest = NULL, nrnodes = NULL, nthsize = 20, mtry = 1, pinv_choice = 1, n_omp = 1, ntree = 50, h = 1, sigma.sq = 1, tau.sq = 0.1, phi = 5, nu = 0.5, n.neighbors = 15, cov.model = "exponential", search.type = "tree", param_estimate = FALSE, verbose = FALSE)
RFGLS_estimate_spatial(coords, y, X, Xtest = NULL, nrnodes = NULL, nthsize = 20, mtry = 1, pinv_choice = 1, n_omp = 1, ntree = 50, h = 1, sigma.sq = 1, tau.sq = 0.1, phi = 5, nu = 0.5, n.neighbors = 15, cov.model = "exponential", search.type = "tree", param_estimate = FALSE, verbose = FALSE)
coords |
an |
y |
an |
X |
an |
Xtest |
an |
nrnodes |
the maximum number of nodes a tree can have. Default choice leads to the deepest tree contigent on |
nthsize |
minimum size of leaf nodes. We recommend not setting this value too small, as that will lead to very deep trees that takes a lot of time to be built and can produce unstable estimaes. Default value is 20. |
mtry |
number of variables randomly sampled at each partition as a candidate split direction. We recommend using
the value p/3 where p is the number of variables in |
pinv_choice |
dictates the choice of method for obtaining the pseudoinverse involved in the cost function and node
representative evaluation. if pinv_choice = 0, SVD is used (slower but more stable), if pinv_choice = 1,
orthogonal decomposition (faster, may produce unstable results if |
n_omp |
number of threads to be used, value can be more than 1 if source code is compiled with OpenMP support. Default is 1. |
ntree |
number of trees to be grown. This value should not be too small. Default value is 50. |
h |
number of core to be used in parallel computing setup for bootstrap samples. If h = 1, there is no parallelization. Default value is 1. |
sigma.sq |
value of sigma square. Default value is 1. |
tau.sq |
value of tau square. Default value is 0.1. |
phi |
value of phi. Default value is 5. |
nu |
value of nu, only required for matern covariance model. Default value is 0.5. |
n.neighbors |
number of neighbors used in the NNGP. Default value is 15. |
cov.model |
keyword that specifies the covariance function to be used in modelling the spatial dependence structure among the observations. Supported keywords are: "exponential", "matern", "spherical", and "gaussian" for exponential, matern, spherical and gaussian covariance function respectively. Default value is "exponential". |
search.type |
keyword that specifies type of nearest neighbor search algorithm to be used. Supported keywords are: "tree" and "brute". Both of them provide the same result, though "tree" should be faster. Default value is "tree". |
param_estimate |
if |
verbose |
if |
A list comprising:
P_matrix |
an |
predicted_matrix |
an |
predicted |
preducted values at the |
X |
the matrix |
y |
the vector |
RFGLS_Object |
object required for prediction. |
Arkajyoti Saha [email protected],
Sumanta Basu [email protected],
Abhirup Datta [email protected]
Saha, A., Basu, S., & Datta, A. (2020). Random Forests for dependent data. arXiv preprint arXiv:2007.15421.
Saha, A., & Datta, A. (2018). BRISC: bootstrap for rapid inference on spatial covariances. Stat, e184, DOI: 10.1002/sta4.184.
Datta, A., S. Banerjee, A.O. Finley, and A.E. Gelfand. (2016) Hierarchical Nearest-Neighbor Gaussian process models for large geostatistical datasets. Journal of the American Statistical Association, 111:800-812.
Andrew Finley, Abhirup Datta and Sudipto Banerjee (2017). spNNGP: Spatial Regression Models for Large Datasets using Nearest Neighbor Gaussian Processes. R package version 0.1.1. https://CRAN.R-project.org/package=spNNGP
Andy Liaw, and Matthew Wiener (2015). randomForest: Breiman and Cutler's Random
Forests for Classification and Regression.
R package version 4.6-14.
https://CRAN.R-project.org/package=randomForest
rmvn <- function(n, mu = 0, V = matrix(1)){ p <- length(mu) if(any(is.na(match(dim(V),p)))) stop("Dimension not right!") D <- chol(V) t(matrix(rnorm(n*p), ncol=p)%*%D + rep(mu,rep(n,p))) } set.seed(1) n <- 200 coords <- cbind(runif(n,0,1), runif(n,0,1)) set.seed(2) x <- as.matrix(rnorm(n),n,1) sigma.sq = 1 phi = 5 tau.sq = 0.1 D <- as.matrix(dist(coords)) R <- exp(-phi*D) w <- rmvn(1, rep(0,n), sigma.sq*R) y <- rnorm(n, 10*sin(pi * x) + w, sqrt(tau.sq)) estimation_result <- RFGLS_estimate_spatial(coords, y, x, ntree = 10)
rmvn <- function(n, mu = 0, V = matrix(1)){ p <- length(mu) if(any(is.na(match(dim(V),p)))) stop("Dimension not right!") D <- chol(V) t(matrix(rnorm(n*p), ncol=p)%*%D + rep(mu,rep(n,p))) } set.seed(1) n <- 200 coords <- cbind(runif(n,0,1), runif(n,0,1)) set.seed(2) x <- as.matrix(rnorm(n),n,1) sigma.sq = 1 phi = 5 tau.sq = 0.1 D <- as.matrix(dist(coords)) R <- exp(-phi*D) w <- rmvn(1, rep(0,n), sigma.sq*R) y <- rnorm(n, 10*sin(pi * x) + w, sqrt(tau.sq)) estimation_result <- RFGLS_estimate_spatial(coords, y, x, ntree = 10)
The function RFGLS_estimate_spatial
fits univariate non-linear regression models for
time-series data using a RF-GLS in Saha et al. 2020. RFGLS_estimate_spatial
uses the sparse Cholesky representation
corresponsinding to AR(q)
process. The fitted Random Forest (RF) model is used later for
prediction via the RFGLS-predict
.
Some code blocks are borrowed from the R packages: spNNGP:
Spatial Regression Models for Large Datasets using Nearest Neighbor
Gaussian Processes
https://CRAN.R-project.org/package=spNNGP and
randomForest: Breiman and Cutler's Random Forests for Classification
and Regression
https://CRAN.R-project.org/package=randomForest .
RFGLS_estimate_timeseries(y, X, Xtest = NULL, nrnodes = NULL, nthsize = 20, mtry = 1, pinv_choice = 1, n_omp = 1, ntree = 50, h = 1, lag_params = 0.5, variance = 1, param_estimate = FALSE, verbose = FALSE)
RFGLS_estimate_timeseries(y, X, Xtest = NULL, nrnodes = NULL, nthsize = 20, mtry = 1, pinv_choice = 1, n_omp = 1, ntree = 50, h = 1, lag_params = 0.5, variance = 1, param_estimate = FALSE, verbose = FALSE)
y |
an |
X |
an |
Xtest |
an |
nrnodes |
the maximum number of nodes a tree can have. Default choice leads to the deepest tree contigent on |
nthsize |
minimum size of leaf nodes. We recommend not setting this value too small, as that will lead to very deep trees that takes a lot of time to be built and can produce unstable estimaes. Default value is 20. |
mtry |
number of variables randomly sampled at each partition as a candidate split direction. We recommend using
the value p/3 where p is the number of variables in |
pinv_choice |
dictates the choice of method for obtaining the pseudoinverse involved in the cost function and node
representative evaluation. if pinv_choice = 0, SVD is used (slower but more stable), if pinv_choice = 1,
orthogonal decomposition (faster, may produce unstable results if |
n_omp |
number of threads to be used, value can be more than 1 if source code is compiled with OpenMP support. Default is 1. |
ntree |
number of trees to be grown. This value should not be too small. Default value is 50. |
h |
number of core to be used in parallel computing setup for bootstrap samples. If h = 1, there is no parallelization. Default value is 1. |
lag_params |
|
variance |
variance of the white noise in temporal error. The function estimate is not affected by this. Default value is 1. |
param_estimate |
if |
verbose |
if |
A list comprising:
P_matrix |
an |
predicted_matrix |
an |
predicted |
preducted values at the |
X |
the matrix |
y |
the vector |
RFGLS_Object |
object required for prediction. |
Arkajyoti Saha [email protected],
Sumanta Basu [email protected],
Abhirup Datta [email protected]
Saha, A., Basu, S., & Datta, A. (2020). Random Forests for dependent data. arXiv preprint arXiv:2007.15421.
Saha, A., & Datta, A. (2018). BRISC: bootstrap for rapid inference on spatial covariances. Stat, e184, DOI: 10.1002/sta4.184.
Andy Liaw, and Matthew Wiener (2015). randomForest: Breiman and Cutler's Random
Forests for Classification and Regression. R package version 4.6-14.
https://CRAN.R-project.org/package=randomForest
Andrew Finley, Abhirup Datta and Sudipto Banerjee (2017). spNNGP: Spatial Regression Models for Large Datasets using Nearest Neighbor Gaussian Processes. R package version 0.1.1. https://CRAN.R-project.org/package=spNNGP
rmvn <- function(n, mu = 0, V = matrix(1)){ p <- length(mu) if(any(is.na(match(dim(V),p)))) stop("Dimension not right!") D <- chol(V) t(matrix(rnorm(n*p), ncol=p)%*%D + rep(mu,rep(n,p))) } set.seed(2) n <- 200 x <- as.matrix(rnorm(n),n,1) sigma.sq <- 1 rho <- 0.5 set.seed(3) b <- rho s <- sqrt(sigma.sq) eps = arima.sim(list(order = c(1,0,0), ar = b), n = n, rand.gen = rnorm, sd = s) y <- eps + 10*sin(pi * x) estimation_result <- RFGLS_estimate_timeseries(y, x, ntree = 10)
rmvn <- function(n, mu = 0, V = matrix(1)){ p <- length(mu) if(any(is.na(match(dim(V),p)))) stop("Dimension not right!") D <- chol(V) t(matrix(rnorm(n*p), ncol=p)%*%D + rep(mu,rep(n,p))) } set.seed(2) n <- 200 x <- as.matrix(rnorm(n),n,1) sigma.sq <- 1 rho <- 0.5 set.seed(3) b <- rho s <- sqrt(sigma.sq) eps = arima.sim(list(order = c(1,0,0), ar = b), n = n, rand.gen = rnorm, sd = s) y <- eps + 10*sin(pi * x) estimation_result <- RFGLS_estimate_timeseries(y, x, ntree = 10)
The function RFGLS_predict
predicts the mean function at a given set of covariates.
It uses a fitted RF-GLS model in Saha et al. 2020 to obtain the predictions.
Some code blocks are borrowed from the R package: randomForest: Breiman and Cutler's Random
Forests for Classification and Regression
https://CRAN.R-project.org/package=randomForest .
RFGLS_predict(RFGLS_out, Xtest, h = 1, verbose = FALSE)
RFGLS_predict(RFGLS_out, Xtest, h = 1, verbose = FALSE)
RFGLS_out |
an object obtained from |
Xtest |
an |
h |
number of core to be used in parallel computing setup for bootstrap samples. If |
verbose |
if |
A list comprising:
predicted_matrix |
an |
predicted |
preducted values at the |
Arkajyoti Saha [email protected],
Sumanta Basu [email protected],
Abhirup Datta [email protected]
Saha, A., Basu, S., & Datta, A. (2020). Random Forests for dependent data. arXiv preprint arXiv:2007.15421.
Andy Liaw, and Matthew Wiener (2015). randomForest: Breiman and Cutler's Random
Forests for Classification and Regression. R package version 4.6-14.
https://CRAN.R-project.org/package=randomForest
rmvn <- function(n, mu = 0, V = matrix(1)){ p <- length(mu) if(any(is.na(match(dim(V),p)))) stop("Dimension not right!") D <- chol(V) t(matrix(rnorm(n*p), ncol=p)%*%D + rep(mu,rep(n,p))) } set.seed(2) n <- 200 x <- as.matrix(rnorm(n),n,1) sigma.sq <- 1 rho <- 0.5 set.seed(3) b <- rho s <- sqrt(sigma.sq) eps = arima.sim(list(order = c(1,0,0), ar = b), n = n, rand.gen = rnorm, sd = s) y <- eps + 10*sin(pi * x[,1]) estimation_result <- RFGLS_estimate_timeseries(y, x, ntree = 10) Xtest <- matrix(seq(0,1, by = 1/1000), 1001, 1) RFGLS_predict <- RFGLS_predict(estimation_result, Xtest)
rmvn <- function(n, mu = 0, V = matrix(1)){ p <- length(mu) if(any(is.na(match(dim(V),p)))) stop("Dimension not right!") D <- chol(V) t(matrix(rnorm(n*p), ncol=p)%*%D + rep(mu,rep(n,p))) } set.seed(2) n <- 200 x <- as.matrix(rnorm(n),n,1) sigma.sq <- 1 rho <- 0.5 set.seed(3) b <- rho s <- sqrt(sigma.sq) eps = arima.sim(list(order = c(1,0,0), ar = b), n = n, rand.gen = rnorm, sd = s) y <- eps + 10*sin(pi * x[,1]) estimation_result <- RFGLS_estimate_timeseries(y, x, ntree = 10) Xtest <- matrix(seq(0,1, by = 1/1000), 1001, 1) RFGLS_predict <- RFGLS_predict(estimation_result, Xtest)
The function RFGLS_predict_spatial
performs fast prediction on a set of new locations by combining
non-linear mean estimate from a fitted RF-GLS model in Saha et al. 2020 with spatial kriging estimate obtained by using Nearest Neighbor Gaussian Processes (NNGP) (Datta et al., 2016).
Some code blocks are borrowed from the R packages: spNNGP:
Spatial Regression Models for Large Datasets using Nearest Neighbor Gaussian Processes
https://CRAN.R-project.org/package=spNNGP and randomForest: Breiman and Cutler's Random
Forests for Classification and Regression
https://CRAN.R-project.org/package=randomForest .
RFGLS_predict_spatial(RFGLS_out, coords.0, Xtest, h = 1, verbose = FALSE)
RFGLS_predict_spatial(RFGLS_out, coords.0, Xtest, h = 1, verbose = FALSE)
RFGLS_out |
an object obtained from |
coords.0 |
the spatial coordinates corresponding to prediction locations. |
Xtest |
an |
h |
number of core to be used in parallel computing setup for bootstrap samples. If |
verbose |
if |
A list comprising:
prediction |
predicted spatial response corresponding to |
Arkajyoti Saha [email protected],
Sumanta Basu [email protected],
Abhirup Datta [email protected]
Saha, A., Basu, S., & Datta, A. (2020). Random Forests for dependent data. arXiv preprint arXiv:2007.15421.
Saha, A., & Datta, A. (2018). BRISC: bootstrap for rapid inference on spatial covariances. Stat, e184, DOI: 10.1002/sta4.184.
Datta, A., S. Banerjee, A.O. Finley, and A.E. Gelfand. (2016) Hierarchical Nearest-Neighbor Gaussian process models for large geostatistical datasets. Journal of the American Statistical Association, 111:800-812.
Andrew Finley, Abhirup Datta and Sudipto Banerjee (2017). spNNGP: Spatial Regression Models for Large Datasets using Nearest Neighbor Gaussian Processes. R package version 0.1.1. https://CRAN.R-project.org/package=spNNGP
Andy Liaw, and Matthew Wiener (2015). randomForest: Breiman and Cutler's Random
Forests for Classification and Regression. R package version 4.6-14.
https://CRAN.R-project.org/package=randomForest
rmvn <- function(n, mu = 0, V = matrix(1)){ p <- length(mu) if(any(is.na(match(dim(V),p)))) stop("Dimension not right!") D <- chol(V) t(matrix(rnorm(n*p), ncol=p)%*%D + rep(mu,rep(n,p))) } set.seed(1) n <- 250 coords <- cbind(runif(n,0,1), runif(n,0,1)) set.seed(2) x <- as.matrix(rnorm(n),n,1) sigma.sq = 1 phi = 5 tau.sq = 0.1 D <- as.matrix(dist(coords)) R <- exp(-phi*D) w <- rmvn(1, rep(0,n), sigma.sq*R) y <- rnorm(n, 10*sin(pi * x) + w, sqrt(tau.sq)) estimation_result <- RFGLS_estimate_spatial(coords[1:200,], y[1:200], matrix(x[1:200,],200,1), ntree = 10) prediction_result <- RFGLS_predict_spatial(estimation_result, coords[201:250,], matrix(x[201:250,],50,1))
rmvn <- function(n, mu = 0, V = matrix(1)){ p <- length(mu) if(any(is.na(match(dim(V),p)))) stop("Dimension not right!") D <- chol(V) t(matrix(rnorm(n*p), ncol=p)%*%D + rep(mu,rep(n,p))) } set.seed(1) n <- 250 coords <- cbind(runif(n,0,1), runif(n,0,1)) set.seed(2) x <- as.matrix(rnorm(n),n,1) sigma.sq = 1 phi = 5 tau.sq = 0.1 D <- as.matrix(dist(coords)) R <- exp(-phi*D) w <- rmvn(1, rep(0,n), sigma.sq*R) y <- rnorm(n, 10*sin(pi * x) + w, sqrt(tau.sq)) estimation_result <- RFGLS_estimate_spatial(coords[1:200,], y[1:200], matrix(x[1:200,],200,1), ntree = 10) prediction_result <- RFGLS_predict_spatial(estimation_result, coords[201:250,], matrix(x[201:250,],50,1))