Title: | Bayesian Regression with Meshed Gaussian Processes |
---|---|
Description: | Fits Bayesian regression models based on latent Meshed Gaussian Processes (MGP) as described in Peruzzi, Banerjee, Finley (2020) <doi:10.1080/01621459.2020.1833889>, Peruzzi, Banerjee, Dunson, and Finley (2021) <arXiv:2101.03579>, Peruzzi and Dunson (2022) <arXiv:2201.10080>. Funded by ERC grant 856506 and NIH grant R01ES028804. |
Authors: | Michele Peruzzi |
Maintainer: | Michele Peruzzi <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.2.3 |
Built: | 2024-12-16 07:08:46 UTC |
Source: | CRAN |
meshed
is a flexible package for Bayesian regression analysis on spatial or spatiotemporal datasets. The main function for fitting regression models is spmeshed
, which outputs posterior samples obtained from Markov chain Monte Carlo which can be summarised using standard tools. The package also provides a function rmeshedgp
for quickly simulating correlated spatial or spatiotemporal data at a very large number of locations.
The functions rmeshedgp
and spmeshed
are provided for prior and posterior sampling (respectively) of Bayesian spatial or spatiotemporal multivariate regression models based on Meshed Gaussian Processes as introduced by Peruzzi, Banerjee, and Finley (2020). Posterior sampling via spmeshed
proceeds by default via GriPS as detailed in Peruzzi, Banerjee, Dunson, and Finley (2021). When at least one outcome is not modeled with Gaussian errors, sampling proceeds taking advantage of Metropolis-adjusted Langevin dynamics as detailed in Peruzzi and Dunson (2022).
Michele Peruzzi
Peruzzi, M., Banerjee, S., and Finley, A.O. (2022) Highly Scalable Bayesian Geostatistical Modeling via Meshed Gaussian Processes on Partitioned Domains. Journal of the American Statistical Association, 117(538):969-982. doi:10.1080/01621459.2020.1833889
Peruzzi, M., Banerjee, S., Dunson, D.B., and Finley, A.O. (2021) Grid-Parametrize-Split (GriPS) for Improved Scalable Inference in Spatial Big Data Analysis. https://arxiv.org/abs/2101.03579
Peruzzi, M., Dunson, D.B. (2022) Spatial meshing for general Bayesian multivariate models. https://arxiv.org/abs/2201.10080
Sample from the posterior predictive distribution of the outcomes at new spatial or spatiotemporal locations after MCMC.
## S3 method for class 'spmeshed' predict(object, newx, newcoords, n_threads=4, verbose=FALSE, ...)
## S3 method for class 'spmeshed' predict(object, newx, newcoords, n_threads=4, verbose=FALSE, ...)
object |
Object output from |
newx |
matrix of covariate values at the new coordinates. |
newcoords |
matrix of new coordinates. |
n_threads |
integer number of OpenMP threads. This is ineffective if |
verbose |
boolean for progress messagging. |
... |
other arguments (unused). |
While this function can always be used to make predictions, in most cases it is more efficient to just include the prediction locations in the main data as NA
values; spmeshed
will sample from the posterior predictive distribution at those locations while doing MCMC. The predict
method is only recommended when all 4 of the following are true:
(1) spmeshed
was run with settings$forced_grid=FALSE
and
(2) the prediction locations are uniformly scattered on the domain (or rather, they are not clustered as a large empty area) and
(3) the number of prediction locations is a large portion of the number of observed data points and
(4) the prediction locations are not on a grid.
In all other cases the main spmeshed
function is setup to be more efficient in automatically performing predictions during MCMC.
coords_out |
matrix with the prediction location coordinates (order updated after predictions). |
preds_out |
array of dimension ( |
Michele Peruzzi [email protected]
Peruzzi, M., Banerjee, S., and Finley, A.O. (2020) Highly Scalable Bayesian Geostatistical Modeling via Meshed Gaussian Processes on Partitioned Domains. Journal of the American Statistical Association, in press. doi:10.1080/01621459.2020.1833889
Peruzzi, M., Banerjee, S., Dunson, D.B., and Finley, A.O. (2021) Grid-Parametrize-Split (GriPS) for Improved Scalable Inference in Spatial Big Data Analysis. https://arxiv.org/abs/2101.03579
# toy example with tiny dataset and short MCMC # on a univariate outcome library(magrittr) library(dplyr) library(meshed) set.seed(2021) SS <- 12 n <- SS^2 # total n. locations, including missing ones coords <- data.frame(Var1=runif(n), Var2=runif(n)) %>% as.matrix() # generate data sigmasq <- 2.3 phi <- 6 tausq <- .1 B <- c(-1,.5,1) CC <- sigmasq * exp(-phi * as.matrix(dist(coords))) LC <- t(chol(CC)) w <- LC %*% rnorm(n) p <- length(B) X <- rnorm(n * p) %>% matrix(ncol=p) y_full <- X %*% B + w + tausq^.5 * rnorm(n) set_missing <- rbinom(n, 1, 0.1) simdata <- data.frame(coords, y_full = y_full, w_latent = w) %>% mutate(y_observed = ifelse(set_missing==1, NA, y_full)) # MCMC setup mcmc_keep <- 500 mcmc_burn <- 100 mcmc_thin <- 2 y <- simdata$y_observed ybar <- mean(y, na.rm=TRUE) # training set y_in <- (y-ybar)[!is.na(y)] X_in <- X[!is.na(y),] coords_in <- coords[!is.na(y),] # suppose we dont want to have gridded knots # i.e. we are fixing the MGP reference set at the observed locations # (this may be inefficient in big data settings) meshout <- spmeshed(y_in, X_in, coords_in, axis_partition=c(4,4), n_samples = mcmc_keep, n_burn = mcmc_burn, n_thin = mcmc_thin, settings = list(forced_grid=FALSE, cache=FALSE), prior=list(phi=c(1,15)), verbose = 0, n_threads = 1) # test set coords_out <- coords[is.na(y),] X_out <- X[is.na(y),] df_predict <- predict(meshout, newx=X_out, newcoords=coords_out) y_posterior_predictive_mean <- df_predict$preds_out[,1,] %>% apply(1, mean) %>% add(ybar) df_predicted <- df_predict$coords_out %>% cbind(y_posterior_predictive_mean)
# toy example with tiny dataset and short MCMC # on a univariate outcome library(magrittr) library(dplyr) library(meshed) set.seed(2021) SS <- 12 n <- SS^2 # total n. locations, including missing ones coords <- data.frame(Var1=runif(n), Var2=runif(n)) %>% as.matrix() # generate data sigmasq <- 2.3 phi <- 6 tausq <- .1 B <- c(-1,.5,1) CC <- sigmasq * exp(-phi * as.matrix(dist(coords))) LC <- t(chol(CC)) w <- LC %*% rnorm(n) p <- length(B) X <- rnorm(n * p) %>% matrix(ncol=p) y_full <- X %*% B + w + tausq^.5 * rnorm(n) set_missing <- rbinom(n, 1, 0.1) simdata <- data.frame(coords, y_full = y_full, w_latent = w) %>% mutate(y_observed = ifelse(set_missing==1, NA, y_full)) # MCMC setup mcmc_keep <- 500 mcmc_burn <- 100 mcmc_thin <- 2 y <- simdata$y_observed ybar <- mean(y, na.rm=TRUE) # training set y_in <- (y-ybar)[!is.na(y)] X_in <- X[!is.na(y),] coords_in <- coords[!is.na(y),] # suppose we dont want to have gridded knots # i.e. we are fixing the MGP reference set at the observed locations # (this may be inefficient in big data settings) meshout <- spmeshed(y_in, X_in, coords_in, axis_partition=c(4,4), n_samples = mcmc_keep, n_burn = mcmc_burn, n_thin = mcmc_thin, settings = list(forced_grid=FALSE, cache=FALSE), prior=list(phi=c(1,15)), verbose = 0, n_threads = 1) # test set coords_out <- coords[is.na(y),] X_out <- X[is.na(y),] df_predict <- predict(meshout, newx=X_out, newcoords=coords_out) y_posterior_predictive_mean <- df_predict$preds_out[,1,] %>% apply(1, mean) %>% add(ybar) df_predicted <- df_predict$coords_out %>% cbind(y_posterior_predictive_mean)
Generates samples from a (univariate) MGP assuming a cubic directed acyclic graph and axis-parallel domain partitioning.
rmeshedgp(coords, theta, axis_partition = NULL, block_size = 100, n_threads=1, cache=TRUE, verbose=FALSE, debug=FALSE)
rmeshedgp(coords, theta, axis_partition = NULL, block_size = 100, n_threads=1, cache=TRUE, verbose=FALSE, debug=FALSE)
coords |
matrix of spatial or spatiotemporal coordinates with |
theta |
vector with covariance parameters. If |
axis_partition |
integer vector of length |
block_size |
integer specifying the (approximate) size of the blocks, i.e. how many spatial or spatiotemporal locations should be included in each block. Note: larger values correspond to an MGP that is closer to a full GP, but require more expensive computations. |
n_threads |
integer number of OpenMP threads. This is ineffective if |
cache |
bool: whether to use cache. Some computational speedup is associated to |
verbose |
bool: print some messages. |
debug |
bool: print more messages. |
Gaussian processes (GPs) lack in scalability to big datasets due to the assumed unrestricted dependence across the spatial or spatiotemporal domain.
Meshed GPs instead use a directed acyclic graph (DAG) with patterns, called mesh, to simplify the dependence structure across the domain. Each DAG node corresponds to a partition of the domain. MGPs can be interpreted as approximating the GP they originate from, or as standalone processes that can be sampled from. This function samples random MGPs and can thus be used to generate big spatial or spatiotemporal data.
The only requirement to sample from a MGP compared to a standard GP is the specification of the domain partitioning strategy. Here, either axis_partition
or block_size
can be used; the default block_size=100
can be used to quickly sample smooth surfaces at millions of locations.
Just like in a standard GP, one needs a covariance function or kernel which can be set as follows.
For spatial data (), the length of
theta
determines which model is used (see above). Letting where
and
are locations in the spatial domain, the exponential covariance is defined as:
whereas the Matern model is
where is the modified Bessel function of the second kind of order
.
For spatiotemporal data (
) the covariance function between locations
and
with distance
and time lag
is defined as
which is a special case of non-separable spacetime covariance as introduced by Gneiting (2002).
data.frame with the (reordered) supplied coordinates in the first d
columns, and the MGP sample in the last column, labeled w
.
Michele Peruzzi <[email protected]>
Gneiting, T (2002) Nonseparable, Stationary Covariance Functions for Space-Time Data. Journal of the American Statistical Association. doi:10.1198/016214502760047113
Peruzzi, M., Banerjee, S., and Finley, A.O. (2020) Highly Scalable Bayesian Geostatistical Modeling via Meshed Gaussian Processes on Partitioned Domains. Journal of the American Statistical Association, in press. doi:10.1080/01621459.2020.1833889
library(ggplot2) library(magrittr) library(meshed) # spatial domain (we choose a grid to make a nice image later) # this generates a dataset of size 6400 xx <- seq(0, 1, length.out=80) coords <- expand.grid(xx, xx) %>% as.matrix() raster_plot <- function(df){ ggplot(df, aes(Var1, Var2, fill=w)) + geom_raster() + scale_fill_viridis_c() + theme_minimal() } # spatial data, exponential covariance # phi=14, sigma^2=2 simdata <- rmeshedgp(coords, c(14, 2)) raster_plot(simdata) # spatial data, matern covariance # phi=14, nu=1, sigma^2=2 simdata <- rmeshedgp(coords, c(14, 1, 2)) raster_plot(simdata) # spacetime data, gneiting's covariance # 64000 locations stcoords <- expand.grid(xx, xx, seq(0, 1, length.out=10)) # it should take less than a couple of seconds simdata <- rmeshedgp(stcoords, c(1, 14, .5, 2)) # plot data at 7th time period raster_plot(simdata %>% dplyr::filter(Var3==unique(Var3)[7]))
library(ggplot2) library(magrittr) library(meshed) # spatial domain (we choose a grid to make a nice image later) # this generates a dataset of size 6400 xx <- seq(0, 1, length.out=80) coords <- expand.grid(xx, xx) %>% as.matrix() raster_plot <- function(df){ ggplot(df, aes(Var1, Var2, fill=w)) + geom_raster() + scale_fill_viridis_c() + theme_minimal() } # spatial data, exponential covariance # phi=14, sigma^2=2 simdata <- rmeshedgp(coords, c(14, 2)) raster_plot(simdata) # spatial data, matern covariance # phi=14, nu=1, sigma^2=2 simdata <- rmeshedgp(coords, c(14, 1, 2)) raster_plot(simdata) # spacetime data, gneiting's covariance # 64000 locations stcoords <- expand.grid(xx, xx, seq(0, 1, length.out=10)) # it should take less than a couple of seconds simdata <- rmeshedgp(stcoords, c(1, 14, .5, 2)) # plot data at 7th time period raster_plot(simdata %>% dplyr::filter(Var3==unique(Var3)[7]))
Fits Bayesian multivariate spatial or spatiotemporal regression models with latent MGPs via Markov chain Monte Carlo.
spmeshed(y, x, coords, k=NULL, family = "gaussian", axis_partition = NULL, block_size = 30, grid_size = NULL, grid_custom = NULL, n_samples = 1000, n_burnin = 100, n_thin = 1, n_threads = 4, verbose = 0, predict_everywhere = FALSE, settings = list(adapting=TRUE, forced_grid=NULL, cache=NULL, ps=TRUE, saving=TRUE, low_mem=FALSE, hmc=4), prior = list(beta=NULL, tausq=NULL, sigmasq = NULL, phi=NULL, a=NULL, nu = NULL, toplim = NULL, btmlim = NULL, set_unif_bounds=NULL), starting = list(beta=NULL, tausq=NULL, theta=NULL, lambda=NULL, v=NULL, a=NULL, nu = NULL, mcmcsd=.05, mcmc_startfrom=0), debug = list(sample_beta=TRUE, sample_tausq=TRUE, sample_theta=TRUE, sample_w=TRUE, sample_lambda=TRUE, verbose=FALSE, debug=FALSE), indpart=FALSE )
spmeshed(y, x, coords, k=NULL, family = "gaussian", axis_partition = NULL, block_size = 30, grid_size = NULL, grid_custom = NULL, n_samples = 1000, n_burnin = 100, n_thin = 1, n_threads = 4, verbose = 0, predict_everywhere = FALSE, settings = list(adapting=TRUE, forced_grid=NULL, cache=NULL, ps=TRUE, saving=TRUE, low_mem=FALSE, hmc=4), prior = list(beta=NULL, tausq=NULL, sigmasq = NULL, phi=NULL, a=NULL, nu = NULL, toplim = NULL, btmlim = NULL, set_unif_bounds=NULL), starting = list(beta=NULL, tausq=NULL, theta=NULL, lambda=NULL, v=NULL, a=NULL, nu = NULL, mcmcsd=.05, mcmc_startfrom=0), debug = list(sample_beta=TRUE, sample_tausq=TRUE, sample_theta=TRUE, sample_w=TRUE, sample_lambda=TRUE, verbose=FALSE, debug=FALSE), indpart=FALSE )
y |
matrix of multivariate outcomes with |
x |
matrix of covariates with |
coords |
matrix of coordinates with |
k |
integer |
family |
a vector with length |
axis_partition |
integer vector of size |
block_size |
integer approximate size of the blocks after domain partitioning. Only used if |
grid_size |
integer vector of size |
grid_custom |
list with elements |
n_samples |
integer number of MCMC samples at which all the unknowns are stored (including the latent effects). |
n_burnin |
integer number of MCMC samples to discard at the beginning of the chain. |
n_thin |
integer thinning parameter for the MCMC chain. Only the chain of latent effects ( |
n_threads |
integer number of OpenMP threads. This is ineffective if |
verbose |
integer. If |
predict_everywhere |
bool used if settings$forced_grid=T. Should predictions be made at the reference grid locations? If not, predictions will be made only at the supplied NA values of Y. |
settings |
list: |
prior |
list: setup for priors of unknown parameters. |
starting |
list: setup for starting values of unknown parameters. |
debug |
list: setup for debugging things. Some parts of MCMC can be turned off here. |
indpart |
bool defaults to |
This function targets the following model:
where is a
-dimensional vector of outcomes at spatial location
,
is a
-dimensional vector of covariates with static coefficients
,
is a matrix of factor loadings of size
,
is a
-dimensional vector which collects the realization of independent Gaussian processes
for
and where
is a correlation function.
is a coordinate in space (
) or space plus time (
). The Meshed GP implemented here associates an axis-parallel tessellation of the domain to a cubic directed acyclic graph (mesh).
coordsdata |
data.frame including the original |
savedata |
Available if |
yhat_mcmc |
list of length |
v_mcmc |
list of length |
w_mcmc |
list of length |
lp_mcmc |
list of length |
beta_mcmc |
array of size |
tausq_mcmc |
matrix of size |
theta_mcmc |
array of size |
lambda_mcmc |
array of size |
paramsd |
Cholesky factorization of the proposal covariance for adaptive MCMC, after adaptation. |
mcmc |
Total number of MCMC iterations performed. |
mcmc_time |
Time in seconds taken for MCMC (not including preprocessing). |
Michele Peruzzi [email protected]
Peruzzi, M., Banerjee, S., and Finley, A.O. (2022) Highly Scalable Bayesian Geostatistical Modeling via Meshed Gaussian Processes on Partitioned Domains. Journal of the American Statistical Association, 117(538):969-982. doi:10.1080/01621459.2020.1833889
Peruzzi, M., Banerjee, S., Dunson, D.B., and Finley, A.O. (2021) Grid-Parametrize-Split (GriPS) for Improved Scalable Inference in Spatial Big Data Analysis. https://arxiv.org/abs/2101.03579
Peruzzi, M. and Dunson, D.B. (2022) Spatial meshing for general Bayesian multivariate models. https://arxiv.org/abs/2201.10080
# toy example with tiny dataset and short MCMC # on a univariate outcome library(magrittr) library(dplyr) library(ggplot2) library(meshed) set.seed(2021) SS <- 12 n <- SS^2 # total n. locations, including missing ones coords <- expand.grid(xx <- seq(0,1,length.out=SS), xx) %>% as.matrix() # generate data sigmasq <- 2.3 phi <- 6 tausq <- .1 B <- c(-1,.5,1) CC <- sigmasq * exp(-phi * as.matrix(dist(coords))) LC <- t(chol(CC)) w <- LC %*% rnorm(n) p <- length(B) X <- rnorm(n * p) %>% matrix(ncol=p) y_full <- X %*% B + w + tausq^.5 * rnorm(n) set_missing <- rbinom(n, 1, 0.1) simdata <- data.frame(coords, y_full = y_full, w_latent = w) %>% mutate(y_observed = ifelse(set_missing==1, NA, y_full)) # MCMC setup mcmc_keep <- 500 mcmc_burn <- 100 mcmc_thin <- 2 y <- simdata$y_observed ybar <- mean(y, na.rm=TRUE) meshout <- spmeshed(y-ybar, X, coords, axis_partition=c(4,4), n_samples = mcmc_keep, n_burn = mcmc_burn, n_thin = mcmc_thin, prior=list(phi=c(1,15)), verbose = 0, n_threads = 1) # posterior means best_post_mean <- meshout$beta_mcmc %>% apply(1:2, mean) # process means wmesh <- data.frame(w_mgp = meshout$w_mcmc %>% summary_list_mean()) # predictions ymesh <- data.frame(y_mgp = meshout$yhat_mcmc %>% summary_list_mean()) outdf <- meshout$coordsdata %>% cbind(ymesh, wmesh) # plot predictions pred_plot <- outdf %>% ggplot(aes(Var1, Var2, color=y_mgp)) + geom_point() + scale_color_viridis_c() # plot latent process latent_plot <- outdf %>% ggplot(aes(Var1, Var2, color=w_mgp)) + geom_point() + scale_color_viridis_c() # estimation of regression coefficients plot(density(meshout$beta_mcmc[1,1,])) abline(v=B[1], col="red")
# toy example with tiny dataset and short MCMC # on a univariate outcome library(magrittr) library(dplyr) library(ggplot2) library(meshed) set.seed(2021) SS <- 12 n <- SS^2 # total n. locations, including missing ones coords <- expand.grid(xx <- seq(0,1,length.out=SS), xx) %>% as.matrix() # generate data sigmasq <- 2.3 phi <- 6 tausq <- .1 B <- c(-1,.5,1) CC <- sigmasq * exp(-phi * as.matrix(dist(coords))) LC <- t(chol(CC)) w <- LC %*% rnorm(n) p <- length(B) X <- rnorm(n * p) %>% matrix(ncol=p) y_full <- X %*% B + w + tausq^.5 * rnorm(n) set_missing <- rbinom(n, 1, 0.1) simdata <- data.frame(coords, y_full = y_full, w_latent = w) %>% mutate(y_observed = ifelse(set_missing==1, NA, y_full)) # MCMC setup mcmc_keep <- 500 mcmc_burn <- 100 mcmc_thin <- 2 y <- simdata$y_observed ybar <- mean(y, na.rm=TRUE) meshout <- spmeshed(y-ybar, X, coords, axis_partition=c(4,4), n_samples = mcmc_keep, n_burn = mcmc_burn, n_thin = mcmc_thin, prior=list(phi=c(1,15)), verbose = 0, n_threads = 1) # posterior means best_post_mean <- meshout$beta_mcmc %>% apply(1:2, mean) # process means wmesh <- data.frame(w_mgp = meshout$w_mcmc %>% summary_list_mean()) # predictions ymesh <- data.frame(y_mgp = meshout$yhat_mcmc %>% summary_list_mean()) outdf <- meshout$coordsdata %>% cbind(ymesh, wmesh) # plot predictions pred_plot <- outdf %>% ggplot(aes(Var1, Var2, color=y_mgp)) + geom_point() + scale_color_viridis_c() # plot latent process latent_plot <- outdf %>% ggplot(aes(Var1, Var2, color=w_mgp)) + geom_point() + scale_color_viridis_c() # estimation of regression coefficients plot(density(meshout$beta_mcmc[1,1,])) abline(v=B[1], col="red")
For a list of matrices , all of the same dimension, this function computes the matrix
with
entry
. This function does not run any check on the dimensions and uses OpenMP if available.
summary_list_mean(x, n_threads=1)
summary_list_mean(x, n_threads=1)
x |
A list of matrices of the same dimension |
n_threads |
integer number of OpenMP threads. This is ineffective if |
The matrix of mean values.
Michele Peruzzi [email protected]
# make some data into a list set.seed(2021) L <- 200 x <- lapply(1:L, function(i) matrix(runif(300), ncol=3)) mean_done <- summary_list_mean(x)
# make some data into a list set.seed(2021) L <- 200 x <- lapply(1:L, function(i) matrix(runif(300), ncol=3)) mean_done <- summary_list_mean(x)
For a list of matrices , all of the same dimension, this function computes the matrix
with
entry
quantile(
, q)
. This function does not run any check on the dimensions and uses OpenMP if available. This is only a convenience function that is supposed to speed up quantile computation for very large problems. The results may be slightly different from R's quantile
which should be used for small problems.
summary_list_q(x, q, n_threads=1)
summary_list_q(x, q, n_threads=1)
x |
A list of matrices of the same dimension. |
q |
A number between 0 and 1. |
n_threads |
integer number of OpenMP threads. This is ineffective if |
The matrix of quantiles.
Michele Peruzzi [email protected]
# make some data into a list set.seed(2021) L <- 200 x <- lapply(1:L, function(i) matrix(runif(300), ncol=3)) quant_done1 <- summary_list_q(x, .9)
# make some data into a list set.seed(2021) L <- 200 x <- lapply(1:L, function(i) matrix(runif(300), ncol=3)) quant_done1 <- summary_list_q(x, .9)