--- title: "BRM on the bike dataset (regression)" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{BRM on the bike dataset (regression)} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 4.5 ) set.seed(1234) ``` ## Overview This vignette demonstrates Blockwise Reduced Modeling (BRM) on the **Capital Bikeshare** dataset, a regression problem where we predict hourly ride counts (`cnt`) from weather and calendar features. Because `bike` is otherwise complete, we induce a realistic blockwise missing pattern with `simulate_blockwise_missing()` to show BRM in its element. The method, and this dataset's role as a benchmark, are described in Srinivasan, Currim, and Ram (2025), *A Reduced Modeling Approach for Making Predictions With Incomplete Data Having Blockwise Missing Patterns*, INFORMS Journal on Data Science. ```{r load} library(blockwise) data(bike) str(bike, list.len = 20) ``` ## Inducing a blockwise missing pattern We mask two groups of columns jointly — mimicking the situation where two independent data sources feeding your pipeline fail on different subsets of rows — plus a small per-column noise rate. ```{r mask} bike_miss <- simulate_blockwise_missing( bike, blocks = list( c("windspeed", "hum", "weekday"), c("hr", "temp", "weathersit") ), prop_missing = 0.30, noise = 0.05 ) round(colMeans(is.na(bike_miss)) * 100, 1) # percent missing per column ``` ## Train / test split ```{r split} set.seed(1234) idx <- sample(nrow(bike_miss), size = floor(0.75 * nrow(bike_miss))) train <- bike_miss[idx, ] test <- bike_miss[-idx, ] X_train <- train[, setdiff(names(train), "cnt")] y_train <- train$cnt X_test <- test[, setdiff(names(test), "cnt")] y_test <- test$cnt ``` ## Fit BRM `brm()` is learner-agnostic: pass any `learner()` specification. Here we try a linear model and a gradient-boosted tree ensemble. The number of blocks is chosen automatically by the elbow heuristic (`choose_num_blocks`). ```{r fit-lm} set.seed(1234) fit_lm <- brm(X_train, y_train, learner = learner_lm()) fit_lm ``` ```{r fit-gbm, eval = requireNamespace("gbm", quietly = TRUE)} fit_gbm <- brm( X_train, y_train, learner = learner_gbm(distribution = "poisson", n.trees = 300), n_blocks = fit_lm$n_blocks # reuse so models are comparable ) fit_gbm ``` ## Predict and score ```{r rmse} rmse <- function(y, yhat) sqrt(mean((y - yhat)^2)) pred_lm <- predict(fit_lm, X_test) cat("BRM (lm) RMSE:", round(rmse(y_test, pred_lm), 2), "\n") ``` ```{r rmse-gbm, eval = requireNamespace("gbm", quietly = TRUE)} pred_gbm <- predict(fit_gbm, X_test) cat("BRM (gbm) RMSE:", round(rmse(y_test, pred_gbm), 2), "\n") ``` ## Comparison to a listwise-deletion baseline The conventional alternative is to drop rows that have any missing value and fit a single model on what remains. This wastes data and gets progressively worse as the missing rate grows. ```{r listwise} complete_train <- na.omit(train) fit_lw <- lm(cnt ~ ., data = complete_train) # For a fair comparison we need to impute the test set's NAs somehow; use # mean/mode from the complete training rows. X_test_imp <- X_test for (j in names(X_test_imp)) { na_idx <- is.na(X_test_imp[[j]]) if (!any(na_idx)) next ref <- complete_train[[j]] if (is.factor(ref)) { X_test_imp[[j]][na_idx] <- names(sort(table(ref), decreasing = TRUE))[1] } else { X_test_imp[[j]][na_idx] <- mean(ref, na.rm = TRUE) } } pred_lw <- predict(fit_lw, newdata = X_test_imp) cat("Listwise-deletion lm RMSE:", round(rmse(y_test, pred_lw), 2), "\n") cat("Training rows used : BRM =", nrow(train), " listwise =", nrow(complete_train), "\n") ``` BRM keeps all training rows (splitting them into per-pattern subsets); the listwise baseline throws away any row with at least one `NA`. ## Citation If you use BRM in your work, please cite: > Srinivasan, K., Currim, F., and Ram, S. (2025). *A Reduced Modeling > Approach for Making Predictions With Incomplete Data Having Blockwise > Missing Patterns.* INFORMS Journal on Data Science. Run `citation("blockwise")` to get a ready-to-paste BibTeX entry.