BRM on the bike dataset (regression)

Overview

This vignette demonstrates Blockwise Reduced Modeling (BRM) on the Capital Bikeshare dataset, a regression problem where we predict hourly ride counts (cnt) from weather and calendar features. Because bike is otherwise complete, we induce a realistic blockwise missing pattern with simulate_blockwise_missing() to show BRM in its element.

The method, and this dataset’s role as a benchmark, are described in Srinivasan, Currim, and Ram (2025), A Reduced Modeling Approach for Making Predictions With Incomplete Data Having Blockwise Missing Patterns, INFORMS Journal on Data Science.

library(blockwise)
data(bike)
str(bike, list.len = 20)
#> 'data.frame':    17379 obs. of  9 variables:
#>  $ season    : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ hr        : int  0 1 2 3 4 5 6 7 8 9 ...
#>  $ weekday   : Factor w/ 7 levels "0","1","2","3",..: 7 7 7 7 7 7 7 7 7 7 ...
#>  $ weathersit: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 2 1 1 1 1 ...
#>  $ temp      : num  0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
#>  $ hum       : num  0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
#>  $ windspeed : num  0 0 0 0 0 0.0896 0 0 0 0 ...
#>  $ cnt       : int  16 40 32 13 1 1 2 3 8 14 ...

Inducing a blockwise missing pattern

We mask two groups of columns jointly — mimicking the situation where two independent data sources feeding your pipeline fail on different subsets of rows — plus a small per-column noise rate.

bike_miss <- simulate_blockwise_missing(
  bike,
  blocks = list(
    c("windspeed", "hum", "weekday"),
    c("hr", "temp", "weathersit")
  ),
  prop_missing = 0.30,
  noise        = 0.05
)

round(colMeans(is.na(bike_miss)) * 100, 1)  # percent missing per column
#>     season       mnth         hr    weekday weathersit       temp        hum 
#>        0.0        0.0       33.5       33.4       33.4       33.5       33.5 
#>  windspeed        cnt 
#>       33.5        0.0

Train / test split

set.seed(1234)
idx <- sample(nrow(bike_miss), size = floor(0.75 * nrow(bike_miss)))
train <- bike_miss[idx, ]
test  <- bike_miss[-idx, ]

X_train <- train[, setdiff(names(train), "cnt")]
y_train <- train$cnt

X_test  <- test[, setdiff(names(test), "cnt")]
y_test  <- test$cnt

Fit BRM

brm() is learner-agnostic: pass any learner() specification. Here we try a linear model and a gradient-boosted tree ensemble. The number of blocks is chosen automatically by the elbow heuristic (choose_num_blocks).

set.seed(1234)
fit_lm <- brm(X_train, y_train, learner = learner_lm())
fit_lm
#> Blockwise Reduced Model (BRM)
#>   blocks        : 3 
#>   overlap       : TRUE 
#>   learner type  : regression 
#>   features      : 8 
#>   cols / block  : 5, 8, 5

fit_gbm <- brm(
  X_train, y_train,
  learner  = learner_gbm(distribution = "poisson", n.trees = 300),
  n_blocks = fit_lm$n_blocks     # reuse so models are comparable
)
fit_gbm
#> Blockwise Reduced Model (BRM)
#>   blocks        : 3 
#>   overlap       : TRUE 
#>   learner type  : regression 
#>   features      : 8 
#>   cols / block  : 5, 8, 5

Predict and score

rmse <- function(y, yhat) sqrt(mean((y - yhat)^2))

pred_lm <- predict(fit_lm, X_test)
cat("BRM (lm)   RMSE:", round(rmse(y_test, pred_lm), 2), "\n")
#> BRM (lm)   RMSE: 149.6

pred_gbm <- predict(fit_gbm, X_test)
cat("BRM (gbm)  RMSE:", round(rmse(y_test, pred_gbm), 2), "\n")
#> BRM (gbm)  RMSE: 118.83

Comparison to a listwise-deletion baseline

The conventional alternative is to drop rows that have any missing value and fit a single model on what remains. This wastes data and gets progressively worse as the missing rate grows.

complete_train <- na.omit(train)
fit_lw <- lm(cnt ~ ., data = complete_train)

# For a fair comparison we need to impute the test set's NAs somehow; use
# mean/mode from the complete training rows.
X_test_imp <- X_test
for (j in names(X_test_imp)) {
  na_idx <- is.na(X_test_imp[[j]])
  if (!any(na_idx)) next
  ref <- complete_train[[j]]
  if (is.factor(ref)) {
    X_test_imp[[j]][na_idx] <- names(sort(table(ref), decreasing = TRUE))[1]
  } else {
    X_test_imp[[j]][na_idx] <- mean(ref, na.rm = TRUE)
  }
}
pred_lw <- predict(fit_lw, newdata = X_test_imp)

cat("Listwise-deletion lm  RMSE:", round(rmse(y_test, pred_lw), 2), "\n")
#> Listwise-deletion lm  RMSE: 151.9
cat("Training rows used    : BRM =", nrow(train), " listwise =",
    nrow(complete_train), "\n")
#> Training rows used    : BRM = 13034  listwise = 3320

BRM keeps all training rows (splitting them into per-pattern subsets); the listwise baseline throws away any row with at least one NA.

Citation

If you use BRM in your work, please cite:

Srinivasan, K., Currim, F., and Ram, S. (2025). A Reduced Modeling Approach for Making Predictions With Incomplete Data Having Blockwise Missing Patterns. INFORMS Journal on Data Science.

Run citation("blockwise") to get a ready-to-paste BibTeX entry.