This vignette demonstrates Blockwise Reduced Modeling (BRM) on the
Capital Bikeshare dataset, a regression problem where
we predict hourly ride counts (cnt) from weather and
calendar features. Because bike is otherwise complete, we
induce a realistic blockwise missing pattern with
simulate_blockwise_missing() to show BRM in its
element.
The method, and this dataset’s role as a benchmark, are described in Srinivasan, Currim, and Ram (2025), A Reduced Modeling Approach for Making Predictions With Incomplete Data Having Blockwise Missing Patterns, INFORMS Journal on Data Science.
library(blockwise)
data(bike)
str(bike, list.len = 20)
#> 'data.frame': 17379 obs. of 9 variables:
#> $ season : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
#> $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
#> $ hr : int 0 1 2 3 4 5 6 7 8 9 ...
#> $ weekday : Factor w/ 7 levels "0","1","2","3",..: 7 7 7 7 7 7 7 7 7 7 ...
#> $ weathersit: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 2 1 1 1 1 ...
#> $ temp : num 0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
#> $ hum : num 0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
#> $ windspeed : num 0 0 0 0 0 0.0896 0 0 0 0 ...
#> $ cnt : int 16 40 32 13 1 1 2 3 8 14 ...We mask two groups of columns jointly — mimicking the situation where two independent data sources feeding your pipeline fail on different subsets of rows — plus a small per-column noise rate.
bike_miss <- simulate_blockwise_missing(
bike,
blocks = list(
c("windspeed", "hum", "weekday"),
c("hr", "temp", "weathersit")
),
prop_missing = 0.30,
noise = 0.05
)
round(colMeans(is.na(bike_miss)) * 100, 1) # percent missing per column
#> season mnth hr weekday weathersit temp hum
#> 0.0 0.0 33.5 33.4 33.4 33.5 33.5
#> windspeed cnt
#> 33.5 0.0brm() is learner-agnostic: pass any
learner() specification. Here we try a linear model and a
gradient-boosted tree ensemble. The number of blocks is chosen
automatically by the elbow heuristic
(choose_num_blocks).
set.seed(1234)
fit_lm <- brm(X_train, y_train, learner = learner_lm())
fit_lm
#> Blockwise Reduced Model (BRM)
#> blocks : 3
#> overlap : TRUE
#> learner type : regression
#> features : 8
#> cols / block : 5, 8, 5fit_gbm <- brm(
X_train, y_train,
learner = learner_gbm(distribution = "poisson", n.trees = 300),
n_blocks = fit_lm$n_blocks # reuse so models are comparable
)
fit_gbm
#> Blockwise Reduced Model (BRM)
#> blocks : 3
#> overlap : TRUE
#> learner type : regression
#> features : 8
#> cols / block : 5, 8, 5The conventional alternative is to drop rows that have any missing value and fit a single model on what remains. This wastes data and gets progressively worse as the missing rate grows.
complete_train <- na.omit(train)
fit_lw <- lm(cnt ~ ., data = complete_train)
# For a fair comparison we need to impute the test set's NAs somehow; use
# mean/mode from the complete training rows.
X_test_imp <- X_test
for (j in names(X_test_imp)) {
na_idx <- is.na(X_test_imp[[j]])
if (!any(na_idx)) next
ref <- complete_train[[j]]
if (is.factor(ref)) {
X_test_imp[[j]][na_idx] <- names(sort(table(ref), decreasing = TRUE))[1]
} else {
X_test_imp[[j]][na_idx] <- mean(ref, na.rm = TRUE)
}
}
pred_lw <- predict(fit_lw, newdata = X_test_imp)
cat("Listwise-deletion lm RMSE:", round(rmse(y_test, pred_lw), 2), "\n")
#> Listwise-deletion lm RMSE: 151.9
cat("Training rows used : BRM =", nrow(train), " listwise =",
nrow(complete_train), "\n")
#> Training rows used : BRM = 13034 listwise = 3320BRM keeps all training rows (splitting them into per-pattern
subsets); the listwise baseline throws away any row with at least one
NA.
If you use BRM in your work, please cite:
Srinivasan, K., Currim, F., and Ram, S. (2025). A Reduced Modeling Approach for Making Predictions With Incomplete Data Having Blockwise Missing Patterns. INFORMS Journal on Data Science.
Run citation("blockwise") to get a ready-to-paste BibTeX
entry.