| Title: | Reduced Modeling for Tabular Data with Blockwise Missingness |
|---|---|
| Description: | Supervised learning on tabular data with blockwise missing patterns, using the Blockwise Reduced Modeling (BRM) method of Srinivasan, Currim, and Ram (2025) <doi:10.1287/ijds.2022.9016>. BRM partitions the training data into overlapping subsets based on per-row feature-missing patterns, fits one user-supplied learner per subset with minimal imputation, and at prediction time routes each test instance to the best-matching subset model. The interface is learner-agnostic: any fit-and-predict pair can be plugged in, and convenience specifications are provided for linear models, tree models, random forests, and gradient boosting. |
| Authors: | Karthik Srinivasan [aut, cre] (ORCID: <https://orcid.org/0000-0002-1608-6190>), Faiz Currim [aut], Sudha Ram [aut] |
| Maintainer: | Karthik Srinivasan <[email protected]> |
| License: | GPL-3 |
| Version: | 0.1.2 |
| Built: | 2026-06-24 10:28:58 UTC |
| Source: | https://github.com/cran/blockwise |
Census-based binary classification dataset: predict whether a person's annual income exceeds $50,000. Used as the classification demonstration in Srinivasan, Currim, and Ram (2025).
adultadult
A data.frame with roughly 32,561 rows including salary (the
0/1-valued response) and typical demographic/employment predictors.
UCI Machine Learning Repository: https://archive.ics.uci.edu/dataset/2/adult
Hourly count of rental bikes between 2011 and 2012 in the Capital Bikeshare system with the corresponding weather and seasonal information. Used as the regression demonstration in Srinivasan, Currim, and Ram (2025).
bikebike
A data.frame with roughly 17,380 rows and the following columns:
Temporal and weather covariates.
Numeric weather covariates.
Response: count of total rental bikes for that hour.
UCI Machine Learning Repository: https://archive.ics.uci.edu/dataset/275/bike+sharing+dataset
BRM partitions the training data into n_blocks subsets based on
per-row patterns of feature-missingness, fits one instance of the supplied
learner per subset using only the features observed in that subset,
and at prediction time routes each test row to the subset model whose
training-time missingness pattern is closest.
brm( X, y, learner = learner_lm(), n_blocks = NULL, low_threshold = 0.05, n_restarts = 5L, overlap = TRUE )brm( X, y, learner = learner_lm(), n_blocks = NULL, low_threshold = 0.05, n_restarts = 5L, overlap = TRUE )
X |
A data.frame of predictors. May contain |
y |
A numeric vector (regression) or a 0/1 numeric vector
(binary classification) of length |
learner |
A |
n_blocks |
Integer number of blocks; if |
low_threshold |
Column-density threshold for including a predictor in a block's model. Default 0.05. |
n_restarts |
k-means restarts for block assignment. Default 5. |
overlap |
If |
The learner interface is intentionally minimal: any fit /
predict pair can be plugged in via learner(), and
convenience specs are provided for common families
(learner_lm, learner_rpart, etc.).
An object of class "brm".
Srinivasan, K., Currim, F., Ram, S. (2025). A Reduced Modeling Approach for Making Predictions With Incomplete Data Having Blockwise Missing Patterns. INFORMS Journal on Data Science.
data(bike, package = "blockwise") bike_miss <- simulate_blockwise_missing( bike, blocks = list(c("hum", "windspeed", "weekday"), c("hr", "temp", "weathersit")), prop_missing = 0.3 ) X <- bike_miss[, setdiff(names(bike_miss), "cnt")] y <- bike_miss$cnt fit <- brm(X, y, learner = learner_lm()) preds <- predict(fit, X)data(bike, package = "blockwise") bike_miss <- simulate_blockwise_missing( bike, blocks = list(c("hum", "windspeed", "weekday"), c("hr", "temp", "weathersit")), prop_missing = 0.3 ) X <- bike_miss[, setdiff(names(bike_miss), "cnt")] y <- bike_miss$cnt fit <- brm(X, y, learner = learner_lm()) preds <- predict(fit, X)
Applies k-means to the binary missingness-indicator matrix for
k = 1, ..., k_max and records, for each k, the fraction of
still-missing cells after dropping columns that are sparse within each
cluster. The curve is monotone-decreasing; BRM picks the k at its elbow.
choose_num_blocks(X, low_threshold = 0.05, n_restarts = 10, k_max = NULL)choose_num_blocks(X, low_threshold = 0.05, n_restarts = 10, k_max = NULL)
X |
A data.frame of predictors; may contain |
low_threshold |
Fraction below which a column is considered absent in a candidate subset. Default 0.05. |
n_restarts |
Number of k-means restarts per k. Default 10. |
k_max |
Upper bound on k. Default |
Typically called internally by brm when n_blocks = NULL.
A list with n_blocks (the chosen k) and missing_curve
(numeric vector of length k_max).
Srinivasan, K., Currim, F., Ram, S. (2025). A Reduced Modeling Approach for Making Predictions With Incomplete Data Having Blockwise Missing Patterns. INFORMS Journal on Data Science.
The King County house-sales dataset. Used as a regression demonstration in Srinivasan, Currim, and Ram (2025).
househouse
A data.frame with roughly 21,600 rows including price
(the response) and typical property covariates such as
bedrooms, bathrooms, sqft_living,
sqft_lot, grade, and yr_built.
Kaggle "House Sales in King County, USA" dataset.
BRM trains one model per overlapping subset. The learner interface makes
that choice user-controlled: supply a fit function that takes
(X, y) and returns a fitted model, and a predict function
that takes (model, X_new) and returns a numeric prediction vector
(or a positive-class probability for binary classification).
learner(fit, predict, type = c("regression", "classification")) learner_lm() learner_glm_binomial() learner_rpart(method = "anova", ...) learner_ranger(...) learner_gbm(distribution = "gaussian", n.trees = 500, ...)learner(fit, predict, type = c("regression", "classification")) learner_lm() learner_glm_binomial() learner_rpart(method = "anova", ...) learner_ranger(...) learner_gbm(distribution = "gaussian", n.trees = 500, ...)
fit |
A function of the form |
predict |
A function of the form |
type |
Either |
method |
rpart split method; one of |
... |
Additional arguments passed to the underlying fitter. |
distribution |
gbm distribution (e.g. |
n.trees |
Number of trees. |
An object of class "brm_learner".
my_learner <- learner( fit = function(X, y) lm(y ~ ., data = cbind(X, y = y)), predict = function(m, X_new) predict(m, newdata = X_new), type = "regression" )my_learner <- learner( fit = function(X, y) lm(y ~ ., data = cbind(X, y = y)), predict = function(m, X_new) predict(m, newdata = X_new), type = "regression" )
Each row of newdata is routed to the block whose training-time
missingness center is closest (Euclidean) to the row's missingness pattern.
The corresponding block model then predicts on that row, using only that
block's feature columns. Any NAs remaining in those columns are
filled by simple mean/mode imputation against the training reference.
## S3 method for class 'brm' predict(object, newdata, ...)## S3 method for class 'brm' predict(object, newdata, ...)
object |
A fitted |
newdata |
A data.frame of predictors. May contain |
... |
Unused. |
A numeric vector of length nrow(newdata).
Joint-masks groups of columns on randomly chosen rows, optionally adding light column-wise random-NA noise on top. Useful for benchmarking BRM on complete datasets; the default arguments reproduce the simulation design used in Srinivasan, Currim, and Ram (2025).
simulate_blockwise_missing( data, blocks, prop_missing, noise = 0.05, seed = NULL )simulate_blockwise_missing( data, blocks, prop_missing, noise = 0.05, seed = NULL )
data |
A data.frame. |
blocks |
A list of character vectors; each vector names the columns masked jointly for one block of rows. |
prop_missing |
Proportion of rows affected per block. Either a scalar
(applied to each block) or a numeric vector of length
|
noise |
Extra per-column random-NA rate applied on top, restricted to columns named in any block. Default 0.05. Set to 0 to disable. |
seed |
Optional integer base seed for reproducibility. If |
A data.frame of the same shape as data, with NAs
introduced in the specified pattern.
df <- data.frame(a = 1:100, b = 1:100, c = 1:100, d = 1:100) simulate_blockwise_missing(df, blocks = list(c("a", "b"), c("c", "d")), prop_missing = 0.3, seed = 1234L)df <- data.frame(a = 1:100, b = 1:100, c = 1:100, d = 1:100) simulate_blockwise_missing(df, blocks = list(c("a", "b"), c("c", "d")), prop_missing = 0.3, seed = 1234L)