The UCI Adult dataset is a binary classification
task: predict whether an adult’s income exceeds $50K/year from census
features. We use it here to show BRM with a probabilistic learner
(learner_glm_binomial()) on data with simulated blockwise
missingness.
library(blockwise)
data(adult)
# Drop `native.country`: it has ~41 levels with a long tail of rare
# countries, which can leave a per-block logistic-regression model
# without an example of a level that later appears at predict time.
# Tree-based learners (`learner_rpart`, `learner_gbm`) tolerate this;
# `learner_glm_binomial` does not. Either drop the column or coarsen
# its levels before fitting.
adult <- adult[, setdiff(names(adult), "native.country")]
str(adult, list.len = 20)
#> 'data.frame': 32561 obs. of 11 variables:
#> $ age : int 49 44 38 38 42 20 49 37 46 36 ...
#> $ workclass : Factor w/ 9 levels " ?"," Federal-gov",..: 5 5 5 6 7 5 5 5 5 6 ...
#> $ education : Factor w/ 16 levels " 10th"," 11th",..: 8 13 12 15 6 12 16 2 12 12 ...
#> $ education.num : num 12 14 NA 15 NA 9 10 7 9 NA ...
#> $ marital.status: Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 3 1 1 3 3 5 1 3 3 3 ...
#> $ occupation : Factor w/ 16 levels ""," ?"," Adm-clerical",..: 1 6 1 12 10 8 1 1 5 1 ...
#> $ relationship : Factor w/ 6 levels " Husband"," Not-in-family",..: 6 2 5 1 6 4 3 1 1 1 ...
#> $ race : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 5 3 2 3 5 5 5 5 5 ...
#> $ sex : Factor w/ 2 levels " Female"," Male": 1 2 1 2 1 2 2 2 2 2 ...
#> $ hours.per.week: int 40 45 32 40 50 15 35 40 40 50 ...
#> $ salary : int 1 1 0 1 0 0 0 0 1 1 ...
table(adult$salary)
#>
#> 0 1
#> 24720 7841Mirroring the design in the paper, we jointly mask two column groups — a “demographics” block and a “work history” block — plus a small column-wise noise rate.
bike_style_groups <- list(
c("age", "workclass", "education"),
c("marital.status", "occupation", "relationship")
)
adult_miss <- simulate_blockwise_missing(
adult,
blocks = bike_style_groups,
prop_missing = 0.30,
noise = 0.02
)
round(colMeans(is.na(adult_miss)) * 100, 1)
#> age workclass education education.num marital.status
#> 31.4 31.4 31.4 1.5 31.4
#> occupation relationship race sex hours.per.week
#> 31.4 31.4 0.0 0.0 0.0
#> salary
#> 0.0prob <- predict(fit, X_test)
pred_class <- as.integer(prob >= 0.5)
acc <- mean(pred_class == y_test)
cat("Accuracy:", round(acc, 3), "\n")
#> Accuracy: 0.828
# Confusion matrix
table(truth = y_test, predicted = pred_class)
#> predicted
#> truth 0 1
#> 0 5695 460
#> 1 939 1047For tree-based classification, swap in
learner_rpart(method = "class") or
learner_gbm(distribution = "bernoulli").
Srinivasan, K., Currim, F., and Ram, S. (2025). A Reduced Modeling Approach for Making Predictions With Incomplete Data Having Blockwise Missing Patterns. INFORMS Journal on Data Science.