BRM on the adult dataset (binary classification)

Overview

The UCI Adult dataset is a binary classification task: predict whether an adult’s income exceeds $50K/year from census features. We use it here to show BRM with a probabilistic learner (learner_glm_binomial()) on data with simulated blockwise missingness.

library(blockwise)
data(adult)

# Drop `native.country`: it has ~41 levels with a long tail of rare
# countries, which can leave a per-block logistic-regression model
# without an example of a level that later appears at predict time.
# Tree-based learners (`learner_rpart`, `learner_gbm`) tolerate this;
# `learner_glm_binomial` does not. Either drop the column or coarsen
# its levels before fitting.
adult <- adult[, setdiff(names(adult), "native.country")]

str(adult, list.len = 20)
#> 'data.frame':    32561 obs. of  11 variables:
#>  $ age           : int  49 44 38 38 42 20 49 37 46 36 ...
#>  $ workclass     : Factor w/ 9 levels " ?"," Federal-gov",..: 5 5 5 6 7 5 5 5 5 6 ...
#>  $ education     : Factor w/ 16 levels " 10th"," 11th",..: 8 13 12 15 6 12 16 2 12 12 ...
#>  $ education.num : num  12 14 NA 15 NA 9 10 7 9 NA ...
#>  $ marital.status: Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 3 1 1 3 3 5 1 3 3 3 ...
#>  $ occupation    : Factor w/ 16 levels ""," ?"," Adm-clerical",..: 1 6 1 12 10 8 1 1 5 1 ...
#>  $ relationship  : Factor w/ 6 levels " Husband"," Not-in-family",..: 6 2 5 1 6 4 3 1 1 1 ...
#>  $ race          : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 5 3 2 3 5 5 5 5 5 ...
#>  $ sex           : Factor w/ 2 levels " Female"," Male": 1 2 1 2 1 2 2 2 2 2 ...
#>  $ hours.per.week: int  40 45 32 40 50 15 35 40 40 50 ...
#>  $ salary        : int  1 1 0 1 0 0 0 0 1 1 ...
table(adult$salary)
#> 
#>     0     1 
#> 24720  7841

Inducing blockwise missingness

Mirroring the design in the paper, we jointly mask two column groups — a “demographics” block and a “work history” block — plus a small column-wise noise rate.

bike_style_groups <- list(
  c("age", "workclass", "education"),
  c("marital.status", "occupation", "relationship")
)

adult_miss <- simulate_blockwise_missing(
  adult,
  blocks       = bike_style_groups,
  prop_missing = 0.30,
  noise        = 0.02
)
round(colMeans(is.na(adult_miss)) * 100, 1)
#>            age      workclass      education  education.num marital.status 
#>           31.4           31.4           31.4            1.5           31.4 
#>     occupation   relationship           race            sex hours.per.week 
#>           31.4           31.4            0.0            0.0            0.0 
#>         salary 
#>            0.0

Train / test split

set.seed(1234)
idx <- sample(nrow(adult_miss), floor(0.75 * nrow(adult_miss)))
train <- adult_miss[idx, ]
test  <- adult_miss[-idx, ]

X_train <- train[, setdiff(names(train), "salary")]
y_train <- train$salary
X_test  <- test[,  setdiff(names(test),  "salary")]
y_test  <- test$salary

Fit BRM with a logistic-regression learner

set.seed(1234)
fit <- brm(X_train, y_train,
           learner = learner_glm_binomial())
fit
#> Blockwise Reduced Model (BRM)
#>   blocks        : 4 
#>   overlap       : TRUE 
#>   learner type  : classification 
#>   features      : 10 
#>   cols / block  : 4, 10, 7, 7

Evaluate

prob <- predict(fit, X_test)
pred_class <- as.integer(prob >= 0.5)

acc <- mean(pred_class == y_test)
cat("Accuracy:", round(acc, 3), "\n")
#> Accuracy: 0.828

# Confusion matrix
table(truth = y_test, predicted = pred_class)
#>      predicted
#> truth    0    1
#>     0 5695  460
#>     1  939 1047

For tree-based classification, swap in learner_rpart(method = "class") or learner_gbm(distribution = "bernoulli").

Citation

Srinivasan, K., Currim, F., and Ram, S. (2025). A Reduced Modeling Approach for Making Predictions With Incomplete Data Having Blockwise Missing Patterns. INFORMS Journal on Data Science.