Getting started with fbrglm

What fbrglm is for

fbrglm is a formula-based front-end for regularized generalized linear models. Internally it delegates the fit to glmnet; the wrapper’s job is to make the user-facing experience look like base R’s glm() — a formula + data.frame, automatic factor handling, complete-case filtering, and the familiar S3 methods (print, summary, coef, predict, nobs, plot).

The MVP described here is infer = "none": it returns regularized point estimates and does not report classical standard errors, z values, p values, or confidence intervals. Honest post-selection inference (via data splitting or selective inference) is on the roadmap; see the package TODO.md.

library(fbrglm)

A small binomial example

n <- 150
dat <- data.frame(
    y  = rbinom(n, 1, 0.5),
    x1 = rnorm(n),
    x2 = rnorm(n),
    x3 = rnorm(n)
)

fit <- fbrglm(y ~ x1 + x2 + x3, data = dat,
              family = "binomial",
              lambda = "cv_min")

print() shows the call and the basics of the fit:

print(fit)
#> Formula-based regularized GLM (fbrglm)
#> 
#> Call:
#> fbrglm(formula = y ~ x1 + x2 + x3, data = dat, family = "binomial", 
#>     lambda = "cv_min")
#> 
#> Family:        binomial
#> Alpha:         1
#> Lambda rule:   cv_min
#> Lambda value:  0.05370831
#> Inference:     none
#> Observations:  total = 150, dropped = 0, used = 150
#> Non-zero terms: 0 / 3

summary() returns a structured object that includes the call, family, chosen λ, complete-case bookkeeping, and the (regularized) coefficient vector with zeros included:

summary(fit)
#> fbrglm summary
#> ==============
#> 
#> Call:
#> fbrglm(formula = y ~ x1 + x2 + x3, data = dat, family = "binomial", 
#>     lambda = "cv_min")
#> 
#> Family:        binomial
#> Lambda method: cv_min
#> Lambda value:  0.05370831
#> Inference:     none
#> 
#> Observations:
#>   total = 150, dropped (missing) = 0, used = 150
#> 
#> Coefficients:
#>             Estimate
#> (Intercept)    -0.08
#> x1              0.00
#> x2              0.00
#> x3              0.00
#> 
#> Non-zero predictors (0): (none)
#> 
#> Note: no standard errors, z-values, or p-values are reported under
#> infer = "none". Classical inference does not account for the
#> shrinkage bias from L1/L2 penalisation or for the data-driven
#> selection of lambda. Use infer = "split" / "selective" (planned)
#> for valid post-selection inference; see vignette("fbrglm").

Coefficients and predictions follow the same shapes you’d expect from glm():

coef(fit)
#> (Intercept)          x1          x2          x3 
#> -0.08004271  0.00000000  0.00000000  0.00000000

head(predict(fit, newdata = dat[1:5, ], type = "response"))
#> [1] 0.48 0.48 0.48 0.48 0.48

A plot() method is registered; it delegates to plot.cv.glmnet() when λ was chosen by cross-validation, and to plot.glmnet() otherwise.

plot(fit)

Choosing lambda

There are three rules, exposed through a single argument:

fit_min <- fbrglm(y ~ x1 + x2 + x3, data = dat,
                  family = "binomial", lambda = "cv_min")
fit_1se <- fbrglm(y ~ x1 + x2 + x3, data = dat,
                  family = "binomial", lambda = "cv_1se")
fit_fix <- fbrglm(y ~ x1 + x2 + x3, data = dat,
                  family = "binomial",
                  lambda = "fix", lambda_value = 0.05)

c(cv_min = fit_min$lambda_value,
  cv_1se = fit_1se$lambda_value,
  fix    = fit_fix$lambda_value)
#>     cv_min     cv_1se        fix 
#> 0.05370831 0.05370831 0.05000000

"cv_min" and "cv_1se" go through cv.glmnet(); "fix" skips CV and goes straight to glmnet() at the supplied lambda_value. The numeric λ actually used is always available at fit$lambda_value.

Factor predictors

Factor columns are auto-dummied via model.matrix(), and the training factor levels are stored on the fit object so predict(newdata = ...) can rebuild a design matrix that matches the training column structure — even when some training levels are missing from newdata.

n_train <- 200
train <- data.frame(
    y  = rbinom(n_train, 1, 0.5),
    x1 = rnorm(n_train),
    g  = factor(sample(c("A", "B", "C"), n_train, replace = TRUE),
                levels = c("A", "B", "C"))
)
fit_f <- fbrglm(y ~ x1 + g, data = train,
                family = "binomial",
                lambda = "fix", lambda_value = 0.05)

## newdata is missing level "C"
test <- data.frame(
    x1 = rnorm(10),
    g  = factor(rep(c("A", "B"), 5), levels = c("A", "B", "C"))
)
head(predict(fit_f, newdata = test, type = "response"))
#> [1] 0.56 0.56 0.56 0.56 0.56 0.56

fbrglm also tolerates the narrower case where newdata’s factor has its levels narrowed (not just its values): missing one-hot columns are padded with zeros before being handed to glmnet.

Missing values

fbrglm() drops rows with any NA from the modelling frame, prints a one-line note, and records the counts on the fit object under fit$nobs_info.

dat_na <- dat
dat_na$y[1:5] <- NA
fit_na <- fbrglm(y ~ x1 + x2 + x3, data = dat_na,
                 family = "binomial",
                 lambda = "fix", lambda_value = 0.05)
#> fbrglm: dropped 5 row(s) with missing values (145 / 150 kept).

fit_na$nobs_info
#> $n_total
#> [1] 150
#> 
#> $n_dropped_missing
#> [1] 5
#> 
#> $n_used
#> [1] 145
nobs(fit_na)
#> [1] 145

Offsets

offset at fit time goes through to glmnet(); at predict time, pass newoffset of matching length. With newdata = NULL the stored training offset is reused; with newdata supplied, an explicit newoffset is required.

n_off <- 80
dat_off <- data.frame(
    y  = rbinom(n_off, 1, 0.5),
    x1 = rnorm(n_off),
    x2 = rnorm(n_off)
)
fit_off <- fbrglm(y ~ x1 + x2, data = dat_off, family = "binomial",
                  offset = rep(0.2, n_off),
                  lambda = "fix", lambda_value = 0.05)

head(predict(fit_off, type = "response"))                  # reuses training offset
#> [1] 0.5625 0.5625 0.5625 0.5625 0.5625 0.5625
head(predict(fit_off, newdata = dat_off[1:5, ],
             newoffset = rep(0.2, 5), type = "response"))
#> [1] 0.5625 0.5625 0.5625 0.5625 0.5625

Reaching the underlying glmnet objects

If you need to use a glmnet-specific tool, two accessors get you out of the wrapper:

class(as_glmnet(fit_min))
#> [1] "lognet" "glmnet"
class(as_cv_glmnet(fit_min))
#> [1] "cv.glmnet"

class(as_glmnet(fit_fix))
#> [1] "lognet" "glmnet"
as_cv_glmnet(fit_fix)        # NULL — no CV was run
#> NULL

as_glmnet() returns the underlying glmnet object (the $glmnet.fit slot when the wrapper used CV). as_cv_glmnet() returns the cv.glmnet object, or NULL for the "fix" λ path.

Limitations (intentional)

The MVP is deliberately narrow:

  • only infer = "none" is implemented; "split" and "selective" are planned but not in this release.
  • families: gaussian, binomial, poisson only. multinomial and cox will land later.
  • the x / y direct-matrix entry point is reserved but not yet supported — supply formula + data instead.
  • classical glm()-style standard errors, z / p values, and confidence intervals are intentionally not shown for infer = "none". Doing so naively for regularized estimators would be misleading; honest inference is the next milestone.

Reproducible benchmarks against raw glmnet, glmnetUtils, and a parsnip / workflows pipeline with the glmnet engine live in a separate repository: https://github.com/dsc-chiba-u/fbrglm-experiments. In the prediction-failure case (narrowed test factor levels), raw glmnet built naively can fail; parsnip / workflows succeeds but with higher runtime overhead than fbrglm in the tested small-data setting. See the experiments repo for the CSVs and figures.