llmimpute provides missing data imputation through two complementary engines:
The package automatically selects the appropriate engine at runtime.
library(llmimpute)
# Example dataset with missing values
df <- data.frame(
age = c(45L, NA, 38L, 62L, 29L),
bp = c(130, 140, 120, 155, NA),
smoker = c("No", "Yes", "No", NA, "No"),
stringsAsFactors = FALSE
)
# 1. Diagnose missingness (no API call)
lmi_diagnose(df)
# 2. Impute — offline fallback used automatically when no API key is set
result <- lmi_impute(df)
# 3. Access results
result$data # imputed data frame
result$imputations # audit trail with confidence scores and reasoning
summary(result) # per-column statistics
# 4. Export to disk
lmi_export(result, path = tempdir(), prefix = "my_study")When operating offline, lmi_impute() delegates to
lmi_impute_offline(). You can also call it directly.
# List all 19 available offline methods
lmi_methods()
# Use a specific method
result_rf <- lmi_impute(df, offline = TRUE, offline_method = "random_forest")
result_si <- lmi_impute(df, offline = TRUE, offline_method = "softimpute")
result_br <- lmi_impute(df, offline = TRUE, offline_method = "bayesian_ridge")
# Let the package choose per column (default)
result_auto <- lmi_impute(df)The "auto" selector chooses the best algorithm per
column based on:
LLM mode requires a valid Anthropic API key obtained from
https://console.anthropic.com. Store the key in
.Renviron:
ANTHROPIC_API_KEY=sk-ant-api03-...
library(llmimpute)
# Set key for this session (reads ANTHROPIC_API_KEY from environment)
lmi_set_api_key()
# Impute with domain context
result <- lmi_impute(df, domain = "healthcare")
# Flag anomalous existing values in addition to imputing
result2 <- lmi_impute(df, domain = "healthcare", flag_suspicious = TRUE)
result2$suspicious # data.frame of flagged cellsThe domain argument guides the LLM’s reasoning:
| Value | Use when |
|---|---|
"general" |
Mixed or unknown data |
"healthcare" |
Medical records, clinical data |
"financial" |
Economic indicators, transactions |
"hr" |
Employee records, HR data |
"survey" |
Questionnaire or Likert-scale data |
"scientific" |
Lab measurements, research data |
Every imputed cell is recorded in
result$imputations:
head(result$imputations)
# row col original imputed confidence reasoning
# 1 2 age NA 45 72 knn ...
# 2 5 bp NA 130 68 mean ...The confidence column (0–100) reflects how reliably the
method can estimate the missing value given the available data. Filter
low-confidence imputations before downstream modelling:
For data frames with more than 50 rows, lmi_impute()
automatically chunks the data for LLM calls. Adjust
max_rows to balance API rate limits against context
quality:
The offline engine processes all rows in a single pass without chunking.
lmi_diagnose() first — free, instant, and shows the
full missing map.seed.lmi_result object with
lmi_export(..., format = "rds") to preserve the full audit
trail alongside the imputed data."softimpute" or "random_forest" offer
excellent accuracy without API costs.