Getting started with llmimpute

Overview

llmimpute provides missing data imputation through two complementary engines:

A Large Language Model (LLM) engine that uses the Anthropic Claude API for context-aware, semantically-informed imputation.
A fully offline statistical engine implementing nineteen algorithms entirely in base R — no internet connection or API key required.

The package automatically selects the appropriate engine at runtime.

Installation

# Install from CRAN
install.packages("llmimpute")

Quick start

library(llmimpute)

# Example dataset with missing values
df <- data.frame(
  age    = c(45L, NA, 38L, 62L, 29L),
  bp     = c(130, 140, 120, 155, NA),
  smoker = c("No", "Yes", "No", NA, "No"),
  stringsAsFactors = FALSE
)

# 1. Diagnose missingness (no API call)
lmi_diagnose(df)

# 2. Impute — offline fallback used automatically when no API key is set
result <- lmi_impute(df)

# 3. Access results
result$data          # imputed data frame
result$imputations   # audit trail with confidence scores and reasoning
summary(result)      # per-column statistics

# 4. Export to disk
lmi_export(result, path = tempdir(), prefix = "my_study")

Offline imputation: choosing a method

When operating offline, lmi_impute() delegates to lmi_impute_offline(). You can also call it directly.

# List all 19 available offline methods
lmi_methods()

# Use a specific method
result_rf  <- lmi_impute(df, offline = TRUE, offline_method = "random_forest")
result_si  <- lmi_impute(df, offline = TRUE, offline_method = "softimpute")
result_br  <- lmi_impute(df, offline = TRUE, offline_method = "bayesian_ridge")

# Let the package choose per column (default)
result_auto <- lmi_impute(df)

The "auto" selector chooses the best algorithm per column based on:

Data type (numeric vs. categorical)
Skewness (median instead of mean for |skew| > 1)
Missingness rate (k-NN for > 40 % missing)
Available correlated predictors (ridge or PMM when strong correlations exist)
Sample size (random forest when n >= 20 and >= 2 complete predictors)

LLM-mode imputation

LLM mode requires a valid Anthropic API key obtained from https://console.anthropic.com. Store the key in .Renviron:

ANTHROPIC_API_KEY=sk-ant-api03-...

library(llmimpute)

# Set key for this session (reads ANTHROPIC_API_KEY from environment)
lmi_set_api_key()

# Impute with domain context
result <- lmi_impute(df, domain = "healthcare")

# Flag anomalous existing values in addition to imputing
result2 <- lmi_impute(df, domain = "healthcare", flag_suspicious = TRUE)
result2$suspicious   # data.frame of flagged cells

Domain-specific imputation

The domain argument guides the LLM’s reasoning:

Value	Use when
`"general"`	Mixed or unknown data
`"healthcare"`	Medical records, clinical data
`"financial"`	Economic indicators, transactions
`"hr"`	Employee records, HR data
`"survey"`	Questionnaire or Likert-scale data
`"scientific"`	Lab measurements, research data

Choosing a model

# See available models
lmi_models()

# Higher capability (slower, more expensive)
lmi_set_model("claude-opus-4-20250514")

# Faster and cheaper
lmi_set_model("claude-haiku-4-5-20251001")

Inspecting the audit trail

Every imputed cell is recorded in result$imputations:

head(result$imputations)
#   row    col original imputed confidence reasoning
# 1   2    age       NA      45         72  knn ...
# 2   5     bp       NA     130         68  mean ...

The confidence column (0–100) reflects how reliably the method can estimate the missing value given the available data. Filter low-confidence imputations before downstream modelling:

high_conf <- result$imputations[result$imputations$confidence >= 70, ]

Large datasets

For data frames with more than 50 rows, lmi_impute() automatically chunks the data for LLM calls. Adjust max_rows to balance API rate limits against context quality:

result <- lmi_impute(big_df, domain = "financial", max_rows = 30L,
                     verbose = TRUE)

The offline engine processes all rows in a single pass without chunking.

Tips for best results

Run lmi_diagnose() first — free, instant, and shows the full missing map.
Columns with > 80 % missing are difficult for any imputation method. Consider dropping them or using a missingness indicator variable.
For reproducible offline imputation, always set seed.
Save the complete lmi_result object with lmi_export(..., format = "rds") to preserve the full audit trail alongside the imputed data.
LLM mode outperforms statistical methods on semantically rich heterogeneous data (e.g. healthcare records with meaningful column names). For purely numeric tabular data with clear structure, "softimpute" or "random_forest" offer excellent accuracy without API costs.