Getting started with llmimpute

Overview

llmimpute provides missing data imputation through two complementary engines:

  1. A Large Language Model (LLM) engine that uses the Anthropic Claude API for context-aware, semantically-informed imputation.
  2. A fully offline statistical engine implementing nineteen algorithms entirely in base R — no internet connection or API key required.

The package automatically selects the appropriate engine at runtime.

Installation

# Install from CRAN
install.packages("llmimpute")

Quick start

library(llmimpute)

# Example dataset with missing values
df <- data.frame(
  age    = c(45L, NA, 38L, 62L, 29L),
  bp     = c(130, 140, 120, 155, NA),
  smoker = c("No", "Yes", "No", NA, "No"),
  stringsAsFactors = FALSE
)

# 1. Diagnose missingness (no API call)
lmi_diagnose(df)

# 2. Impute — offline fallback used automatically when no API key is set
result <- lmi_impute(df)

# 3. Access results
result$data          # imputed data frame
result$imputations   # audit trail with confidence scores and reasoning
summary(result)      # per-column statistics

# 4. Export to disk
lmi_export(result, path = tempdir(), prefix = "my_study")

Offline imputation: choosing a method

When operating offline, lmi_impute() delegates to lmi_impute_offline(). You can also call it directly.

# List all 19 available offline methods
lmi_methods()

# Use a specific method
result_rf  <- lmi_impute(df, offline = TRUE, offline_method = "random_forest")
result_si  <- lmi_impute(df, offline = TRUE, offline_method = "softimpute")
result_br  <- lmi_impute(df, offline = TRUE, offline_method = "bayesian_ridge")

# Let the package choose per column (default)
result_auto <- lmi_impute(df)

The "auto" selector chooses the best algorithm per column based on:

  • Data type (numeric vs. categorical)
  • Skewness (median instead of mean for |skew| > 1)
  • Missingness rate (k-NN for > 40 % missing)
  • Available correlated predictors (ridge or PMM when strong correlations exist)
  • Sample size (random forest when n >= 20 and >= 2 complete predictors)

LLM-mode imputation

LLM mode requires a valid Anthropic API key obtained from https://console.anthropic.com. Store the key in .Renviron:

ANTHROPIC_API_KEY=sk-ant-api03-...
library(llmimpute)

# Set key for this session (reads ANTHROPIC_API_KEY from environment)
lmi_set_api_key()

# Impute with domain context
result <- lmi_impute(df, domain = "healthcare")

# Flag anomalous existing values in addition to imputing
result2 <- lmi_impute(df, domain = "healthcare", flag_suspicious = TRUE)
result2$suspicious   # data.frame of flagged cells

Domain-specific imputation

The domain argument guides the LLM’s reasoning:

Value Use when
"general" Mixed or unknown data
"healthcare" Medical records, clinical data
"financial" Economic indicators, transactions
"hr" Employee records, HR data
"survey" Questionnaire or Likert-scale data
"scientific" Lab measurements, research data

Choosing a model

# See available models
lmi_models()

# Higher capability (slower, more expensive)
lmi_set_model("claude-opus-4-20250514")

# Faster and cheaper
lmi_set_model("claude-haiku-4-5-20251001")

Inspecting the audit trail

Every imputed cell is recorded in result$imputations:

head(result$imputations)
#   row    col original imputed confidence reasoning
# 1   2    age       NA      45         72  knn ...
# 2   5     bp       NA     130         68  mean ...

The confidence column (0–100) reflects how reliably the method can estimate the missing value given the available data. Filter low-confidence imputations before downstream modelling:

high_conf <- result$imputations[result$imputations$confidence >= 70, ]

Large datasets

For data frames with more than 50 rows, lmi_impute() automatically chunks the data for LLM calls. Adjust max_rows to balance API rate limits against context quality:

result <- lmi_impute(big_df, domain = "financial", max_rows = 30L,
                     verbose = TRUE)

The offline engine processes all rows in a single pass without chunking.

Tips for best results

  • Run lmi_diagnose() first — free, instant, and shows the full missing map.
  • Columns with > 80 % missing are difficult for any imputation method. Consider dropping them or using a missingness indicator variable.
  • For reproducible offline imputation, always set seed.
  • Save the complete lmi_result object with lmi_export(..., format = "rds") to preserve the full audit trail alongside the imputed data.
  • LLM mode outperforms statistical methods on semantically rich heterogeneous data (e.g. healthcare records with meaningful column names). For purely numeric tabular data with clear structure, "softimpute" or "random_forest" offer excellent accuracy without API costs.