| Title: | Missing Data Imputation via Language Models and Statistics |
|---|---|
| Description: | Provides missing data imputation through two complementary engines: a large language model engine that communicates with the 'Anthropic' 'Claude' application programming interface for context-aware semantic imputation, and a fully self-contained offline engine implementing nineteen statistical and machine learning algorithms entirely in base R with no additional package dependencies. Offline methods include mean, median, mode, last observation carried forward, next observation carried backward, hot-deck, predictive mean matching, k-nearest neighbours, ordinary least-squares regression, Lasso with coordinate descent, Ridge with closed-form solution, Bayesian Ridge regression with evidence approximation following MacKay (1992), support vector regression with a radial basis function kernel, classification and regression trees, random forests, gradient boosting, iterative random forest imputation, principal component analysis imputation via iterative singular value decomposition, and nuclear-norm minimisation via singular value thresholding. When no API key is available the package automatically falls back to the offline engine, ensuring full operation in environments without internet access. Every imputed value is accompanied by a confidence score and a plain-language reasoning string, producing reproducible audit trails. The automatic method selector chooses the best algorithm per column based on data type, skewness, missingness rate, and inter-column correlations. |
| Authors: | Sadikul Islam [aut, cre] (ORCID: <https://orcid.org/0000-0003-2924-7122>), Rajesh Kaushal [aut] |
| Maintainer: | Sadikul Islam <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.0 |
| Built: | 2026-06-23 16:33:10 UTC |
| Source: | https://github.com/cran/llmimpute |
Convenience S3 method allowing as.data.frame(result) to extract
the imputed data frame from an lmi_result object directly.
## S3 method for class 'lmi_result' as.data.frame(x, ...)## S3 method for class 'lmi_result' as.data.frame(x, ...)
x |
An object of class |
... |
Currently unused. Included for S3 compatibility. |
The imputed data.frame.
df <- data.frame(x = c(1, NA, 3)) result <- lmi_impute_offline(df, verbose = FALSE) clean <- as.data.frame(result)df <- data.frame(x = c(1, NA, 3)) result <- lmi_impute_offline(df, verbose = FALSE) clean <- as.data.frame(result)
Analyses a data frame and prints a report on missing values, column types,
and the number of unique observed values per column. No API calls are made.
Use this function before lmi_impute to preview what will be
imputed and to choose an appropriate method.
lmi_diagnose(data, na_strings = NULL)lmi_diagnose(data, na_strings = NULL)
data |
A |
na_strings |
Character vector of additional strings to treat as
|
Invisibly returns a data.frame with one row per column and
columns column, type, n_missing, pct_missing,
n_unique.
lmi_impute, lmi_impute_offline
df <- data.frame( age = c(25L, NA, 35L, 40L), income = c(50000, 60000, NA, 80000), edu = c("BSc", NA, "MSc", "BSc"), stringsAsFactors = FALSE ) lmi_diagnose(df) lmi_diagnose(df, na_strings = "N/A")df <- data.frame( age = c(25L, NA, 35L, 40L), income = c(50000, 60000, NA, 80000), edu = c("BSc", NA, "MSc", "BSc"), stringsAsFactors = FALSE ) lmi_diagnose(df) lmi_diagnose(df, na_strings = "N/A")
Writes the imputed data frame and the imputation audit trail to disk.
Supports CSV (default) and RDS output formats. When
flag_suspicious = TRUE was used in lmi_impute, the
suspicious-values table is also written.
lmi_export( result, path = tempdir(), prefix = "llmimpute", format = c("csv", "rds"), overwrite = FALSE )lmi_export( result, path = tempdir(), prefix = "llmimpute", format = c("csv", "rds"), overwrite = FALSE )
result |
An object of class |
path |
Character string. Output directory. Created recursively if it
does not exist. Default is the current working directory |
prefix |
Character string prepended to output file names.
Default |
format |
Character string. |
overwrite |
Logical. Overwrite existing files? Default |
Invisibly returns a named character vector of the file paths written.
lmi_impute, lmi_impute_offline
df <- data.frame( age = c(25L, NA, 35L), income = c(50000, 60000, NA), stringsAsFactors = FALSE ) result <- lmi_impute_offline(df, verbose = FALSE) ## Not run: lmi_export(result, path = tempdir(), prefix = "my_study") ## End(Not run)df <- data.frame( age = c(25L, NA, 35L), income = c(50000, 60000, NA), stringsAsFactors = FALSE ) result <- lmi_impute_offline(df, verbose = FALSE) ## Not run: lmi_export(result, path = tempdir(), prefix = "my_study") ## End(Not run)
Primary entry point for the llmimpute package. When an Anthropic
API key is configured, missing values are filled using the Claude large
language model, which reasons about each missing cell using the semantic
meaning of column names, inter-column relationships, and domain knowledge.
When no API key is available (or when offline = TRUE), the function
transparently delegates to lmi_impute_offline, which runs
entirely in base R without internet access.
lmi_impute( data, domain = c("general", "healthcare", "financial", "hr", "survey", "scientific"), offline = FALSE, offline_fallback = TRUE, offline_method = c("auto", "mean", "median", "mode", "locf", "nocb", "hotdeck", "pmm", "knn", "linear", "lasso", "ridge", "bayesian_ridge", "svr", "decision_tree", "random_forest", "gradient_boost", "missforest", "pca_impute", "softimpute"), na_strings = NULL, cols = NULL, confidence = TRUE, reasoning = TRUE, flag_suspicious = FALSE, max_rows = 50L, knn_k = 5L, seed = 42L, verbose = TRUE )lmi_impute( data, domain = c("general", "healthcare", "financial", "hr", "survey", "scientific"), offline = FALSE, offline_fallback = TRUE, offline_method = c("auto", "mean", "median", "mode", "locf", "nocb", "hotdeck", "pmm", "knn", "linear", "lasso", "ridge", "bayesian_ridge", "svr", "decision_tree", "random_forest", "gradient_boost", "missforest", "pca_impute", "softimpute"), na_strings = NULL, cols = NULL, confidence = TRUE, reasoning = TRUE, flag_suspicious = FALSE, max_rows = 50L, knn_k = 5L, seed = 42L, verbose = TRUE )
data |
A |
domain |
Character string describing the data domain. Guides LLM
reasoning. One of |
offline |
Logical. If |
offline_fallback |
Logical. If |
offline_method |
Character. Offline imputation strategy passed to
|
na_strings |
Character vector of additional strings to treat as
|
cols |
Character vector of column names to impute. |
confidence |
Logical. Include a confidence score (0-100) per imputed
cell. Default |
reasoning |
Logical. Include a one-sentence explanation per imputed
cell. Default |
flag_suspicious |
Logical. Ask the LLM to flag anomalous existing
values (LLM mode only). Default |
max_rows |
Integer. Maximum rows per API call chunk (LLM mode only).
Default |
knn_k |
Integer. Neighbours for the |
seed |
Integer. Random seed for reproducible offline imputation.
Default |
verbose |
Logical. Print progress messages. Default |
An object of class lmi_result, a named list with:
dataThe imputed data.frame.
imputationsdata.frame audit trail: one row per
imputed cell with columns row, col, original,
imputed, confidence, reasoning.
suspiciousdata.frame of flagged existing values
(LLM mode, flag_suspicious = TRUE only). NULL otherwise.
summaryNamed list of imputation statistics.
callThe matched call.
| Situation | Behaviour |
API key present, offline = FALSE |
LLM imputation (Anthropic) |
No key, offline_fallback = TRUE |
Offline engine (silent) |
No key, offline_fallback = FALSE |
Error with guidance |
offline = TRUE |
Offline engine always |
lmi_impute_offline, lmi_diagnose,
lmi_export, print.lmi_result
# Offline imputation (works with no API key) df <- data.frame( age = c(25L, NA, 35L, 40L, NA), income = c(50000, 60000, NA, 80000, 55000), edu = c("BSc", NA, "MSc", "BSc", "PhD"), stringsAsFactors = FALSE ) result <- lmi_impute(df) result$data result$imputations summary(result) # Force a specific offline method result2 <- lmi_impute(df, offline = TRUE, offline_method = "random_forest") ## Not run: # LLM mode requires a valid Anthropic API key lmi_set_api_key() # reads ANTHROPIC_API_KEY from environment result3 <- lmi_impute(df, domain = "hr") ## End(Not run)# Offline imputation (works with no API key) df <- data.frame( age = c(25L, NA, 35L, 40L, NA), income = c(50000, 60000, NA, 80000, 55000), edu = c("BSc", NA, "MSc", "BSc", "PhD"), stringsAsFactors = FALSE ) result <- lmi_impute(df) result$data result$imputations summary(result) # Force a specific offline method result2 <- lmi_impute(df, offline = TRUE, offline_method = "random_forest") ## Not run: # LLM mode requires a valid Anthropic API key lmi_set_api_key() # reads ANTHROPIC_API_KEY from environment result3 <- lmi_impute(df, domain = "hr") ## End(Not run)
Fully offline imputation using 20 methods from scratch in base R. No API key, no internet, no third-party packages needed.
lmi_impute_offline( data, method = c("auto", "mean", "median", "mode", "locf", "nocb", "hotdeck", "pmm", "knn", "linear", "lasso", "ridge", "bayesian_ridge", "svr", "decision_tree", "random_forest", "gradient_boost", "missforest", "pca_impute", "softimpute"), cols = NULL, na_strings = NULL, knn_k = 5L, n_trees = 50L, max_depth = 4L, shrinkage = 0.1, n_iter = 10L, lambda = 1, rank = 3L, seed = 42L, verbose = TRUE )lmi_impute_offline( data, method = c("auto", "mean", "median", "mode", "locf", "nocb", "hotdeck", "pmm", "knn", "linear", "lasso", "ridge", "bayesian_ridge", "svr", "decision_tree", "random_forest", "gradient_boost", "missforest", "pca_impute", "softimpute"), cols = NULL, na_strings = NULL, knn_k = 5L, n_trees = 50L, max_depth = 4L, shrinkage = 0.1, n_iter = 10L, lambda = 1, rank = 3L, seed = 42L, verbose = TRUE )
data |
A |
method |
One of: |
cols |
Column names to impute. |
na_strings |
Extra strings treated as |
knn_k |
Neighbours for knn. Default |
n_trees |
Trees for rf/gb/missforest. Default |
max_depth |
Tree depth. Default |
shrinkage |
Learning rate for gradient_boost. Default |
n_iter |
Iterations for iterative methods. Default |
lambda |
Regularisation for ridge/lasso/bayesian_ridge. Default |
rank |
SVD rank for pca_impute/softimpute. Default |
seed |
Random seed. Default |
verbose |
Print progress. Default |
An object of class lmi_result.
df <- data.frame( age = c(25, NA, 35, 40, NA), income = c(50000, 60000, NA, 80000, 55000), edu = c("BSc", NA, "MSc", "BSc", "PhD"), stringsAsFactors = FALSE ) lmi_impute_offline(df) lmi_impute_offline(df, method = "random_forest") lmi_impute_offline(df, method = "softimpute")df <- data.frame( age = c(25, NA, 35, 40, NA), income = c(50000, 60000, NA, 80000, 55000), edu = c("BSc", NA, "MSc", "BSc", "PhD"), stringsAsFactors = FALSE ) lmi_impute_offline(df) lmi_impute_offline(df, method = "random_forest") lmi_impute_offline(df, method = "softimpute")
Prints all 20 methods grouped by category with usage guidance.
lmi_methods()lmi_methods()
Invisibly returns a data.frame of method metadata.
lmi_methods()lmi_methods()
Prints recommended model identifiers for every provider supported by
lmi_impute, including free-tier options. Use
lmi_set_model to activate a model after choosing a provider
with lmi_set_api_key.
lmi_models()lmi_models()
Invisibly returns a character vector of all model identifiers across every provider.
lmi_set_model, lmi_providers,
lmi_impute
lmi_models()lmi_models()
Prints every LLM provider supported by lmi_impute, grouped
into cloud APIs, local servers, and custom endpoints, with free-tier
indicators and required environment variables.
lmi_providers()lmi_providers()
Invisibly returns a character vector of provider names.
lmi_providers()lmi_providers()
Sets the API key and LLM provider used by LLM-mode imputation. Supports
cloud APIs (Anthropic 'Claude', OpenAI, Google Gemini, Groq, Mistral,
Cohere, OpenRouter, Together AI, Fireworks AI, DeepSeek, Perplexity,
xAI Grok, AI21 Labs, Cerebras) and local servers (Ollama, LM Studio,
Jan, llama.cpp, KoboldCpp, Text Generation WebUI). Any other
OpenAI-compatible endpoint can be used via provider = "custom"
with a base_url.
If no API key is configured, lmi_impute automatically falls
back to the offline statistical engine. The package is fully functional
without any API key.
lmi_set_api_key( api_key = NULL, provider = "anthropic", base_url = NULL, .session = TRUE )lmi_set_api_key( api_key = NULL, provider = "anthropic", base_url = NULL, .session = TRUE )
api_key |
Character string. Your API key. Not required for local
providers (Ollama, LM Studio, Jan, llama.cpp, KoboldCpp,
Text Generation WebUI). If |
provider |
Character string. Provider name. One of:
|
base_url |
Character string. Required only for |
.session |
Logical. Store for the current session only (default
|
Invisibly returns the API key string (or "" for keyless
local providers).
lmi_providers, lmi_set_model,
lmi_impute
## Not run: ## Cloud APIs lmi_set_api_key("sk-ant-...", provider = "anthropic") lmi_set_api_key("sk-...", provider = "openai") lmi_set_api_key("AIza...", provider = "gemini") # free tier lmi_set_api_key("gsk_...", provider = "groq") # free tier lmi_set_api_key("sk-or-...", provider = "openrouter") # free models lmi_set_api_key("...", provider = "deepseek") lmi_set_api_key("...", provider = "cerebras") # free tier ## Local servers (no key needed) lmi_set_api_key(provider = "ollama") lmi_set_api_key(provider = "lmstudio") lmi_set_api_key(provider = "jan") lmi_set_api_key(provider = "llamacpp") lmi_set_api_key(provider = "koboldcpp") lmi_set_api_key(provider = "textgenwebui") ## Any OpenAI-compatible endpoint lmi_set_api_key("mykey", provider = "custom", base_url = "http://my-server:8000/v1/chat/completions") ## Override local port lmi_set_api_key(provider = "ollama", base_url = "http://localhost:11435/api/chat") ## End(Not run)## Not run: ## Cloud APIs lmi_set_api_key("sk-ant-...", provider = "anthropic") lmi_set_api_key("sk-...", provider = "openai") lmi_set_api_key("AIza...", provider = "gemini") # free tier lmi_set_api_key("gsk_...", provider = "groq") # free tier lmi_set_api_key("sk-or-...", provider = "openrouter") # free models lmi_set_api_key("...", provider = "deepseek") lmi_set_api_key("...", provider = "cerebras") # free tier ## Local servers (no key needed) lmi_set_api_key(provider = "ollama") lmi_set_api_key(provider = "lmstudio") lmi_set_api_key(provider = "jan") lmi_set_api_key(provider = "llamacpp") lmi_set_api_key(provider = "koboldcpp") lmi_set_api_key(provider = "textgenwebui") ## Any OpenAI-compatible endpoint lmi_set_api_key("mykey", provider = "custom", base_url = "http://my-server:8000/v1/chat/completions") ## Override local port lmi_set_api_key(provider = "ollama", base_url = "http://localhost:11435/api/chat") ## End(Not run)
Sets or retrieves the model identifier used by lmi_impute
in LLM mode. The default model is determined by the active provider.
Use lmi_models to see recommended models per provider.
lmi_set_model(model = NULL) lmi_get_model()lmi_set_model(model = NULL) lmi_get_model()
model |
Character string. A valid model identifier for the active
provider. If |
Invisibly returns the active model string.
lmi_get_model() ## Not run: lmi_set_model("gpt-4o-mini") lmi_set_model("gemini-1.5-flash") lmi_set_model("llama-3.3-70b-versatile") lmi_set_model("deepseek-chat") lmi_set_model("llama3.2") # Ollama local ## End(Not run)lmi_get_model() ## Not run: lmi_set_model("gpt-4o-mini") lmi_set_model("gemini-1.5-flash") lmi_set_model("llama-3.3-70b-versatile") lmi_set_model("deepseek-chat") lmi_set_model("llama3.2") # Ollama local ## End(Not run)
Displays a formatted summary of an imputation result in the console,
including overall statistics, per-column imputation counts, and the first
n imputed values with their confidence scores and reasoning.
## S3 method for class 'lmi_result' print(x, n = 10L, ...)## S3 method for class 'lmi_result' print(x, n = 10L, ...)
x |
An object of class |
n |
Integer. Number of individual imputation rows to display.
Default |
... |
Currently unused. Included for S3 compatibility. |
Invisibly returns x.
df <- data.frame( age = c(25L, NA, 35L), income = c(50000, 60000, NA), stringsAsFactors = FALSE ) result <- lmi_impute_offline(df, verbose = FALSE) print(result)df <- data.frame( age = c(25L, NA, 35L), income = c(50000, 60000, NA), stringsAsFactors = FALSE ) result <- lmi_impute_offline(df, verbose = FALSE) print(result)
Returns a data.frame summarising imputation counts and confidence
statistics per column, suitable for further analysis or reporting.
## S3 method for class 'lmi_result' summary(object, ...)## S3 method for class 'lmi_result' summary(object, ...)
object |
An object of class |
... |
Currently unused. Included for S3 compatibility. |
A data.frame with columns column, n_imputed,
mean_confidence, min_confidence, max_confidence.
Returns NULL invisibly when no imputations were performed.
df <- data.frame( age = c(25L, NA, 35L, 40L), income = c(50000, 60000, NA, 80000), stringsAsFactors = FALSE ) result <- lmi_impute_offline(df, verbose = FALSE) summary(result)df <- data.frame( age = c(25L, NA, 35L, 40L), income = c(50000, 60000, NA, 80000), stringsAsFactors = FALSE ) result <- lmi_impute_offline(df, verbose = FALSE) summary(result)