--- title: "Getting started with llmimpute" author: "llmimpute authors" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting started with llmimpute} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE ) ``` ## Overview **llmimpute** provides missing data imputation through two complementary engines: 1. A **Large Language Model (LLM) engine** that uses the Anthropic Claude API for context-aware, semantically-informed imputation. 2. A **fully offline statistical engine** implementing nineteen algorithms entirely in base R — no internet connection or API key required. The package automatically selects the appropriate engine at runtime. ## Installation ```{r install} # Install from CRAN install.packages("llmimpute") ``` ## Quick start ```{r quickstart} library(llmimpute) # Example dataset with missing values df <- data.frame( age = c(45L, NA, 38L, 62L, 29L), bp = c(130, 140, 120, 155, NA), smoker = c("No", "Yes", "No", NA, "No"), stringsAsFactors = FALSE ) # 1. Diagnose missingness (no API call) lmi_diagnose(df) # 2. Impute — offline fallback used automatically when no API key is set result <- lmi_impute(df) # 3. Access results result$data # imputed data frame result$imputations # audit trail with confidence scores and reasoning summary(result) # per-column statistics # 4. Export to disk lmi_export(result, path = tempdir(), prefix = "my_study") ``` ## Offline imputation: choosing a method When operating offline, `lmi_impute()` delegates to `lmi_impute_offline()`. You can also call it directly. ```{r methods} # List all 19 available offline methods lmi_methods() # Use a specific method result_rf <- lmi_impute(df, offline = TRUE, offline_method = "random_forest") result_si <- lmi_impute(df, offline = TRUE, offline_method = "softimpute") result_br <- lmi_impute(df, offline = TRUE, offline_method = "bayesian_ridge") # Let the package choose per column (default) result_auto <- lmi_impute(df) ``` The `"auto"` selector chooses the best algorithm per column based on: - Data type (numeric vs. categorical) - Skewness (median instead of mean for |skew| > 1) - Missingness rate (k-NN for > 40 % missing) - Available correlated predictors (ridge or PMM when strong correlations exist) - Sample size (random forest when n >= 20 and >= 2 complete predictors) ## LLM-mode imputation LLM mode requires a valid Anthropic API key obtained from `https://console.anthropic.com`. Store the key in `.Renviron`: ``` ANTHROPIC_API_KEY=sk-ant-api03-... ``` ```{r llm} library(llmimpute) # Set key for this session (reads ANTHROPIC_API_KEY from environment) lmi_set_api_key() # Impute with domain context result <- lmi_impute(df, domain = "healthcare") # Flag anomalous existing values in addition to imputing result2 <- lmi_impute(df, domain = "healthcare", flag_suspicious = TRUE) result2$suspicious # data.frame of flagged cells ``` ## Domain-specific imputation The `domain` argument guides the LLM's reasoning: | Value | Use when | |---|---| | `"general"` | Mixed or unknown data | | `"healthcare"` | Medical records, clinical data | | `"financial"` | Economic indicators, transactions | | `"hr"` | Employee records, HR data | | `"survey"` | Questionnaire or Likert-scale data | | `"scientific"` | Lab measurements, research data | ## Choosing a model ```{r model} # See available models lmi_models() # Higher capability (slower, more expensive) lmi_set_model("claude-opus-4-20250514") # Faster and cheaper lmi_set_model("claude-haiku-4-5-20251001") ``` ## Inspecting the audit trail Every imputed cell is recorded in `result$imputations`: ```{r audit} head(result$imputations) # row col original imputed confidence reasoning # 1 2 age NA 45 72 knn ... # 2 5 bp NA 130 68 mean ... ``` The `confidence` column (0–100) reflects how reliably the method can estimate the missing value given the available data. Filter low-confidence imputations before downstream modelling: ```{r filter} high_conf <- result$imputations[result$imputations$confidence >= 70, ] ``` ## Large datasets For data frames with more than 50 rows, `lmi_impute()` automatically chunks the data for LLM calls. Adjust `max_rows` to balance API rate limits against context quality: ```{r chunks} result <- lmi_impute(big_df, domain = "financial", max_rows = 30L, verbose = TRUE) ``` The offline engine processes all rows in a single pass without chunking. ## Tips for best results - Run `lmi_diagnose()` first — free, instant, and shows the full missing map. - Columns with > 80 % missing are difficult for any imputation method. Consider dropping them or using a missingness indicator variable. - For reproducible offline imputation, always set `seed`. - Save the complete `lmi_result` object with `lmi_export(..., format = "rds")` to preserve the full audit trail alongside the imputed data. - LLM mode outperforms statistical methods on semantically rich heterogeneous data (e.g. healthcare records with meaningful column names). For purely numeric tabular data with clear structure, `"softimpute"` or `"random_forest"` offer excellent accuracy without API costs.