---
title: "Getting started with llmimpute"
author: "llmimpute authors"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting started with llmimpute}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse  = TRUE,
  comment   = "#>",
  eval      = FALSE
)
```

## Overview

**llmimpute** provides missing data imputation through two complementary
engines:

1. A **Large Language Model (LLM) engine** that uses the Anthropic Claude
   API for context-aware, semantically-informed imputation.
2. A **fully offline statistical engine** implementing nineteen algorithms
   entirely in base R — no internet connection or API key required.

The package automatically selects the appropriate engine at runtime.

## Installation

```{r install}
# Install from CRAN
install.packages("llmimpute")
```

## Quick start

```{r quickstart}
library(llmimpute)

# Example dataset with missing values
df <- data.frame(
  age    = c(45L, NA, 38L, 62L, 29L),
  bp     = c(130, 140, 120, 155, NA),
  smoker = c("No", "Yes", "No", NA, "No"),
  stringsAsFactors = FALSE
)

# 1. Diagnose missingness (no API call)
lmi_diagnose(df)

# 2. Impute — offline fallback used automatically when no API key is set
result <- lmi_impute(df)

# 3. Access results
result$data          # imputed data frame
result$imputations   # audit trail with confidence scores and reasoning
summary(result)      # per-column statistics

# 4. Export to disk
lmi_export(result, path = tempdir(), prefix = "my_study")
```

## Offline imputation: choosing a method

When operating offline, `lmi_impute()` delegates to `lmi_impute_offline()`.
You can also call it directly.

```{r methods}
# List all 19 available offline methods
lmi_methods()

# Use a specific method
result_rf  <- lmi_impute(df, offline = TRUE, offline_method = "random_forest")
result_si  <- lmi_impute(df, offline = TRUE, offline_method = "softimpute")
result_br  <- lmi_impute(df, offline = TRUE, offline_method = "bayesian_ridge")

# Let the package choose per column (default)
result_auto <- lmi_impute(df)
```

The `"auto"` selector chooses the best algorithm per column based on:

- Data type (numeric vs. categorical)
- Skewness (median instead of mean for |skew| > 1)
- Missingness rate (k-NN for > 40 % missing)
- Available correlated predictors (ridge or PMM when strong correlations exist)
- Sample size (random forest when n >= 20 and >= 2 complete predictors)

## LLM-mode imputation

LLM mode requires a valid Anthropic API key obtained from
`https://console.anthropic.com`. Store the key in `.Renviron`:

```
ANTHROPIC_API_KEY=sk-ant-api03-...
```

```{r llm}
library(llmimpute)

# Set key for this session (reads ANTHROPIC_API_KEY from environment)
lmi_set_api_key()

# Impute with domain context
result <- lmi_impute(df, domain = "healthcare")

# Flag anomalous existing values in addition to imputing
result2 <- lmi_impute(df, domain = "healthcare", flag_suspicious = TRUE)
result2$suspicious   # data.frame of flagged cells
```

## Domain-specific imputation

The `domain` argument guides the LLM's reasoning:

| Value | Use when |
|---|---|
| `"general"` | Mixed or unknown data |
| `"healthcare"` | Medical records, clinical data |
| `"financial"` | Economic indicators, transactions |
| `"hr"` | Employee records, HR data |
| `"survey"` | Questionnaire or Likert-scale data |
| `"scientific"` | Lab measurements, research data |

## Choosing a model

```{r model}
# See available models
lmi_models()

# Higher capability (slower, more expensive)
lmi_set_model("claude-opus-4-20250514")

# Faster and cheaper
lmi_set_model("claude-haiku-4-5-20251001")
```

## Inspecting the audit trail

Every imputed cell is recorded in `result$imputations`:

```{r audit}
head(result$imputations)
#   row    col original imputed confidence reasoning
# 1   2    age       NA      45         72  knn ...
# 2   5     bp       NA     130         68  mean ...
```

The `confidence` column (0–100) reflects how reliably the method can
estimate the missing value given the available data. Filter low-confidence
imputations before downstream modelling:

```{r filter}
high_conf <- result$imputations[result$imputations$confidence >= 70, ]
```

## Large datasets

For data frames with more than 50 rows, `lmi_impute()` automatically chunks
the data for LLM calls. Adjust `max_rows` to balance API rate limits against
context quality:

```{r chunks}
result <- lmi_impute(big_df, domain = "financial", max_rows = 30L,
                     verbose = TRUE)
```

The offline engine processes all rows in a single pass without chunking.

## Tips for best results

- Run `lmi_diagnose()` first — free, instant, and shows the full missing map.
- Columns with > 80 % missing are difficult for any imputation method.
  Consider dropping them or using a missingness indicator variable.
- For reproducible offline imputation, always set `seed`.
- Save the complete `lmi_result` object with `lmi_export(..., format = "rds")`
  to preserve the full audit trail alongside the imputed data.
- LLM mode outperforms statistical methods on semantically rich heterogeneous
  data (e.g. healthcare records with meaningful column names). For purely
  numeric tabular data with clear structure, `"softimpute"` or
  `"random_forest"` offer excellent accuracy without API costs.