LLM-Assisted Data Cleaning with llmclean

Introduction

Data frames collected from surveys, administrative records, or manual entry almost always contain semantic inconsistencies that rule-based tools cannot catch. The llmclean package addresses this by sending a compact representation of your data to a large language model (LLM) and requesting structured, human-readable suggestions for each issue found.

The key distinction from existing tools:

Tool What it catches
janitor Column name formatting
validate / pointblank Rule-based type/range checks
llmclean Semantic meaning: typos, abbreviations, case, malformed formats, cross-field contradictions

No existing CRAN package provides the detect → explain → suggest → apply workflow specialised for semantic data frame inconsistencies.

Supported Providers

openai     →  GPT-4o, GPT-4o-mini     (API key required)
anthropic  →  Claude Haiku / Sonnet   (API key required)
google     →  Gemini 2.0 Flash        (free tier available)
groq       →  LLaMA 3.1, Mixtral      (free tier available)
ollama     →  Any local model         (no key, fully offline)
offline    →  Statistical fallback    (no API, no internet)

Quick Start

library(llmclean)
library(dplyr)

Step 1 — Configure the provider

For this vignette we use offline mode, which requires no API key:

set_llm_provider("offline")

With a real LLM provider (not run here):

# Free Groq tier — fastest inference
set_llm_provider("groq",
                 api_key = Sys.getenv("GROQ_API_KEY"),
                 model   = "llama-3.1-8b-instant")

# OpenAI
set_llm_provider("openai",
                 api_key = Sys.getenv("OPENAI_API_KEY"),
                 model   = "gpt-4o-mini")

# Anthropic Claude
set_llm_provider("anthropic",
                 api_key = Sys.getenv("ANTHROPIC_API_KEY"),
                 model   = "claude-haiku-4-5-20251001")

# Local Ollama (no key needed, model must be installed)
set_llm_provider("ollama", model = "llama3")

Step 2 — Inspect the built-in messy datasets

data(messy_employees)
data(messy_survey)

cat("messy_employees:", nrow(messy_employees), "rows x",
    ncol(messy_employees), "cols\n\n")
#> messy_employees: 20 rows x 8 cols

# Peek at known issues
cat("Status variants:\n"); print(table(messy_employees$status))
#> Status variants:
#> 
#>   active   Active   ACTIVE    actve inactive Inactive INACTIVE 
#>        5        5        2        2        4        1        1
cat("\nDepartment variants:\n"); print(table(messy_employees$department))
#> 
#> Department variants:
#> 
#>        Finanace         finance         Finance              hr              HR 
#>               1               1               4               1               3 
#> Human Resources            I.T.              IT       marketing       Marketing 
#>               1               1               5               1               1 
#>       MARKETING 
#>               1
cat("\nAge outliers:", messy_employees$age[messy_employees$age < 0 |
                                             messy_employees$age > 100], "\n")
#> 
#> Age outliers: -5 150

The data contains at least six types of inconsistency across eight columns — typical of manually entered HR records.


Stage 1 — Detect Issues

detect_issues() calls the LLM (or offline fallback) and returns a tidy tibble with one row per detected problem.

issues <- detect_issues(messy_employees,
                         context = "HR employee records. Status values
                                    should be 'active' or 'inactive'.")
cat("Issues found:", nrow(issues), "\n\n")
#> Issues found: 102
print(issues[, c("column","row_index","value","issue_type",
                  "suggestion","confidence")])
#> # A tibble: 102 × 6
#>    column     row_index value   issue_type suggestion confidence
#>    <chr>          <int> <chr>   <chr>      <chr>           <dbl>
#>  1 age               10 150     outlier    REVIEW           0.75
#>  2 department         1 Finance case       finance          0.9 
#>  3 department         1 Finance format     REVIEW           0.8 
#>  4 department         2 finance format     REVIEW           0.8 
#>  5 department         3 Finance case       finance          0.9 
#>  6 department         3 Finance format     REVIEW           0.8 
#>  7 department         4 HR      case       hr               0.9 
#>  8 department         4 HR      typo       it               0.76
#>  9 department         4 HR      format     REVIEW           0.8 
#> 10 department         5 hr      typo       it               0.76
#> # ℹ 92 more rows

Issue type breakdown

# Summary by type
as.data.frame(table(Type = issues$issue_type)) |>
  dplyr::arrange(dplyr::desc(Freq))
#>           Type Freq
#> 1       format   61
#> 2         case   19
#> 3         typo   13
#> 4 abbreviation    7
#> 5      outlier    2

Case inconsistencies

# Show all case inconsistencies found
issues[issues$issue_type == "case",
       c("column","row_index","value","suggestion","confidence")]
#> # A tibble: 19 × 5
#>    column     row_index value         suggestion    confidence
#>    <chr>          <int> <chr>         <chr>              <dbl>
#>  1 department         1 Finance       finance              0.9
#>  2 department         3 Finance       finance              0.9
#>  3 department         4 HR            hr                   0.9
#>  4 department        10 Marketing     marketing            0.9
#>  5 department        12 MARKETING     marketing            0.9
#>  6 department        13 Finance       finance              0.9
#>  7 department        14 Finance       finance              0.9
#>  8 department        19 HR            hr                   0.9
#>  9 department        20 HR            hr                   0.9
#> 10 name               1 Alice Johnson alice johnson        0.9
#> 11 status             1 Active        active               0.9
#> 12 status             3 ACTIVE        active               0.9
#> 13 status             5 Inactive      inactive             0.9
#> 14 status             7 INACTIVE      inactive             0.9
#> 15 status             8 Active        active               0.9
#> 16 status            11 Active        active               0.9
#> 17 status            13 ACTIVE        active               0.9
#> 18 status            15 Active        active               0.9
#> 19 status            18 Active        active               0.9

Typos and near-duplicates (Levenshtein)

The offline detector uses utils::adist() to compute Levenshtein edit distances between all pairs of unique values in each column. Values within 2 edits of a more frequent value are flagged as typos.

issues[issues$issue_type == "typo",
       c("column","row_index","value","suggestion","explanation")]
#> # A tibble: 13 × 5
#>    column     row_index value             suggestion        explanation         
#>    <chr>          <int> <chr>             <chr>             <chr>               
#>  1 department         4 HR                it                'HR' is 2 edit(s) f…
#>  2 department         5 hr                it                'hr' is 2 edit(s) f…
#>  3 department        15 Finanace          finance           'Finanace' is 1 edi…
#>  4 department        19 HR                it                'HR' is 2 edit(s) f…
#>  5 department        20 HR                it                'HR' is 2 edit(s) f…
#>  6 email              7 [email protected] [email protected]  '[email protected]'…
#>  7 email             10 [email protected]   [email protected]   '[email protected]' i…
#>  8 email             13 [email protected] [email protected] '[email protected]'…
#>  9 hire_date          8 2021-01-18        2021-06-14        '2021-01-18' is 2 e…
#> 10 hire_date         13 2011-07-15        2018-03-15        '2011-07-15' is 2 e…
#> 11 hire_date         20 2018-09-25        2018-03-15        '2018-09-25' is 2 e…
#> 12 status             9 actve             active            'actve' is 1 edit(s…
#> 13 status            19 actve             active            'actve' is 1 edit(s…

Malformed email addresses

issues[issues$issue_type == "format",
       c("column","row_index","value","suggestion")]
#> # A tibble: 61 × 4
#>    column     row_index value     suggestion
#>    <chr>          <int> <chr>     <chr>     
#>  1 department         1 Finance   REVIEW    
#>  2 department         2 finance   REVIEW    
#>  3 department         3 Finance   REVIEW    
#>  4 department         4 HR        REVIEW    
#>  5 department         5 hr        REVIEW    
#>  6 department         7 IT        REVIEW    
#>  7 department         8 I.T.      REVIEW    
#>  8 department         9 IT        REVIEW    
#>  9 department        10 Marketing REVIEW    
#> 10 department        11 marketing REVIEW    
#> # ℹ 51 more rows

Numeric outliers (Tukey outer fence)

issues[issues$issue_type == "outlier",
       c("column","row_index","value","explanation")]
#> # A tibble: 2 × 4
#>   column row_index value  explanation                                           
#>   <chr>      <int> <chr>  <chr>                                                 
#> 1 age           10 150    Value 150 is outside the Tukey outer fence [-7, 77] (…
#> 2 salary        10 999999 Value 999999 is outside the Tukey outer fence [2500, …

Stage 2 — Suggest Fixes

suggest_fixes() can enrich low-confidence suggestions by re-querying the LLM with surrounding row context. In offline mode it returns the issues unchanged.

enriched <- suggest_fixes(messy_employees, issues)
cat("Enriched columns:", paste(names(enriched), collapse = ", "), "\n")
#> Enriched columns: column, row_index, value, issue_type, explanation, suggestion, confidence, provider, model, alternatives, confidence_revised

# Show suggestions for status column
enriched[enriched$column == "status",
          c("row_index","value","suggestion","alternatives","confidence_revised")]
#> # A tibble: 37 × 5
#>    row_index value    suggestion alternatives confidence_revised
#>        <int> <chr>    <chr>      <chr>                     <dbl>
#>  1         1 Active   active     active                     0.9 
#>  2         1 Active   REVIEW     REVIEW                     0.8 
#>  3         2 active   REVIEW     REVIEW                     0.8 
#>  4         3 ACTIVE   active     active                     0.9 
#>  5         3 ACTIVE   REVIEW     REVIEW                     0.8 
#>  6         4 active   REVIEW     REVIEW                     0.8 
#>  7         5 Inactive inactive   inactive                   0.9 
#>  8         5 Inactive active     active                     0.76
#>  9         5 Inactive REVIEW     REVIEW                     0.8 
#> 10         6 inactive active     active                     0.76
#> # ℹ 27 more rows

Stage 3 — Apply Fixes

apply_fixes() has two modes:

  • confirm = FALSE: apply all fixes above min_confidence automatically (batch mode)
  • confirm = TRUE: interactive review, one fix at a time
# Non-interactive: apply fixes with confidence >= 0.88
df_clean <- apply_fixes(
  messy_employees,
  enriched,
  confirm        = FALSE,
  min_confidence = 0.88
)

cat("Status before:", paste(sort(unique(messy_employees$status)), collapse=", "), "\n")
#> Status before: active, Active, ACTIVE, actve, inactive, Inactive, INACTIVE
cat("Status after: ", paste(sort(unique(df_clean$status)), collapse=", "), "\n\n")
#> Status after:  active, inactive

cat("Department before:",
    paste(sort(unique(messy_employees$department)), collapse=", "), "\n")
#> Department before: Finanace, finance, Finance, hr, HR, Human Resources, I.T., IT, marketing, Marketing, MARKETING
cat("Department after: ",
    paste(sort(unique(df_clean$department)), collapse=", "), "\n")
#> Department after:  finance, hr, Human Resources, I.T., IT, marketing

Dry run mode

Before applying anything, preview what would change:

plan <- apply_fixes(messy_employees, enriched, dry_run = TRUE)
cat("Planned changes:\n")
#> Planned changes:
print(plan[, c("column","row_index","current_value","suggestion","issue_type")])
#> # A tibble: 102 × 5
#>    column     row_index current_value suggestion issue_type
#>    <chr>          <int> <chr>         <chr>      <chr>     
#>  1 age               10 150           REVIEW     outlier   
#>  2 department         1 Finance       finance    case      
#>  3 department         1 Finance       REVIEW     format    
#>  4 department         2 finance       REVIEW     format    
#>  5 department         3 Finance       finance    case      
#>  6 department         3 Finance       REVIEW     format    
#>  7 department         4 HR            hr         case      
#>  8 department         4 HR            it         typo      
#>  9 department         4 HR            REVIEW     format    
#> 10 department         5 hr            it         typo      
#> # ℹ 92 more rows

Stage 4 — Offline Detection (No API Key)

offline_detect() runs without any LLM using three statistical methods:

  1. Levenshtein distance (Chaudhuri et al., 2003) for typo detection
  2. Regex patterns for email, date, and phone format validation
  3. Tukey outer fence (Tukey, 1977) for numeric outlier detection
# Works completely offline
offline_issues <- offline_detect(
  messy_survey,
  issue_types      = c("case","typo","format","outlier"),
  max_edit_distance = 2L
)

cat("Survey issues found:", nrow(offline_issues), "\n\n")
#> Survey issues found: 57
offline_issues[, c("column","value","issue_type","suggestion","confidence")]
#> # A tibble: 57 × 5
#>    column    value issue_type suggestion confidence
#>    <chr>     <chr> <chr>      <chr>           <dbl>
#>  1 age_group 18-25 format     REVIEW           0.8 
#>  2 age_group 18-25 format     REVIEW           0.8 
#>  3 age_group 18-25 format     REVIEW           0.8 
#>  4 age_group 26-35 typo       36-45            0.76
#>  5 age_group 26-35 format     REVIEW           0.8 
#>  6 age_group 26-35 typo       36-45            0.76
#>  7 age_group 26-35 format     REVIEW           0.8 
#>  8 age_group 26-35 typo       36-45            0.76
#>  9 age_group 26-35 format     REVIEW           0.8 
#> 10 age_group 36-45 typo       46-55            0.76
#> # ℹ 47 more rows

Stage 5 — Summary Report

llmclean_report() produces a structured audit log suitable for reproducible data quality documentation.

rpt <- llmclean_report(messy_employees, df_clean, issues)
#> 
#> === llmclean Data Quality Report ===
#> Provider    : offline (statistical)
#> Data        : 20 rows x 8 columns
#> Issues found: 102
#> Fixes applied: 22 / skipped: 80
#> 
#> Issues by column and type:
#>      column   issue_type n_detected n_applied
#>         age      outlier          1         0
#>  department abbreviation          1         0
#>  department         case          9         9
#>  department       format         19         0
#>  department         typo          5         1
#>       email       format          2         0
#>       email         typo          3         0
#>   hire_date       format         20         0
#>   hire_date         typo          3         0
#>        name         case          1         1
#>      salary      outlier          1         0
#>      status abbreviation          6         0
#>      status         case          9         9
#>      status       format         20         0
#>      status         typo          2         2
#> 
#> Fixes applied:
#> # A tibble: 22 × 4
#>    column     row_index original  corrected
#>    <chr>          <int> <chr>     <chr>    
#>  1 department         1 Finance   finance  
#>  2 department         3 Finance   finance  
#>  3 department         4 HR        hr       
#>  4 department        10 Marketing marketing
#>  5 department        12 MARKETING marketing
#>  6 department        13 Finance   finance  
#>  7 department        14 Finance   finance  
#>  8 department        15 Finanace  finance  
#>  9 department        19 HR        hr       
#> 10 department        20 HR        hr       
#> # ℹ 12 more rows
#> =====================================
cat("Summary by column and type:\n")
#> Summary by column and type:
print(rpt$summary)
#> # A tibble: 15 × 4
#>    column     issue_type   n_detected n_applied
#>    <fct>      <fct>             <int>     <int>
#>  1 age        outlier               1         0
#>  2 department abbreviation          1         0
#>  3 department case                  9         9
#>  4 department format               19         0
#>  5 department typo                  5         1
#>  6 email      format                2         0
#>  7 email      typo                  3         0
#>  8 hire_date  format               20         0
#>  9 hire_date  typo                  3         0
#> 10 name       case                  1         1
#> 11 salary     outlier               1         0
#> 12 status     abbreviation          6         0
#> 13 status     case                  9         9
#> 14 status     format               20         0
#> 15 status     typo                  2         2

cat("\nCell-level changes (first 8):\n")
#> 
#> Cell-level changes (first 8):
print(head(rpt$changes, 8))
#> # A tibble: 8 × 4
#>   column     row_index original  corrected
#>   <chr>          <int> <chr>     <chr>    
#> 1 department         1 Finance   finance  
#> 2 department         3 Finance   finance  
#> 3 department         4 HR        hr       
#> 4 department        10 Marketing marketing
#> 5 department        12 MARKETING marketing
#> 6 department        13 Finance   finance  
#> 7 department        14 Finance   finance  
#> 8 department        15 Finanace  finance

cat("\nMetadata:\n")
#> 
#> Metadata:
cat("  Provider  :", rpt$metadata$provider, "\n")
#>   Provider  : offline
cat("  Model     :", rpt$metadata$model, "\n")
#>   Model     : statistical
cat("  Detected  :", rpt$metadata$n_total, "\n")
#>   Detected  : 102
cat("  Applied   :", rpt$metadata$n_applied, "\n")
#>   Applied   : 22

Complete Workflow (One Pipeline)

library(llmclean)

# 1. Configure provider (use Groq free tier)
set_llm_provider("groq",
                 api_key = Sys.getenv("GROQ_API_KEY"),
                 model   = "llama-3.1-8b-instant")

# 2. Load data
data(messy_employees)

# 3. Detect semantic issues
issues <- detect_issues(
  messy_employees,
  context = "Employee records. Status: active/inactive. Age: 18-70."
)

# 4. Enrich low-confidence suggestions
enriched <- suggest_fixes(messy_employees, issues, n_alternatives = 2L)

# 5. Apply fixes non-interactively
df_clean <- apply_fixes(messy_employees, enriched,
                         confirm = FALSE, min_confidence = 0.80)

# 6. Generate audit report
llmclean_report(messy_employees, df_clean, issues)

API Key Storage Best Practice

Never hardcode API keys in scripts. Store them in .Renviron:

# ~/.Renviron
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GROQ_API_KEY=gsk_...
GOOGLE_API_KEY=AIza...

Then retrieve with Sys.getenv("OPENAI_API_KEY"). The usethis package provides usethis::edit_r_environ() to open this file.


Session Information

sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dplyr_1.2.1    llmclean_0.1.1 rmarkdown_2.31
#> 
#> loaded via a namespace (and not attached):
#>  [1] vctrs_0.7.3      cli_3.6.6        knitr_1.51       rlang_1.2.0     
#>  [5] xfun_0.58        otel_0.2.0       generics_0.1.4   jsonlite_2.0.0  
#>  [9] glue_1.8.1       buildtools_1.0.0 htmltools_0.5.9  maketools_1.3.2 
#> [13] sys_3.4.3        sass_0.4.10      tibble_3.3.1     evaluate_1.0.5  
#> [17] jquerylib_0.1.4  fastmap_1.2.0    yaml_2.3.12      lifecycle_1.0.5 
#> [21] compiler_4.6.0   pkgconfig_2.0.3  digest_0.6.39    R6_2.6.1        
#> [25] utf8_1.2.6       tidyselect_1.2.1 pillar_1.11.1    magrittr_2.0.5  
#> [29] bslib_0.11.0     withr_3.0.2      tools_4.6.0      cachem_1.1.0

References

Chaudhuri, S., Ganjam, K., Ganti, V. and Motwani, R. (2003). Robust and efficient fuzzy match for online data cleaning. Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, 313–324. https://doi.org/10.1145/872757.872796

de Jonge, E. and van der Loo, M. (2013). An introduction to data cleaning with R. Statistics Netherlands Discussion Paper. https://cran.r-project.org/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf

Levenshtein, V.I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8), 707–710.

Müller, H. and Freytag, J.C. (2003). Problems, methods, and challenges in comprehensive data cleansing. Technical Report HUB-IB-164, Humboldt University Berlin.

Tukey, J.W. (1977). Exploratory Data Analysis. Addison-Wesley. ISBN: 978-0-201-07616-5.

van der Loo, M.P.J. and de Jonge, E. (2018). Statistical Data Cleaning with Applications in R. John Wiley & Sons. https://doi.org/10.1002/9781118897126