| Title: | 'LLM'-Assisted Data Cleaning with Multi-Provider Support |
|---|---|
| Description: | Detects and suggests fixes for semantic inconsistencies in data frames by calling large language models (LLMs) through a unified, provider-agnostic interface. Supported providers include 'OpenAI' ('GPT-4o', 'GPT-4o-mini'), 'Anthropic' ('Claude'), 'Google' ('Gemini'), 'Groq' (free-tier 'LLaMA' and 'Mixtral'), and local 'Ollama' models. The package identifies issues that rule-based tools cannot detect: abbreviation variants, typographic errors, case inconsistencies, and malformed values. Results are returned as tidy data frames with column, row index, detected value, issue type, suggested fix, and confidence score. An offline fallback using statistical and fuzzy-matching methods is provided for use without any API key. Interactive fix application with human review is supported via 'apply_fixes()'. Methods follow de Jonge and van der Loo (2013) <https://cran.r-project.org/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf> and Chaudhuri et al. (2003) <doi:10.1145/872757.872796>. |
| Authors: | Sadikul Islam [aut, cre] (ORCID: <https://orcid.org/0000-0003-2924-7122>) |
| Maintainer: | Sadikul Islam <[email protected]> |
| License: | GPL-3 |
| Version: | 0.1.0 |
| Built: | 2026-05-22 09:06:43 UTC |
| Source: | https://github.com/cran/llmclean |
Detects and suggests fixes for semantic inconsistencies in data frames by calling large language models through a unified, provider-agnostic interface.
Traditional data cleaning tools such as janitor, validate,
and pointblank are rule-based: they catch type mismatches, range
violations, and schema errors. They cannot understand *meaning*.
An LLM, by contrast, understands that "NYC", "New York",
"new york", and "New Yrok" all refer to the same city.
It recognises "[email protected]" as a malformed email, and
"actve" as a typo for "active". These semantic issues are
the hardest to catch and the most damaging to downstream analyses.
GPT-4o, GPT-4o-mini (API key required). The most widely used commercial LLM with mature tooling.
Claude Sonnet, Claude Haiku (API key required). Large context window; strong instruction following.
Gemini 2.0 Flash, Gemini 1.5 Pro (API key required). Free tier available.
LLaMA 3.1, Mixtral (free-tier API key). Fastest inference; suitable for large data frames.
Any local model (llama3, mistral, phi3). Fully offline; no API key; complete data privacy.
Configure provider: set_llm_provider()
Detect semantic issues: detect_issues()
Get suggested fixes: suggest_fixes()
Apply fixes interactively: apply_fixes()
Summary report: llmclean_report()
offline_detect() provides statistical and fuzzy-matching
detection without any LLM call, using Levenshtein string distances,
frequency analysis, and regex pattern matching. Use this when no API
key is available or for data privacy reasons.
de Jonge, E. and van der Loo, M. (2013). An introduction to data cleaning with R. Statistics Netherlands Discussion Paper, The Hague. https://cran.r-project.org/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf
Chaudhuri, S., Ganjam, K., Ganti, V. and Motwani, R. (2003). Robust and efficient fuzzy match for online data cleaning. Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, 313–324. doi:10.1145/872757.872796
van der Loo, M.P.J. and de Jonge, E. (2018). Statistical Data Cleaning with Applications in R. John Wiley & Sons, Chichester. doi:10.1002/9781118897126
OpenAI (2024). GPT-4 Technical Report. arXiv preprint arXiv:2303.08774. https://arxiv.org/abs/2303.08774
Maintainer: Sadikul Islam [email protected] (ORCID)
Applies the suggestions from detect_issues() or
suggest_fixes() to the original data frame, either
automatically or with interactive human review of each fix.
apply_fixes(df, issues, confirm = TRUE, min_confidence = 0.7, dry_run = FALSE)apply_fixes(df, issues, confirm = TRUE, min_confidence = 0.7, dry_run = FALSE)
df |
A |
issues |
A |
confirm |
Logical. If |
min_confidence |
Numeric (0–1). Only apply fixes with confidence
at or above this threshold. Default |
dry_run |
Logical. If |
The function creates a copy of df, then iterates over each row of
issues in order of descending confidence. For each issue:
If confirm = FALSE and confidence >= min_confidence
and suggestion != "REVIEW": the value at
df[row_index, column] is replaced with suggestion.
If confirm = TRUE: the user is presented with the current
value, the suggested fix, and the context, then prompted to
accept (y), skip (n), or type a custom replacement.
A "_applied" attribute is attached to the returned data frame
recording which fixes were applied, for use by
llmclean_report().
A data.frame with the same structure as df but
with accepted fixes applied. The original df is not modified.
van der Loo, M.P.J. and de Jonge, E. (2018). Statistical Data Cleaning with Applications in R. John Wiley & Sons. doi:10.1002/9781118897126
detect_issues, suggest_fixes,
llmclean_report
set_llm_provider("offline", verbose = FALSE) data(messy_employees) issues <- offline_detect(messy_employees) # Non-interactive: apply high-confidence fixes automatically df_clean <- apply_fixes(messy_employees, issues, confirm = FALSE, min_confidence = 0.85) head(df_clean) # Dry run: see what would change plan <- apply_fixes(messy_employees, issues, dry_run = TRUE) planset_llm_provider("offline", verbose = FALSE) data(messy_employees) issues <- offline_detect(messy_employees) # Non-interactive: apply high-confidence fixes automatically df_clean <- apply_fixes(messy_employees, issues, confirm = FALSE, min_confidence = 0.85) head(df_clean) # Dry run: see what would change plan <- apply_fixes(messy_employees, issues, dry_run = TRUE) plan
Sends a compact representation of the data frame to the configured LLM provider and requests detection of semantic inconsistencies: typographic errors, abbreviation variants, case inconsistencies, malformed values, cross-field contradictions, and implausible entries. Returns a tidy tibble of detected issues with location, type, and confidence.
detect_issues( df, columns = NULL, sample_n = NULL, issue_types = c("typo", "case", "abbreviation", "format", "outlier", "contradiction", "duplicate"), context = NULL, max_tokens = 2000L, verbose = TRUE )detect_issues( df, columns = NULL, sample_n = NULL, issue_types = c("typo", "case", "abbreviation", "format", "outlier", "contradiction", "duplicate"), context = NULL, max_tokens = 2000L, verbose = TRUE )
df |
A |
columns |
Character vector or |
sample_n |
Integer or |
issue_types |
Character vector. Types of issues to look for. Any
subset of: |
context |
Character or |
max_tokens |
Integer. Maximum tokens in the LLM response.
Default |
verbose |
Logical. Print progress messages. Default |
The function constructs a structured prompt containing:
Column names and inferred types.
The first sample_n rows in CSV-like format.
Instructions requesting JSON output with one object per issue.
The LLM is instructed to return only a JSON array with fields:
column, row_index (1-based), value, issue_type,
explanation, suggestion, and confidence (0–1).
If the LLM response cannot be parsed as valid JSON, the raw response is
returned with a warning and offline_detect() is called as fallback.
Token usage note: Sending 100 rows of a 10-column data frame uses approximately 1,500–3,000 tokens. At gpt-4o-mini pricing (~$0.15/M input tokens as of 2025), this costs less than US$0.001.
A tibble with one row per detected issue and columns:
columnColumn name where the issue was found.
row_indexRow number (1-based).
valueThe problematic value as a character string.
issue_typeOne of the issue_types categories.
explanationHuman-readable explanation of the problem.
suggestionSuggested corrected value.
confidenceNumeric 0–1. LLM-assigned confidence.
providerThe LLM provider used.
modelThe specific model used.
Returns a zero-row tibble with the same columns if no issues are found.
de Jonge, E. and van der Loo, M. (2013). An introduction to data cleaning with R. Statistics Netherlands Discussion Paper. https://cran.r-project.org/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf
Chaudhuri, S., Ganjam, K., Ganti, V. and Motwani, R. (2003). Robust and efficient fuzzy match for online data cleaning. Proc. 2003 ACM SIGMOD, 313–324. doi:10.1145/872757.872796
set_llm_provider, suggest_fixes,
apply_fixes, offline_detect
# Offline detection (no API key needed) set_llm_provider("offline", verbose = FALSE) data(messy_employees) issues <- detect_issues(messy_employees) issues # With LLM (requires valid API key) ## Not run: set_llm_provider("openai", model = "gpt-4o-mini") issues <- detect_issues(messy_employees, context = "Employee records. Status: active/inactive.") issues # Groq free tier set_llm_provider("groq", model = "llama-3.1-8b-instant") issues <- detect_issues(messy_employees, sample_n = 50L) ## End(Not run)# Offline detection (no API key needed) set_llm_provider("offline", verbose = FALSE) data(messy_employees) issues <- detect_issues(messy_employees) issues # With LLM (requires valid API key) ## Not run: set_llm_provider("openai", model = "gpt-4o-mini") issues <- detect_issues(messy_employees, context = "Employee records. Status: active/inactive.") issues # Groq free tier set_llm_provider("groq", model = "llama-3.1-8b-instant") issues <- detect_issues(messy_employees, sample_n = 50L) ## End(Not run)
Returns the current LLM provider, model, and base URL configured by
set_llm_provider().
get_llm_provider(show_key = FALSE)get_llm_provider(show_key = FALSE)
show_key |
Logical. If |
A named list with elements provider, model,
base_url, and optionally api_key.
set_llm_provider("offline", verbose = FALSE) get_llm_provider()set_llm_provider("offline", verbose = FALSE) get_llm_provider()
Produces a tidy summary of detected issues and applied fixes, grouped by column and issue type. Suitable for inclusion in a data-quality audit log or reproducible report. Optionally prints a formatted table to the console.
llmclean_report(df_original, df_cleaned, issues, print = TRUE)llmclean_report(df_original, df_cleaned, issues, print = TRUE)
df_original |
A |
df_cleaned |
A |
issues |
A |
print |
Logical. If |
The report computes:
Number of issues detected per column and issue type.
Number of fixes applied vs skipped.
Cell-level change summary (original value -> corrected value).
Provider and model used.
Data dimensions before and after cleaning.
The applied-fix provenance is read from the "_applied" attribute
set by apply_fixes(). If the attribute is absent (e.g.
when df_cleaned was not produced by apply_fixes()), the
report uses the full issues tibble.
Invisibly returns a named list with three elements:
summary (tibble: column / issue_type / n_detected / n_applied),
changes (tibble: column / row_index / original / corrected),
metadata (list: provider, model, n_total, n_applied, n_skipped).
de Jonge, E. and van der Loo, M. (2013). An introduction to data cleaning with R. Statistics Netherlands Discussion Paper. https://cran.r-project.org/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf
set_llm_provider("offline", verbose = FALSE) data(messy_employees) issues <- offline_detect(messy_employees) df_clean <- apply_fixes(messy_employees, issues, confirm = FALSE, min_confidence = 0.85) rpt <- llmclean_report(messy_employees, df_clean, issues) rpt$summaryset_llm_provider("offline", verbose = FALSE) data(messy_employees) issues <- offline_detect(messy_employees) df_clean <- apply_fixes(messy_employees, issues, confirm = FALSE, min_confidence = 0.85) rpt <- llmclean_report(messy_employees, df_clean, issues) rpt$summary
A hypothetical data frame of 20 employee records containing deliberate
data quality issues across all common inconsistency types: capitalisation
variants, typos, duplicate entries, malformed email addresses, out-of-range
numeric values, and cross-field redundancies. Designed to illustrate the
full range of issues detectable by detect_issues() and
offline_detect().
messy_employeesmessy_employees
A data.frame with 20 rows and 8 variables:
Integer. Unique employee identifier (1–20).
Character. Employee full name. Contains mixed-case
inconsistencies (e.g. "Alice Johnson" vs "alice johnson"
vs "ALICE JOHNSON").
Character. Department name. Contains case variants
("Finance" vs "finance"), abbreviations ("IT"
vs "I.T."), synonym variants ("HR" vs
"Human Resources"), and a typo ("Finanace").
Character. Email address. Contains malformed values:
double @, missing TLD, and leading dot.
Integer. Age in years. Contains two impossible values:
-5 and 150.
Numeric. Annual salary in USD. Contains a data entry
error (999999) as an implausible outlier.
Character. Employment status (active /
inactive). Contains case variants and two typos
("actve").
Character. Hire date. Contains a mixed-format date
("2015/07/22" vs "2018-03-15" ISO-8601 format).
All data are entirely hypothetical and generated for illustrative purposes. The inconsistency types are based on the taxonomy in de Jonge and van der Loo (2013) and reflect common real-world data entry errors documented in Muller and Freytag (2003).
Hypothetical data generated for illustration. See
data-raw/generate_datasets.R.
de Jonge, E. and van der Loo, M. (2013). An introduction to data cleaning with R. Statistics Netherlands Discussion Paper. https://cran.r-project.org/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf
Muller, H. and Freytag, J.C. (2003). Problems, methods, and challenges in comprehensive data cleansing. Technical Report, Humboldt University Berlin, HUB-IB-164.
messy_survey, detect_issues,
offline_detect
data(messy_employees) str(messy_employees) # Quick overview of known issues table(messy_employees$status) # case inconsistency table(messy_employees$department) # variant forms messy_employees$age[messy_employees$age < 0 | messy_employees$age > 100]data(messy_employees) str(messy_employees) # Quick overview of known issues table(messy_employees$status) # case inconsistency table(messy_employees$department) # variant forms messy_employees$age[messy_employees$age < 0 | messy_employees$age > 100]
A hypothetical survey data frame of 15 respondents from 5 countries
containing systematic data quality issues typical of free-text survey
responses: country name variants ("USA" vs "United States"),
satisfaction rating inconsistencies, spelling errors, and numeric outliers.
messy_surveymessy_survey
A data.frame with 15 rows and 5 variables:
Integer. Unique respondent identifier (1–15).
Character. Respondent country. Contains abbreviations
("USA", "UK"), synonyms ("Deutschland" for
"Germany"), case variants, and a typo ("Japn").
Character. Satisfaction rating (5-point Likert
scale). Contains case inconsistencies and typos
("Satisified", "Nutral", "Very Dissatified").
Character. Age group bracket. Clean column for comparison.
Numeric. Reported annual income in USD. Contains a
negative value (-500) and an implausible outlier
(999999).
All data are entirely hypothetical. Survey inconsistency patterns are inspired by common free-text standardisation challenges described in Chaudhuri et al. (2003).
Hypothetical data generated for illustration.
See data-raw/generate_datasets.R.
Chaudhuri, S., Ganjam, K., Ganti, V. and Motwani, R. (2003). Robust and efficient fuzzy match for online data cleaning. Proc. 2003 ACM SIGMOD, 313–324. doi:10.1145/872757.872796
messy_employees, detect_issues
data(messy_survey) str(messy_survey) table(messy_survey$country) # variant forms table(messy_survey$satisfaction) # case + typosdata(messy_survey) str(messy_survey) table(messy_survey$country) # variant forms table(messy_survey$satisfaction) # case + typos
Detects data quality issues using statistical and fuzzy-matching methods, requiring no LLM API call or internet connection. Uses Levenshtein string distances for near-duplicate category detection, frequency analysis for rare/singleton values, regex patterns for format validation, and IQR-based outlier detection for numeric variables.
offline_detect( df, columns = NULL, issue_types = c("typo", "case", "abbreviation", "format", "outlier", "duplicate"), max_edit_distance = 2L, min_freq = 2L, outlier_iqr_mult = 3 )offline_detect( df, columns = NULL, issue_types = c("typo", "case", "abbreviation", "format", "outlier", "duplicate"), max_edit_distance = 2L, min_freq = 2L, outlier_iqr_mult = 3 )
df |
A |
columns |
Character vector or |
issue_types |
Character vector. Subset of |
max_edit_distance |
Integer. Maximum Levenshtein edit distance
between two values to flag as potential near-duplicates (typos or
abbreviations). Default |
min_freq |
Integer. Values appearing fewer than this many times in
a column are flagged as potentially erroneous singletons. Default
|
outlier_iqr_mult |
Numeric. Multiplier for the IQR-based outlier
fence: values outside |
Methods used:
Levenshtein edit distance
(Levenshtein, 1966) between all pairs of unique values in each column.
Pairs within max_edit_distance edits but with different
capitalisation-normalised forms are flagged as potential typos.
Implemented via utils::adist().
Detects columns where the same word appears
in multiple capitalisation forms (e.g. "active", "Active",
"ACTIVE").
Regex patterns for email addresses, phone numbers (international), URLs, and ISO dates. Any value not matching the majority pattern is flagged.
Modified Tukey outer fence: flags values beyond
Q1 - k*IQR or Q3 + k*IQR where =
outlier_iqr_mult.
Values appearing only once in a column with fewer than 10 unique values are flagged (likely erroneous entries).
The offline detector cannot catch semantic issues (e.g. knowing
that "NYC" and "New York" are the same city) — that requires
an LLM. Use detect_issues() with a configured LLM provider
for full semantic detection.
A tibble with the same structure as detect_issues():
columns column, row_index, value, issue_type,
explanation, suggestion, confidence.
The provider and model columns are set to
"offline" and "statistical" respectively.
Levenshtein, V.I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8), 707–710.
Tukey, J.W. (1977). Exploratory Data Analysis. Addison-Wesley, Reading, MA. ISBN: 978-0-201-07616-5.
Chaudhuri, S., et al. (2003). Robust and efficient fuzzy match for online data cleaning. Proc. 2003 ACM SIGMOD, 313–324. doi:10.1145/872757.872796
detect_issues, set_llm_provider
data(messy_employees) issues <- offline_detect(messy_employees) issues # Only look for case and format issues offline_detect(messy_employees, issue_types = c("case", "format"))data(messy_employees) issues <- offline_detect(messy_employees) issues # Only look for case and format issues offline_detect(messy_employees, issue_types = c("case", "format"))
Sets the LLM provider, API key, and model to use for all subsequent
detect_issues() and suggest_fixes() calls.
Configuration is stored in a package-level environment and persists for
the R session. Supports OpenAI, Anthropic (Claude), Google (Gemini),
Groq, and local Ollama models.
set_llm_provider( provider = "offline", api_key = NULL, model = NULL, base_url = NULL, verbose = TRUE )set_llm_provider( provider = "offline", api_key = NULL, model = NULL, base_url = NULL, verbose = TRUE )
provider |
Character. LLM provider name. One of:
|
api_key |
Character or |
model |
Character or
|
base_url |
Character or |
verbose |
Logical. If |
Getting API keys (all free tiers available):
https://ollama.com/ — run ollama pull llama3
Best practice: Store API keys in environment variables rather
than in scripts. Add to your .Renviron file:
OPENAI_API_KEY=sk-... ANTHROPIC_API_KEY=sk-ant-... GROQ_API_KEY=gsk_... GOOGLE_API_KEY=AIza...
Invisibly returns a list with the configured provider settings.
OpenAI (2024). GPT-4 Technical Report. arXiv preprint arXiv:2303.08774. https://arxiv.org/abs/2303.08774
Anthropic (2024). Claude Model Overview. https://platform.claude.com/docs/en/docs/about-claude/models/overview
get_llm_provider, detect_issues
# Use offline mode (no API key needed) set_llm_provider("offline") # Configure OpenAI (reads key from env var OPENAI_API_KEY) ## Not run: set_llm_provider("openai", model = "gpt-4o-mini") # Configure Anthropic explicitly set_llm_provider("anthropic", api_key = Sys.getenv("ANTHROPIC_API_KEY"), model = "claude-haiku-4-5-20251001") # Configure free Groq set_llm_provider("groq", model = "llama-3.1-8b-instant") # Local Ollama (no key needed) set_llm_provider("ollama", model = "llama3") ## End(Not run)# Use offline mode (no API key needed) set_llm_provider("offline") # Configure OpenAI (reads key from env var OPENAI_API_KEY) ## Not run: set_llm_provider("openai", model = "gpt-4o-mini") # Configure Anthropic explicitly set_llm_provider("anthropic", api_key = Sys.getenv("ANTHROPIC_API_KEY"), model = "claude-haiku-4-5-20251001") # Configure free Groq set_llm_provider("groq", model = "llama-3.1-8b-instant") # Local Ollama (no key needed) set_llm_provider("ollama", model = "llama3") ## End(Not run)
Takes the issues tibble from detect_issues() or
offline_detect() and optionally sends it back to the LLM
with the full data context to obtain higher-quality, rank-ordered
suggestions for each issue. Useful when the initial detection pass
returned suggestion = "REVIEW" or low-confidence suggestions.
suggest_fixes( df, issues, n_alternatives = 2L, filter_confidence = 0.8, verbose = TRUE )suggest_fixes( df, issues, n_alternatives = 2L, filter_confidence = 0.8, verbose = TRUE )
df |
A |
issues |
A |
n_alternatives |
Integer. Number of alternative suggestions per
issue (in addition to the primary). Default |
filter_confidence |
Numeric (0–1). Only re-query issues with
confidence below this threshold. Default |
verbose |
Logical. Print progress messages. Default |
If the current provider is "offline", this function returns the
input issues tibble unchanged with a message, since no LLM is
available to enrich suggestions.
The function sends each low-confidence issue to the LLM along with the
surrounding rows for context, and requests ranked alternatives. Results
are merged back into the issues tibble, replacing the original
suggestion with the top-ranked alternative and adding
alternatives as a comma-separated string column.
The input issues tibble with two additional columns:
alternatives (comma-separated list of alternative suggestions)
and confidence_revised (updated confidence after re-querying).
de Jonge, E. and van der Loo, M. (2013). An introduction to data cleaning with R. Statistics Netherlands Discussion Paper. https://cran.r-project.org/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf
set_llm_provider("offline", verbose = FALSE) data(messy_employees) issues <- offline_detect(messy_employees) # Offline: suggest_fixes returns issues unchanged enriched <- suggest_fixes(messy_employees, issues) enriched ## Not run: set_llm_provider("openai", model = "gpt-4o-mini") issues <- detect_issues(messy_employees) enriched <- suggest_fixes(messy_employees, issues, n_alternatives = 3L) ## End(Not run)set_llm_provider("offline", verbose = FALSE) data(messy_employees) issues <- offline_detect(messy_employees) # Offline: suggest_fixes returns issues unchanged enriched <- suggest_fixes(messy_employees, issues) enriched ## Not run: set_llm_provider("openai", model = "gpt-4o-mini") issues <- detect_issues(messy_employees) enriched <- suggest_fixes(messy_employees, issues, n_alternatives = 3L) ## End(Not run)