--- title: "Quick Start" author: "Gilles Colling" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Quick Start} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 6, fig.height = 4 ) library(joinspy) ``` An R join returns a well-formed data frame even when half the keys fail to match. The row count is wrong, nothing warns, and the problem surfaces later in a plot or a summary table that doesn't add up. The causes are usually mundane: a trailing space in a key column, a case difference between two source systems, a duplicate key nobody expected. None of these show up in `str()` or `head()`, and all of them silently change what a join returns. joinspy works in three stages. First, diagnose: `join_spy()` inspects the key columns of two tables before they are joined, measures the overlap, predicts the row count of every join type, and reports the string-level problems that block matches (whitespace, case, encoding, typos). Second, repair: `join_repair()` fixes the mechanical problems in one call, with a dry-run mode to preview the changes. Third, enforce: `join_strict()` lets us declare the cardinality we believe the data has (`1:1`, `1:n`, `n:1`, `n:m`) and errors if the data violates it, so the output row count is predictable by construction. Around these three sit smaller tools: a quick pass/fail gate (`key_check()`), post-hoc forensics for joins that already happened (`join_explain()`, `join_diff()`), a row-explosion warning (`check_cartesian()`), multi-step chain analysis (`analyze_join_chain()`), and logging for pipelines. joinspy performs the actual join through the engine that matches the data: plain data frames go through base `merge()`, tibbles through dplyr, and data.tables through data.table. The diagnostic layer runs in front of whichever engine is used, so the join semantics we already rely on stay in place and adopting joinspy amounts to changing a function name. The `*_join_spy()` wrappers drop in wherever a join already lives. This vignette is the tour: a taste of every feature, with pointers to the dedicated vignettes for depth. `vignette("why-keys-dont-match")` covers string forensics, `vignette("common-issues")` catalogues problems and fixes, `vignette("backends")` covers the three join engines, and `vignette("production")` covers logging and pipeline patterns. ## String Diagnostics The keys *look* identical but differ at the byte level: a trailing space, a case mismatch, an invisible Unicode character. These differences don't show up in `str()` or `summary()` output, but they determine whether a join matches. `join_spy()` checks every key column on both sides and reports what it finds, grouped by issue type and tagged with a severity. This section walks through the categories one at a time; `vignette("why-keys-dont-match")` treats each in depth. ### Whitespace Trailing and leading whitespace is the most common reason keys fail to match. A column exported from Excel might contain `"London "` alongside `"London"`, and R treats those as distinct values. ```{r} sales <- data.frame( city = c("London", "Paris ", " Berlin", "Tokyo"), revenue = c(500, 300, 450, 600), stringsAsFactors = FALSE ) cities <- data.frame( city = c("London", "Paris", "Berlin", "Tokyo", "Madrid"), country = c("UK", "France", "Germany", "Japan", "Spain"), stringsAsFactors = FALSE ) join_spy(sales, cities, by = "city") ``` The report flags the whitespace problems and shows which values are affected. `"Paris "` and `" Berlin"` both fail to match, dropping the match rate from 100% to 50%. The match rate is computed on the left table's unique keys: of the four distinct cities in `sales`, two find a counterpart in `cities`. The damaged keys also surface a second way, as near matches: `"Paris "` sits one edit away from `"Paris"`, so the report lists the pair as a possible typo even before we know the cause is whitespace. Whitespace problems compound with multi-column keys: ```{r} sales2 <- data.frame( city = c("London", "Paris ", "Berlin"), district = c("West ", "Central", " Mitte"), revenue = c(500, 300, 450), stringsAsFactors = FALSE ) districts <- data.frame( city = c("London", "Paris", "Berlin"), district = c("West", "Central", "Mitte"), pop = c(200000, 350000, 180000), stringsAsFactors = FALSE ) join_spy(sales2, districts, by = c("city", "district")) ``` London fails on the district column, Paris on the city column. A row needs all key columns to match, so a single trailing space in any column breaks the join. For composite keys the report adds a per-column breakdown: each key column gets its own match rate, and the column with the lowest rate is called out as the likely culprit. Here the district column matches worse than the city column, which tells us where to look first. That breakdown lives in the report as `multicolumn_analysis`, with the weakest column named in `problem_column`. ### A Quick Gate: key_check() For a pass/fail check without the full report, we can use `key_check()`: ```{r} key_check(sales, cities, by = "city") ``` `key_check()` runs the duplicate, NA, whitespace, and case checks (a subset of what `join_spy()` examines) and condenses the answer to a single logical, returned invisibly. With `warn = TRUE` (the default) it also prints which issues it found, one line per issue. Setting `warn = FALSE` silences the printing entirely, which is the mode we want inside scripts: ```{r} ok <- key_check(sales, cities, by = "city", warn = FALSE) ok ``` Because the return value is a plain logical, it slots into `stopifnot()` or conditional logic. A line like `stopifnot(key_check(x, y, by = "id", warn = FALSE))` turns key quality into a hard precondition: the script halts at the join site instead of producing a silently wrong result. ### Keys with Different Names Often the key columns carry different names in the two tables: the orders table calls it `id`, the CRM calls it `customer_id`. Renaming a column just to run a diagnostic is busywork, so every joinspy function that takes `by` also accepts the named-vector form familiar from dplyr: ```{r} sales_reps <- data.frame( id = c("C1", "C2", "C3"), total = c(100, 250, 175), stringsAsFactors = FALSE ) crm <- data.frame( customer_id = c("C1", "C2", "C4"), owner = c("Dana", "Eli", "Fay"), stringsAsFactors = FALSE ) join_spy(sales_reps, crm, by = c("id" = "customer_id")) ``` In the report header, the mapping shows as `id = customer_id`, and every check runs on `id` in the left table against `customer_id` in the right. The same spelling works in `key_check()`, `join_repair()`, `join_strict()`, and the wrappers, so a pipeline that joins on mismatched names never needs an intermediate rename step just to get diagnostics. ### Case Mismatches A CRM might store `"ACME"` while the billing system stores `"Acme"`, and R treats these as distinct strings. ```{r} invoices <- data.frame( company = c("ACME", "globex", "Initech", "UMBRELLA"), amount = c(1200, 800, 950, 1500), stringsAsFactors = FALSE ) vendors <- data.frame( company = c("Acme", "Globex", "Initech", "Umbrella"), sector = c("Manufacturing", "Logistics", "Software", "Biotech"), stringsAsFactors = FALSE ) join_spy(invoices, vendors, by = "company") ``` Only `"Initech"` matches exactly; the other three would be lost in a standard join. The case-mismatch check compares the two key sets after lowercasing and reports every pair that would match if case were ignored, with an example pair quoted in the message. Fixing this requires choosing a direction (lowercase everything, or uppercase everything), which is exactly what the `standardize_case` argument of `join_repair()` does in the next section. ### Encoding and Invisible Characters Two strings can render identically in the console but differ at the byte level: a non-breaking space (`\u00A0`) versus a regular space, a zero-width space from PDF copy-paste. ```{r} # Simulate invisible character contamination raw_ids <- data.frame( product_id = c("SKU-001", "SKU-002", paste0("SKU-003", "\u200B")), batch = c("A", "B", "C"), stringsAsFactors = FALSE ) clean_ids <- data.frame( product_id = c("SKU-001", "SKU-002", "SKU-003"), warehouse = c("East", "West", "North"), stringsAsFactors = FALSE ) join_spy(raw_ids, clean_ids, by = "product_id") ``` The zero-width space appended to `"SKU-003"` is invisible when printed but prevents the match. joinspy scans for the usual suspects: zero-width space (U+200B), zero-width non-joiner (U+200C), zero-width joiner (U+200D), byte order mark (U+FEFF), and non-breaking space (U+00A0). Columns that mix declared encodings are flagged as well. These characters typically arrive via copy-paste from PDFs or web pages, or from files saved with a byte order mark, so an encoding flag is often a hint about an upstream pipeline step worth checking. ### Near Matches When an unmatched key sits within an edit distance of two from a key on the other side, the report lists the pair as a possible typo. Keys shorter than three characters are skipped to avoid false positives. ```{r} batches_x <- data.frame( batch = c("LOT-A100", "LOT-B200", "LOT-C300"), yield = c(0.92, 0.88, 0.95), stringsAsFactors = FALSE ) batches_y <- data.frame( batch = c("LOT-A100", "LOT-B200", "LOT-C30O"), # final character is a letter O supplier = c("North", "East", "West"), stringsAsFactors = FALSE ) join_spy(batches_x, batches_y, by = "batch") ``` `"LOT-C300"` and `"LOT-C30O"` differ by a single character, a digit zero against a letter O, which is the kind of error manual data entry produces. Near matches carry severity `"info"` rather than `"warning"` because the package cannot know whether two similar keys are a typo or two genuinely distinct values. They are surfaced for human review; `join_repair()` deliberately leaves them alone. ### Type Mismatches Key columns can also disagree about type. An identifier stored as numeric on one side and character on the other is common when one table comes from a database and the other from a CSV reader. ```{r} shipments <- data.frame(zip = c(1010, 1020, 1030), n_parcels = c(5, 7, 2)) zones <- data.frame( zip = c("1010", "1020", "1030"), zone = c("Inner", "Mid", "Outer"), stringsAsFactors = FALSE ) join_spy(shipments, zones, by = "zip") ``` The match analysis still predicts full overlap because R coerces numeric to character when comparing the key sets, and base `merge()` does the same when joining. dplyr's join verbs refuse to join a double column to a character column, so the same two tables join under one backend and error under another. The type-mismatch warning catches the disagreement before the backend choice decides the outcome. The report also warns about character-versus-factor pairs, mismatched factor levels, and numeric keys that risk floating-point or large-integer precision loss. ### Combining Multiple Issues Real data rarely has just one problem. `join_spy()` checks for whitespace, case, and encoding issues in one call and reports them together, grouped by type. ```{r} employees <- data.frame( dept = c("Sales ", "ENGINEERING", "marketing", paste0("HR", "\u00A0")), name = c("Alice", "Bob", "Carol", "David"), stringsAsFactors = FALSE ) departments <- data.frame( dept = c("Sales", "Engineering", "Marketing", "HR"), budget = c(50000, 80000, 45000, 30000), stringsAsFactors = FALSE ) join_spy(employees, departments, by = "dept") ``` Zero matches out of four rows, with the problems grouped by type. The fix differs per category: whitespace is safe to trim automatically, case standardization requires choosing a direction, and encoding issues may indicate deeper pipeline problems. The severity tags help with triage. A `"warning"` (duplicates, whitespace, NA keys, numeric/character type clashes) changes join results in ways that usually matter. Entries tagged `"info"`, like empty strings, factor level differences, and near matches, deserve a look but are sometimes intentional. ### Duplicate Keys Duplicates don't prevent matches, but they multiply rows. If `"London"` appears three times in the left table and twice in the right, we get six rows for London after an inner join. `key_duplicates()` returns the offending rows along with a count: ```{r} transactions <- data.frame( store_id = c("S1", "S1", "S2", "S3", "S3", "S3"), day = c("Mon", "Tue", "Mon", "Mon", "Tue", "Wed"), amount = c(100, 200, 150, 300, 250, 400), stringsAsFactors = FALSE ) key_duplicates(transactions, by = "store_id") ``` The `.n_duplicates` column maps directly to the row multiplication factor in a join: a key with `.n_duplicates` of 3 contributes three rows for every matching row on the other side. The `keep` argument accepts `"first"`, `"last"`, or the default `"all"`. With `"first"` or `"last"`, only one representative row per duplicated key is returned, which is handy when we want one row per key to inspect rather than the full set: ```{r} key_duplicates(transactions, by = "store_id", keep = "first") ``` `key_duplicates()` also takes composite keys (`by = c("store_id", "day")`), in which case a row counts as a duplicate only when all key columns repeat together. A row whose key contains an NA in any component is treated as having an NA key, matching how joins treat missing key components. What to do about duplicates depends on what they mean. When the duplicates are legitimate detail rows (multiple transactions per store), the usual move is to aggregate to the key level before joining, or to keep the detail and declare the relationship with `join_strict(..., expect = "n:1")` so the expansion is explicit. When the duplicates are accidental (a double-loaded file, a botched append), deduplicate first; `keep = "first"` gives a quick way to pick a representative row, though deciding which row to trust is a data question, and the answer belongs in the script rather than in a silent default. ## The JoinReport Object Every diagnostic in the previous section came from printing a `JoinReport`, the object that `join_spy()` returns. The printed form is a summary; the object underneath holds everything the analysis found, which is what we want when reacting to diagnostics programmatically. ```{r} report <- join_spy(sales, cities, by = "city") names(report) ``` The components mirror the printed sections. `x_summary` and `y_summary` describe each table's keys: row count (`n_rows`), distinct keys (`n_unique`), duplicated keys (`n_duplicates`) and the rows they occupy (`n_duplicate_rows`), missing keys (`n_na`), and the duplicated key values themselves (`duplicate_keys`). `match_analysis` holds the overlap: counts of matched, left-only, and right-only keys, the match rate, and the actual key vectors so we can see which values fell on each side. `expected_rows` predicts the result size for all four join types, `by` records the join columns, `multicolumn_analysis` carries the per-column breakdown for composite keys, and `memory_estimate` sizes the expected result. ```{r} report$match_analysis$match_rate report$match_analysis$left_only_keys ``` The two left-only keys are the whitespace-damaged values. Pulling them out of the report like this is how a pipeline can branch on diagnostics: quarantine the unmatched keys, write them to a review file, or fail the run if the match rate drops below an agreed floor. The summaries are useful on their own, before any matching question arises. `x_summary$duplicate_keys` names the keys that repeat, and comparing `n_rows` to `n_unique` is the one-line answer to whether a table can serve as a lookup. These are the same numbers the printed report displays; the components just make them addressable. `issues` is a list with one entry per detected problem. Each entry carries a `type`, a `severity` (`"warning"` or `"info"`), a human-readable `message`, and usually a `details` element with the affected values. ```{r} vapply(report$issues, function(i) i$type, character(1)) report$issues[[1]]$message ``` ### print(), summary(), and plot() Three S3 methods cover the common ways of consuming a report. `print()` produces the full console report we have been reading all along. `summary()` flattens the headline numbers into a data frame with one metric per row, which is the form to keep when collecting diagnostics across many joins: ```{r} summary(report) ``` `summary()` also takes a `format` argument: `"text"` prints aligned key-value pairs, and `"markdown"` prints a table ready to paste into a report or an issue tracker. ```{r} summary(report, format = "markdown") ``` `plot()` draws the key overlap as a Venn diagram, with the left-only, shared, and right-only key counts in the three regions and the match rate in the subtitle: ```{r} plot(report) ``` The picture makes the diagnosis immediate: two keys live only in `sales` (the damaged Paris and Berlin), two are shared, and three live only in `cities`. The figure earns its place when the audience is wider than the analyst: a Venn diagram in a data-quality writeup communicates "half our keys have no counterpart" faster than a table of counts does. The `file` argument saves the same figure to disk (PNG, SVG, or PDF, chosen by the file extension) and `colors` controls the two circle colours. The method invisibly returns the three counts, so the numbers behind the figure stay available. For checking report objects in defensive code, `is_join_report()` tests the class: ```{r} is_join_report(report) ``` This matters in pipelines that call `last_report()` (introduced below), which returns `NULL` when no join has run yet. ## Auto-Repair `join_repair()` handles whitespace trimming, case standardization, invisible character removal, and empty-to-NA conversion in a single call. Each fix is controlled by its own argument: `trim_whitespace` (default `TRUE`), `standardize_case` (`"lower"`, `"upper"`, or the default `NULL` for no change), `remove_invisible` (default `TRUE`), and `empty_to_na` (default `FALSE`). Only the key columns named in `by` are touched, and only when they are character; numeric and factor key columns pass through unchanged. ### Dry Run We can preview what `join_repair()` would change with `dry_run = TRUE`: ```{r} messy_left <- data.frame( id = c(" A-101", "B-202 ", "c-303", paste0("D-404", "\u200B")), score = c(88, 92, 76, 95), stringsAsFactors = FALSE ) messy_right <- data.frame( id = c("A-101", "B-202", "C-303", "D-404"), label = c("Alpha", "Beta", "Gamma", "Delta"), stringsAsFactors = FALSE ) join_repair(messy_left, messy_right, by = "id", dry_run = TRUE) ``` The dry run shows how many values will change and what kind of change each column needs, without touching the original data. Running the preview first is worth the extra line whenever the keys carry meaning: case standardization in particular rewrites values, and a dry run confirms it will only rewrite the ones we expect. ### Applying Repairs Drop `dry_run` (or set it to `FALSE`) to get back the repaired data frames. When we pass both `x` and `y`, the return value is a list with both: ```{r} repaired <- join_repair( messy_left, messy_right, by = "id", trim_whitespace = TRUE, standardize_case = "upper", remove_invisible = TRUE ) repaired$x$id repaired$y$id ``` Both sides now have clean, uppercase, whitespace-free keys. We can confirm with `key_check()`: ```{r} key_check(repaired$x, repaired$y, by = "id") ``` The diagnose-repair-verify loop is the core workflow: `join_spy()` names the problems, `join_repair()` fixes the mechanical ones, and `key_check()` confirms the keys are clean before the join runs. The non-key columns (`score` and `label` here) are never modified, so repairs cannot corrupt the data the join is meant to carry. ### Repairing a Single Table When `y` is omitted, `join_repair()` returns the repaired `x` directly rather than a list. This is the form to use when cleaning a table once, before joining it against several others: ```{r} codes <- data.frame( code = c(" X1", "X2 ", ""), n = 1:3, stringsAsFactors = FALSE ) join_repair(codes, by = "code", empty_to_na = TRUE) ``` The `empty_to_na` flag converts empty strings to `NA`. The distinction matters for joins: two empty strings match each other, so a stray `""` on both sides quietly joins rows that have no business being paired, while `NA` keys never match anything. Converting makes the missingness explicit. ### What Repair Does Not Fix `join_repair()` fixes mechanical string problems: whitespace, case, invisible characters, empty strings. It does not fix typos, semantic mismatches, or different key formats (`"USA"` vs. `"United States"`). Typos surface in the report as near matches and need human judgment, since an automated fix could merge keys that are genuinely distinct. Format mismatches need a recoding table. `vignette("common-issues")` walks through both, alongside the rest of the problem catalogue. ### Repair Suggestions from a Report If we already have a `JoinReport` from `join_spy()`, `suggest_repairs()` generates ready-to-run code that fixes the detected issues: ```{r} report <- join_spy(messy_left, messy_right, by = "id") suggest_repairs(report) ``` The generated code calls base R functions (`trimws()`, `tolower()`, `gsub()`) keyed to what the report detected, covering whitespace, case, empty-string, and encoding issues. The snippets are also returned invisibly as a character vector, so they can be written to a script file. This path suits the case where we want the fix recorded as explicit code in the analysis script rather than hidden inside a `join_repair()` call. ## Row Count Predictions Every `JoinReport` includes an `expected_rows` component that predicts how many rows each join type will produce, calculated from key overlap and duplicate structure. ```{r} orders <- data.frame( product_id = c("P1", "P1", "P2", "P3", "P4"), quantity = c(10, 5, 20, 15, 8), stringsAsFactors = FALSE ) products <- data.frame( product_id = c("P1", "P2", "P3", "P5"), name = c("Widget", "Gadget", "Gizmo", "Doohickey"), stringsAsFactors = FALSE ) report <- join_spy(orders, products, by = "product_id") report$expected_rows ``` A left join here retains all five order rows (with `NA` for the unmatched `"P4"`), while an inner join drops it. For each join type, the calculation counts overlapping keys and multiplies by the number of duplicates on each side. A key appearing twice in `x` and three times in `y` contributes 2 * 3 = 6 rows to an inner join. The other join types build on that base: the left prediction adds the unmatched left rows, the right prediction adds the unmatched right rows, and the full prediction adds both. The calculation is exact when all keys are non-NA and no string issues remain. NA keys deserve caution: R's `merge()` and dplyr both skip NAs by default, so the predicted count may overestimate. The report warns when NAs are present. Checking the prediction before running the join is the cheapest correctness test joins offer. If we believe the join is a simple lookup and `expected_rows$left` exceeds `nrow(x)`, the belief is wrong, and we have learned it before any rows were produced. The wrappers in the next section automate exactly this comparison. Beyond row counts, the predictions also translate into a size estimate: ```{r} report$memory_estimate ``` `memory_estimate` multiplies the predicted row count by the average bytes per row in the inputs, scaled up for the extra columns the result carries. It is a heuristic, useful for the order of magnitude: when a join we expect to return megabytes is predicted in gigabytes, something is wrong with the keys, and we have found out without allocating the result. `summary()` returns the same information as a compact data frame: ```{r} summary(report) ``` ### Sampling Large Tables On tables with millions of rows, the string scans take noticeable time. The `sample` argument caps how many rows each table contributes to the analysis: ```{r} set.seed(7) big_orders <- data.frame( product_id = sample(sprintf("P%04d", 1:5000), 20000, replace = TRUE), qty = sample(1:10, 20000, replace = TRUE) ) catalog <- data.frame( product_id = sprintf("P%04d", 1:5000), price = round(runif(5000, 1, 100), 2) ) report_big <- join_spy(big_orders, catalog, by = "product_id", sample = 2000) report_big$sampling ``` When sampling kicks in, the report records the original row counts alongside the sample size, and the printed report carries a notice. The trade-off is exactness: match rates and row predictions then describe the sample, so treat them as estimates of the full-table values. Issue detection still works well under sampling, since a systematic problem like trailing whitespace shows up in any reasonably sized sample. A practical division of labour is to sample during interactive exploration, where we mostly want to know whether the keys are clean and roughly how they overlap, and to run the full analysis once before the join that matters, where the exact row prediction is the payoff. ## Post-Join Diagnostics Sometimes the join has already happened and we need to understand the result. Maybe the join sits in someone else's code, maybe the inputs are gone from memory and only the output survived a long pipeline run, or maybe we are reviewing a script where the row count quietly doubles between two steps. Two functions cover this direction: `join_explain()` reconstructs why a result has the size it has, and `join_diff()` compares a table to its post-join self. ### Explaining Row Count Changes `join_explain()` takes the result, the two input tables, and the join column, then explains why the row count changed: ```{r} tickets <- data.frame( event_id = c(1, 2, 2, 3), seat = c("A1", "B2", "B3", "C1"), stringsAsFactors = FALSE ) events <- data.frame( event_id = c(1, 2, 4), name = c("Concert", "Play", "Opera"), stringsAsFactors = FALSE ) result <- merge(tickets, events, by = "event_id") expl <- join_explain(result, tickets, events, by = "event_id", type = "inner") ``` The output has two parts. The row-count section states the before and after sizes and flags the direction of the change. The explanation section then decomposes the change into its causes: duplicate keys multiplying rows, unmatched keys being dropped or padded, and NA keys that never match. Usually when a table grows unexpectedly, the answer is a handful of keys with duplicates on one side. Passing `type` tells the function which join was performed so it attributes dropped rows correctly. The explanation is also returned invisibly as a list, with the row counts, the difference, per-side duplicate counts, and the matched/left-only/right-only key counts: ```{r} expl$diff expl$left_only ``` The result lost one row relative to `tickets` and one left key had no match: the inner join dropped event 3. Capturing the list lets a pipeline assert on the decomposition, for example failing the run whenever `left_only` is nonzero. ### Before/After Comparison `join_diff()` compares a data frame before and after a join to show which rows were added, dropped, or preserved: ```{r} before <- data.frame( id = 1:4, value = c("a", "b", "c", "d"), stringsAsFactors = FALSE ) lookup <- data.frame( id = c(2, 3, 4, 5), extra = c("X", "Y", "Z", "W"), stringsAsFactors = FALSE ) after <- merge(before, lookup, by = "id", all = TRUE) join_diff(before, after, by = "id") ``` With `all = TRUE` nothing is dropped: the row for `id = 1` survives with `NA` in `extra`, and `id = 5` enters from the lookup side. The diff reports one added row, one added column, and one more unique key after the join than before. The `by` argument is optional; without it the comparison covers dimensions and columns only, and with it `join_diff()` adds the key analysis. Here is a second example that highlights row addition from duplicates rather than new keys: ```{r} before2 <- data.frame( id = c(1, 2, 3), value = c("a", "b", "c"), stringsAsFactors = FALSE ) dup_lookup <- data.frame( id = c(1, 1, 2, 3), tag = c("X", "Y", "Z", "W"), stringsAsFactors = FALSE ) after2 <- merge(before2, dup_lookup, by = "id") d <- join_diff(before2, after2, by = "id") ``` The row count grew from three to four because `id = 1` matched two rows in the lookup. No new keys were introduced; the growth came entirely from duplicate expansion, and the key analysis shows it as duplicate rows appearing where there were none. `join_diff()` distinguishes these two sources of row addition (new keys vs. duplicate expansion). Like `join_explain()`, it returns its findings invisibly, including the added and removed column names: ```{r} d$columns_added ``` ## Safe Join Wrappers The `*_join_spy()` family wraps standard joins with automatic diagnostics. They print a report before joining and return the joined data frame: ```{r} patients <- data.frame( patient_id = c("P01", "P02", "P03"), age = c(34, 28, 45), stringsAsFactors = FALSE ) labs <- data.frame( patient_id = c("P01", "P02", "P04"), result = c(5.2, 3.8, 6.1), stringsAsFactors = FALSE ) result <- left_join_spy(patients, labs, by = "patient_id") head(result) ``` After the report, the wrapper announces the join, performs it, and compares the actual row count to the prediction, flagging any discrepancy. That final check closes the loop: a mismatch between predicted and actual rows means the prediction's assumptions (usually around NA keys) did not hold, and the message says so at the join site. All four wrappers (`left_join_spy`, `right_join_spy`, `inner_join_spy`, `full_join_spy`) share the same interface. The wrappers are convenient for exploratory work; in production, we might prefer calling `join_spy()` first and inspecting the report programmatically before proceeding, or `join_strict()` when there is a cardinality worth enforcing. The wrappers also accept a `backend` parameter (`"base"`, `"dplyr"`, or `"data.table"`) that overrides automatic backend detection. Every wrapper attaches the full report to the result as an attribute, so the diagnostics travel with the data: ```{r} rpt <- attr(result, "join_report") rpt$expected_rows$left ``` A function several layers downstream can recover the report from the data frame it received and check what happened at the join, with no extra plumbing. ### Quiet Mode and Deferred Reports In pipelines, printed output can be distracting. With `.quiet = TRUE`, the report is suppressed and can be retrieved afterward via `last_report()`: ```{r} result <- inner_join_spy(patients, labs, by = "patient_id", .quiet = TRUE) # Later, inspect what happened rpt <- last_report() rpt$match_analysis$match_rate ``` `last_report()` returns the report from the most recent `*_join_spy()` call and persists for the duration of the R session. Printing is governed by two arguments: the report prints when `verbose = TRUE` (the default) and `.quiet = FALSE`, and either `verbose = FALSE` or `.quiet = TRUE` silences it. The diagnostics themselves always run: quiet mode changes what is shown, and the report is still computed, stored for `last_report()`, attached as an attribute, and written to the log file when logging is enabled. ## Cardinality Enforcement Every join has an intended cardinality: a lookup table is `1:n`, a one-to-one mapping is `1:1`. `join_strict()` lets you declare that relationship. If the data violates it, the join errors instead of producing a wrong-sized result. The output row count is determined by the constraint you specify. ```{r} sensors <- data.frame( sensor_id = c("T1", "T2", "T3"), location = c("Roof", "Basement", "Lobby"), stringsAsFactors = FALSE ) readings <- data.frame( sensor_id = c("T1", "T2", "T3"), value = c(22.1, 18.5, 21.0), stringsAsFactors = FALSE ) # Succeeds: one reading per sensor join_strict(sensors, readings, by = "sensor_id", expect = "1:1") ``` ```{r error = TRUE} # Fails: T1 has two readings, violating 1:1 readings_dup <- data.frame( sensor_id = c("T1", "T1", "T2", "T3"), value = c(22.1, 23.4, 18.5, 21.0), stringsAsFactors = FALSE ) join_strict(sensors, readings_dup, by = "sensor_id", expect = "1:1") ``` The error names both the expected and the observed cardinality and reports the duplicate counts on each side, so the message alone usually identifies which table broke the contract. The `expect` argument accepts four cardinality levels: - `"1:1"`: both sides have unique keys. Any duplicate on either side is an error. - `"1:n"`: the left table has unique keys, the right may repeat. The classic lookup-table join. - `"n:1"`: the mirror of `"1:n"`. The right table is the lookup. - `"n:m"`: duplicates allowed on both sides. Rarely what we want; exists as an explicit opt-in for intentional Cartesian products per key. The long spellings `"1:many"`, `"many:1"`, and `"many:many"` are accepted as synonyms. The `type` argument selects the join itself (`"left"` by default, or `"right"`, `"inner"`, `"full"`), and `backend` passes through to the same engine dispatch the wrappers use, so a strict join runs on dplyr or data.table just as readily as on base R. The same duplicated readings that broke the `1:1` contract pass cleanly once we declare the relationship we actually have, one sensor to many readings: ```{r} join_strict(sensors, readings_dup, by = "sensor_id", expect = "1:n") ``` The result has four rows, one per reading, and that growth is now expected rather than accidental. This is the working pattern for `join_strict()`: pick the tightest constraint the data model supports, and let the function complain when reality disagrees. The constraint is evaluated on the duplicate structure of the whole key columns, before any joining happens. That makes `join_strict()` a contract on the tables themselves: declaring `expect = "1:n"` in a script documents the assumed data model, and the script stops the day an upstream source starts shipping duplicate keys. To detect the actual cardinality without enforcing anything: ```{r} detect_cardinality(sensors, readings_dup, by = "sensor_id") ``` The classification follows from where the duplicates sit: duplicates on both sides give `"n:m"`, duplicates only on the left give `"n:1"`, only on the right `"1:n"`, and none `"1:1"`. Alongside the printed summary, the string is returned invisibly, so we can use it programmatically: to select the right `expect` value, or to log the cardinality of a feed over time and notice when it drifts. ## Advanced Features ### Cartesian Product Detection When both tables have duplicates on the same key, the join produces the Cartesian product of those duplicates. Two keys with 100 rows each produce 10,000 rows for that key alone. `check_cartesian()` warns before this happens: ```{r} left <- data.frame( group = c("A", "A", "A", "B"), x = 1:4, stringsAsFactors = FALSE ) right <- data.frame( group = c("A", "A", "B", "B"), y = 5:8, stringsAsFactors = FALSE ) cart <- check_cartesian(left, right, by = "group") ``` The function computes, per matched key, the product of its occurrence counts on the two sides, and sums these into the predicted inner-join size; the expansion factor is that total divided by the larger of the two input row counts. Here key A contributes 3 * 2 = 6 rows and key B contributes 1 * 2 = 2, so the four-row inputs produce eight rows, an expansion factor of 2: ```{r} cart$expansion_factor cart$worst_keys ``` The `threshold` argument (default 10) sets how large the expansion factor must be before the function reports a risk. A clean lookup join keeps the factor at 1 or below, and pushing it past 10 essentially requires duplicates on both sides of the same key, which is why 10 makes a sensible default boundary between intended expansion and an accident. For a pipeline where every join should be near 1:1, a tighter threshold catches trouble earlier: ```{r} check_cartesian(left, right, by = "group", threshold = 1.5) ``` The warning ranks the worst offending keys by their row product, which is the right place to start: in real data, Cartesian blowups usually trace back to a few catch-all keys (an `"UNKNOWN"` code, an empty-string id) that are heavily duplicated on both sides. ### Multi-Table Join Chains `analyze_join_chain()` runs diagnostics across a sequence of joins. We provide a named list of tables and a list of join specifications: ```{r} orders <- data.frame(order_id = 1:4, customer_id = c(1, 2, 2, 3)) customers <- data.frame(customer_id = 1:3, region_id = c(1, 1, 2)) regions <- data.frame(region_id = 1:2, name = c("North", "South"), stringsAsFactors = FALSE) ``` ```{r} chain <- analyze_join_chain( tables = list(orders = orders, customers = customers, regions = regions), joins = list( list(left = "orders", right = "customers", by = "customer_id"), list(left = "result", right = "regions", by = "region_id") ) ) ``` Each step prints its input sizes, match rate, expected left-join row count, and issue count, followed by a chain-level summary. The special name `"result"` refers to the output of the previous join, so we can chain as many tables as needed. Because `"result"` resolves to the actual intermediate data (built with left joins), key diagnostics at step two account for any NAs or key distribution changes introduced by the first join. This matters in star-schema assembly, where a problem in the first join (a duplicate dimension key, say) only becomes visible as row growth two joins later. The return value is a list with one element per step, each holding the step's full `JoinReport`, so any step can be drilled into afterward: ```{r} chain[[2]]$report$match_analysis$match_rate ``` ### Backend Support All join wrappers and `join_strict()` accept a `backend` argument. By default (`backend = NULL`), joinspy detects the backend from the input class: tibbles use dplyr, data.tables use data.table, plain data frames use base R. ```{r eval = FALSE} # Force dplyr backend even for plain data frames left_join_spy(x, y, by = "id", backend = "dplyr") # Force base R for tibbles left_join_spy(x, y, by = "id", backend = "base") ``` The diagnostic layer is backend-agnostic; all string analysis and key checking happens before the join engine is invoked, so reports are identical regardless of backend. The result keeps the class the backend produces (a tibble from dplyr, a data.table from data.table), and extra arguments in `...` pass through to the underlying join function, so backend-specific options remain available. When a requested package is missing, joinspy falls back to base R with a warning rather than failing. `vignette("backends")` covers the dispatch rules, the differences between the engines, and mixed-class inputs. ## Logging and Audit Trails At the console we read reports as they print. A scheduled pipeline has no one watching, so the reports need to go somewhere durable instead. joinspy offers two routes: log a single report explicitly, or log every wrapper join automatically. `vignette("production")` builds full pipeline patterns on top of these. `log_report()` writes one `JoinReport` to a file, with the format chosen by the extension: `.txt` or `.log` for human-readable text, `.json` for machine-readable output, and `.rds` for the complete R object. ```{r} report <- join_spy(patients, labs, by = "patient_id") tmp <- tempfile(fileext = ".log") log_report(report, tmp, timestamp = FALSE) cat(readLines(tmp)[1:8], sep = "\n") ``` The text format mirrors the printed report: table summaries, match analysis, expected rows, and an issue count by type. With `append = TRUE`, successive reports accumulate in the same file, and the default `timestamp = TRUE` stamps each entry so the log doubles as a history. The JSON format carries the same numbers as structured fields: ```{r} tmp_json <- tempfile(fileext = ".json") log_report(report, tmp_json, timestamp = FALSE) cat(readLines(tmp_json)[1:8], sep = "\n") unlink(c(tmp, tmp_json)) ``` JSON suits log aggregation: each report is one object, so downstream tooling can chart match rates over time or alert on issue counts. The `.rds` route preserves the entire object, including the detail vectors inside each issue, and `readRDS()` restores a report that `print()`, `summary()`, and `plot()` work on as if it were fresh. For pipelines with many joins, `set_log_file()` switches on automatic logging: every `*_join_spy()` call appends its report without further code. `get_log_file()` returns the currently configured path, or `NULL` when logging is off. ```{r} log_path <- tempfile(fileext = ".log") set_log_file(log_path) identical(get_log_file(), log_path) ``` ```{r} r1 <- inner_join_spy(patients, labs, by = "patient_id", .quiet = TRUE) r2 <- left_join_spy(patients, labs, by = "patient_id", .quiet = TRUE) set_log_file(NULL) length(readLines(log_path)) > 0 unlink(log_path) ``` Both joins ran quietly and both still landed in the log, because auto-logging happens regardless of `.quiet`. That combination of silent console and complete log is the production default worth aiming for: the run output stays readable while every join leaves an auditable record. `set_log_file()` also accepts `format = "json"` for line-oriented JSON logs, and returns the previous setting invisibly so a function can set, use, and restore the log path without disturbing the caller's configuration. Choosing among the formats comes down to who reads the log. Text logs suit a human scanning yesterday's run for the join that misbehaved. JSON suits monitoring, where a script parses each entry and compares the match rate against a floor. RDS suits forensics on a single problematic join, since the reloaded report keeps the affected key values inside each issue and supports the full set of S3 methods. Mixing them is normal: an auto-logged text file for the run as a whole, plus an explicit `log_report(..., "incident.rds")` when something needs a closer look later. ## Quick Reference | Function | Purpose | Returns | |---|---|---| | `join_spy()` | Full pre-join diagnostic | `JoinReport` (print/summary/plot) | | `key_check()` | Quick pass/fail | Logical | | `key_duplicates()` | Find duplicate keys | Data frame with `.n_duplicates` | | `join_repair()` | Fix whitespace, case, encoding | Repaired data frame(s) | | `suggest_repairs()` | Print repair code snippets | Printed output (invisible) | | `left_join_spy()` | Left join + diagnostics | Joined data frame | | `right_join_spy()` | Right join + diagnostics | Joined data frame | | `inner_join_spy()` | Inner join + diagnostics | Joined data frame | | `full_join_spy()` | Full join + diagnostics | Joined data frame | | `last_report()` | Retrieve last quiet report | `JoinReport` or `NULL` | | `is_join_report()` | Test for report objects | Logical | | `join_strict()` | Cardinality-enforcing join | Joined data frame or error | | `join_explain()` | Explain row count change | Printed summary | | `join_diff()` | Before/after comparison | Printed summary | | `detect_cardinality()` | Detect key relationship | Character (`"1:1"`, etc.) | | `check_cartesian()` | Warn about row explosion | List with explosion analysis | | `analyze_join_chain()` | Multi-step join diagnostics | Printed chain analysis | | `log_report()` | Write report to file | File output (invisible) | | `set_log_file()` | Enable auto-logging | Sets global log path | | `get_log_file()` | Show current log path | Path or `NULL` | ## Further Reading - `vignette("why-keys-dont-match")` for the string-level failure modes in depth: whitespace, case, encoding, types, and how to hunt each one down - `vignette("common-issues")` for a catalogue of join problems and solutions - `vignette("production")` for logging, gating, and pipeline patterns built on `join_strict()` and the quiet wrappers - `vignette("backends")` for the base, dplyr, and data.table engines and the dispatch rules between them - Function help pages: `?join_spy`, `?join_repair`, `?join_strict`