---
title: "Common Join Problems"
author: "Gilles Colling"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Common Join Problems}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 6,
  fig.height = 4
)
library(joinspy)
```

Trailing spaces, flipped case, and zero-width Unicode characters make keys
that look identical on screen compare as different during a join.

This vignette covers the issues joinspy detects, ordered roughly by frequency.
String-level issues come first, then structural ones (duplicates, NAs, type
mismatches, Cartesian explosions). Each section follows the same arc: the
symptom, a small dataset that reproduces it, the diagnosis with joinspy, and
the fix. The closing section collects the individual tools into a numbered
workflow we can follow when a join goes wrong and the cause is unknown.

## Trailing and leading whitespace

The classic. Someone exports a CSV from Excel, and now half the keys carry a
trailing space. Everything looks fine when we print the data frame. Nothing
matches when we join.

```{r}
sales <- data.frame(
  product = c("Widget", "Gadget ", " Gizmo"),
  units = c(10, 20, 30),
  stringsAsFactors = FALSE
)

inventory <- data.frame(
  product = c("Widget", "Gadget", "Gizmo"),
  stock = c(100, 200, 300),
  stringsAsFactors = FALSE
)

join_spy(sales, inventory, by = "product")
```

Two of three keys carry whitespace that prevents matching. `join_repair()`
strips it:

```{r}
sales_clean <- join_repair(sales, by = "product")
key_check(sales_clean, inventory, by = "product")
```

When we want to see what a repair would do before committing to it,
`dry_run = TRUE` prints the planned changes and leaves the data untouched:

```{r}
join_repair(sales, by = "product", dry_run = TRUE)
```

Whitespace usually enters at the import boundary. Base `read.csv()` keeps it
unless we set `strip.white = TRUE`, while `readr::read_csv()` trims by
default (`trim_ws = TRUE`). Fixed-width exports and hand-typed spreadsheet
cells are the other common sources; a stray space typed after a product name
survives every visual inspection. Trimming key columns right after import,
before any join, keeps the problem from spreading into derived tables.

The problem compounds with composite keys -- whitespace in *any* column is
enough to break the match:

```{r}
shipments <- data.frame(
  warehouse = c("East ", "West", "East "),
  product   = c("Widget", "Gadget ", "Gizmo"),
  shipped   = c(50, 80, 35),
  stringsAsFactors = FALSE
)

stock <- data.frame(
  warehouse = c("East", "West", "East"),
  product   = c("Widget", "Gadget", "Gizmo"),
  on_hand   = c(200, 150, 90),
  stringsAsFactors = FALSE
)

join_spy(shipments, stock, by = c("warehouse", "product"))
```

Both `warehouse` and `product` carry trailing spaces in `shipments`, so all
three rows fail to match. The per-column breakdown in the report shows which
component of the composite key drags the match rate down. A single
`join_repair()` call cleans every key column at once:

```{r}
shipments_clean <- join_repair(shipments, by = c("warehouse", "product"))
key_check(shipments_clean, stock, by = c("warehouse", "product"))
```

Composite keys deserve this check even when each column was cleaned at some
point, because new joins often combine columns that were never used as keys
before.

## Case mismatches

Databases are often case-insensitive; R is not. When we pull tables from two
different systems, one might store `"ABC"` and the other `"abc"`.

```{r}
sensors <- data.frame(
  station = c("AWS-01", "aws-02", "Aws-03"),
  temp = c(22.1, 18.4, 25.7),
  stringsAsFactors = FALSE
)

metadata <- data.frame(
  station = c("aws-01", "AWS-02", "AWS-03"),
  region = c("North", "South", "East"),
  stringsAsFactors = FALSE
)
```

None of these keys match as-is:

```{r}
join_spy(sensors, metadata, by = "station")
```

We can repair both sides to a common case:

```{r}
repaired <- join_repair(sensors, metadata, by = "station", standardize_case = "lower")
key_check(repaired$x, repaired$y, by = "station")
```

`standardize_case` accepts `"lower"` or `"upper"`. One thing to watch for:
`join_repair()` only modifies character columns. A factor key passes through
untouched, with no message, so we convert factors with `as.character()`
before repairing; the factor section below shows this in detail. A second
caution applies when case carries meaning. If sample codes `a1` and `A1` are
genuinely different things, folding the case merges them, so we run
`key_duplicates()` on the folded column before trusting the repair.

Case drift between tables usually means they came from systems with
different conventions. Many SQL collations compare case-insensitively, so
`AWS-01` and `aws-01` are the same row there and different rows in R. Mixed
case inside a single column, as in `sensors` above, more often points to
hand entry. Standardizing to one case at import, and writing that convention
down, prevents the next data pull from reintroducing the mismatch.

## Encoding and invisible characters

A key contains a non-breaking space (U+00A0) instead of a regular space, or
a zero-width joiner crept in during a copy-paste from a PDF. The strings
print identically but do not match.

```{r}
# Simulate a non-breaking space in one key
left <- data.frame(
  city = c("New York", "Los\u00a0Angeles", "Chicago"),
  pop = c(8.3, 3.9, 2.7),
  stringsAsFactors = FALSE
)

right <- data.frame(
  city = c("New York", "Los Angeles", "Chicago"),
  area = c(302, 469, 227),
  stringsAsFactors = FALSE
)

join_spy(left, right, by = "city")
```

The `"Los\u00a0Angeles"` key in `left` looks like `"Los Angeles"` in `right`,
but the non-breaking space makes them different byte sequences.
`join_repair()` with `remove_invisible = TRUE` (the default) strips these out:

```{r}
left_fixed <- join_repair(left, by = "city")
key_check(left_fixed, right, by = "city")
```

When the repair needs to live in a script that does not load joinspy,
`suggest_repairs()` turns the report into plain base R that we can paste
anywhere:

```{r}
report <- join_spy(left, right, by = "city")
suggest_repairs(report)
```

Common sources include PDF extraction, web scraping, and cross-platform file
transfers. `join_repair()` handles the most common offenders: non-breaking
spaces, zero-width joiners, BOM markers, and soft hyphens. It does not attempt
full Unicode normalization (NFC vs. NFD); for that we would reach for
`stringi::stri_trans_nfc()`. When the same feed delivers non-breaking spaces
every week, we move the `gsub()` line above into the import script and raise
the issue with whoever owns the producing system.

## Empty strings masquerading as data

Empty strings (`""`) are valid character values in R. They will match other
empty strings in a join, which is almost never what we want: two rows with
missing identifiers get joined as though they refer to the same entity.

```{r}
patients <- data.frame(
  mrn = c("P001", "", "P003"),
  age = c(34, 56, 29),
  stringsAsFactors = FALSE
)

visits <- data.frame(
  mrn = c("P001", "P002", ""),
  date = c("2024-01-10", "2024-02-15", "2024-03-20"),
  stringsAsFactors = FALSE
)

join_spy(patients, visits, by = "mrn")
```

`join_spy()` lists empty strings as an informational issue: two empty keys
do match each other, so whether that counts as a bug depends on what the
rows mean. Here it would attach an anonymous visit to an anonymous patient.
Converting empties to `NA` before joining fixes this, since NAs never match
in R:

```{r}
patients_fixed <- join_repair(patients, by = "mrn", empty_to_na = TRUE)
patients_fixed$mrn
```

Passing `y` repairs both tables in one call; the return value is then a list
with elements `x` and `y`:

```{r}
both <- join_repair(patients, visits, by = "mrn", empty_to_na = TRUE)
both$y$mrn
```

Empty strings are what base `read.csv()` produces for blank cells in
character columns; only the literal string `"NA"` becomes missing by
default. Passing `na.strings = c("NA", "")` at import keeps blanks out of
the key column entirely, which is cheaper than repairing after the fact.
Note that `data.table` treats `""` and `NA_character_` as distinct in keyed
joins, so when using a data.table backend we need to convert empty strings to
`NA` on both sides.

## Factor keys

Legacy code written under `stringsAsFactors = TRUE`, modeling pipelines, and
some file readers hand us keys stored as factors. A factor key holding the
same labels as a character key joins fine, since R coerces during the merge,
and `join_spy()` notes the type difference:

```{r}
surveys <- data.frame(
  site = factor(c("North", "South", "East")),
  count = c(12, 7, 30)
)

habitats <- data.frame(
  site = c("North", "South", "East"),
  habitat = c("bog", "meadow", "forest"),
  stringsAsFactors = FALSE
)

join_spy(surveys, habitats, by = "site")
```

All 3 keys match, and the report carries an informational note that the
factor will be coerced. The trap is what factors hide. The string checks in
`join_spy()` run on character columns only, so whitespace buried inside
factor labels goes unreported:

```{r}
plots <- data.frame(site = factor(c("North ", "South")), richness = c(14, 9))
join_spy(plots, habitats, by = "site")
```

The match analysis still does its job (1 of 2 keys matches), but nothing
in the issue list says why. Converting to character first surfaces the
cause:

```{r}
plots$site <- as.character(plots$site)
join_spy(plots, habitats, by = "site")
```

Now the whitespace warning appears, along with a near-match pairing
`'North '` with `'North'`. `join_repair()` follows the same character-only
rule: called on the factor version it returns the data unchanged, with no
message. After the conversion it repairs the key as usual:

```{r}
plots_clean <- join_repair(plots, by = "site")
key_check(plots_clean, habitats, by = "site")
```

When both keys are factors, `join_spy()` also compares their level sets.
Levels that exist on only one side are reported even when no data row uses
them, which catches lookup tables built against a stale set of categories:

```{r}
surveys_f <- data.frame(
  site = factor(c("North", "South"), levels = c("North", "South", "West"))
)
habitats_f <- data.frame(site = factor(c("North", "South", "East")))

join_spy(surveys_f, habitats_f, by = "site")
```

Here `"West"` is declared on the left and never observed, while `"East"`
exists on the right, so the report counts 1 level unique to each side. The
case section mentioned the level-versus-label trap; here it is concretely.
`as.numeric()` on a factor returns the internal level codes:

```{r}
plot_ids <- factor(c("10", "20", "30"))
as.numeric(plot_ids)
as.numeric(as.character(plot_ids))
```

The direct conversion returns the codes 1, 2, 3, while the route
through `as.character()` recovers the labels 10, 20, 30. A key column converted the
first way joins against the wrong rows with no warning, since the codes are
perfectly valid numbers. Any time a numeric-looking key passes through a
factor, the double conversion is the safe route.

## Near-matches and typos

Sometimes keys are close but not identical. These are genuine mismatches,
untouched by whitespace and case repairs, that `join_spy()` flags when it
finds keys in one table with no counterpart in the other.

```{r}
orders <- data.frame(
  sku = c("WDG-100", "GDG-200", "GZM-300"),
  qty = c(5, 12, 8),
  stringsAsFactors = FALSE
)

catalog <- data.frame(
  sku = c("WDG-100", "GDG-200", "GZM-301"),
  price = c(9.99, 14.99, 7.50),
  stringsAsFactors = FALSE
)

report <- join_spy(orders, catalog, by = "sku")
```

Internally, `join_spy()` computes Levenshtein distances between unmatched keys.
When two keys differ by only one or two characters, the report flags them as
near-matches; `GZM-300` and `GZM-301` sit at edit distance 1. Here is a
clearer example with multiple near-matches:

```{r}
employees <- data.frame(
  name = c("Johnson", "Smithe", "O'Brian", "Williams"),
  dept = c("Sales", "R&D", "Ops", "HR"),
  stringsAsFactors = FALSE
)

payroll <- data.frame(
  name = c("Jonhson", "Smith", "O'Brien", "Williams"),
  salary = c(55000, 62000, 48000, 71000),
  stringsAsFactors = FALSE
)

report <- join_spy(employees, payroll, by = "name")
```

`"Johnson"` vs. `"Jonhson"` (transposition), `"Smithe"` vs. `"Smith"` (extra
character), and `"O'Brian"` vs. `"O'Brien"` (vowel swap) are all within edit
distance 2. `"Williams"` matches exactly. The search has deliberate limits:
it considers pairs within edit distance 2, skips keys shorter than 3
characters, and scans at most the first 50 unmatched keys against 100
candidates. On large tables the near-match list is therefore a sample of the
problem, a prompt to look further along the same lines.

There is no automated fix here since joinspy cannot know which side is
correct, but the near-match list gives a concrete starting point for
building a lookup table. Once we have decided which side is authoritative,
the corrections belong in a small recode table stored with the pipeline, so
the same typo never needs re-diagnosing. Typos like these usually trace back
to hand-entered data; where the key is supposed to be machine-generated, a
near-match is worth treating as a symptom of two systems generating IDs
independently.

## Duplicate keys

Duplicate keys cause row multiplication. A left join on a key that appears
twice in the right table doubles the corresponding rows from the left.

```{r}
orders <- data.frame(
  customer_id = c(1, 2, 3),
  amount = c(100, 250, 75)
)

addresses <- data.frame(
  customer_id = c(1, 2, 2, 3),
  address = c("NYC", "LA", "SF", "Chicago"),
  stringsAsFactors = FALSE
)

join_spy(orders, addresses, by = "customer_id")
```

`key_duplicates()` shows which rows are responsible:

```{r}
key_duplicates(addresses, by = "customer_id")
```

Every offending row comes back with a `.n_duplicates` count attached.
`keep = "first"` or `keep = "last"` reduce the output to one row per key,
which doubles as a quick deduplication candidate:

```{r}
key_duplicates(addresses, by = "customer_id", keep = "first")
```

If each customer should have one address, we deduplicate first. If we
genuinely need all combinations, the multiplication is correct -- we just
need to know it will happen. Duplicates often mean the right table is a
different entity than assumed: an address *history* where we expected a
current-address table. The fix is then a data-model decision, picking the
latest row, aggregating, or accepting the multiplication deliberately.
Whichever we choose, encoding it as a `join_strict()` expectation (shown
below) catches the silent regression when next month's extract grows a
second row per customer.

When *both* sides have duplicates, each key group produces a Cartesian
product:

```{r}
orders_dup <- data.frame(
  product = c("A", "A", "B", "B"),
  qty     = c(10, 20, 5, 15)
)

prices_dup <- data.frame(
  product = c("A", "A", "A", "B", "B"),
  price   = c(1.0, 1.1, 1.2, 2.0, 2.5)
)

join_spy(orders_dup, prices_dup, by = "product")
```

Product `"A"` has 2 rows on the left and 3 on the right, so a join produces
2 x 3 = 6 rows for that key alone. `check_cartesian()` quantifies the total
expansion before we run the join:

```{r}
check_cartesian(orders_dup, prices_dup, by = "product")
```

By default it raises the alarm when the result would exceed 10 times the
larger input; the `threshold` argument adjusts that cut-off.

## NA keys

`NA` never equals `NA` in R. This is by design, but it surprises people who
expect two missing values to match.

```{r}
orders <- data.frame(
  customer_id = c(1, NA, 3, NA),
  amount = c(100, 200, 300, 400)
)

customers <- data.frame(
  customer_id = c(1, 2, 3, NA),
  name = c("Alice", "Bob", "Carol", "Unknown"),
  stringsAsFactors = FALSE
)

join_spy(orders, customers, by = "customer_id")
```

We can either remove rows with NA keys before joining, or replace NAs with a
sentinel value if we actually want them to match:

```{r}
# Remove
orders_clean <- orders[!is.na(orders$customer_id), ]
key_check(orders_clean, customers, by = "customer_id")
```

Removal is right when missing IDs are noise. When they carry meaning, say
unattributed orders that should collect under a single placeholder customer,
the sentinel route makes them joinable. We replace the NA with an impossible
ID on both sides:

```{r}
orders_s <- orders
customers_s <- customers
orders_s$customer_id[is.na(orders_s$customer_id)] <- -1
customers_s$customer_id[is.na(customers_s$customer_id)] <- -1

join_spy(orders_s, customers_s, by = "customer_id")
```

The NA warnings are gone and the match rate is 100%. The report now warns
about something new: the sentinel appears twice on the left, so it is a
duplicate key, and both formerly missing orders will attach to the same
`"Unknown"` row:

```{r}
merge(orders_s, customers_s, by = "customer_id", all.x = TRUE)
```

A sentinel makes missing keys equal to each other, which is the behavior we
asked for, so the duplication here is expected. The value must be impossible
as a real ID: `-1` works for positive integer keys, something like
`"__missing__"` for character keys. Both sides need the replacement, since a
sentinel on one side and an NA on the other still never match. NA keys
usually arrive from earlier outer joins or from incomplete entry, and
`join_explain()` lists them as one of its standard explanations, so a
post-join row-count surprise often traces back to this section.

## Type mismatches

One table stores IDs as integers, the other as character strings.
`merge()` coerces silently; `dplyr::left_join()` refuses. Either way,
we want to know about it before the join.

```{r}
invoices <- data.frame(
  product_id = c(1, 2, 3),
  total = c(500, 300, 150)
)

products <- data.frame(
  product_id = c("1", "2", "3"),
  name = c("Widget", "Gadget", "Gizmo"),
  stringsAsFactors = FALSE
)

join_spy(invoices, products, by = "product_id")
```

A subtler variant occurs with `Date` vs. character, or `POSIXct` vs. `Date`,
where the join either fails or coerces through numeric intermediaries.
`join_spy()` flags the type mismatch regardless of the types involved.

```{r}
invoices$product_id <- as.character(invoices$product_id)
key_check(invoices, products, by = "product_id")
```

The repair direction matters: converting the numeric side to character, as
above, is lossless. Going the other way destroys any key with leading
zeros:

```{r}
ids <- c("007", "042")
as.character(as.numeric(ids))
```

`"007"` comes back as `"7"`, a different key, and any non-numeric ID in the
column becomes NA outright. Type drift between extracts is common with
type-sniffing readers: a column of all-digit IDs imports as numeric until
the first alphanumeric ID appears, at which point the same column imports as
character. Pinning the type at import (`colClasses` in base R, `col_types`
in readr) removes the drift at its source.

## Numeric keys with floating-point noise

Numeric keys produced by arithmetic carry floating-point noise. Three depths
built by accumulating 0.1 look identical to hand-typed values when printed,
and the third one is different:

```{r}
readings <- data.frame(depth = cumsum(rep(0.1, 3)), oxygen = c(8.1, 7.4, 6.9))
layers <- data.frame(
  depth = c(0.1, 0.2, 0.3),
  layer = c("surface", "mid", "bottom")
)

print(readings$depth, digits = 17)
readings$depth == layers$depth
```

The accumulated third value is 0.30000000000000004, so the comparison with
0.3 fails. `join_spy()` warns about the key type and the match analysis
shows the damage:

```{r}
join_spy(readings, layers, by = "depth")
```

The match analysis reports 2 keys in both tables, with one orphan on each
side: the two versions of 0.3 that refuse to be equal. The floating-point
warning fires whenever a key column holds non-integer doubles, on both
tables here, because any such key can fail this way. Pinpointing which
values differ by an epsilon is outside joinspy's string checks; that part is
on us, and the standard fix is to remove the noise before joining:

```{r}
readings$depth <- round(readings$depth, 6)
all(readings$depth %in% layers$depth)
```

Rounding to a precision coarser than the noise and finer than the data
restores exact equality. A sturdier design avoids fractional keys entirely.
Store depth in centimeters as an integer, or format it to a fixed-width
string, and this failure mode does not come back. Fractional keys usually
appear when a measured quantity gets promoted into an identifier; an
explicit ID column upstream removes the temptation.

## Many-to-many explosions

When both tables have duplicate keys, we get a Cartesian product within each
key group. With real data this can turn a 10,000-row join into a million-row
table.

```{r}
items <- data.frame(
  order_id = c(1, 1, 2, 2, 2),
  item = c("A", "B", "C", "D", "E"),
  stringsAsFactors = FALSE
)

payments <- data.frame(
  order_id = c(1, 1, 2, 2),
  method = c("Card", "Cash", "Card", "Wire"),
  stringsAsFactors = FALSE
)

check_cartesian(items, payments, by = "order_id")
```

`detect_cardinality()` tells us the relationship type:

```{r}
detect_cardinality(items, payments, by = "order_id")
```

If we expected a one-to-many relationship, `join_strict()` will stop us
before the explosion happens:

```{r error = TRUE}
join_strict(items, payments, by = "order_id", type = "left", expect = "1:n")
```

The error arrives before any rows are produced, which matters when the
explosion would have been the million-row kind. Many-to-many joins that are
intentional, such as enumerating all item-payment pairs for reconciliation,
are better written with the expectation stated: `expect = "n:m"` passes
every cardinality and documents that the expansion is deliberate. The
`*_join_spy()` wrappers report the predicted row count for the same reason,
so a join expected to preserve row counts announces itself when it triples
them instead. Explosions almost always enter a pipeline through a table
that quietly gained a second granularity, an `order_id` table that became
an `order_id` x `payment_attempt` table, and the cardinality check is the
cheapest way to notice.

## No matches at all

An inner join returns zero rows, and downstream code may not check for an
empty data frame.

```{r}
system_a <- data.frame(
  user_id = c("USR-001", "USR-002", "USR-003"),
  score = c(85, 90, 78),
  stringsAsFactors = FALSE
)

system_b <- data.frame(
  user_id = c("1", "2", "3"),
  dept = c("Sales", "R&D", "Ops"),
  stringsAsFactors = FALSE
)

join_spy(system_a, system_b, by = "user_id")
```

Zero overlap -- the keys use completely different formats, and no amount of
trimming or case-folding will help. We need a mapping table or a
transformation that extracts the numeric part:

```{r}
system_a$user_num <- gsub("^USR-0*", "", system_a$user_id)
key_check(system_a, system_b, by = c("user_num" = "user_id"))
```

Format mismatches like this are structural: the two systems never shared an
ID scheme, so joinspy can report the zero overlap and the absence of string
issues, and that combination is itself the diagnosis. The `gsub()`
extraction works when one format embeds the other. Failing that, somebody
owns a mapping table, and the join goes through it. The named `by` in the
final `key_check()` call joins our derived `user_num` column against system
B's `user_id` without renaming anything; the next section covers that
syntax.

## Differently named key columns

Tables rarely agree on what the key column is called: `patient_id` in the
admissions extract is `mrn` in the registry. A named `by` vector maps left
names to right names, and every joinspy function accepts it:

```{r}
admissions <- data.frame(
  patient_id = c("P-01 ", "P-02", "P-03"),
  ward = c("A", "B", "B"),
  stringsAsFactors = FALSE
)

registry <- data.frame(
  mrn = c("P-01", "P-02", "P-04"),
  dob = c("1980-03-02", "1975-11-19", "1990-07-30"),
  stringsAsFactors = FALSE
)

join_spy(admissions, registry, by = c("patient_id" = "mrn"))
```

The header line shows the mapping (`patient_id = mrn`), and the diagnostics
run as usual: the whitespace warning points at `patient_id`, and the
near-match list pairs `'P-01 '` with `'P-01'`. Repairs and joins take the
same vector, with `join_repair()` fixing `patient_id` on the left and `mrn`
on the right:

```{r}
fixed <- join_repair(admissions, registry, by = c("patient_id" = "mrn"))
left_join_spy(fixed$x, fixed$y, by = c("patient_id" = "mrn"), verbose = FALSE)
```

The result keeps the left table's column name, `patient_id`, and the
unmatched `P-03` carries an NA date of birth. Renaming columns before a join
is the common workaround, and it tends to decay as scripts grow, with the
rename and the join drifting apart until one of them changes alone. Passing
the mapping straight to `by` keeps the two halves of the decision in one
place.

## Troubleshooting workflow

The sections above each handle one failure in isolation. On a real join we
usually do not know which failure we have, so this is the order we check in,
walked through on a pair of tables that carries several problems at once:

```{r}
shipments <- data.frame(
  order_ref = c("ORD-1 ", "ORD-2", "ORD-2", "ORD-3", NA),
  qty = c(10, 25, 5, 12, 7),
  stringsAsFactors = FALSE
)

invoices <- data.frame(
  order_ref = c("ORD-1", "ORD-2", "ORD-4"),
  total = c(99, 250, 80),
  stringsAsFactors = FALSE
)
```

**Step 1: run `join_spy()` and read it top to bottom.** The match rate and
the issue list classify the problem before we attempt any fix. For large
tables, `sample = 1000` runs the same diagnostics on a random subset first.

```{r}
report <- join_spy(shipments, invoices, by = "order_ref")
report
```

One report, four findings: a duplicate key, an NA key, a whitespace
problem, and a clutch of near-matches. Each outcome routes to a step below:

1. Whitespace, case, encoding, or empty-string issues: repair the strings
   (step 2).
2. Duplicate keys, or expected row counts above the left table's row count:
   inspect the duplicates (step 3).
3. NA keys: decide what missing means (step 4).

4. A type mismatch: align the types (step 5).

5. Near-matches on otherwise clean keys: build a recode table (the
   near-match section above).
6. A 0% match rate with no string issues: a format mismatch; extract a
   common key or find the mapping table (the no-matches section above).

**Step 2: repair the strings.** `join_repair()` covers whitespace, case,
invisible characters, and empty strings in one call; `suggest_repairs(report)`
prints the equivalent base R when the fix has to live elsewhere.

```{r}
shipments_repaired <- join_repair(shipments, by = "order_ref")
```

**Step 3: inspect the duplicates.** `key_duplicates()` shows the rows,
`detect_cardinality()` names the relationship, and `check_cartesian()`
bounds the blow-up for the worst keys.

```{r}
key_duplicates(shipments_repaired, by = "order_ref")
detect_cardinality(shipments_repaired, invoices, by = "order_ref")
```

`ORD-2` appears twice on the left, so the relationship is `n:1`. If two
shipments per order is the real shape of the data, we keep it and state the
expectation in step 6. If it is an accident, we deduplicate or aggregate
here.

**Step 4: decide what NA keys mean.** Dropping loses the 7-unit shipment
with no reference; the sentinel route from the NA section keeps it under a
placeholder. Here we drop:

```{r}
shipments_repaired <- shipments_repaired[!is.na(shipments_repaired$order_ref), ]
```

**Step 5: align types.** Nothing to fix in this example; when the report
shows a numeric column joining a character column, we convert toward
character (the lossless direction) or pin the types at import.

**Step 6: join with the expectation enforced.** `join_strict()` performs
the join only if the cardinality matches what we declared, so the data
model decision from step 3 becomes executable:

```{r}
result <- join_strict(shipments_repaired, invoices, by = "order_ref",
                      type = "left", expect = "n:1")
result
```

**Step 7: audit the result.** `join_explain()` accounts for the difference
between input and output row counts after the fact:

```{r}
join_explain(result, shipments_repaired, invoices,
             by = "order_ref", type = "left")
```

The row count is unchanged at 4, and the explanation still lists the two
forces that could have moved it: the duplicate `ORD-2` and the unmatched
`ORD-3`. On a larger join those same lines say where unexpected rows came
from. `join_diff()` offers the same comparison oriented around column
changes.

**Step 8: leave a trail.** In production pipelines, `set_log_file()` routes
every subsequent `*_join_spy()` report to a file, which is how we debug a
join that went wrong last Tuesday:

```{r}
log_file <- tempfile(fileext = ".log")
set_log_file(log_file)
audited <- left_join_spy(shipments_repaired, invoices,
                         by = "order_ref", .quiet = TRUE)
set_log_file(NULL)
readLines(log_file)[2:8]
```

With `.quiet = TRUE` the join runs silently and the report still lands in
the log; `last_report()` retrieves it in-session. For one-off snapshots,
`log_report()` writes a single report, and a `.json` or `.rds` extension
switches the format for machine consumption.

**Step 9: for multi-join pipelines, check the whole chain.**
`analyze_join_chain()` runs the step 1 diagnostic at every link and reports
where the first problem enters:

```{r}
orders <- data.frame(order_id = 1:3, customer_id = c(1, 2, 2))
customers <- data.frame(customer_id = 1:3, region_id = c(1, 1, 2))
regions <- data.frame(region_id = 1:2, name = c("North", "South"))

analyze_join_chain(
  tables = list(orders = orders, customers = customers, regions = regions),
  joins = list(
    list(left = "orders", right = "customers", by = "customer_id"),
    list(left = "result", right = "regions", by = "region_id")
  )
)
```

Each step gets its own match rate and issue count, with `"result"` referring
to the output of the previous join, so the first bad link in a five-table
pipeline is visible without bisecting by hand.

## See Also

- `vignette("quickstart")` for a quick introduction to joinspy

- `?join_spy`, `?join_repair`, `?key_check`, `?join_strict`

- `?check_cartesian`, `?detect_cardinality`, `?join_explain`, `?suggest_repairs`