A join runs without error but the row count is wrong – fewer rows than expected, or more. The columns look fine. The key values look identical in the console.
R’s merge() and dplyr’s *_join() compare
key values byte-for-byte. When keys fail to match, they are genuinely
different at the byte level: a trailing space, a case mismatch, or a
zero-width Unicode character that occupies no screen width.
This vignette walks through five scenarios where joins fail for
string-level reasons that resist casual inspection. The data is
synthetic; the patterns come from real pipelines. The first two sections
cover the mechanics: what R compares when it joins two key columns, and
which checks join_spy() runs against them. The scenarios
then put both to work.
Every join reduces to the same primitive: for each key in one table,
find the positions of equal keys in the other. merge()
builds on match(); dplyr and data.table run their own join
engines and apply the same standard of equality. Two character keys are
equal when their byte sequences are equal, after R translates any
declared encodings to a common representation. There is no trimming, no
case folding, no Unicode normalization along the way. A single byte of
difference makes two keys strangers.
identical() applies the same standard, which makes it a
convenient way to test what a join will see:
The strings differ by one character, and charToRaw()
shows which byte is responsible:
The final 20 is an ordinary space. This one happens to
be easy to spot once printed as bytes. The characters behind longer
debugging sessions render as nothing at all, or as something
indistinguishable from a space. The non-breaking space is the classic
example:
space <- "\u0020"
nbsp <- "\u00A0"
space == nbsp
#> [1] FALSE
c(utf8ToInt(space), utf8ToInt(nbsp))
#> [1] 32 160utf8ToInt() returns the Unicode code point of each
character, which makes it the most direct inspection tool for a
suspicious key: 32 is the space on your keyboard, 160 is the
non-breaking space from a web page or a PDF. Both render as a gap of the
same width.
Two strings can also be visually identical with no invisible characters anywhere in them. Unicode allows an accented letter to be written two ways: as one precomposed code point, or as a base letter followed by a combining accent.
Both print as the same accented e. The first is a single code point, the second is two, and R compares the sequences without normalizing them:
c(nchar(e_one), nchar(e_two))
#> [1] 1 2
utf8ToInt(e_one)
#> [1] 233
utf8ToInt(e_two)
#> [1] 101 769A reference table typed by hand usually carries the precomposed form. Data that passed through certain text processors or through macOS file paths can arrive decomposed, and a column of city or person names will then fail to match a column that looks identical character for character.
Encoding adds one more layer. The same character can be stored as different bytes depending on the encoding. The umlauted u is one byte in latin1 and two bytes in UTF-8:
city_utf8 <- "Z\u00FCrich"
city_latin1 <- iconv(city_utf8, from = "UTF-8", to = "latin1")
c(Encoding(city_utf8), Encoding(city_latin1))
#> [1] "UTF-8" "latin1"
charToRaw(city_utf8)
#> [1] 5a c3 bc 72 69 63 68
charToRaw(city_latin1)
#> [1] 5a fc 72 69 63 68Different bytes, same declared meaning. As long as the declaration is correct, R translates before comparing and the keys still match:
Trouble starts when bytes and declaration disagree. A latin1 file
read as if it were UTF-8, or read with no declaration on a system that
assumes one, produces strings whose bytes no longer mean what R thinks
they mean. Accented keys then differ from their clean counterparts, and
nothing in the comparison machinery reconciles them.
join_spy() flags key columns that mix declared encodings,
since mixed declarations within a single column usually mean some values
entered the session by a different route than the rest.
None of the failures above produce a warning, because none of them
are errors from R’s point of view. String equality is exact and well
defined. A join that matches three keys out of six did what it was told,
and a zero-row inner join is a legitimate result (an anti-join hopes for
exactly that). merge() has no way to know that six matches
were expected. The evidence that something went wrong lives in the
difference between the rows you expected and the rows you got, which is
why diagnosing it takes a separate step.
Every scenario below ends with a join_spy() call, so it
is worth laying out what that call checks. Everything happens inside the
one function and comes back in the printed JoinReport. The
chunks in this section reproduce individual checks with base R so the
mechanics are visible; none of them are needed in normal use.
Leading and trailing whitespace is detected with anchored regular expressions over each character key column:
ids <- c("CUST-1001", "CUST-1002 ", " CUST-1004")
grepl("^\\s+", ids)
#> [1] FALSE FALSE TRUE
grepl("\\s+$", ids)
#> [1] FALSE TRUE FALSEValues failing either test are collected and reported with their
column and table. Interior whitespace is left alone:
"New York" and "New York" differ by an
interior space and fall to the near-match check further down.
A case mismatch is defined relationally. A key counts only when it has no exact match in the other table and would gain one if both sides were lowercased:
crm <- c("[email protected]", "[email protected]")
clicks <- c("[email protected]", "[email protected]")
crm[tolower(crm) %in% tolower(clicks)]
#> [1] "[email protected]"Eve appears in neither form on the right, so she is a genuinely unmatched key; Alice is a case mismatch. The report keeps example pairs, which is how the output in Scenario 2 can name the specific addresses involved.
The encoding check looks for two conditions: mixed declared encodings
within one column, read off Encoding(), and the presence of
any of five code points that render as nothing or as an ordinary-looking
space:
hidden <- "[\u200B\u200C\u200D\uFEFF\u00A0]"
grepl(hidden, c("India\u200B", "India"), perl = TRUE)
#> [1] TRUE FALSEThe watch list is U+200B (zero-width space), U+200C (zero-width
non-joiner), U+200D (zero-width joiner), U+FEFF (byte order mark), and
U+00A0 (non-breaking space). U+00A0 earns its place on the list because
regex whitespace classes and trimws() treat it
inconsistently across engines and locales, so a whitespace scan alone
cannot be trusted to catch it.
Empty strings sit in a blind spot: they read as missing data to a human and as a valid key to a join.
An empty string equals another empty string, so two tables that both
use "" as a placeholder will match those rows to each
other. The report flags empty keys as informational, and
join_repair(empty_to_na = TRUE) converts them when that
match behavior is unwanted.
Key columns can also be factors, integers, or doubles, and the type checks cover the seams between column types. Character against factor is reported as informational, since the join will coerce. Character against numeric is reported as a warning, because that coercion is a recurring source of surprises. When both sides are factors, their level sets are compared and any levels unique to one side are listed. Numeric keys get two checks of their own: keys with decimal parts, and keys large enough to exhaust double precision.
Both lines are standard floating-point behavior, and both translate
directly into joins that quietly drop or merge rows: a key computed as
0.1 + 0.2 will never find a key stored as 0.3,
and above 2^53 distinct integer IDs stop being
distinguishable as doubles.
Separately from string quality, join_spy() summarizes
each table’s keys: rows, unique keys, duplicated keys, NA keys.
Duplicates matter because each match against a duplicated key multiplies
rows; NA keys matter because they will not match. From the per-key
counts the report predicts the result size of all four join types. The
inner-join figure is the sum, over matched keys, of the product of each
side’s counts, which is exactly the arithmetic that turns a
harmless-looking duplicate into a memory problem. The report object also
carries a heuristic memory estimate per join type for the same
reason.
Whatever remains unmatched after the exact analysis goes through a
typo hunt. Unmatched left keys are compared against right keys by
Levenshtein edit distance, the minimum number of single-character
insertions, deletions, and substitutions needed to turn one string into
the other, computed with utils::adist():
utils::adist("WDG102", "WDG-102")
#> [,1]
#> [1,] 1
utils::adist("Asia Pacific ", "Asia Pacific")
#> [,1]
#> [1,] 1A pair is reported as a near match when its distance is 1 or 2. Keys
shorter than three characters are skipped, since nearly everything sits
within two edits of a string like "A7". A length pre-check
discards pairs whose lengths already differ by more than two, and to
keep the scan cheap the comparison covers at most 50 unmatched left keys
against 100 right keys, reporting the 10 closest pairs. Near matches
stay informational: "WDG102" sitting one edit from
"WDG-102" is strong evidence of a typo, and the report
points it out without changing anything.
For multi-column keys, an extra pass analyzes each key column on its own: unique values per side, matches between them, and a per-column match rate. The column with the lowest match rate is called out as the likely problem column. Scenario 5 shows what this buys: an overall match rate says something is wrong, and the per-column breakdown says where.
A retail analytics team receives monthly sales data from a
distribution partner as a CSV exported from Excel. They join it against
their internal customer database on customer_id. For two
quarters, everything works. Then one month, 30% of the sales records
stop matching. Nobody changed the code or the customer database. The
partner’s IDs are all present in the internal system – or so it
appears.
partner_sales <- data.frame(
customer_id = c("CUST-1001", "CUST-1002 ", "CUST-1003",
" CUST-1004", "CUST-1005 ", "CUST-1006"),
amount = c(2500, 1800, 3200, 950, 4100, 1600),
stringsAsFactors = FALSE
)
internal_db <- data.frame(
customer_id = c("CUST-1001", "CUST-1002", "CUST-1003",
"CUST-1004", "CUST-1005", "CUST-1006", "CUST-1007"),
region = c("West", "East", "West", "South", "East", "North", "West"),
stringsAsFactors = FALSE
)Nothing in str() or print() reveals the
issue – trailing spaces are invisible in console output.
join_spy() catches it:
report <- join_spy(partner_sales, internal_db, by = "customer_id")
report
#>
#> ── Join Diagnostic Report ──────────────────────────────────────────────────────
#> Join columns: customer_id
#>
#> ── Table Summary ──
#>
#> Left table: Rows: 6 Unique keys: 6 Duplicate keys: 0 NA keys: 0
#> Right table: Rows: 7 Unique keys: 7 Duplicate keys: 0 NA keys: 0
#>
#> ── Match Analysis ──
#>
#> Keys in both: 3
#> Keys only in left: 3
#> Keys only in right: 4
#> Match rate (left): "50%"
#>
#> ── Issues Detected ──
#>
#> ! Left column 'customer_id' has 3 value(s) with leading/trailing whitespace
#> ℹ 10 near-match(es) found (e.g., 'CUST-1002 ' ~ 'CUST-1002', ' CUST-1004' ~ 'CUST-1004', 'CUST-1002 ' ~ 'CUST-1001') - possible typos?
#>
#> ── Expected Row Counts ──
#>
#> inner_join: 3
#> left_join: 6
#> right_join: 7
#> full_join: 10Three of the six partner IDs carry whitespace.
"CUST-1002 " is a different string from
"CUST-1002" as far as R is concerned.
join_repair() trims both tables at once:
repaired <- join_repair(partner_sales, internal_db, by = "customer_id")
#> ✔ Repaired 3 value(s)
partner_fixed <- repaired$x
internal_fixed <- repaired$yWe can verify the repair worked:
key_check(partner_fixed, internal_fixed, by = "customer_id")
#> ✔ Key check passed: no issues detectedAnd now the join gives us what we expected:
The root cause was an Excel CONCATENATE formula that
preserved trailing spaces from a variable-width source column. Excel
renders "CUST-1002" and "CUST-1002 "
identically, so nobody noticed. Trailing whitespace is the single most
common join failure we see in practice.
Excel is one of several routes for this defect. SQL
CHAR(n) columns pad values with spaces to a fixed width,
and some database drivers deliver the padding to R intact. Fixed-width
flat files do it by construction. Manual entry adds a trailing space
whenever someone types one before tabbing out of a cell. Every route
passes through an ingestion point, so the cheapest prevention is to trim
key columns once, where external data enters the pipeline, and to put a
key_check() call at that boundary. It would have turned
this quarter’s silent 30% match drop into a loud failure on the first
run.
A SaaS company wants to join CRM profiles to clickstream events for a churn analysis. The CRM stores email addresses in uppercase (a database migration decision from the late 1990s). The web app stores them in lowercase. Both systems are internally consistent.
crm_profiles <- data.frame(
email = c("[email protected]", "[email protected]", "[email protected]",
"[email protected]", "[email protected]"),
plan = c("enterprise", "starter", "pro", "enterprise", "starter"),
stringsAsFactors = FALSE
)
click_events <- data.frame(
email = c("[email protected]", "[email protected]", "[email protected]",
"[email protected]", "[email protected]"),
page_views = c(47, 12, 89, 33, 5),
stringsAsFactors = FALSE
)An inner join returns zero rows. R’s string comparison is
case-sensitive, so every key pair fails. join_spy() flags
the situation before the join:
report <- join_spy(crm_profiles, click_events, by = "email")
report
#>
#> ── Join Diagnostic Report ──────────────────────────────────────────────────────
#> Join columns: email
#>
#> ── Table Summary ──
#>
#> Left table: Rows: 5 Unique keys: 5 Duplicate keys: 0 NA keys: 0
#> Right table: Rows: 5 Unique keys: 5 Duplicate keys: 0 NA keys: 0
#>
#> ── Match Analysis ──
#>
#> Keys in both: 0
#> Keys only in left: 5
#> Keys only in right: 5
#> Match rate (left): "0%"
#>
#> ── Issues Detected ──
#>
#> ! 4 key(s) would match if case-insensitive (e.g., '[email protected]' vs '[email protected]')
#>
#> ── Expected Row Counts ──
#>
#> inner_join: 0
#> left_join: 5
#> right_join: 5
#> full_join: 10suggest_repairs() generates the fix:
suggest_repairs(report)
#>
#> ── Suggested Repairs ───────────────────────────────────────────────────────────
#> # Standardize case:
#> x[["email"]] <- tolower(x[["email"]])
#> y[["email"]] <- tolower(y[["email"]])Or we can use join_repair() directly, specifying case
standardization:
repaired <- join_repair(
crm_profiles, click_events,
by = "email",
standardize_case = "lower"
)
#> ✔ Repaired 5 value(s)After lowercasing both sides, the inner join returns four matched rows (everyone except Eve, who has no click data, and Frank, who is not in the CRM):
result <- merge(repaired$x, repaired$y, by = "email")
nrow(result)
#> [1] 4
result
#> email plan page_views
#> 1 [email protected] enterprise 47
#> 2 [email protected] starter 12
#> 3 [email protected] pro 89
#> 4 [email protected] enterprise 33Email addresses are case-insensitive by RFC 5321, so lowercasing is
the right normalization here. For other identifier types (product codes,
country abbreviations), "upper" may be more
appropriate.
Conventions like these are system-wide and invisible from inside the
system that holds them. Many database collations compare strings
case-insensitively, so the CRM’s uppercase migration never broke an
internal query; the web stack lowercased on write and was just as
self-consistent. The mismatch exists only at the seam where the two
systems meet, which is where an R join usually sits. The first join
between two new sources is the right moment to run
join_spy() and settle the convention question, and spelling
the choice out through standardize_case documents it in
code for the next person who touches the pipeline.
A public health researcher compiles data from multiple sources for a systematic review. A few studies published supplementary tables only as PDF, so she copies the table from the PDF viewer, pastes into a spreadsheet, cleans up the columns, and reads the CSV into R. The data looks perfect – every country name is spelled correctly. But half the countries fail to match a reference population table.
# Simulating PDF copy-paste artifacts:
# \u00A0 is non-breaking space, \u200B is zero-width space
pdf_data <- data.frame(
country = c("Brazil", "India\u200B", "Germany",
"Japan\u00A0", "Canada", "France\u200B"),
prevalence = c(12.3, 8.7, 5.1, 3.9, 6.2, 4.8),
stringsAsFactors = FALSE
)
reference <- data.frame(
country = c("Brazil", "India", "Germany", "Japan",
"Canada", "France", "Italy"),
population_m = c(214, 1408, 84, 125, 38, 68, 59),
stringsAsFactors = FALSE
)Printing the PDF data shows nothing wrong:
The zero-width space after “India” and “France” occupies zero pixels.
The non-breaking space after “Japan” renders like a regular space but is
U+00A0, not U+0020 – trimws() will not always remove it.
The merge reflects this:
Three of six countries match. join_spy() detects the
invisible characters:
report <- join_spy(pdf_data, reference, by = "country")
report
#>
#> ── Join Diagnostic Report ──────────────────────────────────────────────────────
#> Join columns: country
#>
#> ── Table Summary ──
#>
#> Left table: Rows: 6 Unique keys: 6 Duplicate keys: 0 NA keys: 0
#> Right table: Rows: 7 Unique keys: 7 Duplicate keys: 0 NA keys: 0
#>
#> ── Match Analysis ──
#>
#> Keys in both: 3
#> Keys only in left: 3
#> Keys only in right: 4
#> Match rate (left): "50%"
#>
#> ── Issues Detected ──
#>
#> ! Left column 'country' has encoding issues (invisible chars or mixed encoding)
#> ℹ 3 near-match(es) found (e.g., 'India' ~ 'India', 'Japan ' ~ 'Japan', 'France' ~ 'France') - possible typos?
#>
#> ── Expected Row Counts ──
#>
#> inner_join: 3
#> left_join: 6
#> right_join: 7
#> full_join: 10Alongside the encoding warning, the near-match check reaches the same conclusion from a different direction: each contaminated name sits one edit away from its clean counterpart in the reference table.
repaired <- join_repair(pdf_data, reference, by = "country")
#> ✔ Repaired 3 value(s)
nrow(merge(repaired$x, repaired$y, by = "country"))
#> [1] 6Six matches. PDF copy-paste is the most common source of these
artifacts, but web scraping, OCR output, and legacy mainframe exports
can produce them too. One useful debugging trick outside of joinspy:
nchar("India\u200B") returns 6, not 5. But that requires
already suspecting the problem.
Each watched code point has a typical entry route. The non-breaking
space comes from typesetting: PDF layout engines and HTML
( ) use it to keep words on one line, and the
clipboard preserves it faithfully. Zero-width spaces are inserted by
some content-management systems and word processors as line-break hints.
The byte order mark arrives at the front of files saved as “UTF-8 with
BOM” by some Windows editors, where it fuses with the first field name
or the first key of a CSV. All of these survive visual inspection, and
all of them disappear under join_repair()’s
invisible-character pass. Stripping them right after
read.csv() keeps them out of every downstream join at
once.
An e-commerce pipeline joins transaction records to a product catalogue. The pipeline ran cleanly for months, then match rates started drifting: 99% in January, 97% in February, 94% in March. Nobody noticed until finance flagged a margin discrepancy in April.
The code had not changed. A new data entry clerk had joined the
warehouse team in December. The canonical product code format was
"WDG-100" – uppercase prefix, dash, three-digit suffix. The
new clerk sometimes omitted the dash, sometimes typed lowercase. The
warehouse system did fuzzy matching internally, so it accepted codes
that the ETL join’s exact comparison rejected.
# Product catalogue (canonical format)
catalogue <- data.frame(
product_code = c("WDG-100", "WDG-101", "WDG-102",
"WDG-103", "WDG-104", "WDG-105"),
product_name = c("Widget A", "Widget B", "Widget C",
"Widget D", "Widget E", "Widget F"),
margin = c(0.35, 0.28, 0.42, 0.31, 0.39, 0.25),
stringsAsFactors = FALSE
)
# Recent transactions (mix of old and new clerk entries)
transactions <- data.frame(
product_code = c("WDG-100", "WDG-101", "WDG102",
"wdg-103", "WDG-104", "wdg105",
"WDG-100", "WDG103"),
quantity = c(5, 3, 7, 2, 4, 6, 1, 8),
stringsAsFactors = FALSE
)Some codes match and some do not, which makes partial failures harder to spot than complete ones.
report <- join_spy(transactions, catalogue, by = "product_code")
report
#>
#> ── Join Diagnostic Report ──────────────────────────────────────────────────────
#> Join columns: product_code
#>
#> ── Table Summary ──
#>
#> Left table: Rows: 8 Unique keys: 7 Duplicate keys: 1 NA keys: 0
#> Right table: Rows: 6 Unique keys: 6 Duplicate keys: 0 NA keys: 0
#>
#> ── Match Analysis ──
#>
#> Keys in both: 3
#> Keys only in left: 4
#> Keys only in right: 3
#> Match rate (left): "42.9%"
#>
#> ── Issues Detected ──
#>
#> ! Left table has 1 duplicate key(s) affecting 2 rows - may cause row multiplication
#> ! 1 key(s) would match if case-insensitive (e.g., 'wdg-103' vs 'WDG-103')
#> ℹ 10 near-match(es) found (e.g., 'WDG102' ~ 'WDG-102', 'WDG103' ~ 'WDG-103', 'WDG102' ~ 'WDG-100') - possible typos?
#>
#> ── Expected Row Counts ──
#>
#> inner_join: 4
#> left_join: 8
#> right_join: 7
#> full_join: 11The report stacks several findings: the case check catches
"wdg-103", and the near-match check lines the dashless
codes up against catalogue entries one or two edits away.
Here is where this scenario differs from the previous ones.
join_repair() can fix the case issue, but it cannot insert
the missing dashes – that requires domain knowledge about the code
format.
We can do a dry run to see what join_repair() would
fix:
join_repair(transactions, catalogue,
by = "product_code",
standardize_case = "upper",
dry_run = TRUE)
#>
#> ── Repair Preview (Dry Run) ────────────────────────────────────────────────────
#>
#> ── Left table (x) ──
#>
#> ℹ product_code: upper case (2)After applying the mechanical fixes:
repaired <- join_repair(transactions, catalogue,
by = "product_code",
standardize_case = "upper")
#> ✔ Repaired 2 value(s)The case issues are resolved, but the missing dashes remain. A manual transformation handles those:
# Manual fix: insert dash if missing in product codes matching the pattern
fix_codes <- function(codes) {
gsub("^([A-Z]{3})(\\d)", "\\1-\\2", codes)
}
repaired$x$product_code <- fix_codes(repaired$x$product_code)All eight transactions match. join_repair() handles
context-free transformations (trimming, case normalization, stripping
invisible characters). Inserting a dash into "WDG102"
requires knowing the canonical format – that fix has to come from
someone who understands the data.
The upstream story here is the most general of the five: two systems
with different matching strictness, connected by people. The warehouse
software tolerated sloppy codes because its internal lookup was fuzzy;
the ETL join was exact; the clerk had no way to see the difference. Keys
typed by humans drift like this wherever entry goes unvalidated. A rule
rejecting anything that fails ^[A-Z]{3}-\d{3}$ at the
warehouse boundary would have caught every malformed code in December.
The drift is also visible from the R side: a match rate sliding from 99%
to 94% over three months is a measurable signal, and logging a report
per pipeline run (vignette("production") covers
log_report() and set_log_file()) turns that
slide into an alert long before finance notices the margins.
Two government datasets need to be linked: regional economic indicators and regional population estimates, keyed on region and year. The year column is numeric and matches without trouble. The region column has a whitespace problem that affects only some records.
economics <- data.frame(
region = c("North America", "Europe", "Asia Pacific ",
"North America", "Europe", "Asia Pacific "),
year = c(2022, 2022, 2022, 2023, 2023, 2023),
gdp_growth = c(2.1, 1.8, 4.2, 1.9, 0.9, 3.8),
stringsAsFactors = FALSE
)
population <- data.frame(
region = c("North America", "Europe", "Asia Pacific",
"North America", "Europe", "Asia Pacific"),
year = c(2022, 2022, 2022, 2023, 2023, 2023),
pop_millions = c(580, 450, 4300, 585, 448, 4350),
stringsAsFactors = FALSE
)In a compound key join, both columns must match. The year column is
fine everywhere, but "Asia Pacific " with a trailing space
will not match "Asia Pacific".
Four rows instead of six. North America and Europe match; Asia Pacific does not, because of the trailing space.
join_spy() pinpoints which column in the compound key
has the problem:
report <- join_spy(economics, population, by = c("region", "year"))
report
#>
#> ── Join Diagnostic Report ──────────────────────────────────────────────────────
#> Join columns: region, year
#>
#> ── Table Summary ──
#>
#> Left table: Rows: 6 Unique keys: 6 Duplicate keys: 0 NA keys: 0
#> Right table: Rows: 6 Unique keys: 6 Duplicate keys: 0 NA keys: 0
#>
#> ── Match Analysis ──
#>
#> Keys in both: 4
#> Keys only in left: 2
#> Keys only in right: 2
#> Match rate (left): "66.7%"
#>
#> ── Issues Detected ──
#>
#> ! Left column 'region' has 1 value(s) with leading/trailing whitespace
#> ℹ 1 near-match(es) found (e.g., 'Asia Pacific ' ~ 'Asia Pacific') - possible typos?
#>
#> ── Per-Column Breakdown ──
#>
#> region: "66.7%" match rate (2/3)
#> year: "100%" match rate (2/2)
#> ℹ Lowest match rate: region
#>
#> ── Expected Row Counts ──
#>
#> inner_join: 4
#> left_join: 6
#> right_join: 6
#> full_join: 8The per-column breakdown shows the year column matching fully while region falls short, and names region as the column with the lowest match rate. From there the whitespace warning above it identifies the exact values at fault.
repaired <- join_repair(economics, population, by = c("region", "year"))
#> ✔ Repaired 2 value(s)
result <- merge(repaired$x, repaired$y, by = c("region", "year"))
nrow(result)
#> [1] 6Six rows. With compound keys, a string issue in any single column is enough to break the match. The more columns in the key, the more places a byte-level discrepancy can occur.
Compound keys generalize everything in this vignette: each column
added to by is one more surface where any of these defects
can land. Region-by-year panels, site-by-date field records,
patient-by-visit tables: wherever observations are keyed by a
combination of identifiers, the component columns tend to come from
different upstream systems with different hygiene. An aggregate match
rate cannot say which component is at fault. The per-column breakdown
can, and checking it first rules out most candidate explanations before
any bytes get inspected.
These five scenarios share three properties. The data looks correct
to standard inspection tools – str(),
summary(), print() all render the values
identically. R returns fewer (or more) rows without a warning, because
the key values genuinely differ at the byte level. And the fix is
mechanical once the cause is known – trimming, lowercasing, or stripping
invisible Unicode are all one-line operations.
join_spy() surfaces the cause directly, which is
especially useful with data from external sources, manual entry, PDF
extraction, or cross-system integrations.
Worked the other way around, the shared structure becomes a checklist. Each question in a join post-mortem maps to one call:
join_spy(x, y, by) gives the
full report: key summaries, match rates, detected issues, and expected
row counts for all four join types.key_check(x, y, by)
prints a short status and invisibly returns a logical;
warn = FALSE makes it silent.key_duplicates(data, by)
returns the duplicated rows with a .n_duplicates count
column.join_repair() with
dry_run = TRUE previews the repairs without touching the
data.join_repair() trims
whitespace, standardizes case, strips the invisible characters, and
optionally converts empty strings to NA.suggest_repairs(report) prints
copy-pasteable snippets matched to the issues a report found.join_explain(result, x, y, by) reconstructs where the rows
went.join_diff(before, after, by) compares dimensions, columns,
and key statistics.join_strict()
performs the join and errors when an expectation like
expect = "1:1" is violated.The split between diagnosis and repair is deliberate.
join_spy() never modifies data, and
join_repair() only applies transformations that are correct
regardless of what the keys mean: trimming, casing, stripping invisible
characters. Anything that requires knowing the data, like the missing
dashes of Scenario 4, stays a human decision, with the diagnostics
pointing at it.
For the wider catalogue of join failures beyond strings (duplicate
keys multiplying rows, NA keys, accidental cartesian products), see
vignette("common-issues"). For a tour of the full API on
one worked example, see vignette("quickstart"). The
string-level defects collected here are the ones that resist inspection
longest; one diagnostic call before joining data you did not generate
yourself covers every one of them.