---
title: "Working with Backends"
author: "Gilles Colling"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Working with Backends}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 6,
  fig.height = 4
)
library(joinspy)
```

joinspy works with base R data frames, tibbles, and data.tables. The join
wrappers (`left_join_spy()`, `join_strict()`, etc.) detect the input class and
dispatch to the right engine automatically. The diagnostic layer (`join_spy()`,
`key_check()`, `join_explain()`, and friends) is backend-agnostic: it runs
the same analysis regardless of what class the inputs are.

We walk through detection, the exact engine calls behind each wrapper,
explicit overrides, and class preservation below. We then take the engine
differences that survive the wrapper layer (column-name collisions, row
ordering, NA-key matching) and demonstrate each one with output, so that
switching backends mid-project holds no surprises. The vignette closes with
how backends interact with `join_strict()` and the report flow, and with
guidance on when an override earns its keep.


## Auto-detection

When we call `left_join_spy()` or `join_strict()` without specifying a
backend, joinspy inspects the class of `x` and `y` and picks the backend
according to a fixed priority: data.table > tibble > base R.

The check itself is two `inherits()` calls. A `"data.table"` class on either
input selects the data.table engine. When neither side is a data.table, an
input inheriting from `"tbl_df"` hands the join to dplyr, and plain data
frames fall through to base R, which needs nothing beyond what ships with R.
Both dplyr and data.table sit in Suggests, so the base path works on an
installation that has neither.

data.table takes priority because its merge implementation depends on key
handling, indexing, and reference semantics that a dplyr join would discard.
dplyr, on the other hand, handles a coerced data.table without issues. Both
inputs are checked -- if one side is a tibble and the other a plain data frame,
dplyr is selected. If a mixed-class call selects a backend whose package is
not installed, joinspy falls back to base R with a warning.

Here is the detection in action with each input type:

```{r}
# Base R data frames: auto-detects "base"
orders_df <- data.frame(
  id = c(1, 2, 3),
  amount = c(100, 250, 75),
  stringsAsFactors = FALSE
)

customers_df <- data.frame(
  id = c(1, 2, 4),
  name = c("Alice", "Bob", "Diana"),
  stringsAsFactors = FALSE
)

result_base <- left_join_spy(orders_df, customers_df, by = "id", .quiet = TRUE)
class(result_base)
```

```{r eval = requireNamespace("dplyr", quietly = TRUE)}
# Tibbles: auto-detects "dplyr"
orders_tbl <- dplyr::tibble(
  id = c(1, 2, 3),
  amount = c(100, 250, 75)
)

customers_tbl <- dplyr::tibble(
  id = c(1, 2, 4),
  name = c("Alice", "Bob", "Diana")
)

result_dplyr <- left_join_spy(orders_tbl, customers_tbl, by = "id", .quiet = TRUE)
class(result_dplyr)
```

```{r eval = requireNamespace("data.table", quietly = TRUE)}
# data.tables: auto-detects "data.table"
orders_dt <- data.table::data.table(
  id = c(1, 2, 3),
  amount = c(100, 250, 75)
)

customers_dt <- data.table::data.table(
  id = c(1, 2, 4),
  name = c("Alice", "Bob", "Diana")
)

result_dt <- left_join_spy(orders_dt, customers_dt, by = "id", .quiet = TRUE)
class(result_dt)
```

When the two inputs have different classes, the higher-priority class wins:

```{r eval = requireNamespace("data.table", quietly = TRUE) && requireNamespace("dplyr", quietly = TRUE)}
# data.table + tibble: data.table wins
mixed_result <- left_join_spy(orders_dt, customers_tbl, by = "id", .quiet = TRUE)
class(mixed_result)
```

A single data.table anywhere in the call decides the engine, even when the
other input is a tibble.


## From wrapper to engine call

All five joining functions route through one internal dispatch point, so
backend behavior is identical whether we call a `*_join_spy()` wrapper or
`join_strict()`. The dispatch maps each join type onto the engine's own
interface:

| joinspy call | `"base"` | `"dplyr"` | `"data.table"` |
|---|---|---|---|
| `left_join_spy()`  | `merge(x, y, all.x = TRUE)` | `dplyr::left_join(x, y, by)` | `merge(xdt, ydt, all.x = TRUE)` |
| `right_join_spy()` | `merge(x, y, all.y = TRUE)` | `dplyr::right_join(x, y, by)` | `merge(xdt, ydt, all.y = TRUE)` |
| `inner_join_spy()` | `merge(x, y)` | `dplyr::inner_join(x, y, by)` | `merge(xdt, ydt, all = FALSE)` |
| `full_join_spy()`  | `merge(x, y, all.x = TRUE, all.y = TRUE)` | `dplyr::full_join(x, y, by)` | `merge(xdt, ydt, all = TRUE)` |
| `join_strict(type = ...)` | as above, for `type` | as above | as above |

In the data.table column, `xdt` and `ydt` are the inputs after
`data.table::as.data.table()`, which passes an existing data.table through
unchanged. The data.table engine therefore goes through
`merge.data.table()`, the method data.table registers for base R's `merge()`
generic. It never touches the `y[x]` bracket-join syntax. Several behaviors
people associate with data.table joins (the `i.` column prefix, joining on
the table's stored key) belong to the bracket syntax and do not appear here.

A named `by` vector such as `c("id" = "customer_id")` works on every backend.
Base R receives it as `by.x` and `by.y`, dplyr takes the named vector
natively, and for data.table the right table's key columns are renamed to the
left table's names before merging. In all three cases the result carries the
left table's key names; the right table's name (`account_key` below) never
surfaces in the output, whichever engine runs:

```{r eval = requireNamespace("data.table", quietly = TRUE)}
accounts <- data.frame(id = c(1, 2), balance = c(50, 80))
holders <- data.frame(account_key = c(1, 2), holder = c("Ana", "Ben"))

names(left_join_spy(accounts, holders, by = c("id" = "account_key"),
                    backend = "data.table", .quiet = TRUE))
```

Anything passed through `...` lands in the engine call from the table above.
That is the door for engine-specific arguments, several of which appear in
the sections that follow: `sort`, `suffixes`, and `incomparables` for the two
`merge()`-based engines; `na_matches`, `suffix`, and `relationship` for
dplyr.


## Explicit override

All join wrappers and `join_strict()` accept a `backend` argument that
overrides auto-detection. The three valid values are `"base"`, `"dplyr"`, and
`"data.table"`.

We can force dplyr on plain data frames to get tibble output:

```{r eval = requireNamespace("dplyr", quietly = TRUE)}
result <- left_join_spy(orders_df, customers_df, by = "id",
                        backend = "dplyr", .quiet = TRUE)
class(result)
```

Many-to-many keys are a place where the engines look like they should
diverge. Called at the console, `dplyr::left_join()` warns when it detects an
unexpected many-to-many relationship. Through joinspy no warning appears on
any backend, because dplyr runs that check only for joins called directly
from the user's environment; a call that originates inside another package
skips it. The joinspy report still flags the expansion: it counts the
duplicate keys and predicts the multiplied row count before the engine runs.

```{r eval = requireNamespace("dplyr", quietly = TRUE)}
# These have a legitimate many-to-many relationship
tags <- dplyr::tibble(
  item_id = c(1, 1, 2),
  tag = c("red", "large", "small")
)

prices <- dplyr::tibble(
  item_id = c(1, 2, 2),
  currency = c("USD", "USD", "EUR")
)

result <- left_join_spy(tags, prices, by = "item_id", .quiet = TRUE)
nrow(result)
```

Three input rows became four, silently as far as dplyr is concerned;
`last_report()` has the full story. To turn an expectation about key
relationships into a hard requirement, dplyr's `relationship` argument passes
through `...` and keeps its teeth, since constraint violations are errors and
errors fire no matter who calls the join:

```{r error = TRUE, eval = requireNamespace("dplyr", quietly = TRUE)}
left_join_spy(tags, prices, by = "item_id", .quiet = TRUE,
              relationship = "one-to-one")
```

`join_strict(expect = "1:1")` does the same job with a single syntax across
all three backends; we return to it below.

Or force data.table on plain data frames for speed on large inputs:

```{r eval = requireNamespace("data.table", quietly = TRUE)}
result <- left_join_spy(orders_df, customers_df, by = "id",
                        backend = "data.table", .quiet = TRUE)
class(result)
```

An explicit backend must be installed. Requesting `backend = "dplyr"` without
dplyr will error, not silently fall back -- auto-detection is a convenience,
but an explicit override is a contract.

Setting `backend = "base"` is also a way to guarantee reproducibility across
environments where dplyr may or may not be installed.


## Class preservation

joinspy preserves input class through the full diagnostic-repair-join cycle:

- **Diagnostics** (`join_spy()`, `key_check()`, etc.) accept any data frame
  subclass and return report objects without modifying the input.
- **Repair** (`join_repair()`) operates on key columns in place and returns
  the same class it received.
- **Join wrappers** dispatch to the native join engine for the detected class.

Here is a full cycle with base R data frames:

```{r}
messy_df <- data.frame(
  code = c("A-1 ", "B-2", " C-3"),
  value = c(10, 20, 30),
  stringsAsFactors = FALSE
)

lookup_df <- data.frame(
  code = c("A-1", "B-2", "C-3"),
  label = c("Alpha", "Beta", "Gamma"),
  stringsAsFactors = FALSE
)

# 1. Diagnose
report <- join_spy(messy_df, lookup_df, by = "code")

# 2. Repair
repaired_df <- join_repair(messy_df, by = "code")
class(repaired_df)  # still data.frame

# 3. Join
joined_df <- left_join_spy(repaired_df, lookup_df, by = "code", .quiet = TRUE)
class(joined_df)  # still data.frame
joined_df
```

The same cycle with tibbles:

```{r eval = requireNamespace("dplyr", quietly = TRUE)}
messy_tbl <- dplyr::tibble(
  code = c("A-1 ", "B-2", " C-3"),
  value = c(10, 20, 30)
)

lookup_tbl <- dplyr::tibble(
  code = c("A-1", "B-2", "C-3"),
  label = c("Alpha", "Beta", "Gamma")
)

repaired_tbl <- join_repair(messy_tbl, by = "code")
class(repaired_tbl)  # still tbl_df

joined_tbl <- left_join_spy(repaired_tbl, lookup_tbl, by = "code", .quiet = TRUE)
class(joined_tbl)  # still tbl_df
joined_tbl
```

And with data.tables:

```{r eval = requireNamespace("data.table", quietly = TRUE)}
messy_dt <- data.table::data.table(
  code = c("A-1 ", "B-2", " C-3"),
  value = c(10, 20, 30)
)

lookup_dt <- data.table::data.table(
  code = c("A-1", "B-2", "C-3"),
  label = c("Alpha", "Beta", "Gamma")
)

repaired_dt <- join_repair(messy_dt, by = "code")
class(repaired_dt)  # still data.table

joined_dt <- left_join_spy(repaired_dt, lookup_dt, by = "code", .quiet = TRUE)
class(joined_dt)  # still data.table
joined_dt
```

When `join_repair()` receives both `x` and `y`, it returns a list with `$x`
and `$y`, each preserving the class of the corresponding input.

`join_strict()` also preserves class -- the cardinality check runs before the
join, so a satisfied constraint returns the native class and a violated one
errors before any output is produced.

The one exception is an explicit backend override that does not match the input
class. Passing `backend = "data.table"` on a tibble returns a data.table,
because that is what the data.table engine produces.


## Backend differences, demonstrated

The wrapper layer hides the engine interfaces and normalizes the column
naming for named keys. The engines' own conventions for the result pass
through untouched: how name collisions are resolved, what order rows come
back in, and whether NA keys match. Each of these is small enough to go
unnoticed in a code review and large enough to change downstream results,
so we show all three with output rather than assert them.

### Column-name collisions

When a non-key column exists in both tables, every engine has to rename one
of them. All three backends land on the same convention, the `.x`/`.y`
suffix pair:

```{r}
dup_left <- data.frame(id = c(1, 2), status = c("open", "closed"))
dup_right <- data.frame(id = c(1, 2), status = c("new", "old"))

names(left_join_spy(dup_left, dup_right, by = "id", .quiet = TRUE))
```

```{r eval = requireNamespace("dplyr", quietly = TRUE)}
names(left_join_spy(dup_left, dup_right, by = "id",
                    backend = "dplyr", .quiet = TRUE))
```

```{r eval = requireNamespace("data.table", quietly = TRUE)}
names(left_join_spy(dup_left, dup_right, by = "id",
                    backend = "data.table", .quiet = TRUE))
```

data.table users may expect an `i.` prefix on right-table columns. That
prefix comes from the `y[x]` bracket join; `merge.data.table()` follows the
suffix convention of base R's `merge()`, and the data.table backend calls
`merge()`. Column names in joinspy output are stable across backends.

The suffix pair is configurable through `...`, with one spelling difference
to keep in mind when migrating code: both `merge()` methods call the argument
`suffixes`, dplyr calls it `suffix`.

```{r eval = requireNamespace("dplyr", quietly = TRUE)}
names(left_join_spy(dup_left, dup_right, by = "id", backend = "dplyr",
                    .quiet = TRUE, suffix = c("_current", "_incoming")))
```

### Row ordering

Row order is the difference most likely to break code silently. Base R's
`merge()` sorts the result by the key columns. So does `merge.data.table()`,
whose `sort` argument defaults to `TRUE`. dplyr keeps the left table's row
order. We join a left table whose ids arrive in the order 3, 1, 2:

```{r}
lhs <- data.frame(id = c(3, 1, 2), amount = c(30, 10, 20))
rhs <- data.frame(id = c(1, 2, 3), name = c("a", "b", "c"))

left_join_spy(lhs, rhs, by = "id", .quiet = TRUE)$id
```

```{r eval = requireNamespace("dplyr", quietly = TRUE)}
left_join_spy(lhs, rhs, by = "id", backend = "dplyr", .quiet = TRUE)$id
```

```{r eval = requireNamespace("data.table", quietly = TRUE)}
left_join_spy(lhs, rhs, by = "id", backend = "data.table", .quiet = TRUE)$id
```

Only dplyr returns the rows the way they arrived. For the merge-based
engines, `sort = FALSE` passes through `...`:

```{r eval = requireNamespace("data.table", quietly = TRUE)}
left_join_spy(lhs, rhs, by = "id", backend = "data.table",
              .quiet = TRUE, sort = FALSE)$id
```

data.table documents the unsorted result as keeping `x`'s row order. Base R
documents it as unspecified, so unsorted `merge()` output should be treated
as arbitrary even when it happens to look right. Any pipeline that indexes
join output by row position will give different answers under base or
data.table than under dplyr; an explicit `order()` after the join, or joining
on a key downstream instead of relying on position, makes the assumption
visible and removes the hazard.

### NA keys match each other

All three engines treat NA as a matchable value by default. An NA key on the
left joins to an NA key on the right:

```{r}
na_left <- data.frame(code = c("A", NA), v = c(1, 2))
na_right <- data.frame(code = c("A", NA), label = c("alpha", "missing"))

nrow(inner_join_spy(na_left, na_right, by = "code", .quiet = TRUE))
```

```{r eval = requireNamespace("dplyr", quietly = TRUE)}
nrow(inner_join_spy(na_left, na_right, by = "code",
                    backend = "dplyr", .quiet = TRUE))
```

```{r eval = requireNamespace("data.table", quietly = TRUE)}
inner_join_spy(na_left, na_right, by = "code",
               backend = "data.table", .quiet = TRUE)
```

Two rows on every backend; the NA-to-NA pair survives the inner join. The
data.table output also shows its key sort placing the NA row first.

joinspy's row predictions take the opposite stance. `join_spy()` counts NA
keys separately, reports them as an issue, and predicts row counts as if they
match nothing:

```{r}
report <- join_spy(na_left, na_right, by = "code")
report$expected_rows$inner
```

The predicted inner join has 1 row; the engines return 2. In a verbose
wrapper call this surfaces as a `Result: 2 rows (expected 1)` flag, which is
exactly the discrepancy the wrapper exists to catch, since NA-on-NA matches
are a classic source of phantom rows. To make the engines agree with the
prediction, two pass-through arguments drop NA matches: `na_matches =
"never"` for dplyr, and `incomparables = NA` for both `merge()`-based
engines (single-key joins only):

```{r eval = requireNamespace("dplyr", quietly = TRUE)}
nrow(inner_join_spy(na_left, na_right, by = "code", backend = "dplyr",
                    .quiet = TRUE, na_matches = "never"))
```

```{r}
nrow(inner_join_spy(na_left, na_right, by = "code", backend = "base",
                    .quiet = TRUE, incomparables = NA))
```

For composite keys, where `incomparables` no longer applies, filtering NA
keys before the join with `complete.cases()` on the key columns is the
portable route. The report's NA-key counts say up front how many rows such a
filter would touch.


## join_strict(), .quiet, and the report flow

Every wrapper computes the diagnostic report before any engine runs. All
four `*_join_spy()` functions share one implementation that calls
`join_spy()` first, stores the result, and only then dispatches the join.
The `backend` argument affects the second step alone, so the report is
identical whichever engine performs the join:

```{r eval = requireNamespace("dplyr", quietly = TRUE)}
invisible(left_join_spy(orders_df, customers_df, by = "id",
                        backend = "base", .quiet = TRUE))
report_base <- last_report()

invisible(left_join_spy(orders_df, customers_df, by = "id",
                        backend = "dplyr", .quiet = TRUE))
report_dplyr <- last_report()

identical(report_base$expected_rows, report_dplyr$expected_rows)
```

`.quiet = TRUE` suppresses the printing and nothing else. The report is still
computed. It remains available through `last_report()`, travels with the
result as the `"join_report"` attribute, and goes to the log whenever
`set_log_file()` is active. A quiet data.table join leaves the same audit
trail as a verbose dplyr one.

```{r}
result <- left_join_spy(orders_df, customers_df, by = "id", .quiet = TRUE)
is_join_report(attr(result, "join_report"))
```

In verbose mode the wrapper finishes by comparing the engine's actual row
count against the report's prediction for the chosen join type. Because the
prediction is backend-independent, this check doubles as a guard against the
engine differences from the previous section: an NA key that matched, or a
duplicate that multiplied, shows up as a predicted-versus-actual gap whatever
the backend.

`join_strict()` slots into the same flow one step earlier. It summarizes the
keys, classifies the actual cardinality, and raises an error before any
engine is chosen when the `expect` contract is violated. The error message is
the same on every backend because no backend has run yet. Here we request
data.table and still fail at the cardinality gate:

```{r error = TRUE}
products <- data.frame(id = 1:3, product = c("A", "B", "C"))
suppliers_dup <- data.frame(id = c(1, 1, 2), name = c("S1", "S2", "S3"))

join_strict(products, suppliers_dup, by = "id", expect = "1:1",
            backend = "data.table")
```

When the contract holds, the join proceeds through the same dispatch as the
wrappers and returns the native class:

```{r eval = requireNamespace("data.table", quietly = TRUE)}
products_dt <- data.table::data.table(id = 1:3, product = c("A", "B", "C"))
suppliers_dt <- data.table::data.table(id = 1:3, name = c("S1", "S2", "S3"))

class(join_strict(products_dt, suppliers_dt, by = "id", expect = "1:1"))
```


## Diagnostics are backend-agnostic

The diagnostic functions (`join_spy()`, `key_check()`, `key_duplicates()`,
`join_explain()`, `detect_cardinality()`, `check_cartesian()`) operate purely
on column values and never call a join engine. They produce identical results
regardless of input class.

This means we can diagnose on data.tables and join with dplyr, or diagnose in
a base-R script and pass the data to a Shiny app that uses dplyr internally.

```{r eval = requireNamespace("data.table", quietly = TRUE) && requireNamespace("dplyr", quietly = TRUE)}
# Diagnose on data.tables
orders_dt <- data.table::data.table(
  id = c(1, 2, 3),
  amount = c(100, 250, 75)
)

customers_dt <- data.table::data.table(
  id = c(1, 2, 4),
  name = c("Alice", "Bob", "Diana")
)

report <- join_spy(orders_dt, customers_dt, by = "id")

# Join with dplyr (convert first)
orders_tbl <- dplyr::as_tibble(orders_dt)
customers_tbl <- dplyr::as_tibble(customers_dt)
result <- left_join_spy(orders_tbl, customers_tbl, by = "id", .quiet = TRUE)
class(result)
```

The report object is structurally identical across backends -- `$issues`,
`$expected_rows`, and `$match_analysis` contain the same values. This also
means we can write unit tests for key quality using plain data frames even when
production code uses data.table.


## Choosing a backend

Auto-detection is the right default. It keeps the output class aligned with
the input class, it never errors over a missing package, and the same script
behaves sensibly whether a colleague feeds it tibbles or data.tables. The
override is for cases where the engine itself matters.

`backend = "base"` removes dplyr and data.table from the execution path. A
script that must run on a minimal R installation, a package that wants
joinspy without inheriting heavier dependencies, and a frozen reproducibility
archive are all served by base R's `merge()`, whose join semantics have been
stable for decades. Forcing base is also one way to reach `merge()`-only
arguments on inputs that would otherwise auto-detect to dplyr.

`backend = "dplyr"` buys the `relationship` and `na_matches` controls shown
above, plus tibble output for pipelines that expect it. When the surrounding
code is dplyr anyway, matching the backend keeps a single join grammar in
play.

`backend = "data.table"` matters when the output feeds by-reference
operations (`:=`, `data.table::setkey()`) that require a genuine data.table,
or when the tables are large enough for engine speed to register at all.
Inside joinspy that point arrives later than expected, because the wrapper
spends most of its time on diagnostics rather than on the join. We can
measure this directly: the chunk times the same 5,000-row join on each
backend, then times the diagnostic pass alone.

```{r eval = requireNamespace("dplyr", quietly = TRUE) && requireNamespace("data.table", quietly = TRUE)}
set.seed(42)
n <- 5000
big_x <- data.frame(id = sample(n), x = rnorm(n))
big_y <- data.frame(id = sample(n), y = rnorm(n))

t_base <- system.time(left_join_spy(big_x, big_y, by = "id",
                                    backend = "base", .quiet = TRUE))
t_dplyr <- system.time(left_join_spy(big_x, big_y, by = "id",
                                     backend = "dplyr", .quiet = TRUE))
t_dt <- system.time(left_join_spy(big_x, big_y, by = "id",
                                  backend = "data.table", .quiet = TRUE))
t_spy <- system.time(report <- join_spy(big_x, big_y, by = "id"))

round(c(base = t_base[["elapsed"]], dplyr = t_dplyr[["elapsed"]],
        data.table = t_dt[["elapsed"]], diagnostics = t_spy[["elapsed"]]), 3)
```

The three wrapper timings land close together, and the last entry shows why:
most of the elapsed time is `join_spy()` scanning keys, with the engine merge
accounting for the remainder. Two consequences follow. Switching backends to
speed up a `*_join_spy()` call addresses the smaller term. And on genuinely
large tables the lever sits on the diagnostic side: `join_spy()` accepts a
`sample` argument for approximate diagnostics on a row subset, after which
the engine of choice can be called directly on the full data.

Two cautions before overriding. First, an explicit backend determines the
output class regardless of input class: `backend = "data.table"` on tibbles
returns a data.table, and downstream tibble-flavored code may misbehave.
Second, `backend = "base"` selects base R's `merge()` generic, and S3
dispatch inside `merge()` still routes data.table inputs to data.table's own
method:

```{r eval = requireNamespace("data.table", quietly = TRUE)}
class(left_join_spy(orders_dt, customers_dt, by = "id",
                    backend = "base", .quiet = TRUE))
```

A data.table input runs `merge.data.table()` even under `backend = "base"`.
Stripping the class first with `as.data.frame()` is the way to reach the
data.frame method. The same dispatch rule applies to tibbles, harmlessly:
tibbles register no `merge()` method of their own, so the data.frame method
runs. Auto-detection already picks the engine that respects the
richest input class, which is why leaving the `backend` argument alone is the
recommendation for everything except the deliberate cases above.


## See Also

- `vignette("quickstart")` for a quick introduction to joinspy

- `vignette("common-issues")` for a catalogue of join problems and solutions

- `?left_join_spy`, `?join_strict` for backend parameter documentation