---
title: "A Zoomerjoin Guided Tour"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{A Zoomerjoin Guided Tour}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

### Introduction:

This vignette gives a basic overview of the core functionality of the
zoomerjoin package. Zoomerjoin empowers you to fuzzily-match datasets with
millions of rows in seconds, while staying light on memory usage. This makes it
feasible to perform fuzzy-joins on datasets in the hundreds of millions of
observations in a matter of minutes.

## How Does it Work?

Zoomerjoin's blazingly fast joins for the string distance are made possible by
an optimized, performant implementation of the
[MinHash](https://en.wikipedia.org/wiki/MinHash) algorithm written in Rust.

While most conventional joining packages compare the all pairs of records in
the two datasets you wish to join, the MinHash algorithm manages to compare
only similar records to each other. This results in matches that are orders of
magnitudes faster than other matching software packages: `zoomerjoin` takes
hours or minutes to join datasets that would have taken centuries to join using
other matching methods.

## Basic Syntax:

If you're familiar with the logical-join syntax from `dplyr`, then you already
know how to use fuzzy join to join two datasets. Zoomerjoin provides
`jaccard_inner_join()` and `jaccard_full_join()` (among others), which are the
fuzzy-joining analogues of the corresponding dplyr functions.

I demonstrate the syntax by using the package to join to corpuses, which formed
from entries from the [Database on Ideology, Money in Politics, and Elections
(DIME)](https://data.stanford.edu/dime) (Bonica 2016).

The first corpus looks as follows:

```{r}
library(tidyverse)
library(microbenchmark)
library(fuzzyjoin)
library(zoomerjoin)

corpus_1 <- dime_data %>% # dime data is packaged with zoomerjoin
  head(500)
names(corpus_1) <- c("a", "field")
corpus_1
```

And the second looks as follows:

```{r}
corpus_2 <- dime_data %>% # dime data is packaged with zoomerjoin
  tail(500)
names(corpus_2) <- c("b", "field")
corpus_2
```

The two Corpuses can't be directly joined because of misspellings. This means
we must use the fuzzy-matching capabilities of zoomerjoin:

```{r}
set.seed(1)
start_time <- Sys.time()
join_out <- jaccard_inner_join(corpus_1, corpus_2,
  by = "field", n_gram_width = 6,
  n_bands = 20, band_width = 6, threshold = .8
)
print(Sys.time() - start_time)
print(join_out)
```

The first two arguments, `a`, and `b`, are direct analogues of the `dplyr`
arguments, and are the two data frames you want to join. The `by` field also
acts the same as it does in 'dplyr' (it provides the function the columns you
want to match on).

The `n_gram_width` parameter determines how wide the n-grams that are used in
the similarity evaluation should be, while the `threshold` argument determines
how similar a pair of strings has to be (in Jaccard similarity) to be
considered a match. Users of the `stringdist` or `fuzzyjoin` package will be
familiar with both of these arguments, but should bear in mind that those
packages measure *string distance* (where a distance of 0 indicates complete
similarity), while this package operates on *string similarity,* so a threshold
of .8 will keep matches above 80% Jaccard similarity.

The `n_bands` and `band_width` parameters govern the performance of the LSH.
The default parameters should perform well for medium-size (n < 10^7) datasets
where matches are somewhat similar (similarity > .8), but may require tuning in
other settings. the `jaccard_hyper_grid_search()`, and `jaccard_curve()` functions
can help select these parameters for you given the properties of the LSH you
desire.

As an example, you can use the `jaccard_curve()` function to plot the probability
that a pair of records are compared at each possible Jaccard distance, $d$
between zero and one:

```{r}
jaccard_curve(20, 6)
```

By looking at the plot produced, we can see that using these hyperparameters,
comparisons will almost never be made between pairs of records that have a
Jaccard similarity of less than .2 (saving time), pairs of records that have a
Jaccard similarity of greater than .8 are almost always compared (giving a low
false-negative rate).

For more details about the hyperparameters, the `textreuse` package has an
excellent vignette, and zoomerjoin provides a re-implementation of its
profiling tools, `jaccard_probability,` and `jaccard_bandwidth` (although the
implementations differ slightly as the hyperparameters in each package are
different).


## Standardizing String Names After A Merge

Often after merging, it can help to standardize the names or fields that have
been joined on. This way, you can assign a unique label or identifying key to
all observations that have a similar value of the merging variable. The
`jaccard_string_group()` function makes this possible. It first performs locality
sensitive hashing to identify similar pairs of observations within the dataset,
and then runs a community detection algorithm to identify clusters of similar
observations, which are each assigned a label. The community-detection
algorithm, `fastgreedy.community()` from the `igraph` package runs in log-linear
time, so the entire algorithm completes in linearithmic time.

Here's a short snippet showing how you can use `jaccard_string_group()` to
standardize a set of organization names.

```{r}
organization_names <- c(
  "American Civil Liberties Union",
  "American Civil Liberties Union (ACLU)",
  "NRA National Rifle Association",
  "National Rifle Association NRA",
  "National Rifle Association",
  "Planned Parenthood",
  "Blue Cross"
)
standardized_organization_names <- jaccard_string_group(organization_names, threshold = .5, band_width = 3)
print(standardized_organization_names)
```


### References:

Bonica, Adam. 2016. Database on Ideology, Money in Politics, and Elections: Public version 2.0 [Computer file]. Stanford, CA: Stanford University Libraries.