Package 'fastLink'

Title: Fast Probabilistic Record Linkage with Missing Data
Description: Implements a Fellegi-Sunter probabilistic record linkage model that allows for missing data and the inclusion of auxiliary information. This includes functionalities to conduct a merge of two datasets under the Fellegi-Sunter model using the Expectation-Maximization algorithm. In addition, tools for preparing, adjusting, and summarizing data merges are included. The package implements methods described in Enamorado, Fifield, and Imai (2019) ''Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records'' <doi:10.1017/S0003055418000783> and is available at <https://imai.fas.harvard.edu/research/linkage.html>.
Authors: Ted Enamorado [aut, cre], Ben Fifield [aut], Kosuke Imai [aut]
Maintainer: Ted Enamorado <[email protected]>
License: GPL (>= 3)
Version: 0.6.1
Built: 2024-12-12 07:06:23 UTC
Source: CRAN

Help Index


aggconfusion

Description

Aggregate confusion tables from separate runs of fastLink() (UNDER DEVELOPMENT)

Usage

aggconfusion(object)

Arguments

object

A list of confusion tables.

Value

'aggconfusion()' returns two tables - one calculating the confusion table, and another calculating a series of additional summary statistics.

Author(s)

Ted Enamorado <[email protected]> and Ben Fifield <[email protected]>


Aggregate EM objects for use in 'summary.fastLink()'

Description

aggregateEM aggregates EM objects for easy processing by 'summary.fastLink()'

Usage

aggregateEM(em.list, within.geo)

Arguments

em.list

A list of 'fastLink' or 'fastLink.EM' objects that should be aggregate in 'summary.fastLink()'

within.geo

A vector of booleans corresponding to whether each object in 'em.list' is a within-geography match or an across-geography match. Should be of equal length to 'em.list'. Default is NULL (assumes all are within-geography matches).


blockData

Description

Contains functionalities for blocking two data sets on one or more variables prior to conducting a merge.

Usage

blockData(dfA, dfB, varnames, window.block, window.size,
kmeans.block, nclusters, iter.max, n.cores)

Arguments

dfA

Dataset A - to be matched to Dataset B

dfB

Dataset B - to be matched to Dataset A

varnames

A vector of variable names to use for blocking. Must be present in both dfA and dfB

window.block

A vector of variable names indicating that the variable should be blocked using windowing blocking. Must be present in varnames.

window.size

The size of the window for window blocking. Default is 1 (observations +/- 1 on the specified variable will be blocked together).

kmeans.block

A vector of variable names indicating that the variable should be blocked using k-means blocking. Must be present in varnames.

nclusters

Number of clusters to create with k-means. Default value is the number of clusters where the average cluster size is 100,000 observations.

iter.max

Maximum number of iterations for the k-means algorithm to run. Default is 5000

n.cores

Number of cores to parallelize over. Default is NULL.

Value

A list with an entry for each block. Each list entry contains two vectors — one with the indices indicating the block members in dataset A, and another containing the indices indicating the block members in dataset B.

Examples

## Not run: 
block_out <- blockData(dfA, dfB, varnames = c("city", "birthyear"))

## End(Not run)

calcMoversPriors

Description

calcMoversPriors calculates prior estimates of in-state and cross-state movers rates from the IRS SOI Migration data, which can be used to improve the accuracy of the EM algorithm.

Usage

calcMoversPriors(geo.a, geo.b, year.start, year.end,
county, state.a, state.b, matchrate.lambda, remove.instate)

Arguments

geo.a

The state code (if state = TRUE) or county name (if state = FALSE) for the earlier of the two voter files.

geo.b

The state code (if state = TRUE) or county name (if state = FALSE) for the later of the two voter files.

year.start

The year of the voter file for geography A.

year.end

The year of the voter file for geography B.

county

Whether prior is being calculated on the county or state level. Default is FALSE (for a state-level calculation).

state.a

If county = TRUE (indicating a county-level match), the state code of geo.a. Default is NULL.

state.b

If county = TRUE (indicating a county-level match), the state code of geo.b. Default is NULL.

matchrate.lambda

If TRUE, then returns the match rate for lambda (the expected share of observations in dataset A that can be found in dataset B). If FALSE, then returns the expected share of matches across all pairwise comparisons of datasets A and B. Default is FALSE

remove.instate

If TRUE, then for calculating cross-state movers rates assumes that successful matches have been subsetted out. The interpretation of the prior is then the match rate conditional on being an out-of-state or county mover. Default is TRUE.

Value

calcMoversPriors returns a list with estimates of the expected match rate, and of the expected in-state movers rate when matching within-state.

Author(s)

Ben Fifield <[email protected]>

Examples

calcMoversPriors(geo.a = "CA", geo.b = "CA", year.start = 2014, year.end = 2015)

clusterMatch

Description

Creates properly sized clusters for matching, using either alphabetical or word embedding clustering. If using word embedding, the function first creates a word embedding out of the provided vectors, and then runs PCA on the matrix. It then takes the first k dimensions (where k is provided by the user) and k-means is run on that matrix to get the clusters.

Usage

clusterMatch(vecA, vecB, nclusters, max.n, word.embed, min.var, iter.max)

Arguments

vecA

The character vector from dataset A

vecB

The character vector from dataset B

nclusters

The number of clusters to create from the provided data. Either nclusters = NULL or max.n = NULL.

max.n

The maximum size of either dataset A or dataset B in the largest cluster. Either nclusters = NULL or max.n = NULL

word.embed

Whether to use word embedding clustering. Default is FALSE.

min.var

The minimum amount of explained variance (maximum = 1) a PCA dimension can provide in order to be included in k-means clustering when using word embedding. Default is .20.

iter.max

Maximum number of iterations for the k-means algorithm.

Value

clusterMatch returns a list of length 3:

clusterA

The cluster assignments for dataset A

clusterB

The cluster assignments for dataset B

n.clusters

The number of clusters created

kmeans

The k-means object output.

pca

The PCA object output.

dims.pca

The number of dimensions from PCA used for the k-means clustering.

Author(s)

Ben Fifield <[email protected]>

Examples

data(samplematch)
cl <- clusterMatch(dfA$firstname, dfB$firstname, nclusters = 3)

Get confusion table for fastLink objects

Description

Calculate confusion table after running fastLink().

Usage

confusion(object, threshold)

Arguments

object

A 'fastLink' object or list of fastLink objects. Can only be run if 'return.all = TRUE' in 'fastLink().'

threshold

The matching threshold above which a pair is a true match. Default is .85

Value

'confusion()' returns two tables - one calculating the confusion table, and another calculating a series of additional summary statistics.

Author(s)

Ted Enamorado <[email protected]> and Ben Fifield <[email protected]>

Examples

## Not run: 
 out <- fastLink(
 dfA = dfA, dfB = dfB,
 varnames = c("firstname", "middlename", "lastname"),
 stringdist.match = c("firstname", "middlename", "lastname"),
 partial.match = c("firstname", "lastname", "streetname"),
 return.all = TRUE)

 ct <- confusion(out)

## End(Not run)

County-level FIPS Codes

Description

This data maps county names to FIPS codes for use in calculating prior movers rates.

Usage

countyfips

Format

A dataframe containing 3235 observations.


County-level inflow rates by state

Description

This data compiles and cleans county-level movers inflow rates by county, from the IRS Statistics on Income dataset.

Usage

countyinflow

Format

A dataframe containing 423752 observations.


County-level outflow rates by state

Description

This data compiles and cleans county-level movers outflow rates by county, from the IRS Statistics on Income dataset.

Usage

countyoutflow

Format

A dataframe containing 424475 observations.


dedupeMatches

Description

Dedupe matched dataframes.

Usage

dedupeMatches(matchesA, matchesB, EM,
matchesLink, patterns, linprog)

Arguments

matchesA

A dataframe of the matched observations in dataset A, with all variables used to inform the match.

matchesB

A dataframe of the matched observations in dataset B, with all variables used to inform the match.

EM

The EM object from emlinkMARmov()

matchesLink

The output from matchesLink()

patterns

The output from getPatterns().

linprog

Whether to implement Winkler's linear programming solution to the deduplication problem. Default is false.

Value

dedupeMatches() returns a list containing the following elements:

matchesA

A deduped version of matchesA

matchesB

A deduped version of matchesB

EM

A deduped version of the EM object

Author(s)

Ted Enamorado <[email protected]> and Ben Fifield <[email protected]>


Sample dataset A

Description

This data is a randomized and anonymized sample dataset to display features of fastLink.

Usage

dfA

Format

A dataframe containing 500 observations.


Sample dataset B

Description

This data is a randomized and anonymized sample dataset to display features of fastLink.

Usage

dfB

Format

A dataframe containing 350 observations.


emlinklog

Description

Expectation-Maximization algorithm for Record Linkage allowing for dependencies across linkage fields

Usage

emlinklog(patterns, nobs.a, nobs.b, p.m, p.gamma.j.m, p.gamma.j.u,
iter.max, tol, varnames)

Arguments

patterns

table that holds the counts for each unique agreement pattern. This object is produced by the function: tableCounts.

nobs.a

Number of observations in dataset A

nobs.b

Number of observations in dataset B

p.m

probability of finding a match. Default is 0.1

p.gamma.j.m

probability that conditional of being in the matched set we observed a specific agreement pattern.

p.gamma.j.u

probability that conditional of being in the non-matched set we observed a specific agreement pattern.

iter.max

Max number of iterations. Default is 5000

tol

Convergence tolerance. Default is 1e-05

varnames

The vector of variable names used for matching. Automatically provided if using fastLink() wrapper. Used for clean visualization of EM results in summary functions.

Value

emlinklog returns a list with the following components:

zeta.j

The posterior match probabilities for each unique pattern.

p.m

The probability of finding a match.

p.u

The probability of finding a non-match.

p.gamma.j.m

The probability of observing a particular agreement pattern conditional on being in the set of matches.

p.gamma.j.u

The probability of observing a particular agreement pattern conditional on being in the set of non-matches.

patterns.w

Counts of the agreement patterns observed, along with the Felligi-Sunter Weights.

iter.converge

The number of iterations it took the EM algorithm to converge.

nobs.a

The number of observations in dataset A.

nobs.b

The number of observations in dataset B.

Author(s)

Ted Enamorado <[email protected]> and Benjamin Fifield

Examples

## Not run: 
## Calculate gammas
g1 <- gammaCKpar(dfA$firstname, dfB$firstname)
g2 <- gammaCKpar(dfA$middlename, dfB$middlename)
g3 <- gammaCKpar(dfA$lastname, dfB$lastname)
g4 <- gammaKpar(dfA$birthyear, dfB$birthyear)

## Run tableCounts
tc <- tableCounts(list(g1, g2, g3, g4), nobs.a = nrow(dfA), nobs.b = nrow(dfB))

## Run EM
em.log <- emlinklog(tc, nobs.a = nrow(dfA), nobs.b = nrow(dfB))

## End(Not run)

emlinkMARmov

Description

Expectation-Maximization algorithm for Record Linkage under the Missing at Random (MAR) assumption.

Usage

emlinkMARmov(patterns, nobs.a, nobs.b, p.m, iter.max,
tol, p.gamma.k.m, p.gamma.k.u, prior.lambda, w.lambda,
prior.pi, w.pi, address.field, gender.field, varnames)

Arguments

patterns

table that holds the counts for each unique agreement pattern. This object is produced by the function: tableCounts.

nobs.a

Number of observations in dataset A

nobs.b

Number of observations in dataset B

p.m

probability of finding a match. Default is 0.1

iter.max

Max number of iterations. Default is 5000

tol

Convergence tolerance. Default is 1e-05

p.gamma.k.m

probability that conditional of being in the matched set we observed a specific agreement value for field k.

p.gamma.k.u

probability that conditional of being in the non-matched set we observed a specific agreement value for field k.

prior.lambda

The prior probability of finding a match, derived from auxiliary data.

w.lambda

How much weight to give the prior on lambda versus the data. Must range between 0 (no weight on prior) and 1 (weight fully on prior)

prior.pi

The prior probability of the address field not matching, conditional on being in the matched set. To be used when the share of movers in the population is known with some certainty.

w.pi

How much weight to give the prior on pi versus the data. Must range between 0 (no weight on prior) and 1 (weight fully on prior)

address.field

Boolean indicators for whether a given field is an address field. Default is NULL (FALSE for all fields). Address fields should be set to TRUE while non-address fields are set to FALSE if provided.

gender.field

Boolean indicators for whether a given field is for gender. If so, exact match is conducted on gender. Default is NULL (FALSE for all fields). The one gender field should be set to TRUE while all other fields are set to FALSE if provided.

varnames

The vector of variable names used for matching. Automatically provided if using fastLink() wrapper. Used for clean visualization of EM results in summary functions.

Value

emlinkMARmov returns a list with the following components:

zeta.j

The posterior match probabilities for each unique pattern.

p.m

The probability of a pair matching.

p.u

The probability of a pair not matching.

p.gamma.k.m

The matching probability for a specific matching field.

p.gamma.k.u

The non-matching probability for a specific matching field.

p.gamma.j.m

The probability that a pair is in the matched set given a particular agreement pattern.

p.gamma.j.u

The probability that a pair is in the unmatched set given a particular agreement pattern.

patterns.w

Counts of the agreement patterns observed, along with the Felligi-Sunter Weights.

iter.converge

The number of iterations it took the EM algorithm to converge.

nobs.a

The number of observations in dataset A.

nobs.b

The number of observations in dataset B.

Author(s)

Ted Enamorado <[email protected]> and Kosuke Imai

Examples

## Not run: 
## Calculate gammas
g1 <- gammaCKpar(dfA$firstname, dfB$firstname)
g2 <- gammaCKpar(dfA$middlename, dfB$middlename)
g3 <- gammaCKpar(dfA$lastname, dfB$lastname)
g4 <- gammaKpar(dfA$birthyear, dfB$birthyear)

## Run tableCounts
tc <- tableCounts(list(g1, g2, g3, g4), nobs.a = nrow(dfA), nobs.b = nrow(dfB))

## Run EM
em <- emlinkMARmov(tc, nobs.a = nrow(dfA), nobs.b = nrow(dfB))

## End(Not run)

emlinkRS

Description

Calculates Felligi-Sunter weights and posterior zeta probabilities for matching patterns observed in a larger population that are not present in a sub-sample used to estimate the EM.

Usage

emlinkRS(patterns.out, em.out, nobs.a, nobs.b)

Arguments

patterns.out

The output from 'tableCounts()' or 'emlinkMARmov()' (run on full dataset), containing all observed matching patterns in the full sample and the number of times that pattern is observed.

em.out

The output from 'emlinkMARmov()', an EM object estimated on a smaller random sample to apply to counts from a larger sample

nobs.a

Total number of observations in dataset A

nobs.b

Total number of observations in dataset B

Value

emlinkMARmov returns a list with the following components:

zeta.j

The posterior match probabilities for each unique pattern.

p.m

The posterior probability of a pair matching.

p.u

The posterior probability of a pair not matching.

p.gamma.k.m

The posterior of the matching probability for a specific matching field.

p.gamma.k.u

The posterior of the non-matching probability for a specific matching field.

p.gamma.j.m

The posterior probability that a pair is in the matched set given a particular agreement pattern.

p.gamma.j.u

The posterior probability that a pair is in the unmatched set given a particular agreement pattern.

patterns.w

Counts of the agreement patterns observed, along with the Felligi-Sunter Weights.

iter.converge

The number of iterations it took the EM algorithm to converge.

nobs.a

The number of observations in dataset A.

nobs.b

The number of observations in dataset B.

Author(s)

Ted Enamorado <[email protected]> and Ben Fifield <[email protected]>

Examples

## Not run: 
## -------------
## Run on subset
## -------------
dfA.s <- dfA[sample(1:nrow(dfA), 50),]; dfB.s <- dfB[sample(1:nrow(dfB), 50),]

## Calculate gammas
g1 <- gammaCKpar(dfA.s$firstname, dfB.s$firstname)
g2 <- gammaCKpar(dfA.s$middlename, dfB.s$middlename)
g3 <- gammaCKpar(dfA.s$lastname, dfB.s$lastname)
g4 <- gammaKpar(dfA.s$birthyear, dfB.s$birthyear)

## Run tableCounts
tc <- tableCounts(list(g1, g2, g3, g4), nobs.a = nrow(dfA.s), nobs.b = nrow(dfB.s))

## Run EM
em <- emlinkMAR(tc, nobs.a = nrow(dfA.s), nobs.b = nrow(dfB.s))

## ------------------
## Apply to full data
## ------------------

## Calculate gammas
g1 <- gammaCKpar(dfA$firstname, dfB$firstname)
g2 <- gammaCKpar(dfA$middlename, dfB$middlename)
g3 <- gammaCKpar(dfA$lastname, dfB$lastname)
g4 <- gammaKpar(dfA$birthyear, dfB$birthyear)

## Run tableCounts
tc <- tableCounts(list(g1, g2, g3, g4), nobs.a = nrow(dfA), nobs.b = nrow(dfB))

em.full <- emlinkRS(tc, em, nrow(dfA), nrow(dfB)

## End(Not run)

gammaCK2par

Description

Field comparisons for string variables. Two possible agreement patterns are considered: 0 total disagreement, 2 agreement. The distance between strings is calculated using a Jaro-Winkler distance.

Usage

gammaCK2par(matAp, matBp, n.cores, cut.a, method, w)

Arguments

matAp

vector storing the comparison field in data set 1

matBp

vector storing the comparison field in data set 2

n.cores

Number of cores to parallelize over. Default is NULL.

cut.a

Lower bound for full match, ranging between 0 and 1. Default is 0.92

method

String distance method, options are: "jw" Jaro-Winkler (Default), "dl" Damerau-Levenshtein, "jaro" Jaro, and "lv" Edit

w

Parameter that describes the importance of the first characters of a string (only needed if method = "jw"). Default is .10

Value

gammaCK2par returns a list with the indices corresponding to each matching pattern, which can be fed directly into tableCounts and matchesLink.

Author(s)

Ted Enamorado <[email protected]>, Ben Fifield <[email protected]>, and Kosuke Imai

Examples

## Not run: 
g1 <- gammaCK2par(dfA$firstname, dfB$lastname)

## End(Not run)

gammaCKpar

Description

Field comparisons for string variables. Three possible agreement patterns are considered: 0 total disagreement, 1 partial agreement, 2 agreement. The distance between strings is calculated using a Jaro-Winkler distance.

Usage

gammaCKpar(matAp, matBp, n.cores, cut.a, cut.p, method, w)

Arguments

matAp

vector storing the comparison field in data set 1

matBp

vector storing the comparison field in data set 2

n.cores

Number of cores to parallelize over. Default is NULL.

cut.a

Lower bound for full match, ranging between 0 and 1. Default is 0.92

cut.p

Lower bound for partial match, ranging between 0 and 1. Default is 0.88

method

String distance method, options are: "jw" Jaro-Winkler (Default), "dl" Damerau-Levenshtein, "jaro" Jaro, and "lv" Edit

w

Parameter that describes the importance of the first characters of a string (only needed if method = "jw"). Default is .10

Value

gammaCKpar returns a list with the indices corresponding to each matching pattern, which can be fed directly into tableCounts and matchesLink.

Author(s)

Ted Enamorado <[email protected]>, Ben Fifield <[email protected]>, and Kosuke Imai

Examples

## Not run: 
g1 <- gammaCKpar(dfA$firstname, dfB$lastname)

## End(Not run)

gammaKpar

Description

Field comparisons: 0 disagreement, 2 total agreement.

Usage

gammaKpar(matAp, matBp, gender, n.cores)

Arguments

matAp

vector storing the comparison field in data set 1

matBp

vector storing the comparison field in data set 2

gender

Whether the matching variable is gender. Will override standard warnings of missingness/nonvariability. Default is FALSE.

n.cores

Number of cores to parallelize over. Default is NULL.

Value

gammaKpar returns a list with the indices corresponding to each matching pattern, which can be fed directly into tableCounts and matchesLink.

Author(s)

Ted Enamorado <[email protected]>, Ben Fifield <[email protected]>, and Kosuke Imai

Examples

## Not run: 
g1 <- gammaKpar(dfA$birthyear, dfB$birthyear)

## End(Not run)

gammaNUMCK2par

Description

Field comparisons for numeric variables. Two possible agreement patterns are considered: 0 total disagreement, 2 agreement. The distance between numbers is calculated using their absolute distance.

Usage

gammaNUMCK2par(matAp, matBp, n.cores, cut.a)

Arguments

matAp

vector storing the comparison field in data set 1

matBp

vector storing the comparison field in data set 2

n.cores

Number of cores to parallelize over. Default is NULL.

cut.a

Lower bound for full match. Default is 1

Value

gammaNUMCK2par returns a list with the indices corresponding to each matching pattern, which can be fed directly into tableCounts and matchesLink.

Author(s)

Ted Enamorado <[email protected]>, Ben Fifield <[email protected]>, and Kosuke Imai

Examples

## Not run: 
g1 <- gammaNUMCK2par(dfA$birthyear, dfB$birthyear)

## End(Not run)

gammaNUMCKpar

Description

Field comparisons for numeric variables. Three possible agreement patterns are considered: 0 total disagreement, 1 partial agreement, 2 agreement. The distance between numbers is calculated using their absolute distance.

Usage

gammaNUMCKpar(matAp, matBp, n.cores, cut.a, cut.p)

Arguments

matAp

vector storing the comparison field in data set 1

matBp

vector storing the comparison field in data set 2

n.cores

Number of cores to parallelize over. Default is NULL.

cut.a

Lower bound for full match. Default is 1

cut.p

Lower bound for partial match. Default is 2

Value

gammaNUMCKpar returns a list with the indices corresponding to each matching pattern, which can be fed directly into tableCounts and matchesLink.

Author(s)

Ted Enamorado <[email protected]>, Ben Fifield <[email protected]>, and Kosuke Imai

Examples

## Not run: 
g1 <- gammaNUMCKpar(dfA$birthyear, dfB$birthyear)

## End(Not run)

getMatches

Description

Subset two data frames to the matches returned by fastLink() or matchesLink(). Can also return a single deduped data frame if dfA and dfB are identical and fl.out is of class 'fastLink.dedupe'.

Usage

getMatches(dfA, dfB, fl.out, threshold.match, combine.dfs)

Arguments

dfA

Dataset A - matched to Dataset B by fastLink().

dfB

Dataset B - matches to Dataset A by fastLink().

fl.out

Either the output from fastLink() or matchesLink().

threshold.match

A number between 0 and 1 indicating the lower bound that the user wants to declare a match. For instance, threshold.match = .85 will return all pairs with posterior probability greater than .85 as matches. Default is 0.85.

combine.dfs

Whether to combine the two data frames being merged into a single data frame. If FALSE, two data frames are returned in a list. Default is TRUE.

Value

getMatches() returns a list of two data frames:

dfA.match

A subset of dfA subsetted down to the successful matches.

dfB.match

A subset of dfB subsetted down to the successful matches.

Author(s)

Ben Fifield <[email protected]>

Examples

## Not run: 
fl.out <- fastLink(dfA, dfB,
varnames = c("firstname", "lastname", "streetname", "birthyear"),
n.cores = 1)
ret <- getMatches(dfA, dfB, fl.out)

## End(Not run)

getPatterns

Description

Get the full matching patterns for all matched pairs in dataset A and dataset B

Usage

getPatterns(
  matchesA,
  matchesB,
  varnames,
  stringdist.match,
  numeric.match,
  partial.match,
  stringdist.method = "jw",
  cut.a = 0.92,
  cut.p = 0.88,
  jw.weight = 0.1,
  cut.a.num = 1,
  cut.p.num = 2.5
)

Arguments

matchesA

A dataframe of the matched observations in dataset A, with all variables used to inform the match.

matchesB

A dataframe of the matched observations in dataset B, with all variables used to inform the match.

varnames

A vector of variable names to use for matching. Must be present in both matchesA and matchesB.

stringdist.match

A vector of booleans, indicating whether to use string distance matching when determining matching patterns on each variable. Must be same length as varnames.

numeric.match

A vector of booleans, indicating whether to use numeric pairwise distance matching when determining matching patterns on each variable. Must be same length as varnames.

partial.match

A vector of booleans, indicating whether to include a partial matching category for the string distances. Must be same length as varnames. Default is FALSE for all variables.

stringdist.method

String distance method for calculating similarity, options are: "jw" Jaro-Winkler (Default), "jaro" Jaro, and "lv" Edit

cut.a

Lower bound for full string-distance match, ranging between 0 and 1. Default is 0.92

cut.p

Lower bound for partial string-distance match, ranging between 0 and 1. Default is 0.88

jw.weight

Parameter that describes the importance of the first characters of a string (only needed if stringdist.method = "jw"). Default is .10

cut.a.num

Lower bound for full numeric match. Default is 1

cut.p.num

Lower bound for partial numeric match. Default is 2.5

Value

getPatterns() returns a dataframe with a row for each matched pair, where each column indicates the matching pattern for each matching variable.

Author(s)

Ted Enamorado <[email protected]> and Ben Fifield <[email protected]>


getPosterior

Description

Get the posterior probability of a match for each matched pair of observations

Usage

getPosterior(matchesA, matchesB, EM, patterns)

Arguments

matchesA

A dataframe of the matched observations in dataset A, with all variables used to inform the match.

matchesB

A dataframe of the matched observations in dataset B, with all variables used to inform the match.

EM

The EM object from emlinkMARmov()

patterns

The output from getPatterns().

Value

getPosterior returns the posterior probability of a match for each matched pair of observations in matchesA and matchesB

Author(s)

Ben Fifield <[email protected]>


inspectEM

Description

Inspect EM objects to analyze successfully and unsuccessfully matched patterns.

Usage

inspectEM(object, posterior.range, digits)

Arguments

object

The output from either fastLink or emlinkMARmov.

posterior.range

The range of posterior probabilities to display. Default is c(0.85, 1).

digits

How many digits to include in inspectEM dataframe. Default is 3.

Value

inspectEM returns a data frame with information about patterns around the provided threshold.

Author(s)

Ben Fifield <[email protected]>


nameReweight

Description

Reweights posterior probabilities to account for observed frequency of names. Downweights posterior probability of match if first name is common, upweights if first name is uncommon.

Usage

nameReweight(dfA, dfB, EM, gammalist, matchesLink,
varnames, firstname.field, patterns, threshold.match, n.cores)

Arguments

dfA

The full version of dataset A that is being matched.

dfB

The full version of dataset B that is being matched.

EM

The EM object from emlinkMARmov()

gammalist

The list of gamma objects calculated on the full dataset that indicate matching patterns, which is fed into tableCounts() and matchesLink().

matchesLink

The output from matchesLink().

varnames

A vector of variable names to use for matching. Must be present in both matchesA and matchesB.

firstname.field

A vector of booleans, indicating whether each field indicates first name. TRUE if so, otherwise FALSE.

patterns

The output from getPatterns().

threshold.match

A number between 0 and 1 indicating either the lower bound (if only one number provided) or the range of certainty that the user wants to declare a match. For instance, threshold.match = .85 will return all pairs with posterior probability greater than .85 as matches, while threshold.match = c(.85, .95) will return all pairs with posterior probability between .85 and .95 as matches.

n.cores

Number of cores to parallelize over. Default is NULL.

Value

nameReweight() returns a list containing the following elements:

zetaA

The reweighted zeta estimates for each matched element in dataset A.

zetaB

The reweighted zeta estimates for each matched element in dataset B.

Author(s)

Ted Enamorado <[email protected]> and Ben Fifield <[email protected]>


preprocText

Description

Preprocess text data such as names and addresses.

Usage

preprocText(text, convert_text, tolower, soundex,
usps_address, remove_whitespace, remove_punctuation, convert_text_to)

Arguments

text

A vector of text data to convert.

convert_text

Whether to convert text to the desired encoding, where the encoding is specified in the 'convert_text_to' argument. Default is TRUE

tolower

Whether to normalize the text to be all lowercase. Default is TRUE.

soundex

Whether to convert the field to the Census's soundex encoding. Default is FALSE.

usps_address

Whether to use USPS address standardization rules to clean address fields. Default is FALSE.

remove_whitespace

Whether to remove leading and trailing whitespace, and to convert multiple spaces to a single space. Default is TRUE.

remove_punctuation

Whether to remove punctuation from a string. Default is TRUE.

convert_text_to

Which encoding to use when converting text. Default is 'Latin-ASCII'. Full list of encodings in the stri_trans_list() function in the stringi package.

Value

preprocText() returns the preprocessed vector of text.

Author(s)

Ben Fifield <[email protected]>


print.inspectEM

Description

Print information from the EM algorithm to console.

Usage

## S3 method for class 'inspectEM'
print(x, ...)

Arguments

x

An inspectEM object

...

Further arguments to be passed to print.fastLink().


State-level FIPS Codes

Description

This data maps state names to FIPS codes for use in calculating prior movers rates.

Usage

statefips

Format

A dataframe containing 54 observations.


State-level inflow rates by state

Description

This data compiles and cleans state-level movers inflow rates by state, from the IRS Statistics on Income dataset.

Usage

stateinflow

Format

A dataframe containing 11321 observations.


In-state movers rates by state

Description

This data collects in-state movers rates by state, for imputation where within-county movers rates are not available.

Usage

statemove

Format

A dataframe containing 51 observations.


State-level outflow rates by state

Description

This data compiles and cleans state-level movers outflow rates by state, from the IRS Statistics on Income dataset.

Usage

stateoutflow

Format

A dataframe containing 11320 observations.


stringSubset

Description

Removes as candidate matches any observations with no close matches on string-distance measures.

Usage

stringSubset(vecA, vecB, similarity.threshold, stringdist.method,
jw.weight, n.cores)

Arguments

vecA

A character or factor vector from dataset A

vecB

A character or factor vector from dataset B

similarity.threshold

Lower bound on string-distance measure for being considered a possible match. If an observation has no possible matches above this threshold, it is discarded from the match. Default is 0.8.

stringdist.method

The method to use for calculating string-distance similarity. Possible values are 'jaro' (Jaro Distance), 'jw' (Jaro-Winkler), and 'lv' (Levenshtein). Default is 'jw'.

jw.weight

Parameter that describes the importance of the first characters of a string (only needed if stringdist.method = "jw"). Default is .10.

n.cores

Number of cores to parallelize over. Default is NULL.

Value

A list of length two, where the both entries are a vector of indices to be included in the match from dataset A (entry 1) and dataset B (entry 2).

Examples

## Not run: 
subset_out <- stringSubset(dfA$firstname, dfB$lastname, n.cores = 1)
fl_out <- fastLink(dfA[subset_out$dfA.block == 1,], dfB[subset_out$dfB.block == 1,],
varnames = c("firstname", "lastname", "streetname", "birthyear"), n.cores = 1)

## End(Not run)

tableCounts

Description

Count pairs with the same pattern in the cross product between two datasets.

Usage

tableCounts(gammalist, nobs.a, nobs.b, n.cores)

Arguments

gammalist

A list of objects produced by gammaKpar, gammaCK2par, or gammaCKpar.

nobs.a

number of observations in dataset 1

nobs.b

number of observations in dataset 2

n.cores

Number of cores to parallelize over. Default is NULL.

Value

tableCounts returns counts of all unique mathching patterns, which can be fed directly into emlinkMAR to get posterior matching probabilities for each unique pattern.

Author(s)

Ted Enamorado <[email protected]>, Ben Fifield <[email protected]>, and Kosuke Imai

Examples

## Not run: 
## Calculate gammas
g1 <- gammaCKpar(dfA$firstname, dfB$firstname)
g2 <- gammaCKpar(dfA$middlename, dfB$middlename)
g3 <- gammaCKpar(dfA$lastname, dfB$lastname)
g4 <- gammaKpar(dfA$birthyear, dfB$birthyear)

## Run tableCounts
tc <- tableCounts(list(g1, g2, g3, g4), nobs.a = nrow(dfA), nobs.b = nrow(dfB))

## End(Not run)