In the example we will use the same dataset as in the Blocking records for record linkage vignette.
reclin2 packageThe package contains function pair_ann() which aims at
integration with reclin2 package. This function works as
follows.
pair_ann(x = census[1:1000],
y = cis[1:1000],
on = c("pername1", "pername2", "sex", "dob_day", "dob_mon", "dob_year", "enumcap", "enumpc"),
deduplication = FALSE) |>
head()
#> First data set: 1 000 records
#> Second data set: 1 000 records
#> Total number of pairs: 6 pairs
#> Blocking on: 'pername1', 'pername2', 'sex', 'dob_day',
#> 'dob_mon', 'dob_year', 'enumcap', 'enumpc'
#>
#> .x .y block
#> <int> <int> <num>
#> 1: 204 1 1
#> 2: 204 375 1
#> 3: 204 391 1
#> 4: 204 405 1
#> 5: 204 424 1
#> 6: 204 484 1Which provides you information on the total number of pairs. This can
be further included in the pipeline of the reclin2 package
(note that we use a different ANN this time).
pair_ann(x = census[1:1000],
y = cis[1:1000],
on = c("pername1", "pername2", "sex", "dob_day", "dob_mon", "dob_year", "enumcap", "enumpc"),
deduplication = FALSE,
ann = "hnsw") |>
compare_pairs(on = c("pername1", "pername2", "sex", "dob_day", "dob_mon", "dob_year", "enumcap", "enumpc"),
comparators = list(cmp_jarowinkler())) |>
score_simple("score",
on = c("pername1", "pername2", "sex", "dob_day", "dob_mon", "dob_year", "enumcap", "enumpc")) |>
select_threshold("threshold", score = "score", threshold = 6) |>
link(selection = "threshold") |>
head()
#> Total number of pairs: 6 pairs
#>
#> Key: <.y>
#> .y .x person_id.x pername1.x pername2.x sex.x dob_day.x dob_mon.x
#> <int> <int> <char> <char> <char> <char> <char> <char>
#> 1: 11 945 DE256NG039003 HARRIET THOMSON F 12 1
#> 2: 71 427 DE159QA062001 LEWIS GREEN M 23 3
#> 3: 83 720 DE237GG025002 IMOGEN DARIS F 6 4
#> 4: 99 136 DE125LU022001 DANIEC MICCER M 21 4
#> 5: 154 949 DE256NG040002 CHLOE WILSON F 5 7
#> 6: 156 549 DE159QY035002 AVA KING F 7 7
#> dob_year.x hse_num enumcap.x enumpc.x str_nam
#> <char> <num> <char> <char> <char>
#> 1: 1995 39 39 SPRINGFIELD ROAD DE256NG Springfield Road
#> 2: 1973 62 62 CHURCH ROAD DE159QA Church Road
#> 3: 1968 25 25 WOODLANDS ROAD DE237GG Woodlands Road
#> 4: 1947 22 22 PARK LANE DE125LU Park Lane
#> 5: 1978 40 40 SPRINGFIELD ROAD DE256NG Springfield Road
#> 6: 1969 35 35 CHURCH ROAD DE159QY Church Road
#> cap_add census_id x person_id.y pername1.y
#> <char> <char> <int> <char> <char>
#> 1: 39, Springfield Road CENSDE256NG039003 945 DE256NG039003 HARRIET
#> 2: 62, Church Road CENSDE159QA062001 427 DE159QA062001 LEWIS
#> 3: 25, Woodlands Road CENSDE237GG025002 720 DE237GG025002 IMOGEW
#> 4: 22, Park Lane CENSDE125LU022001 136 DE125LU022001 DAMIEL
#> 5: 40, Springfield Road CENSDE256NG040002 949 DE256NG040002 CHLOE
#> 6: 35, Church Road CENSDE159QY035002 549 DE159QY035002 AVA
#> pername2.y sex.y dob_day.y dob_mon.y dob_year.y enumcap.y
#> <char> <char> <char> <char> <char> <char>
#> 1: THOMSON F 12 1 39 SPRINGFIELD ROAD
#> 2: GREEN M 23 3 62 CHURCH ROAD
#> 3: DAVIS F 6 4 25 WOODLANDS ROAD
#> 4: HILLER M 21 4 22 PARK LANE
#> 5: WILSOM F 5 7 40 SPRINGFIELD ROAD
#> 6: KING F 7 7 35 CHURCH ROAD
#> enumpc.y cis_id y
#> <char> <char> <int>
#> 1: DE256NG CISDE256NG039003 11
#> 2: DE159QA CISDE159QA062001 71
#> 3: DE237GG CISDE237GG025002 83
#> 4: DE125LU CISDE125LU022001 99
#> 5: DE256NG CISDE256NG040002 154
#> 6: DE159QY CISDE159QY035002 156fastLink packageJust use the block column in the function
fastLink::blockData(). As a result you will obtain a list
of records blocked for further processing.
RecordLinkage packageJust use the block column in the argument
blockfld in the compare.dedup() or
compare.linkage() function. Please note that
block column for the RecordLinkage package
should be stored as a character not a
numeric/integer vector.