NB: the following is an advanced usage of
deident
. If you are just getting started we recommend
looking at the other vignettes first.
While the deident
package implements multiple different
methods for deidentification, one of its key advantages is the ability
to re-use and share methods across data sets due to the ‘stateful’
nature of its design.
If you wish to share a unit between different pipelines, the cleanest approach is to initialize the method of interest and then pass it into the first pipeline:
library(deident)
psu <- Pseudonymizer$new()
name_pipe <- starwars |>
deident(psu, name)
apply_deident(starwars, name_pipe)
#> # A tibble: 87 × 14
#> name height mass hair_color skin_color eye_color birth_year sex gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 v3MUe 172 77 blond fair blue 19 male mascu…
#> 2 7rHIx 167 75 <NA> gold yellow 112 none mascu…
#> 3 q5Vhs 96 32 <NA> white, bl… red 33 none mascu…
#> 4 KQz8x 202 136 none white yellow 41.9 male mascu…
#> 5 50zEr 150 49 brown light brown 19 fema… femin…
#> 6 PxvnO 178 120 brown, grey light blue 52 male mascu…
#> 7 riJWk 165 75 brown light blue 47 fema… femin…
#> 8 vpMZA 97 32 <NA> white, red red NA none mascu…
#> 9 4YeYM 183 84 black light brown 24 male mascu…
#> 10 OCtXW 182 77 auburn, white fair blue-gray 57 male mascu…
#> # ℹ 77 more rows
#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
Having called apply_deident
the Pseudonymizer
psu
has learned encodings for each string in
starwars$name
. If these strings appear a second time, they
will be replaced in the same way, and we can build a second pipeline
using psu
:
combined.frm <- data.frame(
ID = c(head(starwars$name, 5), head(ShiftsWorked$Employee, 5))
)
reused_pipe <- combined.frm |>
deident(psu, ID)
apply_deident(combined.frm, reused_pipe)
#> ID
#> 1 v3MUe
#> 2 7rHIx
#> 3 q5Vhs
#> 4 KQz8x
#> 5 50zEr
#> 6 2vEoX
#> 7 beMKE
#> 8 rpSge
#> 9 Zq1ja
#> 10 4Eo42
Since the first 5 lines of combined.frm$ID
are the same
as starwars$ID
the first 5 lines of each transformed data
set are also the same.