transformations

library(deident)

Transformations

Out of the box, deident features a set of transformations to aid in the de-identification of data sets. Each transformation is implemented via R6Class and extends BaseDeident. User defined transformations can be implemented in a similar manner.

To demonstrate the different transformation we supply a toy data set, df, comprising 26 observations of three variables:

  • A: character, a to z
  • B: numeric, 1 to 26
  • C: character, X if B <= 13, Y if B > 13

Psudonymizer

Apply a cached random replacement cipher. Re-occurrence of the same key will receive the same hash.

Implemented deident options:

deident(df, "psudonymize", A)
deident(df, "Pseudonymizer", A)
deident(df, Pseudonymizer, A)
deident(df, Pseudonymizer$new(), A)

psu <- Pseudonymizer$new()
deident(df, psu, A)

Options

By default Pseudonymizer replaces values in variables with a random alpha-numeric string of 5 characters. This can be replaced via calling set_method on an instantiated Pseudonymizer with the desired function:

psu <- Pseudonymizer$new()

new_method <- function(key, ...){
  paste(sample(letters, 12, T), collapse="")
}

psu$set_method(new_method)

deident(df, psu, A)
#> DeidentList
#>    1 step(s) implemented 
#>    Step 1 : 'Pseudonymizer' on variable(s) A 
#> For data:
#>    columns: A, B, C

The first argument to the method receives the key to be transformed.

Shuffler

Implemented deident options:

deident(df, "shuffle", A)
deident(df, "Shuffler", A)
deident(df, Shuffler, A)
deident(df, Shuffler$new(), A)

shuffle <- Shuffler$new()
deident(df, shuffle, A)

Encrypter

Apply cryptographic hashing to a variable.

Implemented deident options:

deident(df, "encrypt", A)
deident(df, "Encrypter", A)
deident(df, Encrypter, A)
deident(df, Encrypter$new(), A)

encrypt <- Encrypter$new()
deident(df, encrypt, A)

Options

At initialization, Encrypter can be given hash_key and seed values to control the cryptographic encryption. It is recommended users set these values and do not disclose them.

encrypt <- Encrypter$new(hash_key="deident_hash_key_123", seed=202)
deident(df, encrypt, A)
#> DeidentList
#>    1 step(s) implemented 
#>    Step 1 : 'Encrypter' on variable(s) A 
#> For data:
#>    columns: A, B, C

Perturber

Apply Gaussian white noise to a numeric variable.

Implemented deident options:

deident(df, "perturb", A)
deident(df, "Perturber", A)
deident(df, Perturber, A)
deident(df, Perturber$new(), A)

perturb <- Perturber$new()
deident(df, perturb, A)

Options

At initialization, Perturber can be given a scale for the white noise via the sd argument.

# perturb <- Perturber$new(noise=adaptive_noise(0.2))
# deident(df, perturb, B)

Blurer

Aggregate categorical values dependent on a user supplied list. the list must be supplied to Blur at initialization.

Implemented deident options:

letter_blur <- c(rep("Early", 13), rep("Late", 13))
names(letter_blur) <- letters

blur <- Blurer$new(blur = letter_blur)
deident(df, blur, A)

NumericBlurer

Aggregate numeric values dependent on a user supplied vector of breaks/ cuts. If no vector is supplied NumericBlurer defaults to a binary classification about 0.

Implemented deident options:

deident(df, "numeric_blur", B)
deident(df, "NumericBlurer", B)
deident(df, NumericBlurer, B)
deident(df, NumericBlurer$new(), B)

numeric_blur <- NumericBlurer$new()
deident(df, numeric_blur, B)

Options

At initialization NumericBlurer takes an argument cuts to define the limits of each interval.

numeric_blur <- NumericBlurer$new(cuts=c(5, 10, 15, 20))
deident(df, numeric_blur, B)
#> DeidentList
#>    1 step(s) implemented 
#>    Step 1 : 'NumericBlurer' on variable(s) B 
#> For data:
#>    columns: A, B, C

GroupedShuffler

Apply Shuffler to a data set having first grouped the data on column(s). The grouping needs to be defined at initialization.

Implemented deident options:

grouped_shuffle <- GroupedShuffler$new(C)
deident(df, grouped_shuffle, B)

Options

At initialization GroupedShuffler takes an argument limit such that if any aggregated sub group has fewer than limit observations all values are dropped.

numeric_blur <- GroupedShuffler$new(C, limit=1)
deident(df, numeric_blur, B)
#> DeidentList
#>    1 step(s) implemented 
#>    Step 1 : 'GroupedShuffler(group_on=C)' on variable(s) B 
#> For data:
#>    columns: A, B, C

Drop

Define a column to be removed from the pipeline.

Implemented deident options:


deident(df, Drop, B)

drop <- deident:::Drop$new()
deident(df, drop, B)