Title: | Persistent Data Anonymization Pipeline |
---|---|
Description: | A framework for the replicable removal of personally identifiable data (PID) in data sets. The package implements a suite of methods to suit different data types based on the suggestions of Garfinkel (2015) <doi:10.6028/NIST.IR.8053> and the ICO "Guidelines on Anonymization" (2012) <https://ico.org.uk/media/1061/anonymisation-code.pdf>. |
Authors: | Robert Cook [aut, cre] , Md Assaduzaman [aut] , Sarahjane Jones [aut] |
Maintainer: | Robert Cook <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0.0 |
Built: | 2024-11-25 14:56:45 UTC |
Source: | CRAN |
Function factory to apply white noise to a vector proportional to the spread of the data
adaptive_noise(sd.ratio = 1/10)
adaptive_noise(sd.ratio = 1/10)
sd.ratio |
the level of noise to apply relative to the vectors standard deviation. |
a function
f <- adaptive_noise(0.2) f(1:10)
f <- adaptive_noise(0.2) f(1:10)
add_blur()
adds an bluring step to a transformation pipeline
(NB: intended for categorical data). When ran as a transformation, values
are recoded to a lower cardinality as defined by blur
.
#'
add_blur(object, ..., blur = c())
add_blur(object, ..., blur = c())
object |
Either a |
... |
variables to be transformed. |
blur |
a key-value pair such that 'key' is replaced by 'value' on transformation. |
A 'DeidentList' representing the untrained transformation pipeline. The object contains fields:
deident_methods
a list of each step in the pipeline (consisting of variables
and method
)
and methods:
mutate
apply the pipeline to a new data set
to_yaml
serialize the pipeline to a '.yml' file
category_blur()
is provided to aid in defining the blur
.blur <- category_blur(ShiftsWorked$Shift, `Working` = "Day|Night") pipe.blur <- add_blur(ShiftsWorked, `Shift`, blur = .blur) pipe.blur$mutate(ShiftsWorked)
.blur <- category_blur(ShiftsWorked$Shift, `Working` = "Day|Night") pipe.blur <- add_blur(ShiftsWorked, `Shift`, blur = .blur) pipe.blur$mutate(ShiftsWorked)
add_encrypt()
adds an encryption step to a transformation pipeline.
When ran as a transformation, each specified variable undergoes replacement
via an encryption hashing function depending on the hash_key
and seed
set.
add_encrypt(object, ..., hash_key = "", seed = NA)
add_encrypt(object, ..., hash_key = "", seed = NA)
object |
Either a |
... |
variables to be transformed. |
hash_key |
a random alphanumeric key to control encryption |
seed |
a random alphanumeric to concat to the value being encrypted |
A 'DeidentList' representing the untrained transformation pipeline. The object contains fields:
deident_methods
a list of each step in the pipeline (consisting of variables
and method
)
and methods:
mutate
apply the pipeline to a new data set
to_yaml
serialize the pipeline to a '.yml' file
# Basic usage; without setting a `hash_key` or `seed` encryption is poor. pipe.encrypt <- add_encrypt(ShiftsWorked, Employee) pipe.encrypt$mutate(ShiftsWorked) # Once set the encryption is more secure assuming `hash_key` and `seed` are # not exposed. pipe.encrypt.secure <- add_encrypt(ShiftsWorked, Employee, hash_key="hash1", seed="Seed2") pipe.encrypt.secure$mutate(ShiftsWorked)
# Basic usage; without setting a `hash_key` or `seed` encryption is poor. pipe.encrypt <- add_encrypt(ShiftsWorked, Employee) pipe.encrypt$mutate(ShiftsWorked) # Once set the encryption is more secure assuming `hash_key` and `seed` are # not exposed. pipe.encrypt.secure <- add_encrypt(ShiftsWorked, Employee, hash_key="hash1", seed="Seed2") pipe.encrypt.secure$mutate(ShiftsWorked)
add_group()
allows for the injection of aggregation into the transformation
pipeline. Should you need to apply a transformation under aggregation (e.g.
add_shuffle
) this helper creates a grouped data.frame
as would be done
with dplyr::group_by()
.
The function add_ungroup()
is supplied to perform the inverse operation.
add_group(object, ...) add_ungroup(object, ...)
add_group(object, ...) add_ungroup(object, ...)
object |
Either a |
... |
Variables on which data is to be grouped. |
A 'DeidentList' representing the untrained transformation pipeline. The object contains fields:
deident_methods
a list of each step in the pipeline (consisting of variables
and method
)
and methods:
mutate
apply the pipeline to a new data set
to_yaml
serialize the pipeline to a '.yml' file
pipe.grouped <- add_group(ShiftsWorked, Date, Shift) pipe.grouped_shuffle <- add_shuffle(pipe.grouped, `Daily Pay`) add_ungroup(pipe.grouped_shuffle, `Daily Pay`)
pipe.grouped <- add_group(ShiftsWorked, Date, Shift) pipe.grouped_shuffle <- add_shuffle(pipe.grouped, `Daily Pay`) add_ungroup(pipe.grouped_shuffle, `Daily Pay`)
add_numeric_blur()
adds an bluring step to a transformation pipeline
(NB: intended for numeric data). When ran as a transformation, the data is
split into intervals depending on the cuts
supplied of the series
[-Inf, cut.1), [cut.1, cut.2), ..., [cut.n, Inf] where
cuts
= c(cut.1, cut.2, ..., cut.n).
add_numeric_blur(object, ..., cuts = 0)
add_numeric_blur(object, ..., cuts = 0)
object |
Either a |
... |
variables to be transformed. |
cuts |
The position in which data is to be divided. |
A 'DeidentList' representing the untrained transformation pipeline. The object contains fields:
deident_methods
a list of each step in the pipeline (consisting of variables
and method
)
and methods:
mutate
apply the pipeline to a new data set
to_yaml
serialize the pipeline to a '.yml' file
add_perturb()
adds an perturbation step to a transformation pipeline
(NB: intended for numeric data). When ran as a transformation, each
specified variable is transformed by the noise
function.
add_perturb(object, ..., noise = adaptive_noise(0.1))
add_perturb(object, ..., noise = adaptive_noise(0.1))
object |
Either a |
... |
variables to be transformed. |
noise |
a single-argument function that applies randomness. |
A 'DeidentList' representing the untrained transformation pipeline. The object contains fields:
deident_methods
a list of each step in the pipeline (consisting of variables
and method
)
and methods:
mutate
apply the pipeline to a new data set
to_yaml
serialize the pipeline to a '.yml' file
adaptive_noise()
, white_noise()
, and lognorm_noise()
pipe.perturb <- add_perturb(ShiftsWorked, `Daily Pay`) pipe.perturb$mutate(ShiftsWorked) pipe.perturb.white_noise <- add_perturb(ShiftsWorked, `Daily Pay`, noise=white_noise(0.1)) pipe.perturb.white_noise$mutate(ShiftsWorked) pipe.perturb.noisy_adaptive <- add_perturb(ShiftsWorked, `Daily Pay`, noise=adaptive_noise(1)) pipe.perturb.noisy_adaptive$mutate(ShiftsWorked)
pipe.perturb <- add_perturb(ShiftsWorked, `Daily Pay`) pipe.perturb$mutate(ShiftsWorked) pipe.perturb.white_noise <- add_perturb(ShiftsWorked, `Daily Pay`, noise=white_noise(0.1)) pipe.perturb.white_noise$mutate(ShiftsWorked) pipe.perturb.noisy_adaptive <- add_perturb(ShiftsWorked, `Daily Pay`, noise=adaptive_noise(1)) pipe.perturb.noisy_adaptive$mutate(ShiftsWorked)
add_pseudonymize()
adds a psuedonymization step to a transformation pipeline.
When ran as a transformation, terms that have not been seen before are given a new
random alpha-numeric string while terms that have been previously transformed
reuse the same term.
add_pseudonymize(object, ..., lookup = list())
add_pseudonymize(object, ..., lookup = list())
object |
Either a |
... |
variables to be transformed. |
lookup |
a pre-existing name-value pair to define intended psuedonymizations. Instances of 'name' will be replaced with 'value' on transformation.#' |
A 'DeidentList' representing the untrained transformation pipeline. The object contains fields:
deident_methods
a list of each step in the pipeline (consisting of variables
and method
)
and methods:
mutate
apply the pipeline to a new data set
to_yaml
serialize the pipeline to a '.yml' file
# Basic usage; pipe.pseudonymize <- add_pseudonymize(ShiftsWorked, Employee) pipe.pseudonymize$mutate(ShiftsWorked) pipe.pseudonymize2 <- add_pseudonymize(ShiftsWorked, Employee, lookup=list("Kyle Wilson" = "Kyle")) pipe.pseudonymize2$mutate(ShiftsWorked)
# Basic usage; pipe.pseudonymize <- add_pseudonymize(ShiftsWorked, Employee) pipe.pseudonymize$mutate(ShiftsWorked) pipe.pseudonymize2 <- add_pseudonymize(ShiftsWorked, Employee, lookup=list("Kyle Wilson" = "Kyle")) pipe.pseudonymize2$mutate(ShiftsWorked)
add_shuffle()
adds a shuffling step to a transformation pipeline.
When ran as a transformation, each specified variable undergoes a random sample without
replacement so that summary metrics on a single variable are unchanged, but
inter-variable metrics are rendered spurious.
add_shuffle(object, ..., limit = 0)
add_shuffle(object, ..., limit = 0)
object |
Either a |
... |
variables to be transformed. |
limit |
integer - the minimum number of observations a variable needs to
have for shuffling to be performed. If the variable has length less than |
A 'DeidentList' representing the untrained transformation pipeline. The object contains fields:
deident_methods
a list of each step in the pipeline (consisting of variables
and method
)
and methods:
mutate
apply the pipeline to a new data set
to_yaml
serialize the pipeline to a '.yml' file
add_group()
for usage under aggregation
# Basic usage; pipe.shuffle <- add_shuffle(ShiftsWorked, Employee) pipe.shuffle$mutate(ShiftsWorked) pipe.shuffle.limit <- add_shuffle(ShiftsWorked, Employee, limit=1) pipe.shuffle.limit$mutate(ShiftsWorked)
# Basic usage; pipe.shuffle <- add_shuffle(ShiftsWorked, Employee) pipe.shuffle$mutate(ShiftsWorked) pipe.shuffle.limit <- add_shuffle(ShiftsWorked, Employee, limit=1) pipe.shuffle.limit$mutate(ShiftsWorked)
Applies a pipeline as defined by deident
to a data frame. tibble, or file.
apply_deident(object, deident, ...)
apply_deident(object, deident, ...)
object |
The data to be deidentified |
deident |
A deidentification pipeline to be used. |
... |
Terms to be passed to other methods |
Apply a 'deident' pipeline to a new data frame
apply_to_data_frame(data, transformer, ...)
apply_to_data_frame(data, transformer, ...)
data |
The data set to be converted |
transformer |
The pipeline to be used |
... |
To be passed on to other methods |
Create new Deidentifier object
Setter for 'method' field
Save 'Deidentifier' to serialized object.
Apply 'method' to a vector of values
Apply 'method' to variables in a data frame
Apply 'mutate' method to an aggregated data frame.
Aggregate a data frame and apply 'mutate' to each.
Convert self
to a list
String representation of self
Check if parameters are in allowed fields
method |
New function to be used as the method. |
location |
File path to save to. |
keys |
Vector of values to be processed |
force |
Perform transformation on all variables even if some given are not in the data. |
grouped_data |
a 'grouped_df' object |
data |
A data frame to be manipulated |
grp_cols |
Vector of variables in 'data' to group on. |
mutate_cols |
Vector of variables in 'data' to transform. |
type |
character vector describing the object. Defaults to class. |
... |
Options to check exist |
method
Function to call for data transform.
Convert self
to a list.
blur |
Look-up list to define aggregation. |
keys |
Vector of values to be processed |
... |
Values to be concatenated to keys |
'Bluring' refers to aggregation of data e.g. converting city to country, or post code to IMD. The level of blurring is defined by the list given at initialization which maps key to value e.g. list(London = "England", Paris = "France").
Blurer
Apply blur to a vector of values
blur
List of aggregations to be applied. Create new Blurer object
Utility for producing 'blur'
category_blur(vec, ...)
category_blur(vec, ...)
vec |
The vector of values to be used |
... |
|
Create a deident pipeline
create_deident(method, ...)
create_deident(method, ...)
method |
A deidentifier to initialize. |
... |
list of variables to be deidentifier. NB: key word arguments will be passed to method at initialization. |
deident()
creates a transformation pipeline of 'deidentifiers' for
the repeated application of anonymization transformations.
deident(data, deidentifier, ...)
deident(data, deidentifier, ...)
data |
A data frame, existing pipeline, or a 'deidentifier' (as either initialized object, class generator, or character string) |
deidentifier |
A deidentifier' (as either initialized object, class generator, or character string) to be appended to the current pipeline |
... |
Positional arguments are variables of 'data' to be transformed and key-word arguments are passed to 'deidentifier' at creation |
A 'DeidentList' representing the untrained transformation pipeline. The object contains fields:
deident_methods
a list of each step in the pipeline (consisting of variables
and method
)
and methods:
mutate
apply the pipeline to a new data set
to_yaml
serialize the pipeline to a '.yml' file
# pipe <- deident(ShiftsWorked, Pseudonymizer, Employee) print(pipe) apply_deident(ShiftsWorked, pipe)
# pipe <- deident(ShiftsWorked, Pseudonymizer, Employee) print(pipe) apply_deident(ShiftsWorked, pipe)
Apply a deident pipeline to a set of files and save them back to disk
deident_job_from_folder( deident_pipeline, data_dir, result_dir = "Deident_results" )
deident_job_from_folder( deident_pipeline, data_dir, result_dir = "Deident_results" )
deident_pipeline |
The deident list to be used. |
data_dir |
a path to the files to be transformed. |
result_dir |
a path to where files are to be saved. |
A Deident
class dealing with the exclusion of variables.
Create new Encrypter object
Convert self
to a list.
hash_key |
An alpha numeric key for use in encryption |
seed |
An alpha numeric key which is concatenated to minimize brute force attacks |
keys |
Vector of values to be processed |
... |
Values to be concatenated to keys |
'Encrypting' refers to the cryptographic hashing of data e.g. md5 checksum. Encryption is more powerful if a random hash and seed are supplied and kept secret.
Encrypter
Apply blur to a vector of values
hash_key
Alpha-numeric secret key for encryption
seed
String for concatenation to raw value
Restore a serialized deident from file
from_yaml(path)
from_yaml(path)
path |
Path to serialized deident. |
deident <- deident(ShiftsWorked, Pseudonymizer, Employee) .tempfile <- tempfile(fileext = ".yml") deident$to_yaml(.tempfile) deident.yaml <- from_yaml(.tempfile) deident.yaml$mutate(ShiftsWorked)
deident <- deident(ShiftsWorked, Pseudonymizer, Employee) .tempfile <- tempfile(fileext = ".yml") deident$to_yaml(.tempfile) deident.yaml <- from_yaml(.tempfile) deident.yaml$mutate(ShiftsWorked)
Convert self
to a list.
Character representation of the class
limit |
Minimum number of rows required to shuffle data |
data |
A data frame to be manipulated |
... |
Vector of variables in 'data' to transform. |
'Shuffling' refers to the a random sampling of a variable without replacement e.g. [A, B, C] becoming [B, A, C] but not [A, A, B]. "Grouped shuffling" refers to aggregating the data by another feature before applying the shuffling process. Grouped shuffling will preserve aggregate level metrics (e.g. mean, median, mode) but removes ordinal properties i.e. correlations and auto-correlations
group_on
Symbolic representation of grouping variables
limit
Minimum number of rows required to shuffle data Create new GroupedShuffler object
Function factory to apply log-normal noise to a vector
lognorm_noise(sd = 0.1)
lognorm_noise(sd = 0.1)
sd |
the standard deviation of noise to apply. |
a function
f <- lognorm_noise(1) f(1:10)
f <- lognorm_noise(1) f(1:10)
A Deident
class dealing with the addition of random noise to a
numeric variable.
Create new Perturber object
Apply noise to a vector of values
Convert self
to a list.
Character representation of the class
noise |
a single-argument function that applies randomness. |
keys |
Vector of values to be processed |
... |
Values to be concatenated to keys |
noise.str
character representation of noise
method
random noise function
pert <- Perturber$new() pert$transform(1:10)
pert <- Perturber$new() pert$transform(1:10)
A Deident
class dealing with the (repeatable) random replacement of
string for deidentification.
Create new Pseudonymizer
object
Check if a key exists in lookup
Check if a key exists in lookup
Retrieve a value from lookup
Returns self$lookup
formatted as a tibble
Convert self
to a list.
Apply the deidentifcation method to the supplied keys
lookup |
a pre-existing name-value pair to define intended psuedonymizations. Instances of 'name' will be replaced with 'value' on transformation. |
keys |
value to be checked |
... |
values to concatenate to |
parse_numerics |
True: Force columns to characters. NB: only character vectors will be parsed. |
lookup
list of mapping from key-value on transform.
A synthetic data set intended to demonstrate the design and application of a
deidentification pipeline. Employee names are entirely fictitious and constructed
from the
â FiveThirtyEight Most Common Name Datasetâ
.
ShiftsWorked
ShiftsWorked
A data frame with 3,100 rows and 6 columns:
Table primary key (integer)
Name of listed employee
The date being considered
The shift-type done by employee
on date
. One of 'Day', 'Night' or 'Rest'.
Shift start time (missing if on 'Rest' shift)
Shift end time (missing if on 'Rest' shift)
Shift end time (missing if on 'Rest' shift)
Create new Shuffler object
Update minimum vector size for shuffling
Apply the deidentifcation method to the supplied keys
Convert self
to a list.
method |
[optional] A function representing the method of re-sampling to be used. By default uses exhaustive sampling without replacement. |
keys |
Value(s) to be transformed. |
... |
Value(s) to concatenate to |
limit |
integer - the minimum number of observations a variable needs to
have for shuffling to be performed. If the variable has length less than |
'Shuffling' refers to the a random sampling of a variable without replacement e.g. [A, B, C] becoming [B, A, C] but not [A, A, B]. Shuffling will preserve top level metrics (e.g. mean, median, mode) but removes ordinal properties i.e. correlations and auto-correlations
limit
minimum vector length to be shuffled. If vector to be transformed has length < limit, the data is replaced with NAs
The original data, from SWAPI, the Star Wars API, https://swapi.py4e.com/, has been revised
to reflect additional research into gender and sex determinations of characters. NB: taken from dplyr
starwars
starwars
A tibble with 87 rows and 14 variables:
Name of the character
Height (cm)
Weight (kg)
Hair, skin, and eye colors
Year born (BBY = Before Battle of Yavin)
The biological sex of the character, namely male, female, hermaphroditic, or none (as in the case for Droids).
The gender role or gender identity of the character as determined by their personality or the way they were programmed (as in the case for Droids).
Name of homeworld
Name of species
List of films the character appeared in
List of vehicles the character has piloted
List of starships the character has piloted
starwars
starwars
Function factory to apply white noise to a vector
white_noise(sd = 0.1)
white_noise(sd = 0.1)
sd |
the standard deviation of noise to apply. |
a function
f <- white_noise(1) f(1:10)
f <- white_noise(1) f(1:10)