This vignette introduces the versioning package, which aims to simplify management of project settings and file input/output by combining them in a single R object.
R data pipelines commonly require reading and writing data to
versioned directories. Each directory might correspond to one step of a
multi-step process, where that version corresponds to particular
settings for that step and a chain of previous steps that each have
their own respective versions. This package describes a
Config
(configuration) object that makes it easy to read
and write versioned data, based on YAML configuration files loaded and
saved to each versioned folder.
To get started, install and load the versioning package.
YAML is a natural format for storing project settings, since it can represent numeric, character, and logical settings as well as hierarchically-nested settings. We will use the ‘example_config.yaml’ file that comes with the versioning package for this example. The following code block prints the contents of the YAML file to screen:
example_config_fp <- system.file('extdata', 'example_config.yaml', package = 'versioning')
# Print the contents of the input YAML file
file_contents <- system(paste('cat', example_config_fp), intern = T)
message(paste(file_contents, collapse ='\n'))
#> a: 'foo'
#> b: ['bar', 'baz']
#> group_c:
#> d: 1e5
#> e: false
#> directories:
#> raw_data:
#> versioned: FALSE
#> path: '~/versioning_test/raw_data'
#> files:
#> a: 'example_input_file.csv'
#> prepared_data:
#> versioned: TRUE
#> path: '~/versioning_test/prepared_data'
#> files:
#> prepared_table: 'example_prepared_table.csv'
#> summary_text: 'summary_of_rows.txt'
#> versions:
#> prepared_data: 'v1'
We can load this YAML file by creating a new Config
object. The only required argument when creating a Config object is
config_list
, which is either a nested R list of settings or
(in our case) a filepath to a YAML file containing those settings.
The Config object stores all those settings internally in the
config$config_list
attribute. The full list of settings can
always be viewed using print(config)
or
str(config$config_list)
.
# Load YAML file as a Config object
config <- versioning::Config$new(config_list = example_config_fp)
# Print the config file contents
print(config)
#> List of 5
#> $ a : chr "foo"
#> $ b : chr [1:2] "bar" "baz"
#> $ group_c :List of 2
#> ..$ d: chr "1e5"
#> ..$ e: logi FALSE
#> $ directories:List of 2
#> ..$ raw_data :List of 3
#> .. ..$ versioned: logi FALSE
#> .. ..$ path : chr "~/versioning_test/raw_data"
#> .. ..$ files :List of 1
#> .. .. ..$ a: chr "example_input_file.csv"
#> ..$ prepared_data:List of 3
#> .. ..$ versioned: logi TRUE
#> .. ..$ path : chr "~/versioning_test/prepared_data"
#> .. ..$ files :List of 2
#> .. .. ..$ prepared_table: chr "example_prepared_table.csv"
#> .. .. ..$ summary_text : chr "summary_of_rows.txt"
#> $ versions :List of 1
#> ..$ prepared_data: chr "v1"
You can always access the list of settings directly by subsetting
config$config_list
like a normal list, but the
Config$get()
method is sometimes preferable. For example,
if you want to retrieve the setting listed under “a”,
config$get('a')
is equivalent to
config$config_list[['a']]
, but will throw an error if the
setting “a” does not exist. You can also use the
config$get()
method for nested settings, as shown
below:
# Retrieve some example settings from the config file
message("config$get('a') yields: ", config$get('a'))
#> config$get('a') yields: foo
message("config$get('b') yields: ", config$get('b'))
#> config$get('b') yields: barbaz
message("config$get('group_c', 'd') yields: ", config$get('group_c', 'd'))
#> config$get('group_c', 'd') yields: 1e5
# Update a setting
config$config_list$a <- 12345
message("config$get('a') has been updated and now yields: ", config$get('a'))
#> config$get('a') has been updated and now yields: 12345
There are two special sub-lists of the config_list
,
titled directories
and versions
, that can be
handy for versioned R workflows with multiple steps. Each item in
directories
is structured with the following
information:
versioned
(logical): Does the directory have versioned
sub-directories?path
(character): Path to the directoryfiles
(list): Named list of files within the
directoryIn the example below, we’ll show a very simple workflow where data is originally placed in a “raw_data” directory, which is not versioned, and then some summaries are written to a “prepared_data” directory, which is versioned. This mimics some data science workflows where differences between data preparation methods and model results need to be tracked over time. For this example, we will use temporary directories for both:
# Update the raw_data and prepared_data directories to temporary directories for this
# example
config$config_list$directories$raw_data$path <- tempdir(check = T)
config$config_list$directories$prepared_data$path <- tempdir(check = T)
# Create directories
message(
"Creating raw_data directory, which is not versioned: ",
config$get_dir_path('raw_data')
)
#> Creating raw_data directory, which is not versioned: /tmp/RtmpTWLUYm
dir.create(config$get_dir_path('raw_data'), showWarnings = FALSE)
message(
"Creating prepared_data directory, which is versioned: ",
config$get_dir_path('prepared_data')
)
#> Creating prepared_data directory, which is versioned: /tmp/RtmpTWLUYm/v1
dir.create(config$get_dir_path('prepared_data'), showWarnings = FALSE)
# Copy the example input file to the raw data folder
file.copy(
from = system.file('extdata', 'example_input_file.csv', package = 'versioning'),
to = config$get_file_path(dir_name = 'raw_data', file_name = 'a')
)
#> [1] TRUE
As seen above, we can use the config$get_dir_path()
to
access directory paths and config$get_file_path()
to access
files within a directory. Note also that the path for the
“prepared_data” folder ends with “v1”: this is because
config$versions$prepared_data
is currently set to “v1”. In
a future run of this workflow, we could change the folder version by
updating this setting.
We can also use the config$read()
and
config$write()
functions to read and write files within
these directories.
# Read that same table from file
df <- config$read(dir_name = 'raw_data', file_name = 'a')
# Write a prepared table and a summary to file
config$write(df, dir_name = 'prepared_data', file_name = 'prepared_table')
config$write(
paste("The prepared table has", nrow(df), "rows and", ncol(df), "columns."),
dir_name = 'prepared_data',
file_name = 'summary_text'
)
# Both files should now appear in the "prepared_data" directory
list.files(config$get_dir_path('prepared_data'))
#> [1] "example_prepared_table.csv" "summary_of_rows.txt"
These use the autoread()
and autowrite()
functions behind the scenes, and support any file extensions listed in
get_file_reading_functions()
/get_file_writing_functions()
.
message(
"Supported file types for reading: ",
paste(sort(names(versioning::get_file_reading_functions())), collapse = ', ')
)
#> Supported file types for reading: csv, dbf, dta, geojson, geotiff, rda, rdata, rds, shp, tif, txt, yaml, yml
message(
"Supported file types for writing: ",
paste(sort(names(versioning::get_file_writing_functions())), collapse = ', ')
)
#> Supported file types for writing: csv, geojson, geotiff, rda, rdata, rds, shp, tif, txt, yaml, yml
There is also a helper function, config$write_self()
,
that will write the current config to a specified directory as a
config.yaml
file. For example, the following code block
writes the current config to the versioned “prepared_data”
directory:
# Write the config object to the "prepared_data" directory
config$write_self(dir_name = 'prepared_data')
# The "prepared_data" directory should now include "config.yaml"
list.files(config$get_dir_path('prepared_data'))
#> [1] "config.yaml" "example_prepared_table.csv"
#> [3] "summary_of_rows.txt"
While you can always update settings, versions, and file paths by
changing the input YAML file, it is sometimes more convenient to update
versions in code or through command line arguments passed to a script.
In these cases, you can specify the versions
argument when
creating a new Config object. This argument will set or overwrite the
particular versions listed, while keeping other versions unchanged. For
example, the following code block loads the config, but changes (only)
the “prepared_data” version to “v2”.
# Load a new custom config where the "prepared_data" version has been updated to "v2"
custom_versions <- list(prepared_data = 'v2')
config_v2 <- versioning::Config$new(
config_list = example_config_fp,
versions = custom_versions
)
print(config_v2$get_dir_path('prepared_data')) # Should now end in ".../v2"
#> [1] "~/versioning_test/prepared_data/v2"
For more information about using this package, see the documentation
on the Config
object:
help(Config, package = 'versioning')
.