Using the versioning package

This vignette introduces the versioning package, which aims to simplify management of project settings and file input/output by combining them in a single R object.

R data pipelines commonly require reading and writing data to versioned directories. Each directory might correspond to one step of a multi-step process, where that version corresponds to particular settings for that step and a chain of previous steps that each have their own respective versions. This package describes a Config (configuration) object that makes it easy to read and write versioned data, based on YAML configuration files loaded and saved to each versioned folder.

To get started, install and load the versioning package.

# install.packages('versioning')
library(versioning)

YAML is a natural format for storing project settings, since it can represent numeric, character, and logical settings as well as hierarchically-nested settings. We will use the ‘example_config.yaml’ file that comes with the versioning package for this example. The following code block prints the contents of the YAML file to screen:

example_config_fp <- system.file('extdata', 'example_config.yaml', package = 'versioning')

# Print the contents of the input YAML file
file_contents <- system(paste('cat', example_config_fp), intern = T)
message(paste(file_contents, collapse ='\n'))
#> a: 'foo'
#> b: ['bar', 'baz']
#> group_c:
#>   d: 1e5
#>   e: false
#> directories:
#>   raw_data:
#>     versioned: FALSE
#>     path: '~/versioning_test/raw_data'
#>     files:
#>       a: 'example_input_file.csv'
#>   prepared_data:
#>     versioned: TRUE
#>     path: '~/versioning_test/prepared_data'
#>     files:
#>       prepared_table: 'example_prepared_table.csv'
#>       summary_text: 'summary_of_rows.txt'
#> versions:
#>   prepared_data: 'v1'

We can load this YAML file by creating a new Config object. The only required argument when creating a Config object is config_list, which is either a nested R list of settings or (in our case) a filepath to a YAML file containing those settings.

The Config object stores all those settings internally in the config$config_list attribute. The full list of settings can always be viewed using print(config) or str(config$config_list).

# Load YAML file as a Config object
config <- versioning::Config$new(config_list = example_config_fp)

# Print the config file contents
print(config)
#> List of 5
#>  $ a          : chr "foo"
#>  $ b          : chr [1:2] "bar" "baz"
#>  $ group_c    :List of 2
#>   ..$ d: chr "1e5"
#>   ..$ e: logi FALSE
#>  $ directories:List of 2
#>   ..$ raw_data     :List of 3
#>   .. ..$ versioned: logi FALSE
#>   .. ..$ path     : chr "~/versioning_test/raw_data"
#>   .. ..$ files    :List of 1
#>   .. .. ..$ a: chr "example_input_file.csv"
#>   ..$ prepared_data:List of 3
#>   .. ..$ versioned: logi TRUE
#>   .. ..$ path     : chr "~/versioning_test/prepared_data"
#>   .. ..$ files    :List of 2
#>   .. .. ..$ prepared_table: chr "example_prepared_table.csv"
#>   .. .. ..$ summary_text  : chr "summary_of_rows.txt"
#>  $ versions   :List of 1
#>   ..$ prepared_data: chr "v1"

You can always access the list of settings directly by subsetting config$config_list like a normal list, but the Config$get() method is sometimes preferable. For example, if you want to retrieve the setting listed under “a”, config$get('a') is equivalent to config$config_list[['a']], but will throw an error if the setting “a” does not exist. You can also use the config$get() method for nested settings, as shown below:

# Retrieve some example settings from the config file
message("config$get('a') yields: ", config$get('a'))
#> config$get('a') yields: foo
message("config$get('b') yields: ", config$get('b'))
#> config$get('b') yields: barbaz
message("config$get('group_c', 'd') yields: ", config$get('group_c', 'd'))
#> config$get('group_c', 'd') yields: 1e5

# Update a setting
config$config_list$a <- 12345
message("config$get('a') has been updated and now yields: ", config$get('a'))
#> config$get('a') has been updated and now yields: 12345

There are two special sub-lists of the config_list, titled directories and versions, that can be handy for versioned R workflows with multiple steps. Each item in directories is structured with the following information:

Name of the sublist: how the directory is accessed from the config (in our example, “raw_data” or “prepared_data”)
versioned (logical): Does the directory have versioned sub-directories?
path (character): Path to the directory
files (list): Named list of files within the directory

In the example below, we’ll show a very simple workflow where data is originally placed in a “raw_data” directory, which is not versioned, and then some summaries are written to a “prepared_data” directory, which is versioned. This mimics some data science workflows where differences between data preparation methods and model results need to be tracked over time. For this example, we will use temporary directories for both:

# Update the raw_data and prepared_data directories to temporary directories for this
# example
config$config_list$directories$raw_data$path <- tempdir(check = T)
config$config_list$directories$prepared_data$path <- tempdir(check = T)

# Create directories
message(
  "Creating raw_data directory, which is not versioned: ",
  config$get_dir_path('raw_data')
)
#> Creating raw_data directory, which is not versioned: /tmp/RtmpjIwKuw
dir.create(config$get_dir_path('raw_data'), showWarnings = FALSE)

message(
  "Creating prepared_data directory, which is versioned: ",
  config$get_dir_path('prepared_data')
)
#> Creating prepared_data directory, which is versioned: /tmp/RtmpjIwKuw/v1
dir.create(config$get_dir_path('prepared_data'), showWarnings = FALSE)

# Copy the example input file to the raw data folder
file.copy(
  from = system.file('extdata', 'example_input_file.csv', package = 'versioning'),
  to = config$get_file_path(dir_name = 'raw_data', file_name = 'a')
)
#> [1] TRUE

As seen above, we can use the config$get_dir_path() to access directory paths and config$get_file_path() to access files within a directory. Note also that the path for the “prepared_data” folder ends with “v1”: this is because config$versions$prepared_data is currently set to “v1”. In a future run of this workflow, we could change the folder version by updating this setting.

We can also use the config$read() and config$write() functions to read and write files within these directories.

# Read that same table from file
df <- config$read(dir_name = 'raw_data', file_name = 'a')

# Write a prepared table and a summary to file
config$write(df, dir_name = 'prepared_data', file_name = 'prepared_table')
config$write(
  paste("The prepared table has", nrow(df), "rows and", ncol(df), "columns."),
  dir_name = 'prepared_data',
  file_name = 'summary_text'
)

# Both files should now appear in the "prepared_data" directory
list.files(config$get_dir_path('prepared_data')) 
#> [1] "example_prepared_table.csv" "summary_of_rows.txt"

These use the autoread() and autowrite() functions behind the scenes, and support any file extensions listed in get_file_reading_functions()/get_file_writing_functions().

message(
  "Supported file types for reading: ",
  paste(sort(names(versioning::get_file_reading_functions())), collapse = ', ')
)
#> Supported file types for reading: bz2, csv, dbf, dta, e00, fgb, gdb, geojson, geojsonseq, geotiff, gml, gpkg, gps, gpx, gtm, gxt, gz, jml, kml, map, mdb, nc, ods, osm, pbf, rda, rdata, rds, shp, sqlite, tif, tsv, txt, vdv, xls, xlsx, yaml, yml
message(
  "Supported file types for writing: ",
  paste(sort(names(versioning::get_file_writing_functions())), collapse = ', ')
)
#> Supported file types for writing: csv, fgb, geojson, geotiff, gml, gpkg, gps, gpx, gtm, gxt, jml, kml, map, nc, ods, rda, rdata, rds, shp, sqlite, tif, txt, vdv, yaml, yml

There is also a helper function, config$write_self(), that will write the current config to a specified directory as a config.yaml file. For example, the following code block writes the current config to the versioned “prepared_data” directory:

# Write the config object to the "prepared_data" directory
config$write_self(dir_name = 'prepared_data')
# The "prepared_data" directory should now include "config.yaml"
list.files(config$get_dir_path('prepared_data'))
#> [1] "config.yaml"                "example_prepared_table.csv"
#> [3] "summary_of_rows.txt"

While you can always update settings, versions, and file paths by changing the input YAML file, it is sometimes more convenient to update versions in code or through command line arguments passed to a script. In these cases, you can specify the versions argument when creating a new Config object. This argument will set or overwrite the particular versions listed, while keeping other versions unchanged. For example, the following code block loads the config, but changes (only) the “prepared_data” version to “v2”.

# Load a new custom config where the "prepared_data" version has been updated to "v2"
custom_versions <- list(prepared_data = 'v2')
config_v2 <- versioning::Config$new(
  config_list = example_config_fp,
  versions = custom_versions
)
print(config_v2$get_dir_path('prepared_data')) # Should now end in ".../v2"
#> [1] "~/versioning_test/prepared_data/v2"

For more information about using this package, see the documentation on the Config object: help(Config, package = 'versioning').