--- title: "Using the versioning package" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Using the versioning package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` This vignette introduces the **versioning** package, which aims to simplify management of project settings and file input/output by combining them in a single R object. R data pipelines commonly require reading and writing data to versioned directories. Each directory might correspond to one step of a multi-step process, where that version corresponds to particular settings for that step and a chain of previous steps that each have their own respective versions. This package describes a `Config` (configuration) object that makes it easy to read and write versioned data, based on YAML configuration files loaded and saved to each versioned folder. To get started, install and load the **versioning** package. ```{r setup} # install.packages('versioning') library(versioning) ``` YAML is a natural format for storing project settings, since it can represent numeric, character, and logical settings as well as hierarchically-nested settings. We will use the 'example_config.yaml' file that comes with the **versioning** package for this example. The following code block prints the contents of the YAML file to screen: ```{r show-config} example_config_fp <- system.file('extdata', 'example_config.yaml', package = 'versioning') # Print the contents of the input YAML file file_contents <- system(paste('cat', example_config_fp), intern = T) message(paste(file_contents, collapse ='\n')) ``` We can load this YAML file by creating a new `Config` object. The only required argument when creating a Config object is `config_list`, which is either a nested R list of settings or (in our case) a filepath to a YAML file containing those settings. The Config object stores all those settings internally in the `config$config_list` attribute. The full list of settings can always be viewed using `print(config)` or `str(config$config_list)`. ```{r load-config} # Load YAML file as a Config object config <- versioning::Config$new(config_list = example_config_fp) # Print the config file contents print(config) ``` You can always access the list of settings directly by subsetting `config$config_list` like a normal list, but the `Config$get()` method is sometimes preferable. For example, if you want to retrieve the setting listed under "a", `config$get('a')` is equivalent to `config$config_list[['a']]`, but will throw an error if the setting "a" does not exist. You can also use the `config$get()` method for nested settings, as shown below: ```{r retrieve-settings} # Retrieve some example settings from the config file message("config$get('a') yields: ", config$get('a')) message("config$get('b') yields: ", config$get('b')) message("config$get('group_c', 'd') yields: ", config$get('group_c', 'd')) # Update a setting config$config_list$a <- 12345 message("config$get('a') has been updated and now yields: ", config$get('a')) ``` There are two special sub-lists of the `config_list`, titled `directories` and `versions`, that can be handy for versioned R workflows with multiple steps. Each item in `directories` is structured with the following information: 1. Name of the sublist: how the directory is accessed from the config (in our example, "raw_data" or "prepared_data") 2. `versioned` (logical): Does the directory have versioned sub-directories? 3. `path` (character): Path to the directory 4. `files` (list): Named list of files within the directory In the example below, we'll show a very simple workflow where data is originally placed in a "raw_data" directory, which is not versioned, and then some summaries are written to a "prepared_data" directory, which is versioned. This mimics some data science workflows where differences between data preparation methods and model results need to be tracked over time. For this example, we will use temporary directories for both: ```{r get-directories} # Update the raw_data and prepared_data directories to temporary directories for this # example config$config_list$directories$raw_data$path <- tempdir(check = T) config$config_list$directories$prepared_data$path <- tempdir(check = T) # Create directories message( "Creating raw_data directory, which is not versioned: ", config$get_dir_path('raw_data') ) dir.create(config$get_dir_path('raw_data'), showWarnings = FALSE) message( "Creating prepared_data directory, which is versioned: ", config$get_dir_path('prepared_data') ) dir.create(config$get_dir_path('prepared_data'), showWarnings = FALSE) # Copy the example input file to the raw data folder file.copy( from = system.file('extdata', 'example_input_file.csv', package = 'versioning'), to = config$get_file_path(dir_name = 'raw_data', file_name = 'a') ) ``` As seen above, we can use the `config$get_dir_path()` to access directory paths and `config$get_file_path()` to access files within a directory. Note also that the path for the "prepared_data" folder ends with "v1": this is because `config$versions$prepared_data` is currently set to "v1". In a future run of this workflow, we could change the folder version by updating this setting. We can also use the `config$read()` and `config$write()` functions to read and write files within these directories. ```{r read-write-files} # Read that same table from file df <- config$read(dir_name = 'raw_data', file_name = 'a') # Write a prepared table and a summary to file config$write(df, dir_name = 'prepared_data', file_name = 'prepared_table') config$write( paste("The prepared table has", nrow(df), "rows and", ncol(df), "columns."), dir_name = 'prepared_data', file_name = 'summary_text' ) # Both files should now appear in the "prepared_data" directory list.files(config$get_dir_path('prepared_data')) ``` These use the `autoread()` and `autowrite()` functions behind the scenes, and support any file extensions listed in `get_file_reading_functions()`/`get_file_writing_functions()`. ```{r get-supported-extensions} message( "Supported file types for reading: ", paste(sort(names(versioning::get_file_reading_functions())), collapse = ', ') ) message( "Supported file types for writing: ", paste(sort(names(versioning::get_file_writing_functions())), collapse = ', ') ) ``` There is also a helper function, `config$write_self()`, that will write the current config to a specified directory as a `config.yaml` file. For example, the following code block writes the current config to the versioned "prepared_data" directory: ```{r write-self} # Write the config object to the "prepared_data" directory config$write_self(dir_name = 'prepared_data') # The "prepared_data" directory should now include "config.yaml" list.files(config$get_dir_path('prepared_data')) ``` While you can always update settings, versions, and file paths by changing the input YAML file, it is sometimes more convenient to update versions in code or through command line arguments passed to a script. In these cases, you can specify the `versions` argument when creating a new Config object. This argument will set or overwrite the particular versions listed, while keeping other versions unchanged. For example, the following code block loads the config, but changes (only) the "prepared_data" version to "v2". ```{r update-versions} # Load a new custom config where the "prepared_data" version has been updated to "v2" custom_versions <- list(prepared_data = 'v2') config_v2 <- versioning::Config$new( config_list = example_config_fp, versions = custom_versions ) print(config_v2$get_dir_path('prepared_data')) # Should now end in ".../v2" ``` For more information about using this package, see the documentation on the `Config` object: `help(Config, package = 'versioning')`.