--- title: "Documenting Datasets" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Documenting Datasets} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup, message=FALSE, warning=FALSE} library(qtkit) library(fs) library(tibble) library(dplyr) library(glue) library(readr) ``` # Introduction Proper dataset documentation is crucial for reproducible research and effective data sharing. The {qtkit} package provides two main functions to help standardize and automate the documentation process: - `create_data_origin()`: Creates standardized metadata about data(set) sources - `create_data_dictionary()`: Generates the scaffolding for a detailed variable-level documentation or can use AI to generate descriptions to be reviewed and updated as necessary # Creating Dataset Origin Documentation ## Basic Usage Let's start with documenting the built-in `mtcars` dataset: ```{r} # Create a temporary file for our documentation origin_file <- file_temp(ext = "csv") # Create the origin documentation template origin_doc <- create_data_origin( file_path = origin_file, return = TRUE ) # View the template origin_doc |> glimpse() ``` The template provides fields for essential metadata. You can either open the CSV file in a spreadsheet editor or fill it out programmatically, as shown below. Here's how you might fill it out for `mtcars`: ```{r} origin_doc |> mutate(description = c( "Motor Trend Car Road Tests", "Henderson and Velleman (1981), Building multiple regression models interactively. Biometrics, 37, 391–411.", "US automobile market, passenger vehicles", "1973-74", "Built-in R dataset (.rda)", "Single data frame with 32 observations of 11 variables", "Public Domain", "Citation: Henderson and Velleman (1981)" )) |> write_csv(origin_file) ``` ## Customizing Origin Documentation You can force overwrite existing documentation: ```{r} create_data_origin( file_path = origin_file, force = TRUE ) ``` # Creating Data Dictionaries ## Basic Dictionary Creation Create a basic data dictionary without AI assistance: ```{r} # Create a temporary file for our dictionary dict_file <- file_temp(ext = "csv") # Generate dictionary for iris dataset iris_dict <- create_data_dictionary( data = iris, file_path = dict_file ) # View the results iris_dict |> glimpse() ``` ## AI-Enhanced Data Dictionaries If you have an OpenAI API key, you can generate more detailed descriptions: ```{r, eval=FALSE} # Not run - requires API key Sys.setenv(OPENAI_API_KEY = "your-api-key") iris_dict_ai <- create_data_dictionary( data = iris, file_path = dict_file, model = "gpt-4", sample_n = 5 ) ``` Example output might look like: ```{r echo=FALSE} # Simulated AI output tibble( variable = c("Sepal.Length", "Sepal.Width"), name = c("Sepal Length", "Sepal Width"), type = c("numeric", "numeric"), description = c( "Length of the sepal in centimeters", "Width of the sepal in centimeters" ) ) ``` ## Working with Larger Datasets For larger datasets, you can use sampling and grouping: ```{r, eval=FALSE} diamonds_dict <- diamonds |> create_data_dictionary( file_path = "diamonds_dict.csv", model = "gpt-4", sample_n = 3, grouping = "cut" # Sample across different cut categories ) ``` ## Best Practices 1. Create documentation when first obtaining/creating a dataset 2. Update documentation when: - Adding new variables - Modifying data structure - Changing data sources 3. Store documentation alongside data in version control 4. Include documentation paths in your project README # Conclusion The {qtkit} package provides flexible tools for standardizing dataset documentation. By combining `create_data_origin()` and `create_data_dictionary()`, you can create comprehensive documentation that enhances reproducibility and data sharing. ## Additional Resources - Package documentation: `help(package = "qtkit")` - Related packages: {dataMaid}, {codebook} - [Project homepage](https://github.com/qtalr/qtkit)