| Title: | Data Checker for Validating Data Frames Against Defined Schema |
|---|---|
| Description: | Validates data frames against a defined schema. Produces a report of the checks performed and any issues found, with index and entry value where appropriate. Backend checks are performed using pointblank Richard Iannone et al (2025) <doi:10.32614/CRAN.package.pointblank>. |
| Authors: | Crown Copyright [cph], Analysis Standards and Pipelines Team (ONS) [cre, aut] |
| Maintainer: | Analysis Standards and Pipelines Team (ONS) <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 2.0.0 |
| Built: | 2026-06-08 22:30:59 UTC |
| Source: | https://github.com/cran/data.checker |
This function allows you to add a custom check to the Validatorobject.
add_check(validator, description, condition)add_check(validator, description, condition)
validator |
A |
description |
A description of the custom check. |
condition |
Expression to be evaluated or logical conditions to define the custom check. Optional if outcome is set |
The updated Validator object with the custom check added.
This function allows you to add a custom check outcomes to the validator log. The outcomes must be a logical vector.
add_check_custom( validator, description, outcome, type = c("error", "warning", "note") )add_check_custom( validator, description, outcome, type = c("error", "warning", "note") )
validator |
A |
description |
A description of the custom check. |
outcome |
Logical vector indicating the result of the check (TRUE/FALSE). Outcome must be logical. |
type |
The type of the check, can be "error", "warning", or "note". |
The updated Validator object with the custom check added.
This function adds a new entry to the validator's QA log with details such as a description, type of entry, timestamp, pass status, and failing IDs.
add_qa_entry( validator, description, failing_ids, outcome = NA, entry_type = c("info", "warning", "error") )add_qa_entry( validator, description, failing_ids, outcome = NA, entry_type = c("info", "warning", "error") )
validator |
a |
description |
A character string describing the QA entry. |
failing_ids |
Optional: A vector of IDs that failed the QA check. If more than 10 IDs are provided, only the first 10 are stored, with a note indicating the additional count. |
outcome |
Optional: A logical value indicating whether the QA check passed. If not provided or invalid, defaults to |
entry_type |
Optional: A character string specifying the type of entry. Must be one of "info", "warning", or "error". Defaults to "info". |
The updated validator object with the new entry appended to its QA log.
This function runs the full suite of validation checks on a Validator object.
check(validator, ...)check(validator, ...)
validator |
An object of class |
... |
Additional arguments (currently unused). |
The validated Validator object if all checks pass. If any check fails, an error is thrown.
# create schema schema <- list( check_duplicates = FALSE, check_completeness = FALSE, columns = list( age = list(type = "double", optional = FALSE), sex = list(type = "character", optional = FALSE) ) ) # create dataframe df <- data.frame( age = c(10, 11, 13, 15, 22, 34, 80), sex = c("M", "F", "M", "F", "M", "F", "M") ) # create validator object validator <- new_validator( data = df, schema = schema ) # validate the data validator <- check(validator)# create schema schema <- list( check_duplicates = FALSE, check_completeness = FALSE, columns = list( age = list(type = "double", optional = FALSE), sex = list(type = "character", optional = FALSE) ) ) # create dataframe df <- data.frame( age = c(10, 11, 13, 15, 22, 34, 80), sex = c("M", "F", "M", "F", "M", "F", "M") ) # create validator object validator <- new_validator( data = df, schema = schema ) # validate the data validator <- check(validator)
This function validates data against a given schema, performs checks, and exports the validation results to a specified file in a given format.
check_and_export( data, schema, file, format, hard_check = FALSE, backseries = NULL, name = deparse(substitute(data)) )check_and_export( data, schema, file, format, hard_check = FALSE, backseries = NULL, name = deparse(substitute(data)) )
data |
The data to be validated. |
schema |
The schema to validate against. |
file |
The file path where the validation results will be exported. |
format |
The format in which the validation results will be exported. |
hard_check |
logical. Optional - FALSE by default. If TRUE, raises an error if there are any failed checks. Otherwise, raises a warning. |
backseries |
A previous version of the data to check against (optional). |
name |
validator name - defaults to the name of the dataframe object supplied to "data" (Optional). Must be a single character string. |
The exported validation results.
# create schema schema <- list( check_duplicates = FALSE, check_completeness = FALSE, columns = list( age = list(type = "double", optional = FALSE), sex = list(type = "character", optional = FALSE) ) ) # create dataframe df <- data.frame( age = c(10, 11, 13, 15, 22, 34, 80), sex = c("M", "F", "M", "F", "M", "F", "M") ) # validate and export log check_and_export( data = df, schema = schema, file = paste0(tempdir(),"\\validation_results_example.html"), format = "html", hard_check = TRUE )# create schema schema <- list( check_duplicates = FALSE, check_completeness = FALSE, columns = list( age = list(type = "double", optional = FALSE), sex = list(type = "character", optional = FALSE) ) ) # create dataframe df <- data.frame( age = c(10, 11, 13, 15, 22, 34, 80), sex = c("M", "F", "M", "F", "M", "F", "M") ) # validate and export log check_and_export( data = df, schema = schema, file = paste0(tempdir(),"\\validation_results_example.html"), format = "html", hard_check = TRUE )
Checks if the latest data is consistent with previous data.
check_backseries(validator)check_backseries(validator)
validator |
A Validator object containing the schema and agent. |
The updated Validator object with outcomes logged.
This function performs checks on the column names of a Validator object to ensure they follow specific naming conventions and
meet schema conditions.
check_colnames(validator)check_colnames(validator)
validator |
A |
The function performs the following checks on the column names:
Ensures column names do not contain spaces.
Ensures column names do not contain symbols other than underscores.
Ensures column names do not contain uppercase letters.
For each check, a QA entry is added to the Validator object with details about the check, whether it passed, and the IDs of failing columns (if any). # nolint: line_length_linter.
The updated Validator object with QA entries for each check.
This function performs checks on the columns of Validator$data to ensure they meet the
specified schema conditions and checks.
check_column_contents(validator)check_column_contents(validator)
validator |
A |
The updated Validator object with QA entries for each check.
Check dataset for missing columns
check_completeness(validator)check_completeness(validator)
validator |
data |
The updated validator object with new log entries appended.
duplicates_cols is specified in the schema.
Otherwise, all columns are used for duplicate check.Check for duplicate rows. Can use subset of columns to check for
duplicates if duplicates_cols is specified in the schema.
Otherwise, all columns are used for duplicate check.
check_duplicates(validator)check_duplicates(validator)
validator |
|
The updated validator object with new log entries appended.
This function checks that the contents of the schema are consistent with the data frame provided. It checks for unused schema entries, incompatible schema entries, and that any columns specified in the schema are present in the data frame.
check_schema_contents_against_df(validator)check_schema_contents_against_df(validator)
validator |
A |
The updated Validator object with QA entries added for any issues found in the schema.
This function checks the types and classes of the columns in the data against the schema
defined in the Validator object.
check_types(validator)check_types(validator)
validator |
A |
The updated Validator object with quality assurance (QA) entries added for type and
class checks. Each QA entry includes a description, pass/fail status, and any failing column IDs.
This function exists For ease of use - see export.Validator() for details.
export(object, ...)export(object, ...)
object |
The object to be checked. |
... |
Additional arguments passed to specific methods. |
The result of the export operation, specific to the object type.
This function exports the log of a Validator object to a file in the specified format.
## S3 method for class 'Validator' export(object, file, format = c("yaml", "json", "html", "csv"), ...)## S3 method for class 'Validator' export(object, file, format = c("yaml", "json", "html", "csv"), ...)
object |
A |
file |
A string specifying the file path where the log will be exported. The file extension must match the specified format. |
format |
A string specifying the format of the output file.
Supported formats are |
... |
Additional arguments passed to specific methods. |
Writes the log to the specified file. No value is returned.
This function raises errors or warnings if any checks flagged as error or warnings fail.
hard_checks_status(validator, hard_check)hard_checks_status(validator, hard_check)
validator |
A |
hard_check |
A logical value indicating whether to perform hard checks (default is TRUE). |
Warning if there are any warnings or errors in the log when hard_check is FALSE.
Error if there are any errors and hard_check is TRUE.
Flag outliers based on Interquartile Range (IQR). Outliers are flagged if they are below Q1 - (mulitplier * IQR) or above Q3 + (multiplier * IQR).
iqr_bounds(x, multiplier = 1.5)iqr_bounds(x, multiplier = 1.5)
x |
A numeric vector. |
multiplier |
A numeric value to multiply the IQR by (default is 1.5). |
A vector the same size as x, with TRUE for values that are outliers and FALSE otherwise
This wrapper calls is_valid_column_values for each column in the schema
is_column_contents_valid(schema)is_column_contents_valid(schema)
schema |
the validator schema |
TRUE if all column values are valid, otherwise an error is raised.
Check type of column in schema is valid
is_type_valid(schema)is_type_valid(schema)
schema |
the validator schema |
TRUE if all column types are valid, otherwise an error is raised.
This function checks that for any column schema, the max values (e.g., max_string_length, max_date) are not less than the corresponding min values (e.g., min_string_length, min_date). If any such inconsistency is found, an error is raised with a descriptive message.
is_valid_column_values(column_schema, col_name)is_valid_column_values(column_schema, col_name)
column_schema |
A list representing the schema for a specific column, which may contain max and min value specifications. |
col_name |
The name of the column being checked, used for error messages. |
TRUE if all max values are greater than or equal to their corresponding min values, otherwise an error is raised.
Check if the schema is valid
is_valid_schema(schema)is_valid_schema(schema)
schema |
A list to validate. |
TRUE if the schema is a valid named list, otherwise FALSE.
Generate HTML Representation of a Log
log_html(validator)log_html(validator)
validator |
A |
A string containing the HTML representation of the log.
This function extracts validation results from a pointblank agent and appends them to the validator's log.
log_pointblank_outcomes(validator)log_pointblank_outcomes(validator)
validator |
A list containing a pointblank agent and a log. The agent should have a validation_set from a pointblank interrogation. |
Each entry in the log will contain the timestamp, description, outcome, failing row indices, number of failures, and entry type for each validation step.
The updated validator list with new log entries appended.
This function converts a validator log into a formatted data frame (table) for exports.
log_to_table(log)log_to_table(log)
log |
A list representing the validator log, where each element is a log entry. |
A data frame containing the formatted log entries.
Creates a Validator object to validate data against a given schema.
new_validator( data, schema, backseries = NULL, name = deparse(substitute(data)) )new_validator( data, schema, backseries = NULL, name = deparse(substitute(data)) )
data |
A data frame to validate against the schema. |
schema |
A schema object that defines the validation rules. See the vignette for more details on schema structure. This can also be a file path to a JSON, YAML, or TOML file containing the schema. |
backseries |
A previous version of the data to check against (optional). |
name |
validator name - defaults to the name of the dataframe object supplied to "data" (optional). Must be a single character string. |
An object of class Validator.
# create schema schema <- list( check_duplicates = FALSE, check_completeness = FALSE, columns = list( age = list(type = "double", optional = FALSE), sex = list(type = "character", optional = FALSE) ) ) # create dataframe df <- data.frame( age = c(10, 11, 13, 15, 22, 34, 80), sex = c("M", "F", "M", "F", "M", "F", "M") ) # create validator object validator <- new_validator( data = df, schema = schema )# create schema schema <- list( check_duplicates = FALSE, check_completeness = FALSE, columns = list( age = list(type = "double", optional = FALSE), sex = list(type = "character", optional = FALSE) ) ) # create dataframe df <- data.frame( age = c(10, 11, 13, 15, 22, 34, 80), sex = c("M", "F", "M", "F", "M", "F", "M") ) # create validator object validator <- new_validator( data = df, schema = schema )
This function prints the log of a Validator object in a markdown table format.
## S3 method for class 'Validator' print(x, ...)## S3 method for class 'Validator' print(x, ...)
x |
A |
... |
Additional arguments passed to specific methods. |
A markdown-formatted table of the validator log.
To be used by check_column_contents - not intended to be run separately.
run_checks(validator, i_col)run_checks(validator, i_col)
validator |
|
i_col |
column index |
validator object
This function modifies a schema by converting column types to their corresponding R classes.
types_to_classes(schema)types_to_classes(schema)
schema |
A list containing a |
The modified schema with updated type and class fields for each column.
Validate date formats in the schema This function checks that any date formats specified in the schema are valid and can be parsed correctly.
validate_and_convert_date_formats(schema)validate_and_convert_date_formats(schema)
schema |
A list containing a |
The original schema if all date formats are valid. If any date format is invalid, an error is thrown with a message indicating the issue.
This function calculates the maximum z-score for a numeric column.
z_score(x)z_score(x)
x |
A numeric vector. |
A vector of the same length as x, indicating the z-score for each element.