| Title: | Lossless CDISC-Native Input and Output for Clinical Datasets |
|---|---|
| Description: | Reads and writes clinical-trial datasets losslessly across 'SAS' XPORT (XPT), Clinical Data Interchange Standards Consortium (CDISC) Dataset-JSON, and 'Apache Parquet', applying a specification to produce submission-ready Study Data Tabulation Model (SDTM) and Analysis Data Model (ADaM) datasets. A single canonical metadata model carries labels, CDISC data types, lengths, 'SAS' display formats, controlled-terminology references, and sort keys identically across every format, so conversion between any two formats is lossless by construction. Pure 'R' and lightweight, with no external 'SAS' or 'Java' runtime. Implements the published format specifications for CDISC Dataset-JSON (<https://cdisc-org.github.io/DataExchange-DatasetJson/doc/dataset-json1-1.html>) and 'SAS' XPORT (<https://www.loc.gov/preservation/digital/formats/fdd/fdd000466.shtml>). |
| Authors: | Vignesh Thanikachalam [aut, cre, cph] |
| Maintainer: | Vignesh Thanikachalam <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.1 |
| Built: | 2026-06-24 13:23:06 UTC |
| Source: | https://github.com/cran/artoo |
Run the ordered, transactional artoo pipeline that turns a raw analysis
data frame into one conformed to its specification and carrying
artoo_meta. This is the middle of the workflow (spec -> apply_spec ->
read_/write_): the conformed frame is ready for any write_*() codec, and
the metadata it now carries makes that write lossless. The input is never
mutated; if any step aborts, the call leaves your data untouched.
apply_spec( x, spec, dataset, conformance = c("warn", "abort", "off"), na_position = c("first", "last"), extra = c("keep", "drop"), on_coercion_loss = c("error", "keep") )apply_spec( x, spec, dataset, conformance = c("warn", "abort", "off"), na_position = c("first", "last"), extra = c("keep", "drop"), on_coercion_loss = c("error", "keep") )
x |
The raw data frame to conform. |
spec |
The specification to conform to. |
dataset |
The dataset whose rules apply. |
conformance |
What to do with conformance findings.
Note: this governs only the findings disposition — what is
reported. Pipeline errors are a different category and abort under
every setting, including |
na_position |
Where missing key values sort. |
extra |
What happens to undeclared columns.
Interaction: the drop runs before the check, so Note: |
on_coercion_loss |
What to do when coercion would lose data.
Interaction: independent of Tip: |
Ordered pipeline. Four fixed steps run in order: coerce each column
to its CDISC dataType, reorder columns to the spec, sort rows by the
dataset keys, then stamp the metadata. A spec variable the data lacks is
never fabricated as an empty column: artoo is a lossless carrier, not a
deriver. It is reported instead, an informational heads-up at apply time
plus a missing_variable finding (when mandatory) or missing_permissible
(when not), and left absent, so the conformed frame carries only the
columns the data actually had.
Extras are kept by default. A column the spec does not declare
survives the pipeline (ordered after the declared ones), is reported
by the extra_variable conformance finding, and round-trips through
every write_*() codec with metadata inferred from its R class —
membership reported, never enforced by silent destruction. Keeping is
the default because artoo is lossless by construction: a
metadata-application step that silently discarded columns would break
that contract, so trimming data is always an explicit, announced choice
rather than a default side effect.
extra = "drop" opts in to trim-to-spec (the returned frame carries
exactly the spec's columns): the undeclared columns are removed before
the check, so the findings describe exactly the returned frame (a dropped
column is never reported as extra_variable), and the drop itself is
always announced (artoo_message_apply) as the audit trail of what was
removed — even under conformance = "off".
Lossless or abort, your call. A coercion that would damage values —
an integer dataType truncating fractions or overflowing R's 32-bit
range — aborts with artoo_error_type before any value is touched, under
the default on_coercion_loss = "error". This gate is independent of
conformance: conformance = "off" does not bypass it. When the data
(not the spec) is right, set on_coercion_loss = "keep": the column
keeps its wider source type and the divergence is reported as an
integer_fraction / integer_overflow finding, never silently
truncated. When the spec is wrong, retype it with set_type() (or
repair_spec() from the findings). The error abort carries the offending
rows as data: cnd$variables is a data frame with columns
variable, data_type, n, and reason ("truncated" /
"overflowed"), so a pipeline can collect every mismatch in one
tryCatch(..., artoo_error_type = function(cnd) cnd$variables) pass.
The NA-introduction warning (artoo_warning_coercion) carries the same
frame with reason = "na_introduced", and a conformance = "abort"
failure carries the complete findings frame as cnd$findings.
Values are never translated. Coded variables keep their submission
values (SEX stays "M"); codelist translation is its own verb,
decode_column().
A conformed <data.frame> carrying artoo_meta (read it with
get_meta()) and, unless conformance = "off", the findings frame
conformance() reads back. Hand it to any write_*() codec.
Check: check_spec() for the findings; conformance() to read them
back.
Fix the spec: set_type() to retype a variable the data disagrees
with, repair_spec() to apply every integer fix from a findings frame.
Translate: decode_column() for codelist value mapping.
Metadata: get_meta() / set_meta() for what the stamp attaches.
# ---- Example 1: conform ADSL, then read its metadata ---- # # The bundled adam_spec describes ADSL; the raw frame is coerced, # ordered, sorted, and stamped with the CDISC metadata get_meta() reads # back. Variables the spec declares but this extract never derived are # reported (not added), readable via conformance(). adsl <- apply_spec(cdisc_adsl, adam_spec, "ADSL") get_meta(adsl)@dataset$records # ---- Example 2: extras are kept and reported, or dropped on request ---- # # By default a column outside the spec rides along (reported by the # extra_variable finding) and still writes losslessly; extra = "drop" # trims to the spec, announced and still reported. DM is SDTM, so it # conforms against the bundled sdtm_spec. raw <- cdisc_dm raw$DERIVED <- seq_len(nrow(raw)) dm <- apply_spec(raw, sdtm_spec, "DM") findings <- conformance(dm) findings[findings$check == "extra_variable", c("variable", "message")] trimmed <- apply_spec(raw, sdtm_spec, "DM", extra = "drop") "DERIVED" %in% names(trimmed)# ---- Example 1: conform ADSL, then read its metadata ---- # # The bundled adam_spec describes ADSL; the raw frame is coerced, # ordered, sorted, and stamped with the CDISC metadata get_meta() reads # back. Variables the spec declares but this extract never derived are # reported (not added), readable via conformance(). adsl <- apply_spec(cdisc_adsl, adam_spec, "ADSL") get_meta(adsl)@dataset$records # ---- Example 2: extras are kept and reported, or dropped on request ---- # # By default a column outside the spec rides along (reported by the # extra_variable finding) and still writes losslessly; extra = "drop" # trims to the spec, announced and still reported. DM is SDTM, so it # conforms against the bundled sdtm_spec. raw <- cdisc_dm raw$DERIVED <- seq_len(nrow(raw)) dm <- apply_spec(raw, sdtm_spec, "DM") findings <- conformance(dm) findings[findings$check == "extra_variable", c("variable", "message")] trimmed <- apply_spec(raw, sdtm_spec, "DM", extra = "drop") "DERIVED" %in% names(trimmed)
Build a reusable control that selects which dimensions check_spec()
evaluates. Construct one per study and thread it through every
check_spec() call so the conformance surface is consistent. Each toggle is validated at construction, so a
mistyped name or value aborts early rather than being silently ignored.
artoo_checks( missing_variable = TRUE, missing_permissible = TRUE, extra_variable = TRUE, type_mismatch = TRUE, length_overflow = TRUE, char_length_limit = TRUE, codelist_membership = TRUE, codelist_membership_extensible = TRUE, label_match = TRUE, key_uniqueness = TRUE, display_format = TRUE, variable_name = TRUE, dataset_name = TRUE, label_length = TRUE, integer_overflow = TRUE, integer_fraction = TRUE, iso8601_format = TRUE )artoo_checks( missing_variable = TRUE, missing_permissible = TRUE, extra_variable = TRUE, type_mismatch = TRUE, length_overflow = TRUE, char_length_limit = TRUE, codelist_membership = TRUE, codelist_membership_extensible = TRUE, label_match = TRUE, key_uniqueness = TRUE, display_format = TRUE, variable_name = TRUE, dataset_name = TRUE, label_length = TRUE, integer_overflow = TRUE, integer_fraction = TRUE, iso8601_format = TRUE )
missing_variable |
Flag mandatory spec variables absent from the
data. |
missing_permissible |
Flag permissible (non-mandatory) spec variables
absent from the data. |
extra_variable |
Flag data columns the spec does not declare.
|
type_mismatch |
Flag columns whose storage differs from the spec
dataType. |
length_overflow |
Flag character values longer than the spec length.
|
char_length_limit |
Flag character values longer than the SAS XPORT
v5 / FDA 200-byte limit. |
codelist_membership |
Flag values outside their closed codelist.
|
codelist_membership_extensible |
Flag values outside an extensible
codelist's enumerated terms. |
label_match |
Flag a column whose label attribute differs from the
spec label. |
key_uniqueness |
Flag a dataset whose spec key variables do not
uniquely identify its rows. |
display_format |
Flag a date/datetime/time variable whose
displayFormat is not a recognized SAS format of that family.
|
variable_name |
Flag a data column name that violates the XPORT
naming rules. |
dataset_name |
Flag a dataset name that violates the XPORT naming
rules. |
label_length |
Flag a column label attribute over the 40-byte XPORT
v5 / FDA limit. |
integer_overflow |
Flag an integer-typed variable holding values
beyond R's 32-bit integer range. |
integer_fraction |
Flag an integer-typed variable holding fractional
values. |
iso8601_format |
Flag a character date/datetime/time variable whose
values are not valid ISO 8601 text. |
Selection, not severity. This control decides which findings are
produced; apply_spec()'s conformance argument (warn, abort, off)
decides what to do with the findings its full-default check raises. A
disabled dimension is skipped entirely, so the findings frame stays
clean.
A <artoo_checks> control object. Pass it as the checks
argument to check_spec().
check_spec(), which consumes it; apply_spec() for the
findings disposition.
# ---- Example 1: the default runs every conformance dimension ---- # # With no arguments, every conformance dimension is enabled. artoo_checks() # ---- Example 2: silence one dimension for a whole study ---- # # Turn off the length check (e.g. while a spec's lengths are provisional) # and reuse the control across every dataset. spec <- artoo_spec(cdisc_sdtm_datasets, cdisc_sdtm_variables, codelists = cdisc_codelists) ck <- artoo_checks(length_overflow = FALSE) nrow(check_spec(cdisc_dm, spec, "DM", checks = ck))# ---- Example 1: the default runs every conformance dimension ---- # # With no arguments, every conformance dimension is enabled. artoo_checks() # ---- Example 2: silence one dimension for a whole study ---- # # Turn off the length check (e.g. while a spec's lengths are provisional) # and reuse the control across every dataset. spec <- artoo_spec(cdisc_sdtm_datasets, cdisc_sdtm_variables, codelists = cdisc_codelists) ck <- artoo_checks(length_overflow = FALSE) nrow(check_spec(cdisc_dm, spec, "DM", checks = ck))
List the character encodings clinical data actually travels in, with
the name each ecosystem uses for the same thing: the R name (the
standard IANA name, which iconv() and the wider R ecosystem use),
the SAS session-encoding name, and the Python codec. Any spelling
from the r or sas column works as the encoding argument of
every artoo reader and writer.
artoo_encodings()artoo_encodings()
What an encoding is. Text is stored as bytes; an encoding is the rule that maps those bytes to characters. Plain A-Z digits and punctuation are the same bytes in every encoding listed here — the differences only show in accented letters (a-umlaut, e-acute), special symbols (micro, degree), and non-Latin scripts. Reading bytes with the wrong rule is what turns a degree sign into garbage.
Which one do I have? In SAS, run PROC OPTIONS OPTION=ENCODING; RUN;
and look up the reported name in the sas column. Most US/EU Windows
SAS installs report WLATIN1 — that is windows-1252 here.
Which one should I write? Usually none: write_*(encoding = NULL)
(the default) inherits the encoding recorded when the data was read, so
a round-trip is byte-faithful. The regulatory defaults artoo applies
when nothing is recorded: SAS XPORT writes US-ASCII (the FDA Study
Data Technical Conformance Guide expectation) and Dataset-JSON / NDJSON
write UTF-8 (required by CDISC and RFC 8259). A value that cannot be
represented in the target encoding aborts loudly — see on_invalid on
the writers.
Note: in memory, artoo text is always UTF-8 (NFC-normalised) — encodings only matter at the file boundary, exactly as in Python 3.
A <data.frame> with one row per encoding and columns
r (the R name — the standard IANA name iconv() uses, and what
artoo records in the metadata), sas (the SAS session-encoding
name), python (the Python codec name), and description.
Use it: the encoding argument of read_xpt(), write_xpt(),
read_json(), and the other readers/writers.
Formats: artoo_formats() for the codec registry.
# ---- Example 1: the full cross-ecosystem table ---- # # One row per encoding; the same byte rule under each ecosystem's name. artoo_encodings() # ---- Example 2: look up a SAS session encoding ---- # # PROC OPTIONS reported WLATIN1: find the R and Python names for the # same bytes (the sas and r spellings both work as encoding=). enc <- artoo_encodings() enc[enc$sas == "WLATIN1", ]# ---- Example 1: the full cross-ecosystem table ---- # # One row per encoding; the same byte rule under each ecosystem's name. artoo_encodings() # ---- Example 2: look up a SAS session encoding ---- # # PROC OPTIONS reported WLATIN1: find the R and Python names for the # same bytes (the sas and r spellings both work as encoding=). enc <- artoo_encodings() enc[enc$sas == "WLATIN1", ]
List every registered codec and whether it can read and write in this
session. The pure-R formats (xpt, json, rds) are always available;
optional-engine formats (parquet) report FALSE until their package is
installed. Purely informational, modelled on the diagnostic helpers in the
wider ecosystem; it never aborts.
artoo_formats()artoo_formats()
A <data.frame> with one row per format and columns format,
read, write (logical), and extensions.
read_dataset() and write_dataset() which use the registry.
# ---- Example 1: see what this session can read and write ---- # # rds is always available; the table shows the extensions each codec claims. artoo_formats()# ---- Example 1: see what this session can read and write ---- # # rds is always available; the table shows the extensions each codec claims. artoo_formats()
Build and validate a artoo_spec from dataset, variable, and codelist
tables. Each table is coerced to a plain data frame, missing optional
columns are filled with typed NAs, every variable type is canonicalised
to the CDISC dataType vocabulary, and cross-slot integrity (dataset and
codelist references) is checked before the object is returned. The spec
is the lingua franca the rest of artoo reads, applies, and serialises.
artoo_spec( datasets = NULL, variables = NULL, codelists = NULL, study = NULL, values = NULL, methods = NULL, comments = NULL, documents = NULL, standard = NULL )artoo_spec( datasets = NULL, variables = NULL, codelists = NULL, study = NULL, values = NULL, methods = NULL, comments = NULL, documents = NULL, standard = NULL )
datasets |
Dataset-level metadata table.
|
variables |
Variable-level metadata table.
Requirement: every |
codelists |
Controlled-terminology terms.
Interaction: every |
study |
Study-level metadata. |
values |
Value-level (VLM) metadata. |
methods |
Derivation methods. |
comments |
Comment definitions. |
documents |
Document references. |
standard |
The CDISC standard the spec implements.
Restriction: all sources must agree on one value; conflicting
standards abort with |
Coerce, then validate. Each table is first coerced to a plain data
frame (a tibble is accepted and demoted); known columns are cast to
their storage mode and absent optional columns are added as typed NA,
so every downstream reader can trust the schema. Validation runs only
after coercion, on the completed slots.
Type canonicalisation. variables$data_type is mapped through the
closed CDISC dataType vocabulary (string, integer, decimal,
float, double, boolean, date, datetime, time, URI). Common
SAS / P21 spellings resolve automatically ("text", "Char",
"integer (8)", ...); an unrecognised token aborts with
artoo_error_type.
Cross-slot integrity. Construction fails (artoo_error_spec) if a
variable names a dataset absent from datasets, or references a
codelist_id absent from codelists.
One spec, one standard. A artoo_spec carries exactly one CDISC
standard, stored as the scalar @standard property. The constructor
resolves it from the standard argument, a standard column in
datasets (the P21 workbook shape), and a standard field in study
(the Define-XML shape) — those columns are consumed, so @standard is
the single home. More than one distinct value aborts with
artoo_error_spec; scope the source to one standard (e.g.
read_spec(path, datasets = ...)) instead of mixing.
One study vocabulary. Well-known study fields are canonicalised to
the CDISC ODM GlobalVariables names, snake_cased: study_name,
study_description, protocol_name. Source spellings resolve
automatically (StudyName, studyid, ...); fields the vocabulary does
not know pass through verbatim. Aliases that disagree on a value abort
with artoo_error_spec.
A validated artoo_spec object. Inspect it with
spec_datasets() / spec_variables(), or check it with
validate_spec().
Inspect: spec_datasets(), spec_variables(), spec_codelists(),
spec_keys(), spec_study().
Check: validate_spec(). Predicate: is_artoo_spec().
# ---- Example 1: build a spec from the bundled CDISC-pilot tables ---- # # `cdisc_sdtm_datasets` and `cdisc_sdtm_variables` hold the CDISC pilot SDTM # metadata in the shape artoo_spec() expects; the constructor # canonicalises every type and checks cross-slot integrity. spec <- artoo_spec(cdisc_sdtm_datasets, cdisc_sdtm_variables, codelists = cdisc_codelists) spec_datasets(spec) # ---- Example 2: a focused spec for a single dataset ---- # # Slice the bundled tables to one dataset (DM) to build a smaller spec. dm_ds <- cdisc_sdtm_datasets[cdisc_sdtm_datasets$dataset == "DM", ] dm_var <- cdisc_sdtm_variables[cdisc_sdtm_variables$dataset == "DM", ] dm_spec <- artoo_spec(dm_ds, dm_var, codelists = cdisc_codelists) head(spec_variables(dm_spec, "DM")[, c("variable", "label", "data_type")])# ---- Example 1: build a spec from the bundled CDISC-pilot tables ---- # # `cdisc_sdtm_datasets` and `cdisc_sdtm_variables` hold the CDISC pilot SDTM # metadata in the shape artoo_spec() expects; the constructor # canonicalises every type and checks cross-slot integrity. spec <- artoo_spec(cdisc_sdtm_datasets, cdisc_sdtm_variables, codelists = cdisc_codelists) spec_datasets(spec) # ---- Example 2: a focused spec for a single dataset ---- # # Slice the bundled tables to one dataset (DM) to build a smaller spec. dm_ds <- cdisc_sdtm_datasets[cdisc_sdtm_datasets$dataset == "DM", ] dm_var <- cdisc_sdtm_variables[cdisc_sdtm_variables$dataset == "DM", ] dm_spec <- artoo_spec(dm_ds, dm_var, codelists = cdisc_codelists) head(spec_variables(dm_spec, "DM")[, c("variable", "label", "data_type")])
A 60-row sample of the CDISC pilot ADaM adverse events analysis dataset (ADAE): one row per reported event, with treatment-emergent flags, severity, and coding variables (labels preserved as attributes).
cdisc_adaecdisc_adae
A data frame with 60 rows (STUDYID, USUBJID, AETERM,
AESEV, TRTEMFL, ASTDT, ...).
First 60 rows of the CDISC pilot adam/cdisc/adae.xpt from the
PHUSE Test Data Factory (phuse-org/phuse-scripts).
A 60-subject sample of the CDISC pilot ADaM subject-level analysis dataset (ADSL): one row per subject, with treatment, demographic, baseline, and disposition variables (labels preserved as column attributes).
cdisc_adslcdisc_adsl
A data frame with 60 rows and 48 variables (STUDYID, USUBJID,
TRT01P, AGE, SEX, RACE, SAFFL, TRTSDT, ...).
First 60 subjects of the CDISC pilot adam/cdisc/adsl.xpt from
the PHUSE Test Data Factory (phuse-org/phuse-scripts).
A 60-subject sample of the CDISC pilot SDTM demographics domain (DM): one row per subject, with the standard DM variables (labels preserved as attributes).
cdisc_dmcdisc_dm
A data frame with 60 rows and 25 variables (STUDYID, DOMAIN,
USUBJID, AGE, SEX, RACE, ARM, COUNTRY, ...).
First 60 subjects of the CDISC pilot sdtm/TDF_SDTM_v1.0/dm.xpt
from the PHUSE Test Data Factory (phuse-org/phuse-scripts).
The constructor-shaped metadata tables for the bundled demo data, split
by CDISC standard because a artoo_spec carries exactly one:
cdisc_adam_datasets + cdisc_adam_variables describe ADSL
(ADaMIG 1.1), and cdisc_sdtm_datasets + cdisc_sdtm_variables
describe DM (SDTMIG 3.1.2). Each variables table is derived from the
data (names, labels, inferred CDISC types, byte lengths) by
data-raw/. Pass one standard's pair to artoo_spec(); passing both
pairs together aborts with artoo_error_spec — mixing standards in
one spec is the mistake the split exists to prevent.
cdisc_adam_datasets cdisc_adam_variables cdisc_sdtm_datasets cdisc_sdtm_variables cdisc_codelistscdisc_adam_datasets cdisc_adam_variables cdisc_sdtm_datasets cdisc_sdtm_variables cdisc_codelists
Each *_datasets table is a data frame with one row per dataset:
Dataset name ("ADSL" or "DM").
Dataset label.
The CDISC standard, consumed into the spec's
spec_standard().
Each *_variables table is a data frame with one row per variable:
Owning dataset name.
Variable name.
Variable label (from the data's label attribute).
CDISC dataType inferred from the column's class.
Storage length (max byte width for character, 8 for numeric).
Variable order within the dataset.
NCI codelist reference ("C66731" on SEX).
cdisc_codelists is a data frame of controlled-terminology terms (the
real NCI codelist C66731 for SEX):
Codelist identifier ("C66731").
Submission value ("M", "F", ...).
Decoded value ("Male", "Female", ...).
Term order.
Derived from the CDISC pilot .xpt files in the public PHUSE
Test Data Factory (phuse-org/phuse-scripts) by
data-raw/bundle-demo.R.
Ready-made artoo_spec objects built from the official CDISC Define-XML
2.1 release examples: adam_spec (ADaMIG 1.1; datasets ADSL, ADAE) and
sdtm_spec (SDTMIG 3.1.2; datasets TS, DM, VS, SUPPDM). Every bundled
demo dataset conforms to its spec under
apply_spec(conformance = "abort") — the pairing is gated at build
time. The same specs ship as P21 workbooks in
system.file("extdata", "adam-spec.xlsx", package = "artoo") and
"sdtm-spec.xlsx", written by write_spec().
adam_spec sdtm_specadam_spec sdtm_spec
A validated artoo_spec() object; inspect it with
spec_datasets(), spec_variables(), and spec_standard().
Demo adaptations (each an ADR in data-raw/bundle-spec.R): the
sponsor-defined codelists (CL.ARM, CL.ARMCD, CL.BMICAT, and the
extensible NCI VS codelists) are marked extended; VISITNUM is typed
float (the pilot data has fractional visit numbers); VS declares the
SDTMIG timepoint variables VSTPT/VSTPTNUM, with VSTPTNUM in the VS
key.
The CDISC Define-XML 2.1 release example defines (ADaM + SDTM),
pinned by sha256 in data-raw/bundle-spec.R; data from the PHUSE Test
Data Factory.
A 60-row sample of the CDISC pilot SDTM supplemental qualifiers for DM (SUPPDM): the non-standard qualifier values that ride alongside the DM domain.
cdisc_suppdmcdisc_suppdm
A data frame with 60 rows (STUDYID, RDOMAIN, USUBJID,
QNAM, QVAL, ...).
First 60 rows of the CDISC pilot sdtm/cdiscpilot01/suppdm.xpt
from the PHUSE Test Data Factory (phuse-org/phuse-scripts).
The CDISC pilot SDTM trial summary domain (TS): one row per trial characteristic (33 rows in the pilot), the study-design parameters a submission carries.
cdisc_tscdisc_ts
A data frame with 33 rows (STUDYID, TSPARMCD, TSPARM,
TSVAL, ...).
The CDISC pilot sdtm/cdiscpilot01/ts.xpt from the PHUSE Test
Data Factory (phuse-org/phuse-scripts).
A 60-row sample of the CDISC pilot SDTM vital signs domain (VS): repeated measurements per subject across visits, positions, and planned timepoints.
cdisc_vscdisc_vs
A data frame with 60 rows (STUDYID, USUBJID, VSTESTCD,
VSORRES, VISITNUM, VSPOS, VSTPTNUM, ...).
First 60 rows of the CDISC pilot sdtm/cdiscpilot01/vs.xpt from
the PHUSE Test Data Factory (phuse-org/phuse-scripts).
Compare a data frame to one dataset's specification and report where they
diverge. This is the data-conformance check at the end of the artoo
workflow (spec -> apply_spec -> check_spec): it reuses the metadata the
spec already carries (variables, types, lengths, codelists, keys). It is
distinct from validate_spec(), which checks the spec's own internal
integrity rather than the data. Both report findings keyed to the same open
rule catalog.
check_spec( x, spec, dataset, decode = c("none", "to_decode", "to_code"), checks = NULL )check_spec( x, spec, dataset, decode = c("none", "to_decode", "to_code"), checks = NULL )
x |
The data frame to check. |
spec |
The specification to check against. |
dataset |
The dataset whose rules apply. Restriction: must name a dataset in |
decode |
Which codelist column membership is checked against.
|
checks |
Which conformance dimensions to evaluate. |
Findings, not enforcement. check_spec() never modifies data; it
returns every divergence it finds. apply_spec() runs it and decides what
to do via its conformance argument (warn, abort, off). The dimensions
checked are: missing variables (split into mandatory, an error, and
permissible, a warning), extra variables (data column the spec does not
declare), type mismatch, ISO 8601 validity of character date/datetime/time
values (CDISC partials pass; "12NOV2019" does not), fractional values
and 32-bit overflow under an integer dataType (both would corrupt data
at coercion), character length overflow, the hard 200-byte XPORT v5 / FDA
character limit, codelist membership, label drift against the spec, key
uniqueness, and displayFormat validity.
Decode-aware membership. decode selects which codelist column the
data is checked against, matching apply_spec()'s decode step:
"none"/"to_code" check against the codelist terms, "to_decode"
against the decodes. apply_spec() threads its own decode through, so
a decoded column is not wrongly flagged.
Fatal vs informational coercion checks. Only integer_fraction and
integer_overflow carry error severity: they mark data an integer
dataType cannot hold without loss, which apply_spec() refuses to coerce
(its on_coercion_loss governs that gate). type_mismatch is a note: a
column stored more widely than the spec declares (an integer-valued
double, for instance) coerces cleanly, so it is informational, not a
blocker.
A findings data frame with columns check, dimension,
severity ("error", "warning", or "note"), dataset, variable,
and message, one row per divergence. Zero rows means the data conforms.
apply_spec() which runs this; check_study() for the same
check across a whole study; artoo_checks() to select dimensions;
validate_spec() for spec integrity.
# ---- Example 1: a conformed frame surfaces only the genuine gaps ---- # # apply_spec() coerces and orders to spec but never fabricates a variable # the data lacks; checking the result reports the permissible variables # this extract never derived (here, six) instead of hiding them as empty # columns. adsl <- apply_spec(cdisc_adsl, adam_spec, "ADSL", conformance = "off") nrow(check_spec(adsl, adam_spec, "ADSL")) # ---- Example 2: raw data surfaces divergences ---- # # Checking a raw frame with an undeclared column flags the extras. raw <- cdisc_adsl raw$NOTASPEC <- 1 head(check_spec(raw, adam_spec, "ADSL")[, c("check", "variable", "severity")])# ---- Example 1: a conformed frame surfaces only the genuine gaps ---- # # apply_spec() coerces and orders to spec but never fabricates a variable # the data lacks; checking the result reports the permissible variables # this extract never derived (here, six) instead of hiding them as empty # columns. adsl <- apply_spec(cdisc_adsl, adam_spec, "ADSL", conformance = "off") nrow(check_spec(adsl, adam_spec, "ADSL")) # ---- Example 2: raw data surfaces divergences ---- # # Checking a raw frame with an undeclared column flags the extras. raw <- cdisc_adsl raw$NOTASPEC <- 1 head(check_spec(raw, adam_spec, "ADSL")[, c("check", "variable", "severity")])
Run check_spec() over every dataset in a study and return one stacked
findings frame. Where check_spec() answers "does this dataset conform?",
check_study() answers "is my whole study submittable?" in a single pass,
surfacing every dataset's divergences at once instead of one abort at a
time. The result is an ordinary findings frame underneath, so filter it by
severity or hand it straight to repair_spec().
check_study( spec, data, decode = c("none", "to_decode", "to_code"), checks = NULL )check_study( spec, data, decode = c("none", "to_decode", "to_code"), checks = NULL )
spec |
The specification to check against. |
data |
The study's datasets. |
decode |
Which codelist column to check against. |
checks |
Which conformance dimensions to run.
|
One row per divergence, every dataset stacked. Each dataset's findings
carry its name in the dataset column, so the frame is the union of the
per-dataset check_spec() results. Printing renders the dataset-by-check
count matrix (the study-level summary); the underlying frame is unchanged.
Data-requiring, like check_spec(). check_study() checks data
against the spec, so it needs the data frames. For the spec's own
structural integrity (no data), use validate_spec().
A <artoo_study_findings> data frame with the same columns as
check_spec() (check, dimension, severity, dataset, variable,
message), one row per divergence across all datasets. Zero rows means
the whole study conforms. Print it for the count matrix; treat it as an
ordinary data frame otherwise.
One dataset: check_spec(). Spec structure only: validate_spec().
Repair: repair_spec() to apply the integer fixes the matrix surfaces.
# ---- Example 1: scan a whole study in one pass ---- # # Loop the conformance check over every dataset's data. A fractional AGE # (the spec types it integer) surfaces as an integer_fraction finding; the # print is a dataset-by-check count matrix. adsl <- cdisc_adsl adsl$AGE <- adsl$AGE + 0.5 check_study(adam_spec, list(ADSL = adsl, ADAE = cdisc_adae)) # ---- Example 2: feed the findings straight into repair_spec() ---- # # The result is an ordinary findings frame, so repair_spec() consumes it to # flip every integer_fraction / integer_overflow variable across the study. findings <- check_study(adam_spec, list(ADSL = adsl)) fixed <- repair_spec(adam_spec, findings) spec_variables(fixed, "ADSL")$data_type[ spec_variables(fixed, "ADSL")$variable == "AGE" ]# ---- Example 1: scan a whole study in one pass ---- # # Loop the conformance check over every dataset's data. A fractional AGE # (the spec types it integer) surfaces as an integer_fraction finding; the # print is a dataset-by-check count matrix. adsl <- cdisc_adsl adsl$AGE <- adsl$AGE + 0.5 check_study(adam_spec, list(ADSL = adsl, ADAE = cdisc_adae)) # ---- Example 2: feed the findings straight into repair_spec() ---- # # The result is an ordinary findings frame, so repair_spec() consumes it to # flip every integer_fraction / integer_overflow variable across the study. findings <- check_study(adam_spec, list(ADSL = adsl)) fixed <- repair_spec(adam_spec, findings) spec_variables(fixed, "ADSL")$data_type[ spec_variables(fixed, "ADSL")$variable == "AGE" ]
Return a one-row-per-variable attribute table — the pane a SAS
programmer reads in PROC CONTENTS or the Universal Viewer: position,
name, Char/Num type, length, format, informat, label, and the CDISC key
sequence. This is the quick look after apply_spec() stamps a frame, or
on any dataset file artoo can read.
columns(x, member = NULL)columns(x, member = NULL)
x |
What to describe. |
member |
XPORT member to describe. |
Every real column shows. The table covers the frame's columns: a
column the spec never declared (which apply_spec() keeps, never drops)
still appears, its attributes inferred from the R class. A plain,
never-stamped data frame works the same way — every attribute is
inferred.
Len is physical storage. The pane mirrors what PROC CONTENTS
shows and what the writers store, not a spec digit-width. A Char column
always carries a byte Len (the declared length, else inferred from the
data); a numeric Len is blank, because a numeric stores as an 8-byte
IEEE double with no character width (a Define-XML numeric Length is a
digit-width, kept in the metadata for the Define / P21 surface). Format
and informat names render uppercase; the metadata keeps the source
spelling.
A path reads through the codec. A file path is dispatched by
extension through the same registry as read_dataset(), so the
attributes come from the one lossless reader (an unknown extension
aborts with the registry's known-extensions message).
Tip: a multi-member XPORT file needs member =; without one the
xpt reader aborts and points at xpt_members() for the listing.
Note: an .xpt path shows a blank Key: the XPORT byte layout
stores only name, label, length, and formats, so keySequence (like
codelist and origin) cannot ride in the file. The metadata-carrying
formats (.json, .ndjson, .parquet, .rds) and the in-session
conformed frame show it; re-apply the spec after an xpt read to
restore it.
A <artoo_columns> data frame with columns #, Variable,
Type, Len, Format, Label, Key, printed left-aligned. The
Informat column appears only when at least one variable carries an
informat (most clinical panes have none). It is an ordinary data frame
underneath — filter or inspect it like one.
Members: xpt_members() lists a multi-member XPORT file.
Metadata: get_meta() for the full artoo_meta; apply_spec()
which stamps it.
spec <- artoo_spec(cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists) # ---- Example 1: the column pane of a conformed frame ---- # # apply_spec() stamps ADSL with its metadata; columns() reads it back as # the SAS-style attribute table. adsl <- apply_spec(cdisc_adsl, spec, "ADSL", conformance = "off") columns(adsl) # ---- Example 2: straight off a file ---- # # Write the conformed frame to any format and point columns() at the # path; the codec reads it back and the attributes are identical. p <- tempfile(fileext = ".json") write_json(adsl, p) columns(p)spec <- artoo_spec(cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists) # ---- Example 1: the column pane of a conformed frame ---- # # apply_spec() stamps ADSL with its metadata; columns() reads it back as # the SAS-style attribute table. adsl <- apply_spec(cdisc_adsl, spec, "ADSL", conformance = "off") columns(adsl) # ---- Example 2: straight off a file ---- # # Write the conformed frame to any format and point columns() at the # path; the codec reads it back and the attributes are identical. p <- tempfile(fileext = ".json") write_json(adsl, p) columns(p)
Pull the conformance findings apply_spec() attached to a conformed
data frame — the readable answer to "what did the check find?". The
result is the same findings frame check_spec() returns (one row per
divergence), with a print method that renders a sectioned report, so
conformance(adsl) at the console is the inspection step the
artoo_warning_conformance warning points you at.
conformance(x)conformance(x)
x |
A data frame produced by Requirement: the conformance check must have run: a frame from
|
A <artoo_findings> data frame with columns check,
dimension, severity ("error", "warning", or "note"),
dataset, variable, and message. Zero rows means the data
conformed. Print it for the sectioned report; treat it as an ordinary
data frame for programmatic use.
apply_spec() which attaches the findings; check_spec() for
the same check on demand; artoo_checks() to select dimensions.
spec <- artoo_spec( cdisc_sdtm_datasets, cdisc_sdtm_variables, codelists = cdisc_codelists ) # ---- Example 1: inspect what the conform step found ---- # # Conforming raw DM records the findings on the result; conformance() # renders them as a report instead of a raw attribute. dm <- suppressWarnings(apply_spec(cdisc_dm, spec, "DM")) conformance(dm) # ---- Example 2: gate a pipeline on error-severity findings ---- # # The findings frame is an ordinary data frame: filter by severity to # drive your own logic. f <- conformance(dm) nrow(f[f$severity == "error", ])spec <- artoo_spec( cdisc_sdtm_datasets, cdisc_sdtm_variables, codelists = cdisc_codelists ) # ---- Example 1: inspect what the conform step found ---- # # Conforming raw DM records the findings on the result; conformance() # renders them as a report instead of a raw attribute. dm <- suppressWarnings(apply_spec(cdisc_dm, spec, "DM")) conformance(dm) # ---- Example 2: gate a pipeline on error-severity findings ---- # # The findings frame is an ordinary data frame: filter by severity to # drive your own logic. f <- conformance(dm) nrow(f[f$severity == "error", ])
Map one column's values through a spec codelist — code to decode or
decode to code — writing the result to a new variable or in place. This
is the everyday companion to apply_spec()'s whole-dataset decode
step: deriving RACEN from RACE, recovering submission codes from
decoded values, or decoding a single variable for display, without
re-running the pipeline. When the target variable is declared in the
spec, the result is also coerced to its dataType and labelled, so the
new column lands conformed.
decode_column( x, spec, dataset, from, to = from, direction = c("to_decode", "to_code"), no_match = c("error", "keep", "na"), trim = TRUE, ignore_case = FALSE )decode_column( x, spec, dataset, from, to = from, direction = c("to_decode", "to_code"), no_match = c("error", "keep", "na"), trim = TRUE, ignore_case = FALSE )
x |
The data frame to extend. |
spec |
The specification carrying the codelists. |
dataset |
The dataset whose variables apply. |
from |
The source column. |
to |
The destination variable. |
direction |
Which way to map.
|
no_match |
Policy for values absent from the codelist.
|
trim |
Match after trimming whitespace. |
ignore_case |
Match case-insensitively. |
Which codelist applies. The codelist attached to to in the spec
wins (the natural direction for RACEN-style derivations, where the
numeric variable owns the code/decode pairs); when to declares none,
from's codelist is used. If neither variable references a codelist
the call aborts — there is nothing to map through.
Mismatched surfaces chain. A single call maps through ONE codelist,
so the winning codelist's terms (or decodes) must line up with the
from values — the CDISC *N convention guarantees this for
RACEN-style pairs, whose decodes are the character variable's
submission values. When the two codelists share no value surface (say
SEXN's decodes are "Female"/"Male" but SEX holds "F"/"M"),
the unmatched values hit the no_match policy; translate in two hops
instead — decode through from's codelist first, then to_code
through the destination's:
dm |>
decode_column(spec, "DM", from = "SEX", to = "SEXDECD") |>
decode_column(spec, "DM", from = "SEXDECD", to = "SEXN",
direction = "to_code")
Soft matches are reported, never silent. Values that match only
after trimming whitespace (or case-folding, when ignore_case = TRUE)
still map, with a artoo_warning_codelist naming the variants —
check_spec() always compares exactly, so clean the source for
submission.
The data frame x with the to column added (at the end) or
replaced (in place), ready for the next pipeline step.
Whole-dataset decode: apply_spec() with decode =.
Inspect the terms: spec_codelists(). Check membership:
check_spec().
spec <- artoo_spec(cdisc_sdtm_datasets, cdisc_sdtm_variables, codelists = cdisc_codelists) # ---- Example 1: decode a coded variable into a display column ---- # # SEX is coded against C66731; map the codes to their decodes in a new # column, leaving the submission values untouched. dm <- decode_column(cdisc_dm, spec, "DM", from = "SEX", to = "SEXDECD") table(dm$SEX, dm$SEXDECD) # ---- Example 2: the RACEN pattern, a coded numeric from its decode ---- # # Declare SEXN as an integer variable owning a numeric codelist, then # derive it from SEX's decoded values: to_code maps each decode to its # submission code, and the spec dataType makes the result integer. vars <- rbind( cdisc_sdtm_variables, data.frame( dataset = "DM", variable = "SEXN", label = "Sex (N)", data_type = "integer", length = 8L, order = NA_integer_, codelist_id = "SEXN" ) ) cls <- rbind( cdisc_codelists, data.frame( codelist_id = "SEXN", term = c("1", "2"), decode = c("F", "M"), order = 1:2 ) ) spec_n <- artoo_spec(cdisc_sdtm_datasets, vars, codelists = cls) dm_n <- decode_column(cdisc_dm, spec_n, "DM", from = "SEX", to = "SEXN", direction = "to_code" ) str(dm_n$SEXN)spec <- artoo_spec(cdisc_sdtm_datasets, cdisc_sdtm_variables, codelists = cdisc_codelists) # ---- Example 1: decode a coded variable into a display column ---- # # SEX is coded against C66731; map the codes to their decodes in a new # column, leaving the submission values untouched. dm <- decode_column(cdisc_dm, spec, "DM", from = "SEX", to = "SEXDECD") table(dm$SEX, dm$SEXDECD) # ---- Example 2: the RACEN pattern, a coded numeric from its decode ---- # # Declare SEXN as an integer variable owning a numeric codelist, then # derive it from SEX's decoded values: to_code maps each decode to its # submission code, and the spec dataType makes the result integer. vars <- rbind( cdisc_sdtm_variables, data.frame( dataset = "DM", variable = "SEXN", label = "Sex (N)", data_type = "integer", length = 8L, order = NA_integer_, codelist_id = "SEXN" ) ) cls <- rbind( cdisc_codelists, data.frame( codelist_id = "SEXN", term = c("1", "2"), decode = c("F", "M"), order = 1:2 ) ) spec_n <- artoo_spec(cdisc_sdtm_datasets, vars, codelists = cls) dm_n <- decode_column(cdisc_dm, spec_n, "DM", from = "SEX", to = "SEXN", direction = "to_code" ) str(dm_n$SEXN)
Pull the artoo_meta off a data frame produced by apply_spec() or read
back by any read_*() codec. The metadata travels as a single
Dataset-JSON string in the frame's metadata_json attribute; get_meta()
parses it to the S7 object, the form every codec writes from. This is the
read half of the lossless round-trip.
get_meta(x)get_meta(x)
x |
A data frame carrying artoo metadata. Requirement: |
A <artoo_meta> with two properties. @dataset is a named
list of dataset-level attributes: itemGroupOID, name, label,
records, studyOID, metaDataVersionOID, encoding, and keys.
@columns is a named list with one entry per variable, each carrying
itemOID, name, label, dataType, targetDataType, length,
displayFormat, informat, keySequence, codelist,
significantDigits, and origin (absent values are NULL). Pass it
to set_meta() to re-attach, or index it directly
(meta@columns$AGE$label).
set_meta() for the write half; apply_spec() which stamps it.
# ---- Example 1: read metadata off a conformed dataset ---- # # apply_spec() stamps the metadata; get_meta() reads it back as the S7 # object whose @columns holds one CDISC attribute set per variable. spec <- artoo_spec(cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists) adsl <- apply_spec(cdisc_adsl, spec, "ADSL") meta <- get_meta(adsl) meta@columns$STUDYID # ---- Example 2: round-trip metadata across two frames ---- # # The metadata is a portable object: read it off one frame and stamp it # onto another with set_meta(). bare <- as.data.frame(adsl) attr(bare, "metadata_json") <- NULL restamped <- set_meta(bare, meta) identical(get_meta(restamped)@columns, meta@columns)# ---- Example 1: read metadata off a conformed dataset ---- # # apply_spec() stamps the metadata; get_meta() reads it back as the S7 # object whose @columns holds one CDISC attribute set per variable. spec <- artoo_spec(cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists) adsl <- apply_spec(cdisc_adsl, spec, "ADSL") meta <- get_meta(adsl) meta@columns$STUDYID # ---- Example 2: round-trip metadata across two frames ---- # # The metadata is a portable object: read it off one frame and stamp it # onto another with set_meta(). bare <- as.data.frame(adsl) attr(bare, "metadata_json") <- NULL restamped <- set_meta(bare, meta) identical(get_meta(restamped)@columns, meta@columns)
Report whether an object is a artoo_checks control built by
artoo_checks(). Use it to guard a checks argument before threading it
into check_spec() or apply_spec().
is_artoo_checks(x)is_artoo_checks(x)
x |
Object to test. |
A <logical(1)>: TRUE when x is a artoo_checks.
artoo_checks() to build one.
# ---- Example 1: confirm a control before reusing it ---- # # is_artoo_checks() distinguishes a real control from a bare list of flags. is_artoo_checks(artoo_checks()) is_artoo_checks(list(missing_variable = TRUE))# ---- Example 1: confirm a control before reusing it ---- # # is_artoo_checks() distinguishes a real control from a bare list of flags. is_artoo_checks(artoo_checks()) is_artoo_checks(list(missing_variable = TRUE))
Report whether an object is a artoo_meta — the CDISC-shaped metadata a
conformed dataset carries through the artoo workflow (spec -> apply_spec ->
read_/write_). get_meta() returns one; this is the type guard before you
inspect its @dataset and @columns slots.
is_artoo_meta(x)is_artoo_meta(x)
x |
Object to test. |
A <logical(1)>: TRUE when x is a artoo_meta, else FALSE.
get_meta() and set_meta() to read and attach metadata.
# ---- Example 1: guard before inspecting metadata ---- # # get_meta() yields a artoo_meta; is_artoo_meta() confirms the type before # you reach into its slots. spec <- artoo_spec(cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists) adsl <- apply_spec(cdisc_adsl, spec, "ADSL") meta <- get_meta(adsl) is_artoo_meta(meta) # ---- Example 2: a bare data frame carries no meta object ---- # # The raw frame itself is not a artoo_meta — only the object get_meta() # returns is. is_artoo_meta(cdisc_adsl)# ---- Example 1: guard before inspecting metadata ---- # # get_meta() yields a artoo_meta; is_artoo_meta() confirms the type before # you reach into its slots. spec <- artoo_spec(cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists) adsl <- apply_spec(cdisc_adsl, spec, "ADSL") meta <- get_meta(adsl) is_artoo_meta(meta) # ---- Example 2: a bare data frame carries no meta object ---- # # The raw frame itself is not a artoo_meta — only the object get_meta() # returns is. is_artoo_meta(cdisc_adsl)
Report whether an object is a artoo_spec — the validated CDISC
specification that drives the artoo workflow (spec -> apply_spec ->
read_/write_). artoo_spec() builds one; this is the type guard before you
pass it to apply_spec() or reach into it with the spec accessors.
is_artoo_spec(x)is_artoo_spec(x)
x |
Object to test. |
A <logical(1)>: TRUE when x is a artoo_spec, else FALSE.
artoo_spec() to build one; is_artoo_meta() for the metadata
guard.
# ---- Example 1: guard a built specification ---- # # artoo_spec() assembles and validates a spec; is_artoo_spec() confirms the # type before you drive apply_spec() with it. spec <- artoo_spec(cdisc_sdtm_datasets, cdisc_sdtm_variables, codelists = cdisc_codelists) is_artoo_spec(spec) # ---- Example 2: an ordinary object is not a spec ---- # # Any non-artoo_spec value — a bare data frame, say — returns FALSE. is_artoo_spec(cdisc_dm)# ---- Example 1: guard a built specification ---- # # artoo_spec() assembles and validates a spec; is_artoo_spec() confirms the # type before you drive apply_spec() with it. spec <- artoo_spec(cdisc_sdtm_datasets, cdisc_sdtm_variables, codelists = cdisc_codelists) is_artoo_spec(spec) # ---- Example 2: an ordinary object is not a spec ---- # # Any non-artoo_spec value — a bare data frame, say — returns FALSE. is_artoo_spec(cdisc_dm)
Inventory the dataset(s) a path contains, one row per dataset, dispatched
by extension through the same codec registry as read_dataset(). A SAS
XPORT library lists every member; a single-dataset file (.json,
.ndjson, .parquet, .rds) reports one row; a directory inventories
each dataset file it holds. The format-neutral companion to the
xpt-specific xpt_members().
members(path)members(path)
path |
A dataset file or a directory. |
One dataset per file, except XPORT. XPORT is the only multi-dataset
container artoo handles, so only an .xpt path can return more than one
row. Every other format is one dataset per file.
A directory is inventoried, not descended. Only the files directly in the directory are listed (no recursion); files whose extension no codec claims are skipped, and a directory with no dataset files returns an empty inventory rather than aborting. A dataset file that fails to read aborts with its codec's error, naming the file.
Note: counting records reads the file through its codec (the one
lossless reader), so members() is an honest count, not a header guess; for
a large directory it reads every dataset.
A <artoo_members> data frame, one row per dataset, with columns
file (source basename), member (dataset name), label, records
(row count), variables (column count), and format (the codec
format). Empty when a directory holds no dataset files. It is an ordinary
data frame underneath.
Members of one XPORT file: xpt_members().
Per-variable attributes: columns() for one dataset's variable pane.
dm <- apply_spec(cdisc_dm, sdtm_spec, "DM", conformance = "off") # ---- Example 1: one dataset in a file ---- # # A single-dataset format reports exactly one member. p <- tempfile(fileext = ".json") write_json(dm, p) members(p) # ---- Example 2: every dataset in a directory ---- # # Point members() at a folder to inventory each dataset file it holds, one # row per dataset, dispatched by extension. dir <- tempfile("datasets") dir.create(dir) write_json(dm, file.path(dir, "dm.json")) write_rds(dm, file.path(dir, "dm.rds")) members(dir)dm <- apply_spec(cdisc_dm, sdtm_spec, "DM", conformance = "off") # ---- Example 1: one dataset in a file ---- # # A single-dataset format reports exactly one member. p <- tempfile(fileext = ".json") write_json(dm, p) members(p) # ---- Example 2: every dataset in a directory ---- # # Point members() at a folder to inventory each dataset file it holds, one # row per dataset, dispatched by extension. dir <- tempfile("datasets") dir.create(dir) write_json(dm, file.path(dir, "dm.json")) write_rds(dm, file.path(dir, "dm.rds")) members(dir)
Read a clinical file back to a data frame, restoring its artoo_meta. The
codec is chosen from the file extension (or an explicit format), and the
metadata the file carries is re-attached, so a value written by
write_dataset() round-trips losslessly. This is the ingest end of the
I/O layer; the per-format wrappers like read_rds() call it.
read_dataset(path, format = NULL, col_select = NULL, n_max = Inf, ...)read_dataset(path, format = NULL, col_select = NULL, n_max = Inf, ...)
path |
Source file path. |
format |
Force a codec instead of inferring from the extension.
|
col_select |
Variables to read. Note: an unknown name is a |
n_max |
Maximum records to read. |
... |
Codec-specific arguments passed through to the decoder (see
the per-format wrappers, e.g. |
A <data.frame> carrying artoo_meta when the file recorded it
(read it with get_meta()). A file whose payload is not a data frame is
a artoo_error_codec.
write_dataset() for the inverse; read_rds() for the
per-format wrapper.
spec <- artoo_spec(cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists) # ---- Example 1: round-trip a dataset through rds ---- # # Write a conformed dataset, then read it back; the metadata survives. adsl <- apply_spec(cdisc_adsl, spec, "ADSL", conformance = "off") path <- tempfile(fileext = ".rds") write_dataset(adsl, path) back <- read_dataset(path) identical(get_meta(back)@columns, get_meta(adsl)@columns) # ---- Example 2: the metadata names the dataset and row count ---- # # The restored artoo_meta exposes the dataset-level attributes. get_meta(back)@dataset$recordsspec <- artoo_spec(cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists) # ---- Example 1: round-trip a dataset through rds ---- # # Write a conformed dataset, then read it back; the metadata survives. adsl <- apply_spec(cdisc_adsl, spec, "ADSL", conformance = "off") path <- tempfile(fileext = ".rds") write_dataset(adsl, path) back <- read_dataset(path) identical(get_meta(back)@columns, get_meta(adsl)@columns) # ---- Example 2: the metadata names the dataset and row count ---- # # The restored artoo_meta exposes the dataset-level attributes. get_meta(back)@dataset$records
Read a CDISC Dataset-JSON v1.1 (.json) file back to a data frame,
restoring the full artoo_meta it carries and realizing SAS
date/datetime/time variables to R Date / POSIXct / hms::hms. Column
types are reconstructed from the recorded metadata, not guessed from the
JSON tokens, so the round-trip is lossless. The ingest end of the I/O layer;
a thin wrapper over read_dataset() with format = "json".
read_json(path, col_select = NULL, n_max = Inf, encoding = NULL)read_json(path, col_select = NULL, n_max = Inf, encoding = NULL)
path |
Source |
col_select |
Variables to read. Note: an unknown name is a |
n_max |
Maximum records to read. |
encoding |
Source charset of the file bytes. Tip: any SAS or IANA spelling listed by |
A <data.frame> carrying artoo_meta (read it with
get_meta()).
write_json() for the inverse; read_dataset() for the generic
dispatcher.
spec <- artoo_spec(cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists) # ---- Example 1: round-trip a conformed dataset through Dataset-JSON ---- # # The variable labels, types, and keys survive the round-trip. adsl <- apply_spec(cdisc_adsl, spec, "ADSL", conformance = "off") path <- tempfile(fileext = ".json") write_json(adsl, path) back <- read_json(path) identical(get_meta(back)@columns, get_meta(adsl)@columns) # ---- Example 2: the metadata names the dataset and row count ---- # # The restored artoo_meta exposes the dataset-level attributes. get_meta(back)@dataset$recordsspec <- artoo_spec(cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists) # ---- Example 1: round-trip a conformed dataset through Dataset-JSON ---- # # The variable labels, types, and keys survive the round-trip. adsl <- apply_spec(cdisc_adsl, spec, "ADSL", conformance = "off") path <- tempfile(fileext = ".json") write_json(adsl, path) back <- read_json(path) identical(get_meta(back)@columns, get_meta(adsl)@columns) # ---- Example 2: the metadata names the dataset and row count ---- # # The restored artoo_meta exposes the dataset-level attributes. get_meta(back)@dataset$records
Read a newline-delimited CDISC Dataset-JSON v1.1 (.ndjson) file back to
a data frame, restoring the full artoo_meta from its metadata line and
realizing SAS date/datetime/time variables to R Date / POSIXct /
hms::hms. Rows are parsed in bounded slabs, and n_max stops the
line loop early, so a partial read of a huge file never parses the tail.
A thin wrapper over read_dataset() with format = "ndjson".
read_ndjson(path, col_select = NULL, n_max = Inf, encoding = NULL)read_ndjson(path, col_select = NULL, n_max = Inf, encoding = NULL)
path |
Source |
col_select |
Variables to read. Note: an unknown name is a |
n_max |
Maximum records to read. |
encoding |
Source charset of the file bytes. |
A <data.frame> carrying artoo_meta (read it with
get_meta()).
write_ndjson() for the inverse; read_json() for the
array-form file; read_dataset() for the generic dispatcher.
spec <- artoo_spec(cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists) # ---- Example 1: round-trip a conformed dataset through NDJSON ---- # # The variable labels, types, and keys survive the round-trip. adsl <- apply_spec(cdisc_adsl, spec, "ADSL", conformance = "off") path <- tempfile(fileext = ".ndjson") write_ndjson(adsl, path) back <- read_ndjson(path) identical(get_meta(back)@columns, get_meta(adsl)@columns) # ---- Example 2: a bounded partial read of the first rows ---- # # n_max stops the line loop as soon as enough rows are in. head_rows <- read_ndjson(path, n_max = 5) get_meta(head_rows)@dataset$recordsspec <- artoo_spec(cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists) # ---- Example 1: round-trip a conformed dataset through NDJSON ---- # # The variable labels, types, and keys survive the round-trip. adsl <- apply_spec(cdisc_adsl, spec, "ADSL", conformance = "off") path <- tempfile(fileext = ".ndjson") write_ndjson(adsl, path) back <- read_ndjson(path) identical(get_meta(back)@columns, get_meta(adsl)@columns) # ---- Example 2: a bounded partial read of the first rows ---- # # n_max stops the line loop as soon as enough rows are in. head_rows <- read_ndjson(path, n_max = 5) get_meta(head_rows)@dataset$records
Read an Apache Parquet (.parquet) file back to a data frame, restoring the
artoo_meta from its metadata_json sidecar and realizing SAS
date/datetime/time variables to R Date / POSIXct / hms::hms. A
parquet written by another tool (with no artoo sidecar) reads back as a
bare frame. A thin wrapper over read_dataset() with format = "parquet".
Requires the lightweight nanoparquet package.
read_parquet(path, col_select = NULL, n_max = Inf, encoding = NULL)read_parquet(path, col_select = NULL, n_max = Inf, encoding = NULL)
path |
Source |
col_select |
Variables to read. Note: an unknown name is a |
n_max |
Maximum records to read. |
encoding |
Source charset of the string columns. Tip: any SAS or IANA spelling listed by |
A <data.frame> carrying artoo_meta when the file recorded it
(read it with get_meta()); otherwise a plain data frame.
write_parquet() for the inverse; read_dataset() for the
generic dispatcher.
spec <- artoo_spec(cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists) # ---- Example 1: round-trip a conformed dataset through Parquet ---- # # The variable labels, types, and keys survive the round-trip. adsl <- apply_spec(cdisc_adsl, spec, "ADSL", conformance = "off") path <- tempfile(fileext = ".parquet") write_parquet(adsl, path) back <- read_parquet(path) get_meta(back)@columns$STUDYID$label # ---- Example 2: the metadata names the dataset and row count ---- # # The restored artoo_meta exposes the dataset-level attributes. get_meta(back)@dataset$recordsspec <- artoo_spec(cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists) # ---- Example 1: round-trip a conformed dataset through Parquet ---- # # The variable labels, types, and keys survive the round-trip. adsl <- apply_spec(cdisc_adsl, spec, "ADSL", conformance = "off") path <- tempfile(fileext = ".parquet") write_parquet(adsl, path) back <- read_parquet(path) get_meta(back)@columns$STUDYID$label # ---- Example 2: the metadata names the dataset and row count ---- # # The restored artoo_meta exposes the dataset-level attributes. get_meta(back)@dataset$records
Read an R .rds file written by write_rds() (or any rds carrying a
metadata_json attribute) back to a data frame with its artoo_meta
restored. A thin wrapper over read_dataset() with format = "rds".
read_rds(path, col_select = NULL, n_max = Inf, encoding = NULL)read_rds(path, col_select = NULL, n_max = Inf, encoding = NULL)
path |
Source |
col_select |
Variables to read. Note: an unknown name is a |
n_max |
Maximum records to read. |
encoding |
Source charset of the string columns. Tip: any SAS or IANA spelling listed by |
A <data.frame> carrying artoo_meta when the file recorded
it. An rds holding anything other than a data frame is a
artoo_error_codec; use readRDS() for arbitrary objects.
write_rds() for the inverse; read_dataset() for the generic
dispatcher.
spec <- artoo_spec( cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists ) # ---- Example 1: read a dataset written by write_rds() ---- # # The restored frame carries the same metadata it was written with. adsl <- apply_spec(cdisc_adsl, spec, "ADSL", conformance = "off") path <- tempfile(fileext = ".rds") write_rds(adsl, path) back <- read_rds(path) get_meta(back)@dataset$records # ---- Example 2: a plain rds still reads as a data frame ---- # # An rds without artoo metadata reads back as an ordinary frame. bare <- tempfile(fileext = ".rds") saveRDS(cdisc_dm, bare) nrow(read_rds(bare))spec <- artoo_spec( cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists ) # ---- Example 1: read a dataset written by write_rds() ---- # # The restored frame carries the same metadata it was written with. adsl <- apply_spec(cdisc_adsl, spec, "ADSL", conformance = "off") path <- tempfile(fileext = ".rds") write_rds(adsl, path) back <- read_rds(path) get_meta(back)@dataset$records # ---- Example 2: a plain rds still reads as a data frame ---- # # An rds without artoo metadata reads back as an ordinary frame. bare <- tempfile(fileext = ".rds") saveRDS(cdisc_dm, bare) nrow(read_rds(bare))
Read a clinical-dataset specification into a validated artoo_spec,
dispatching on the file extension: artoo's native JSON (the inverse of
write_spec()), a Pinnacle 21 (P21) Excel workbook, or a native
Define-XML 2.0/2.1 document. The returned spec is the lingua franca the
rest of artoo applies and serialises.
read_spec(path, datasets = NULL, on_duplicate = c("error", "first", "warn"))read_spec(path, datasets = NULL, on_duplicate = c("error", "first", "warn"))
path |
The specification file to read. Requirement: reading a P21 workbook needs the |
datasets |
Read only these datasets. |
on_duplicate |
Policy for a variable defined more than once.
|
Three formats, one validator. A .json file is read as artoo native
JSON; a .xlsx / .xls file is read as a P21 workbook; a .xml file is
read as Define-XML 2.x. Either way the result is built through
artoo_spec(), so type canonicalisation and cross-slot integrity checks
are identical regardless of source.
Define-XML ingestion (needs the xml2 package). ItemGroupDefs become
datasets (keys derived from the ItemRef KeySequence), ItemRef + ItemDef
pairs become variables, CodeLists become codelists
(def:ExtendedValue = "Yes" marks an extended term), MethodDefs /
CommentDefs / leaves become the supporting slots, and ValueListDefs land
in the value-level slot with their where-clauses rendered as readable
text.
Note: an ExternalCodeList (MedDRA, ISO-3166) names a dictionary,
not an enumerable membership list; it is dropped, and variables that
referenced it carry no codelist. Define-XML v1.0 (the 2005 model) is
refused with guidance.
P21 ingestion. Sheets are located by a tolerant alias match
(case-, space-, and spelling-variant insensitive). Datasets and
Variables are required; Codelists and ValueLevel are optional (the
latter becomes the spec's value-level slot). Every cell is read as
text, then the dataset and codelist foreign keys are forward-filled to
recover merged cells (which the Excel reader returns as NA on
continuation rows). A key that cannot be resolved aborts with
artoo_error_spec rather than being silently dropped.
A validated artoo_spec. Inspect it with spec_datasets() /
spec_variables(), check it with validate_spec(), or persist it
with write_spec().
Inverse: write_spec() serialises a spec to native JSON.
Build / inspect: artoo_spec(), spec_datasets(),
spec_variables(), validate_spec().
# ---- Example 1: round-trip a spec through native JSON ---- # # write_spec() and read_spec() are inverses on the JSON path: the spec # that comes back is identical to the one written. spec <- artoo_spec(cdisc_sdtm_datasets, cdisc_sdtm_variables, codelists = cdisc_codelists) path <- tempfile(fileext = ".json") write_spec(spec, path) back <- read_spec(path) identical(back, spec) # ---- Example 2: scope the read to one dataset ---- # # `datasets =` reads just the domain you are working on — validation is # scoped with it, so a problem elsewhere in the workbook cannot block # this dataset. dm_spec <- read_spec(path, datasets = "DM") spec_datasets(dm_spec) head(spec_variables(dm_spec, "DM")[, c("variable", "label", "data_type")])# ---- Example 1: round-trip a spec through native JSON ---- # # write_spec() and read_spec() are inverses on the JSON path: the spec # that comes back is identical to the one written. spec <- artoo_spec(cdisc_sdtm_datasets, cdisc_sdtm_variables, codelists = cdisc_codelists) path <- tempfile(fileext = ".json") write_spec(spec, path) back <- read_spec(path) identical(back, spec) # ---- Example 2: scope the read to one dataset ---- # # `datasets =` reads just the domain you are working on — validation is # scoped with it, so a problem elsewhere in the workbook cannot block # this dataset. dm_spec <- read_spec(path, datasets = "DM") spec_datasets(dm_spec) head(spec_variables(dm_spec, "DM")[, c("variable", "label", "data_type")])
Read a SAS Transport (.xpt) file (v5 or v8) back to a data frame,
restoring the artoo_meta its NAMESTR records carry and realizing SAS
date/datetime/time variables to R Date / POSIXct / hms::hms. The
ingest end of the I/O layer; a thin wrapper over read_dataset() with
format = "xpt".
read_xpt(path, encoding = NULL, col_select = NULL, n_max = Inf, member = NULL)read_xpt(path, encoding = NULL, col_select = NULL, n_max = Inf, member = NULL)
path |
Source |
encoding |
Force a source charset. Tip: any SAS or IANA spelling listed by |
col_select |
Variables to read. Note: an unknown name is a |
n_max |
Maximum records to read. |
member |
Which member of a multi-member transport file to read.
Tip: |
The character encoding is auto-detected (UTF-8 if every character value is
valid UTF-8, else Windows-1252) and recorded on the returned
artoo_meta, so a later write_xpt() reproduces it; pass encoding to
override. XPORT cannot record its own encoding, so this detection is a
heuristic. See write_xpt() for what XPORT can and cannot preserve.
A <data.frame> carrying artoo_meta (read it with
get_meta()).
xpt_members() to list a file's members; write_xpt() for the
inverse; read_dataset() for the generic dispatcher.
spec <- artoo_spec( cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists ) # ---- Example 1: round-trip a conformed dataset through xpt ---- # # Write ADSL, read it back; the variable labels and lengths survive. adsl <- apply_spec(cdisc_adsl, spec, "ADSL", conformance = "off") path <- tempfile(fileext = ".xpt") write_xpt(adsl, path) back <- read_xpt(path) get_meta(back)@columns$STUDYID$label # ---- Example 2: pick one member of a multi-member transport file ---- # # Build a two-member file by concatenating two single-member files (every # member section is 80-byte padded), then read one dataset out of it. dm <- apply_spec(cdisc_dm, sdtm_spec, "DM", conformance = "off") p_dm <- tempfile(fileext = ".xpt") write_xpt(dm, p_dm) multi <- tempfile(fileext = ".xpt") writeBin( c( readBin(path, "raw", file.size(path)), readBin(p_dm, "raw", file.size(p_dm))[-(1:240)] ), multi ) xpt_members(multi)$name nrow(read_xpt(multi, member = "DM"))spec <- artoo_spec( cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists ) # ---- Example 1: round-trip a conformed dataset through xpt ---- # # Write ADSL, read it back; the variable labels and lengths survive. adsl <- apply_spec(cdisc_adsl, spec, "ADSL", conformance = "off") path <- tempfile(fileext = ".xpt") write_xpt(adsl, path) back <- read_xpt(path) get_meta(back)@columns$STUDYID$label # ---- Example 2: pick one member of a multi-member transport file ---- # # Build a two-member file by concatenating two single-member files (every # member section is 80-byte padded), then read one dataset out of it. dm <- apply_spec(cdisc_dm, sdtm_spec, "DM", conformance = "off") p_dm <- tempfile(fileext = ".xpt") write_xpt(dm, p_dm) multi <- tempfile(fileext = ".xpt") writeBin( c( readBin(path, "raw", file.size(path)), readBin(p_dm, "raw", file.size(p_dm))[-(1:240)] ), multi ) xpt_members(multi)$name nrow(read_xpt(multi, member = "DM"))
Take the integer_fraction and integer_overflow findings a
check_spec() or check_study() run reports and return a new spec with
every offending variable retyped to "float", so a frame that the
original spec would refuse to coerce now conforms. This closes the loop on
the spec-side fix: inspect the findings, then apply them all at once
instead of editing the source workbook variable by variable. Persist the
result with write_spec().
repair_spec(spec, findings)repair_spec(spec, findings)
spec |
The specification to repair. |
findings |
A findings data frame. |
Scope. Only the two lossy-integer findings are repaired
(integer_fraction, integer_overflow) — both mean "the spec says
integer but the data is not", and "float" is the loss-free fix. Other
findings are ignored; this is not a general spec rewriter. When no
repairable finding is present the spec is returned unchanged, with a note.
Built on set_type(). Each (dataset, variable) pair is applied
through the same validated override set_type() uses, so the result is a
fully re-validated artoo_spec, never a hand-edited internal.
A new <artoo_spec> with the flagged variables retyped to
"float", or spec unchanged when there is nothing to repair. The input
is never mutated.
Primitive: set_type() to retype a chosen variable directly.
Findings: check_spec() for one dataset, check_study() across a
study. Persist: write_spec().
# ---- Example 1: auto-repair an integer/fractional mismatch ---- # # adam_spec types ADSL.AGE as integer. Give it fractional ages and # check_spec() raises an integer_fraction error; repair_spec() flips AGE # (and only AGE) to float, and the corrected spec then applies cleanly. dat <- cdisc_adsl dat$AGE <- dat$AGE + 0.5 findings <- check_spec(dat, adam_spec, "ADSL") fixed <- repair_spec(adam_spec, findings) spec_variables(fixed, "ADSL")$data_type[ spec_variables(fixed, "ADSL")$variable == "AGE" ] # ---- Example 2: nothing to repair is a no-op ---- # # The bundled data conforms, so its findings carry no integer_fraction or # integer_overflow rows and the spec is returned unchanged. clean <- check_spec(cdisc_adsl, adam_spec, "ADSL") identical(repair_spec(adam_spec, clean), adam_spec)# ---- Example 1: auto-repair an integer/fractional mismatch ---- # # adam_spec types ADSL.AGE as integer. Give it fractional ages and # check_spec() raises an integer_fraction error; repair_spec() flips AGE # (and only AGE) to float, and the corrected spec then applies cleanly. dat <- cdisc_adsl dat$AGE <- dat$AGE + 0.5 findings <- check_spec(dat, adam_spec, "ADSL") fixed <- repair_spec(adam_spec, findings) spec_variables(fixed, "ADSL")$data_type[ spec_variables(fixed, "ADSL")$variable == "AGE" ] # ---- Example 2: nothing to repair is a no-op ---- # # The bundled data conforms, so its findings carry no integer_fraction or # integer_overflow rows and the spec is returned unchanged. clean <- check_spec(cdisc_adsl, adam_spec, "ADSL") identical(repair_spec(adam_spec, clean), adam_spec)
Stamp a artoo_meta onto a data frame as a single Dataset-JSON string in
its metadata_json attribute. Every write_*() codec reads that string
back with get_meta() and embeds it verbatim, so the metadata survives
the trip to any format. Use it to attach metadata to a bare frame before a
write, or to re-stamp after a tidyverse verb has dropped attributes.
set_meta(x, meta)set_meta(x, meta)
x |
The data frame to stamp. |
meta |
The metadata to attach. |
The data frame x, with its metadata_json attribute set. Pass
it on to a write_*() codec or back through get_meta().
get_meta() for the read half; apply_spec() which stamps it.
# ---- Example 1: re-stamp metadata a dplyr verb would drop ---- # # Conform a dataset, capture its metadata, then re-attach after an # attribute-dropping transform so the write stays lossless. spec <- artoo_spec( cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists ) adsl <- apply_spec(cdisc_adsl, spec, "ADSL") meta <- get_meta(adsl) trimmed <- head(as.data.frame(adsl), 5) attr(trimmed, "metadata_json") <- NULL set_meta(trimmed, meta) # ---- Example 2: borrow metadata from a conformed dataset ---- # # A writer with a raw frame can lift metadata off a conformed dataset and # stamp it onto the bare frame (DM is SDTM, so its spec is sdtm-shaped). sdtm <- artoo_spec( cdisc_sdtm_datasets, cdisc_sdtm_variables, codelists = cdisc_codelists ) meta_dm <- get_meta(apply_spec(cdisc_dm, sdtm, "DM")) dm <- set_meta(cdisc_dm, meta_dm) is_artoo_meta(get_meta(dm))# ---- Example 1: re-stamp metadata a dplyr verb would drop ---- # # Conform a dataset, capture its metadata, then re-attach after an # attribute-dropping transform so the write stays lossless. spec <- artoo_spec( cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists ) adsl <- apply_spec(cdisc_adsl, spec, "ADSL") meta <- get_meta(adsl) trimmed <- head(as.data.frame(adsl), 5) attr(trimmed, "metadata_json") <- NULL set_meta(trimmed, meta) # ---- Example 2: borrow metadata from a conformed dataset ---- # # A writer with a raw frame can lift metadata off a conformed dataset and # stamp it onto the bare frame (DM is SDTM, so its spec is sdtm-shaped). sdtm <- artoo_spec( cdisc_sdtm_datasets, cdisc_sdtm_variables, codelists = cdisc_codelists ) meta_dm <- get_meta(apply_spec(cdisc_dm, sdtm, "DM")) dm <- set_meta(cdisc_dm, meta_dm) is_artoo_meta(get_meta(dm))
Return a new artoo_spec with one or more variables retyped. This is the
supported, in-R way to correct a spec when the data disagrees with its
declared dataType (e.g. a variable typed integer whose extract holds
fractional values): fix it here rather than editing the source workbook,
then drive apply_spec() with the corrected spec. The spec is immutable,
so the original is never changed.
set_type(spec, dataset, ...)set_type(spec, dataset, ...)
spec |
The specification to amend. |
dataset |
The dataset whose variables to retype. |
... |
Named Tip: to undo an |
Per-dataset scope. A type is set only on the named dataset's row.
A variable that appears in several datasets keeps its other rows' types;
call set_type() once per dataset to change them all. Spec-wide
consequences (a variable typed inconsistently across datasets) are a
validate_spec() concern, not a construction error here.
Canonicalised, then validated. Each supplied type is mapped through
the closed CDISC dataType vocabulary, so "Float", "decimal", and
"text" all resolve; an unrecognised token aborts with
artoo_error_type. The rebuilt spec is re-validated, so an override that
would break the spec aborts with artoo_error_spec.
A new <artoo_spec> with the named variables retyped, ready for
apply_spec() or write_spec(). The input spec is unchanged.
Auto-repair: repair_spec() to apply every integer_fraction /
integer_overflow fix from a findings frame at once.
Workflow: apply_spec() to conform with the corrected spec;
write_spec() to persist it; check_spec() to find the mismatches.
# ---- Example 1: retype one variable the data disagrees with ---- # # The bundled adam_spec types ADSL.AGE as integer. If an extract stored it # with fractional values, retype it to float so apply_spec() coerces # without loss. set_type() returns a new spec; the original is untouched. fixed <- set_type(adam_spec, "ADSL", AGE = "float") v <- spec_variables(fixed, "ADSL") v[v$variable == "AGE", c("variable", "data_type")] # ---- Example 2: retype several at once, original left intact ---- # # Pass any number of variable = type pairs; canonical dataTypes and common # spellings both resolve. The source spec is immutable, so adam_spec still # reports AGE as its original type. patched <- set_type(adam_spec, "ADSL", AGE = "decimal", TRTSDT = "date") spec_variables(adam_spec, "ADSL")$data_type[ spec_variables(adam_spec, "ADSL")$variable == "AGE" ]# ---- Example 1: retype one variable the data disagrees with ---- # # The bundled adam_spec types ADSL.AGE as integer. If an extract stored it # with fractional values, retype it to float so apply_spec() coerces # without loss. set_type() returns a new spec; the original is untouched. fixed <- set_type(adam_spec, "ADSL", AGE = "float") v <- spec_variables(fixed, "ADSL") v[v$variable == "AGE", c("variable", "data_type")] # ---- Example 2: retype several at once, original left intact ---- # # Pass any number of variable = type pairs; canonical dataTypes and common # spellings both resolve. The source spec is immutable, so adam_spec still # reports AGE as its original type. patched <- set_type(adam_spec, "ADSL", AGE = "decimal", TRTSDT = "date") spec_variables(adam_spec, "ADSL")$data_type[ spec_variables(adam_spec, "ADSL")$variable == "AGE" ]
Return the controlled-terminology terms and decodes a spec carries: one
codelist's terms when codelist_id names it, or the full codelists slot
when codelist_id is NULL. Use it to inspect the values a coded variable
is allowed to take before applying the spec. Mirrors the
spec_variables() filter pattern.
spec_codelists(spec, codelist_id = NULL)spec_codelists(spec, codelist_id = NULL)
spec |
The specification to read. |
codelist_id |
The codelist to return. Restriction: a non- |
A data frame of codelist terms, one row per term: every term
when codelist_id is NULL, else the named codelist's terms. Columns:
codelist_id — the codelist identifier variables reference.
term — the submission value (what conformed data carries).
decode — the human-readable decoded value.
order — display order within the codelist.
extended — TRUE marks an extensible codelist (sponsor terms
allowed; non-members downgrade to notes in check_spec()).
comment_id — reference into the comments slot.
spec_variables() for which variables reference a codelist.
# ---- Example 1: the terms behind a coded variable ---- # # SEX is coded against C66731; spec_codelists() returns the terms and their # decodes that apply_spec() will enforce or decode. spec <- artoo_spec( cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists ) spec_codelists(spec, "C66731") # ---- Example 2: the whole codelists table ---- # # Called with no id, it returns every term across every codelist. head(spec_codelists(spec))# ---- Example 1: the terms behind a coded variable ---- # # SEX is coded against C66731; spec_codelists() returns the terms and their # decodes that apply_spec() will enforce or decode. spec <- artoo_spec( cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists ) spec_codelists(spec, "C66731") # ---- Example 2: the whole codelists table ---- # # Called with no id, it returns every term across every codelist. head(spec_codelists(spec))
Return the comment definitions a specification carries. Datasets,
variables, value-level rows, and codelists reference these by
comment_id; validate_spec() checks the references resolve and each
referenced comment has a body.
spec_comments(spec)spec_comments(spec)
spec |
The specification to read. |
A data frame of comment metadata, one row per comment, with all
four columns: comment_id, description, document_id, pages. Empty
when the spec defines no comments.
spec_methods(), spec_documents(), validate_spec().
# ---- Example 1: the comments a spec defines ---- # # Build a spec with one comment and read it back. spec <- artoo_spec( data.frame(dataset = "ADSL"), data.frame(dataset = "ADSL", variable = "AGE", data_type = "integer"), comments = data.frame( comment_id = "C.AGE", description = "Age in years at informed consent.", stringsAsFactors = FALSE ) ) spec_comments(spec)# ---- Example 1: the comments a spec defines ---- # # Build a spec with one comment and read it back. spec <- artoo_spec( data.frame(dataset = "ADSL"), data.frame(dataset = "ADSL", variable = "AGE", data_type = "integer"), comments = data.frame( comment_id = "C.AGE", description = "Age in years at informed consent.", stringsAsFactors = FALSE ) ) spec_comments(spec)
List the datasets a specification defines. The result is the set of
names you pass as the dataset argument to the other accessors and to
apply_spec().
spec_datasets(spec)spec_datasets(spec)
spec |
The specification to read. |
A character vector of dataset names, de-duplicated and with
NAs dropped. Empty when the spec has no datasets.
spec_variables() for one dataset's variables; spec_keys()
for its sort keys.
# ---- Example 1: the datasets the pilot ADaM spec defines ---- # # Build the spec from the bundled CDISC-pilot tables and list its # datasets — the names you pass to the other accessors. spec <- artoo_spec( cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists ) spec_datasets(spec)# ---- Example 1: the datasets the pilot ADaM spec defines ---- # # Build the spec from the bundled CDISC-pilot tables and list its # datasets — the names you pass to the other accessors. spec <- artoo_spec( cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists ) spec_datasets(spec)
Return the document references a specification carries. Methods and
comments point to these by document_id.
spec_documents(spec)spec_documents(spec)
spec |
The specification to read. |
A data frame of document metadata (document_id, title,
href), one row per document. Empty when the spec defines none.
spec_methods(), spec_comments(), validate_spec().
# ---- Example 1: the documents a spec defines ---- # # Build a spec with one document reference and read it back. spec <- artoo_spec( data.frame(dataset = "ADSL"), data.frame(dataset = "ADSL", variable = "AGE", data_type = "integer"), documents = data.frame( document_id = "SAP", title = "Statistical Analysis Plan", stringsAsFactors = FALSE ) ) spec_documents(spec)# ---- Example 1: the documents a spec defines ---- # # Build a spec with one document reference and read it back. spec <- artoo_spec( data.frame(dataset = "ADSL"), data.frame(dataset = "ADSL", variable = "AGE", data_type = "integer"), documents = data.frame( document_id = "SAP", title = "Statistical Analysis Plan", stringsAsFactors = FALSE ) ) spec_documents(spec)
Parse a dataset's sort keys into a character vector of variable names.
These keys drive the sort step of apply_spec() and the keySequence
written to each output format.
spec_keys(spec, dataset)spec_keys(spec, dataset)
spec |
The specification to read. |
dataset |
The dataset whose keys to parse. Restriction: must name a dataset in the spec. |
A character vector of key variable names, split from the
dataset's keys cell (whitespace- or comma-separated). Empty when no
keys are declared.
spec_datasets() for the dataset names; spec_variables() for
the variables a key must reference.
# ---- Example 1: parse a dataset's sort keys ---- # # Declare DM's keys, then read them back as the ordered vector apply_spec() # sorts by. (STUDYID and USUBJID are real DM variables in the demo data.) ds <- cdisc_sdtm_datasets ds$keys[ds$dataset == "DM"] <- "STUDYID USUBJID" spec <- artoo_spec(ds, cdisc_sdtm_variables, codelists = cdisc_codelists) spec_keys(spec, "DM")# ---- Example 1: parse a dataset's sort keys ---- # # Declare DM's keys, then read them back as the ordered vector apply_spec() # sorts by. (STUDYID and USUBJID are real DM variables in the demo data.) ds <- cdisc_sdtm_datasets ds$keys[ds$dataset == "DM"] <- "STUDYID USUBJID" spec <- artoo_spec(ds, cdisc_sdtm_variables, codelists = cdisc_codelists) spec_keys(spec, "DM")
Return the method definitions a specification carries. Variables and
value-level rows reference these by method_id; validate_spec() checks
that every reference resolves and that each referenced method is
complete (has a description).
spec_methods(spec)spec_methods(spec)
spec |
The specification to read. |
A data frame of method metadata, one row per method, with all
eight columns: method_id, description, name, type,
expression_context, expression_code, document_id, pages. Empty
when the spec defines no methods.
spec_comments(), spec_documents(), validate_spec().
# ---- Example 1: the methods a spec defines ---- # # Build a spec with one derivation method and read it back. spec <- artoo_spec( data.frame(dataset = "ADSL"), data.frame(dataset = "ADSL", variable = "AGEGR1", data_type = "string"), methods = data.frame( method_id = "MT.AGEGR1", description = "Age group from AGE.", stringsAsFactors = FALSE ) ) spec_methods(spec)# ---- Example 1: the methods a spec defines ---- # # Build a spec with one derivation method and read it back. spec <- artoo_spec( data.frame(dataset = "ADSL"), data.frame(dataset = "ADSL", variable = "AGEGR1", data_type = "string"), methods = data.frame( method_id = "MT.AGEGR1", description = "Age group from AGE.", stringsAsFactors = FALSE ) ) spec_methods(spec)
Return the one CDISC standard the specification carries (e.g.
"ADaMIG 1.1", "SDTMIG 3.2"). A artoo_spec is single-standard by
construction — artoo_spec() aborts when its sources mix standards —
so this is always a scalar; NA when no source named one.
spec_standard(spec)spec_standard(spec)
spec |
The specification to read. |
A <character(1)>: the standard, or NA when unspecified.
spec_study() for the rest of the study-level metadata;
artoo_spec() for how the standard is resolved.
# ---- Example 1: the standard set at construction ---- # # Pass the standard explicitly (or let it resolve from a P21 workbook's # Standard column / a Define-XML study block) and read it back. spec <- artoo_spec( cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists, standard = "ADaMIG 1.1" ) spec_standard(spec) # ---- Example 2: unspecified resolves to NA ---- # # A spec built without any standard source carries NA. bare <- artoo_spec( cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists ) spec_standard(bare)# ---- Example 1: the standard set at construction ---- # # Pass the standard explicitly (or let it resolve from a P21 workbook's # Standard column / a Define-XML study block) and read it back. spec <- artoo_spec( cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists, standard = "ADaMIG 1.1" ) spec_standard(spec) # ---- Example 2: unspecified resolves to NA ---- # # A spec built without any standard source carries NA. bare <- artoo_spec( cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists ) spec_standard(bare)
Return the study-level metadata row, or a single field from it. Holds
the canonical study fields (study_name, study_description,
protocol_name — every source spelling is canonicalised by
artoo_spec()) plus any other study-scoped fields a source provides
(the CDISC standard lives on its own property — see spec_standard()).
spec_study(spec, field = NULL)spec_study(spec, field = NULL)
spec |
The specification to read. |
field |
Return one field instead of the row. Restriction: a non- |
The study data frame (one row), or the value of one field.
The canonical fields are study_name, study_description, and
protocol_name — every source spelling is canonicalised to these by
artoo_spec() — plus any other field the source carried verbatim
(e.g. define_version from a Define-XML read).
spec_datasets() for the datasets the study scopes;
spec_standard() for the spec's CDISC standard.
# ---- Example 1: the whole study row, then one field ---- # # spec_study() with no field returns the study-level table; pass a field # name to pull a single value such as the study name. spec <- artoo_spec( cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists, study = data.frame(study_name = "CDISCPILOT01") ) spec_study(spec) spec_study(spec, "study_name")# ---- Example 1: the whole study row, then one field ---- # # spec_study() with no field returns the study-level table; pass a field # name to pull a single value such as the study name. spec <- artoo_spec( cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists, study = data.frame(study_name = "CDISCPILOT01") ) spec_study(spec) spec_study(spec, "study_name")
Return the variable-metadata table for one dataset, or for the whole
spec. Each row carries the variable's CDISC data_type, label, length,
display format, key sequence, and codelist reference.
spec_variables(spec, dataset = NULL)spec_variables(spec, dataset = NULL)
spec |
The specification to read. |
dataset |
Restrict to one dataset. Restriction: a non- |
A data frame of variable metadata, one row per variable, with
22 columns (absent ones are filled with typed NA at construction):
dataset, variable — the identifying pair (unique within a spec).
itemoid — the Define-XML / Dataset-JSON item OID, when recorded.
label — the variable label (<= 40 bytes for XPORT v5).
data_type — canonical CDISC dataType (string, integer,
decimal, float, double, boolean, date, datetime, time,
URI).
target_data_type — integer/decimal when a temporal variable
stores as a SAS-epoch numeric; NA means ISO 8601 text (--DTC).
length — declared storage length (bytes for character).
display_format, informat — SAS format / informat strings.
key_sequence — 1-based position in the dataset sort key.
order — column position in the dataset.
codelist_id, method_id, comment_id — references into the
codelists / methods / comments slots.
mandatory — logical obligation flag (NA is treated as
mandatory by check_spec()).
significant_digits — for decimal variables.
origin, source, predecessor, assigned_value, pages,
role — Define-XML provenance fields, carried as-is.
Filter or arrange it with ordinary base / dplyr verbs.
spec_datasets() for the dataset names; spec_codelists() for a
variable's controlled terminology.
spec <- artoo_spec(cdisc_sdtm_datasets, cdisc_sdtm_variables, codelists = cdisc_codelists) # ---- Example 1: one dataset's variables ---- # # Pass a dataset name to get just that domain's variables, already # canonicalised to CDISC dataTypes. head(spec_variables(spec, "DM")[, c("variable", "label", "data_type")]) # ---- Example 2: every variable across the spec ---- # # Omit `dataset` to get the full table, e.g. to count variables per domain. table(spec_variables(spec)$dataset)spec <- artoo_spec(cdisc_sdtm_datasets, cdisc_sdtm_variables, codelists = cdisc_codelists) # ---- Example 1: one dataset's variables ---- # # Pass a dataset name to get just that domain's variables, already # canonicalised to CDISC dataTypes. head(spec_variables(spec, "DM")[, c("variable", "label", "data_type")]) # ---- Example 2: every variable across the spec ---- # # Omit `dataset` to get the full table, e.g. to count variables per domain. table(spec_variables(spec)$dataset)
Re-attach and reconcile a artoo_meta after a transformation that dropped
or reshaped it: the metadata's columns are narrowed and reordered to the
frame's current columns, the record count is refreshed, the keys are
recomputed, and a column the metadata does not describe gets an entry
synthesized from its class and attributes. The one-liner to run after a
dplyr (or base) pipeline, before handing the frame to a write_*() codec.
sync_meta(x, meta = NULL)sync_meta(x, meta = NULL)
x |
The transformed data frame. |
meta |
The metadata to reconcile against. Requirement: when the transform dropped the attribute (base |
Why it exists. Base row subsetting (x[i, ]) drops the frame's
metadata_json attribute, and many tidyverse verbs rebuild the frame.
sync_meta() takes the last-known metadata (the frame's own attribute
when it survived, or an explicit meta) and makes it agree with the data
again, so the round trip stays lossless without hand-editing.
A <data.frame>: x re-stamped with the reconciled
artoo_meta. Hand it to any write_*() codec.
Read / attach: get_meta(), set_meta().
Produce conformed frames: apply_spec().
spec <- artoo_spec(cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists) # ---- Example 1: re-attach after an attribute-dropping subset ---- # # Base subsetting drops the metadata; capture it first, transform, then # sync. The metadata narrows to the kept columns and the new row count. adsl <- apply_spec(cdisc_adsl, spec, "ADSL", conformance = "off") meta <- get_meta(adsl) elderly <- adsl[adsl$AGE > 65, c("STUDYID", "USUBJID", "AGE")] synced <- sync_meta(elderly, meta) get_meta(synced)@dataset$records # ---- Example 2: a derived column gains a synthesized entry ---- # # A new column the metadata does not describe is profiled from its class, # so the frame still writes losslessly. adsl$AGEGR9 <- ifelse(adsl$AGE > 65, ">65", "<=65") synced2 <- sync_meta(adsl) get_meta(synced2)@columns$AGEGR9$dataTypespec <- artoo_spec(cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists) # ---- Example 1: re-attach after an attribute-dropping subset ---- # # Base subsetting drops the metadata; capture it first, transform, then # sync. The metadata narrows to the kept columns and the new row count. adsl <- apply_spec(cdisc_adsl, spec, "ADSL", conformance = "off") meta <- get_meta(adsl) elderly <- adsl[adsl$AGE > 65, c("STUDYID", "USUBJID", "AGE")] synced <- sync_meta(elderly, meta) get_meta(synced)@dataset$records # ---- Example 2: a derived column gains a synthesized entry ---- # # A new column the metadata does not describe is profiled from its class, # so the frame still writes losslessly. adsl$AGEGR9 <- ifelse(adsl$AGE > 65, ">65", "<=65") synced2 <- sync_meta(adsl) get_meta(synced2)@columns$AGEGR9$dataType
Run artoo's bundled, self-contained checks over a artoo_spec,
scoped to the dataset(s) you are working on, and return a
artoo_check that prints a sectioned report. Every finding is keyed to
an open rule in the shipped catalog (see spec_rules.json); the result
object keeps the findings as a plain data frame in @findings for
programmatic use.
validate_spec( spec, data = NULL, dataset = NULL, on_error = c("off", "warn", "abort") )validate_spec( spec, data = NULL, dataset = NULL, on_error = c("off", "warn", "abort") )
spec |
The specification to validate. |
data |
Optional input data for controlled-terminology checks.
|
dataset |
Restrict to one or more datasets. Restriction: each name must be a dataset in the spec. |
on_error |
What to do with an error-severity finding.
|
Dataset-scoped. A spec workbook carries many datasets. Pass
dataset to validate only the one(s) you are working on — the
methods, comments, and codelists those datasets reference are checked
for completeness, but unrelated datasets are not. dataset = NULL
validates the whole spec.
Collect, do not stop. Every finding is collected and returned;
validate_spec() does not abort on an error-severity finding unless
on_error = "abort".
A artoo_check object. Its @findings data frame has columns
check, dimension, severity, dataset, variable, message.
Print it for the sectioned report.
artoo_spec() to build a spec; spec_methods() /
spec_comments() for the metadata checked.
# ---- Example 1: validate one dataset ---- # # Build a spec from the bundled ADaM tables and validate it; the # result prints a sectioned report and keeps the findings table. spec <- artoo_spec( cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists ) chk <- validate_spec(spec, dataset = "ADSL") chk@findings # ---- Example 2: gate on errors with on_error = "abort" ---- # # Point a key at a missing variable, then validate with on_error = "abort" # and catch the resulting error. bad_ds <- cdisc_sdtm_datasets bad_ds$keys <- "NOTAVAR" bad <- artoo_spec(bad_ds, cdisc_sdtm_variables, codelists = cdisc_codelists) tryCatch( validate_spec(bad, dataset = "DM", on_error = "abort"), artoo_error_validation = function(e) conditionMessage(e)[1] )# ---- Example 1: validate one dataset ---- # # Build a spec from the bundled ADaM tables and validate it; the # result prints a sectioned report and keeps the findings table. spec <- artoo_spec( cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists ) chk <- validate_spec(spec, dataset = "ADSL") chk@findings # ---- Example 2: gate on errors with on_error = "abort" ---- # # Point a key at a missing variable, then validate with on_error = "abort" # and catch the resulting error. bad_ds <- cdisc_sdtm_datasets bad_ds$keys <- "NOTAVAR" bad <- artoo_spec(bad_ds, cdisc_sdtm_variables, codelists = cdisc_codelists) tryCatch( validate_spec(bad, dataset = "DM", on_error = "abort"), artoo_error_validation = function(e) conditionMessage(e)[1] )
Serialize a data frame to a clinical file format, preserving its
artoo_meta losslessly. The codec is chosen from the file extension (or an
explicit format), so one call covers xpt, Dataset-JSON, Parquet, and rds.
This is the emit end of the artoo workflow; the per-format wrappers like
write_rds() are thin sugar over it.
write_dataset(x, path, format = NULL, ...)write_dataset(x, path, format = NULL, ...)
x |
The dataset to write. |
path |
Destination file path. |
format |
Force a codec instead of inferring from the extension.
|
... |
Codec-specific arguments passed through to the encoder (see
the per-format wrappers, e.g. |
The input x, invisibly, so a write can sit mid-pipeline.
Called for the side effect of writing path.
read_dataset() for the inverse; write_rds() for the
per-format wrapper; artoo_formats() for what is available.
spec <- artoo_spec(cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists) # ---- Example 1: write a conformed dataset, inferring rds from the path ---- # # apply_spec() attaches the metadata; write_dataset() carries it into the # file so a later read is lossless. adsl <- apply_spec(cdisc_adsl, spec, "ADSL", conformance = "off") path <- tempfile(fileext = ".rds") write_dataset(adsl, path) # ---- Example 2: force the format for an unconventional extension ---- # # When the extension does not name the format, pass it explicitly. alt <- tempfile(fileext = ".data") write_dataset(adsl, alt, format = "rds")spec <- artoo_spec(cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists) # ---- Example 1: write a conformed dataset, inferring rds from the path ---- # # apply_spec() attaches the metadata; write_dataset() carries it into the # file so a later read is lossless. adsl <- apply_spec(cdisc_adsl, spec, "ADSL", conformance = "off") path <- tempfile(fileext = ".rds") write_dataset(adsl, path) # ---- Example 2: force the format for an unconventional extension ---- # # When the extension does not name the format, pass it explicitly. alt <- tempfile(fileext = ".data") write_dataset(adsl, alt, format = "rds")
Serialize a data frame to a CDISC Dataset-JSON v1.1 (.json) file,
Dataset-JSON being the native home of the artoo_meta shape: the file is
the metadata block plus a flat rows array. The emit end of the artoo
workflow (spec -> apply_spec -> write_json); a thin wrapper over
write_dataset() with format = "json".
write_json( x, path, on_invalid = c("error", "replace", "ignore"), created = NULL, strict = FALSE )write_json( x, path, on_invalid = c("error", "replace", "ignore"), created = NULL, strict = FALSE )
x |
The dataset to write. |
path |
Destination |
on_invalid |
Policy for values that are not valid UTF-8.
|
created |
Creation timestamp. |
strict |
Suppress the Note: |
Full metadata, no loss. Unlike .xpt, a .json file records the
complete artoo_meta: keySequence, codelist, origin, targetDataType, and
significantDigits all survive. Dates, datetimes, and times are exchanged as
ISO 8601 strings, or as SAS-epoch numbers when their targetDataType is
"integer" (the ADaM numeric-date convention); decimal rides as a string
so exact precision is preserved. The file is always UTF-8 (RFC 8259 / CDISC
v1.1). NaN and infinite values are not valid CDISC numerics and abort the
write.
Streaming write, whole-file read. The writer streams the rows array
in bounded slabs (a .json.gz path gzips the stream transparently), but
read_json() must parse the whole array at once. For multi-million-row
datasets prefer the NDJSON variant (write_ndjson() / read_ndjson()),
which bounds memory in both directions.
The input x, invisibly, so a write can sit mid-pipeline.
read_json() for the inverse; write_dataset() for the generic
dispatcher.
# ---- Example 1: write a conformed dataset as Dataset-JSON ---- # # apply_spec() attaches the metadata; write_json() serializes the full # itemGroup plus the data rows. adsl <- apply_spec(cdisc_adsl, adam_spec, "ADSL", conformance = "off") path <- tempfile(fileext = ".json") write_json(adsl, path) # ---- Example 2: a frozen timestamp for reproducible bytes ---- # # Fixing `created` makes two writes byte-identical; the columns() pane on # the written file shows the full metadata the file carries (DM is SDTM, # so it conforms against the bundled sdtm_spec). dm <- apply_spec(cdisc_dm, sdtm_spec, "DM", conformance = "off") path2 <- tempfile(fileext = ".json") write_json(dm, path2, created = as.POSIXct("2020-01-01", tz = "UTC")) columns(path2)# ---- Example 1: write a conformed dataset as Dataset-JSON ---- # # apply_spec() attaches the metadata; write_json() serializes the full # itemGroup plus the data rows. adsl <- apply_spec(cdisc_adsl, adam_spec, "ADSL", conformance = "off") path <- tempfile(fileext = ".json") write_json(adsl, path) # ---- Example 2: a frozen timestamp for reproducible bytes ---- # # Fixing `created` makes two writes byte-identical; the columns() pane on # the written file shows the full metadata the file carries (DM is SDTM, # so it conforms against the bundled sdtm_spec). dm <- apply_spec(cdisc_dm, sdtm_spec, "DM", conformance = "off") path2 <- tempfile(fileext = ".json") write_json(dm, path2, created = as.POSIXct("2020-01-01", tz = "UTC")) columns(path2)
Serialize a data frame to the newline-delimited variant of CDISC
Dataset-JSON v1.1 (.ndjson): line 1 carries the complete metadata block,
every following line one row array. The streaming end of the artoo
workflow (spec -> apply_spec -> write_ndjson) for datasets too large for
the array-form .json file; a thin wrapper over write_dataset() with
format = "ndjson".
write_ndjson( x, path, on_invalid = c("error", "replace", "ignore"), created = NULL, strict = FALSE )write_ndjson( x, path, on_invalid = c("error", "replace", "ignore"), created = NULL, strict = FALSE )
x |
The dataset to write. |
path |
Destination |
on_invalid |
Policy for values that are not valid UTF-8.
|
created |
Creation timestamp. |
strict |
Suppress the |
Bounded memory, both directions. The writer streams slabs of
per-column JSON literals and read_ndjson() parses slab-sized line
batches, so a multi-million-row dataset never materializes a whole rows
array the way the .json codec must. A .ndjson.gz path gzips the stream
transparently.
The input x, invisibly, so a write can sit mid-pipeline.
read_ndjson() for the inverse; write_json() for the
array-form file; write_dataset() for the generic dispatcher.
spec <- artoo_spec(cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists) # ---- Example 1: write a conformed dataset as NDJSON ---- # # apply_spec() attaches the metadata; write_ndjson() streams the metadata # line and one row per line. adsl <- apply_spec(cdisc_adsl, spec, "ADSL", conformance = "off") path <- tempfile(fileext = ".ndjson") write_ndjson(adsl, path) readLines(path, n = 2)[2] # ---- Example 2: gzip the stream via the file extension ---- # # A .ndjson.gz path compresses transparently; read_ndjson() inflates it. gz <- tempfile(fileext = ".ndjson.gz") write_ndjson(adsl, gz) nrow(read_ndjson(gz))spec <- artoo_spec(cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists) # ---- Example 1: write a conformed dataset as NDJSON ---- # # apply_spec() attaches the metadata; write_ndjson() streams the metadata # line and one row per line. adsl <- apply_spec(cdisc_adsl, spec, "ADSL", conformance = "off") path <- tempfile(fileext = ".ndjson") write_ndjson(adsl, path) readLines(path, n = 2)[2] # ---- Example 2: gzip the stream via the file extension ---- # # A .ndjson.gz path compresses transparently; read_ndjson() inflates it. gz <- tempfile(fileext = ".ndjson.gz") write_ndjson(adsl, gz) nrow(read_ndjson(gz))
Serialize a data frame to an Apache Parquet (.parquet) file, storing the
data natively while preserving the full artoo_meta as a CDISC-shaped
sidecar in the file's key-value metadata. The emit end of the artoo
workflow (spec -> apply_spec -> write_parquet); a thin wrapper over
write_dataset() with format = "parquet". Requires the lightweight
nanoparquet package.
write_parquet( x, path, encoding = NULL, on_invalid = c("error", "replace", "ignore"), compression = "snappy" )write_parquet( x, path, encoding = NULL, on_invalid = c("error", "replace", "ignore"), compression = "snappy" )
x |
The dataset to write. |
path |
Destination |
encoding |
Source charset to record. Tip: any SAS or IANA spelling listed by |
on_invalid |
Policy for values that are not valid UTF-8.
|
compression |
Column compression codec.
|
Metadata where plain Parquet has none. A bare nanoparquet/arrow file
drops labels, formats, and codelists; write_parquet() embeds the complete
artoo_meta as a single Dataset-JSON-shaped string under the
metadata_json key, so read_parquet() restores every CDISC attribute.
The same string is what a .json file or an rds carries, so conversion
between any two formats stays lossless. A reader without artoo still opens
the data and can see the metadata_json block.
The input x, invisibly, so a write can sit mid-pipeline.
read_parquet() for the inverse; write_dataset() for the
generic dispatcher.
spec <- artoo_spec(cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists) # ---- Example 1: write a conformed dataset to Parquet ---- # # apply_spec() attaches the metadata; write_parquet() stores the data # natively and the metadata as a CDISC-shaped sidecar. adsl <- apply_spec(cdisc_adsl, spec, "ADSL", conformance = "off") path <- tempfile(fileext = ".parquet") write_parquet(adsl, path) # ---- Example 2: round-trip and confirm the metadata survived ---- # # Reading it back yields an identical artoo_meta. back <- read_parquet(path) identical(get_meta(back)@columns, get_meta(adsl)@columns)spec <- artoo_spec(cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists) # ---- Example 1: write a conformed dataset to Parquet ---- # # apply_spec() attaches the metadata; write_parquet() stores the data # natively and the metadata as a CDISC-shaped sidecar. adsl <- apply_spec(cdisc_adsl, spec, "ADSL", conformance = "off") path <- tempfile(fileext = ".parquet") write_parquet(adsl, path) # ---- Example 2: round-trip and confirm the metadata survived ---- # # Reading it back yields an identical artoo_meta. back <- read_parquet(path) identical(get_meta(back)@columns, get_meta(adsl)@columns)
Write a data frame to an R .rds file, preserving its artoo_meta. A thin
wrapper over write_dataset() with format = "rds"; the rds carries the
metadata both as live R attributes and as the language-agnostic
metadata_json string, so read_rds() restores it exactly.
write_rds(x, path, encoding = NULL)write_rds(x, path, encoding = NULL)
x |
The dataset to write. |
path |
Destination |
encoding |
Source charset to record. Tip: any SAS or IANA spelling listed by |
The input x, invisibly, so a write can sit mid-pipeline.
read_rds() for the inverse; write_dataset() for the generic
dispatcher.
spec <- artoo_spec(cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists) # ---- Example 1: write a conformed dataset to rds ---- # # apply_spec() attaches the metadata; write_rds() carries it into the file. adsl <- apply_spec(cdisc_adsl, spec, "ADSL", conformance = "off") path <- tempfile(fileext = ".rds") write_rds(adsl, path) # ---- Example 2: round-trip and confirm the metadata survived ---- # # Reading it back yields an identical artoo_meta. back <- read_rds(path) identical(get_meta(back)@columns, get_meta(adsl)@columns)spec <- artoo_spec(cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists) # ---- Example 1: write a conformed dataset to rds ---- # # apply_spec() attaches the metadata; write_rds() carries it into the file. adsl <- apply_spec(cdisc_adsl, spec, "ADSL", conformance = "off") path <- tempfile(fileext = ".rds") write_rds(adsl, path) # ---- Example 2: round-trip and confirm the metadata survived ---- # # Reading it back yields an identical artoo_meta. back <- read_rds(path) identical(get_meta(back)@columns, get_meta(adsl)@columns)
Serialise a artoo_spec, dispatching on the file extension: a .json
path writes artoo's native, lossless JSON; a .xlsx path writes a
Pinnacle 21 (P21) style Excel workbook. Both are inverses of
read_spec() on their format, which makes the spec converters free
compositions: read_spec("define.xml") |> write_spec("spec.xlsx") is a
Define-XML to P21 bridge in one line.
write_spec(spec, path)write_spec(spec, path)
spec |
The specification to serialise. |
path |
Destination file. |
Native JSON is the lossless format. Each slot is written as an array
of row objects, with NA encoded as JSON null and numbers at full
precision, so read_spec() rebuilds an identical artoo_spec through
artoo_spec(). Object keys are emitted in a fixed order, so writing the
same spec twice yields byte-identical output.
P21 xlsx is the interchange format. Sheets are emitted with the
headers the P21 reader recognises (Define, Datasets, Variables,
ValueLevel, Codelists, Methods, Comments, Documents; empty optional
sheets are omitted), foreign keys repeated on every row (no merged
cells), and the spec's spec_standard() as the Datasets sheet's
Standard column. The study row writes back as the Define sheet's
Attribute/Value pairs (StudyName, StudyDescription,
ProtocolName). The Data Type column is written in the Define-XML /
ODM vocabulary the workbook expects: a character variable is text
(not the Dataset-JSON string), and decimal / double collapse to
float, boolean / URI to text.
Columns the P21 vocabulary does not model are not lost: a foreign column carried on a slot is re-emitted verbatim under its own header, so an xlsx round-trip keeps user columns.
Note: fields with no P21 column (itemoid, target_data_type,
per-variable key_sequence) do not survive an xlsx round-trip;
persist to JSON when you need the spec back exactly. The Data Type
re-encoding is also non-injective: decimal, double, boolean, and
URI fold to float or text on a read-back. A Define-XML
partialDate / partialDatetime (and the other partial / incomplete
subtypes) is read as the base date / datetime – CDISC Dataset-JSON
v1.1 has no partial dataType – so it is written back as the base type.
The output path, invisibly. Read it back with read_spec().
Inverse: read_spec() reads native JSON, a P21 Excel workbook, or
Define-XML back into a artoo_spec.
Build / inspect: artoo_spec(), spec_datasets(),
spec_variables(), spec_standard().
# ---- Example 1: persist a spec to JSON, then read it back ---- # # Build a spec from the bundled CDISC-pilot tables, write it to a temp # JSON file, and confirm read_spec() reconstructs it intact. spec <- artoo_spec( cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists ) path <- tempfile(fileext = ".json") write_spec(spec, path) identical(read_spec(path), spec) # ---- Example 2: the same spec as a P21 workbook ---- # # The .xlsx path emits P21-shaped sheets; reading the workbook back # recovers the P21-representable surface (here: the dataset names). if (requireNamespace("writexl", quietly = TRUE)) { xlsx <- tempfile(fileext = ".xlsx") write_spec(spec, xlsx) spec_datasets(read_spec(xlsx)) }# ---- Example 1: persist a spec to JSON, then read it back ---- # # Build a spec from the bundled CDISC-pilot tables, write it to a temp # JSON file, and confirm read_spec() reconstructs it intact. spec <- artoo_spec( cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists ) path <- tempfile(fileext = ".json") write_spec(spec, path) identical(read_spec(path), spec) # ---- Example 2: the same spec as a P21 workbook ---- # # The .xlsx path emits P21-shaped sheets; reading the workbook back # recovers the P21-representable surface (here: the dataset names). if (requireNamespace("writexl", quietly = TRUE)) { xlsx <- tempfile(fileext = ".xlsx") write_spec(spec, xlsx) spec_datasets(read_spec(xlsx)) }
Serialize a data frame to a SAS Transport (.xpt) file in v5 (the FDA
submission standard) or v8 (extended names and labels), preserving the
artoo_meta a column can hold. The emit end of the artoo workflow
(spec -> apply_spec -> write_xpt); a thin wrapper over write_dataset()
with format = "xpt".
write_xpt( x, path, version = 5, encoding = NULL, on_invalid = c("error", "replace", "ignore"), created = NULL )write_xpt( x, path, version = 5, encoding = NULL, on_invalid = c("error", "replace", "ignore"), created = NULL )
x |
The dataset to write. |
path |
Destination |
version |
XPORT transport version. |
encoding |
Target charset. Tip: any SAS or IANA spelling listed by |
on_invalid |
Policy for values not representable in |
created |
Header timestamp. |
What XPORT can carry. An .xpt file's NAMESTR stores only variable
name, label, length, and SAS format. CDISC metadata beyond that
(keySequence, codelist, origin, targetDataType, ...) and the source
encoding are not representable in the bytes; they ride the in-session
artoo_meta and the sidecar in self-describing formats (Dataset-JSON,
Parquet, rds). XPORT also cannot distinguish an empty string from NA
(both store as blanks) and drops trailing spaces.
Character ISO dates (--DTC) write as text. A character column whose
dataType is date/datetime/time with no numeric targetDataType is
the CDISC ISO 8601 text form — the SDTM --DTC convention — and stores
as a character variable, partial dates ("1951", "1951-12") included,
byte for byte. The SAS-numeric encoding (with DATE9.-style formats) is
used for columns that are R Date/POSIXct/hms or whose
metadata records targetDataType = "integer" (the ADaM numeric-date
convention). A character column under targetDataType = "integer"
aborts loudly — a partial date can never become a SAS numeric silently.
The input x, invisibly, so a write can sit mid-pipeline.
read_xpt() for the inverse; write_dataset() for the generic
dispatcher.
spec <- artoo_spec( cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists ) # ---- Example 1: write a conformed dataset as v5 (FDA standard) ---- # # apply_spec() attaches the metadata; write_xpt() carries the label, length, # and SAS format for each variable into the transport file. adsl <- apply_spec(cdisc_adsl, spec, "ADSL", conformance = "off") path <- tempfile(fileext = ".xpt") write_xpt(adsl, path) # ---- Example 2: v8 for long names, with a frozen timestamp ---- # # Version 8 keeps names over 8 characters; a fixed `created` makes the bytes # reproducible. Reading it back shows the labels, types, and record count # survived the transport. DM is SDTM, so it conforms against the bundled # sdtm_spec. dm <- apply_spec(cdisc_dm, sdtm_spec, "DM", conformance = "off") path8 <- tempfile(fileext = ".xpt") write_xpt(dm, path8, version = 8, created = as.POSIXct("2020-01-01", tz = "UTC")) get_meta(read_xpt(path8))@dataset$recordsspec <- artoo_spec( cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists ) # ---- Example 1: write a conformed dataset as v5 (FDA standard) ---- # # apply_spec() attaches the metadata; write_xpt() carries the label, length, # and SAS format for each variable into the transport file. adsl <- apply_spec(cdisc_adsl, spec, "ADSL", conformance = "off") path <- tempfile(fileext = ".xpt") write_xpt(adsl, path) # ---- Example 2: v8 for long names, with a frozen timestamp ---- # # Version 8 keeps names over 8 characters; a fixed `created` makes the bytes # reproducible. Reading it back shows the labels, types, and record count # survived the transport. DM is SDTM, so it conforms against the bundled # sdtm_spec. dm <- apply_spec(cdisc_dm, sdtm_spec, "DM", conformance = "off") path8 <- tempfile(fileext = ".xpt") write_xpt(dm, path8, version = 8, created = as.POSIXct("2020-01-01", tz = "UTC")) get_meta(read_xpt(path8))@dataset$records
Report every dataset (member) a SAS Transport (.xpt) file holds, with
its label, variable count, and row count — the survey step before
read_xpt() with member = picks one. A single-member file (the FDA
submission convention) returns one row.
xpt_members(path)xpt_members(path)
path |
Source |
v5 has no recorded row count. A v8 member records its rows; a v5
member's count is derived from the byte span up to the next member (or end
of file) minus trailing padding, so an all-character v5 member whose last
row is entirely blank reports one row fewer (the documented v5 ambiguity,
see write_xpt()).
A <data.frame> with one row per member and columns member
(1-based index), name, label, nvars, and nobs. Pass member or
name to read_xpt().
read_xpt() with member = to read one of them.
spec <- artoo_spec( cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists ) # ---- Example 1: a single-member file reports one row ---- # # The FDA convention is one dataset per transport file. dm <- apply_spec(cdisc_dm, sdtm_spec, "DM", conformance = "off") p <- tempfile(fileext = ".xpt") write_xpt(dm, p) xpt_members(p) # ---- Example 2: survey a multi-member file, then read one member ---- # # Concatenate two single-member files into one library and list it. adsl <- apply_spec(cdisc_adsl, spec, "ADSL", conformance = "off") p2 <- tempfile(fileext = ".xpt") write_xpt(adsl, p2) multi <- tempfile(fileext = ".xpt") writeBin( c( readBin(p, "raw", file.size(p)), readBin(p2, "raw", file.size(p2))[-(1:240)] ), multi ) xpt_members(multi)spec <- artoo_spec( cdisc_adam_datasets, cdisc_adam_variables, codelists = cdisc_codelists ) # ---- Example 1: a single-member file reports one row ---- # # The FDA convention is one dataset per transport file. dm <- apply_spec(cdisc_dm, sdtm_spec, "DM", conformance = "off") p <- tempfile(fileext = ".xpt") write_xpt(dm, p) xpt_members(p) # ---- Example 2: survey a multi-member file, then read one member ---- # # Concatenate two single-member files into one library and list it. adsl <- apply_spec(cdisc_adsl, spec, "ADSL", conformance = "off") p2 <- tempfile(fileext = ".xpt") write_xpt(adsl, p2) multi <- tempfile(fileext = ".xpt") writeBin( c( readBin(p, "raw", file.size(p)), readBin(p2, "raw", file.size(p2))[-(1:240)] ), multi ) xpt_members(multi)