--- title: "Split-and-Recombine Diagrams" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Split-and-Recombine Diagrams} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} ## Use ragg for better font rendering if available if (requireNamespace("ragg", quietly = TRUE)) { knitr::opts_chunk$set( dev = "ragg_png", fig.retina = 1, collapse = TRUE, comment = "#>", message = FALSE, warning = FALSE, out.width = "100%", dpi = 150 ) } else { knitr::opts_chunk$set( collapse = TRUE, comment = "#>", message = FALSE, warning = FALSE, out.width = "100%", dpi = 150 ) } ## Dynamic figure sizing (see enrollment_diagrams vignette for details) .flow_dims <- new.env(parent = emptyenv()) .flow_dims$width <- NULL .flow_dims$height <- NULL knitr::opts_hooks$set(use_rec_dims = function(options) { if (isTRUE(options$use_rec_dims)) { if (!is.null(.flow_dims$width)) options$fig.width <- .flow_dims$width if (!is.null(.flow_dims$height)) options$fig.height <- .flow_dims$height .flow_dims$width <- NULL .flow_dims$height <- NULL } options }) queue_flow <- function(flow, ...) { ## Measure on the same device family that renders the figures (ragg, set ## via dev = "ragg_png" above) so that non-default fonts---whose metrics ## differ between devices---are sized consistently and the canvas is not ## cropped. Falls back to recdims()'s default pdf measurement otherwise. md <- if (requireNamespace("ragg", quietly = TRUE)) { function() { tf <- tempfile(fileext = ".png") ragg::agg_png(tf, width = 10, height = 10, units = "in", res = 150) tf } } else NULL dims <- selecta::recdims(flow, ..., .measure_dev = md) .flow_dims$width <- dims["width"] .flow_dims$height <- dims["height"] invisible(flow) } ``` Many clinical studies divide a population into strata for independent characterization, then recombine those strata into a single cohort for downstream analysis. This split-and-recombine pattern arises in screening validation studies, exposure-stratified observational cohorts, and adaptive trial designs that classify patients before randomization. It represents a third flow topology in `selecta`, distinct from both permanent parallel arms (*e.g.*, CONSORT/STROBE/STARD diagrams) and top-level source convergence (*e.g.*, PRISMA/MOOSE diagrams). In `selecta`, split-and-recombine diagrams are built around the following core functions: | Function | Purpose | |:---------|:--------| | `enroll()` | Establish the starting cohort from data or a manual count | | `stratify()` | Divide the flow into parallel strata | | `combine()` | Merge strata back into a single downstream flow | Thus, the split-and-recombine pipeline adheres to the following basic structure: ```{r, eval = FALSE} enroll(...) |> exclude(...) |> stratify(labels, n, label) |> exclude(...) |> combine(label, sublabel) |> exclude(...) |> endpoint(label) |> flowchart() ``` where `stratify()` fans out to parallel arms and `combine()` converges arms back together. Between the split and the recombination, `exclude()` calls apply independently within each stratum, producing per-stratum side boxes. > *n.b.:* To ensure correct font rendering and figure sizing, the diagrams below are displayed using a vignette-only helper function (`queue_flow()`) that applies recommended dimensions from `recdims()` via the [`ragg`](https://ragg.r-lib.org/) graphics device, with the standard output function applied afterwards (`flowchart()`). In practice, replace this `queue_flow()`/`flowchart()` workflow with a call to `flowsave()` for equivalent printed results: > > ```{r, eval = FALSE} > flowsave(flow, "consort.pdf") > flowsave(flow, "consort.png", dpi = 300) > ``` > > Using `flowsave()` ensures that the figure dimensions are always large enough to accommodate the diagram content, and it is the preferred method for saving flow diagram outputs in `selecta`. --- # Preliminaries ```{r setup} library(selecta) library(data.table) data(selectaex2) ``` --- # Manual Entry ## **Example 1:** Screening Validation Study In screening-validation studies, a high-risk population is stratified by whether participants received an annual screening protocol. The strata are then characterized independently with respect to outcomes of interest, after which they are recombined into a single confirmed cohort for downstream analysis: ```{r} example1 <- enroll(n = 160, label = "High-risk participants") |> phase("Enrollment") |> exclude("Concurrent enrollment in another study", n = 2, included_label = "Total cohort") |> phase("Screening Status") |> stratify( labels = c("Unscreened", "Screened"), n = c(82, 76), label = "Annual screening status" ) |> exclude("Without confirmed outcome", n = c(44, 66)) |> combine("Outcome cohort", sublabel = "Participants with confirmed outcome") |> phase("Outcome Verification") |> exclude("Without available adjudication", n = 7) |> exclude("Without available imaging", n = 23) |> endpoint("Participants with available imaging") ``` ```{r, echo = FALSE} queue_flow(example1) ``` ```{r, use_rec_dims = TRUE, echo = TRUE} flowchart(example1) ``` The `stratify()` function creates the downward split, and `combine()` draws converging arrows from each stratum back to a single node. Between the two, `exclude()` is called once with a vector of per-stratum counts (`n = c(44, 66)`), producing one side box per column. In `combine()`, the `sublabel` parameter writes a descriptive second line below the main heading inside the recombined node, and the flow continues as a single stream with standard exclusion steps. ## **Example 2:** Per-Stratum Exclusion Reasons When per-stratum attrition has distinct causes, the `reasons` argument accepts a list of named vectors (one per stratum). Reason ordering is harmonized across strata using global totals, consistent with the behavior of per-arm reasons after `allocate()`: ```{r} example2 <- enroll(n = 5000, label = "Patients in registry") |> phase("Enrollment") |> exclude("Ineligible", n = 800, reasons = c("Age < 18" = 200, "Prior diagnosis" = 350, "Missing baseline data" = 250), included_label = "Eligible cohort") |> phase("Exposure Classification") |> stratify( labels = c("Statin users", "Non-users"), n = c(1800, 2400), label = "Classified by statin exposure" ) |> exclude("Lost to follow-up", n = c(120, 180), reasons = list( c("Moved" = 50, "Withdrew consent" = 30, "Deceased" = 20, "Inconsistent usage" = 20), c("Moved" = 80, "Withdrew consent" = 60, "Deceased" = 40) )) |> combine("Analysis cohort", sublabel = "Patients with complete follow-up") |> phase("Analysis") |> endpoint("Included in primary analysis") ``` ```{r, echo = FALSE} queue_flow(example2) ``` ```{r, use_rec_dims = TRUE, echo = TRUE} flowchart(example2, count_first = TRUE) ``` --- # Data-Driven Flow In data mode, `stratify()` accepts a column name rather than explicit labels and counts. The `combine()` function recombines the per-stratum datasets internally, and `cohort()` returns the unified post-recombination dataset. ## **Example 3:** Data-Driven Split and Recombine The following example uses the `selectaex2` dataset, stratifying by treatment assignment and recombining after documenting per-arm discontinuation: ```{r} example3 <- enroll(selectaex2, id = "patient_id") |> phase("Screening") |> exclude("Duplicate records", criterion = is_duplicate == TRUE, included_label = "Unique records") |> exclude("Failed eligibility", criterion = eligible == FALSE, reasons = "exclusion_reason", included_label = "Eligible cohort") |> phase("Allocation") |> stratify("treatment", label = "Treatment assignment") |> phase("Follow-up") |> exclude("Discontinued", criterion = discontinued == TRUE, reasons = "discontinuation_reason") |> combine("Completers") |> phase("Analysis") |> endpoint("Analysis cohort") ``` ```{r, echo = FALSE} queue_flow(example3) ``` ```{r, use_rec_dims = TRUE, echo = TRUE} flowchart(example3) ``` --- # Cohort Extraction The `cohort()` and `cohorts()` functions work with split-and-recombine flows. After a `combine()` step, `cohort()` returns the unified recombined dataset rather than a per-arm list: ```{r} final <- cohort(example3) dim(final) ``` The `cohorts()` function captures snapshots at every stage, including the combine point. Each snapshot records the remaining and excluded datasets: ```{r} stages <- cohorts(example3) names(stages) ``` The combine snapshot contains the recombined dataset: ```{r} nrow(stages[["Completers"]]$included) ``` Per-arm snapshots from the stratified region are available at the exclusion step labels. These contain named lists (one element per arm) rather than single datasets: ```{r} disc <- stages[["Discontinued"]] vapply(disc$included, nrow, integer(1L)) vapply(disc$excluded, nrow, integer(1L)) ``` This supports a complete analytical workflow: define the enrollment flow, render the diagram, and extract any intermediate or final cohort for downstream analysis. --- # Re-Splitting after Recombination A flow may be split, recombined, and then split again. This arises in adaptive designs where patients are first characterized by a baseline variable, recombined, and then randomized. The `stratify()` function permits a second split after `combine()` has closed the first: ## **Example 4:** Risk Stratification Followed by Randomization ```{r} example4 <- enroll(n = 2000, label = "Screened") |> phase("Screening") |> exclude("Ineligible", n = 400, reasons = c("No consent" = 180, "Prior treatment" = 120, "ECOG >= 3" = 100)) |> phase("Risk Stratification") |> stratify( labels = c("High risk", "Low risk"), n = c(700, 900), label = "Risk classification" ) |> exclude("Declined participation", n = c(50, 80)) |> combine("Eligible cohort") |> phase("Allocation") |> allocate(labels = c("Intervention", "Control"), n = c(735, 735)) |> phase("Follow-up") |> exclude("Lost to follow-up", n = c(30, 35), reasons = list( c("Withdrew consent" = 18, "Relocated" = 12), c("Withdrew consent" = 20, "Relocated" = 15) )) |> phase("Analysis") |> endpoint("Analyzed") ``` ```{r, echo = FALSE} queue_flow(example4) ``` ```{r, use_rec_dims = TRUE, echo = TRUE} flowchart(example4) ``` The layout engine scopes each split-combine span independently, so the converge arrows from the first split do not interfere with the second split's arm positions. The second split may use either `stratify()` (for observational grouping) or `allocate()` (for randomization); both are permitted after a prior `combine()`. --- # Design Considerations The split-and-recombine topology works well for two-stratum splits with or without per-stratum side boxes. For three or more strata, flowcharts will similarly render without collisions or overlap, but any per-stratum side boxes may produce asymmetry due to the geometric limitations of the split-and-recombine flow. In such cases, consider simplifying the per-stratum detail or using external graphics editing software for full control over the layout. --- # Saving to File The `flowsave()` function saves the diagram to a file (PDF, PNG, SVG, or TIFF) with auto-computed dimensions: ```{r, eval = FALSE} flowsave(example1, "screening_validation.pdf") flowsave(example1, "screening_validation.png", dpi = 300) ``` Explicit dimensions override the automatic calculation: ```{r, eval = FALSE} flowsave(example1, "screening_validation.pdf", width = 10, height = 12) ``` All visual parameters accepted by `flowchart()` are also accepted by `flowsave()`: ```{r, eval = FALSE} flowsave(example1, "screening_validation_cf.pdf", count_first = TRUE, cex = 1.0, cex_side = 0.8) ``` --- # Further Reading - [Enrollment Diagrams](enrollment_diagrams.html): CONSORT, STROBE, and STARD diagrams with permanent parallel arms - [Systematic Reviews](systematic_reviews.html): PRISMA and MOOSE diagrams with top-level source convergence - [Advanced Workflows](advanced_workflows.html): Factorial (nested-split) designs and hierarchical exclusion reasons - [Graphviz Export](graphviz_export.html): DOT output for Graphviz/DiagrammeR rendering