--- title: "Getting started with dqcheckrGUI" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting started with dqcheckrGUI} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, eval = FALSE) ``` `dqcheckrGUI` is a point-and-click Shiny interface for [dqcheckr](https://github.com/mickmioduszewski/dqcheckr). It lets you configure dataset quality checks, run them against incoming file deliveries, and browse historical results — without writing any R code. The app runs entirely on your local machine. No internet connection, no server, and no cloud services are required. ## Prerequisites Install `dqcheckr` before installing `dqcheckrGUI`: ```{r} install.packages("dqcheckr") devtools::install_github("mickmioduszewski/dqcheckrGUI") ``` ## Launching the app ```{r} dqcheckrGUI::run_app() ``` The app opens in your default browser at `http://localhost:4321`. It reads and writes files relative to the working directory you launched from, so set that to your project folder before calling `run_app()`. The terminal window must remain open while the app is running — closing it stops the app. Alternatively, double-click `launch.command` (macOS), `launch.sh` (Linux), or `launch.bat` (Windows) from the app directory. ### First-run setup On first launch, if no `dqcheckr.yml` global config file exists in the working directory, the app opens directly to **Global Config** and prompts you to set two paths before doing anything else: - **Snapshot database** — where run summaries are stored (e.g. `data/snapshots.sqlite`) - **Report output directory** — where HTML reports are written (e.g. `reports/`) Click **Save global config** once these are set, then proceed to configure your first dataset. ### Example configuration files Example YAML configuration files using the Star Wars dataset are included with the package. Copy them to get started quickly: ```{r} file.copy( system.file("extdata/example_config", package = "dqcheckrGUI"), "config", recursive = TRUE ) ``` --- ## The interface The app has a fixed sidebar on the left and a main panel that changes based on your selection. ``` ┌──────────────────┬────────────────────────────────┐ │ SIDEBAR │ MAIN PANEL │ │ │ │ │ Datasets │ [content changes here] │ │ customers ✓ │ │ │ suppliers ⚠ │ │ │ [+ New] │ │ │ │ │ │ ▶ Run │ │ │ ⏱ History │ │ │ ⚙ Global Config │ │ └──────────────────┴────────────────────────────────┘ ``` ### Status badges Status badges appear in the sidebar, dataset panel, and history table: | Badge | Meaning | |-------|---------| | ✓ PASS | All checks passed | | ⚠ WARN | One or more warnings; no failures | | ✗ FAIL | One or more checks failed | | ● RUNNING | A check is currently in progress | | — not run | No runs recorded yet | --- ## Dataset panel Clicking a dataset name in the sidebar opens its **Dataset panel** in the main area: ``` customer_accounts ────────────────────────────────────────── Format: CSV Location: data/incoming/ Config: config/customer_accounts.yml [Edit config] [Run check ▶] Recent runs (last 5): 2026-05-30 20260530.csv ✓ PASS 0 failures 2026-05-23 20260523.csv ⚠ WARN 0 failures 2026-05-16 20260516.csv ✓ PASS 0 failures [View all in History →] Compare drift: ☐ run 1 ☐ run 2 [Compare ▶] ``` - **Edit config** opens the 8-step wizard pre-populated with the existing settings. - **Run check ▶** navigates to the Run panel with this dataset pre-selected. - Clicking any row in the recent runs table opens that run's HTML report in a new browser tab. - **Compare drift** — tick two checkboxes from the recent runs table and click **Compare ▶** to produce a drift report comparing those two deliveries. --- ## Setting up a dataset — the 8-step wizard Click **+ New dataset** in the sidebar to open the configuration wizard. Navigate between steps using the **Back** and **Next** buttons, or click the numbered breadcrumb bar at the top. All values are preserved as you move back and forth. If you navigate away from the wizard with unsaved changes, the app will warn you before discarding them. ### Step 1 — Dataset identity Enter a short machine-readable name for the dataset. Names must start with a letter and contain only letters, numbers, and underscores (e.g. `customer_accounts`). This name is used as the config filename and passed directly to `dqcheckr::run_dq_check()`. An optional free-text description can also be added. ### Step 2 — File location Choose how files are identified for each run: **Folder scan (recommended)** — point to a folder; the app picks the two most recently modified files automatically on each run. Use this for delivery processes that drop files into a fixed directory. A **Preview most recent file** button shows the names and sizes of the two most recent files in the selected folder: *"Current: 20260530.csv (2.4 MB) | Previous: 20260523.csv (2.3 MB)"*. **Explicit file paths** — list a current file and optionally a previous file by path. Useful when files are versioned by name. If no previous file is given, comparison checks (CP series) are skipped. ### Step 3 — Format and structure This step identifies the file's format and column layout. The top of the screen shows a raw text preview of the first 50 lines of the file — the full file is never loaded. **Auto-detection**: when a file is loaded, the app uses `readr` to detect the delimiter, encoding, quote character, and whether the first row is a header. Results are shown as an editable confirmation panel — you always make the final call, nothing is silently committed. #### CSV files Confirm or adjust: - **Format** (CSV or Fixed-Width) - **Delimiter** — comma, tab, semicolon, pipe, space, colon, or custom - **Encoding** — detected candidates are shown with confidence percentages - **Quote character** - **Header row** — whether the first row contains column names A parsed preview updates live as you change these settings. If there is no header row, a column-naming panel appears alongside the preview where you can enter names for each column. Names must be valid R identifiers; a suggestion is offered if you enter something invalid. #### Fixed-width (FWF) files When FWF is selected, a **visual ruler** activates above the text preview. - **Click** anywhere in the text to place a column boundary at that character position. - **Drag** a boundary line left or right to adjust it — it snaps to character boundaries. - **Double-click** a boundary line to remove it. - An **Auto-detect** button attempts automatic boundary detection using `readr::fwf_empty()`. Below the ruler, a table shows the resulting column definitions (start position, width, name, type). Edit column names and types here. A validation badge confirms whether the column widths account for the full record length: - ✅ green — widths match the line length exactly - ⚠ amber — widths cover fewer characters than the line length (trailing columns unaccounted for) - ✗ red — widths exceed the line length (Next is disabled until fixed) If the file has header rows to skip before the data begins, set **Header rows to skip**. Column names are pre-populated from the skipped header row if one is present. ### Step 4 — Column classification Review every detected column. For each column you can: - **Set a type override** — force a column to be treated as `character` even if it looks numeric. Use this for postcodes, phone numbers, account codes, BSB numbers, or any identifier that happens to contain only digits. Click the inferred type to open the override dropdown. - **Mark as key column** — that column will be checked for uniqueness (QC-12). - **Mark as expected** — columns not present in a delivery will raise a schema warning (SC-01). All columns are expected by default; use **Select none** to clear all and tick only the columns that must always be present. ### Step 5 — Column rules Each column is shown as a collapsible card. Expand a column to add optional per-column validation rules. **Standard rules** (always visible): | Rule | Applies to | Check | |------|------------|-------| | Allowed values | character columns | QC-09: flag any value not in the list | | Min value | numeric columns | QC-10: flag values below the minimum | | Max value | numeric columns | QC-10: flag values above the maximum | **Advanced rules** (click **Advanced ▼** to reveal): | Rule | Description | |------|-------------| | Regex pattern | Flag values that do not match the pattern (QC-13). Click **Test against sample** to verify the pattern against the actual file before saving. | | Max missing rate | Override the dataset-level threshold for this column only | | Max non-numeric rate | Override for numeric columns only | | Max missing rate change | Override the comparison threshold for this column | | Max mean shift | Override the mean shift threshold for this column | A regex syntax error disables **Next**; a pattern that fails against sample values shows a warning but does not block you from proceeding. ### Step 6 — Rule overrides Override the global default thresholds for this dataset only. Each field is pre-filled with the current global default. Only fields whose value differs from the global default are written to the dataset config; fields left at their default are omitted so that a later change to the global config is automatically inherited. | Threshold | Default | What it controls | |-----------|---------|-----------------| | Max missing rate | 0.05 | Flag a column if > 5 % of values are blank | | Max non-numeric rate | 0.01 | Flag a numeric column if > 1 % of values cannot be parsed | | Min row count | 0 (off) | Fail if the delivery has fewer rows than this | | Max row count change | 10 % | Warn if row count changes by > 10 % vs previous delivery | | Max mean shift | 20 % | Warn if a numeric column mean shifts by > 20 % | | Max missing change | 2 pp | Warn if missing rate changes by > 2 percentage points | | Max non-numeric change | 1 pp | Warn if non-numeric rate changes by > 1 percentage point | | Type inference threshold | 0.90 | A column is typed numeric if ≥ 90 % of values parse as numbers | Schema change flags control whether warnings are raised when columns are added, dropped, change type, or change order between deliveries. ### Step 7 — Custom checks (optional) Point to a plain R file that defines a `custom_checks(df)` function. The app validates the file immediately: it checks that the file exists, parses without syntax errors, and defines the expected function. A green badge confirms a valid file; a red badge shows the specific problem. Leave this field blank to skip custom checks. See `vignette("dqcheckr", package = "dqcheckr")` for the custom checks function signature and return value. ### Step 8 — Review and save A summary of all settings is shown alongside a YAML preview of the configuration that will be written. Click **Save config ✓** to write the file. A success notification confirms the path. The app then navigates to the dataset panel for the saved dataset. **For analysts who hand-edit YAML**: the app preserves any keys you have added to the YAML file outside the wizard. On the next edit, those keys appear in the Step 8 preview under `# preserved from original file` and are written back unchanged. The wizard never silently drops hand-added config keys. --- ## Running a quality check Click **▶ Run** in the sidebar (or **Run check ▶** from a dataset panel). 1. Select a dataset from the dropdown. 2. The app validates the config before enabling the run button — it checks that the config file is readable, the data folder or files exist, and the output paths are writable. Any problem is shown as a red badge with the specific issue. 3. Click **▶ Run check**. The check runs in a background process so the UI stays responsive. Progress is streamed to the log area in real time. When complete: ``` Status: ✓ PASS [Open report ↗] [View log] 0 failures 0 warnings 22 passed Report: reports/customer_accounts_20260531_143022.html ``` - **Open report ↗** opens the HTML report in the system browser. - **View log** shows the full run log in a dialog box. - **■ Stop** (shown during a run) cancels the background process. A confirmation is shown first, noting that the snapshot will not be written if the run has not completed. --- ## Browsing history Click **⏱ History** in the sidebar to see all past runs across all datasets. The table can be filtered by dataset name, status, or date using the filter row at the top of each column. Click **Load more** to page through older runs. **To open a report**: click the **Open** link in the Report column to open that run's HTML report in a new browser tab. **To compare two deliveries**: 1. Tick the checkbox next to exactly two runs **from the same dataset**. 2. Click **Compare drift ▶**. The drift report opens in a new tab, showing column-by-column changes between the two snapshots. The button is disabled with a tooltip if the selected rows are from different datasets. --- ## Global configuration Click **⚙ Global Config** to set defaults that apply to all datasets unless overridden at the dataset level (Step 6). **Infrastructure paths** — set the snapshot database path and report output directory. The snapshot database is created automatically on first run if it does not exist. If either path's parent directory does not exist, a red validation message is shown — create the directory on disk first, then save. **Default rule thresholds** — the same thresholds described in Step 6 above, applied globally. Dataset-level overrides take precedence. Click **Save global config** to write changes. ### Team use via a shared network drive Multiple users can share the same configuration and run history by pointing their `config_dir` and snapshot database paths at a shared network folder. Each user runs the app locally; the shared drive holds the YAML configs and the SQLite snapshot database. Set these paths in Global Config on each user's machine. SQLite handles concurrent reads safely; two simultaneous writes (unlikely in practice) are serialised by file locking. --- ## Learn more See `vignette("dqcheckr", package = "dqcheckr")` for a full description of every quality check (QC-01 to QC-14, SC-01/02, CP-01 to CP-08), per-column configuration options, custom checks, snapshot database schema, and the `compare_snapshots()` function used for drift reports.