| Title: | Data Quality Checks and Statistical Assumption Testing for Agricultural Experiments |
|---|---|
| Description: | Provides a comprehensive pipeline for data quality checks and statistical assumption diagnostics in agricultural experimental data. Functions cover outlier detection using Interquartile Range (IQR) fence, Z-score, modified Z-score (Hampel identifier), Grubbs test and Dixon Q-test with consensus flagging; missing data pattern analysis and mechanism classification (Missing Completely At Random/Missing At Random/Missing Not At Random (MCAR/MAR/MNAR)) via Little's test; normality testing using Shapiro-Wilk, Anderson-Darling, Kolmogorov-Smirnov, Lilliefors, Pearson chi-square and Jarque-Bera tests; homogeneity of variance via Bartlett, Levene and Fligner-Killeen tests; independence of errors via Durbin-Watson, Breusch-Godfrey and Wald-Wolfowitz runs tests; experimental design validation for Completely Randomised Design (CRD), Randomised Complete Block Design (RCBD), Latin Square Design (LSD) and factorial designs; qualitative variable consistency checks; and automated HyperText Markup Language (HTML) report generation. Designed to align with Findable, Accessible, Interoperable and Reusable (FAIR) data principles. Methods follow Gomez and Gomez (1984, ISBN:978-0471870920) and Montgomery (2017, ISBN:978-1119492443). |
| Authors: | Sadikul Islam [aut, cre] (ORCID: <https://orcid.org/0000-0003-2924-7122>) |
| Maintainer: | Sadikul Islam <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 0.1.3 |
| Built: | 2026-05-21 09:58:05 UTC |
| Source: | https://github.com/cran/agriDQ |
A simulated Randomised Complete Block Design (RCBD) dataset for a wheat variety trial with 4 treatments and 5 blocks (20 plots total). The dataset contains one intentional high outlier (plot P03, yield = 8.9 t/ha) and one missing value (plot P17) for demonstration of the agriDQ quality-check functions.
agri_trialagri_trial
A data frame with 20 rows and 7 variables:
plot_idCharacter. Unique plot identifier (P01–P20).
blockFactor. Block identifier (B1–B5).
treatmentFactor. Treatment/variety label (T1–T4).
varietyCharacter. Wheat variety name corresponding to each treatment (HD2967, GW322, PBW343, WH1105).
yieldNumeric. Grain yield in tonnes per hectare
(t/ha). Contains one outlier (~8.9 t/ha) and one NA.
plant_heightNumeric. Mean plant height in cm.
tillersNumeric. Mean effective tiller count per plant.
Data were generated with set.seed(2025) using an additive RCBD
model:
where t/ha (grand mean), treatment effects are
T1 = 0, T2 = +0.4, T3 = +0.8, T4 = 0.2 t/ha, block effects
are , and errors are .
Two observations were manually perturbed: plot P03 set to 8.9 t/ha
(high outlier) and plot P17 set to NA (missing plot).
Simulated data generated for package demonstration purposes.
data(agri_trial) str(agri_trial) summary(agri_trial)data(agri_trial) str(agri_trial) summary(agri_trial)
Checks the structural integrity of agricultural experimental data against a declared experimental design. Verifies treatment completeness, replication balance, block structure, missing treatment combinations, degrees of freedom for error, and minimum sample size.
check_design( df, treatment = NULL, block = NULL, response = NULL, design = c("RCBD", "CRD", "LSD", "factorial"), factors = NULL, expected_reps = NULL, alpha = 0.05 )check_design( df, treatment = NULL, block = NULL, response = NULL, design = c("RCBD", "CRD", "LSD", "factorial"), factors = NULL, expected_reps = NULL, alpha = 0.05 )
df |
A data frame containing the experimental data. |
treatment |
Character. Name of the treatment factor column. |
block |
Character or |
response |
Character. Name of the numeric response column. |
design |
Character. One of |
factors |
Character vector. Additional factor column names for factorial designs. |
expected_reps |
Integer or |
alpha |
Numeric. Significance level. Default |
Checks performed:
Response variable is numeric.
Missing values in response column.
Replication balance (equal n per treatment).
Expected replications match (if expected_reps supplied).
RCBD: each treatment appears exactly once per block.
Error degrees of freedom (Gomez & Gomez, 1984).
Factorial: all factor-level combinations present.
Minimum sample size guideline.
An object of class "agriDQ_design" with per-check
results, treatment levels, and a pass/warn/fail summary.
Gomez, K.A. and Gomez, A.A. (1984). Statistical Procedures for Agricultural Research, 2nd ed. Wiley, ISBN:978-0471870920. pp. 8–55.
df <- expand.grid( treatment = paste0("T", 1:4), block = paste0("B", 1:3), KEEP.OUT.ATTRS = FALSE, stringsAsFactors = FALSE ) df$yield <- rnorm(nrow(df), 4.5, 0.5) result <- check_design(df, treatment = "treatment", block = "block", response = "yield", design = "RCBD") print(result)df <- expand.grid( treatment = paste0("T", 1:4), block = paste0("B", 1:3), KEEP.OUT.ATTRS = FALSE, stringsAsFactors = FALSE ) df$yield <- rnorm(nrow(df), 4.5, 0.5) result <- check_design(df, treatment = "treatment", block = "block", response = "yield", design = "RCBD") print(result)
Tests the equal-variance assumption required for ANOVA using three complementary tests: Bartlett, Levene (Brown-Forsythe), and Fligner-Killeen. Reports a consensus and a practical variance ratio.
check_homogeneity(x, group, alpha = 0.05)check_homogeneity(x, group, alpha = 0.05)
x |
Numeric vector of the response variable. |
group |
Factor or character vector of group labels. |
alpha |
Numeric. Significance level. Default |
Test choice:
Most powerful when data are truly normal; sensitive to departures from normality.
Robust to non-normality; uses group medians rather than means. Recommended for most agricultural data where mild skewness is common.
Fully nonparametric; most robust option for clearly non-normal data.
The variance ratio (max/min across groups) is also reported. A ratio exceeding 3 is a practical warning for ANOVA robustness (Montgomery, 2017).
An object of class "agriDQ_homogeneity" containing
results (list of agriDQ_result), var_by_group,
var_ratio, consensus, and n.
Levene, H. (1960). Robust tests for equality of variances. In Contributions to Probability and Statistics, ed. I. Olkin, pp. 278–292. Stanford University Press.
Montgomery, D.C. (2017). Design and Analysis of Experiments, 9th ed. Wiley, ISBN:978-1119492443.
set.seed(3) yield <- c(rnorm(10, 4, 0.5), rnorm(10, 4, 1.5), rnorm(10, 4, 0.8)) trt <- rep(c("T1", "T2", "T3"), each = 10) result <- check_homogeneity(yield, trt) print(result)set.seed(3) yield <- c(rnorm(10, 4, 0.5), rnorm(10, 4, 1.5), rnorm(10, 4, 0.8)) trt <- rep(c("T1", "T2", "T3"), each = 10) result <- check_homogeneity(yield, trt) print(result)
Tests whether residuals from a fitted model (or a raw sequential vector) are independent — a core assumption for ANOVA and regression in agricultural field trials. Applies three complementary tests.
check_independence(residuals, alpha = 0.05, plot = TRUE)check_independence(residuals, alpha = 0.05, plot = TRUE)
residuals |
Numeric vector of model residuals or raw sequential observations. |
alpha |
Numeric. Significance level. Default |
plot |
Logical. Produce residuals-vs-order and ACF plots. Default
|
Tests applied:
Tests for lag-1 autocorrelation. DW
indicates no autocorrelation; DW suggests positive
autocorrelation (common in field trials with spatial trends).
Tests for higher-order serial correlation (lags 1 and 2).
Nonparametric test for randomness of the residual sequence.
Pass all three residuals from residuals(fit) after fitting an
ANOVA or regression model, with observations in field-plot order.
An object of class "agriDQ_independence" containing
results (list of agriDQ_result), consensus,
and n.
Durbin, J. and Watson, G.S. (1950). Testing for serial correlation in least squares regression. Biometrika, 37(3/4), 409–428. doi:10.1093/biomet/37.3-4.409
set.seed(5) fit <- lm(rnorm(30) ~ rep(1:3, 10)) result <- check_independence(residuals(fit), plot = FALSE) print(result)set.seed(5) fit <- lm(rnorm(30) ~ rep(1:3, 10)) result <- check_independence(residuals(fit), plot = FALSE) print(result)
Provides comprehensive missing data analysis: per-column and per-row missingness rates, pattern matrix, Little's MCAR test, and an inferred missingness mechanism with imputation recommendation.
check_missing(df, alpha = 0.05, plot = TRUE)check_missing(df, alpha = 0.05, plot = TRUE)
df |
A data frame (numeric and/or factor/character columns). |
alpha |
Numeric. Significance level for Little's MCAR test.
Default |
plot |
Logical. Produce a missingness pattern heatmap. Default
|
Missingness mechanisms:
Missing Completely At Random — independent of observed and unobserved values. Complete-case analysis is valid.
Missing At Random — depends only on observed values. Multiple imputation is appropriate.
Missing Not At Random — depends on the missing value itself. Requires sensitivity analysis.
Little's (1988) MCAR test is applied to numeric columns. A significant chi-squared statistic rejects MCAR, suggesting MAR or MNAR.
An object of class "agriDQ_missing" containing:
col_summaryPer-column missing count and percentage.
row_summaryPer-row missing count.
pattern_matrixBinary matrix (1 = missing).
little_testNamed list: statistic, df,
p_value.
mechanismCharacter: "MCAR", "MAR",
or "undetermined".
recommendationCharacter: suggested next step.
Little, R.J.A. (1988). A test of missing completely at random for multivariate data with missing values. Journal of the American Statistical Association, 83(404), 1198–1202. doi:10.1080/01621459.1988.10478722
set.seed(1) df <- data.frame( yield = c(rnorm(18, 4.5), NA, NA), height = c(NA, rnorm(19, 80)), treatment = rep(c("T1", "T2"), 10) ) result <- check_missing(df, plot = FALSE) print(result)set.seed(1) df <- data.frame( yield = c(rnorm(18, 4.5), NA, NA), height = c(NA, rnorm(19, 80)), treatment = rep(c("T1", "T2"), 10) ) result <- check_missing(df, plot = FALSE) print(result)
Applies a battery of normality tests selected by sample size, together with skewness, excess kurtosis, and a Q-Q plot. Returns a consensus recommendation for ANOVA/regression suitability.
check_normality( x, alpha = 0.05, tests = c("shapiro", "anderson", "ks", "lilliefors", "pearson", "jarque"), plot = TRUE, varname = "variable" )check_normality( x, alpha = 0.05, tests = c("shapiro", "anderson", "ks", "lilliefors", "pearson", "jarque"), plot = TRUE, varname = "variable" )
x |
Numeric vector of observations. |
alpha |
Numeric. Significance level. Default |
tests |
Character vector of tests to apply. Any subset of
|
plot |
Logical. Produce Q-Q and histogram plots. Default
|
varname |
Character. Label for plot titles and output. |
Test selection guidance for agricultural data:
: Shapiro-Wilk is most powerful (Razali & Wah, 2011).
: Anderson-Darling is preferred.
: Lilliefors or Kolmogorov-Smirnov.
Jarque-Bera assesses skewness and kurtosis directly.
Consensus is "pass" when the majority of applicable tests
do not reject normality.
An object of class "agriDQ_normality" with:
varnameVariable label.
nSample size (non-missing).
descriptivesList: mean, median, SD, CV, skewness, excess kurtosis, min, max.
resultsNamed list of agriDQ_result objects.
consensusCharacter: "pass", "warning",
or "fail".
consensus_msgCharacter: actionable recommendation.
Razali, N.M. and Wah, Y.B. (2011). Power comparisons of Shapiro-Wilk, Kolmogorov-Smirnov, Lilliefors and Anderson-Darling tests. Journal of Statistical Modeling and Analytics, 2(1), 21–33.
yield <- rnorm(30, mean = 4.2, sd = 0.6) result <- check_normality(yield, varname = "Wheat yield (t/ha)", plot = FALSE) print(result)yield <- rnorm(30, mean = 4.2, sd = 0.6) result <- check_normality(yield, varname = "Wheat yield (t/ha)", plot = FALSE) print(result)
Applies five complementary outlier detection methods and combines them into a consensus flag. A consensus flag is raised when at least two methods independently flag the same observation, which substantially reduces false positives compared to any single method.
check_outliers( x, method = c("iqr", "zscore", "hampel", "grubbs", "dixon"), alpha = 0.05, iqr_k = 1.5, z_threshold = 3, hampel_k = 3.5, labels = NULL )check_outliers( x, method = c("iqr", "zscore", "hampel", "grubbs", "dixon"), alpha = 0.05, iqr_k = 1.5, z_threshold = 3, hampel_k = 3.5, labels = NULL )
x |
Numeric vector of observations (e.g., yield, plant height). |
method |
Character vector. One or more of |
alpha |
Numeric. Significance level for formal tests. Default
|
iqr_k |
Numeric. IQR multiplier for the fence method. Default
|
z_threshold |
Numeric. Z-score threshold. Default |
hampel_k |
Numeric. Hampel identifier threshold in MAD units.
Default |
labels |
Optional character vector of observation labels (e.g.,
plot IDs) of the same length as |
Methods applied:
IQR fence — flags values outside
.
Z-score — flags where
.
Hampel identifier (modified Z-score) — robust to masking.
Uses . Recommended for
small agricultural trial datasets where classical Z-score is
distorted by the very outliers being sought.
Grubbs test — formal test for a single extreme outlier under normality (Grubbs, 1950). Iterates if an outlier is found.
Dixon Q-test — suitable for small samples ()
(Dixon, 1950).
An object of class "agriDQ_outlier" — a list containing:
flagsData frame with flag status from each method
and a consensus column.
summaryNamed integer vector: outlier count per method.
n_flaggedInteger: observations flagged by consensus.
n_totalInteger: total observations.
n_validInteger: non-missing observations.
Grubbs, F.E. (1950). Sample criteria for testing outlying observations. Annals of Mathematical Statistics, 21(1), 27–58. doi:10.1214/aoms/1177729885
Dixon, W.J. (1950). Analysis of extreme values. Annals of Mathematical Statistics, 21(4), 488–506. doi:10.1214/aoms/1177729747
set.seed(42) yield <- c(rnorm(20, mean = 4.5, sd = 0.5), 9.8, 0.2) result <- check_outliers(yield, method = c("iqr", "zscore", "hampel")) print(result)set.seed(42) yield <- c(rnorm(20, mean = 4.5, sd = 0.5), 9.8, 0.2) result <- check_outliers(yield, method = c("iqr", "zscore", "hampel")) print(result)
Detects multivariate outliers using the squared Mahalanobis distance with a chi-squared critical value. Useful for observations that are not extreme on any single variable but are unusual in combination (e.g., very high yield paired with very low plant height).
check_outliers_mv(df, alpha = 0.05, robust = FALSE)check_outliers_mv(df, alpha = 0.05, robust = FALSE)
df |
A numeric data frame or matrix. Rows are observations, columns are variables. |
alpha |
Numeric. Significance level for the chi-squared critical
value. Default |
robust |
Logical. If |
An object of class "agriDQ_mout" containing Mahalanobis
distances (distances), critical value (critical),
logical flag vector (flags), count of flagged observations
(n_flagged), and a summary.
set.seed(7) df <- data.frame( yield = c(rnorm(20, 4.5, 0.5), 9.0), plant_ht = c(rnorm(20, 80, 5), 30.0) ) result <- check_outliers_mv(df) print(result)set.seed(7) df <- data.frame( yield = c(rnorm(20, 4.5, 0.5), 9.0), plant_ht = c(rnorm(20, 80, 5), 30.0) ) result <- check_outliers_mv(df) print(result)
Detects common data quality issues in categorical variables: inconsistent capitalisation, whitespace errors, near-duplicate labels (fuzzy matching), unexpected factor levels, and rare categories.
check_qualitative( df, cols = NULL, expected_levels = NULL, fuzzy_threshold = 2L, rare_threshold = 0.02 )check_qualitative( df, cols = NULL, expected_levels = NULL, fuzzy_threshold = 2L, rare_threshold = 0.02 )
df |
A data frame. |
cols |
Character vector of columns to check. If |
expected_levels |
Named list mapping column names to character
vectors of valid levels. E.g.
|
fuzzy_threshold |
Integer. Levenshtein distance threshold for
near-duplicate detection. Applied only when minimum label length
exceeds 3 characters (to avoid false positives on short codes).
Default |
rare_threshold |
Numeric. Proportion below which a category is
flagged as rare. Applied only when |
Issues detected per column:
Missing values — count and percentage.
Case inconsistency — e.g., "Kharif" vs
"kharif" vs "KHARIF".
Whitespace — leading/trailing spaces or double spaces.
Near-duplicates — label pairs within fuzzy_threshold
Levenshtein distance (long labels only).
Unexpected levels — values not in expected_levels.
Rare categories — frequency below rare_threshold
(large samples only).
An object of class "agriDQ_qualitative" with per-column
results (col_results), a consolidated issue table
(issue_table), and n_issues.
df <- data.frame( treatment = c("T1", "T1", "t1", "T2", "T2"), season = c("Kharif", "Kharif", "kharif", "Rabi", "Rabi"), stringsAsFactors = FALSE ) result <- check_qualitative(df, expected_levels = list(season = c("Kharif", "Rabi"))) print(result)df <- data.frame( treatment = c("T1", "T1", "t1", "T2", "T2"), season = c("Kharif", "Kharif", "kharif", "Rabi", "Rabi"), stringsAsFactors = FALSE ) result <- check_qualitative(df, expected_levels = list(season = c("Kharif", "Rabi"))) print(result)
For each variable with missing values, fits a logistic regression of the missingness indicator on all other observed variables to assess whether the MAR assumption is plausible.
classify_missing(df, alpha = 0.05)classify_missing(df, alpha = 0.05)
df |
A data frame. |
alpha |
Numeric. Significance level. Default |
A data frame with columns variable, pct_missing,
lr_pvalue, and mechanism.
set.seed(2) df <- data.frame( yield = c(NA, rnorm(9, 4.5, 0.5)), trt = rep(c("T1", "T2"), 5) ) classify_missing(df)set.seed(2) df <- data.frame( yield = c(NA, rnorm(9, 4.5, 0.5)), trt = rep(c("T1", "T2"), 5) ) classify_missing(df)
Produces a self-contained HTML report from a
run_dq_pipeline result. The report includes a
colour-coded scorecard (green / amber / red), a detailed results table,
and an interpretation guide.
generate_dq_report( pipeline, output_file, title = "agriDQ Data Quality Report", author = "agriDQ" )generate_dq_report( pipeline, output_file, title = "agriDQ Data Quality Report", author = "agriDQ" )
pipeline |
An object of class |
output_file |
Character. Path for the HTML output file (e.g.
|
title |
Character. Report title. |
author |
Character. Author name for the report header. |
Invisibly returns the path to the generated HTML file.
data(agri_trial) pl <- run_dq_pipeline(agri_trial, response = "yield", treatment = "treatment", block = "block", plot = FALSE) tmp <- tempfile(fileext = ".html") generate_dq_report(pl, output_file = tmp, author = "Researcher")data(agri_trial) pl <- run_dq_pipeline(agri_trial, response = "yield", treatment = "treatment", block = "block", plot = FALSE) tmp <- tempfile(fileext = ".html") generate_dq_report(pl, output_file = tmp, author = "Researcher")
Print an agriDQ_result object
## S3 method for class 'agriDQ_result' print(x, ...)## S3 method for class 'agriDQ_result' print(x, ...)
x |
An object of class |
... |
Ignored. |
Invisibly returns x.
Runs all six data quality modules in sequence on a numeric response variable within an agricultural experimental data frame and returns a unified result with a master summary table.
run_dq_pipeline( df, response = NULL, treatment = NULL, block = NULL, design = "RCBD", alpha = 0.05, plot = TRUE, outlier_method = c("iqr", "zscore", "hampel") )run_dq_pipeline( df, response = NULL, treatment = NULL, block = NULL, design = "RCBD", alpha = 0.05, plot = TRUE, outlier_method = c("iqr", "zscore", "hampel") )
df |
A data frame. |
response |
Character. Name of the numeric response variable. |
treatment |
Character or |
block |
Character or |
design |
Character. Experimental design type passed to
|
alpha |
Numeric. Significance level. Default |
plot |
Logical. Produce diagnostic plots from sub-modules. Default
|
outlier_method |
Character vector. Methods for
|
An object of class "agriDQ_pipeline" containing:
stepsNamed list of sub-module results.
summaryData frame: module, test, statistic, p-value, status.
response, treatment, block,
design
Input parameters.
n, alpha, timestamp
Metadata.
data(agri_trial) result <- run_dq_pipeline(agri_trial, response = "yield", treatment = "treatment", block = "block", design = "RCBD", plot = FALSE) print(result)data(agri_trial) result <- run_dq_pipeline(agri_trial, response = "yield", treatment = "treatment", block = "block", design = "RCBD", plot = FALSE) print(result)
Applies automatic label standardisation: trims whitespace, collapses multiple spaces, and optionally converts case or applies a lookup-table replacement.
standardise_labels( df, cols = NULL, case = c("none", "lower", "upper", "title"), lookup = NULL )standardise_labels( df, cols = NULL, case = c("none", "lower", "upper", "title"), lookup = NULL )
df |
A data frame. |
cols |
Character vector of column names to standardise. Defaults to all character/factor columns. |
case |
Character. One of |
lookup |
Named list of replacement maps, e.g.
|
A data frame with standardised labels.
df <- data.frame(trt = c(" T1 ", "T1", "t1", "T2"), stringsAsFactors = FALSE) standardise_labels(df, case = "upper")df <- data.frame(trt = c(" T1 ", "T1", "t1", "T2"), stringsAsFactors = FALSE) standardise_labels(df, case = "upper")