| Title: | Automated Exploratory Data Analysis and Dataset Profiling |
|---|---|
| Description: | Profiles a data frame with minimal input: column type inference, missing-value analysis, distributional summary statistics (including skewness and kurtosis), normality tests, outlier detection, correlation and categorical-association analysis, date-column profiling, grouped comparisons and an overall data-quality score, alongside a set of 'ggplot2' visualisations. A single entry point, profile_data(), returns a structured S3 object holding metadata, statistics, diagnostics and plots, with print(), summary() and plot() methods, and report() renders the whole profile to a self-contained HTML file. Statistical methods include the Shapiro-Wilk normality test as implemented by Royston (1995) <doi:10.2307/2986146> and the Anderson-Darling test following Stephens (1974) <doi:10.1080/01621459.1974.10480196>, with power comparisons of these tests in Yap and Sim (2011) <doi:10.1080/00949655.2010.520163>, and the categorical association measure of Cramer (1946, ISBN:9780691080048). |
| Authors: | Muhammad Farooqi [aut, cre] |
| Maintainer: | Muhammad Farooqi <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.2.1 |
| Built: | 2026-06-24 13:24:01 UTC |
| Source: | https://github.com/cran/dataProfilerR |
For each Date/POSIXct column, reports the count, missingness, range, and
the largest gap between consecutive (sorted, unique) timestamps – a quick way
to spot coverage holes in a time series.
analyze_dates(df, types = NULL)analyze_dates(df, types = NULL)
df |
A data frame. |
types |
Optional named character vector of column types; computed if not supplied. |
A data frame with one row per date column (column, n,
n_missing, min, max, range_days, n_unique, max_gap_days), or
NULL if there are no date columns.
df <- data.frame(d = as.Date("2026-01-01") + c(0, 1, 2, 10)) analyze_dates(df)df <- data.frame(d = as.Date("2026-01-01") + c(0, 1, 2, 10)) analyze_dates(df)
Reports missingness per column and overall, including how many rows are
fully complete. Only NA is counted as missing (blank strings are not).
analyze_missing(df)analyze_missing(df)
df |
A data frame. |
A list with per_column (a data frame of column, n_missing,
pct_missing) and overall (a list with total/missing cell counts,
pct_missing, complete_rows and pct_complete_rows).
analyze_missing(data.frame(a = c(1, NA, 3), b = c("x", "y", NA)))analyze_missing(data.frame(a = c(1, NA, 3), b = c("x", "y", NA)))
Computes Cramer's V between every pair of categorical/logical columns. V
ranges from 0 (no association) to 1 (perfect association) and is the
categorical analogue of a correlation matrix. It is derived from the
chi-squared statistic: V = sqrt(X^2 / (n * (k - 1))), where k is the
smaller of the two factors' level counts.
categorical_association(df, types = NULL, max_levels = 50)categorical_association(df, types = NULL, max_levels = 50)
df |
A data frame. |
types |
Optional named character vector of column types (from
|
max_levels |
Categorical columns with more than this many levels are skipped (a high-cardinality column makes the chi-squared test unreliable and the table huge). Default 50. |
A symmetric numeric matrix of Cramer's V with a unit diagonal, or
NULL if fewer than two eligible categorical columns are present.
df <- data.frame(a = c("x", "x", "y", "y"), b = c("p", "p", "q", "q"), c = c("m", "n", "m", "n")) categorical_association(df)df <- data.frame(a = c("x", "x", "y", "y"), b = c("p", "p", "q", "q"), c = c("m", "n", "m", "n")) categorical_association(df)
Grouped profiling: split the data by a categorical column and summarise each numeric column within each group (count, mean, sd, median, missingness). This is the quickest way to see whether a metric differs by segment.
compare_groups(df, group, max_groups = 50)compare_groups(df, group, max_groups = 50)
df |
A data frame. |
group |
Name of the grouping column. Should be categorical/logical (or a low-cardinality column); a warning is issued if it has many levels. |
max_groups |
Maximum number of groups before erroring (guards against accidentally grouping on a near-unique column). Default 50. |
A list with group_sizes (a data frame of group, n) and
numeric_by_group (a long data frame of group, column, n,
n_missing, mean, sd, median), or NULL if there are no numeric
columns to compare.
compare_groups(iris, "Species")compare_groups(iris, "Species")
Correlation matrices over the numeric columns, using pairwise-complete observations.
correlation_analysis(df, types = NULL, method = c("pearson", "spearman"))correlation_analysis(df, types = NULL, method = c("pearson", "spearman"))
df |
A data frame. |
types |
Optional named character vector of column types. |
method |
Character vector; any of |
A named list of correlation matrices (one per requested method), or
NULL if there are fewer than two numeric columns.
correlation_analysis(iris)correlation_analysis(iris)
Rolls several signals into a single 0-100 score and a letter grade. The
components are completeness (share of non-missing cells), row uniqueness
(penalises duplicate rows), and column variability (penalises constant,
single-value columns). If an outlier_rate is supplied it adds a cleanliness
component. Components are averaged with the supplied weights.
data_quality_score( df, missing = NULL, outlier_rate = NULL, weights = c(completeness = 0.4, uniqueness = 0.2, variability = 0.2, cleanliness = 0.2) )data_quality_score( df, missing = NULL, outlier_rate = NULL, weights = c(completeness = 0.4, uniqueness = 0.2, variability = 0.2, cleanliness = 0.2) )
df |
A data frame. |
missing |
Optional result of |
outlier_rate |
Optional fraction (0-1) of numeric cells flagged as outliers; if supplied, a cleanliness component is included. |
weights |
Optional named numeric vector of component weights. Missing components are dropped and the rest renormalised. |
A list with score (0-100), grade (a letter), and components
(a named numeric vector of the component scores).
data_quality_score(iris)data_quality_score(iris)
Three standard rules:
"iqr": outside Q1 - k*IQR / Q3 + k*IQR (Tukey's rule, k = 1.5).
"zscore": absolute z-score above threshold (default 3).
"robust": absolute modified z-score using the median and MAD above
threshold (default 3.5); resistant to the outliers it is detecting.
detect_outliers(x, method = c("iqr", "zscore", "robust"), threshold = NULL)detect_outliers(x, method = c("iqr", "zscore", "robust"), threshold = NULL)
x |
A numeric vector. |
method |
One of |
threshold |
Cutoff for |
A list: method, n (non-missing count), n_outliers, pct,
is_outlier (a logical vector aligned to x, FALSE for NA), and
bounds (lower/upper, where applicable).
detect_outliers(c(1, 2, 3, 4, 100), method = "iqr")detect_outliers(c(1, 2, 3, 4, 100), method = "iqr")
Maps each column to one of "numeric", "integer", "date",
"logical", "categorical", "text" or "other". Character columns are
split into "categorical" and "text" heuristically: long strings, or
high-cardinality columns where most values are unique, are treated as free
text; everything else is categorical.
infer_column_types(df, text_min_avg_chars = 50, text_unique_ratio = 0.8)infer_column_types(df, text_min_avg_chars = 50, text_unique_ratio = 0.8)
df |
A data frame. |
text_min_avg_chars |
Average character length above which a character column is considered free text. Default 50. |
text_unique_ratio |
Fraction of unique values above which a character column (with enough rows) is considered free text. Default 0.8. |
A named character vector of inferred types, one per column.
infer_column_types(data.frame(a = 1:3, b = c("x", "y", "z"), d = Sys.Date() + 0:2))infer_column_types(data.frame(a = 1:3, b = c("x", "y", "z"), d = Sys.Date() + 0:2))
Is an object a data_profile?
is_data_profile(x)is_data_profile(x)
x |
Any object. |
TRUE if x has class data_profile.
is_data_profile(profile_data(iris))is_data_profile(profile_data(iris))
Moment-based kurtosis minus 3, so a normal distribution scores near 0.
kurtosis(x)kurtosis(x)
x |
A numeric vector. |
A single numeric value, or NA_real_ if there are fewer than four
non-missing values or the variance is zero.
kurtosis(rnorm(100))kurtosis(rnorm(100))
Runs the Shapiro-Wilk test on each numeric/integer column, and the Anderson-Darling test as well if the suggested nortest package is installed. Shapiro-Wilk requires 3 to 5000 observations; larger columns are reduced to an evenly-spaced subsample of 5000. The subsample is deterministic and does not touch the session's random-number state.
normality_tests(df, types = NULL, alpha = 0.05)normality_tests(df, types = NULL, alpha = 0.05)
df |
A data frame. |
types |
Optional named character vector of column types. |
alpha |
Significance level for the |
A data frame with one row per numeric column: column, n_used,
shapiro_W, shapiro_p, ad_A and ad_p (the Anderson-Darling columns
are NA if nortest is absent), and a logical normal. Returns
NULL if there are no numeric columns.
normality_tests(iris)normality_tests(iris)
Applies detect_outliers() to every numeric column and tabulates the result.
outlier_summary(df, types = NULL, method = "iqr")outlier_summary(df, types = NULL, method = "iqr")
df |
A data frame. |
types |
Optional named character vector of column types. |
method |
Outlier method passed to |
A list with per_column (a data frame of column, n_outliers,
pct) and overall_rate (fraction of numeric cells flagged, 0-1), or
NULL if there are no numeric columns.
outlier_summary(iris)outlier_summary(iris)
Heatmap of the Cramer's V matrix from categorical_association().
plot_association(df, max_levels = 50)plot_association(df, max_levels = 50)
df |
A data frame. |
max_levels |
Passed to |
A ggplot2 object, or NULL (with a warning) if there are fewer
than two eligible categorical columns.
plot_association( data.frame(a = c("x", "x", "y", "y"), b = c("p", "p", "q", "q")) )plot_association( data.frame(a = c("x", "x", "y", "y"), b = c("p", "p", "q", "q")) )
One boxplot per numeric column, faceted with free y-scales so columns on different scales are still readable. Useful as a quick outlier scan.
plot_boxplots(df)plot_boxplots(df)
df |
A data frame. |
A ggplot2 object, or NULL (with a warning) if there are no
numeric columns.
plot_boxplots(iris)plot_boxplots(iris)
A heatmap of the correlation matrix over the numeric columns, annotated with the rounded coefficients.
plot_correlation(df, method = c("pearson", "spearman"))plot_correlation(df, method = c("pearson", "spearman"))
df |
A data frame. |
method |
Correlation method: |
A ggplot2 object, or NULL (with a warning) if there are fewer
than two numeric columns.
plot_correlation(iris)plot_correlation(iris)
Histogram with a density overlay for numeric columns; a bar chart of the most frequent levels for categorical/text/logical columns.
plot_distribution(df, column, bins = 30, max_levels = 20)plot_distribution(df, column, bins = 30, max_levels = 20)
df |
A data frame. |
column |
Name of the column to plot. |
bins |
Histogram bins for numeric columns. Default 30. |
max_levels |
Maximum categories to show for categorical columns. Default 20. |
A ggplot2 object.
plot_distribution(iris, "Sepal.Length") plot_distribution(iris, "Species")plot_distribution(iris, "Sepal.Length") plot_distribution(iris, "Species")
A tile plot of where NAs fall: columns on the x-axis, rows on the y-axis,
shaded by whether each cell is missing. For wide/tall data the rows are
subsampled to max_rows so the plot stays legible.
plot_missing(df, max_rows = 500)plot_missing(df, max_rows = 500)
df |
A data frame. |
max_rows |
Maximum rows to display (subsampled if exceeded). Default 500. |
A ggplot2 object.
df <- data.frame(a = c(1, NA, 3), b = c(NA, "y", "z")) plot_missing(df)df <- data.frame(a = c(1, NA, 3), b = c(NA, "y", "z")) plot_missing(df)
A scatterplot matrix over selected numeric columns, drawn with facets. Capped at a handful of columns because the number of panels grows quadratically.
plot_pairs(df, columns = NULL, max_cols = 5)plot_pairs(df, columns = NULL, max_cols = 5)
df |
A data frame. |
columns |
Optional character vector of numeric columns to include. If
|
max_cols |
Maximum number of columns to include. Default 5. |
A ggplot2 object, or NULL (with a warning) if fewer than two
numeric columns are available.
plot_pairs(iris, c("Sepal.Length", "Sepal.Width", "Petal.Length"))plot_pairs(iris, c("Sepal.Length", "Sepal.Width", "Petal.Length"))
Returns one of the figures built by profile_data().
## S3 method for class 'data_profile' plot( x, which = c("missing", "correlation", "association", "boxplots", "pairs", "distribution"), column = NULL, ... )## S3 method for class 'data_profile' plot( x, which = c("missing", "correlation", "association", "boxplots", "pairs", "distribution"), column = NULL, ... )
x |
A |
which |
Which figure: |
column |
Column name, required when |
... |
Ignored. |
A ggplot2 object (also drawn when called at the console).
p <- profile_data(iris) plot(p, which = "missing") plot(p, which = "distribution", column = "Sepal.Length")p <- profile_data(iris) plot(p, which = "missing") plot(p, which = "distribution", column = "Sepal.Length")
Print a concise overview of a data profile
## S3 method for class 'data_profile' print(x, ...)## S3 method for class 'data_profile' print(x, ...)
x |
A |
... |
Ignored. |
x, invisibly.
print(profile_data(iris))print(profile_data(iris))
The package's single entry point. It runs type inference, missing-value
analysis, summary statistics, normality tests, outlier detection, correlation
analysis and a data-quality score, and (optionally) builds a set of
ggplot2 visualisations. The result is a data_profile S3 object with
print(), summary() and plot() methods.
profile_data( df, dataset_name = NULL, build_plots = TRUE, distributions = TRUE, normality = TRUE, outlier_method = "iqr", cor_method = c("pearson", "spearman"), group_by = NULL, verbose = FALSE )profile_data( df, dataset_name = NULL, build_plots = TRUE, distributions = TRUE, normality = TRUE, outlier_method = "iqr", cor_method = c("pearson", "spearman"), group_by = NULL, verbose = FALSE )
df |
A data frame with at least one row and one column and unique, non-empty column names. |
dataset_name |
Optional label stored in the metadata; defaults to the
deparsed name of |
build_plots |
Whether to build the ggplot2 objects. Set |
distributions |
Whether to build a per-column distribution plot (the
eager, heaviest part of plotting). Set |
normality |
Whether to run normality tests. Default |
outlier_method |
Method passed to |
cor_method |
Correlation methods: any of |
group_by |
Optional name of a categorical column. If supplied, a grouped
comparison of the numeric columns is added to the diagnostics (see
|
verbose |
Print progress messages. Default |
An object of class data_profile: a list with elements metadata,
statistics, diagnostics, plots and call.
print.data_profile(), summary.data_profile(),
plot.data_profile()
p <- profile_data(iris) p summary(p) plot(p, which = "correlation")p <- profile_data(iris) p summary(p) plot(p, which = "correlation")
Turns a data_profile into a standalone HTML file containing the metadata,
quality score, statistical tables and every figure. The report is built with
rmarkdown, so a working pandoc installation is required (R Markdown's
usual dependency); report() errors clearly if pandoc is unavailable.
report( x, output_file = "dataProfilerR_report.html", title = NULL, quiet = TRUE )report( x, output_file = "dataProfilerR_report.html", title = NULL, quiet = TRUE )
x |
A |
output_file |
Path to write. A bare file name lands in the working
directory. Default |
title |
Report title. Defaults to the dataset name. |
quiet |
Passed to |
The path to the written file, invisibly.
if (requireNamespace("rmarkdown", quietly = TRUE) && rmarkdown::pandoc_available()) { p <- profile_data(iris) f <- report(p, file.path(tempdir(), "iris_report.html")) }if (requireNamespace("rmarkdown", quietly = TRUE) && rmarkdown::pandoc_available()) { p <- profile_data(iris) f <- report(p, file.path(tempdir(), "iris_report.html")) }
Moment-based skewness, computed as m3 / m2^(3/2) on the non-missing values.
skewness(x)skewness(x)
x |
A numeric vector. |
A single numeric value, or NA_real_ if there are fewer than three
non-missing values or the variance is zero.
skewness(c(1, 2, 2, 3, 10))skewness(c(1, 2, 2, 3, 10))
Produces a numeric summary data frame (count, missingness, mean, sd, variance, quartiles, IQR, skewness, kurtosis) for numeric and integer columns, and a categorical summary (cardinality and most frequent level) for factor, logical, categorical and text columns.
summarize_columns(df, types = NULL)summarize_columns(df, types = NULL)
df |
A data frame. |
types |
Optional named character vector of column types (as returned by
|
A list with numeric (a data frame, or NULL if no numeric
columns) and categorical (a named list, possibly empty).
summarize_columns(iris)summarize_columns(iris)
Prints the numeric summary, the columns with the most missingness, normality verdicts, outlier counts, and the strongest correlations, and returns the same pieces invisibly as a list.
## S3 method for class 'data_profile' summary(object, max_rows = 10, ...)## S3 method for class 'data_profile' summary(object, max_rows = 10, ...)
object |
A |
max_rows |
Maximum rows to print per table. Default 10. |
... |
Ignored. |
A list of the printed tables, invisibly.
summary(profile_data(iris))summary(profile_data(iris))