Title: | Extensible Data Pattern Searching Framework |
---|---|
Description: | Extensible framework for subgroup discovery (Atzmueller (2015) <doi:10.1002/widm.1144>), contrast patterns (Chen (2022) <doi:10.48550/arXiv.2209.13556>), emerging patterns (Dong (1999) <doi:10.1145/312129.312191>), association rules (Agrawal (1994) <https://www.vldb.org/conf/1994/P487.PDF>) and conditional correlations (Hájek (1978) <doi:10.1007/978-3-642-66943-9>). Both crisp (Boolean, binary) and fuzzy data are supported. It generates conditions in the form of elementary conjunctions, evaluates them on a dataset and checks the induced sub-data for interesting statistical properties. A user-defined function may be defined to evaluate on each generated condition to search for custom patterns. |
Authors: | Michal Burda [aut, cre]
|
Maintainer: | Michal Burda <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.4.0 |
Built: | 2025-02-07 07:22:51 UTC |
Source: | CRAN |
A general function for searching for patterns of custom type. The function
allows for the selection of columns of x
to be used as condition
predicates. The function enumerates all possible conditions in the form of
elementary conjunctions of selected predicates, and for each condition,
a user-defined callback function f
is executed. The callback function is
intended to perform some analysis and return an object representing a pattern
or patterns related to the condition. dig()
returns a list of these
returned objects.
The callback function f
may have some arguments that are listed in the
f
argument description. The algorithm provides information about the
generated condition based on the present arguments.
Additionally to condition
, the function allows for the selection of
the so-called focus predicates. The focus predicates, a.k.a. foci, are
predicates that are evaluated within each condition and some additional
information is provided to the callback function about them.
dig()
allows to specify some restrictions on the generated conditions,
such as:
the minimum and maximum length of the condition (min_length
and
max_length
arguments).
the minimum support of the condition (min_support
argument). Support
of the condition is the relative frequency of the condition in the dataset
x
.
the minimum support of the focus (min_focus_support
argument). Support
of the focus is the relative frequency of rows such that all condition
predicates AND the focus are TRUE on it. Foci with support lower than
min_focus_support
are filtered out.
dig( x, f, condition = everything(), focus = NULL, disjoint = var_names(colnames(x)), min_length = 0, max_length = Inf, min_support = 0, min_focus_support = min_support, min_conditional_focus_support = 0, max_support = 1, filter_empty_foci = FALSE, t_norm = "goguen", max_results = Inf, verbose = FALSE, threads = 1L, error_context = list(arg_x = "x", arg_f = "f", arg_condition = "condition", arg_focus = "focus", arg_disjoint = "disjoint", arg_min_length = "min_length", arg_max_length = "max_length", arg_min_support = "min_support", arg_min_focus_support = "min_focus_support", arg_min_conditional_focus_support = "min_conditional_focus_support", arg_max_support = "max_support", arg_filter_empty_foci = "filter_empty_foci", arg_t_norm = "t_norm", arg_max_results = "max_results", arg_verbose = "verbose", arg_threads = "threads", call = current_env()) )
dig( x, f, condition = everything(), focus = NULL, disjoint = var_names(colnames(x)), min_length = 0, max_length = Inf, min_support = 0, min_focus_support = min_support, min_conditional_focus_support = 0, max_support = 1, filter_empty_foci = FALSE, t_norm = "goguen", max_results = Inf, verbose = FALSE, threads = 1L, error_context = list(arg_x = "x", arg_f = "f", arg_condition = "condition", arg_focus = "focus", arg_disjoint = "disjoint", arg_min_length = "min_length", arg_max_length = "max_length", arg_min_support = "min_support", arg_min_focus_support = "min_focus_support", arg_min_conditional_focus_support = "min_conditional_focus_support", arg_max_support = "max_support", arg_filter_empty_foci = "filter_empty_foci", arg_t_norm = "t_norm", arg_max_results = "max_results", arg_verbose = "verbose", arg_threads = "threads", call = current_env()) )
x |
a matrix or data frame. The matrix must be numeric (double) or logical.
If |
f |
the callback function executed for each generated condition. This function may have some of the following arguments. Based on the present arguments, the algorithm would provide information about the generated condition:
|
condition |
a tidyselect expression (see tidyselect syntax) specifying the columns to use as condition predicates |
focus |
a tidyselect expression (see tidyselect syntax) specifying the columns to use as focus predicates |
disjoint |
an atomic vector of size equal to the number of columns of |
min_length |
the minimum size (the minimum number of predicates) of the condition to be generated (must be greater or equal to 0). If 0, the empty condition is generated in the first place. |
max_length |
The maximum size (the maximum number of predicates) of the condition to be generated. If equal to Inf, the maximum length of conditions is limited only by the number of available predicates. |
min_support |
the minimum support of a condition to trigger the callback
function for it. The support of the condition is the relative frequency
of the condition in the dataset |
min_focus_support |
the minimum support of a focus, for the focus to be passed to the callback function. The support of the focus is the relative frequency of rows such that all condition predicates AND the focus are TRUE on it. For numerical (double) input, the support is computed as the mean (over all rows) of multiplications of predicate values. |
min_conditional_focus_support |
the minimum relative support of a focus within a condition. The conditional support of the focus is the relative frequency of rows with focus being TRUE within rows where the condition is TRUE. |
max_support |
the maximum support of a condition to trigger the callback |
filter_empty_foci |
a logical scalar indicating whether to skip conditions,
for which no focus remains available after filtering by |
t_norm |
a t-norm used to compute conjunction of weights. It must be one of
|
max_results |
the maximum number of generated conditions to execute the
callback function on. If the number of found conditions exceeds
|
verbose |
a logical scalar indicating whether to print progress messages. |
threads |
the number of threads to use for parallel computation. |
error_context |
a list of details to be used in error messages.
This argument is useful when
|
A list of results provided by the callback function f
.
Michal Burda
partition()
, var_names()
, dig_grid()
library(tibble) # Prepare iris data for use with dig() d <- partition(iris, .breaks = 2) # Call f() for each condition with support >= 0.5. The result is a list # of strings representing the conditions. dig(x = d, f = function(condition) { format_condition(names(condition)) }, min_support = 0.5) # Create a more complex pattern object - a list with some statistics res <- dig(x = d, f = function(condition, support) { list(condition = format_condition(names(condition)), support = support) }, min_support = 0.5) print(res) # Format the result as a data frame do.call(rbind, lapply(res, as_tibble)) # Within each condition, evaluate also supports of columns starting with # "Species" res <- dig(x = d, f = function(condition, support, pp) { c(list(condition = format_condition(names(condition))), list(condition_support = support), as.list(pp / nrow(d))) }, condition = !starts_with("Species"), focus = starts_with("Species"), min_support = 0.5, min_focus_support = 0) # Format the result as a tibble do.call(rbind, lapply(res, as_tibble)) # For each condition, create multiple patterns based on the focus columns res <- dig(x = d, f = function(condition, support, pp) { lapply(seq_along(pp), function(i) { list(condition = format_condition(names(condition)), condition_support = support, focus = names(pp)[i], focus_support = pp[[i]] / nrow(d)) }) }, condition = !starts_with("Species"), focus = starts_with("Species"), min_support = 0.5, min_focus_support = 0) # As res is now a list of lists, we need to flatten it before converting to # a tibble res <- unlist(res, recursive = FALSE) # Format the result as a tibble do.call(rbind, lapply(res, as_tibble))
library(tibble) # Prepare iris data for use with dig() d <- partition(iris, .breaks = 2) # Call f() for each condition with support >= 0.5. The result is a list # of strings representing the conditions. dig(x = d, f = function(condition) { format_condition(names(condition)) }, min_support = 0.5) # Create a more complex pattern object - a list with some statistics res <- dig(x = d, f = function(condition, support) { list(condition = format_condition(names(condition)), support = support) }, min_support = 0.5) print(res) # Format the result as a data frame do.call(rbind, lapply(res, as_tibble)) # Within each condition, evaluate also supports of columns starting with # "Species" res <- dig(x = d, f = function(condition, support, pp) { c(list(condition = format_condition(names(condition))), list(condition_support = support), as.list(pp / nrow(d))) }, condition = !starts_with("Species"), focus = starts_with("Species"), min_support = 0.5, min_focus_support = 0) # Format the result as a tibble do.call(rbind, lapply(res, as_tibble)) # For each condition, create multiple patterns based on the focus columns res <- dig(x = d, f = function(condition, support, pp) { lapply(seq_along(pp), function(i) { list(condition = format_condition(names(condition)), condition_support = support, focus = names(pp)[i], focus_support = pp[[i]] / nrow(d)) }) }, condition = !starts_with("Species"), focus = starts_with("Species"), min_support = 0.5, min_focus_support = 0) # As res is now a list of lists, we need to flatten it before converting to # a tibble res <- unlist(res, recursive = FALSE) # Format the result as a tibble do.call(rbind, lapply(res, as_tibble))
Association rules identify conditions (antecedents) under which a specific feature (consequent) is present very often.
A => C
If condition A
is satisfied, then the feature C
is present very often.
university_edu & middle_age & IT_industry => high_income
People in middle age with university education working in IT industry
have very likely a high income.
Antecedent A
is usually a set of predicates, and consequent C
is a single
predicate.
For the following explanations we need a mathematical function , which
is defined for a set
of predicates as a relative frequency of rows satisfying
all predicates from
. For logical data,
equals to the relative
frequency of rows, for which all predicates
from
are TRUE.
For numerical (double) input,
is computed as the mean (over all rows)
of truth degrees of the formula
i_1 AND i_2 AND ... AND i_n
, where
AND
is a triangular norm selected by the t_norm
argument.
Association rules are characterized with the following quality measures.
Length of a rule is the number of elements in the antecedent.
Coverage of a rule is equal to .
Consequent support of a rule is equal to .
Support of a rule is equal to .
Confidence of a rule is the fraction .
dig_associations( x, antecedent = everything(), consequent = everything(), disjoint = var_names(colnames(x)), min_length = 0L, max_length = Inf, min_coverage = 0, min_support = 0, min_confidence = 0, contingency_table = FALSE, measures = NULL, t_norm = "goguen", max_results = Inf, verbose = FALSE, threads = 1 )
dig_associations( x, antecedent = everything(), consequent = everything(), disjoint = var_names(colnames(x)), min_length = 0L, max_length = Inf, min_coverage = 0, min_support = 0, min_confidence = 0, contingency_table = FALSE, measures = NULL, t_norm = "goguen", max_results = Inf, verbose = FALSE, threads = 1 )
x |
a matrix or data frame with data to search in. The matrix must be
numeric (double) or logical. If |
antecedent |
a tidyselect expression (see tidyselect syntax) specifying the columns to use in the antecedent (left) part of the rules |
consequent |
a tidyselect expression (see tidyselect syntax) specifying the columns to use in the consequent (right) part of the rules |
disjoint |
an atomic vector of size equal to the number of columns of |
min_length |
the minimum length, i.e., the minimum number of predicates in the antecedent, of a rule to be generated. Value must be greater or equal to 0. If 0, rules with empty antecedent are generated in the first place. |
max_length |
The maximum length, i.e., the maximum number of predicates in the antecedent, of a rule to be generated. If equal to Inf, the maximum length is limited only by the number of available predicates. |
min_coverage |
the minimum coverage of a rule in the dataset |
min_support |
the minimum support of a rule in the dataset |
min_confidence |
the minimum confidence of a rule in the dataset |
contingency_table |
a logical value indicating whether to provide a contingency
table for each rule. If |
measures |
a character vector specifying the additional quality measures to compute.
If |
t_norm |
a t-norm used to compute conjunction of weights. It must be one of
|
max_results |
the maximum number of generated conditions to execute the
callback function on. If the number of found conditions exceeds
|
verbose |
a logical value indicating whether to print progress messages. |
threads |
the number of threads to use for parallel computation. |
A tibble with found patterns and computed quality measures.
Michal Burda
partition()
, var_names()
, dig()
d <- partition(mtcars, .breaks = 2) dig_associations(d, antecedent = !starts_with("mpg"), consequent = starts_with("mpg"), min_support = 0.3, min_confidence = 0.8, measures = c("lift", "conviction"))
d <- partition(mtcars, .breaks = 2) dig_associations(d, antecedent = !starts_with("mpg"), consequent = starts_with("mpg"), min_support = 0.3, min_confidence = 0.8, measures = c("lift", "conviction"))
Baseline contrast patterns identify conditions under which a specific feature is significantly different from a given value by performing a one-sample statistical test.
var != 0 | C
Variable var
is (in average) significantly different from 0 under the
condition C
.
(measure_error != 0 | measure_tool_A
If measuring with measure tool A, the average measure error is
significantly different from 0.
The baseline contrast is computed using a one-sample statistical test, which
is specified by the method
argument. The function computes the contrast
between all variables specified by the vars
argument. Baseline contrasts
are computed in sub-data corresponding to conditions generated from the
condition
columns. Function dig_baseline_contrasts()
supports crisp
conditions only, i.e., the condition columns in x
must be logical.
dig_baseline_contrasts( x, condition = where(is.logical), vars = where(is.numeric), disjoint = var_names(colnames(x)), min_length = 0L, max_length = Inf, min_support = 0, max_support = 1, method = "t", alternative = "two.sided", h0 = 0, conf_level = 0.95, max_p_value = 0.05, wilcox_exact = FALSE, wilcox_correct = TRUE, wilcox_tol_root = 1e-04, wilcox_digits_rank = Inf, max_results = Inf, verbose = FALSE, threads = 1 )
dig_baseline_contrasts( x, condition = where(is.logical), vars = where(is.numeric), disjoint = var_names(colnames(x)), min_length = 0L, max_length = Inf, min_support = 0, max_support = 1, method = "t", alternative = "two.sided", h0 = 0, conf_level = 0.95, max_p_value = 0.05, wilcox_exact = FALSE, wilcox_correct = TRUE, wilcox_tol_root = 1e-04, wilcox_digits_rank = Inf, max_results = Inf, verbose = FALSE, threads = 1 )
x |
a matrix or data frame with data to search the patterns in. |
condition |
a tidyselect expression (see tidyselect syntax) specifying the columns to use as condition predicates |
vars |
a tidyselect expression (see tidyselect syntax) specifying the columns to use for computation of contrasts |
disjoint |
an atomic vector of size equal to the number of columns of |
min_length |
the minimum size (the minimum number of predicates) of the condition to be generated (must be greater or equal to 0). If 0, the empty condition is generated in the first place. |
max_length |
The maximum size (the maximum number of predicates) of the condition to be generated. If equal to Inf, the maximum length of conditions is limited only by the number of available predicates. |
min_support |
the minimum support of a condition to trigger the callback
function for it. The support of the condition is the relative frequency
of the condition in the dataset |
max_support |
the maximum support of a condition to trigger the callback
function for it. See argument |
method |
a character string indicating which contrast to compute.
One of |
alternative |
indicates the alternative hypothesis and must be one of
|
h0 |
a numeric value specifying the null hypothesis for the test. For
the |
conf_level |
a numeric value specifying the level of the confidence interval. The default value is 0.95. |
max_p_value |
the maximum p-value of a test for the pattern to be considered
significant. If the p-value of the test is greater than |
wilcox_exact |
(used for the |
wilcox_correct |
(used for the |
wilcox_tol_root |
(used for the |
wilcox_digits_rank |
(used for the |
max_results |
the maximum number of generated conditions to execute the
callback function on. If the number of found conditions exceeds
|
verbose |
a logical scalar indicating whether to print progress messages. |
threads |
the number of threads to use for parallel computation. |
A tibble with found patterns in rows. The following columns are always present:
condition |
the condition of the pattern as a character string
in the form |
support |
the support of the condition, i.e., the relative
frequency of the condition in the dataset |
var |
the name of the contrast variable. |
estimate |
the estimated mean or median of variable |
statistic |
the statistic of the selected test. |
p_value |
the p-value of the underlying test. |
n |
the number of rows in the sub-data corresponding to the condition. |
conf_int_lo |
the lower bound of the confidence interval of the estimate. |
conf_int_hi |
the upper bound of the confidence interval of the estimate. |
alternative |
a character string indicating the alternative
hypothesis. The value must be one of |
method |
a character string indicating the method used for the test. |
comment |
a character string with additional information about the test (mainly error messages on failure). |
For the "t"
method, the following additional columns are also
present (see also t.test()
):
df |
the degrees of freedom of the t test. |
stderr |
the standard error of the mean. |
Michal Burda
dig_paired_baseline_contrasts()
, dig_complement_contrasts()
,
dig()
, dig_grid()
,
stats::t.test()
, stats::wilcox.test()
Complement contrast patterns identify conditions under which there is a significant difference in some numerical variable between elements that satisfy the identified condition and the rest of the data table.
(var | C) != (var | not C)
There is a statistically significant difference in variable var
between
group of elements that satisfy condition C
and a group of elements that
do not satisfy condition C
.
(life_expectancy | smoker) < (life_expectancy | non-smoker)
The life expectancy in people that smoke cigarettes is in average
significantly lower than in people that do not smoke.
The complement contrast is computed using a two-sample statistical test,
which is specified by the method
argument. The function computes the
complement contrast in all variables specified by the vars
argument.
Complement contrasts are computed based on sub-data corresponding
to conditions generated from the condition
columns and the rest of the
data table. Function #' dig_complement_contrasts()
supports crisp
conditions only, i.e., the condition columns in x
must be logical.
dig_complement_contrasts( x, condition = where(is.logical), vars = where(is.numeric), disjoint = var_names(colnames(x)), min_length = 0L, max_length = Inf, min_support = 0, max_support = 1 - min_support, method = "t", alternative = "two.sided", h0 = if (method == "var") 1 else 0, conf_level = 0.95, max_p_value = 0.05, t_var_equal = FALSE, wilcox_exact = FALSE, wilcox_correct = TRUE, wilcox_tol_root = 1e-04, wilcox_digits_rank = Inf, max_results = Inf, verbose = FALSE, threads = 1L )
dig_complement_contrasts( x, condition = where(is.logical), vars = where(is.numeric), disjoint = var_names(colnames(x)), min_length = 0L, max_length = Inf, min_support = 0, max_support = 1 - min_support, method = "t", alternative = "two.sided", h0 = if (method == "var") 1 else 0, conf_level = 0.95, max_p_value = 0.05, t_var_equal = FALSE, wilcox_exact = FALSE, wilcox_correct = TRUE, wilcox_tol_root = 1e-04, wilcox_digits_rank = Inf, max_results = Inf, verbose = FALSE, threads = 1L )
x |
a matrix or data frame with data to search the patterns in. |
condition |
a tidyselect expression (see tidyselect syntax) specifying the columns to use as condition predicates |
vars |
a tidyselect expression (see tidyselect syntax) specifying the columns to use for computation of contrasts |
disjoint |
an atomic vector of size equal to the number of columns of |
min_length |
the minimum size (the minimum number of predicates) of the condition to be generated (must be greater or equal to 0). If 0, the empty condition is generated in the first place. |
max_length |
The maximum size (the maximum number of predicates) of the condition to be generated. If equal to Inf, the maximum length of conditions is limited only by the number of available predicates. |
min_support |
the minimum support of a condition to trigger the callback
function for it. The support of the condition is the relative frequency
of the condition in the dataset |
max_support |
the maximum support of a condition to trigger the callback
function for it. See argument |
method |
a character string indicating which contrast to compute.
One of |
alternative |
indicates the alternative hypothesis and must be one of
|
h0 |
a numeric value specifying the null hypothesis for the test. For
the |
conf_level |
a numeric value specifying the level of the confidence interval. The default value is 0.95. |
max_p_value |
the maximum p-value of a test for the pattern to be considered
significant. If the p-value of the test is greater than |
t_var_equal |
(used for the |
wilcox_exact |
(used for the |
wilcox_correct |
(used for the |
wilcox_tol_root |
(used for the |
wilcox_digits_rank |
(used for the |
max_results |
the maximum number of generated conditions to execute the
callback function on. If the number of found conditions exceeds
|
verbose |
a logical scalar indicating whether to print progress messages. |
threads |
the number of threads to use for parallel computation. |
A tibble with found patterns in rows. The following columns are always present:
condition |
the condition of the pattern as a character string
in the form |
support |
the support of the condition, i.e., the relative
frequency of the condition in the dataset |
var |
the name of the contrast variable. |
estimate |
the estimate value (see the underlying test. |
statistic |
the statistic of the selected test. |
p_value |
the p-value of the underlying test. |
n_x |
the number of rows in the sub-data corresponding to the condition. |
n_y |
the number of rows in the sub-data corresponding to the negation of the condition. |
conf_int_lo |
the lower bound of the confidence interval of the estimate. |
conf_int_hi |
the upper bound of the confidence interval of the estimate. |
alternative |
a character string indicating the alternative
hypothesis. The value must be one of |
method |
a character string indicating the method used for the test. |
comment |
a character string with additional information about the test (mainly error messages on failure). |
For the "t"
method, the following additional columns are also
present (see also t.test()
):
df |
the degrees of freedom of the t test. |
stderr |
the standard error of the mean difference. |
Michal Burda
dig_baseline_contrasts()
, dig_paired_baseline_contrasts()
,
dig()
, dig_grid()
,
stats::t.test()
, stats::wilcox.test()
, stats::var.test()
Conditional correlations are patterns that identify strong relationships between pairs of numeric variables under specific conditions.
xvar ~ yvar | C
xvar
and yvar
highly correlates in data that satisfy the condition
C
.
study_time ~ test_score | hard_exam
For hard exams, the amount of study time is highly correlated with
the obtained exam's test score.
The function computes correlations between all combinations of xvars
and
yvars
columns of x
in multiple sub-data corresponding to conditions
generated from condition
columns.
dig_correlations( x, condition = where(is.logical), xvars = where(is.numeric), yvars = where(is.numeric), disjoint = var_names(colnames(x)), method = "pearson", alternative = "two.sided", exact = NULL, min_length = 0L, max_length = Inf, min_support = 0, max_support = 1, max_results = Inf, verbose = FALSE, threads = 1 )
dig_correlations( x, condition = where(is.logical), xvars = where(is.numeric), yvars = where(is.numeric), disjoint = var_names(colnames(x)), method = "pearson", alternative = "two.sided", exact = NULL, min_length = 0L, max_length = Inf, min_support = 0, max_support = 1, max_results = Inf, verbose = FALSE, threads = 1 )
x |
a matrix or data frame with data to search in. |
condition |
a tidyselect expression (see tidyselect syntax) specifying the columns to use as condition predicates |
xvars |
a tidyselect expression (see tidyselect syntax) specifying the columns to use for computation of correlations |
yvars |
a tidyselect expression (see tidyselect syntax) specifying the columns to use for computation of correlations |
disjoint |
an atomic vector of size equal to the number of columns of |
method |
a character string indicating which correlation coefficient is
to be used for the test. One of |
alternative |
indicates the alternative hypothesis and must be one of
|
exact |
a logical indicating whether an exact p-value should be computed.
Used for Kendall's tau and Spearman's rho. See |
min_length |
the minimum size (the minimum number of predicates) of the condition to be generated (must be greater or equal to 0). If 0, the empty condition is generated in the first place. |
max_length |
The maximum size (the maximum number of predicates) of the condition to be generated. If equal to Inf, the maximum length of conditions is limited only by the number of available predicates. |
min_support |
the minimum support of a condition to trigger the callback
function for it. The support of the condition is the relative frequency
of the condition in the dataset |
max_support |
the maximum support of a condition to trigger the callback
function for it. See argument |
max_results |
the maximum number of generated conditions to execute the
callback function on. If the number of found conditions exceeds
|
verbose |
a logical scalar indicating whether to print progress messages. |
threads |
the number of threads to use for parallel computation. |
A tibble with found patterns.
Michal Burda
# convert iris$Species into dummy logical variables d <- partition(iris, Species) # find conditional correlations between all pairs of numeric variables dig_correlations(d, condition = where(is.logical), xvars = Sepal.Length:Petal.Width, yvars = Sepal.Length:Petal.Width) # With `condition = NULL`, dig_correlations() computes correlations between # all pairs of numeric variables on the whole dataset only, which is an # alternative way of computing the correlation matrix dig_correlations(iris, condition = NULL, xvars = Sepal.Length:Petal.Width, yvars = Sepal.Length:Petal.Width)
# convert iris$Species into dummy logical variables d <- partition(iris, Species) # find conditional correlations between all pairs of numeric variables dig_correlations(d, condition = where(is.logical), xvars = Sepal.Length:Petal.Width, yvars = Sepal.Length:Petal.Width) # With `condition = NULL`, dig_correlations() computes correlations between # all pairs of numeric variables on the whole dataset only, which is an # alternative way of computing the correlation matrix dig_correlations(iris, condition = NULL, xvars = Sepal.Length:Petal.Width, yvars = Sepal.Length:Petal.Width)
This function creates a grid column names specified
by xvars
and yvars
(see var_grid()
). After that, it enumerates all
conditions created from data in x
(by calling dig()
) and for each such
condition and for each row of the grid of combinations, a user-defined
function f
is executed on each sub-data created from x
by selecting all
rows of x
that satisfy the generated condition and by selecting the
columns in the grid's row.
Function is useful for searching for patterns that are based on the
relationships between pairs of columns, such as in dig_correlations()
.
dig_grid( x, f, condition = where(is.logical), xvars = where(is.numeric), yvars = where(is.numeric), disjoint = var_names(colnames(x)), allow = "all", na_rm = FALSE, type = "crisp", min_length = 0L, max_length = Inf, min_support = 0, max_support = 1, max_results = Inf, verbose = FALSE, threads = 1L, error_context = list(arg_x = "x", arg_f = "f", arg_condition = "condition", arg_xvars = "xvars", arg_yvars = "yvars", arg_disjoint = "disjoint", arg_allow = "allow", arg_na_rm = "na_rm", arg_type = "type", arg_min_length = "min_length", arg_max_length = "max_length", arg_min_support = "min_support", arg_max_support = "max_support", arg_max_results = "max_results", arg_verbose = "verbose", arg_threads = "threads", call = current_env()) )
dig_grid( x, f, condition = where(is.logical), xvars = where(is.numeric), yvars = where(is.numeric), disjoint = var_names(colnames(x)), allow = "all", na_rm = FALSE, type = "crisp", min_length = 0L, max_length = Inf, min_support = 0, max_support = 1, max_results = Inf, verbose = FALSE, threads = 1L, error_context = list(arg_x = "x", arg_f = "f", arg_condition = "condition", arg_xvars = "xvars", arg_yvars = "yvars", arg_disjoint = "disjoint", arg_allow = "allow", arg_na_rm = "na_rm", arg_type = "type", arg_min_length = "min_length", arg_max_length = "max_length", arg_min_support = "min_support", arg_max_support = "max_support", arg_max_results = "max_results", arg_verbose = "verbose", arg_threads = "threads", call = current_env()) )
x |
a matrix or data frame with data to search in. |
f |
the callback function to be executed for each generated condition.
The arguments of the callback function differ based on the value of the
In all cases, the function must return a list of scalar values, which will be converted into a single row of result of final tibble. |
condition |
a tidyselect expression (see tidyselect syntax) specifying the columns to use as condition predicates. The selected columns must be logical or numeric. If numeric, fuzzy conditions are considered. |
xvars |
a tidyselect expression (see
tidyselect syntax)
specifying the columns of |
yvars |
|
disjoint |
an atomic vector of size equal to the number of columns of |
allow |
a character string specifying which columns are allowed to be
selected by
|
na_rm |
a logical value indicating whether to remove rows with missing
values from sub-data before the callback function |
type |
a character string specifying the type of conditions to be processed.
The |
min_length |
the minimum size (the minimum number of predicates) of the condition to be generated (must be greater or equal to 0). If 0, the empty condition is generated in the first place. |
max_length |
the maximum size (the maximum number of predicates) of the condition to be generated. If equal to Inf, the maximum length of conditions is limited only by the number of available predicates. |
min_support |
the minimum support of a condition to trigger the callback
function for it. The support of the condition is the relative frequency
of the condition in the dataset |
max_support |
the maximum support of a condition to trigger the callback
function for it. See argument |
max_results |
the maximum number of generated conditions to execute the
callback function on. If the number of found conditions exceeds
|
verbose |
a logical scalar indicating whether to print progress messages. |
threads |
the number of threads to use for parallel computation. |
error_context |
a list of details to be used in error messages.
This argument is useful when
|
A tibble with found patterns. Each row represents a single call of
the callback function f
.
Michal Burda
dig()
, var_grid()
; see also dig_correlations()
and
dig_paired_baseline_contrasts()
, as they are using this function internally.
# *** Example of crisp (boolean) patterns: # dichotomize iris$Species crispIris <- partition(iris, Species) # a simple callback function that computes mean difference of `xvar` and `yvar` f <- function(pd) { list(m = mean(pd[[1]] - pd[[2]]), n = nrow(pd)) } # call f() for each condition created from column `Species` dig_grid(crispIris, f, condition = starts_with("Species"), xvars = starts_with("Sepal"), yvars = starts_with("Petal"), type = "crisp") # *** Example of fuzzy patterns: # create fuzzy sets from Sepal columns fuzzyIris <- partition(iris, starts_with("Sepal"), .method = "triangle", .breaks = 3) # a simple callback function that computes a weighted mean of a difference of # `xvar` and `yvar` f <- function(d, weights) { list(m = weighted.mean(d[[1]] - d[[2]], w = weights), w = sum(weights)) } # call f() for each fuzzy condition created from column fuzzy sets whose # names start with "Sepal" dig_grid(fuzzyIris, f, condition = starts_with("Sepal"), xvars = Petal.Length, yvars = Petal.Width, type = "fuzzy")
# *** Example of crisp (boolean) patterns: # dichotomize iris$Species crispIris <- partition(iris, Species) # a simple callback function that computes mean difference of `xvar` and `yvar` f <- function(pd) { list(m = mean(pd[[1]] - pd[[2]]), n = nrow(pd)) } # call f() for each condition created from column `Species` dig_grid(crispIris, f, condition = starts_with("Species"), xvars = starts_with("Sepal"), yvars = starts_with("Petal"), type = "crisp") # *** Example of fuzzy patterns: # create fuzzy sets from Sepal columns fuzzyIris <- partition(iris, starts_with("Sepal"), .method = "triangle", .breaks = 3) # a simple callback function that computes a weighted mean of a difference of # `xvar` and `yvar` f <- function(d, weights) { list(m = weighted.mean(d[[1]] - d[[2]], w = weights), w = sum(weights)) } # call f() for each fuzzy condition created from column fuzzy sets whose # names start with "Sepal" dig_grid(fuzzyIris, f, condition = starts_with("Sepal"), xvars = Petal.Length, yvars = Petal.Width, type = "fuzzy")
Paired baseline contrast patterns identify conditions under which there is a significant difference in some statistical feature between two paired numeric variables.
(xvar - yvar) != 0 | C
There is a statistically significant difference between paired variables
xvar
and yvar
under the condition C
.
(daily_ice_cream_income - daily_tea_income) > 0 | sunny
Under the condition of sunny weather, the paired test shows that
daily ice-cream income is significantly higher than the
daily tea income.
The paired baseline contrast is computed using a paired version of a statistical test,
which is specified by the method
argument. The function computes the paired
contrast between all pairs of variables, where the first variable is
specified by the xvars
argument and the second variable is specified by the
yvars
argument. Paired baseline contrasts are computed in sub-data corresponding
to conditions generated from the condition
columns. Function
dig_paired_baseline_contrasts()
supports crisp conditions only, i.e.,
the condition columns in x
must be logical.
dig_paired_baseline_contrasts( x, condition = where(is.logical), xvars = where(is.numeric), yvars = where(is.numeric), disjoint = var_names(colnames(x)), min_length = 0L, max_length = Inf, min_support = 0, max_support = 1, method = "t", alternative = "two.sided", h0 = 0, conf_level = 0.95, max_p_value = 1, t_var_equal = FALSE, wilcox_exact = FALSE, wilcox_correct = TRUE, wilcox_tol_root = 1e-04, wilcox_digits_rank = Inf, max_results = Inf, verbose = FALSE, threads = 1 )
dig_paired_baseline_contrasts( x, condition = where(is.logical), xvars = where(is.numeric), yvars = where(is.numeric), disjoint = var_names(colnames(x)), min_length = 0L, max_length = Inf, min_support = 0, max_support = 1, method = "t", alternative = "two.sided", h0 = 0, conf_level = 0.95, max_p_value = 1, t_var_equal = FALSE, wilcox_exact = FALSE, wilcox_correct = TRUE, wilcox_tol_root = 1e-04, wilcox_digits_rank = Inf, max_results = Inf, verbose = FALSE, threads = 1 )
x |
a matrix or data frame with data to search the patterns in. |
condition |
a tidyselect expression (see tidyselect syntax) specifying the columns to use as condition predicates |
xvars |
a tidyselect expression (see tidyselect syntax) specifying the columns to use for computation of contrasts |
yvars |
a tidyselect expression (see tidyselect syntax) specifying the columns to use for computation of contrasts |
disjoint |
an atomic vector of size equal to the number of columns of |
min_length |
the minimum size (the minimum number of predicates) of the condition to be generated (must be greater or equal to 0). If 0, the empty condition is generated in the first place. |
max_length |
The maximum size (the maximum number of predicates) of the condition to be generated. If equal to Inf, the maximum length of conditions is limited only by the number of available predicates. |
min_support |
the minimum support of a condition to trigger the callback
function for it. The support of the condition is the relative frequency
of the condition in the dataset |
max_support |
the maximum support of a condition to trigger the callback
function for it. See argument |
method |
a character string indicating which contrast to compute.
One of |
alternative |
indicates the alternative hypothesis and must be one of
|
h0 |
a numeric value specifying the null hypothesis for the test. For
the |
conf_level |
a numeric value specifying the level of the confidence interval. The default value is 0.95. |
max_p_value |
the maximum p-value of a test for the pattern to be considered
significant. If the p-value of the test is greater than |
t_var_equal |
(used for the |
wilcox_exact |
(used for the |
wilcox_correct |
(used for the |
wilcox_tol_root |
(used for the |
wilcox_digits_rank |
(used for the |
max_results |
the maximum number of generated conditions to execute the
callback function on. If the number of found conditions exceeds
|
verbose |
a logical scalar indicating whether to print progress messages. |
threads |
the number of threads to use for parallel computation. |
A tibble with found patterns in rows. The following columns are always present:
condition |
the condition of the pattern as a character string
in the form |
support |
the support of the condition, i.e., the relative
frequency of the condition in the dataset |
xvar |
the name of the first variable in the contrast. |
yvar |
the name of the second variable in the contrast. |
estimate |
the estimated difference of variable |
statistic |
the statistic of the selected test. |
p_value |
the p-value of the underlying test. |
n |
the number of rows in the sub-data corresponding to the condition. |
conf_int_lo |
the lower bound of the confidence interval of the estimate. |
conf_int_hi |
the upper bound of the confidence interval of the estimate. |
alternative |
a character string indicating the alternative
hypothesis. The value must be one of |
method |
a character string indicating the method used for the test. |
comment |
a character string with additional information about the test (mainly error messages on failure). |
For the "t"
method, the following additional columns are also
present (see also t.test()
):
df |
the degrees of freedom of the t test. |
stderr |
the standard error of the mean difference. |
Michal Burda
dig_baseline_contrasts()
, dig_complement_contrasts()
,
dig()
, dig_grid()
,
stats::t.test()
, stats::wilcox.test()
# Compute ratio of sepal and petal length and width for iris dataset crispIris <- iris crispIris$Sepal.Ratio <- iris$Sepal.Length / iris$Sepal.Width crispIris$Petal.Ratio <- iris$Petal.Length / iris$Petal.Width # Create predicates from the Species column crispIris <- partition(crispIris, Species) # Compute paired contrasts for ratios of sepal and petal length and width dig_paired_baseline_contrasts(crispIris, condition = where(is.logical), xvars = Sepal.Ratio, yvars = Petal.Ratio, method = "t", min_support = 0.1)
# Compute ratio of sepal and petal length and width for iris dataset crispIris <- iris crispIris$Sepal.Ratio <- iris$Sepal.Length / iris$Sepal.Width crispIris$Petal.Ratio <- iris$Petal.Length / iris$Petal.Width # Create predicates from the Species column crispIris <- partition(crispIris, Species) # Compute paired contrasts for ratios of sepal and petal length and width dig_paired_baseline_contrasts(crispIris, condition = where(is.logical), xvars = Sepal.Ratio, yvars = Petal.Ratio, method = "t", min_support = 0.1)
Function takes a character vector of predicates and returns a formatted condition. The format of the condition is a string with predicates separated by commas and enclosed in curly braces.
format_condition(condition)
format_condition(condition)
condition |
a character vector of predicates to be formatted |
a character scalar with a formatted condition
Michal Burda
format_condition(NULL) # returns {} format_condition(c("a", "b", "c")) # returns {a,b,c}
format_condition(NULL) # returns {} format_condition(c("a", "b", "c")) # returns {a,b,c}
Tests whether the given argument is a numeric value from the interval
is_degree(x, na_rm = FALSE)
is_degree(x, na_rm = FALSE)
x |
the value to be tested |
na_rm |
whether to ignore |
TRUE
if x
is a numeric vector, matrix or array with values
between 0 and 1, otherwise, FALSE
is returned. If na_rm
is TRUE
,
NA
values are treated as valid values. If na_rm
is FALSE
and x
contains NA
values, FALSE
is returned.
Michal Burda
Determine whether the first vector is a subset of the second vector
is_subset(x, y)
is_subset(x, y)
x |
the first vector |
y |
the second vector |
TRUE
if x
is a subset of y
, or FALSE
otherwise. x
is
considered a subset of y
if all elements of x
are also in y
,
i.e., if setdiff(x, y)
is a vector of length 0.
Michal Burda
Convert the selected columns of the data frame into either dummy logical columns, or into membership degrees of fuzzy sets, while leaving the remaining columns untouched. Each column selected for transformation typically yields in multiple columns in the output.
partition( .data, .what = everything(), ..., .breaks = NULL, .labels = NULL, .na = TRUE, .keep = FALSE, .method = "crisp", .right = TRUE )
partition( .data, .what = everything(), ..., .breaks = NULL, .labels = NULL, .na = TRUE, .keep = FALSE, .method = "crisp", .right = TRUE )
.data |
the data frame to be processed |
.what |
a tidyselect expression (see tidyselect syntax) specifying the columns to be transformed |
... |
optional other tidyselect expressions selecting additional columns to be processed |
.breaks |
for numeric columns, this has to be either an integer scalar
or a numeric vector. If |
.labels |
character vector specifying the names used to construct
the newly created column names. If |
.na |
if |
.keep |
if |
.method |
The method of transformation for numeric columns. Either
|
.right |
If |
Transformations performed by this function are typically useful as a
preprocessing step before using the dig()
function or some of its
derivatives (dig_correlations()
, dig_paired_baseline_contrasts()
,
dig_associations()
).
The transformation of selected columns differ based on the type. Concretely:
logical column x
is transformed into pair of logical columns,
x=TRUE
andx=FALSE
;
factor column x
, which has levels l1
, l2
, and l3
, is transformed
into three logical columns named x=l1
, x=l2
, and x=l3
;
numeric columnx
is transformed accordingly to .method
argument:
if .method="crisp"
, the column is first transformed into a factor
with intervals as factor levels and then it is processed as a factor
(see above);
for other .method
(triangle
or raisedcos
), several new columns
are created, where each column has numeric values from the interval
and represents a certain fuzzy set (either triangular or
raised-cosinal).
Details of transformation of numeric columns can be specified with
additional arguments (
.breaks
, .labels
, .right
).
A tibble created by transforming .data
.
Michal Burda
# transform logical columns and factors d <- data.frame(a = c(TRUE, TRUE, FALSE), b = factor(c("A", "B", "A")), c = c(1, 2, 3)) partition(d, a, b) # transform numeric columns to logical columns (crisp transformation) partition(CO2, conc:uptake, .method = "crisp", .breaks = 3) # transform numeric columns to fuzzy sets (triangle transformation) partition(CO2, conc:uptake, .method = "triangle", .breaks = 3) # complex transformation with different settings for each column CO2 |> partition(Plant:Treatment) |> partition(conc, .method = "raisedcos", .breaks = c(-Inf, 95, 175, 350, 675, 1000, Inf)) |> partition(uptake, .method = "triangle", .breaks = c(-Inf, 7.7, 28.3, 45.5, Inf), .labels = c("low", "medium", "high"))
# transform logical columns and factors d <- data.frame(a = c(TRUE, TRUE, FALSE), b = factor(c("A", "B", "A")), c = c(1, 2, 3)) partition(d, a, b) # transform numeric columns to logical columns (crisp transformation) partition(CO2, conc:uptake, .method = "crisp", .breaks = 3) # transform numeric columns to fuzzy sets (triangle transformation) partition(CO2, conc:uptake, .method = "triangle", .breaks = 3) # complex transformation with different settings for each column CO2 |> partition(Plant:Treatment) |> partition(conc, .method = "raisedcos", .breaks = c(-Inf, 95, 175, 350, 675, 1000, Inf)) |> partition(uptake, .method = "triangle", .breaks = c(-Inf, 7.7, 28.3, 45.5, Inf), .labels = c("low", "medium", "high"))
xvars
and yvars
arguments are tidyselect expressions (see
tidyselect syntax) that
specify the columns of x
whose names will be used as a domain for
combinations.
If yvars
is NULL
, the function creates a tibble with one column var
enumerating all column names specified by the xvars
argument.
If yvars
is not NULL
, the function creates a tibble with two columns,
xvar
and yvar
, whose rows enumerate all combinations of column names
specified by the xvars
and yvars
argument.
It is allowed to specify the same column in both xvars
and yvars
arguments. In such a case, the combinations of the same column with itself
are removed from the result.
In other words, the function creates a grid of all possible pairs
where
,
,
and
.
var_grid( x, xvars = everything(), yvars = everything(), allow = "all", xvar_name = if (quo_is_null(enquo(yvars))) "var" else "xvar", yvar_name = "yvar", error_context = list(arg_x = "x", arg_xvars = "xvars", arg_yvars = "yvars", arg_allow = "allow", arg_xvar_name = "xvar_name", arg_yvar_name = "yvar_name", call = current_env()) )
var_grid( x, xvars = everything(), yvars = everything(), allow = "all", xvar_name = if (quo_is_null(enquo(yvars))) "var" else "xvar", yvar_name = "yvar", error_context = list(arg_x = "x", arg_xvars = "xvars", arg_yvars = "yvars", arg_allow = "allow", arg_xvar_name = "xvar_name", arg_yvar_name = "yvar_name", call = current_env()) )
x |
either a data frame or a matrix |
xvars |
a tidyselect expression (see
tidyselect syntax)
specifying the columns of |
yvars |
|
allow |
a character string specifying which columns are allowed to be
selected by
|
xvar_name |
the name of the first column in the resulting tibble. |
yvar_name |
the name of the second column in the resulting tibble.
The column does not exist if |
error_context |
A list of details to be used in error messages.
This argument is useful when
|
if yvars
is NULL
, the function returns a tibble with a single
column (var
). If yvars
is a non-NULL
expression, the function
returns two columns (xvar
and yvar
) with rows enumerating
all combinations of column names specified by tidyselect expressions
in xvars
and yvars
arguments.
Michal Burda
# Create a grid of combinations of all pairs of columns in the CO2 dataset: var_grid(CO2) # Create a grid of combinations of all pairs of columns in the CO2 dataset # such that the first, i.e., `xvar` column is `Plant`, `Type`, or # `Treatment`, and the second, i.e., `yvar` column is `conc` or `uptake`: var_grid(CO2, xvars = Plant:Treatment, yvars = conc:uptake)
# Create a grid of combinations of all pairs of columns in the CO2 dataset: var_grid(CO2) # Create a grid of combinations of all pairs of columns in the CO2 dataset # such that the first, i.e., `xvar` column is `Plant`, `Type`, or # `Treatment`, and the second, i.e., `yvar` column is `conc` or `uptake`: var_grid(CO2, xvars = Plant:Treatment, yvars = conc:uptake)
The function assumes that x
is a vector of predicate names, i.e., a character
vector with elements compatible with pattern <varname>=<value>
. The function
returns the <varname>
part of these elements. If the string does not
correspond to the pattern <varname>=<value>
, i.e., if the equal sign (=
)
is missing in the string, the whole string is returned.
var_names(x)
var_names(x)
x |
A character vector of predicate names. |
A <varname>
part of predicate names in x
.
Michal Burda
var_names(c("a=1", "a=2", "b=x", "b=y")) # returns c("a", "a", "b", "b")
var_names(c("a=1", "a=2", "b=x", "b=y")) # returns c("a", "a", "b", "b")
The function returns indices of elements from the given list x
, which are incomparable
(i.e., it is neither subset nor superset) with any preceding element. The first element
is always selected. The next element is selected only if it is incomparable with all
previously selected elements.
which_antichain(x, distance = 0)
which_antichain(x, distance = 0)
x |
a list of integerish vectors |
distance |
a non-negative integer, which specifies the allowed discrepancy between compared sets |
an integer vector of indices of selected (incomparable) elements.
Michal Burda