Title: | Extensible Data Pattern Searching Framework |
---|---|
Description: | Extensible framework for subgroup discovery (Atzmueller (2015) <doi:10.1002/widm.1144>), contrast patterns (Chen (2022) <doi:10.48550/arXiv.2209.13556>), emerging patterns (Dong (1999) <doi:10.1145/312129.312191>), association rules (Agrawal (1994) <https://www.vldb.org/conf/1994/P487.PDF>) and conditional correlations (Hájek (1978) <doi:10.1007/978-3-642-66943-9>). Both crisp (Boolean, binary) and fuzzy data are supported. It generates conditions in the form of elementary conjunctions, evaluates them on a dataset and checks the induced sub-data for interesting statistical properties. A user-defined function may be defined to evaluate on each generated condition to search for custom patterns. |
Authors: | Michal Burda [aut, cre] |
Maintainer: | Michal Burda <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.3.0 |
Built: | 2024-11-25 16:18:24 UTC |
Source: | CRAN |
Create dummy logical columns from selected columns of the data frame. Dummy columns may be created for logical or factor columns as follows:
dichotomize(.data, what = everything(), ..., .keep = FALSE, .other = FALSE)
dichotomize(.data, what = everything(), ..., .keep = FALSE, .other = FALSE)
.data |
a data frame to be processed |
what |
a tidyselect expression (see tidyselect syntax) selecting the columns to be processed |
... |
further tidyselect expressions for selecting the columns to be processed |
.keep |
whether to keep the original columns. If FALSE, the original columns are removed from the result. |
.other |
whether to put into result the rest of columns that were not
specified for dichotomization in |
for logical column col
, a pair of columns is created named col=T
and col=F
where the former (resp. latter) is equal to the original
(resp. negation of the original);
for factor column col
, a new logical column is created for each
level l
of the factor col
and named as col=l
with a value set
to TRUE wherever the original column is equal to l
.
A tibble with selected columns replaced with dummy columns.
Michal Burda
This is a general function that enumerates all conditions created from
data in x
and calls the callback function f
on each.
dig( x, f, condition = everything(), focus = NULL, disjoint = NULL, min_length = 0, max_length = Inf, min_support = 0, min_focus_support = min_support, filter_empty_foci = FALSE, t_norm = "goguen", threads = 1, ... )
dig( x, f, condition = everything(), focus = NULL, disjoint = NULL, min_length = 0, max_length = Inf, min_support = 0, min_focus_support = min_support, filter_empty_foci = FALSE, t_norm = "goguen", threads = 1, ... )
x |
a matrix or data frame. The matrix must be numeric (double) or logical.
If |
f |
the callback function executed for each generated condition. This
function may have some of the following arguments. Based on the present
arguments, the algorithm would provide information about the generated
condition:
- |
condition |
a tidyselect expression (see tidyselect syntax) specifying the columns to use as condition predicates |
focus |
a tidyselect expression (see tidyselect syntax) specifying the columns to use as focus predicates |
disjoint |
an atomic vector of size equal to the number of columns of |
min_length |
the minimum size (the minimum number of predicates) of the condition to be generated (must be greater or equal to 0). If 0, the empty condition is generated in the first place. |
max_length |
The maximum size (the maximum number of predicates) of the condition to be generated. If equal to Inf, the maximum length of conditions is limited only by the number of available predicates. |
min_support |
the minimum support of a condition to trigger the callback
function for it. The support of the condition is the relative frequency
of the condition in the dataset |
min_focus_support |
the minimum support of a focus, for the focus to be passed to the callback function. The support of the focus is the relative frequency of rows such that all condition predicates AND the focus are TRUE on it. For numerical (double) input, the support is computed as the mean (over all rows) of multiplications of predicate values. |
filter_empty_foci |
a logical scalar indicating whether to skip conditions,
for which no focus remains available after filtering by |
t_norm |
a t-norm used to compute conjunction of weights. It must be one of
|
threads |
the number of threads to use for parallel computation. |
... |
Further arguments, currently unused. |
A list of results provided by the callback function f
.
Michal Burda
Search for contrast patterns
dig_contrasts( x, condition = where(is.logical), xvars = where(is.numeric), yvars = where(is.numeric), method = "t", alternative = "two.sided", min_length = 0L, max_length = Inf, min_support = 0, threads = 1, ... )
dig_contrasts( x, condition = where(is.logical), xvars = where(is.numeric), yvars = where(is.numeric), method = "t", alternative = "two.sided", min_length = 0L, max_length = Inf, min_support = 0, threads = 1, ... )
x |
a matrix or data frame with data to search in. |
condition |
a tidyselect expression (see tidyselect syntax) specifying the columns to use as condition predicates |
xvars |
a tidyselect expression (see tidyselect syntax) specifying the columns to use for computation of contrasts |
yvars |
a tidyselect expression (see tidyselect syntax) specifying the columns to use for computation of contrasts |
method |
a character string indicating which contrast to compute.
One of |
alternative |
indicates the alternative hypothesis and must be one of
|
min_length |
the minimum size (the minimum number of predicates) of the condition to be generated (must be greater or equal to 0). If 0, the empty condition is generated in the first place. |
max_length |
The maximum size (the maximum number of predicates) of the condition to be generated. If equal to Inf, the maximum length of conditions is limited only by the number of available predicates. |
min_support |
the minimum support of a condition to trigger the callback
function for it. The support of the condition is the relative frequency
of the condition in the dataset |
threads |
the number of threads to use for parallel computation. |
... |
Further arguments passed to the underlying test function
( |
A tibble with found rules.
Michal Burda
dig()
, dig_grid()
, stats::t.test()
, stats::wilcox.test()
, stats::var.test()
Compute correlation between all combinations of xvars
and yvars
columns
of x
in sub-data corresponding to conditions generated from condition
columns.
dig_correlations( x, condition = where(is.logical), xvars = where(is.numeric), yvars = where(is.numeric), method = "pearson", alternative = "two.sided", exact = NULL, min_length = 0L, max_length = Inf, min_support = 0, threads = 1, ... )
dig_correlations( x, condition = where(is.logical), xvars = where(is.numeric), yvars = where(is.numeric), method = "pearson", alternative = "two.sided", exact = NULL, min_length = 0L, max_length = Inf, min_support = 0, threads = 1, ... )
x |
a matrix or data frame with data to search in. |
condition |
a tidyselect expression (see tidyselect syntax) specifying the columns to use as condition predicates |
xvars |
a tidyselect expression (see tidyselect syntax) specifying the columns to use for computation of correlations |
yvars |
a tidyselect expression (see tidyselect syntax) specifying the columns to use for computation of correlations |
method |
a character string indicating which correlation coefficient is
to be used for the test. One of |
alternative |
indicates the alternative hypothesis and must be one of
|
exact |
a logical indicating whether an exact p-value should be computed.
Used for Kendall's tau and Spearman's rho. See |
min_length |
the minimum size (the minimum number of predicates) of the condition to be generated (must be greater or equal to 0). If 0, the empty condition is generated in the first place. |
max_length |
The maximum size (the maximum number of predicates) of the condition to be generated. If equal to Inf, the maximum length of conditions is limited only by the number of available predicates. |
min_support |
the minimum support of a condition to trigger the callback
function for it. The support of the condition is the relative frequency
of the condition in the dataset |
threads |
the number of threads to use for parallel computation. |
... |
Further arguments, currently unused. |
A tibble with found rules.
Michal Burda
This function creates a grid of combinations of pairs of columns specified
by xvars
and yvars
(see also var_grid()
). After that, it enumerates all
conditions created from data in x
(by calling dig()
) and for each such
condition and for each row of the grid of combinations, a user-defined
function f
is executed on each sub-data created from x
by selecting all
rows of x
that satisfy the generated condition and by selecting the
columns in the grid's row.
dig_grid( x, f, condition = where(is.logical), xvars = where(is.numeric), yvars = where(is.numeric), na_rm = FALSE, type = "bool", min_length = 0L, max_length = Inf, min_support = 0, threads = 1, ... )
dig_grid( x, f, condition = where(is.logical), xvars = where(is.numeric), yvars = where(is.numeric), na_rm = FALSE, type = "bool", min_length = 0L, max_length = Inf, min_support = 0, threads = 1, ... )
x |
a matrix or data frame with data to search in. |
f |
the callback function to be executed for each generated condition.
The arguments of the callback function differ based on the value of the
|
condition |
a tidyselect expression (see tidyselect syntax) specifying the columns to use as condition predicates. The selected columns must be logical or numeric. If numeric, fuzzy conditions are considered. |
xvars |
a tidyselect expression (see
tidyselect syntax)
specifying the columns of |
yvars |
a tidyselect expression (see
tidyselect syntax)
specifying the columns of |
na_rm |
a logical value indicating whether to remove rows with missing
values from sub-data before the callback function |
type |
a character string specifying the type of conditions to be processed.
The |
min_length |
the minimum size (the minimum number of predicates) of the condition to be generated (must be greater or equal to 0). If 0, the empty condition is generated in the first place. |
max_length |
the maximum size (the maximum number of predicates) of the condition to be generated. If equal to Inf, the maximum length of conditions is limited only by the number of available predicates. |
min_support |
the minimum support of a condition to trigger the callback
function for it. The support of the condition is the relative frequency
of the condition in the dataset |
threads |
the number of threads to use for parallel computation. |
... |
Further arguments, currently unused. |
A tibble with found rules. Each row represents a single call of
the callback function f
.
Michal Burda
dig()
, var_grid()
, and dig_correlations()
, as it is using this
function internally
Implicative rule is a rule of the form ,
where
(antecedent) is a set of predicates and
(consequent) is a predicate.
dig_implications( x, antecedent = everything(), consequent = everything(), disjoint = NULL, min_length = 0L, max_length = Inf, min_coverage = 0, min_support = 0, min_confidence = 0, contingency_table = FALSE, measures = NULL, t_norm = "goguen", threads = 1, ... )
dig_implications( x, antecedent = everything(), consequent = everything(), disjoint = NULL, min_length = 0L, max_length = Inf, min_coverage = 0, min_support = 0, min_confidence = 0, contingency_table = FALSE, measures = NULL, t_norm = "goguen", threads = 1, ... )
x |
a matrix or data frame with data to search in. The matrix must be
numeric (double) or logical. If |
antecedent |
a tidyselect expression (see tidyselect syntax) specifying the columns to use in the antecedent (left) part of the rules |
consequent |
a tidyselect expression (see tidyselect syntax) specifying the columns to use in the consequent (right) part of the rules |
disjoint |
an atomic vector of size equal to the number of columns of |
min_length |
the minimum length, i.e., the minimum number of predicates in the antecedent, of a rule to be generated. Value must be greater or equal to 0. If 0, rules with empty antecedent are generated in the first place. |
max_length |
The maximum length, i.e., the maximum number of predicates in the antecedent, of a rule to be generated. If equal to Inf, the maximum length is limited only by the number of available predicates. |
min_coverage |
the minimum coverage of a rule in the dataset |
min_support |
the minimum support of a rule in the dataset |
min_confidence |
the minimum confidence of a rule in the dataset |
contingency_table |
a logical value indicating whether to provide a contingency
table for each rule. If |
measures |
a character vector specifying the additional quality measures to compute.
If |
t_norm |
a t-norm used to compute conjunction of weights. It must be one of
|
threads |
the number of threads to use for parallel computation. |
... |
Further arguments, currently unused. |
For the following explanations we need a mathematical function , which
is defined for a set
of predicates as a relative frequency of rows satisfying
all predicates from
. For logical data,
equals to the relative
frequency of rows, for which all predicates
from
are TRUE.
For numerical (double) input,
is computed as the mean (over all rows)
of truth degrees of the formula
i_1 AND i_2 AND ... AND i_n
, where
AND
is a triangular norm selected by the t_norm
argument.
Implicative rules are characterized with the following quality measures.
Length of a rule is the number of elements in the antecedent.
Coverage of a rule is equal to .
Consequent support of a rule is equal to .
Support of a rule is equal to .
Confidence of a rule is the fraction .
A tibble with found rules and computed quality measures.
Michal Burda
Function takes a character vector of predicates and returns a formatted condition.
format_condition(condition)
format_condition(condition)
condition |
a character vector |
a character scalar
Michal Burda
format_condition(NULL) # returns {} format_condition(c("a", "b", "c")) # returns {a,b,c}
format_condition(NULL) # returns {} format_condition(c("a", "b", "c")) # returns {a,b,c}
Tests whether the given argument is a numeric value from the interval
is_degree(x, na_rm = FALSE)
is_degree(x, na_rm = FALSE)
x |
the value to be tested |
na_rm |
whether to ignore |
TRUE
if x
is a numeric vector or matrix with values between 0 and 1
Michal Burda
Determine whether the first vector is a subset of the second vector
is_subset(x, y)
is_subset(x, y)
x |
the first vector |
y |
the second vector |
TRUE if x
is a subset of y
or FALSE otherwise.
Michal Burda
Convert the selected columns of the data frame into either dummy logical columns (for logicals and factors), or into membership degrees of fuzzy sets (for numeric columns), while leaving the remaining columns untouched. Each column selected for transformation typically yields in multiple columns in the output.
partition( .data, .what = everything(), ..., .breaks = NULL, .labels = NULL, .na = TRUE, .keep = FALSE, .method = "crisp", .right = TRUE )
partition( .data, .what = everything(), ..., .breaks = NULL, .labels = NULL, .na = TRUE, .keep = FALSE, .method = "crisp", .right = TRUE )
.data |
the data frame to be processed |
.what |
a tidyselect expression (see tidyselect syntax) specifying the columns to be transformed |
... |
optional other tidyselect expressions selecting additional columns to be processed |
.breaks |
for numeric columns, this has to be either an integer scalar
or a numeric vector. If |
.labels |
character vector specifying the names used to construct
the newly created column names. If |
.na |
if |
.keep |
if |
.method |
The method of transformation for numeric columns. Either
|
.right |
If |
Concretely, the transformation of each selected column is performed as follows:
logical column x
is transformed into pair of logical columns,
x=TRUE
andx=FALSE
;
factor column x
, which has levels l1
, l2
, and l3
, is transformed
into three logical columns named x=l1
, x=l2
, and x=l3
;
numerical columnx
is transformed accordingly to .method
argument:
if .method="crisp"
, the column is first transformed into a factor
with intervals as factor levels and then it is processed as a factor
(see above);
for other .method
(triangle
or raisedcos
), several new columns
are created, where each column has numeric values from the interval
and represents a certain fuzzy set (either triangular or
raised-cosinal).
Details of transformation of numeric columns can be specified with
additional arguments (
.breaks
, .labels
, .right
).
A tibble created by transforming .data
.
Michal Burda
The function creates a tibble with two columns, xvar
and yvar
, whose
rows enumerate all combinations of column names specified in the xvars
and yvars
argument. The column names to create the combinations from are
specified using a tidyselect expression (see
tidyselect syntax).
var_grid(x, xvars = everything(), yvars = everything())
var_grid(x, xvars = everything(), yvars = everything())
x |
either a data frame or a matrix |
xvars |
a tidyselect expression (see
tidyselect syntax)
specifying the columns of |
yvars |
a tidyselect expression (see
tidyselect syntax)
specifying the columns of |
a tibble with two columns (xvar
and yvar
) with rows enumerating
all combinations of column names specified by tidyselect expressions
in xvars
and yvars
arguments.
Michal Burda
var_grid(CO2) var_grid(CO2, xvars = Plant:Treatment, yvars = conc:uptake)
var_grid(CO2) var_grid(CO2, xvars = Plant:Treatment, yvars = conc:uptake)
The function returns indices of elements from the given list x
, which are incomparable
(i.e., it is neither subset nor superset) with any preceding element. The first element
is always selected. The next element is selected only if it is incomparable with all
previously selected elements.
which_antichain(x, distance = 0)
which_antichain(x, distance = 0)
x |
a list of integerish vectors |
distance |
a non-negative integer, which specifies the allowed discrepancy between compared sets |
an integer vector of indices of selected (incomparable) elements.
Michal Burda