A general function for searching for patterns of custom type. The function
allows for the selection of columns of x
to be used as condition
predicates. The function enumerates all possible conditions in the form of
elementary conjunctions of selected predicates, and for each condition,
a user-defined callback function f
is executed. The callback function is
intended to perform some analysis and return an object representing a pattern
or patterns related to the condition. dig()
returns a list of these
returned objects.
The callback function f
may have some arguments that are listed in the
f
argument description. The algorithm provides information about the
generated condition based on the present arguments.
Additionally to condition
, the function allows for the selection of
the so-called focus predicates. The focus predicates, a.k.a. foci, are
predicates that are evaluated within each condition and some additional
information is provided to the callback function about them.
dig()
allows to specify some restrictions on the generated conditions,
such as:
the minimum and maximum length of the condition (
min_length
andmax_length
arguments).the minimum support of the condition (
min_support
argument). Support of the condition is the relative frequency of the condition in the datasetx
.the minimum support of the focus (
min_focus_support
argument). Support of the focus is the relative frequency of rows such that all condition predicates AND the focus are TRUE on it. Foci with support lower thanmin_focus_support
are filtered out.
Usage
dig(
x,
f,
condition = everything(),
focus = NULL,
disjoint = var_names(colnames(x)),
min_length = 0,
max_length = Inf,
min_support = 0,
min_focus_support = min_support,
min_conditional_focus_support = 0,
max_support = 1,
filter_empty_foci = FALSE,
t_norm = "goguen",
max_results = Inf,
verbose = FALSE,
threads = 1L,
error_context = list(arg_x = "x", arg_f = "f", arg_condition = "condition", arg_focus =
"focus", arg_disjoint = "disjoint", arg_min_length = "min_length", arg_max_length =
"max_length", arg_min_support = "min_support", arg_min_focus_support =
"min_focus_support", arg_min_conditional_focus_support =
"min_conditional_focus_support", arg_max_support = "max_support",
arg_filter_empty_foci = "filter_empty_foci", arg_t_norm = "t_norm", arg_max_results =
"max_results", arg_verbose = "verbose", arg_threads = "threads",
call =
current_env())
)
Arguments
- x
a matrix or data frame. The matrix must be numeric (double) or logical. If
x
is a data frame then each column must be either numeric (double) or logical.- f
the callback function executed for each generated condition. This function may have some of the following arguments. Based on the present arguments, the algorithm would provide information about the generated condition:
condition
- a named integer vector of column indices that represent the predicates of the condition. Names of the vector correspond to column names;support
- a numeric scalar value of the current condition's support;indices
- a logical vector indicating the rows satisfying the condition;weights
- (similar to indices) weights of rows to which they satisfy the current condition;pp
- a value of a contingency table,condition & focus
.pp
is a named numeric vector where each value is a support of conjunction of the condition with a foci column (see thefocus
argument to specify, which columns). Names of the vector are foci column names.pn
- a value of a contingency table,condition & neg focus
.pn
is a named numeric vector where each value is a support of conjunction of the condition with a negated foci column (see thefocus
argument to specify, which columns are foci) - names of the vector are foci column names.np
- a value of a contingency table,neg condition & focus
.np
is a named numeric vector where each value is a support of conjunction of the negated condition with a foci column (see thefocus
argument to specify, which columns are foci) - names of the vector are foci column names.nn
- a value of a contingency table,neg condition & neg focus
.nn
is a named numeric vector where each value is a support of conjunction of the negated condition with a negated foci column (see thefocus
argument to specify, which columns are foci) - names of the vector are foci column names.foci_supports
- (deprecated, usepp
instead) a named numeric vector of supports of foci columns (seefocus
argument to specify, which columns are foci) - names of the vector are foci column names.
- condition
a tidyselect expression (see tidyselect syntax) specifying the columns to use as condition predicates
- focus
a tidyselect expression (see tidyselect syntax) specifying the columns to use as focus predicates
- disjoint
an atomic vector of size equal to the number of columns of
x
that specifies the groups of predicates: if some elements of thedisjoint
vector are equal, then the corresponding columns ofx
will NOT be present together in a single condition. Ifx
is prepared withpartition()
, using thevar_names()
function onx
's column names is a convenient way to create thedisjoint
vector.- min_length
the minimum size (the minimum number of predicates) of the condition to be generated (must be greater or equal to 0). If 0, the empty condition is generated in the first place.
- max_length
The maximum size (the maximum number of predicates) of the condition to be generated. If equal to Inf, the maximum length of conditions is limited only by the number of available predicates.
- min_support
the minimum support of a condition to trigger the callback function for it. The support of the condition is the relative frequency of the condition in the dataset
x
. For logical data, it equals to the relative frequency of rows such that all condition predicates are TRUE on it. For numerical (double) input, the support is computed as the mean (over all rows) of multiplications of predicate values.- min_focus_support
the minimum support of a focus, for the focus to be passed to the callback function. The support of the focus is the relative frequency of rows such that all condition predicates AND the focus are TRUE on it. For numerical (double) input, the support is computed as the mean (over all rows) of multiplications of predicate values.
- min_conditional_focus_support
the minimum relative support of a focus within a condition. The conditional support of the focus is the relative frequency of rows with focus being TRUE within rows where the condition is TRUE.
- max_support
the maximum support of a condition to trigger the callback
- filter_empty_foci
a logical scalar indicating whether to skip conditions, for which no focus remains available after filtering by
min_focus_support
. IfTRUE
, the condition is passed to the callback function only if at least one focus remains after filtering. IfFALSE
, the condition is passed to the callback function regardless of the number of remaining foci.- t_norm
a t-norm used to compute conjunction of weights. It must be one of
"goedel"
(minimum t-norm),"goguen"
(product t-norm), or"lukas"
(Lukasiewicz t-norm).- max_results
the maximum number of generated conditions to execute the callback function on. If the number of found conditions exceeds
max_results
, the function stops generating new conditions and returns the results. To avoid long computations during the search, it is recommended to setmax_results
to a reasonable positive value. Settingmax_results
toInf
will generate all possible conditions.- verbose
a logical scalar indicating whether to print progress messages.
- threads
the number of threads to use for parallel computation.
- error_context
a list of details to be used in error messages. This argument is useful when
dig()
is called from another function to provide error messages, which refer to arguments of the calling function. The list must contain the following elements:arg_x
- the name of the argumentx
as a character stringarg_f
- the name of the argumentf
as a character stringarg_condition
- the name of the argumentcondition
as a character stringarg_focus
- the name of the argumentfocus
as a character stringarg_disjoint
- the name of the argumentdisjoint
as a character stringarg_min_length
- the name of the argumentmin_length
as a character stringarg_max_length
- the name of the argumentmax_length
as a character stringarg_min_support
- the name of the argumentmin_support
as a character stringarg_min_focus_support
- the name of the argumentmin_focus_support
as a character stringarg_max_support
- the name of the argumentmax_support
as a characterarg_filter_empty_foci
- the name of the argumentfilter_empty_foci
as a character stringarg_t_norm
- the name of the argumentt_norm
as a character stringarg_threads
- the name of the argumentthreads
as a character stringcall
- an environment in which to evaluate the error messages.
Examples
library(tibble)
# Prepare iris data for use with dig()
d <- partition(iris, .breaks = 2)
# Call f() for each condition with support >= 0.5. The result is a list
# of strings representing the conditions.
dig(x = d,
f = function(condition) {
format_condition(names(condition))
},
min_support = 0.5)
#> [[1]]
#> [1] "{}"
#>
#> [[2]]
#> [1] "{Sepal.Width=(-Inf;3.2]}"
#>
#> [[3]]
#> [1] "{Sepal.Length=(-Inf;6.1]}"
#>
#> [[4]]
#> [1] "{Petal.Length=(3.95;Inf]}"
#>
#> [[5]]
#> [1] "{Petal.Width=(-Inf;1.3]}"
#>
#> [[6]]
#> [1] "{Petal.Length=(3.95;Inf],Sepal.Width=(-Inf;3.2]}"
#>
# Create a more complex pattern object - a list with some statistics
res <- dig(x = d,
f = function(condition, support) {
list(condition = format_condition(names(condition)),
support = support)
},
min_support = 0.5)
print(res)
#> [[1]]
#> [[1]]$condition
#> [1] "{}"
#>
#> [[1]]$support
#> [1] 1
#>
#>
#> [[2]]
#> [[2]]$condition
#> [1] "{Sepal.Width=(-Inf;3.2]}"
#>
#> [[2]]$support
#> [1] 0.7133333
#>
#>
#> [[3]]
#> [[3]]$condition
#> [1] "{Sepal.Length=(-Inf;6.1]}"
#>
#> [[3]]$support
#> [1] 0.6333333
#>
#>
#> [[4]]
#> [[4]]$condition
#> [1] "{Petal.Length=(3.95;Inf]}"
#>
#> [[4]]$support
#> [1] 0.5933333
#>
#>
#> [[5]]
#> [[5]]$condition
#> [1] "{Petal.Width=(-Inf;1.3]}"
#>
#> [[5]]$support
#> [1] 0.52
#>
#>
#> [[6]]
#> [[6]]$condition
#> [1] "{Petal.Length=(3.95;Inf],Sepal.Width=(-Inf;3.2]}"
#>
#> [[6]]$support
#> [1] 0.5266666
#>
#>
# Format the result as a data frame
do.call(rbind, lapply(res, as_tibble))
#> # A tibble: 6 × 2
#> condition support
#> <chr> <dbl>
#> 1 {} 1
#> 2 {Sepal.Width=(-Inf;3.2]} 0.713
#> 3 {Sepal.Length=(-Inf;6.1]} 0.633
#> 4 {Petal.Length=(3.95;Inf]} 0.593
#> 5 {Petal.Width=(-Inf;1.3]} 0.520
#> 6 {Petal.Length=(3.95;Inf],Sepal.Width=(-Inf;3.2]} 0.527
# Within each condition, evaluate also supports of columns starting with
# "Species"
res <- dig(x = d,
f = function(condition, support, pp) {
c(list(condition = format_condition(names(condition))),
list(condition_support = support),
as.list(pp / nrow(d)))
},
condition = !starts_with("Species"),
focus = starts_with("Species"),
min_support = 0.5,
min_focus_support = 0)
# Format the result as a tibble
do.call(rbind, lapply(res, as_tibble))
#> # A tibble: 6 × 5
#> condition condition_support `Species=setosa` `Species=versicolor`
#> <chr> <dbl> <dbl> <dbl>
#> 1 {} 1 0.333 0.333
#> 2 {Sepal.Width=(-Inf;3.… 0.713 0.113 0.32
#> 3 {Sepal.Length=(-Inf;6… 0.633 0.333 0.227
#> 4 {Petal.Length=(3.95;I… 0.593 0 0.26
#> 5 {Petal.Width=(-Inf;1.… 0.520 0.333 0.187
#> 6 {Petal.Length=(3.95;I… 0.527 0 0.247
#> # ℹ 1 more variable: `Species=virginica` <dbl>
# For each condition, create multiple patterns based on the focus columns
res <- dig(x = d,
f = function(condition, support, pp) {
lapply(seq_along(pp), function(i) {
list(condition = format_condition(names(condition)),
condition_support = support,
focus = names(pp)[i],
focus_support = pp[[i]] / nrow(d))
})
},
condition = !starts_with("Species"),
focus = starts_with("Species"),
min_support = 0.5,
min_focus_support = 0)
# As res is now a list of lists, we need to flatten it before converting to
# a tibble
res <- unlist(res, recursive = FALSE)
# Format the result as a tibble
do.call(rbind, lapply(res, as_tibble))
#> # A tibble: 18 × 4
#> condition condition_support focus focus_support
#> <chr> <dbl> <chr> <dbl>
#> 1 {} 1 Spec… 0.333
#> 2 {} 1 Spec… 0.333
#> 3 {} 1 Spec… 0.333
#> 4 {Sepal.Width=(-Inf;3.2]} 0.713 Spec… 0.113
#> 5 {Sepal.Width=(-Inf;3.2]} 0.713 Spec… 0.32
#> 6 {Sepal.Width=(-Inf;3.2]} 0.713 Spec… 0.28
#> 7 {Sepal.Length=(-Inf;6.1]} 0.633 Spec… 0.333
#> 8 {Sepal.Length=(-Inf;6.1]} 0.633 Spec… 0.227
#> 9 {Sepal.Length=(-Inf;6.1]} 0.633 Spec… 0.0733
#> 10 {Petal.Length=(3.95;Inf]} 0.593 Spec… 0
#> 11 {Petal.Length=(3.95;Inf]} 0.593 Spec… 0.26
#> 12 {Petal.Length=(3.95;Inf]} 0.593 Spec… 0.333
#> 13 {Petal.Width=(-Inf;1.3]} 0.520 Spec… 0.333
#> 14 {Petal.Width=(-Inf;1.3]} 0.520 Spec… 0.187
#> 15 {Petal.Width=(-Inf;1.3]} 0.520 Spec… 0
#> 16 {Petal.Length=(3.95;Inf],Sepal.Width=(… 0.527 Spec… 0
#> 17 {Petal.Length=(3.95;Inf],Sepal.Width=(… 0.527 Spec… 0.247
#> 18 {Petal.Length=(3.95;Inf],Sepal.Width=(… 0.527 Spec… 0.28