Search for patterns of custom type

A general function for searching for patterns of custom type. The function allows for the selection of columns of x to be used as condition predicates. The function enumerates all possible conditions in the form of elementary conjunctions of selected predicates, and for each condition, a user-defined callback function f is executed. The callback function is intended to perform some analysis and return an object representing a pattern or patterns related to the condition. dig() returns a list of these returned objects.

The callback function f may have some arguments that are listed in the f argument description. The algorithm provides information about the generated condition based on the present arguments.

Additionally to condition, the function allows for the selection of the so-called focus predicates. The focus predicates, a.k.a. foci, are predicates that are evaluated within each condition and some additional information is provided to the callback function about them.

dig() allows to specify some restrictions on the generated conditions, such as:

the minimum and maximum length of the condition (min_length and max_length arguments).
the minimum support of the condition (min_support argument). Support of the condition is the relative frequency of the condition in the dataset x.
the minimum support of the focus (min_focus_support argument). Support of the focus is the relative frequency of rows such that all condition predicates AND the focus are TRUE on it. Foci with support lower than min_focus_support are filtered out.

Usage

dig(
  x,
  f,
  condition = everything(),
  focus = NULL,
  disjoint = var_names(colnames(x)),
  excluded = NULL,
  min_length = 0,
  max_length = Inf,
  min_support = 0,
  min_focus_support = 0,
  min_conditional_focus_support = 0,
  max_support = 1,
  filter_empty_foci = FALSE,
  t_norm = "goguen",
  max_results = Inf,
  verbose = FALSE,
  threads = 1L,
  error_context = list(arg_x = "x", arg_f = "f", arg_condition = "condition", arg_focus =
    "focus", arg_disjoint = "disjoint", arg_excluded = "excluded", arg_min_length =
    "min_length", arg_max_length = "max_length", arg_min_support = "min_support",
    arg_min_focus_support = "min_focus_support", arg_min_conditional_focus_support =
    "min_conditional_focus_support", arg_max_support = "max_support",
    arg_filter_empty_foci = "filter_empty_foci", arg_t_norm = "t_norm", arg_max_results =
    "max_results", arg_verbose = "verbose", 
     arg_threads = "threads", call =
    current_env())
)

Arguments

x

a matrix or data frame. The matrix must be numeric (double) or logical. If x is a data frame then each column must be either numeric (double) or logical.

f

the callback function executed for each generated condition. This function may have some of the following arguments. Based on the present arguments, the algorithm would provide information about the generated condition:

condition - a named integer vector of column indices that represent the predicates of the condition. Names of the vector correspond to column names;
support - a numeric scalar value of the current condition's support;
indices - a logical vector indicating the rows satisfying the condition;
weights - (similar to indices) weights of rows to which they satisfy the current condition;
pp - a value of a contingency table, condition & focus. pp is a named numeric vector where each value is a support of conjunction of the condition with a foci column (see the focus argument to specify, which columns). Names of the vector are foci column names.
pn - a value of a contingency table, condition & neg focus. pn is a named numeric vector where each value is a support of conjunction of the condition with a negated foci column (see the focus argument to specify, which columns are foci) - names of the vector are foci column names.
np - a value of a contingency table, neg condition & focus. np is a named numeric vector where each value is a support of conjunction of the negated condition with a foci column (see the focus argument to specify, which columns are foci) - names of the vector are foci column names.
nn - a value of a contingency table, neg condition & neg focus. nn is a named numeric vector where each value is a support of conjunction of the negated condition with a negated foci column (see the focus argument to specify, which columns are foci) - names of the vector are foci column names.
foci_supports - (deprecated, use pp instead) a named numeric vector of supports of foci columns (see focus argument to specify, which columns are foci) - names of the vector are foci column names.

condition

a tidyselect expression (see tidyselect syntax) specifying the columns to use as condition predicates

focus

a tidyselect expression (see tidyselect syntax) specifying the columns to use as focus predicates

disjoint

an atomic vector of size equal to the number of columns of x that specifies the groups of predicates: if some elements of the disjoint vector are equal, then the corresponding columns of x will NOT be present together in a single condition. If x is prepared with partition(), using the var_names() function on x's column names is a convenient way to create the disjoint vector.

excluded

NULL or a list of character vectors, where each character vector represents a formula in the form of implication, where the all but the last element are the antecedent and the last element is the consequent. These formulae will be treated as tautologies and will serve the purpose of filtering out the generated conditions. If the generated condition contains both the antecedent and the consequent of any of the formulae, the condition is not passed to the callback function f. Similarly, if the generated condition contains the antecedent of any of the formulae, the focus, which is the consequent of the formula, is not passed to the callback function f.

min_length

the minimum size (the minimum number of predicates) of the condition to trigger the callback function f. The value of this argument must be greater or equal to 0. If 0, also the empty condition triggers the callback.

max_length

The maximum allowed size (the maximum number of predicates) of the condition. Conditions longer than max_length are not generated. If equal to Inf, the maximum length of conditions is limited only by the number of available predicates. The value of this argument must be greater or equal to 0 and also greater or equal to min_length. This argument effectively affects the speed of the search process and the number of triggered calls of the callback function f.

min_support

the minimum support of a condition to trigger the callback function f. The support of the condition is the relative frequency of the condition in the dataset x. For logical data, it equals to the relative frequency of rows such that all condition predicates are TRUE on it. For numerical (double) input, the support is computed as the mean (over all rows) of multiplications of predicate values. The value of this argument must be in the range \([0, 1]\). If the support of the condition is lower than min_support, the recursive search for conditions containing the current condition is stopped. Therefore, the value of min_support effectively affects the speed of the search process and the number of triggered calls of the callback function f.

min_focus_support

the minimum required support of a focus, for it to be passed to the callback function f. The support of the focus is the relative frequency of rows such that all condition predicates AND the focus are TRUE on it. For logical data, it equals to the relative frequency of rows, for which all condition predicates AND the focus are TRUE. The numerical (double) input is treated as membership degrees to fuzzy sets and the support is computed as the mean (over all rows) of a t-norm of predicate values. (The applied t-norm is selected by the t_norm argument, see below.) The value of this argument must be in the range \([0, 1]\). If the support of the focus is lower than min_focus_support, the focus is not passed to the callback function f. See also the filter_empty_foci argument which, together with min_focus_support, effectively affects the speed of the search process and the number of triggered calls of the callback function f.

min_conditional_focus_support

the minimum relative support of a focus within a condition. The conditional support of the focus is the relative frequency of rows with focus being TRUE within rows where the condition is TRUE. If \(s(C)\) represents the relative frequency of the condition being TRUE within the dataset and \(s(C \cup F)\) represents the relative frequency of the condition and the focus being both TRUE within the dataset, (computed as t-norm if the input is numerical), then the conditional support of the focus is \(s(C \cup F) / s(C)\). The value of this argument must be in the range \([0, 1]\). If the conditional support of the focus is lower than min_conditional_focus_support, the focus is not passed to the callback function f. See also the filter_empty_foci argument which, together with min_conditional_focus_support, effectively affects the speed of the search process and the number of triggered calls of the callback function f.

max_support

the maximum support of a condition to trigger the callback function f. If the support of the condition is greater than max_support, the condition is not passed to the callback function. max_support does not stop the recursive generation of conditions containing the current condition, but only the execution of the callback function. The value of this argument must be in the range \([0, 1]\).

filter_empty_foci

a logical scalar indicating whether to skip triggering the callback function f on conditions, for which no focus remains available after filtering by min_focus_support or min_conditional_focus_support. If TRUE, the callback function f is triggered only if at least one focus remains after filtering. If FALSE, the callback function f is triggered regardless of the number of remaining foci.

t_norm

a t-norm used to compute conjunction of weights. It must be one of "goedel" (minimum t-norm), "goguen" (product t-norm), or "lukas" (Lukasiewicz t-norm).

max_results

the maximum number of generated conditions to execute the callback function on. If the number of found conditions exceeds max_results, the function stops generating new conditions and returns the results. To avoid long computations during the search, it is recommended to set max_results to a reasonable positive value. Setting max_results to Inf will generate all possible conditions.

verbose

a logical scalar indicating whether to print progress messages.

threads

the number of threads to use for parallel computation.

error_context

a list of details to be used in error messages. This argument is useful when dig() is called from another function to provide error messages, which refer to arguments of the calling function. The list must contain the following elements:

arg_x - the name of the argument x as a character string
arg_f - the name of the argument f as a character string
arg_condition - the name of the argument condition as a character string
arg_focus - the name of the argument focus as a character string
arg_disjoint - the name of the argument disjoint as a character string
arg_excluded - the name of the argument excluded as a character string
arg_min_length - the name of the argument min_length as a character string
arg_max_length - the name of the argument max_length as a character string
arg_min_support - the name of the argument min_support as a character string
arg_min_focus_support - the name of the argument min_focus_support as a character string
arg_min_conditional_focus_support - the name of the argument min_conditional_focus_support as a character string
arg_max_support - the name of the argument max_support as a character
arg_filter_empty_foci - the name of the argument filter_empty_foci as a character string
arg_t_norm - the name of the argument t_norm as a character string
arg_threads - the name of the argument threads as a character string
call - an environment in which to evaluate the error messages.

Value

A list of results provided by the callback function f.

Author

Michal Burda

Examples

library(tibble)

# Prepare iris data for use with dig()
d <- partition(iris, .breaks = 2)

# Call f() for each condition with support >= 0.5. The result is a list
# of strings representing the conditions.
dig(x = d,
    f = function(condition) {
        format_condition(names(condition))
    },
    min_support = 0.5)
#> [[1]]
#> [1] "{}"
#> 
#> [[2]]
#> [1] "{Sepal.Width=(-Inf;3.2]}"
#> 
#> [[3]]
#> [1] "{Petal.Length=(3.95;Inf],Sepal.Width=(-Inf;3.2]}"
#> 
#> [[4]]
#> [1] "{Sepal.Length=(-Inf;6.1]}"
#> 
#> [[5]]
#> [1] "{Petal.Length=(3.95;Inf]}"
#> 
#> [[6]]
#> [1] "{Petal.Width=(-Inf;1.3]}"
#> 

# Create a more complex pattern object - a list with some statistics
res <- dig(x = d,
           f = function(condition, support) {
               list(condition = format_condition(names(condition)),
                    support = support)
           },
           min_support = 0.5)
print(res)
#> [[1]]
#> [[1]]$condition
#> [1] "{}"
#> 
#> [[1]]$support
#> [1] 1
#> 
#> 
#> [[2]]
#> [[2]]$condition
#> [1] "{Sepal.Width=(-Inf;3.2]}"
#> 
#> [[2]]$support
#> [1] 0.7133333
#> 
#> 
#> [[3]]
#> [[3]]$condition
#> [1] "{Petal.Length=(3.95;Inf],Sepal.Width=(-Inf;3.2]}"
#> 
#> [[3]]$support
#> [1] 0.5266666
#> 
#> 
#> [[4]]
#> [[4]]$condition
#> [1] "{Sepal.Length=(-Inf;6.1]}"
#> 
#> [[4]]$support
#> [1] 0.6333333
#> 
#> 
#> [[5]]
#> [[5]]$condition
#> [1] "{Petal.Length=(3.95;Inf]}"
#> 
#> [[5]]$support
#> [1] 0.5933333
#> 
#> 
#> [[6]]
#> [[6]]$condition
#> [1] "{Petal.Width=(-Inf;1.3]}"
#> 
#> [[6]]$support
#> [1] 0.52
#> 
#> 

# Format the result as a data frame
do.call(rbind, lapply(res, as_tibble))
#> # A tibble: 6 × 2
#>   condition                                        support
#>   <chr>                                              <dbl>
#> 1 {}                                                 1    
#> 2 {Sepal.Width=(-Inf;3.2]}                           0.713
#> 3 {Petal.Length=(3.95;Inf],Sepal.Width=(-Inf;3.2]}   0.527
#> 4 {Sepal.Length=(-Inf;6.1]}                          0.633
#> 5 {Petal.Length=(3.95;Inf]}                          0.593
#> 6 {Petal.Width=(-Inf;1.3]}                           0.520

# Within each condition, evaluate also supports of columns starting with
# "Species"
res <- dig(x = d,
           f = function(condition, support, pp) {
               c(list(condition = format_condition(names(condition))),
                 list(condition_support = support),
                 as.list(pp / nrow(d)))
           },
           condition = !starts_with("Species"),
           focus = starts_with("Species"),
           min_support = 0.5,
           min_focus_support = 0)

# Format the result as a tibble
do.call(rbind, lapply(res, as_tibble))
#> # A tibble: 6 × 5
#>   condition              condition_support `Species=setosa` `Species=versicolor`
#>   <chr>                              <dbl>            <dbl>                <dbl>
#> 1 {}                                 1                0.333                0.333
#> 2 {Sepal.Width=(-Inf;3.…             0.713            0.113                0.32 
#> 3 {Petal.Length=(3.95;I…             0.527            0                    0.247
#> 4 {Sepal.Length=(-Inf;6…             0.633            0.333                0.227
#> 5 {Petal.Length=(3.95;I…             0.593            0                    0.26 
#> 6 {Petal.Width=(-Inf;1.…             0.520            0.333                0.187
#> # ℹ 1 more variable: `Species=virginica` <dbl>

# For each condition, create multiple patterns based on the focus columns
res <- dig(x = d,
           f = function(condition, support, pp) {
               lapply(seq_along(pp), function(i) {
                   list(condition = format_condition(names(condition)),
                        condition_support = support,
                        focus = names(pp)[i],
                        focus_support = pp[[i]] / nrow(d))
               })
           },
           condition = !starts_with("Species"),
           focus = starts_with("Species"),
           min_support = 0.5,
           min_focus_support = 0)

# As res is now a list of lists, we need to flatten it before converting to
# a tibble
res <- unlist(res, recursive = FALSE)

# Format the result as a tibble
do.call(rbind, lapply(res, as_tibble))
#> # A tibble: 18 × 4
#>    condition                               condition_support focus focus_support
#>    <chr>                                               <dbl> <chr>         <dbl>
#>  1 {}                                                  1     Spec…        0.333 
#>  2 {}                                                  1     Spec…        0.333 
#>  3 {}                                                  1     Spec…        0.333 
#>  4 {Sepal.Width=(-Inf;3.2]}                            0.713 Spec…        0.113 
#>  5 {Sepal.Width=(-Inf;3.2]}                            0.713 Spec…        0.32  
#>  6 {Sepal.Width=(-Inf;3.2]}                            0.713 Spec…        0.28  
#>  7 {Petal.Length=(3.95;Inf],Sepal.Width=(…             0.527 Spec…        0     
#>  8 {Petal.Length=(3.95;Inf],Sepal.Width=(…             0.527 Spec…        0.247 
#>  9 {Petal.Length=(3.95;Inf],Sepal.Width=(…             0.527 Spec…        0.28  
#> 10 {Sepal.Length=(-Inf;6.1]}                           0.633 Spec…        0.333 
#> 11 {Sepal.Length=(-Inf;6.1]}                           0.633 Spec…        0.227 
#> 12 {Sepal.Length=(-Inf;6.1]}                           0.633 Spec…        0.0733
#> 13 {Petal.Length=(3.95;Inf]}                           0.593 Spec…        0     
#> 14 {Petal.Length=(3.95;Inf]}                           0.593 Spec…        0.26  
#> 15 {Petal.Length=(3.95;Inf]}                           0.593 Spec…        0.333 
#> 16 {Petal.Width=(-Inf;1.3]}                            0.520 Spec…        0.333 
#> 17 {Petal.Width=(-Inf;1.3]}                            0.520 Spec…        0.187 
#> 18 {Petal.Width=(-Inf;1.3]}                            0.520 Spec…        0