Association rules identify conditions (antecedents) under which a specific feature (consequent) is present very often.
- Scheme:
A => C
If conditionA
is satisfied, then the featureC
is present very often.- Example:
university_edu & middle_age & IT_industry => high_income
People in middle age with university education working in IT industry have very likely a high income.
Antecedent A
is usually a set of predicates, and consequent C
is a single
predicate.
For the following explanations we need a mathematical function \(supp(I)\), which
is defined for a set \(I\) of predicates as a relative frequency of rows satisfying
all predicates from \(I\). For logical data, \(supp(I)\) equals to the relative
frequency of rows, for which all predicates \(i_1, i_2, \ldots, i_n\) from \(I\) are TRUE.
For numerical (double) input, \(supp(I)\) is computed as the mean (over all rows)
of truth degrees of the formula i_1 AND i_2 AND ... AND i_n
, where
AND
is a triangular norm selected by the t_norm
argument.
Association rules are characterized with the following quality measures.
Length of a rule is the number of elements in the antecedent.
Coverage of a rule is equal to \(supp(A)\).
Consequent support of a rule is equal to \(supp(\{c\})\).
Support of a rule is equal to \(supp(A \cup \{c\})\).
Confidence of a rule is the fraction \(supp(A) / supp(A \cup \{c\})\).
Usage
dig_associations(
x,
antecedent = everything(),
consequent = everything(),
disjoint = var_names(colnames(x)),
excluded = NULL,
min_length = 0L,
max_length = Inf,
min_coverage = 0,
min_support = 0,
min_confidence = 0,
contingency_table = FALSE,
measures = NULL,
t_norm = "goguen",
max_results = Inf,
verbose = FALSE,
threads = 1,
error_context = list(arg_x = "x", arg_antecedent = "antecedent", arg_consequent =
"consequent", arg_disjoint = "disjoint", arg_excluded = "excluded", arg_min_length =
"min_length", arg_max_length = "max_length", arg_min_coverage = "min_coverage",
arg_min_support = "min_support", arg_min_confidence = "min_confidence",
arg_contingency_table = "contingency_table", arg_measures = "measures", arg_t_norm =
"t_norm", arg_max_results = "max_results", arg_verbose = "verbose", arg_threads =
"threads", call = current_env())
)
Arguments
- x
a matrix or data frame with data to search in. The matrix must be numeric (double) or logical. If
x
is a data frame then each column must be either numeric (double) or logical.- antecedent
a tidyselect expression (see tidyselect syntax) specifying the columns to use in the antecedent (left) part of the rules
- consequent
a tidyselect expression (see tidyselect syntax) specifying the columns to use in the consequent (right) part of the rules
- disjoint
an atomic vector of size equal to the number of columns of
x
that specifies the groups of predicates: if some elements of thedisjoint
vector are equal, then the corresponding columns ofx
will NOT be present together in a single condition. Ifx
is prepared withpartition()
, using thevar_names()
function onx
's column names is a convenient way to create thedisjoint
vector.- excluded
NULL or a list of character vectors, where each character vector contains the names of columns that must not appear together in a single antecedent.
- min_length
the minimum length, i.e., the minimum number of predicates in the antecedent, of a rule to be generated. Value must be greater or equal to 0. If 0, rules with empty antecedent are generated in the first place.
- max_length
The maximum length, i.e., the maximum number of predicates in the antecedent, of a rule to be generated. If equal to Inf, the maximum length is limited only by the number of available predicates.
- min_coverage
the minimum coverage of a rule in the dataset
x
. (See Description for the definition of coverage.)- min_support
the minimum support of a rule in the dataset
x
. (See Description for the definition of support.)- min_confidence
the minimum confidence of a rule in the dataset
x
. (See Description for the definition of confidence.)- contingency_table
a logical value indicating whether to provide a contingency table for each rule. If
TRUE
, the columnspp
,pn
,np
, andnn
are added to the output table. These columns contain the number of rows satisfying the antecedent and the consequent, the antecedent but not the consequent, the consequent but not the antecedent, and neither the antecedent nor the consequent, respectively.- measures
a character vector specifying the additional quality measures to compute. If
NULL
, no additional measures are computed. Possible values are"lift"
,"conviction"
,"added_value"
. See https://mhahsler.github.io/arules/docs/measures for a description of the measures.- t_norm
a t-norm used to compute conjunction of weights. It must be one of
"goedel"
(minimum t-norm),"goguen"
(product t-norm), or"lukas"
(Łukasiewicz t-norm).- max_results
the maximum number of generated conditions to execute the callback function on. If the number of found conditions exceeds
max_results
, the function stops generating new conditions and returns the results. To avoid long computations during the search, it is recommended to setmax_results
to a reasonable positive value. Settingmax_results
toInf
will generate all possible conditions.- verbose
a logical value indicating whether to print progress messages.
- threads
the number of threads to use for parallel computation.
- error_context
a named list providing context for error messages. This is mainly useful when
dig_associations()
is called from another function and you want error messages to refer to the argument names of that calling function. The list must contain the following elements:arg_x
- name of the argumentx
arg_antecedent
- name of the argumentantecedent
arg_consequent
- name of the argumentconsequent
arg_disjoint
- name of the argumentdisjoint
arg_excluded
- name of the argumentexcluded
arg_min_length
- name of the argumentmin_length
arg_max_length
- name of the argumentmax_length
arg_min_coverage
- name of the argumentmin_coverage
arg_min_support
- name of the argumentmin_support
arg_min_confidence
- name of the argumentmin_confidence
arg_contingency_table
- name of the argumentcontingency_table
arg_measures
- name of the argumentmeasures
arg_t_norm
- name of the argumentt_norm
arg_max_results
- name of the argumentmax_results
arg_verbose
- name of the argumentverbose
arg_threads
- name of the argumentthreads
Value
An S3 object, which is an instance of associations
and nugget
classes, and which is a tibble with found patterns and computed quality measures.
Examples
d <- partition(mtcars, .breaks = 2)
#> Warning: n same as number of different finite values\neach different finite value is a separate class
#> Warning: n same as number of different finite values\neach different finite value is a separate class
dig_associations(d,
antecedent = !starts_with("mpg"),
consequent = starts_with("mpg"),
min_support = 0.3,
min_confidence = 0.8,
measures = c("lift", "conviction"))
#> # A tibble: 524 × 10
#> antecedent consequent support confidence coverage conseq_support count
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 {carb=(-Inf;4.5]… {mpg=(-In… 0.344 0.846 0.406 0.719 11
#> 2 {carb=(-Inf;4.5]… {mpg=(-In… 0.312 0.909 0.344 0.719 10
#> 3 {am=(-Inf;0.5],c… {mpg=(-In… 0.312 0.909 0.344 0.719 10
#> 4 {am=(-Inf;0.5],c… {mpg=(-In… 0.375 0.857 0.438 0.719 12
#> 5 {carb=(-Inf;4.5]… {mpg=(-In… 0.5 0.889 0.562 0.719 16
#> 6 {carb=(-Inf;4.5]… {mpg=(-In… 0.375 1 0.375 0.719 12
#> 7 {am=(-Inf;0.5],c… {mpg=(-In… 0.375 1 0.375 0.719 12
#> 8 {am=(-Inf;0.5],c… {mpg=(-In… 0.375 1 0.375 0.719 12
#> 9 {am=(-Inf;0.5],c… {mpg=(-In… 0.375 1 0.375 0.719 12
#> 10 {am=(-Inf;0.5],c… {mpg=(-In… 0.375 1 0.375 0.719 12
#> # ℹ 514 more rows
#> # ℹ 3 more variables: antecedent_length <int>, lift <dbl>, conviction <dbl>