Introduction
Package nuggets searches for patterns that can be
expressed as formulae in the form of elementary conjunctions, referred
to in this text as conditions. Conditions are constructed from
predicates, which correspond to data columns. The
interpretation of conditions depends on the choice of underlying
logic:
Crisp (Boolean) logic: each predicate takes values
TRUE(1) orFALSE(0). The truth value of a condition is computed according to the rules of classical Boolean algebra.-
Fuzzy logic: each predicate is assigned a truth degree from the interval \([0, 1]\). The truth degree of a conjunction is then computed using a chosen triangular norm (t-norm). The package supports three common t-norms, which are defined for predicates’ truth degrees \(a, b \in [0, 1]\) as follows:
- Gödel (minimum) t-norm: \(\min(a, b)\) ;
- Goguen (product) t-norm: \(a \cdot b\) ;
- Łukasiewicz t-norm: \(\max(0, a + b - 1)\)
Before applying nuggets, data columns intended as
predicates must be prepared either by dichotomization
(conversion into dummy variables) or by transformation into
fuzzy sets. The package provides functions for both
transformations. See the section Data
Preparation below for more details.
nuggets implements functions to search for pre-defined
types of patterns, for example:
-
dig_associations()for association rules, -
dig_baseline_contrasts(),dig_complement_contrasts(), anddig_paired_baseline_contrasts()for various contrast patterns on numeric variables, -
dig_correlations()for conditional correlations.
See Pre-defined Patterns below for further details.
Discovered rules and patterns can be post-processed, visualized, and explored interactively. Section Post-processing and Visualization describes these features.
Finally, the package allows users to provide custom evaluation functions for conditions and to search for user-defined types of patterns:
-
dig()is a general function for searching arbitrary pattern types. -
dig_grid()is a wrapper arounddig()for patterns defined by conditions and a pair of columns evaluated by a user-defined function.
See Custom Patterns for more information.
Data Preparation
For patterns based on crisp conditions, the data columns that serve
as predicates in conditions must be transformed either to logical
(TRUE/FALSE) columns, or to fuzzy sets with
values from the interval \([0, 1]\).
The first option is simpler and faster, and it is the recommended option
for most applications. The second option is more flexible and allows to
model uncertainty in data, but it is more computationally demanding.
Preparation of Crisp (Boolean) Predicates
For patterns based on crisp conditions, the data columns that would
serve as predicates in conditions have to be transformed to logical
(TRUE/FALSE) columns. That can be done in two
ways:
- numeric columns can be transformed to factors with a selected number of levels, and then
- factors can be transformed to dummy logical columns.
Both operations can be done with the help of the
partition() function. The partition() function
requires the dataset as its first argument and a tidyselect
selection expression to select the columns to be transformed.
Factors and logical columns are automatically transformed to dummy
logical columns by the partition() function. For numeric
columns, the partition() function requires the
.method argument to specify the method of partitioning:
-
.method = "dummy"transforms numeric columns to factors and then to dummy logical columns. That effectively creates a separate logical column for each distinct value of the numeric column. -
.method = "crisp"transforms numeric columns to crisp predicates by dividing the range of values into intervals and coding the values into dummy logical columns according to the intervals. - there exist other methods of partitioning of numeric columns. These methods create fuzzy predicates and are described in the next section.
For example, consider the built-in mtcars dataset. This
dataset contains information about various car models. For the sake of
illustration, let us transform the cyl column into factor
first:
mtcars$cyl <- factor(mtcars$cyl,
levels= c(4, 6, 8),
labels = c("four", "six", "eight"))
head(mtcars)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 six 160 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 six 160 110 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 four 108 93 3.85 2.320 18.61 1 1 4 1
#> Hornet 4 Drive 21.4 six 258 110 3.08 3.215 19.44 1 0 3 1
#> Hornet Sportabout 18.7 eight 360 175 3.15 3.440 17.02 0 0 3 2
#> Valiant 18.1 six 225 105 2.76 3.460 20.22 1 0 3 1Factors are transformed to dummy logical columns by the
partition() function automatically:
partition(mtcars, cyl)
#> # A tibble: 32 × 13
#> mpg disp hp drat wt qsec vs am gear carb `cyl=four`
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
#> 1 21 160 110 3.9 2.62 16.5 0 1 4 4 FALSE
#> 2 21 160 110 3.9 2.88 17.0 0 1 4 4 FALSE
#> 3 22.8 108 93 3.85 2.32 18.6 1 1 4 1 TRUE
#> 4 21.4 258 110 3.08 3.22 19.4 1 0 3 1 FALSE
#> 5 18.7 360 175 3.15 3.44 17.0 0 0 3 2 FALSE
#> 6 18.1 225 105 2.76 3.46 20.2 1 0 3 1 FALSE
#> 7 14.3 360 245 3.21 3.57 15.8 0 0 3 4 FALSE
#> 8 24.4 147. 62 3.69 3.19 20 1 0 4 2 TRUE
#> 9 22.8 141. 95 3.92 3.15 22.9 1 0 4 2 TRUE
#> 10 19.2 168. 123 3.92 3.44 18.3 1 0 4 4 FALSE
#> `cyl=six` `cyl=eight`
#> <lgl> <lgl>
#> 1 TRUE FALSE
#> 2 TRUE FALSE
#> 3 FALSE FALSE
#> 4 TRUE FALSE
#> 5 FALSE TRUE
#> 6 TRUE FALSE
#> 7 FALSE TRUE
#> 8 FALSE FALSE
#> 9 FALSE FALSE
#> 10 TRUE FALSE
#> # ℹ 22 more rowsThe vs, am, and gear columns
are numeric but actually represent categories. To transform them to
dummy logical columns in the same way as factors, we can use the
partition() function with the .method argument
set to "dummy":
partition(mtcars, vs:gear, .method = "dummy")
#> # A tibble: 32 × 15
#> mpg cyl disp hp drat wt qsec carb `vs=0` `vs=1` `am=0` `am=1`
#> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl> <lgl> <lgl>
#> 1 21 six 160 110 3.9 2.62 16.5 4 TRUE FALSE FALSE TRUE
#> 2 21 six 160 110 3.9 2.88 17.0 4 TRUE FALSE FALSE TRUE
#> 3 22.8 four 108 93 3.85 2.32 18.6 1 FALSE TRUE FALSE TRUE
#> 4 21.4 six 258 110 3.08 3.22 19.4 1 FALSE TRUE TRUE FALSE
#> 5 18.7 eight 360 175 3.15 3.44 17.0 2 TRUE FALSE TRUE FALSE
#> 6 18.1 six 225 105 2.76 3.46 20.2 1 FALSE TRUE TRUE FALSE
#> 7 14.3 eight 360 245 3.21 3.57 15.8 4 TRUE FALSE TRUE FALSE
#> 8 24.4 four 147. 62 3.69 3.19 20 2 FALSE TRUE TRUE FALSE
#> 9 22.8 four 141. 95 3.92 3.15 22.9 2 FALSE TRUE TRUE FALSE
#> 10 19.2 six 168. 123 3.92 3.44 18.3 4 FALSE TRUE TRUE FALSE
#> `gear=3` `gear=4` `gear=5`
#> <lgl> <lgl> <lgl>
#> 1 FALSE TRUE FALSE
#> 2 FALSE TRUE FALSE
#> 3 FALSE TRUE FALSE
#> 4 TRUE FALSE FALSE
#> 5 TRUE FALSE FALSE
#> 6 TRUE FALSE FALSE
#> 7 TRUE FALSE FALSE
#> 8 FALSE TRUE FALSE
#> 9 FALSE TRUE FALSE
#> 10 FALSE TRUE FALSE
#> # ℹ 22 more rowsThe mpg column is numeric and therefore cannot be
transformed directly into dummy logical columns. A better approach is to
use the "crisp" method of partitioning.
The "crisp" method divides the range of values of the
selected columns into intervals specified by the .breaks
argument and then encodes the values into dummy logical columns
corresponding to the intervals. The .breaks argument is a
numeric vector that specifies the interval boundaries.
For example, the mpg values can be divided into four
intervals: (-Inf, 15], (15, 20], (20, 30], and (30, Inf). The
.breaks argument is then the vector
c(-Inf, 15, 20, 30, Inf), which defines the boundaries of
these intervals.
partition(mtcars, mpg, .method = "crisp", .breaks = c(-Inf, 15, 20, 30, Inf))
#> # A tibble: 32 × 14
#> cyl disp hp drat wt qsec vs am gear carb `mpg=(-Inf;15]`
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
#> 1 six 160 110 3.9 2.62 16.5 0 1 4 4 FALSE
#> 2 six 160 110 3.9 2.88 17.0 0 1 4 4 FALSE
#> 3 four 108 93 3.85 2.32 18.6 1 1 4 1 FALSE
#> 4 six 258 110 3.08 3.22 19.4 1 0 3 1 FALSE
#> 5 eight 360 175 3.15 3.44 17.0 0 0 3 2 FALSE
#> 6 six 225 105 2.76 3.46 20.2 1 0 3 1 FALSE
#> 7 eight 360 245 3.21 3.57 15.8 0 0 3 4 TRUE
#> 8 four 147. 62 3.69 3.19 20 1 0 4 2 FALSE
#> 9 four 141. 95 3.92 3.15 22.9 1 0 4 2 FALSE
#> 10 six 168. 123 3.92 3.44 18.3 1 0 4 4 FALSE
#> `mpg=(15;20]` `mpg=(20;30]` `mpg=(30;Inf]`
#> <lgl> <lgl> <lgl>
#> 1 FALSE TRUE FALSE
#> 2 FALSE TRUE FALSE
#> 3 FALSE TRUE FALSE
#> 4 FALSE TRUE FALSE
#> 5 TRUE FALSE FALSE
#> 6 TRUE FALSE FALSE
#> 7 FALSE FALSE FALSE
#> 8 FALSE TRUE FALSE
#> 9 FALSE TRUE FALSE
#> 10 TRUE FALSE FALSE
#> # ℹ 22 more rowsNote: it is advisable to put -Inf and Inf
as the first and last elements of the .breaks vector to
ensure that all values are covered by the intervals.
If we want the breaks to be evenly spaced across the range of values,
we can set .breaks to a single integer. This value
specifies the number of intervals to create. For example, the following
command divides the disp values into three intervals of
equal width:
partition(mtcars, disp, .method = "crisp", .breaks = 3)
#> # A tibble: 32 × 13
#> mpg cyl hp drat wt qsec vs am gear carb `disp=(-Inf;205]`
#> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
#> 1 21 six 110 3.9 2.62 16.5 0 1 4 4 TRUE
#> 2 21 six 110 3.9 2.88 17.0 0 1 4 4 TRUE
#> 3 22.8 four 93 3.85 2.32 18.6 1 1 4 1 TRUE
#> 4 21.4 six 110 3.08 3.22 19.4 1 0 3 1 FALSE
#> 5 18.7 eight 175 3.15 3.44 17.0 0 0 3 2 FALSE
#> 6 18.1 six 105 2.76 3.46 20.2 1 0 3 1 FALSE
#> 7 14.3 eight 245 3.21 3.57 15.8 0 0 3 4 FALSE
#> 8 24.4 four 62 3.69 3.19 20 1 0 4 2 TRUE
#> 9 22.8 four 95 3.92 3.15 22.9 1 0 4 2 TRUE
#> 10 19.2 six 123 3.92 3.44 18.3 1 0 4 4 TRUE
#> `disp=(205;338]` `disp=(338;Inf]`
#> <lgl> <lgl>
#> 1 FALSE FALSE
#> 2 FALSE FALSE
#> 3 FALSE FALSE
#> 4 TRUE FALSE
#> 5 FALSE TRUE
#> 6 TRUE FALSE
#> 7 FALSE TRUE
#> 8 FALSE FALSE
#> 9 FALSE FALSE
#> 10 FALSE FALSE
#> # ℹ 22 more rowsEach call to partition() returns a tibble with the
selected columns transformed to dummy logical columns, while the other
columns remain unchanged.
The transformation of the whole mtcars dataset to crisp
predicates can be done as follows:
crisp_mtcars <- mtcars |>
partition(cyl, vs:gear, .method = "dummy") |>
partition(mpg, .method = "crisp", .breaks = c(-Inf, 15, 20, 30, Inf)) |>
partition(disp:carb, .method = "crisp", .breaks = 3)
head(crisp_mtcars, n = 3)
#> # A tibble: 3 × 32
#> `cyl=four` `cyl=six` `cyl=eight` `vs=0` `vs=1` `am=0` `am=1` `gear=3` `gear=4`
#> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
#> 1 FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE TRUE
#> 2 FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE TRUE
#> 3 TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE
#> `gear=5` `mpg=(-Inf;15]` `mpg=(15;20]` `mpg=(20;30]` `mpg=(30;Inf]`
#> <lgl> <lgl> <lgl> <lgl> <lgl>
#> 1 FALSE FALSE FALSE TRUE FALSE
#> 2 FALSE FALSE FALSE TRUE FALSE
#> 3 FALSE FALSE FALSE TRUE FALSE
#> `disp=(-Inf;205]` `disp=(205;338]` `disp=(338;Inf]` `hp=(-Inf;146]`
#> <lgl> <lgl> <lgl> <lgl>
#> 1 TRUE FALSE FALSE TRUE
#> 2 TRUE FALSE FALSE TRUE
#> 3 TRUE FALSE FALSE TRUE
#> `hp=(146;241]` `hp=(241;Inf]` `drat=(-Inf;3.48]` `drat=(3.48;4.21]`
#> <lgl> <lgl> <lgl> <lgl>
#> 1 FALSE FALSE FALSE TRUE
#> 2 FALSE FALSE FALSE TRUE
#> 3 FALSE FALSE FALSE TRUE
#> `drat=(4.21;Inf]` `wt=(-Inf;2.82]` `wt=(2.82;4.12]` `wt=(4.12;Inf]`
#> <lgl> <lgl> <lgl> <lgl>
#> 1 FALSE TRUE FALSE FALSE
#> 2 FALSE FALSE TRUE FALSE
#> 3 FALSE TRUE FALSE FALSE
#> `qsec=(-Inf;17.3]` `qsec=(17.3;20.1]` `qsec=(20.1;Inf]` `carb=(-Inf;3.33]`
#> <lgl> <lgl> <lgl> <lgl>
#> 1 TRUE FALSE FALSE FALSE
#> 2 TRUE FALSE FALSE FALSE
#> 3 FALSE TRUE FALSE TRUE
#> `carb=(3.33;5.67]` `carb=(5.67;Inf]`
#> <lgl> <lgl>
#> 1 TRUE FALSE
#> 2 TRUE FALSE
#> 3 FALSE FALSENow all columns are logical and can be used as predicates in crisp conditions.
Preparation of Triangular and Raised-Cosine Fuzzy Predicates
In many real-world datasets, numeric attributes do not lend themselves to clear-cut, crisp boundaries. For example, deciding whether a car has “low mileage” or “high mileage” is often subjective. A vehicle with 19 miles per gallon may be considered “low” in one context but “medium” in another. Crisp intervals force a strict separation between categories, which can be too rigid and may lose information about gradual changes in the data.
To address this, fuzzy predicates are used. A fuzzy
predicate expresses the degree to which a condition is satisfied.
Instead of being strictly TRUE or FALSE
(although allowed too), each predicate is represented by a number in the
interval \([0,1]\). A truth degree of 0
means the predicate is entirely false, 1 means it is fully true, and
values in between indicate partial membership. This allows us to model
smooth transitions between categories and capture more nuanced
patterns.
For example, a fuzzy predicate could represent “medium horsepower” in
the mtcars dataset. A car with 120 hp may belong to this
category to a degree of 0.8, while a car with 150 hp may belong to it
only to a degree of 0.2. Such representations are more faithful to human
reasoning and often yield patterns that are both more robust and more
interpretable.
The transformation of numeric columns to fuzzy predicates can be done
with the partition() function. As with crisp partitioning,
factors are transformed to dummy logical columns. Numeric columns,
however, are transformed into fuzzy truth values. The
partition() function provides two fuzzy partitioning
methods:
-
.method = "triangle"creates fuzzy sets with triangular or trapezoidal membership functions; -
.method = "raisedcos"creates fuzzy sets with raised cosine or trapezoidal raised-cosine membership functions.
These membership functions specify how strongly a value belongs to a fuzzy set. The choice of function depends on the desired smoothness of the transition between sets.
More advanced fuzzy partitioning of numeric columns can be achieved with the lfl package, which provides tools for defining fuzzy sets of many types, including linguistic terms such as “very small” or “extremely big”. See the
lfldocumentation for more information.
Both triangular and raised cosine shapes are fully defined by three
points: the left border, the peak, and the right border. The
.breaks argument in the partition() function
specifies these points. See the following figure for an illustration of
triangular and raised cosine membership functions for
.breaks = c(-10, 0, 10):

Comparison of triangular and raised cosine membership functions for
.breaks = c(-10, 0, 10)
Each consecutive triplet of values in .breaks defines
one fuzzy set. To create e.g. three fuzzy sets, five break points are
needed. For instance, .breaks = c(-10, -5, 0, 5, 10)
defines three fuzzy sets with peaks at -5, 0, and 5. See the following
figure for an illustration of these fuzzy sets:

Fuzzy sets with triangular membership functions for
partition(x, .method = "triangle", .breaks = c(-10, -5, 0, 5, 10))
It is often useful to extend the fuzzy sets on the edges to infinity.
That ensures that all values are covered by the fuzzy sets. To achieve
that, -Inf and Inf can be added as the first
and last elements of the .breaks vector:

Fuzzy sets with triangular membership functions for
partition(x, .method = "triangle", .breaks = c(-Inf, -5, 0, 5, Inf))
If a regular partitioning of the range of values is desired,
.breaks can be set to a single integer, which specifies the
number of fuzzy sets to create. For example, .breaks = 4
creates partitioning with four fuzzy sets:

Fuzzy sets with triangular membership functions for
partition(x, .method = "triangle", .breaks = 4)
The same is valid for raised cosine fuzzy sets. For instance, the
following figure shows five raised cosine fuzzy sets defined by
.breaks = c(-Inf, -10, -5, 0, 5, 10, Inf):

Fuzzy sets with raised cosine membership functions for
partition(x, .method = "raisedcos", .breaks = c(-Inf, -10, -5, 0, 5, 10, Inf))
A fuzzy transformation of the whole mtcars dataset can
be done as follows:
fuzzy_mtcars <- mtcars |>
partition(cyl, vs:gear, .method = "dummy") |>
partition(mpg, .method = "triangle", .breaks = c(-Inf, 15, 20, 30, Inf)) |>
partition(disp:carb, .method = "triangle", .breaks = 3)
head(fuzzy_mtcars, n = 3)
#> # A tibble: 3 × 31
#> `cyl=four` `cyl=six` `cyl=eight` `vs=0` `vs=1` `am=0` `am=1` `gear=3` `gear=4`
#> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
#> 1 FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE TRUE
#> 2 FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE TRUE
#> 3 TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE
#> `gear=5` `mpg=(-Inf;15;20)` `mpg=(15;20;30)` `mpg=(20;30;Inf)`
#> <lgl> <dbl> <dbl> <dbl>
#> 1 FALSE 0 0.9 0.1
#> 2 FALSE 0 0.9 0.1
#> 3 FALSE 0 0.72 0.28
#> `disp=(-Inf;71.1;272)` `disp=(71.1;272;472)` `disp=(272;472;Inf)`
#> <dbl> <dbl> <dbl>
#> 1 0.557 0.443 0
#> 2 0.557 0.443 0
#> 3 0.816 0.184 0
#> `hp=(-Inf;52;194)` `hp=(52;194;335)` `hp=(194;335;Inf)`
#> <dbl> <dbl> <dbl>
#> 1 0.592 0.408 0
#> 2 0.592 0.408 0
#> 3 0.711 0.289 0
#> `drat=(-Inf;2.76;3.84)` `drat=(2.76;3.84;4.93)` `drat=(3.84;4.93;Inf)`
#> <dbl> <dbl> <dbl>
#> 1 0 0.945 0.0550
#> 2 0 0.945 0.0550
#> 3 0 0.991 0.00917
#> `wt=(-Inf;1.51;3.47)` `wt=(1.51;3.47;5.42)` `wt=(3.47;5.42;Inf)`
#> <dbl> <dbl> <dbl>
#> 1 0.434 0.566 0
#> 2 0.304 0.696 0
#> 3 0.587 0.413 0
#> `qsec=(-Inf;14.5;18.7)` `qsec=(14.5;18.7;22.9)` `qsec=(18.7;22.9;Inf)`
#> <dbl> <dbl> <dbl>
#> 1 0.533 0.467 0
#> 2 0.4 0.6 0
#> 3 0.0214 0.979 0
#> `carb=(-Inf;1;4.5)` `carb=(1;4.5;8)` `carb=(4.5;8;Inf)`
#> <dbl> <dbl> <dbl>
#> 1 0.143 0.857 0
#> 2 0.143 0.857 0
#> 3 1 0 0Note that the cyl, vs, am, and
gear columns are still represented by dummy logical
columns, while the mpg, disp, and other
columns are now represented by fuzzy sets. This combination allows both
crisp and fuzzy predicates to be used together in pattern discovery,
offering more flexibility and interpretability.
Preparation of Trapezoidal Fuzzy Predicates
The triangular and raised cosine membership functions are often sufficient to capture gradual transitions in numeric data. However, in some situations it is useful to have fuzzy sets that stay fully true (membership = 1) over a wider interval before decreasing again. This generalization corresponds to a trapezoidal fuzzy set, which can be seen as a triangle or raised cosine with a “flat top”.
With partition(), trapezoids can be defined for both
"triangle" and "raisedcos" methods by
controlling how many consecutive break points constitute one fuzzy set
and how far the window shifts along the breaks. That can be accomplished
with the .span and .inc arguments:
-
.span- specifies the width of the flat top in terms of the number of break intervals that should be merged. -
.inc- the shift of the window along.breakswhen forming the next fuzzy set.
By default, .span = 1 and .inc = 1, which
means that each fuzzy set is triangular or raised cosine. Setting
.span to a value greater than 1 creates trapezoidal fuzzy
sets. With .span = 2, each fuzzy set is defined by four
consecutive break points - a flat top spans two break intervals. The
following figure is the result of setting .span = 2 and
.breaks = c(-10, -5, 5, 10):

Fuzzy sets with triangular membership functions for
partition(x, .method = "triangle", .span = 2, .breaks = c(-10, -5, 5, 10))
Additional fuzzy sets are created by shifting the window along the
break points. The shift is controlled by the .inc argument.
By default, .inc = 1, which means that the window shifts by
one break point. Consider the following example that shows the effect of
setting .inc = 1 in addition to .span = 2 and
.breaks = c(-15, -10, -5, 0, 5, 10, 15):

Fuzzy sets with triangular membership functions for
partition(x, .method = "triangle", .inc = 1, .span = 2, .breaks = c(-15, -10, -5, 0, 5, 10, 15))
Setting .inc to a value greater than 1 modifies the
shift of the window along the break points. For example, with
.inc = 3, the window shifts by three break points, which
effectively skips two fuzzy sets after each created fuzzy set:

Fuzzy sets with triangular membership functions for
partition(x, .method = "triangle", .inc = 3, .span = 2, .breaks = c(-15, -10, -5, 0, 5, 10, 15))
Pre-defined Patterns
The package nuggets provides a set of functions for
discovering some of the best-known pattern types. These functions can
process Boolean data, fuzzy data, or both. Each function returns a
tibble, where every row represents one detected pattern.
Note: This section assumes that the data have already been preprocessed — i.e., transformed into a binarized or fuzzified form. See the previous section Data Preparation for details on how to prepare your dataset (for example,
crisp_mtcarsandfuzzy_mtcars).
For more advanced workflows — such as defining custom pattern types or computing user-defined measures — see the section Custom Patterns.
Search for Association Rules
Association rules identify conditions (antecedents) under which a specific feature (consequent) is present very often.
\[ A \Rightarrow C \]
If condition A is satisfied, then the feature
C tends to be present.
For example,university_edu & middle_age & IT_industry => high_income
can be read as:
People in middle age with university education working in IT
industry are very likely to have a high income.
In practice, the antecedent A is a set of predicates,
and the consequent C is usually a single predicate.
For a set of predicates \(I\), let \(\text{supp}(I)\) denote the support — the relative frequency (for logical data) or the mean truth degree (for fuzzy data) of rows satisfying all predicates in \(I\). Using this notation:
-
Length — number of predicates in the
antecedent.
-
Coverage — \(\text{supp}(A)\).
-
Consequent support — \(\text{supp}(\{c\})\).
-
Support — \(\text{supp}(A
\cup \{c\})\).
- Confidence — \(\text{supp}(A \cup \{c\}) / \text{supp}(A)\).
Optional additional measures ("lift",
"conviction", "added_value") can be computed
using the measures argument.
Before searching for rules, it is recommended to create a vector of disjoints, which specifies predicates that must not appear together in the same condition. This vector should have the same length as the number of dataset columns.
For example, columns representing gear=3 and
gear=4 are mutually exclusive, so their shared group label
in disj prevents meaningless conditions like
gear=3 & gear=4. You can conveniently generate this
vector with var_names():
disj <- var_names(colnames(fuzzy_mtcars))
print(disj)
#> [1] "cyl" "cyl" "cyl" "vs" "vs" "am" "am" "gear" "gear" "gear"
#> [11] "mpg" "mpg" "mpg" "disp" "disp" "disp" "hp" "hp" "hp" "drat"
#> [21] "drat" "drat" "wt" "wt" "wt" "qsec" "qsec" "qsec" "carb" "carb"
#> [31] "carb"The dig_associations() function searches for association
rules. Its main arguments are:
-
x: the data matrix or data frame (logical or numeric); -
antecedent,consequent: tidyselect expressions selecting columns for each side of the rule; -
disjoint: a vector defining mutually exclusive predicates; - rule filtering thresholds such as
min_support,min_confidence,min_coverage, and limits likemin_length,max_length; - optional parameters such as
measures,t_norm, andcontingency_table.
In the following example, we search for fuzzy association rules in
the dataset fuzzy_mtcars, such that: - any column except
those starting with "am" may appear in the antecedent; -
columns starting with "am" may appear in the consequent; -
minimum support is 0.02; - minimum confidence is
0.8; - additional quality measures "lift" and
"conviction" are computed.
result <- dig_associations(fuzzy_mtcars,
antecedent = !starts_with("am"),
consequent = starts_with("am"),
disjoint = disj,
min_support = 0.02,
min_confidence = 0.8,
measures = c("lift", "conviction"),
contingency_table = TRUE)The result is a tibble containing the discovered rules and their quality metrics. You can arrange them, for example, by decreasing support:
result <- arrange(result, desc(support))
print(result)
#> # A tibble: 526 × 14
#> antecedent consequent support confidence coverage
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 {gear=3} {am=0} 0.469 1 0.469
#> 2 {gear=3,vs=0} {am=0} 0.375 1 0.375
#> 3 {cyl=eight,gear=3,vs=0} {am=0} 0.375 1 0.375
#> 4 {cyl=eight,vs=0} {am=0} 0.375 0.857 0.438
#> 5 {cyl=eight,gear=3} {am=0} 0.375 1 0.375
#> 6 {cyl=eight} {am=0} 0.375 0.857 0.438
#> 7 {mpg=(-Inf;15;20)} {am=0} 0.327 0.847 0.387
#> 8 {drat=(-Inf;2.76;3.84)} {am=0} 0.311 0.948 0.328
#> 9 {gear=3,mpg=(-Inf;15;20)} {am=0} 0.309 1 0.309
#> 10 {drat=(-Inf;2.76;3.84),gear=3} {am=0} 0.307 1 0.307
#> conseq_support count antecedent_length pp pn np nn lift
#> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.594 15 1 15 0 4 13 1.68
#> 2 0.594 12 2 12 0 7 13 1.68
#> 3 0.594 12 3 12 0 7 13 1.68
#> 4 0.594 12 2 12 2 7 11 1.44
#> 5 0.594 12 2 12 0 7 13 1.68
#> 6 0.594 12 1 12 2 7 11 1.44
#> 7 0.594 10.5 1 10.5 1.90 8.52 11.1 1.43
#> 8 0.594 9.96 1 9.96 0.546 9.04 12.5 1.60
#> 9 0.594 9.88 2 9.88 0 9.12 13.0 1.68
#> 10 0.594 9.82 2 9.82 0 9.18 13 1.68
#> conviction
#> <dbl>
#> 1 Inf
#> 2 Inf
#> 3 Inf
#> 4 2.84
#> 5 Inf
#> 6 2.84
#> 7 2.65
#> 8 7.82
#> 9 Inf
#> 10 Inf
#> # ℹ 516 more rowsThis example illustrates the typical workflow for mining association
rules with nuggets. The same structure and arguments apply
when analyzing either fuzzy or Boolean datasets.
Custom Patterns
The nuggets package allows to execute a user-defined
callback function on each generated frequent condition. That way a
custom type of patterns may be searched. The following example
replicates the search for associations rules with the custom callback
function. For that, a dataset has to be dichotomized and the disjoint
vector created as in the Data Preparation section
above:
#head(fuzzyCO2)
#print(disj)As we want to search for associations rules with some minimum support and confidence, we define the variables to hold that thresholds. We also need to define a callback function that will be called for each found frequent condition. Its purpose is to generate the rules with the obtained condition as an antecedent:
min_support <- 0.02
min_confidence <- 0.8
f <- function(condition, support, foci_supports) {
conf <- foci_supports / support
sel <- !is.na(conf) & conf >= min_confidence & !is.na(foci_supports) & foci_supports >= min_support
conf <- conf[sel]
supp <- foci_supports[sel]
lapply(seq_along(conf), function(i) {
list(antecedent = format_condition(names(condition)),
consequent = format_condition(names(conf)[[i]]),
support = supp[[i]],
confidence = conf[[i]])
})
}The callback function f() defines three arguments:
condition, support and
foci_supports. The names of the arguments are not random.
Based on the argument names of the callback function, the searching
algorithm provides information to the function. Here
condition is a vector of indices representing the
conjunction of predicates in a condition. By the predicate we mean the
column in the source dataset. The support argument gets the
relative frequency of the condition in the dataset.
foci_supports is a vector of supports of special
predicates, which we call “foci” (plural of “focus”), within the rows
satisfying the condition. For associations rules, foci are potential
rule consequents.
Now we can run the digging for rules:
#result <- dig(fuzzyCO2,
#f = f,
#condition = !starts_with("Treatment"),
#focus = starts_with("Treatment"),
#disjoint = disj,
#min_length = 1,
#min_support = min_support)As we return a list of lists in the callback function, we have to flatten the first level of lists in the result and binding it into a data frame:
#result <- result |>
#unlist(recursive = FALSE) |>
#lapply(as_tibble) |>
#do.call(rbind, args = _) |>
#arrange(desc(support))
#
#print(result)