Skip to contents

Introduction

Package nuggets searches for patterns that can be expressed as formulae in the form of elementary conjunctions, referred to in this text as conditions. Conditions are constructed from predicates, which correspond to data columns. The interpretation of conditions depends on the choice of underlying logic:

  • Crisp (Boolean) logic: each predicate takes values TRUE (1) or FALSE (0). The truth value of a condition is computed according to the rules of classical Boolean algebra.

  • Fuzzy logic: each predicate is assigned a truth degree from the interval \([0, 1]\). The truth degree of a conjunction is then computed using a chosen triangular norm (t-norm). The package supports three common t-norms, which are defined for predicates’ truth degrees \(a, b \in [0, 1]\) as follows:

    • Gödel (minimum) t-norm: \(\min(a, b)\) ;
    • Goguen (product) t-norm: \(a \cdot b\) ;
    • Łukasiewicz t-norm: \(\max(0, a + b - 1)\)

Before applying nuggets, data columns intended as predicates must be prepared either by dichotomization (conversion into dummy variables) or by transformation into fuzzy sets. The package provides functions for both transformations. See the section Data Preparation below for more details.

nuggets implements functions to search for pre-defined types of patterns, for example:

See Pre-defined Patterns below for further details.

Discovered rules and patterns can be post-processed, visualized, and explored interactively. Section Post-processing and Visualization describes these features.

Finally, the package allows users to provide custom evaluation functions for conditions and to search for user-defined types of patterns:

  • dig() is a general function for searching arbitrary pattern types.
  • dig_grid() is a wrapper around dig() for patterns defined by conditions and a pair of columns evaluated by a user-defined function.

See Custom Patterns for more information.

Data Preparation

For patterns based on crisp conditions, the data columns that serve as predicates in conditions must be transformed either to logical (TRUE/FALSE) columns, or to fuzzy sets with values from the interval \([0, 1]\). The first option is simpler and faster, and it is the recommended option for most applications. The second option is more flexible and allows to model uncertainty in data, but it is more computationally demanding.

Preparation of Crisp (Boolean) Predicates

For patterns based on crisp conditions, the data columns that would serve as predicates in conditions have to be transformed to logical (TRUE/FALSE) columns. That can be done in two ways:

  • numeric columns can be transformed to factors with a selected number of levels, and then
  • factors can be transformed to dummy logical columns.

Both operations can be done with the help of the partition() function. The partition() function requires the dataset as its first argument and a tidyselect selection expression to select the columns to be transformed.

Factors and logical columns are automatically transformed to dummy logical columns by the partition() function. For numeric columns, the partition() function requires the .method argument to specify the method of partitioning:

  • .method = "dummy" transforms numeric columns to factors and then to dummy logical columns. That effectively creates a separate logical column for each distinct value of the numeric column.
  • .method = "crisp" transforms numeric columns to crisp predicates by dividing the range of values into intervals and coding the values into dummy logical columns according to the intervals.
  • there exist other methods of partitioning of numeric columns. These methods create fuzzy predicates and are described in the next section.

For example, consider the built-in mtcars dataset. This dataset contains information about various car models. For the sake of illustration, let us transform the cyl column into factor first:

mtcars$cyl <- factor(mtcars$cyl,
                     levels= c(4, 6, 8),
                     labels = c("four", "six", "eight"))
head(mtcars)
#>                    mpg   cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4         21.0   six  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   six  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710        22.8  four  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive    21.4   six  258 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout 18.7 eight  360 175 3.15 3.440 17.02  0  0    3    2
#> Valiant           18.1   six  225 105 2.76 3.460 20.22  1  0    3    1

Factors are transformed to dummy logical columns by the partition() function automatically:

partition(mtcars, cyl)
#> # A tibble: 32 × 13
#>      mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb `cyl=four`
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>     
#>  1  21    160    110  3.9   2.62  16.5     0     1     4     4 FALSE     
#>  2  21    160    110  3.9   2.88  17.0     0     1     4     4 FALSE     
#>  3  22.8  108     93  3.85  2.32  18.6     1     1     4     1 TRUE      
#>  4  21.4  258    110  3.08  3.22  19.4     1     0     3     1 FALSE     
#>  5  18.7  360    175  3.15  3.44  17.0     0     0     3     2 FALSE     
#>  6  18.1  225    105  2.76  3.46  20.2     1     0     3     1 FALSE     
#>  7  14.3  360    245  3.21  3.57  15.8     0     0     3     4 FALSE     
#>  8  24.4  147.    62  3.69  3.19  20       1     0     4     2 TRUE      
#>  9  22.8  141.    95  3.92  3.15  22.9     1     0     4     2 TRUE      
#> 10  19.2  168.   123  3.92  3.44  18.3     1     0     4     4 FALSE     
#>    `cyl=six` `cyl=eight`
#>    <lgl>     <lgl>      
#>  1 TRUE      FALSE      
#>  2 TRUE      FALSE      
#>  3 FALSE     FALSE      
#>  4 TRUE      FALSE      
#>  5 FALSE     TRUE       
#>  6 TRUE      FALSE      
#>  7 FALSE     TRUE       
#>  8 FALSE     FALSE      
#>  9 FALSE     FALSE      
#> 10 TRUE      FALSE      
#> # ℹ 22 more rows

The vs, am, and gear columns are numeric but actually represent categories. To transform them to dummy logical columns in the same way as factors, we can use the partition() function with the .method argument set to "dummy":

partition(mtcars, vs:gear, .method = "dummy")
#> # A tibble: 32 × 15
#>      mpg cyl    disp    hp  drat    wt  qsec  carb `vs=0` `vs=1` `am=0` `am=1`
#>    <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>  <lgl>  <lgl>  <lgl> 
#>  1  21   six    160    110  3.9   2.62  16.5     4 TRUE   FALSE  FALSE  TRUE  
#>  2  21   six    160    110  3.9   2.88  17.0     4 TRUE   FALSE  FALSE  TRUE  
#>  3  22.8 four   108     93  3.85  2.32  18.6     1 FALSE  TRUE   FALSE  TRUE  
#>  4  21.4 six    258    110  3.08  3.22  19.4     1 FALSE  TRUE   TRUE   FALSE 
#>  5  18.7 eight  360    175  3.15  3.44  17.0     2 TRUE   FALSE  TRUE   FALSE 
#>  6  18.1 six    225    105  2.76  3.46  20.2     1 FALSE  TRUE   TRUE   FALSE 
#>  7  14.3 eight  360    245  3.21  3.57  15.8     4 TRUE   FALSE  TRUE   FALSE 
#>  8  24.4 four   147.    62  3.69  3.19  20       2 FALSE  TRUE   TRUE   FALSE 
#>  9  22.8 four   141.    95  3.92  3.15  22.9     2 FALSE  TRUE   TRUE   FALSE 
#> 10  19.2 six    168.   123  3.92  3.44  18.3     4 FALSE  TRUE   TRUE   FALSE 
#>    `gear=3` `gear=4` `gear=5`
#>    <lgl>    <lgl>    <lgl>   
#>  1 FALSE    TRUE     FALSE   
#>  2 FALSE    TRUE     FALSE   
#>  3 FALSE    TRUE     FALSE   
#>  4 TRUE     FALSE    FALSE   
#>  5 TRUE     FALSE    FALSE   
#>  6 TRUE     FALSE    FALSE   
#>  7 TRUE     FALSE    FALSE   
#>  8 FALSE    TRUE     FALSE   
#>  9 FALSE    TRUE     FALSE   
#> 10 FALSE    TRUE     FALSE   
#> # ℹ 22 more rows

The mpg column is numeric and therefore cannot be transformed directly into dummy logical columns. A better approach is to use the "crisp" method of partitioning.

The "crisp" method divides the range of values of the selected columns into intervals specified by the .breaks argument and then encodes the values into dummy logical columns corresponding to the intervals. The .breaks argument is a numeric vector that specifies the interval boundaries.

For example, the mpg values can be divided into four intervals: (-Inf, 15], (15, 20], (20, 30], and (30, Inf). The .breaks argument is then the vector c(-Inf, 15, 20, 30, Inf), which defines the boundaries of these intervals.

partition(mtcars, mpg, .method = "crisp", .breaks = c(-Inf, 15, 20, 30, Inf))
#> # A tibble: 32 × 14
#>    cyl    disp    hp  drat    wt  qsec    vs    am  gear  carb `mpg=(-Inf;15]`
#>    <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>          
#>  1 six    160    110  3.9   2.62  16.5     0     1     4     4 FALSE          
#>  2 six    160    110  3.9   2.88  17.0     0     1     4     4 FALSE          
#>  3 four   108     93  3.85  2.32  18.6     1     1     4     1 FALSE          
#>  4 six    258    110  3.08  3.22  19.4     1     0     3     1 FALSE          
#>  5 eight  360    175  3.15  3.44  17.0     0     0     3     2 FALSE          
#>  6 six    225    105  2.76  3.46  20.2     1     0     3     1 FALSE          
#>  7 eight  360    245  3.21  3.57  15.8     0     0     3     4 TRUE           
#>  8 four   147.    62  3.69  3.19  20       1     0     4     2 FALSE          
#>  9 four   141.    95  3.92  3.15  22.9     1     0     4     2 FALSE          
#> 10 six    168.   123  3.92  3.44  18.3     1     0     4     4 FALSE          
#>    `mpg=(15;20]` `mpg=(20;30]` `mpg=(30;Inf]`
#>    <lgl>         <lgl>         <lgl>         
#>  1 FALSE         TRUE          FALSE         
#>  2 FALSE         TRUE          FALSE         
#>  3 FALSE         TRUE          FALSE         
#>  4 FALSE         TRUE          FALSE         
#>  5 TRUE          FALSE         FALSE         
#>  6 TRUE          FALSE         FALSE         
#>  7 FALSE         FALSE         FALSE         
#>  8 FALSE         TRUE          FALSE         
#>  9 FALSE         TRUE          FALSE         
#> 10 TRUE          FALSE         FALSE         
#> # ℹ 22 more rows

Note: it is advisable to put -Inf and Inf as the first and last elements of the .breaks vector to ensure that all values are covered by the intervals.

If we want the breaks to be evenly spaced across the range of values, we can set .breaks to a single integer. This value specifies the number of intervals to create. For example, the following command divides the disp values into three intervals of equal width:

partition(mtcars, disp, .method = "crisp", .breaks = 3)
#> # A tibble: 32 × 13
#>      mpg cyl      hp  drat    wt  qsec    vs    am  gear  carb `disp=(-Inf;205]`
#>    <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>            
#>  1  21   six     110  3.9   2.62  16.5     0     1     4     4 TRUE             
#>  2  21   six     110  3.9   2.88  17.0     0     1     4     4 TRUE             
#>  3  22.8 four     93  3.85  2.32  18.6     1     1     4     1 TRUE             
#>  4  21.4 six     110  3.08  3.22  19.4     1     0     3     1 FALSE            
#>  5  18.7 eight   175  3.15  3.44  17.0     0     0     3     2 FALSE            
#>  6  18.1 six     105  2.76  3.46  20.2     1     0     3     1 FALSE            
#>  7  14.3 eight   245  3.21  3.57  15.8     0     0     3     4 FALSE            
#>  8  24.4 four     62  3.69  3.19  20       1     0     4     2 TRUE             
#>  9  22.8 four     95  3.92  3.15  22.9     1     0     4     2 TRUE             
#> 10  19.2 six     123  3.92  3.44  18.3     1     0     4     4 TRUE             
#>    `disp=(205;338]` `disp=(338;Inf]`
#>    <lgl>            <lgl>           
#>  1 FALSE            FALSE           
#>  2 FALSE            FALSE           
#>  3 FALSE            FALSE           
#>  4 TRUE             FALSE           
#>  5 FALSE            TRUE            
#>  6 TRUE             FALSE           
#>  7 FALSE            TRUE            
#>  8 FALSE            FALSE           
#>  9 FALSE            FALSE           
#> 10 FALSE            FALSE           
#> # ℹ 22 more rows

Each call to partition() returns a tibble with the selected columns transformed to dummy logical columns, while the other columns remain unchanged.

The transformation of the whole mtcars dataset to crisp predicates can be done as follows:

crispMtcars <- mtcars |>
    partition(cyl, vs:gear, .method = "dummy") |>
    partition(mpg, .method = "crisp", .breaks = c(-Inf, 15, 20, 30, Inf)) |>
    partition(disp:carb, .method = "crisp", .breaks = 3) 

head(crispMtcars, n = 3)
#> # A tibble: 3 × 32
#>   `cyl=four` `cyl=six` `cyl=eight` `vs=0` `vs=1` `am=0` `am=1` `gear=3` `gear=4`
#>   <lgl>      <lgl>     <lgl>       <lgl>  <lgl>  <lgl>  <lgl>  <lgl>    <lgl>   
#> 1 FALSE      TRUE      FALSE       TRUE   FALSE  FALSE  TRUE   FALSE    TRUE    
#> 2 FALSE      TRUE      FALSE       TRUE   FALSE  FALSE  TRUE   FALSE    TRUE    
#> 3 TRUE       FALSE     FALSE       FALSE  TRUE   FALSE  TRUE   FALSE    TRUE    
#>   `gear=5` `mpg=(-Inf;15]` `mpg=(15;20]` `mpg=(20;30]` `mpg=(30;Inf]`
#>   <lgl>    <lgl>           <lgl>         <lgl>         <lgl>         
#> 1 FALSE    FALSE           FALSE         TRUE          FALSE         
#> 2 FALSE    FALSE           FALSE         TRUE          FALSE         
#> 3 FALSE    FALSE           FALSE         TRUE          FALSE         
#>   `disp=(-Inf;205]` `disp=(205;338]` `disp=(338;Inf]` `hp=(-Inf;146]`
#>   <lgl>             <lgl>            <lgl>            <lgl>          
#> 1 TRUE              FALSE            FALSE            TRUE           
#> 2 TRUE              FALSE            FALSE            TRUE           
#> 3 TRUE              FALSE            FALSE            TRUE           
#>   `hp=(146;241]` `hp=(241;Inf]` `drat=(-Inf;3.48]` `drat=(3.48;4.21]`
#>   <lgl>          <lgl>          <lgl>              <lgl>             
#> 1 FALSE          FALSE          FALSE              TRUE              
#> 2 FALSE          FALSE          FALSE              TRUE              
#> 3 FALSE          FALSE          FALSE              TRUE              
#>   `drat=(4.21;Inf]` `wt=(-Inf;2.82]` `wt=(2.82;4.12]` `wt=(4.12;Inf]`
#>   <lgl>             <lgl>            <lgl>            <lgl>          
#> 1 FALSE             TRUE             FALSE            FALSE          
#> 2 FALSE             FALSE            TRUE             FALSE          
#> 3 FALSE             TRUE             FALSE            FALSE          
#>   `qsec=(-Inf;17.3]` `qsec=(17.3;20.1]` `qsec=(20.1;Inf]` `carb=(-Inf;3.33]`
#>   <lgl>              <lgl>              <lgl>             <lgl>             
#> 1 TRUE               FALSE              FALSE             FALSE             
#> 2 TRUE               FALSE              FALSE             FALSE             
#> 3 FALSE              TRUE               FALSE             TRUE              
#>   `carb=(3.33;5.67]` `carb=(5.67;Inf]`
#>   <lgl>              <lgl>            
#> 1 TRUE               FALSE            
#> 2 TRUE               FALSE            
#> 3 FALSE              FALSE

Now all columns are logical and can be used as predicates in crisp conditions.

Preparation of Triangular and Raised-Cosine Fuzzy Predicates

In many real-world datasets, numeric attributes do not lend themselves to clear-cut, crisp boundaries. For example, deciding whether a car has “low mileage” or “high mileage” is often subjective. A vehicle with 19 miles per gallon may be considered “low” in one context but “medium” in another. Crisp intervals force a strict separation between categories, which can be too rigid and may lose information about gradual changes in the data.

To address this, fuzzy predicates are used. A fuzzy predicate expresses the degree to which a condition is satisfied. Instead of being strictly TRUE or FALSE (although allowed too), each predicate is represented by a number in the interval \([0,1]\). A truth degree of 0 means the predicate is entirely false, 1 means it is fully true, and values in between indicate partial membership. This allows us to model smooth transitions between categories and capture more nuanced patterns.

For example, a fuzzy predicate could represent “medium horsepower” in the mtcars dataset. A car with 120 hp may belong to this category to a degree of 0.8, while a car with 150 hp may belong to it only to a degree of 0.2. Such representations are more faithful to human reasoning and often yield patterns that are both more robust and more interpretable.

The transformation of numeric columns to fuzzy predicates can be done with the partition() function. As with crisp partitioning, factors are transformed to dummy logical columns. Numeric columns, however, are transformed into fuzzy truth values. The partition() function provides two fuzzy partitioning methods:

  • .method = "triangle" creates fuzzy sets with triangular or trapezoidal membership functions;
  • .method = "raisedcos" creates fuzzy sets with raised cosine or trapezoidal raised-cosine membership functions.

These membership functions specify how strongly a value belongs to a fuzzy set. The choice of function depends on the desired smoothness of the transition between sets.

More advanced fuzzy partitioning of numeric columns can be achieved with the lfl package, which provides tools for defining fuzzy sets of many types, including linguistic terms such as “very small” or “extremely big”. See the lfl documentation for more information.

Both triangular and raised cosine shapes are fully defined by three points: the left border, the peak, and the right border. The .breaks argument in the partition() function specifies these points. See the following figure for an illustration of triangular and raised cosine membership functions for .breaks = c(-10, 0, 10):

Comparison of triangular and raised cosine membership functions for .breaks = c(-10, 0, 10)

Comparison of triangular and raised cosine membership functions for .breaks = c(-10, 0, 10)

Each consecutive triplet of values in .breaks defines one fuzzy set. To create e.g. three fuzzy sets, five break points are needed. For instance, .breaks = c(-10, -5, 0, 5, 10) defines three fuzzy sets with peaks at -5, 0, and 5. See the following figure for an illustration of these fuzzy sets:

Fuzzy sets with triangular membership functions for .breaks = c(-10, -5, 0, 5, 10)

Fuzzy sets with triangular membership functions for partition(x, .method = "triangle", .breaks = c(-10, -5, 0, 5, 10))

It is often useful to extend the fuzzy sets on the edges to infinity. That ensures that all values are covered by the fuzzy sets. To achieve that, -Inf and Inf can be added as the first and last elements of the .breaks vector:

Fuzzy sets with triangular membership functions for .breaks = c(-Inf, -5, 0, 5, Inf)

Fuzzy sets with triangular membership functions for partition(x, .method = "triangle", .breaks = c(-Inf, -5, 0, 5, Inf))

If a regular partitioning of the range of values is desired, .breaks can be set to a single integer, which specifies the number of fuzzy sets to create. For example, .breaks = 4 creates partitioning with four fuzzy sets:

Fuzzy sets with triangular membership functions for .breaks = 4

Fuzzy sets with triangular membership functions for partition(x, .method = "triangle", .breaks = 4)

The same is valid for raised cosine fuzzy sets. For instance, the following figure shows five raised cosine fuzzy sets defined by .breaks = c(-Inf, -10, -5, 0, 5, 10, Inf):

Fuzzy sets with raised cosine membership functions for .breaks = c(-Inf, -10, -5, 0, 5, 10, Inf)

Fuzzy sets with raised cosine membership functions for partition(x, .method = "raisedcos", .breaks = c(-Inf, -10, -5, 0, 5, 10, Inf))

A fuzzy transformation of the whole mtcars dataset can be done as follows:

fuzzyMtcars <- mtcars |>
    partition(cyl, vs:gear, .method = "dummy") |>
    partition(mpg, .method = "triangle", .breaks = c(-Inf, 15, 20, 30, Inf)) |>
    partition(disp:carb, .method = "triangle", .breaks = 3) 

head(fuzzyMtcars, n = 3)
#> # A tibble: 3 × 31
#>   `cyl=four` `cyl=six` `cyl=eight` `vs=0` `vs=1` `am=0` `am=1` `gear=3` `gear=4`
#>   <lgl>      <lgl>     <lgl>       <lgl>  <lgl>  <lgl>  <lgl>  <lgl>    <lgl>   
#> 1 FALSE      TRUE      FALSE       TRUE   FALSE  FALSE  TRUE   FALSE    TRUE    
#> 2 FALSE      TRUE      FALSE       TRUE   FALSE  FALSE  TRUE   FALSE    TRUE    
#> 3 TRUE       FALSE     FALSE       FALSE  TRUE   FALSE  TRUE   FALSE    TRUE    
#>   `gear=5` `mpg=(-Inf;15;20)` `mpg=(15;20;30)` `mpg=(20;30;Inf)`
#>   <lgl>                 <dbl>            <dbl>             <dbl>
#> 1 FALSE                     0             0.9               0.1 
#> 2 FALSE                     0             0.9               0.1 
#> 3 FALSE                     0             0.72              0.28
#>   `disp=(-Inf;71.1;272)` `disp=(71.1;272;472)` `disp=(272;472;Inf)`
#>                    <dbl>                 <dbl>                <dbl>
#> 1                  0.557                 0.443                    0
#> 2                  0.557                 0.443                    0
#> 3                  0.816                 0.184                    0
#>   `hp=(-Inf;52;194)` `hp=(52;194;335)` `hp=(194;335;Inf)`
#>                <dbl>             <dbl>              <dbl>
#> 1              0.592             0.408                  0
#> 2              0.592             0.408                  0
#> 3              0.711             0.289                  0
#>   `drat=(-Inf;2.76;3.84)` `drat=(2.76;3.84;4.93)` `drat=(3.84;4.93;Inf)`
#>                     <dbl>                   <dbl>                  <dbl>
#> 1                       0                   0.945                0.0550 
#> 2                       0                   0.945                0.0550 
#> 3                       0                   0.991                0.00917
#>   `wt=(-Inf;1.51;3.47)` `wt=(1.51;3.47;5.42)` `wt=(3.47;5.42;Inf)`
#>                   <dbl>                 <dbl>                <dbl>
#> 1                 0.434                 0.566                    0
#> 2                 0.304                 0.696                    0
#> 3                 0.587                 0.413                    0
#>   `qsec=(-Inf;14.5;18.7)` `qsec=(14.5;18.7;22.9)` `qsec=(18.7;22.9;Inf)`
#>                     <dbl>                   <dbl>                  <dbl>
#> 1                  0.533                    0.467                      0
#> 2                  0.4                      0.6                        0
#> 3                  0.0214                   0.979                      0
#>   `carb=(-Inf;1;4.5)` `carb=(1;4.5;8)` `carb=(4.5;8;Inf)`
#>                 <dbl>            <dbl>              <dbl>
#> 1               0.143            0.857                  0
#> 2               0.143            0.857                  0
#> 3               1                0                      0

Note that the cyl, vs, am, and gear columns are still represented by dummy logical columns, while the mpg, disp, and other columns are now represented by fuzzy sets. This combination allows both crisp and fuzzy predicates to be used together in pattern discovery, offering more flexibility and interpretability.

Preparation of Trapezoidal Fuzzy Predicates

The triangular and raised cosine membership functions are often sufficient to capture gradual transitions in numeric data. However, in some situations it is useful to have fuzzy sets that stay fully true (membership = 1) over a wider interval before decreasing again. This generalization corresponds to a trapezoidal fuzzy set, which can be seen as a triangle or raised cosine with a “flat top”.

With partition(), trapezoids can be defined for both "triangle" and "raisedcos" methods by controlling how many consecutive break points constitute one fuzzy set and how far the window shifts along the breaks. That can be accomplished with the .span and .inc arguments:

  • .span - specifies the width of the flat top in terms of the number of break intervals that should be merged.
  • .inc - the shift of the window along .breaks when forming the next fuzzy set.

By default, .span = 1 and .inc = 1, which means that each fuzzy set is triangular or raised cosine. Setting .span to a value greater than 1 creates trapezoidal fuzzy sets. With .span = 2, each fuzzy set is defined by four consecutive break points - a flat top spans two break intervals. The following figure is the result of setting .span = 2 and .breaks = c(-10, -5, 5, 10):

Fuzzy sets with triangular membership functions for .span = 2, .breaks = c(-10, -5, 5, 10)`

Fuzzy sets with triangular membership functions for partition(x, .method = "triangle", .span = 2, .breaks = c(-10, -5, 5, 10))

Additional fuzzy sets are created by shifting the window along the break points. The shift is controlled by the .inc argument. By default, .inc = 1, which means that the window shifts by one break point. Consider the following example that shows the effect of setting .inc = 1 in addition to .span = 2 and .breaks = c(-15, -10, -5, 0, 5, 10, 15):

Fuzzy sets with triangular membership functions for .inc = 1, .span = 2, .breaks = c(-15, -10, -5, 0, 5, 10, 15)`

Fuzzy sets with triangular membership functions for partition(x, .method = "triangle", .inc = 1, .span = 2, .breaks = c(-15, -10, -5, 0, 5, 10, 15))

Setting .inc to a value greater than 1 modifies the shift of the window along the break points. For example, with .inc = 3, the window shifts by three break points, which effectively skips two fuzzy sets after each created fuzzy set:

Fuzzy sets with triangular membership functions for .inc = 3, .span = 2, .breaks = c(-15, -10, -5, 0, 5, 10, 15)`

Fuzzy sets with triangular membership functions for partition(x, .method = "triangle", .inc = 3, .span = 2, .breaks = c(-15, -10, -5, 0, 5, 10, 15))

Pre-defined Patterns

nuggets provides a set of functions for searching for some best-known pattern types. These functions allow to process Boolean data, fuzzy data, or both. The result of these functions is always a tibble with patterns stored as rows. For more advance usage, which allows to search for custom patterns or to compute user-defined measures and statistics, see the section Custom Patterns.

Search for Association Rules

Association rules are rules of the form \(A \Rightarrow B\), where \(A\) is either Boolean or fuzzy condition in the form of conjunction, and \(B\) is a Boolean or fuzzy predicate.

Before continuing with the search for rules, it is advisable to create the so-called vector of disjoints. The vector of disjoints is a character vector with the same length as the number of columns in the analyzed dataset. It specifies predicates, which are mutually exclusive and should not be combined together in a single pattern’s condition: columns with equal values in the disjoint vector will not appear in a single condition. Providing the vector of disjoints to the algorithm will speed-up the search as it makes no sense, e.g., to combine Plant=Qn1 and Plant=Qn2 in a condition Plant=Qn1 & Plant=Qn2 as such formula is never true for any data row.

The vector of disjoints can be easily created from the column names of the dataset, e.g., by obtaining the first part of column names before the equal sign, which is neatly provided by the var_names() function as follows:

#disj <- var_names(colnames(fuzzyCO2))
#print(disj)

The function dig_associations takes the analyzed dataset as its first parameter and a pair of tidyselect expressions to select the column names to appear in the left-hand (antecedent) and right-hand (consequent) side of the rule. The following command searches for associations rules, such that:

  • any column except those starting with “Treatment” is in the antecedent;
  • any column starting with “Treatment” is in the consequent;
  • the minimum support is 0.02 (support is the proportion of rows that satisfy the antecedent AND consequent));
  • the minimum confidence is 0.8 (confidence is the proportion of rows satisfying the consequent GIVEN the antecedent is true).
#result <- dig_associations(fuzzyCO2,
                           #antecedent = !starts_with("Treatment"),
                           #consequent = starts_with("Treatment"),
                           #disjoint = disj,
                           #min_support = 0.02,
                           #min_confidence = 0.8)

The result is a tibble with found rules. We may arrange it by support in descending order:

#result <- arrange(result, desc(support))
#print(result)

Conditional Correlations

TBD (dig_correlations)

Contrast Patterns

TBD (dig_contrasts)

Post-processing and Visualization

TBD

Custom Patterns

The nuggets package allows to execute a user-defined callback function on each generated frequent condition. That way a custom type of patterns may be searched. The following example replicates the search for associations rules with the custom callback function. For that, a dataset has to be dichotomized and the disjoint vector created as in the Data Preparation section above:

#head(fuzzyCO2)
#print(disj)

As we want to search for associations rules with some minimum support and confidence, we define the variables to hold that thresholds. We also need to define a callback function that will be called for each found frequent condition. Its purpose is to generate the rules with the obtained condition as an antecedent:

min_support <- 0.02
min_confidence <- 0.8

f <- function(condition, support, foci_supports) {
    conf <- foci_supports / support
    sel <- !is.na(conf) & conf >= min_confidence & !is.na(foci_supports) & foci_supports >= min_support
    conf <- conf[sel]
    supp <- foci_supports[sel]
    
    lapply(seq_along(conf), function(i) { 
      list(antecedent = format_condition(names(condition)),
           consequent = format_condition(names(conf)[[i]]),
           support = supp[[i]],
           confidence = conf[[i]])
    })
}

The callback function f() defines three arguments: condition, support and foci_supports. The names of the arguments are not random. Based on the argument names of the callback function, the searching algorithm provides information to the function. Here condition is a vector of indices representing the conjunction of predicates in a condition. By the predicate we mean the column in the source dataset. The support argument gets the relative frequency of the condition in the dataset. foci_supports is a vector of supports of special predicates, which we call “foci” (plural of “focus”), within the rows satisfying the condition. For associations rules, foci are potential rule consequents.

Now we can run the digging for rules:

#result <- dig(fuzzyCO2,
              #f = f,
              #condition = !starts_with("Treatment"),
              #focus = starts_with("Treatment"),
              #disjoint = disj,
              #min_length = 1,
              #min_support = min_support)

As we return a list of lists in the callback function, we have to flatten the first level of lists in the result and binding it into a data frame:

#result <- result |>
  #unlist(recursive = FALSE) |>
  #lapply(as_tibble) |>
  #do.call(rbind, args = _) |>
  #arrange(desc(support))
#
#print(result)