Skip to contents

[Deprecated]

Contrast patterns are a generalization of association rules that allow for the specification of a condition under which there is a significant difference in some statistical feature between two numeric variables.

Scheme:

theta(xvar) >> theta(yvar) | C

The feature theta of the first variable xvar is significantly higher than the feature theta of the second variable yvar under the condition C.

Example:

mean(daily_ice_cream_income) >> mean(daily_tea_income) | sunny

The mean of daily ice-cream income is significantly higher than the mean of daily tea income under the condition of sunny weather.

The contrast is computed using a statistical test, which is specified by the method argument. The function computes the contrast between all pairs of variables, where the first variable is specified by the xvars argument and the second variable is specified by the yvars argument. The contrast is computed in sub-data corresponding to conditions generated from the condition columns. The dig_contrasts() function supports crisp conditions only, i.e., the condition columns must be logical.

Usage

dig_contrasts(
  x,
  condition = where(is.logical),
  xvars = where(is.numeric),
  yvars = where(is.numeric),
  method = "t",
  alternative = "two.sided",
  min_length = 0L,
  max_length = Inf,
  min_support = 0,
  max_p_value = 0.05,
  threads = 1,
  ...
)

Arguments

x

a matrix or data frame with data to search in.

condition

a tidyselect expression (see tidyselect syntax) specifying the columns to use as condition predicates

xvars

a tidyselect expression (see tidyselect syntax) specifying the columns to use for computation of contrasts

yvars

a tidyselect expression (see tidyselect syntax) specifying the columns to use for computation of contrasts

method

a character string indicating which contrast to compute. One of "t", "wilcox", or "var". "t" (resp. "wilcos") compute a parametric (resp. non-parametric) test on equality in position, and "var" performs the F-test on equality of variance.

alternative

indicates the alternative hypothesis and must be one of "two.sided", "greater" or "less". "greater" corresponds to positive association, "less" to negative association.

min_length

the minimum size (the minimum number of predicates) of the condition to be generated (must be greater or equal to 0). If 0, the empty condition is generated in the first place.

max_length

The maximum size (the maximum number of predicates) of the condition to be generated. If equal to Inf, the maximum length of conditions is limited only by the number of available predicates.

min_support

the minimum support of a condition to trigger the callback function for it. The support of the condition is the relative frequency of the condition in the dataset x. For logical data, it equals to the relative frequency of rows such that all condition predicates are TRUE on it. For numerical (double) input, the support is computed as the mean (over all rows) of multiplications of predicate values.

max_p_value

the maximum p-value of a test for the pattern to be considered significant. If the p-value of the test is greater than max_p_value, the pattern is not included in the result.

threads

the number of threads to use for parallel computation.

...

Further arguments passed to the underlying test function (t.test(), wilcox.test(), or var.test() accordingly to the selected method).

Value

A tibble with found patterns in rows. The following columns are always present:

condition

the condition of the pattern as a character string in the form {p1 & p2 & ... & pn} where p1, p2, ..., pn are x's column names.

support

the support of the condition, i.e., the relative frequency of the condition in the dataset x.

xvar

the name of the first variable in the contrast.

yvar

the name of the second variable in the contrast.

p_value

the p-value of the underlying test.

rows

the number of rows in the sub-data corresponding to the condition.

alternative

a character string indicating the alternative hypothesis.

method

a character string indicating the method used for the test.

For the "t" method, the following additional columns are also present (see also t.test()):

estimate_x

the estimated mean of variable xvar.

estimate_y

the estimated mean of variable yvar.

t_statistic

the t-statistic of the t test.

df

the degrees of freedom of the t test.

conf_int_lo

the lower bound of the confidence interval.

conf_int_hi

the upper bound of the confidence interval.

stderr

the standard error of the mean difference.

For the "wilcox" method, the following additional columns are also present (see also wilcox.test()):

estimate

the estimate of the location parameter.

W_statistic

the Wilcoxon rank sum statistic.

conf_int_lo

the lower bound of the confidence interval.

conf_int_hi

the upper bound of the confidence interval.

For the "var" method, the following additional columns are also present (see also var.test()):

estimate

the ratio of the sample variances of variables xvar and yvar.

F_statistic

the value of the F test statistic.

df1

the numerator degrees of freedom.

df2

the denominator degrees of freedom.

conf_int_lo

the lower bound of the confidence interval for the ratio of the population variances.

conf_int_hi

the upper bound of the confidence interval for the ratio of the population variances.

Author

Michal Burda

Examples

crispCO2 <- partition(CO2, Plant:Treatment)
dig_contrasts(crispCO2,
             condition = where(is.logical),
             xvars = conc,
             yvars = uptake,
             method = "t",
             min_support = 0.1)
#> Warning: `dig_contrasts()` was deprecated in nuggets 1.3.0.
#>  Please use `dig_paired_contrasts()` instead.
#> # A tibble: 9 × 15
#>   condition support xvar  yvar  estimate_x estimate_y t_statistic    df  p_value
#>   <chr>       <dbl> <chr> <chr>      <dbl>      <dbl>       <dbl> <dbl>    <dbl>
#> 1 {}           1    conc  upta…       408.         NA       12.9     83 1.94e-21
#> 2 {Type=Qu…    0.5  conc  upta…       401.         NA        8.94    41 3.50e-11
#> 3 {Type=Mi…    0.5  conc  upta…       414.         NA        9.12    41 2.01e-11
#> 4 {Treatme…    0.5  conc  upta…       404.         NA        8.98    41 3.16e-11
#> 5 {Treatme…    0.5  conc  upta…       411.         NA        9.09    41 2.25e-11
#> 6 {Treatme…    0.25 conc  upta…       403.         NA        6.28    20 3.95e- 6
#> 7 {Treatme…    0.25 conc  upta…       419.         NA        6.42    20 2.91e- 6
#> 8 {Treatme…    0.25 conc  upta…       400.         NA        6.21    20 4.54e- 6
#> 9 {Treatme…    0.25 conc  upta…       409.         NA        6.33    20 3.56e- 6
#> # ℹ 6 more variables: rows <int>, conf_int_lo <dbl>, conf_int_hi <dbl>,
#> #   stderr <dbl>, alternative <chr>, method <chr>