Skip to contents

Function tests all columns that are specified by the .what argument and removes those that are almost constant. A column is considered almost constant if the proportion of the most frequent value is greater than the threshold specified by the .threshold argument. See is_almost_constant() for details.

Usage

remove_almost_constant(
  .data,
  .what = everything(),
  ...,
  .threshold = 1,
  .na_rm = FALSE,
  .verbose = FALSE
)

Arguments

.data

a data frame

.what

a tidyselect expression (see tidyselect syntax) selecting the columns to be processed

...

optional other tidyselect expressions selecting additional columns to be processed

.threshold

a numeric scalar in the range \([0, 1]\) specifying the threshold for the proportion of the most frequent value

.na_rm

a logical scalar indicating whether to remove NA values before computing the proportion of the most frequent value. See is_almost_constant() for details of how NA values are handled.

.verbose

a logical scalar indicating whether to print a message about removed columns

Value

A data frame with removed all columns specified by the .what argument that are also (almost) constant

Author

Michal Burda

Examples

d <- data.frame(a1 = 1:10,
                a2 = c(1:9, NA),
                b1 = "b",
                b2 = NA,
                c1 = rep(c(TRUE, FALSE), 5),
                c2 = rep(c(TRUE, NA), 5),
                d = c(rep(TRUE, 4), rep(FALSE, 4), NA, NA))
remove_almost_constant(d, .threshold = 1.0, .na_rm = FALSE)
#> # A tibble: 10 × 5
#>       a1    a2 c1    c2    d    
#>    <int> <int> <lgl> <lgl> <lgl>
#>  1     1     1 TRUE  TRUE  TRUE 
#>  2     2     2 FALSE NA    TRUE 
#>  3     3     3 TRUE  TRUE  TRUE 
#>  4     4     4 FALSE NA    TRUE 
#>  5     5     5 TRUE  TRUE  FALSE
#>  6     6     6 FALSE NA    FALSE
#>  7     7     7 TRUE  TRUE  FALSE
#>  8     8     8 FALSE NA    FALSE
#>  9     9     9 TRUE  TRUE  NA   
#> 10    10    NA FALSE NA    NA   
remove_almost_constant(d, .threshold = 1.0, .na_rm = TRUE)
#> # A tibble: 10 × 4
#>       a1    a2 c1    d    
#>    <int> <int> <lgl> <lgl>
#>  1     1     1 TRUE  TRUE 
#>  2     2     2 FALSE TRUE 
#>  3     3     3 TRUE  TRUE 
#>  4     4     4 FALSE TRUE 
#>  5     5     5 TRUE  FALSE
#>  6     6     6 FALSE FALSE
#>  7     7     7 TRUE  FALSE
#>  8     8     8 FALSE FALSE
#>  9     9     9 TRUE  NA   
#> 10    10    NA FALSE NA   
remove_almost_constant(d, .threshold = 0.5, .na_rm = FALSE)
#> # A tibble: 10 × 3
#>       a1    a2 d    
#>    <int> <int> <lgl>
#>  1     1     1 TRUE 
#>  2     2     2 TRUE 
#>  3     3     3 TRUE 
#>  4     4     4 TRUE 
#>  5     5     5 FALSE
#>  6     6     6 FALSE
#>  7     7     7 FALSE
#>  8     8     8 FALSE
#>  9     9     9 NA   
#> 10    10    NA NA   
remove_almost_constant(d, .threshold = 0.5, .na_rm = TRUE)
#> # A tibble: 10 × 2
#>       a1    a2
#>    <int> <int>
#>  1     1     1
#>  2     2     2
#>  3     3     3
#>  4     4     4
#>  5     5     5
#>  6     6     6
#>  7     7     7
#>  8     8     8
#>  9     9     9
#> 10    10    NA
remove_almost_constant(d, a1:b2, .threshold = 0.5, .na_rm = TRUE)
#> # A tibble: 10 × 5
#>       a1    a2 c1    c2    d    
#>    <int> <int> <lgl> <lgl> <lgl>
#>  1     1     1 TRUE  TRUE  TRUE 
#>  2     2     2 FALSE NA    TRUE 
#>  3     3     3 TRUE  TRUE  TRUE 
#>  4     4     4 FALSE NA    TRUE 
#>  5     5     5 TRUE  TRUE  FALSE
#>  6     6     6 FALSE NA    FALSE
#>  7     7     7 TRUE  TRUE  FALSE
#>  8     8     8 FALSE NA    FALSE
#>  9     9     9 TRUE  TRUE  NA   
#> 10    10    NA FALSE NA    NA