Exploring Base R's cut() function

library(adcp)
library(dplyr)
library(gbRd)

This vignette explores the behavior of base R’s cut() function, which is used to calculate the intervals in adcp_count_obs(). These intervals are used in adcp_plot_speed_hist() and can be passed to adcp_plot_current_rose() using the breaks argument.

`cut()` function

First, let’s look at the help file for cut:

cut	R Documentation

Convert Numeric to Factor

Description

cut divides the range of x into intervals and codes the values in x according to which interval they fall. The leftmost interval corresponds to level one, the next leftmost to level two and so on.

Usage

cut(x, ...)

## Default S3 method:
cut(x, breaks, labels = NULL,
    include.lowest = FALSE, right = TRUE, dig.lab = 3,
    ordered_result = FALSE, ...)

Arguments

`x`	a numeric vector which is to be converted to a factor by cutting.
`breaks`	either a numeric vector of two or more unique cut points or a single number (greater than or equal to 2) giving the number of intervals into which `x` is to be cut.
`labels`	labels for the levels of the resulting category. By default, labels are constructed using `“(a,b]”` interval notation. If `labels = FALSE`, simple integer codes are returned instead of a factor.
`include.lowest`	logical, indicating if an ‘x[i]’ equal to the lowest (or highest, for `right = FALSE`) ‘breaks’ value should be included.
`right`	logical, indicating if the intervals should be closed on the right (and open on the left) or vice versa.
`dig.lab`	integer which is used when labels are not given. It determines the number of digits used in formatting the break numbers.
`ordered_result`	logical: should the result be an ordered factor?
`…`	further arguments passed to or from other methods.

Details

When breaks is specified as a single number, the range of the data is divided into breaks pieces of equal length, and then the outer limits are moved away by 0.1% of the range to ensure that the extreme values both fall within the break intervals. (If x is a constant vector, equal-length intervals are created, one of which includes the single value.)

If a labels parameter is specified, its values are used to name the factor levels. If none is specified, the factor level labels are constructed as “(b1, b2]”, “(b2, b3]” etc. for right = TRUE and as “[b1, b2)”, … if right = FALSE. In this case, dig.lab indicates the minimum number of digits should be used in formatting the numbers b1, b2, …. A larger value (up to 12) will be used if needed to distinguish between any pair of endpoints: if this fails labels such as “Range3” will be used. Formatting is done by formatC.

The default method will sort a numeric vector of breaks, but other methods are not required to and labels will correspond to the intervals after sorting.

As from R 3.2.0, getOption(“OutDec”) is consulted when labels are constructed for labels = NULL.

Value

A factor is returned, unless labels = FALSE which results in an integer vector of level codes.

Values which fall outside the range of breaks are coded as NA, as are NaN and NA values.

Note

Instead of table(cut(x, br)), hist(x, br, plot = FALSE) is more efficient and less memory hungry. Instead of cut(*, labels = FALSE), findInterval() is more efficient.

References

Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.

Examples

Z <- stats::rnorm(10000)
table(cut(Z, breaks = -6:6))
sum(table(cut(Z, breaks = -6:6, labels = FALSE)))
sum(graphics::hist(Z, breaks = -6:6, plot = FALSE)$counts)

cut(rep(1,5), 4) #-- dummy
tx0 <- c(9, 4, 6, 5, 3, 10, 5, 3, 5)
x <- rep(0:8, tx0)
stopifnot(table(x) == tx0)

table( cut(x, breaks = 8))
table( cut(x, breaks = 3*(-2:5)))
table( cut(x, breaks = 3*(-2:5), right = FALSE))

##--- some values OUTSIDE the breaks :
table(cx  <- cut(x, breaks = 2*(0:4)))
table(cxl <- cut(x, breaks = 2*(0:4), right = FALSE))
which(is.na(cx));  x[is.na(cx)]  #-- the first 9  values  0
which(is.na(cxl)); x[is.na(cxl)] #-- the last  5  values  8


## Label construction:
y <- stats::rnorm(100)
table(cut(y, breaks = pi/3*(-3:3)))
table(cut(y, breaks = pi/3*(-3:3), dig.lab = 4))

table(cut(y, breaks =  1*(-3:3), dig.lab = 4))
# extra digits don't "harm" here
table(cut(y, breaks =  1*(-3:3), right = FALSE))
#- the same, since no exact INT!

## sometimes the default dig.lab is not enough to be avoid confusion:
aaa <- c(1,2,3,4,5,2,3,4,5,6,7)
cut(aaa, 3)
cut(aaa, 3, dig.lab = 4, ordered_result = TRUE)

## one way to extract the breakpoints
labs <- levels(cut(aaa, 3))
cbind(lower = as.numeric( sub("\\((.+),.*", "\\1", labs) ),
      upper = as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", labs) ))

Explore with simple example

The arguments we want to explore are include.lowest, right, and dig.lab.

Create a simple dataframe:

df <- data.frame(value = seq(1:10))

df
#>    value
#> 1      1
#> 2      2
#> 3      3
#> 4      4
#> 5      5
#> 6      6
#> 7      7
#> 8      8
#> 9      9
#> 10    10

Assign number of intervals

Assign values to 3 even intervals using the default arguments (include.lowest = FALSE, right = TRUE, and dig.lab = 3):

df$int1 <- cut(
  df$value, 
  breaks = 3, 
  include.lowest = FALSE,
  right = TRUE,
  dig.lab = 3
)

df
#>    value      int1
#> 1      1 (0.991,4]
#> 2      2 (0.991,4]
#> 3      3 (0.991,4]
#> 4      4 (0.991,4]
#> 5      5     (4,7]
#> 6      6     (4,7]
#> 7      7     (4,7]
#> 8      8    (7,10]
#> 9      9    (7,10]
#> 10    10    (7,10]

Notes:

Intervals are right inclusive.
Outer limits are expanded by 0.1 % of the range ((10 - 1) * 0.001).
- Start of first interval is the minimum value minus 0.1 % of the range.
- End of the last interval is the maximum value plus 0.1 % of the range.
The end of the last interval is really 10.009. Because dig.lab = 3, this is truncated to 10.0, and R drops the trailing zero.

To verify the end of the last interval, set dig.lab = 5:

df$int2 <- cut(
  df$value, 
  breaks = 3, 
  include.lowest = FALSE,
  right = TRUE,
  dig.lab = 5
)

df
#>    value      int1       int2
#> 1      1 (0.991,4]  (0.991,4]
#> 2      2 (0.991,4]  (0.991,4]
#> 3      3 (0.991,4]  (0.991,4]
#> 4      4 (0.991,4]  (0.991,4]
#> 5      5     (4,7]      (4,7]
#> 6      6     (4,7]      (4,7]
#> 7      7     (4,7]      (4,7]
#> 8      8    (7,10] (7,10.009]
#> 9      9    (7,10] (7,10.009]
#> 10    10    (7,10] (7,10.009]

For intervals that are left inclusive, set right = FALSE:

df$int3 <- cut(
  df$value, 
  breaks = 3, 
  include.lowest = FALSE,
  right = FALSE,
  dig.lab = 3
)

df
#>    value      int1       int2      int3
#> 1      1 (0.991,4]  (0.991,4] [0.991,4)
#> 2      2 (0.991,4]  (0.991,4] [0.991,4)
#> 3      3 (0.991,4]  (0.991,4] [0.991,4)
#> 4      4 (0.991,4]  (0.991,4]     [4,7)
#> 5      5     (4,7]      (4,7]     [4,7)
#> 6      6     (4,7]      (4,7]     [4,7)
#> 7      7     (4,7]      (4,7]    [7,10)
#> 8      8    (7,10] (7,10.009]    [7,10)
#> 9      9    (7,10] (7,10.009]    [7,10)
#> 10    10    (7,10] (7,10.009]    [7,10)

Notes:

The assigned interval changes for values of 4 and 7.
In this example, dig.lab = 3, so the final interval is displayed as [7, 10). The value of 10 is assigned to this right-exclusive interval because it is really [7, 10.009).

To verify the last interval set right = FALSE and dig.lab = 5:

df$int4 <- cut(
  df$value, 
  breaks = 3, 
  include.lowest = FALSE,
  right = FALSE,
  dig.lab = 5
)

df
#>    value      int1       int2      int3       int4
#> 1      1 (0.991,4]  (0.991,4] [0.991,4)  [0.991,4)
#> 2      2 (0.991,4]  (0.991,4] [0.991,4)  [0.991,4)
#> 3      3 (0.991,4]  (0.991,4] [0.991,4)  [0.991,4)
#> 4      4 (0.991,4]  (0.991,4]     [4,7)      [4,7)
#> 5      5     (4,7]      (4,7]     [4,7)      [4,7)
#> 6      6     (4,7]      (4,7]     [4,7)      [4,7)
#> 7      7     (4,7]      (4,7]    [7,10) [7,10.009)
#> 8      8    (7,10] (7,10.009]    [7,10) [7,10.009)
#> 9      9    (7,10] (7,10.009]    [7,10) [7,10.009)
#> 10    10    (7,10] (7,10.009]    [7,10) [7,10.009)

The behavior of include.lowest depends on the value of right: - When right = TRUE, include.lowest = TRUE makes the first interval left-inclusive. - When right = FALSE, include.lowest = TRUE makes the last interval right-inclusive.

For right = TRUE and include.lowest = TRUE (note the two square brackets for the first interval):

df$int5 <- cut(
  df$value, 
  breaks = 3, 
  include.lowest = TRUE,
  right = TRUE,
  dig.lab = 5
)

df
#>    value      int1       int2      int3       int4       int5
#> 1      1 (0.991,4]  (0.991,4] [0.991,4)  [0.991,4)  [0.991,4]
#> 2      2 (0.991,4]  (0.991,4] [0.991,4)  [0.991,4)  [0.991,4]
#> 3      3 (0.991,4]  (0.991,4] [0.991,4)  [0.991,4)  [0.991,4]
#> 4      4 (0.991,4]  (0.991,4]     [4,7)      [4,7)  [0.991,4]
#> 5      5     (4,7]      (4,7]     [4,7)      [4,7)      (4,7]
#> 6      6     (4,7]      (4,7]     [4,7)      [4,7)      (4,7]
#> 7      7     (4,7]      (4,7]    [7,10) [7,10.009)      (4,7]
#> 8      8    (7,10] (7,10.009]    [7,10) [7,10.009) (7,10.009]
#> 9      9    (7,10] (7,10.009]    [7,10) [7,10.009) (7,10.009]
#> 10    10    (7,10] (7,10.009]    [7,10) [7,10.009) (7,10.009]

For right = FALSE and include.lowest = TRUE (note the two square brackets for the last interval):

df$int6 <- cut(
  df$value, 
  breaks = 3, 
  include.lowest = TRUE,
  right = FALSE,
  dig.lab = 5
)

df
#>    value      int1       int2      int3       int4       int5       int6
#> 1      1 (0.991,4]  (0.991,4] [0.991,4)  [0.991,4)  [0.991,4]  [0.991,4)
#> 2      2 (0.991,4]  (0.991,4] [0.991,4)  [0.991,4)  [0.991,4]  [0.991,4)
#> 3      3 (0.991,4]  (0.991,4] [0.991,4)  [0.991,4)  [0.991,4]  [0.991,4)
#> 4      4 (0.991,4]  (0.991,4]     [4,7)      [4,7)  [0.991,4]      [4,7)
#> 5      5     (4,7]      (4,7]     [4,7)      [4,7)      (4,7]      [4,7)
#> 6      6     (4,7]      (4,7]     [4,7)      [4,7)      (4,7]      [4,7)
#> 7      7     (4,7]      (4,7]    [7,10) [7,10.009)      (4,7] [7,10.009]
#> 8      8    (7,10] (7,10.009]    [7,10) [7,10.009) (7,10.009] [7,10.009]
#> 9      9    (7,10] (7,10.009]    [7,10) [7,10.009) (7,10.009] [7,10.009]
#> 10    10    (7,10] (7,10.009]    [7,10) [7,10.009) (7,10.009] [7,10.009]

Assign break values

Start fresh with a simple dataframe:

df <- data.frame(value = seq(1:10))

df
#>    value
#> 1      1
#> 2      2
#> 3      3
#> 4      4
#> 5      5
#> 6      6
#> 7      7
#> 8      8
#> 9      9
#> 10    10

Assign values to intervals using the breaks c(1, 4, 7, 10) using the default arguments (include.lowest = FALSE, right = TRUE, and dig.lab = 3):

df$int1 <- cut(
  df$value, 
  breaks = c(1, 4, 7, 10), 
  include.lowest = FALSE,
  right = TRUE,
  dig.lab = 3
)

df
#>    value   int1
#> 1      1   <NA>
#> 2      2  (1,4]
#> 3      3  (1,4]
#> 4      4  (1,4]
#> 5      5  (4,7]
#> 6      6  (4,7]
#> 7      7  (4,7]
#> 8      8 (7,10]
#> 9      9 (7,10]
#> 10    10 (7,10]

Notes:

The value of 1 is not assigned an interval because include.lowest = FALSE.

Set include.lowest = TRUE so that the value 1 will be assigned to an interval.

df$int2 <- cut(
  df$value, 
  breaks = c(1, 4, 7, 10), 
  include.lowest = TRUE,
  right = TRUE,
  dig.lab = 3
)

df
#>    value   int1   int2
#> 1      1   <NA>  [1,4]
#> 2      2  (1,4]  [1,4]
#> 3      3  (1,4]  [1,4]
#> 4      4  (1,4]  [1,4]
#> 5      5  (4,7]  (4,7]
#> 6      6  (4,7]  (4,7]
#> 7      7  (4,7]  (4,7]
#> 8      8 (7,10] (7,10]
#> 9      9 (7,10] (7,10]
#> 10    10 (7,10] (7,10]

Check to make sure there the break values were not truncated by increasing dig.lab”

df$int3 <- cut(
  df$value, 
  breaks = c(1, 4, 7, 10), 
  include.lowest = TRUE,
  right = TRUE,
  dig.lab = 5
)

df
#>    value   int1   int2   int3
#> 1      1   <NA>  [1,4]  [1,4]
#> 2      2  (1,4]  [1,4]  [1,4]
#> 3      3  (1,4]  [1,4]  [1,4]
#> 4      4  (1,4]  [1,4]  [1,4]
#> 5      5  (4,7]  (4,7]  (4,7]
#> 6      6  (4,7]  (4,7]  (4,7]
#> 7      7  (4,7]  (4,7]  (4,7]
#> 8      8 (7,10] (7,10] (7,10]
#> 9      9 (7,10] (7,10] (7,10]
#> 10    10 (7,10] (7,10] (7,10]

cut() function

Convert Numeric to Factor

Description

Usage

Arguments

Details

Value

Note

References

See Also

Examples

Explore with simple example

Assign number of intervals

Assign break values

`cut()` function