Exploring Base R's cut() function
cut_function.RmdThis vignette explores the behavior of base R’s cut()
function, which is used to calculate the intervals in
adcp_count_obs(). These intervals are used in
adcp_plot_speed_hist() and can be passed to
adcp_plot_current_rose() using the breaks
argument.
cut() function
First, let’s look at the help file for cut:
| cut | R Documentation | 
Convert Numeric to Factor
Description
cut divides the range of x into intervals and
codes the values in x according to which interval they
fall. The leftmost interval corresponds to level one, the next leftmost
to level two and so on.
Usage
cut(x, ...)
## Default S3 method:
cut(x, breaks, labels = NULL,
    include.lowest = FALSE, right = TRUE, dig.lab = 3,
    ordered_result = FALSE, ...)Arguments
| x | a numeric vector which is to be converted to a factor by cutting. | 
| breaks | 
either a numeric vector of two or more unique cut points or a single
number (greater than or equal to 2) giving the number of intervals into
which  | 
| labels | 
labels for the levels of the resulting category. By default, labels are
constructed using  | 
| include.lowest | 
logical, indicating if an ‘x[i]’ equal to the lowest (or highest, for
 | 
| right | logical, indicating if the intervals should be closed on the right (and open on the left) or vice versa. | 
| dig.lab | integer which is used when labels are not given. It determines the number of digits used in formatting the break numbers. | 
| ordered_result | logical: should the result be an ordered factor? | 
| … | further arguments passed to or from other methods. | 
Details
When breaks is specified as a single number, the range of
the data is divided into breaks pieces of equal length, and
then the outer limits are moved away by 0.1% of the range to ensure that
the extreme values both fall within the break intervals. (If
x is a constant vector, equal-length intervals are created,
one of which includes the single value.)
If a labels parameter is specified, its values are used to
name the factor levels. If none is specified, the factor level labels
are constructed as “(b1, b2]”, “(b2, b3]” etc.
for right = TRUE and as “[b1, b2)”, … if
right = FALSE. In this case, dig.lab indicates
the minimum number of digits should be used in formatting the numbers
b1, b2, …. A larger value (up to 12) will be
used if needed to distinguish between any pair of endpoints: if this
fails labels such as “Range3” will be used. Formatting is
done by formatC.
The default method will sort a numeric vector of breaks,
but other methods are not required to and labels will
correspond to the intervals after sorting.
As from R 3.2.0,
getOption(“OutDec”) is consulted when labels are
constructed for labels = NULL.
Value
A factor is returned, unless labels = FALSE
which results in an integer vector of level codes.
Values which fall outside the range of breaks are coded as
NA, as are NaN and NA values.
Note
Instead of table(cut(x, br)), hist(x, br, plot =
FALSE) is more efficient and less memory hungry. Instead of
cut(*, labels = FALSE), findInterval() is more
efficient.
References
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
See Also
split for splitting a variable according to a group factor;
factor, tabulate, table,
findInterval.
quantile for ways of choosing breaks of roughly equal
content (rather than length).
.bincode for a bare-bones version.
Examples
Z <- stats::rnorm(10000)
table(cut(Z, breaks = -6:6))
sum(table(cut(Z, breaks = -6:6, labels = FALSE)))
sum(graphics::hist(Z, breaks = -6:6, plot = FALSE)$counts)
cut(rep(1,5), 4) #-- dummy
tx0 <- c(9, 4, 6, 5, 3, 10, 5, 3, 5)
x <- rep(0:8, tx0)
stopifnot(table(x) == tx0)
table( cut(x, breaks = 8))
table( cut(x, breaks = 3*(-2:5)))
table( cut(x, breaks = 3*(-2:5), right = FALSE))
##--- some values OUTSIDE the breaks :
table(cx  <- cut(x, breaks = 2*(0:4)))
table(cxl <- cut(x, breaks = 2*(0:4), right = FALSE))
which(is.na(cx));  x[is.na(cx)]  #-- the first 9  values  0
which(is.na(cxl)); x[is.na(cxl)] #-- the last  5  values  8
## Label construction:
y <- stats::rnorm(100)
table(cut(y, breaks = pi/3*(-3:3)))
table(cut(y, breaks = pi/3*(-3:3), dig.lab = 4))
table(cut(y, breaks =  1*(-3:3), dig.lab = 4))
# extra digits don't "harm" here
table(cut(y, breaks =  1*(-3:3), right = FALSE))
#- the same, since no exact INT!
## sometimes the default dig.lab is not enough to be avoid confusion:
aaa <- c(1,2,3,4,5,2,3,4,5,6,7)
cut(aaa, 3)
cut(aaa, 3, dig.lab = 4, ordered_result = TRUE)
## one way to extract the breakpoints
labs <- levels(cut(aaa, 3))
cbind(lower = as.numeric( sub("\\((.+),.*", "\\1", labs) ),
      upper = as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", labs) ))