Title: | Optimal Binning of Data |
---|---|
Description: | Defines thresholds for breaking data into a number of discrete levels, minimizing the (mean) squared error within all bins. |
Authors: | Greg Kreider [aut, cre] |
Maintainer: | Greg Kreider <[email protected]> |
License: | BSD_3_clause + file LICENSE |
Version: | 1.4 |
Built: | 2024-12-19 06:47:44 UTC |
Source: | CRAN |
assign.optbin
returns an object with the same shape as the input
data and values replaced by bin numbers.
assign.optbin(x, binspec, extend.upper=FALSE, by.value=FALSE)
assign.optbin(x, binspec, extend.upper=FALSE, by.value=FALSE)
x |
numeric data to assign |
binspec |
an optimal binning partition |
extend.upper |
if true then any value in x above the last bin is assigned to that bin, otherwise its bin is set to NA |
by.value |
if true then return average value for bin instead of bin numbers |
Replaces the values in a copy of the input data by the bin number it belongs to, or by the bin average value with by.value. The lowest bin always extends to -Inf. The extend.upper argument can open the last bin to +Inf if true. Use this function to get in-place bin assignments for the unsorted data that was passed to optbin.
An object of the same shape as the data.
d <- c(rnorm(30, mean=10, sd=2), rnorm(40, mean=20, sd=2), rnorm(30, mean=30, sd=3)) binned <- optbin(d, 3) assign.optbin(d, binned)
d <- c(rnorm(30, mean=10, sd=2), rnorm(40, mean=20, sd=2), rnorm(30, mean=30, sd=3)) binned <- optbin(d, 3) assign.optbin(d, binned)
Draw a histogram of the data used to build the optimal binning and mark the extent of the bins.
## S3 method for class 'optbin' hist(x, bincol=NULL, main=NULL, xlab=NULL, ...)
## S3 method for class 'optbin' hist(x, bincol=NULL, main=NULL, xlab=NULL, ...)
x |
an object of class |
bincol |
vector of colors for showing extent of bins (default uses an internal set) |
main |
plot title, if not specified will modify the normal histogram title |
xlab |
x axis label, if not specified will modify the normal histogram label |
... |
other parameters passed through to hist |
The points behind the binning are passed unchanged to the histogram function. Bins are marked with colored bars under the x axis, and lines showing the average value in each are also drawn on top.
None
Determines break points in numeric data that minimize the difference between each point in a bin and the average over it.
optbin(x, numbin, metric=c('se', 'mse'), is.sorted=FALSE, max.cache=2^31, na.rm=FALSE)
optbin(x, numbin, metric=c('se', 'mse'), is.sorted=FALSE, max.cache=2^31, na.rm=FALSE)
x |
numeric data |
numbin |
number of bins to partition vector into |
metric |
minimize squared error (se) between values and average over bin, or mean squared error (mse) dividing squared error by bin length |
is.sorted |
set true if x is already in increasing order |
max.cache |
maximum memory in bytes to use to cache bin metrics; if analysis would need more than use slower calculation without cache |
na.rm |
drop NA values (which may occur when converting the data to a vector), otherwise cannot proceed with binning |
Data is converted into a numeric vector and sorted if necessary. Internally bins are determined by positions within the vector, with the breaks inclusive at the upper end. The bin thresholds are the same, so bin b covers the range thr[b-1] < x <= thr[b], where thr[0] is -Inf. The routine finds the first split found with the best metric, if there is more than one.
The library uses an exhaustive search over all possible breakpoints. It begins by finding the best splits with 2 bins for all pairs of start and endpoints, then adds a third bin, and so on. This rejects most alternatives at each level, leaving an O(nbin * nval * nval) algorithm.
An object of class 'optbin' with components:
x |
the original data, sorted |
numbins |
the number of bins created |
call |
argument values when function called |
metric |
cost function used to select best partition |
minse |
value of SE/MSE metric for all bins |
thr |
upper threshold of bin range, inclusive |
binavg |
average of values in each bin |
binse |
value of SE/MSE metric for each bin |
breaks |
positions of endpoint (inclusive) of each bin in x |
assign.optbin
, print.optbin
,
summary.optbin
, plot.optbin
## Well separated groups set.seed(17) d1 <- c(rnorm(75, mean=1, sd=0.2), rnorm(75, mean=3, sd=0.2), rnorm(84, mean=6, sd=0.2), rnorm(75, mean=9, sd=0.2), rnorm(75, mean=11, sd=0.2), rnorm(150, mean=15, sd=0.2)) ## Divides into groups 1+2+3, 4+5, 6, metric is 1176.3 binned3 <- optbin(d1, 3) summary(binned3) plot(binned3) ## Divides into groups 1, 2, 3, 4+5, and 6, metric is 169.9 binned5 <- optbin(d1, 5) plot(binned5) ## Divides into separate groups, metric is 24.4 binned6 <- optbin(d1, 6) summary(binned6) plot(binned6) ## Each rnorm group divides roughly in half. binned12 <- optbin(d1, 12) plot(binned12) ## A grouping that overlaps, bins near but not at minima between peaks d2 <- c(rnorm(300, mean=1, sd=0.25), rnorm(400, mean=2, sd=0.25), rnorm(300, mean=3, sd=0.25)) binned3b <- optbin(d2, 3) hist(binned3b, breaks=50, col='yellow')
## Well separated groups set.seed(17) d1 <- c(rnorm(75, mean=1, sd=0.2), rnorm(75, mean=3, sd=0.2), rnorm(84, mean=6, sd=0.2), rnorm(75, mean=9, sd=0.2), rnorm(75, mean=11, sd=0.2), rnorm(150, mean=15, sd=0.2)) ## Divides into groups 1+2+3, 4+5, 6, metric is 1176.3 binned3 <- optbin(d1, 3) summary(binned3) plot(binned3) ## Divides into groups 1, 2, 3, 4+5, and 6, metric is 169.9 binned5 <- optbin(d1, 5) plot(binned5) ## Divides into separate groups, metric is 24.4 binned6 <- optbin(d1, 6) summary(binned6) plot(binned6) ## Each rnorm group divides roughly in half. binned12 <- optbin(d1, 12) plot(binned12) ## A grouping that overlaps, bins near but not at minima between peaks d2 <- c(rnorm(300, mean=1, sd=0.25), rnorm(400, mean=2, sd=0.25), rnorm(300, mean=3, sd=0.25)) binned3b <- optbin(d2, 3) hist(binned3b, breaks=50, col='yellow')
plot
method for class optbin
.
## S3 method for class 'optbin' plot(x, col=NULL, main="Binned Observations", ...)
## S3 method for class 'optbin' plot(x, col=NULL, main="Binned Observations", ...)
x |
an object of class |
col |
vector of colors to apply to bins (default uses an internal set) |
main |
title of graph |
... |
other parameters passed through to the underlying plotting routines (do not set xaxt or ann) |
The plot will contain the sorted points of the data that generated the bins. Points are color-coded per bin, and the plot contains the average value over the bin as a line. x axis labels are the upper thresholds for each bin.
None
print
method for class optbin
.
## S3 method for class 'optbin' print(x, ...)
## S3 method for class 'optbin' print(x, ...)
x |
an object of class |
... |
generic arguments (ignored) |
Shows the upper bounds of each bin, ie. bin b covers threshold[b-1] < x <= threshold[b] where threshold[0] is -Inf. Also prints the total (mean) squared error sum over all bins.
The argument x unchanged, an object of class 'optbin' with components:
x |
the original data, sorted |
numbins |
the number of bins created |
call |
argument values when function called |
metric |
cost function used to select best partition |
minse |
value of SE/MSE metric for all bins |
thr |
upper threshold of bin range, inclusive |
binavg |
average of values in each bin |
binse |
value of SE/MSE metric for each bin |
breaks |
positions of endpoint (inclusive) of each bin in x |
summary
method for class optbin
.
## S3 method for class 'optbin' summary(object, show.range=FALSE, ...)
## S3 method for class 'optbin' summary(object, show.range=FALSE, ...)
object |
an object of class |
show.range |
if true then print the bin's range of points (endpoint inclusive) in the sorted data |
... |
generic arguments (ignored) |
Prints a table with the upper threshold (inclusive), the average of the data within the bin, and the (mean) squared error sum. show.range also adds a column with the start and end indices of the sorted data belonging to the bin, although this applies to the sorted list and is less useful in general.
Only called for side-effects (printing). There is no return value.