Package 'optbin' reference manual

Package 'optbin'

Title:	Optimal Binning of Data
Description:	Defines thresholds for breaking data into a number of discrete levels, minimizing the (mean) squared error within all bins.
Authors:	Greg Kreider [aut, cre]
Maintainer:	Greg Kreider <[email protected]>
License:	BSD_3_clause + file LICENSE
Version:	1.4
Built:	2025-02-17 06:50:16 UTC
Source:	CRAN

Title:

Optimal Binning of Data

Description:

Defines thresholds for breaking data into a number of discrete levels, minimizing the (mean) squared error within all bins.

Authors:

Greg Kreider [aut, cre]

Maintainer:

Greg Kreider <[email protected]>

License:

BSD_3_clause + file LICENSE

Version:

1.4

Built:

2025-02-17 06:50:16 UTC

Source:

CRAN

Bin Assignment

Description

assign.optbin returns an object with the same shape as the input data and values replaced by bin numbers.

Usage

assign.optbin(x, binspec, extend.upper=FALSE, by.value=FALSE)
assign.optbin(x, binspec, extend.upper=FALSE, by.value=FALSE)

Arguments

`x`	numeric data to assign
`binspec`	an optimal binning partition
`extend.upper`	if true then any value in x above the last bin is assigned to that bin, otherwise its bin is set to NA
`by.value`	if true then return average value for bin instead of bin numbers

Details

Replaces the values in a copy of the input data by the bin number it belongs to, or by the bin average value with by.value. The lowest bin always extends to -Inf. The extend.upper argument can open the last bin to +Inf if true. Use this function to get in-place bin assignments for the unsorted data that was passed to optbin.

Value

An object of the same shape as the data.

Examples

d <- c(rnorm(30, mean=10, sd=2), rnorm(40, mean=20, sd=2),
       rnorm(30, mean=30, sd=3))
binned <- optbin(d, 3)
assign.optbin(d, binned)
d <- c(rnorm(30, mean=10, sd=2), rnorm(40, mean=20, sd=2),
       rnorm(30, mean=30, sd=3))
binned <- optbin(d, 3)
assign.optbin(d, binned)

Histogram with Optimal Bins Marked

Description

Draw a histogram of the data used to build the optimal binning and mark the extent of the bins.

Usage

## S3 method for class 'optbin'
hist(x, bincol=NULL, main=NULL, xlab=NULL, ...)
## S3 method for class 'optbin'
hist(x, bincol=NULL, main=NULL, xlab=NULL, ...)

Arguments

`x`	an object of class `optbin`.
`bincol`	vector of colors for showing extent of bins (default uses an internal set)
`main`	plot title, if not specified will modify the normal histogram title
`xlab`	x axis label, if not specified will modify the normal histogram label
`...`	other parameters passed through to hist

Details

The points behind the binning are passed unchanged to the histogram function. Bins are marked with colored bars under the x axis, and lines showing the average value in each are also drawn on top.

Value

None

Optimal Binning of Continuous Variables

Description

Determines break points in numeric data that minimize the difference between each point in a bin and the average over it.

Usage

optbin(x, numbin, metric=c('se', 'mse'), is.sorted=FALSE, max.cache=2^31, na.rm=FALSE)
optbin(x, numbin, metric=c('se', 'mse'), is.sorted=FALSE, max.cache=2^31, na.rm=FALSE)

Arguments

`x`	numeric data
`numbin`	number of bins to partition vector into
`metric`	minimize squared error (se) between values and average over bin, or mean squared error (mse) dividing squared error by bin length
`is.sorted`	set true if x is already in increasing order
`max.cache`	maximum memory in bytes to use to cache bin metrics; if analysis would need more than use slower calculation without cache
`na.rm`	drop NA values (which may occur when converting the data to a vector), otherwise cannot proceed with binning

Details

Data is converted into a numeric vector and sorted if necessary. Internally bins are determined by positions within the vector, with the breaks inclusive at the upper end. The bin thresholds are the same, so bin b covers the range thr[b-1] < x <= thr[b], where thr[0] is -Inf. The routine finds the first split found with the best metric, if there is more than one.

The library uses an exhaustive search over all possible breakpoints. It begins by finding the best splits with 2 bins for all pairs of start and endpoints, then adds a third bin, and so on. This rejects most alternatives at each level, leaving an O(nbin * nval * nval) algorithm.

Value

An object of class 'optbin' with components:

`x`	the original data, sorted
`numbins`	the number of bins created
`call`	argument values when function called
`metric`	cost function used to select best partition
`minse`	value of SE/MSE metric for all bins
`thr`	upper threshold of bin range, inclusive
`binavg`	average of values in each bin
`binse`	value of SE/MSE metric for each bin
`breaks`	positions of endpoint (inclusive) of each bin in x

Examples

## Well separated groups
set.seed(17)
d1 <- c(rnorm(75, mean=1, sd=0.2), rnorm(75, mean=3, sd=0.2),
        rnorm(84, mean=6, sd=0.2), rnorm(75, mean=9, sd=0.2),
        rnorm(75, mean=11, sd=0.2), rnorm(150, mean=15, sd=0.2))
## Divides into groups 1+2+3, 4+5, 6, metric is 1176.3
binned3 <- optbin(d1, 3)
summary(binned3)
plot(binned3)
## Divides into groups 1, 2, 3, 4+5, and 6, metric is 169.9
binned5 <- optbin(d1, 5)
plot(binned5)
## Divides into separate groups, metric is 24.4
binned6 <- optbin(d1, 6)
summary(binned6)
plot(binned6)
## Each rnorm group divides roughly in half.
binned12 <- optbin(d1, 12)
plot(binned12)
## A grouping that overlaps, bins near but not at minima between peaks
d2 <- c(rnorm(300, mean=1, sd=0.25), rnorm(400, mean=2, sd=0.25),
        rnorm(300, mean=3, sd=0.25))
binned3b <- optbin(d2, 3)
hist(binned3b, breaks=50, col='yellow')
## Well separated groups
set.seed(17)
d1 <- c(rnorm(75, mean=1, sd=0.2), rnorm(75, mean=3, sd=0.2),
        rnorm(84, mean=6, sd=0.2), rnorm(75, mean=9, sd=0.2),
        rnorm(75, mean=11, sd=0.2), rnorm(150, mean=15, sd=0.2))
## Divides into groups 1+2+3, 4+5, 6, metric is 1176.3
binned3 <- optbin(d1, 3)
summary(binned3)
plot(binned3)
## Divides into groups 1, 2, 3, 4+5, and 6, metric is 169.9
binned5 <- optbin(d1, 5)
plot(binned5)
## Divides into separate groups, metric is 24.4
binned6 <- optbin(d1, 6)
summary(binned6)
plot(binned6)
## Each rnorm group divides roughly in half.
binned12 <- optbin(d1, 12)
plot(binned12)
## A grouping that overlaps, bins near but not at minima between peaks
d2 <- c(rnorm(300, mean=1, sd=0.25), rnorm(400, mean=2, sd=0.25),
        rnorm(300, mean=3, sd=0.25))
binned3b <- optbin(d2, 3)
hist(binned3b, breaks=50, col='yellow')

Plotting Optimal Bins

Description

plot method for class optbin.

Usage

## S3 method for class 'optbin'
plot(x, col=NULL, main="Binned Observations", ...)
## S3 method for class 'optbin'
plot(x, col=NULL, main="Binned Observations", ...)

Arguments

`x`	an object of class `optbin`.
`col`	vector of colors to apply to bins (default uses an internal set)
`main`	title of graph
`...`	other parameters passed through to the underlying plotting routines (do not set xaxt or ann)

Details

The plot will contain the sorted points of the data that generated the bins. Points are color-coded per bin, and the plot contains the average value over the bin as a line. x axis labels are the upper thresholds for each bin.

Value

None

Printing Optimal Bins

Description

print method for class optbin.

Usage

## S3 method for class 'optbin'
print(x, ...)
## S3 method for class 'optbin'
print(x, ...)

Arguments

`x`	an object of class `optbin`.
`...`	generic arguments (ignored)

Details

Shows the upper bounds of each bin, ie. bin b covers threshold[b-1] < x <= threshold[b] where threshold[0] is -Inf. Also prints the total (mean) squared error sum over all bins.

Value

The argument x unchanged, an object of class 'optbin' with components:

`x`	the original data, sorted
`numbins`	the number of bins created
`call`	argument values when function called
`metric`	cost function used to select best partition
`minse`	value of SE/MSE metric for all bins
`thr`	upper threshold of bin range, inclusive
`binavg`	average of values in each bin
`binse`	value of SE/MSE metric for each bin
`breaks`	positions of endpoint (inclusive) of each bin in x

Summarizing Optimal Bins

Description

summary method for class optbin.

Usage

## S3 method for class 'optbin'
summary(object, show.range=FALSE, ...)
## S3 method for class 'optbin'
summary(object, show.range=FALSE, ...)

Arguments

`object`	an object of class `optbin`
`show.range`	if true then print the bin's range of points (endpoint inclusive) in the sorted data
`...`	generic arguments (ignored)

Details

Prints a table with the upper threshold (inclusive), the average of the data within the bin, and the (mean) squared error sum. show.range also adds a column with the start and end indices of the sorted data belonging to the bin, although this applies to the sorted list and is less useful in general.

Value

Only called for side-effects (printing). There is no return value.

Package 'optbin'

Help Index

Bin Assignment

Description

Usage

Arguments

Details

Value

See Also

Examples

Histogram with Optimal Bins Marked

Description

Usage

Arguments

Details

Value

See Also

Optimal Binning of Continuous Variables

Description

Usage

Arguments

Details

Value

See Also

Examples

Plotting Optimal Bins

Description

Usage

Arguments

Details

Value

See Also

Printing Optimal Bins

Description

Usage

Arguments

Details

Value

See Also

Summarizing Optimal Bins

Description

Usage

Arguments

Details

Value

See Also