Title: | The Essential Histogram |
---|---|
Description: | Provide an optimal histogram, in the sense of probability density estimation and features detection, by means of multiscale variational inference. In other words, the resulting histogram servers as an optimal density estimator, and meanwhile recovers the features, such as increases or modes, with both false positive and false negative controls. Moreover, it provides a parsimonious representation in terms of the number of blocks, which simplifies data interpretation. The only assumption for the method is that data points are independent and identically distributed, so it applies to fairly general situations, including continuous distributions, discrete distributions, and mixtures of both. For details see Li, Munk, Sieling and Walther (2016) <arXiv:1612.07216>. |
Authors: | Housen Li [aut, cre], Hannes Sieling [aut] |
Maintainer: | Housen Li <[email protected]> |
License: | GPL-3 |
Version: | 1.2.2 |
Built: | 2024-11-26 06:33:20 UTC |
Source: | CRAN |
Provide an optimal histogram, in the sense of probability density estimation and features detection, by means of multiscale variational inference. In other words, the resulting histogram servers as an optimal density estimator, and meanwhile recovers the features, such as increases or modes, with both false positive and false negative controls. Moreover, it provides a parsimonious representation in terms of the number of blocks, which simplifies data interpretation. The only assumption for the method is that data points are independent and identically distributed, so it applies to fairly general situations, including continuous distributions, discrete distributions, and mixtures of both. For details see Li, Munk, Sieling and Walther (2016) <arXiv:1612.07216>.
Package: | essHist |
Type: | Package |
Version: | 1.2.2 |
Date: | 2019-05-10 |
License: | GPL-3 |
Index:
essHistogram Compute the essential histogram checkHistogram Check any estimator by the multiscale confidence set genIntv Generate the system of intervals msQuantile Simulate the quantile of multiscale statistics dmixnorm Compute density function of Gaussian mixtures pmixnorm Compute distribution function of Gaussian mixtures rmixnorm Generate random number of Gaussian mixtures paramExample Output detailed parameters for some famous examples
Housen Li [aut, cre], Hannes Sieling [aut]
Maintainer: Housen Li <[email protected]>
Li, H., Munk, A., Sieling, H., and Walther, G. (2016). The essential histogram. arXiv:1612.07216
# Simulate data set.seed(123) type = 'skewed_unimodal' n = 500 y = rmixnorm(n, type = type) # Compute the essential histogram eh = essHistogram(y, plot = FALSE) # Plot results # compute oracle density x = sort(y) od = dmixnorm(x, type = type) # compare with orcle density plot(x, od, type = "l", xlab = NA, ylab = NA, col = "red", main = type) lines(eh) legend("topleft", c("Oracle density", "Essential histogram"), lty = c(1,1), col = c("red", "black")) ##### Evaluate other method set.seed(123) # Data: mixture of Gaussians "harp" n = 500 y = rmixnorm(n, type = 'harp') # Oracle density x = sort(y) ho = dmixnorm(x, type = 'harp') # R default histogram h = hist(y, plot = FALSE) # Check R default histogram to local multiscale constraints b = checkHistogram(h, y, ylim=c(-0.1,0.16)) lines(x, ho, col = "red") rug(x, col = 'blue') legend("topright", c("R-Histogram", "Truth"), col = c("black", "red"), lty = c(1,1))
# Simulate data set.seed(123) type = 'skewed_unimodal' n = 500 y = rmixnorm(n, type = type) # Compute the essential histogram eh = essHistogram(y, plot = FALSE) # Plot results # compute oracle density x = sort(y) od = dmixnorm(x, type = type) # compare with orcle density plot(x, od, type = "l", xlab = NA, ylab = NA, col = "red", main = type) lines(eh) legend("topleft", c("Oracle density", "Essential histogram"), lty = c(1,1), col = c("red", "black")) ##### Evaluate other method set.seed(123) # Data: mixture of Gaussians "harp" n = 500 y = rmixnorm(n, type = 'harp') # Oracle density x = sort(y) ho = dmixnorm(x, type = 'harp') # R default histogram h = hist(y, plot = FALSE) # Check R default histogram to local multiscale constraints b = checkHistogram(h, y, ylim=c(-0.1,0.16)) lines(x, ho, col = "red") rug(x, col = 'blue') legend("topright", c("R-Histogram", "Truth"), col = c("black", "red"), lty = c(1,1))
Provide the locations, i.e., intervals, where features are potentially missing (a.k.a. false negatives), and the break-points that are potentially redundant (a.k.a. false positives), by means of the multiscale confidence set.
checkHistogram(h, x, alpha = 0.1, q = NULL, intv = NULL, mode = ifelse(anyDuplicated(x),"Gen","Con"), plot = TRUE, xlim = NULL, ylim = NULL, xlab = "", ylab = "", yaxt = "n", ...)
checkHistogram(h, x, alpha = 0.1, q = NULL, intv = NULL, mode = ifelse(anyDuplicated(x),"Gen","Con"), plot = TRUE, xlim = NULL, ylim = NULL, xlab = "", ylab = "", yaxt = "n", ...)
h |
a numeric vector specifying values of a histogram at sample points; or a |
x |
a numeric vector containing the data. |
alpha |
significance level, default as 0.1, see also |
q |
threshold of the multiscale constraint; by default, |
intv |
a data frame provides the system of intervals on which the multiscale statistic is defined. The data frame constains the following two columns
By default, it is set to the sparse interval system proposed by Rivera and Walther (2013), see also Li et al. (2016). |
mode |
By default, |
plot |
logical. If |
xlim , ylim
|
numeric vectors of length 2 (default |
xlab |
a title for the |
ylab |
a title for the |
yaxt |
A character which specifies the |
... |
further arguments and |
This function presents a visualization: the upper part plots the given histogram; in the middle part short vertical lines mark all removable break-points; in the lower part intervals of violation are shown, and a graybar below the middle horizontal line (blue) sumarizes such violations with the darkness scaling with the number of violation intervals covering a location. See Examples below and Li et al. (2016) for further details.
A list consists of one data frame, and one numeric vector:
violatedIntervals |
A data frame provides the intervals where the corresponding local side constraint is violated; an empty data frame if there is no violation. It constains the following four columns
An empty |
removableBreakpoints |
A numeric vector contains all removable breakpoints, with zero length if there is no removable breakpoint. |
The argument intv
is internally adjusted ensure it contains no empty intervals in case of tied observations. Only the intervals on which the input histogram is constant will be checked! All the printing messages can be disabled by calling suppressMessages
.
Li, H., Munk, A., Sieling, H., and Walther, G. (2016). The essential histogram. arXiv:1612.07216.
essHistogram
,
genIntv
,
msQuantile
set.seed(123) # Data: mixture of Gaussians "harp" n = 500 y = rmixnorm(n, type = 'harp') # Oracle density x = sort(y) ho = dmixnorm(x, type = 'harp') # R default histogram h = hist(y, plot = FALSE) # Check R default histogram to local multiscale constriants b = checkHistogram(h, y, ylim=c(-0.1,0.16)) lines(x, ho, col = "red") rug(x, col = 'blue') legend("topright", c("R-Histogram", "Truth"), col = c("black", "red"), lty = c(1,1))
set.seed(123) # Data: mixture of Gaussians "harp" n = 500 y = rmixnorm(n, type = 'harp') # Oracle density x = sort(y) ho = dmixnorm(x, type = 'harp') # R default histogram h = hist(y, plot = FALSE) # Check R default histogram to local multiscale constriants b = checkHistogram(h, y, ylim=c(-0.1,0.16)) lines(x, ho, col = "red") rug(x, col = 'blue') legend("topright", c("R-Histogram", "Truth"), col = c("black", "red"), lty = c(1,1))
Compute the essential histogram via (pruned) dynamic programming.
essHistogram(x, alpha = 0.5, q = NULL, intv = NULL, plot = TRUE, mode = ifelse(anyDuplicated(x),"Gen","Con"), xname = deparse(substitute(x)), ...)
essHistogram(x, alpha = 0.5, q = NULL, intv = NULL, plot = TRUE, mode = ifelse(anyDuplicated(x),"Gen","Con"), xname = deparse(substitute(x)), ...)
x |
a numeric vector containing the data. |
alpha |
significance level; default as 0.5. One should set |
q |
threshold value; by default, |
intv |
a data frame provides the system of intervals on which the multiscale statistic is defined. The data frame constains the following two columns
By default, it is set to the sparse interval system proposed by Rivera and Walther (2013), see also Li et al. (2016). |
plot |
logical. If |
mode |
By default, |
xname |
a character string with the actual |
... |
further arguments and |
The essential histogram is defined as the histogram with least blocks within the multiscale constraint. The one with highest likelihood is picked if there are more than one solutions. The essential histogram involves only one parameter q
, the threshold of the multiscale constraint. Such a parameter can be chosen by means of the significance level alpha
, which leads to nature statistical significance statements for the multiscale constraint. The computational complexity is often linear in terms of sample size, although the worst complexity bound is quadratic up to a log-factor in case of the sparse interval system. See Li et al. (2016) for further details.
An object of class "histogram
", which is of the same class as returned by function hist
.
The argument intv
is internally adjusted to ensure it contains no empty intervals, especially in case of tied observations. The first block of the returned histogram is a closed interval, and the rest blocks are left open right closed intervals. All the printing messages can be disabled by calling suppressMessages
.
Li, H., Munk, A., Sieling, H., and Walther, G. (2016). The essential histogram. arXiv:1612.07216.
Rivera, C., & Walther, G. (2013). Optimal detection of a jump in the intensity of a Poisson process or in a density with likelihood ratio statistics. Scand. J. Stat. 40, 752–769.
checkHistogram
,
genIntv
,
hist
,
msQuantile
# Simulate data set.seed(123) type = 'skewed_unimodal' n = 500 y = rmixnorm(n, type = type) # Compute the essential histogram eh = essHistogram(y, plot = FALSE) # Plot results # compute oracle density x = sort(y) od = dmixnorm(x, type = type) # compare with orcle density plot(x, od, type = "l", xlab = NA, ylab = NA, col = "red", main = type) lines(eh) legend("topleft", c("Oracle density", "Essential histogram"), lty = c(1,1), col = c("red", "black"))
# Simulate data set.seed(123) type = 'skewed_unimodal' n = 500 y = rmixnorm(n, type = type) # Compute the essential histogram eh = essHistogram(y, plot = FALSE) # Plot results # compute oracle density x = sort(y) od = dmixnorm(x, type = type) # compare with orcle density plot(x, od, type = "l", xlab = NA, ylab = NA, col = "red", main = type) lines(eh) legend("topleft", c("Oracle density", "Essential histogram"), lty = c(1,1), col = c("red", "black"))
Generate the system of intervals on which the multiscale statistic is defined, see Li et al. (2016).
genIntv(n, type = c("Sparse", "Full"))
genIntv(n, type = c("Sparse", "Full"))
n |
number of observations. |
type |
type of interval system. |
A data frame provides the system of intervals, and consists two columns
left |
left index of an interval |
right |
right index of an interval |
Li, H., Munk, A., Sieling, H., and Walther, G. (2016). The essential histogram. arXiv:1612.07216.
Rivera, C., & Walther, G. (2013). Optimal detection of a jump in the intensity of a Poisson process or in a density with likelihood ratio statistics. Scand. J. Stat. 40, 752–769.
checkHistogram
,
essHistogram
,
msQuantile
n = 5 intv = genIntv(n,"Full") print(intv)
n = 5 intv = genIntv(n,"Full") print(intv)
Density, distribution function and random generation for the mixture of normals with each component specified by mean
and sd
, and mixture weights by prob
. paramExample
gives detailed parameters for some examples specified by type
.
dmixnorm(x, mean = c(0), sd = rep(1,length(mean)), prob = rep(1,length(mean)), type = NULL, ...) pmixnorm(x, mean = c(0), sd = rep(1,length(mean)), prob = rep(1,length(mean)), type = NULL, ...) rmixnorm(n, mean = c(0), sd = rep(1,length(mean)), prob = rep(1,length(mean)), type = NULL) paramExample(type)
dmixnorm(x, mean = c(0), sd = rep(1,length(mean)), prob = rep(1,length(mean)), type = NULL, ...) pmixnorm(x, mean = c(0), sd = rep(1,length(mean)), prob = rep(1,length(mean)), type = NULL, ...) rmixnorm(n, mean = c(0), sd = rep(1,length(mean)), prob = rep(1,length(mean)), type = NULL) paramExample(type)
x |
vector of locations. |
n |
integer; number of observations. |
mean |
vector of means for each mixture component. |
sd |
vector of standard deviations for each mixture component. Default is of unit variance for each component. |
prob |
vector of prior probability for each mixture component (i.e. mixture weights). All nonnegative values are allowed, and automatically recaled to ensure their sum equal to 1. Default is of equal probability for each component. |
type |
a (case insensitive) character string of example name; It includes examples from Marron & Wand (1992): "MW1", ..., "MW15", or equivalently "gauss", "skewed_unimodal", "strong_skewed", "kurtotic_unimodal", "outlier", "bimodal", "separated_bimodal", "skewed_bimodal", "trimodal", "claw", "double_claw", "asymmetric_claw", "asymmetric_double_claw", "smooth_comb", "discrete_comb"; It also includes "harp" example from Li et al. (2016). |
... |
Users either provide, optionally, mean
, sd
and prob
; or type
. In case of providing type
, the values of mean
, sd
and prob
are ignored.
The default case is standard normal, the same as dnorm
, pnorm
and rnorm
.
dmixnorm
gives the density, pmixnorm
gives the distribution function, and rmixnorm
generates random deviates.
The length of the result is determined by n
for rmixnorm
, and is the length of x
for dmixnorm
and pmixnorm
.
paramExample
gives a data frame with components mean
, sd
and prob
.
Li, H., Munk, A., Sieling, H., and Walther, G. (2016). The essential histogram. arXiv:1612.07216.
Marron, J. S., & Wand, M. P. (1992). Exact mean integrated squred error. Ann. Statist., 20(2), 712–736.
Normal for standard normal distributions; Distributions for other standard distributions.
## Example harp type = "harp" # generate random numbers n = 500 Y = rmixnorm(n, type = type) # compute the density x = seq(min(Y), max(Y), length.out = n) f = dmixnorm(x, type = type) # compute the distribution F = pmixnorm(x, type = type) # plots op = par(mfrow = c(1,2)) plot(x, f, type = "l", main = "Harp Density") rug(Y, col = 'red') plot(x, F, type = "l", main = "Harp Distribution") rug(Y, col = 'red') par(op)
## Example harp type = "harp" # generate random numbers n = 500 Y = rmixnorm(n, type = type) # compute the density x = seq(min(Y), max(Y), length.out = n) f = dmixnorm(x, type = type) # compute the distribution F = pmixnorm(x, type = type) # plots op = par(mfrow = c(1,2)) plot(x, f, type = "l", main = "Harp Density") rug(Y, col = 'red') plot(x, F, type = "l", main = "Harp Distribution") rug(Y, col = 'red') par(op)
Simulate quantiles of the multiscale statistics under any continuous distribution function.
msQuantile(n, alpha = c(0.5), nsim = 5e3, is.sim = (n < 1e4), intv = genIntv(n), mode = c("Con", "Gen"), ...)
msQuantile(n, alpha = c(0.5), nsim = 5e3, is.sim = (n < 1e4), intv = genIntv(n), mode = c("Con", "Gen"), ...)
n |
number of observations. |
alpha |
significance level; default as 0.5, see also |
nsim |
numer of Monte Carlo simulations. |
is.sim |
logical. If |
intv |
a data frame provides the system of intervals on which the multiscale statistic is defined. The data frame constains the following two columns
By default, it is set to the sparse interval system proposed by Rivera and Walther (2013), see |
mode |
See Li et al. (2016) for further details. |
... |
further arguments passed to function |
Empirically, it turns out that the quantile of the multiscale statistic converges fast to that of the limit distribution as the number of observations n
increases. Thus, for the sake of computational efficiency, the quantile with n
= 10,000 are used by default for that with n
> 10,000, which has already been precomputed and stored. Of course, for arbitrary sample size n
, one can always simulate the quantile by setting is.sim = TRUE
, and use the precomputed value by setting is.sim = FALSE
. For a given sample size n
, simulations are once computed, and then automatically recorded in the R memory for later usage. For memory efficiency, only the last simulation is stored.
A vector of length length(alpha)
is returned, the same structure as returned by funtion quantile
with option names = FALSE
; The values are the (1-alpha
)-quantile(s) of the null distribution of the multiscale statistic via Monte Carlo simulation, corresponding to (1-alpha
)-confidence level(s). See Li et al. (2016) for further details.
All the printing messages can be disabled by calling suppressMessages
.
Li, H., Munk, A., Sieling, H., and Walther, G. (2016). The essential histogram. arXiv:1612.07216.
Rivera, C., & Walther, G. (2013). Optimal detection of a jump in the intensity of a Poisson process or in a density with likelihood ratio statistics. Scand. J. Stat. 40, 752–769.
checkHistogram
,
essHistogram
,
genIntv
n = 100 # number of observations nsim = 100 # number of simulations alpha = c(0.1, 0.9) # significance level q = msQuantile(n, alpha, nsim) print(q)
n = 100 # number of observations nsim = 100 # number of simulations alpha = c(0.1, 0.9) # significance level q = msQuantile(n, alpha, nsim) print(q)