library(PCObw)

Introduction

When you estimate the density of a sample with a kernel density estimator (as in the density() function of package stats R Core Team (2018)) you need the crucial parameter “bw”. This parameter is one of the most important for the quality of the estimator. The package PCO provides the optimal bandwidth according to the PCO criterion (for details on this criterion see Varet et al. (2019)).

The package is based on a golden section search where the objective function can be the exact criterion or a binned version of the criterion. For multivariate case, the univariate golden section search has been adapted with no theoretical guarantee of convergence to the global optimum.

There is no limitation on the dimension of the data. However, it is known that kernel density estimation is not adapted for high dimensional data. Moreover, for time and memory reasons, it is recommended to use this package with small dimensions data (lower than 10 for example).

This package contains only two functions. For univariate data, these two functions return the same result. The difference holds for multivariate data:

bw.L2PCO: searches for a full matrix
bw.L2PCO.diag: searches for a diagonal matrix

Univariate data

The input data must be a numeric vector.

# simple example with univariate data
data("gauss_1D_sample")
bw.L2PCO(gauss_1D_sample)
#> [1] 0.504263

Different kernel

For univariate data, it is possible to change the kernel used for the criterion computation with the “K_name” parameter. As default value, “K_name” is set to “gaussian” but there is two other kernels implemented : “epanechnikov” and “biweight”.

# simple example with epanechnikov kernel
data("gauss_1D_sample")
bw.L2PCO(gauss_1D_sample, K_name = "epanechnikov")
#> [1] 0.3992373

# simple example with biweight kernel
bw.L2PCO(gauss_1D_sample, K_name = "biweight")
#> [1] 0.4420822

Optimisation parameters

The optimisation is based on the algorithm of the golden section search. The search stops when one of the two following condition is true:

The number of criterion evaluations is greater or equal to “nh” (default value: nh = 40)
The interval of search has a length lower than the tolerance “tol” (default value: tol = 10^(-6))

# example when the tolerance is not reached
data("gauss_1D_sample")
bw.L2PCO(gauss_1D_sample, nh = 3)
#> Warning in h_GK_1D_exact(xi = xi, nh_max = nh, tol = tol): Warning: The maximum
#> number of evaluations has been reached but not the tolerance
#> [1] 0.5746656

bw.L2PCO(gauss_1D_sample, tol = 10^(-6))
#> [1] 0.504263

Exact or binned criterion

Two ways are possible for the PCO criterion computation. The exact formula is used as default. If the size of the sample is huge, it is preferable to use a binned version with the option “binning = TRUE” for time computation reasons.

# binning example
data("gauss_1D_sample")
bw.L2PCO(gauss_1D_sample, binning = TRUE)
#> Warning in h_GK_1D_bin(xi = xi, nb_bin = nb, nh_max = nh, tol = tol,
#> adapt_nb_bin = adapt_nb_bin): Warning: the number of bins, nb, is probably too
#> small. Increase nb_bin or try with adapt_nb_bin = TRUE
#> [1] 0.1505407

For univariate data, the default number of bins is 32. But, as seen in the previous example, it can be too small. In that case, it is possible to increase the number of bins in setting the option “nb”, or to set the option “adapt_nb_bin” to TRUE.

# change the number of bins "nb"
data("gauss_1D_sample")
bw.L2PCO(gauss_1D_sample, binning = TRUE, nb = 130)
#> [1] 0.5007865

# or use "adapt_nb_bin = TRUE"
bw.L2PCO(gauss_1D_sample, binning = TRUE, adapt_nb_bin = TRUE)
#> the number of bins has been increased up to 132
#> [1] 0.5011275

# time comparison between exact and binned criterion with an huge sample
huge_sample <- rnorm(n = 10000, mean = 0, sd = 1)

ptm0 <- proc.time()
bw.L2PCO(huge_sample)
#> [1] 0.1250733
proc.time() - ptm0
#>    user  system elapsed 
#>  21.975   0.002  21.978

ptm0 <- proc.time()
bw.L2PCO(huge_sample, binning = TRUE, adapt_nb_bin = TRUE)
#> the number of bins has been increased up to 20032
#> [1] 0.1201239
proc.time() - ptm0
#>    user  system elapsed 
#>   0.359   0.000   0.358

Multivariate data

The input data must be, in this case, a numeric matrix. The choice of the function to use is link to the output. If only a diagonal matrix is intended, then use bw.L2PCO.diag. If a full matrix is necessary, then use bw.L2PCO. Please note that the off-diagonal elements are not directly optimised, they are computed using a transformation P^(-1) * diag(H) * P, with P the matrix of the eigen decomposion of the covariance matrix of the data and diag(H) is a diagonal matrix (as described in Varet et al. (2019)).


# example with 2D data
data("gauss_mD_sample")

# to return a full matrix
bw.L2PCO(gauss_mD_sample)
#> Warning in h_GK_mD_full_exact(x_i = xi, S = S, nh_max = nh, tol = tol):
#> Warning: The maximum number of evaluations has been reached but not the
#> tolerance
#>           [,1]      [,2]
#> [1,] 0.3673714 0.1677193
#> [2,] 0.1677193 0.3905522

# to return a diagonal matrix
bw.L2PCO.diag(gauss_mD_sample)
#> Warning in h_GK_mD_diag_exact(x_i = xi, nh_max = nh, tol = tol): Warning: The
#> maximum number of evaluations has been reached but not the tolerance
#>          [,1]      [,2]
#> [1,] 0.312619 0.0000000
#> [2,] 0.000000 0.2554287

Optimisation parameters

The optimisation is based on the algorithm of the golden section search on each dimension. The search stops when one of the two following condition is true:

The number of criterion evaluations is greater or equal to “nh” (default value: nh = 40)
The interval of search has a length lower than the tolerance “tol” (default value: tol = 10^(-6)) on each dimension

As the previous example illustrates, it can be necessary to increase the maximal number of criterion evaluations for more accurate results or to increase the tolerance for faster results.

data("gauss_mD_sample")

# increase the tolerance for faster results
bw.L2PCO.diag(gauss_mD_sample, tol = 10^(-3))
#>          [,1]      [,2]
#> [1,] 0.312619 0.0000000
#> [2,] 0.000000 0.2554287

# increase "nh" for more accurate results
bw.L2PCO.diag(gauss_mD_sample, nh = 80)
#>           [,1]     [,2]
#> [1,] 0.3122675 0.000000
#> [2,] 0.0000000 0.255734

# increase the tolerance for faster results
bw.L2PCO(gauss_mD_sample, tol = 10^(-3))
#>           [,1]      [,2]
#> [1,] 0.3673714 0.1677193
#> [2,] 0.1677193 0.3905522

# increase "nh" for more accurate results
bw.L2PCO(gauss_mD_sample, nh = 80)
#>           [,1]      [,2]
#> [1,] 0.3673956 0.1673685
#> [2,] 0.1673685 0.3905280

Exact or binned criterion

As for univariate data, it is possible to use a binned version of the criterion instead of the exact formula (which is the default computation) in setting “binning” parameter to TRUE. Binning is faster when sample is huge, otherwise it can be slower. For multivariate data, the default number of bins is 32 on each dimension. It is possible to have a different number of bins according to the dimension. For this purpose, use nb_bin_vect parameter. However, contrary to 1D case, there is no check of the number of bins.

data("gauss_mD_sample")

# with a too small number of bins, the results are degenerated
bw.L2PCO.diag(gauss_mD_sample, binning = TRUE, nh = 80, nb_bin_vect = c(5, 10))
#>            [,1]       [,2]
#> [1,] 0.04274339 0.00000000
#> [2,] 0.00000000 0.04274535

bw.L2PCO.diag(gauss_mD_sample, binning = TRUE, nh = 80, nb_bin_vect = c(40, 80))
#>           [,1]      [,2]
#> [1,] 0.3152471 0.0000000
#> [2,] 0.0000000 0.2643706

# with a too small number of bins, the results are degenerated
bw.L2PCO(gauss_mD_sample, binning = TRUE, nh = 80, nb_bin_vect = c(5, 10))
#>             [,1]         [,2]
#> [1,] 4.27443e-02 9.764010e-07
#> [2,] 9.76401e-07 4.274444e-02

bw.L2PCO(gauss_mD_sample, binning = TRUE, nh = 80, nb_bin_vect = c(40, 45))
#>           [,1]      [,2]
#> [1,] 0.3031635 0.2156144
#> [2,] 0.2156144 0.3329640

References

R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Varet, Suzanne, Claire Lacour, Pascal Massart, and Vincent Rivoirard. 2019. “Numerical Performance of Penalized Comparison to Overfitting for Multivariate Kernel Density Estimation.” Hal-02002275. https://hal.archives-ouvertes.fr/hal-02002275.