Package 'capushe' reference manual

Title:	CAlibrating Penalities Using Slope HEuristics
Description:	Calibration of penalized criteria for model selection. The calibration methods available are based on the slope heuristics.
Authors:	Sylvain Arlot, Vincent Brault, Jean-Patrick Baudry, Cathy Maugis and Bertrand Michel
Maintainer:	Vincent Brault <[email protected]>
License:	GPL (>= 2.0)
Version:	1.1.2
Built:	2024-12-22 06:22:48 UTC
Source:	CRAN

Capushe

Description

This package includes functions for model selection via penalization. The model selection criterion has the following form: $\gamma_n (\hat{s}_m)+scoef\times\kappa\times pen_{shape}(m)$ . Two algorithms based on the slope heuristics are proposed to calibrate the parameter $\kappa$ in the penalty: the data-driven slope estimation algorithm (DDSE) and the dimension jump algorithm (Djump).

Details

The data-driven slope estimation algorithm and the dimension jump algorithm are respectively implemented into the DDSE function and the Djump function. Somes classes are defined for the outputs of DDSE and Djump and a graphical display is available for each one of these two classes. DDSE and Djump are both included in the capushe function which is the main function of the package.

Author(s)

Sylvain Arlot, Vincent Brault, Jean-Patrick Baudry, Cathy Maugis and Bertrand Michel.

Maintainer: Vincent Brault <[email protected]>

References

http://www.math.univ-toulouse.fr/~maugis/CAPUSHE.html

http://www.math.u-psud.fr/~brault/capushe.html

Article: Baudry, J.-P., Maugis, C. and Michel, B. (2011) Slope heuristics: overview and implementation. Statistics and Computing, to appear. doi: 10.1007/ s11222-011-9236-1

Examples

data(datacapushe)
## capushe returns the same model with DDSE and Djump:
capushe(datacapushe)
## capushe also returns the model selected by AIC and BIC
capushe(datacapushe,n=1000)
## Djump only
Djump(datacapushe)
## DDSE only
DDSE(datacapushe)
## Graphical representations
plot(Djump(datacapushe))
plot(DDSE(datacapushe))
plot(capushe(datacapushe))
## Validation procedure
data(datapartialcapushe)
capushepartial=capushe(datapartialcapushe)
plot(capushepartial)
## Additional data
data(datavalidcapushe)
validation(capushepartial,datavalidcapushe) ## The slope heuristics should not 
## be applied for datapartialcapushe.

data(datacapushe)
## capushe returns the same model with DDSE and Djump:
capushe(datacapushe)
## capushe also returns the model selected by AIC and BIC
capushe(datacapushe,n=1000)
## Djump only
Djump(datacapushe)
## DDSE only
DDSE(datacapushe)
## Graphical representations
plot(Djump(datacapushe))
plot(DDSE(datacapushe))
plot(capushe(datacapushe))
## Validation procedure
data(datapartialcapushe)
capushepartial=capushe(datapartialcapushe)
plot(capushepartial)
## Additional data
data(datavalidcapushe)
validation(capushepartial,datavalidcapushe) ## The slope heuristics should not 
## be applied for datapartialcapushe.

AICcapushe and BICcapushe

Description

These functions return the model selected by the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC).

Usage

AICcapushe(data, n)
BICcapushe(data, n)
AICcapushe(data, n)
BICcapushe(data, n)

Arguments

data

data is a matrix or a data.frame with four columns of the same length and each line corresponds to a model:

The first column contains the model names.
The second column contains the penalty shape values.
The third column contains the model complexity values.
The fourth column contains the minimum contrast value for each model.

n

n is the sample size.

Details

The penalty shape value should be increasing with respect to the complexity value (column 3). The complexity values have to be positive. n is necessary to compute AIC and BIC criteria. n is the size of sample used to compute the contrast values given in the data matrix. Do not confuse n with the size of the model collection which is the number of rows of the data matrix.

Value

`model`	The model selected by AIC or BIC.
`AIC`	The corresponding value of AIC (for AICcapushe only).
`BIC`	The corresponding value of BIC (for BICcapushe only).

Author(s)

Vincent Brault

References

http://www.math.univ-toulouse.fr/~maugis/CAPUSHE.html

http://www.math.u-psud.fr/~brault/capushe.html

Article: Baudry, J.-P., Maugis, C. and Michel, B. (2011) Slope heuristics: overview and implementation. Statistics and Computing, to appear. doi: 10.1007/ s11222-011-9236-1

Examples

data(datacapushe)
AICcapushe(datacapushe,n=1000)
BICcapushe(datacapushe,n=1000)


data(datacapushe)
AICcapushe(datacapushe,n=1000)
BICcapushe(datacapushe,n=1000)

CAlibrating Penalities Using Slope HEuristics (CAPUSHE)

Description

The capushe function proposes two algorithms based on the slope heuristics to calibrate penalties in the context of model selection via penalization.

Usage

capushe(data,n=0,pct=0.15,point=0,psi.rlm=psi.bisquare,scoef=2,
			Careajump=0,Ctresh=0)
capushe(data,n=0,pct=0.15,point=0,psi.rlm=psi.bisquare,scoef=2,
			Careajump=0,Ctresh=0)

Arguments

`data`	`data` is a matrix or a data.frame with four columns of the same length and each line corresponds to a model: The first column contains the model names. The second column contains the penalty shape values. The third column contains the model complexity values. The fourth column contains the minimum contrast value for each model.
`n`	`n` is the sample size.
`pct`	Minimum percentage of points for the plateau selection. See `DDSE` for more details.
`point`	Minimum number of point for the plateau selection (See `DDSE` for more details). If `point` is different from 0, `pct` is obsolete.
`psi.rlm`	Weight function used by `rlm`. See `DDSE` for more details. `psi.rlm`="lm" for non robust linear regression.
`scoef`	Ratio parameter. Default value is 2.
`Careajump`	Constant of jump area (See `Djump` for more details). Default value is 0 (no area).
`Ctresh`	Maximal treshold for the complexity associated to the penalty coefficient (See `Djump` for more details). Default value is 0 (Maximal jump selected as the greater jump).

Details

The model $\hat{m}$ selected by the procedure fulfills

$\hat{m}=$ argmin $\gamma_n (\hat{s}_m)+scoef\times \kappa\times pen_{shape}(m)$

where

$\kappa$ is the penalty coefficient.
$\gamma_n$ is the empirical contrast.
$\hat{s}_m$ is the estimator for the model $m$ .
$scoef$ is the ratio parameter.
$pen_{shape}$ is the penalty shape.

The capushe function calls the functions DDSE and Djump to calibrate $\kappa$ , see the description of these functions for more details. In the case of equality between two penalty shape values, only the model with the smallest contrast is considered.

Value

`@DDSE`	A list returned by the `DDSE` function.
`@DDSE@model`	The `model` selected by the `DDSE` function.
`@DDSE@kappa`	The vector of the successive slope values.
`@DDSE@ModelHat`	A list providing details about the model selected by the `DDSE` function.
`@DDSE@interval`	A list about the "slope interval" corresponding to the plateau selected in `DDSE`. See `DDSE` for more details.
`@DDSE@graph`	A list computed for the `plot` function.
`@Djump`	A list returned by the `Djump` function.
`@Djump@model`	The `model` selected by the `Djump` function.
`@Djump@ModelHat`	A list providing details about the model selected by the `Djump` function.
`@Djump@graph`	A list computed for the `plot` function.
`@AIC_capushe`	A list returned by the `AICcapushe` function.
`@BIC_capushe`	A list returned by the `BICcapushe` function.
`@n`	Sample size.

Author(s)

Vincent Brault

References

http://www.math.univ-toulouse.fr/~maugis/CAPUSHE.html

http://www.math.u-psud.fr/~brault/capushe.html

Article: Baudry, J.-P., Maugis, C. and Michel, B. (2011) Slope heuristics: overview and implementation. Statistics and Computing, to appear. doi: 10.1007/ s11222-011-9236-1

Examples

data(datacapushe)
capushe(datacapushe) 
capushe(datacapushe,1000)
data(datacapushe)
capushe(datacapushe) 
capushe(datacapushe,1000)

datacapushe

Description

A dataframe example for the capushe package based on a simulated Gaussian mixture dataset in $\R^3$ .

Usage

data(datacapushe)data(datacapushe)

Format

A data frame with 50 rows (models) and the following 4 variables:

model: a character vector

: model names.

pen: a numeric vector

: model penalty shape values.

complexity: a numeric vector

: model complexity values.

contrast: a numeric vector

: model contrast values.

Details

The simulated dataset is composed of $n=1000$ observations in $\R^3$ . It consists of an equiprobable mixture of three large "bubble" groups centered at $\nu_1=(0,0,0)$ , $\nu_2=(6,0,0)$ and $\nu_3=(0,6,0)$ respectively. Each bubble group $j$ is simulated from a mixture of seven components according to the following density distribution:

$x\in\R^3\rightarrow 0.4\Phi(x|\mu_1+\nu_j,I_3)+\sum_{k=2}^70.1\Phi(x|\mu_k+\nu_j,0.1I_3)$

with $\mu_1=(0,0,0)$ , $\mu_2=(0,0,1.5)$ , $\mu_3=(0,1.5,0)$ , $\mu_4=(1.5,0,0,)$ , $\mu_5=(0,0,-1.5)$ , $\mu_6=(0,-1.5,0)$ and $\mu_7=(-1.5,0,0,)$ . Thus the distribution of the dataset is actually a $21$ -component Gaussian mixture.

A model collection of spherical Gaussian mixtures is considered and the dataframe datacapushe contains the maximum likelihood estimations for each of these models. The number of free parameters of each model is used for the complexity values and $pen_{shape}$ is defined by this complexity divided by $n$ .

datapartialcapushe and datavalidcapushe can be used to run the validation function. datapartialcapushe only contains the models with less than $21$ components. datavalidcapushe contains three models with $30$ , $40$ and $50$ components respectively.

Source

http://www.math.univ-toulouse.fr/~maugis/CAPUSHE.html

References

Article: Baudry, J.-P., Maugis, C. and Michel, B. (2011) Slope heuristics: overview and implementation. Statistics and Computing, to appear. doi: 10.1007/ s11222-011-9236-1

Examples

data(datacapushe)
capushe(datacapushe,n=1000)
## BIC, DDSE and Djump all three select the true model
plot(capushe(datacapushe))
## Validation:
data(datapartialcapushe)
capushepartial=capushe(datapartialcapushe)
data(datavalidcapushe)
validation(capushepartial,datavalidcapushe) ## The slope heuristics should not 
## be applied for datapartialcapushe.
data(datacapushe)
capushe(datacapushe,n=1000)
## BIC, DDSE and Djump all three select the true model
plot(capushe(datacapushe))
## Validation:
data(datapartialcapushe)
capushepartial=capushe(datapartialcapushe)
data(datavalidcapushe)
validation(capushepartial,datavalidcapushe) ## The slope heuristics should not 
## be applied for datapartialcapushe.

Model selection by Data-Driven Slope Estimation

Description

DDSE is a model selection function based on the slope heuristics.

Usage

DDSE(data, pct = 0.15, point = 0, psi.rlm = psi.bisquare, scoef = 2)
DDSE(data, pct = 0.15, point = 0, psi.rlm = psi.bisquare, scoef = 2)

Arguments

`data`	`data` is a matrix or a data.frame with four columns of the same length and each line corresponds to a model: The first column contains the model names. The second column contains the penalty shape values. The third column contains the model complexity values. The fourth column contains the minimum contrast value for each model.
`pct`	Minimum percentage of points for the plateau selection. It must be between 0 and 1. Default value is 0.15.
`point`	Minimum number of point for the plateau selection. If `point` is different from 0, `pct` is obsolete.
`psi.rlm`	Weight function used by `rlm`. `psi.rlm`="lm" for non robust linear regression.
`scoef`	Ratio parameter. Default value is 2.

Details

Let $M$ be the model collection and $P=\{pen_{shape}(m),m\in M\}$ . The DDSE algorithm proceeds in four steps:

If several models in the collection have the same penalty shape value (column 2), only the model having the smallest contrast value $\gamma_n(\hat{s}_m)$ (column 4) is considered.
For any $p\in P$ , the slope $\hat{\kappa}(p)$ (argument @kappa) of the linear regression (argument psi.rlm) on the couples of points $\{(pen_{shape}(m),-\gamma_n (\hat{s}_m)); pen_{shape}(m)\geq p\}$ is computed.
For any $p\in P$ , the model fulfilling the following condition is selected:

$\hat{m}(p)=$ argmin $\gamma_n (\hat{s}_m)+scoef\times \hat{\kappa}(p)\times pen_{shape}(m)$ .

This gives an increasing sequence of change-points $(p_i)_{1\leq i\leq I+1}$ (output @ModelHat$point_breaking). Let $(N_i)_{1\leq i\leq I}$ (output @ModelHat$number_plateau) be the lengths of each "plateau".
If point is different from 0, let $\hat{i}=$ max $\{1\leq i\leq I; N_i\geq point\}$ else let $\hat{i}=$ max $\{1\leq i\leq I; N_i\geq pct\sum_{l=1}^IN_l\}$ (output @ModelHat$imax). The model $\hat{m}(p_{\hat{i}})$ (output @model) is finally returned.

The "slope interval" is the interval $[a,b]$ where $a=inf\{\hat{\kappa}(p),p\in[p_{\hat{i}},p_{\hat{i}+1}[\cap P\}$ and $b=sup\{\hat{\kappa}(p),p\in[p_{\hat{i}},p_{\hat{i}+1}[\cap P\}$ .

Value

`@model`	The `model` selected by the DDSE algorithm.
`@kappa`	The vector of the successive slope values.
`@ModelHat`	A list describing the algorithm.
`@ModelHat$model_hat`	The vector of preselected models $\hat{m}(p)$ .
`@ModelHat$point_breaking`	The vector of the breaking points $(p_i)_{1\leq i\leq I+1}$ .
`@ModelHat$number_plateau`	The vector of the lengths $(N_i)_{1\leq i\leq I}$ .
`@ModelHat$imax`	The rank $\hat{i}$ of the selected plateau.
`@interval`	A list about the "slope interval".
`@interval$interval`	The slope interval.
`@interval$percent_of_points`	The proportion $N_{\hat{i}}/\sum_{l=1}^IN_l$ .
`@graph`	A list computed for the `plot` method.

Author(s)

Vincent Brault

References

http://www.math.univ-toulouse.fr/~maugis/CAPUSHE.html

http://www.math.u-psud.fr/~brault/capushe.html

Article: Baudry, J.-P., Maugis, C. and Michel, B. (2011) Slope heuristics: overview and implementation. Statistics and Computing, to appear. doi: 10.1007/ s11222-011-9236-1

Examples

data(datacapushe)
DDSE(datacapushe)
plot(DDSE(datacapushe))
## DDSE with "lm" for the regression
DDSE(datacapushe,psi.rlm="lm")
data(datacapushe)
DDSE(datacapushe)
plot(DDSE(datacapushe))
## DDSE with "lm" for the regression
DDSE(datacapushe,psi.rlm="lm")

Model selection by dimension jump

Description

Djump is a model selection function based on the slope heuristics.

Usage

Djump(data,scoef=2,Careajump=0,Ctresh=0)
Djump(data,scoef=2,Careajump=0,Ctresh=0)

Arguments

`data`	`data` is a matrix or a data.frame with four columns of the same length and each line corresponds to a model: The first column contains the model names. The second column contains the penalty shape values. The third column contains the model complexity values. The fourth column contains the minimum contrast value for each model.
`scoef`	Ratio parameter. Default value is 2.
`Careajump`	Constant of jump area. Default value is 0 (no area). In practice, it is advisable to take $Careajump=\sqrt{\frac{log(n)}{n}}$ where $n$ is the number of observations.
`Ctresh`	Maximal treshold for the complexity associated to the penalty coefficient. Default value is 0 (Maximal jump selected as the greatest jump). In practice, it is advisable to take $Ctresh=\frac{n}{log(n)}$ where $n$ is the number of observations.

Details

The Djump algorithm proceeds in three steps:

For all $\kappa>0$ , compute

$m(\kappa)\in argmin_{m\in M} \{\gamma_n(\hat{s}_m)+\kappa\times pen_{shape}(m)\}$

This gives a decreasing step function $\kappa \mapsto C_{m(\kappa)}$ .
Find $\hat{\kappa}$ such that $C_{m(\hat{\kappa})}$ corresponds to the greatest jump of complexity if $C_{tresh}=0$ else $\hat{\kappa}$ such that

$\hat{\kappa}=inf\{\kappa>0: C_{m(\kappa)}\leq C_{tresh}\}.$
Select $\hat{m}=m(scoef\times\hat{\kappa})$ (output @model).

Arlot has proposed a jump area containing the maximal jump defined by :

$[\kappa(1-Careajump);\kappa(1+Careajump)].$

If $Careajump>0$ , Djump return the area with the greatest jump. In practice, it is advisable to take $Careajump=\frac{log(n)}{n}$ where $n$ is the number of observations.

Value

`@model`	The `model` selected by the dimension jump method.
`@ModelHat`	A list describing the algorithm.
`@ModelHat$jump`	The vector of jump heights.
`@ModelHat$kappa`	The vector of the values of $\kappa$ at each jump.
`@ModelHat$model_hat`	The vector of the selected models $m(\kappa)$ by the jump.
`@ModelHat$JumpMax`	The location of the greatest jump.
`@ModelHat$Kopt`	$\kappa_{opt}=scoef\hat{\kappa}$ .
`@graph`	A list computed for the `plot` method.

Author(s)

Vincent Brault

References

http://www.math.univ-toulouse.fr/~maugis/CAPUSHE.html

http://www.math.u-psud.fr/~brault/capushe.html

Article: Baudry, J.-P., Maugis, C. and Michel, B. (2011) Slope heuristics: overview and implementation. Statistics and Computing, to appear. doi: 10.1007/ s11222-011-9236-1

Examples

data(datacapushe)
Djump(datacapushe)
plot(Djump(datacapushe))
Djump(datacapushe,Careajump=sqrt(log(1000)/1000))
plot(Djump(datacapushe,Careajump=sqrt(log(1000)/1000)))
Djump(datacapushe,Ctresh=1000/log(1000))
plot(Djump(datacapushe,Ctresh=1000/log(1000)))
data(datacapushe)
Djump(datacapushe)
plot(Djump(datacapushe))
Djump(datacapushe,Careajump=sqrt(log(1000)/1000))
plot(Djump(datacapushe,Careajump=sqrt(log(1000)/1000)))
Djump(datacapushe,Ctresh=1000/log(1000))
plot(Djump(datacapushe,Ctresh=1000/log(1000)))

Plot for capushe

Description

The plot methods allow the user to check that the slope heuristics can be applied confidently.

Usage

plot(x,newwindow=TRUE,ask=TRUE) for capushe.

plot(x,newwindow=TRUE) for DDSE and Djump.

Arguments

`x`	Output of `DDSE`, `Djump` or `capushe`.
`newwindow`	If `newwindow`=`TRUE`, a new window is created for each plot.
`ask`	If `ask`=`TRUE`, `plot` waits for the user to press a key to display the next plot (only for the class `capushe`).

Details

The graphical window of DDSE is composed of three graphics (see DDSE for more details):

left: The left plot shows $-\gamma_n(\hat{s}_m)$ with respect to the penalty shape values.
topright: Successive slope values $\hat{\kappa}(p)$ .
bottomright: The bottomright plot shows the selected models $\hat{m}(p)$ with respect to the successive slope values. The plateau in blue is selected.

The graphical window of Djump shows the complexity $C_{m(\kappa)}$ of the selected model with respect to $\kappa$ . $\hat{\kappa}^{dj}$ corresponds to the greatest jump. $\kappa_{opt}$ is defined by $\kappa_{opt}=scoef\times \hat{\kappa}^{dj}$ . The red line represents the slope interval computed by the DDSE algorithm (only for capushe). See Djump for more details.

Methods

signature(x = "Capushe"): This graphical function displays the DDSE plot and the Djump plot.
signature(x = "DDSE"): This graphical function displays the DDSE plot.
signature(x = "Djump"): This graphical function displays the Djump plot.

Note

Use newwindow=FALSE to produce a PDF files (for an object of class capushe, use moreover ask=FALSE).

validation

Description

validation checks that the slope heuristics can be applied confidently.

Usage

validation(x,data2,...)
validation(x,data2,...)

Arguments

x

x must be an object of class capushe or DDSE, in practice an output of the capushe function or the DDSE function.

data2

data2 is a matrix or a data.frame with four columns of the same length and each line corresponds to a model:

The first column contains the model names.
The second column contains the penalty shape values.
The third column contains the model complexity values.
The fourth column contains the minimum contrast value for each model.

...

If newwindow==TRUE, a new window is created for the plot.

Details

The validation function plots the additional and more complex models data2 to check that the linear relation between the penalty shape values and the contrast values (which is recorded in x) is valid for the more complex models.

Author(s)

Vincent Brault

References

http://www.math.univ-toulouse.fr/~maugis/CAPUSHE.html

http://www.math.u-psud.fr/~brault/capushe.html

Article: Baudry, J.-P., Maugis, C. and Michel, B. (2011) Slope heuristics: overview and implementation. Statistics and Computing, to appear. doi: 10.1007/ s11222-011-9236-1

Examples

data(datapartialcapushe)
capushepartial=capushe(datapartialcapushe)
data(datavalidcapushe)
validation(capushepartial,datavalidcapushe) ## The slope heuristics should not 
## be applied for datapartialcapushe.
data(datacapushe)
plot(capushe(datacapushe))

data(datapartialcapushe)
capushepartial=capushe(datapartialcapushe)
data(datavalidcapushe)
validation(capushepartial,datavalidcapushe) ## The slope heuristics should not 
## be applied for datapartialcapushe.
data(datacapushe)
plot(capushe(datacapushe))

Package 'capushe'

Help Index

Capushe

Description

Details

Author(s)

References

See Also

Examples

AICcapushe and BICcapushe

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

CAlibrating Penalities Using Slope HEuristics (CAPUSHE)

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

datacapushe

Description

Usage

Format

Details

Source

References

Examples

Model selection by Data-Driven Slope Estimation

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Model selection by dimension jump

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Plot for capushe

Description

Arguments

Details

Methods

Note

validation

Description

Usage

Arguments

Details

Author(s)

References

See Also

Examples