Title: | Fast Cross-Validation via Sequential Testing |
---|---|
Description: | The fast cross-validation via sequential testing (CVST) procedure is an improved cross-validation procedure which uses non-parametric testing coupled with sequential analysis to determine the best parameter set on linearly increasing subsets of the data. By eliminating under-performing candidates quickly and keeping promising candidates as long as possible, the method speeds up the computation while preserving the capability of a full cross-validation. Additionally to the CVST the package contains an implementation of the ordinary k-fold cross-validation with a flexible and powerful set of helper objects and methods to handle the overall model selection process. The implementations of the Cochran's Q test with permutations and the sequential testing framework of Wald are generic and can therefore also be used in other contexts. |
Authors: | Tammo Krueger, Mikio Braun |
Maintainer: | Tammo Krueger <[email protected]> |
License: | GPL (>= 2.0) |
Version: | 0.2-3 |
Built: | 2024-11-13 06:16:10 UTC |
Source: | CRAN |
The fast cross-validation via sequential testing (CVST) procedure is an improved cross-validation procedure which uses non-parametric testing coupled with sequential analysis to determine the best parameter set on linearly increasing subsets of the data. By eliminating under-performing candidates quickly and keeping promising candidates as long as possible, the method speeds up the computation while preserving the capability of a full cross-validation. Additionally to the CVST the package contains an implementation of the ordinary k-fold cross-validation with a flexible and powerful set of helper objects and methods to handle the overall model selection process. The implementations of the Cochran's Q test with permutations and the sequential testing framework of Wald are generic and can therefore also be used in other contexts.
Package: | CVST |
Type: | Package |
Title: | Fast Cross-Validation via Sequential Testing |
Version: | 0.2-3 |
Date: | 2022-02-19 |
Depends: | kernlab,Matrix |
Author: | Tammo Krueger, Mikio Braun |
Maintainer: | Tammo Krueger <[email protected]> |
Description: | The fast cross-validation via sequential testing (CVST) procedure is an improved cross-validation procedure which uses non-parametric testing coupled with sequential analysis to determine the best parameter set on linearly increasing subsets of the data. By eliminating under-performing candidates quickly and keeping promising candidates as long as possible, the method speeds up the computation while preserving the capability of a full cross-validation. Additionally to the CVST the package contains an implementation of the ordinary k-fold cross-validation with a flexible and powerful set of helper objects and methods to handle the overall model selection process. The implementations of the Cochran's Q test with permutations and the sequential testing framework of Wald are generic and can therefore also be used in other contexts. |
License: | GPL (>= 2.0) |
NeedsCompilation: | no |
Packaged: | 2022-02-21 18:10:19 UTC; tammok |
Repository: | CRAN |
Date/Publication: | 2022-02-21 18:40:02 UTC |
Index of help topics:
CV Perform a k-fold Cross-validation CVST-package Fast Cross-Validation via Sequential Testing cochranq.test Cochran's Q Test with Permutation constructCVSTModel Setup for a CVST Run. constructData Construction and Handling of 'CVST.data' Objects constructLearner Construction of Specific Learners for CVST constructParams Construct a Grid of Parameters constructSequentialTest Construct and Handle Sequential Tests. fastCV The Fast Cross-Validation via Sequential Testing (CVST) Procedure noisyDonoho Generate Donoho's Toy Data Sets noisySine Regression and Classification Toy Data Set
Tammo Krueger, Mikio Braun
Maintainer: Tammo Krueger <[email protected]>
Tammo Krueger, Danny Panknin, and Mikio Braun. Fast cross-validation via sequential testing. Journal of Machine Learning Research 16 (2015) 1103-1155. URL https://jmlr.org/papers/volume16/krueger15a/krueger15a.pdf.
Abraham Wald. Sequential Analysis. Wiley, 1947.
W. G. Cochran. The comparison of percentages in matched samples. Biometrika, 37 (3-4):256–266, 1950.
M. Friedman. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association, 32 (200):675–701, 1937.
ns = noisySine(100) svm = constructSVMLearner() params = constructParams(kernel="rbfdot", sigma=10^(-3:3), nu=c(0.05, 0.1, 0.2, 0.3)) opt = fastCV(ns, svm, params, constructCVSTModel())
ns = noisySine(100) svm = constructSVMLearner() params = constructParams(kernel="rbfdot", sigma=10^(-3:3), nu=c(0.05, 0.1, 0.2, 0.3)) opt = fastCV(ns, svm, params, constructCVSTModel())
Performs the Cochran's Q test on the data. If the data matrix contains too few elements, the chisquare distribution of the test statistic is replaced by a permutation variant.
cochranq.test(mat)
cochranq.test(mat)
mat |
The data matrix with the individuals in the rows and treatments in the columns. |
Returns a htest
object with the usual entries.
Tammo Krueger <[email protected]>
W. G. Cochran. The comparison of percentages in matched samples. Biometrika, 37 (3-4):256–266, 1950.
Kashinath D. Patil. Cochran's Q test: Exact distribution. Journal of the American Statistical Association, 70 (349):186–189, 1975.
Merle W. Tate and Sara M. Brown. Note on the Cochran Q test. Journal of the American Statistical Association, 65 (329):155–160, 1970.
mat = matrix(c(rep(0, 10), 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1), ncol=4) cochranq.test(mat) mat = matrix(c(rep(0, 7), 1, rep(0, 12), 1, 1, 0, 1, rep(0, 5), 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1), nrow=8) cochranq.test(mat)
mat = matrix(c(rep(0, 10), 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1), ncol=4) cochranq.test(mat) mat = matrix(c(rep(0, 7), 1, rep(0, 12), 1, 1, 0, 1, rep(0, 5), 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1), nrow=8) cochranq.test(mat)
This is an helper object of type CVST.setup
conatining all
necessary parameters for a CVST run.
constructCVSTModel(steps = 10, beta = 0.1, alpha = 0.01, similaritySignificance = 0.05, earlyStoppingSignificance = 0.05, earlyStoppingWindow = 3, regressionSimilarityViaOutliers = FALSE)
constructCVSTModel(steps = 10, beta = 0.1, alpha = 0.01, similaritySignificance = 0.05, earlyStoppingSignificance = 0.05, earlyStoppingWindow = 3, regressionSimilarityViaOutliers = FALSE)
steps |
Number of steps CVST should run |
beta |
Significance level for H0. |
alpha |
Significance level for H1. |
similaritySignificance |
Significance level of the similarity test. |
earlyStoppingSignificance |
Significance level of the early stopping test. |
earlyStoppingWindow |
Size of the early stopping window. |
regressionSimilarityViaOutliers |
Should the less strict outlier-based similarity measure for regression tasks be used. |
A CVST.setup
object suitable for fastCV
.
Tammo Krueger <[email protected]>
Tammo Krueger, Danny Panknin, and Mikio Braun. Fast cross-validation via sequential testing. Journal of Machine Learning Research 16 (2015) 1103-1155. URL https://jmlr.org/papers/volume16/krueger15a/krueger15a.pdf.
CVST.data
Objects
The CVST methods needs a structured interface to both regression and classification data sets. These helper methods allow the construction and consistence handling of these types of data sets.
constructData(x, y) getN(data) getSubset(data, subset) getX(data, subset = NULL) shuffleData(data) isClassification(data) isRegression(data)
constructData(x, y) getN(data) getSubset(data, subset) getX(data, subset = NULL) shuffleData(data) isClassification(data) isRegression(data)
x |
The feature data as vector or matrix. |
y |
The observed values (regressands/labels) as list, vector or factor. |
data |
A |
subset |
A index set. |
constructData
returns a CVST.data
object. getN
returns the number of data points in the data set. getSubset
returns a subset of the data as a CVST.data
object, while
getX
just return the feature data. shuffleData
returns a
randomly shuffled instance of the data.
Tammo Krueger <[email protected]>
nsine = noisySine(10) isClassification(nsine) isRegression(nsine) getN(nsine) getX(nsine) nsineShuffeled = shuffleData(nsine) getX(nsineShuffeled) getSubset(nsineShuffeled, 1:3)
nsine = noisySine(10) isClassification(nsine) isRegression(nsine) getN(nsine) getX(nsine) nsineShuffeled = shuffleData(nsine) getX(nsineShuffeled) getSubset(nsineShuffeled, 1:3)
These methods construct a CVST.learner
object suitable for the
CVST method. These objects provide the common interface needed for the
CV
and fastCV
methods. We provide kernel
logistic regression, kernel ridge regression, support vector machines
and support vector regression as fully functional implementation templates.
constructLearner(learn, predict) constructKlogRegLearner() constructKRRLearner() constructSVMLearner() constructSVRLearner()
constructLearner(learn, predict) constructKlogRegLearner() constructKRRLearner() constructSVMLearner() constructSVRLearner()
learn |
The learning methods which takes a |
predict |
The prediction method which takes a model and |
The nu-SVM and nu-SVR are build on top the corresponding implementations of
the kernlab
package (see reference). In the list of parameters these
implementations expect an entry named kernel
, which gives the
name of the kernel that should be used, an entry named nu
specifying the nu parameter, and an entry named C
giving the C
parameter for the nu-SVR.
The KRR and KLR also expect kernel
and necessary other
parameters to construct the kernel. Both methods expect a lambda
parameter and KLR additonally a tol and maxiter parameter in the
parameter list.
Note that the lambda of KRR/KLR and the C parameter of SVR are scaled by the data set size to allow for comparable results in the fast CV loop.
Returns a learner of type CVST.learner
suitable for CV
and fastCV
.
Tammo Krueger <[email protected]>
Alexandros Karatzoglou, Alexandros Smola, Kurt Hornik, Achim Zeileis. kernlab - An S4 Package for Kernel Methods in R Journal of Statistical Software Vol. 11, Issue 9, Nov 2004. DOI: doi:10.18637/jss.v011.i09.
Volker Roth. Probabilistic discriminative kernel classifiers for multi-class problems. In Proceedings of the 23rd DAGM-Symposium on Pattern Recognition, pages 246–253, 2001.
# SVM ns = noisySine(100) svm = constructSVMLearner() p = list(kernel="rbfdot", sigma=100, nu=.1) m = svm$learn(ns, p) nsTest = noisySine(1000) pred = svm$predict(m, nsTest) sum(pred != nsTest$y) / getN(nsTest) # Kernel logistic regression klr = constructKlogRegLearner() p = list(kernel="rbfdot", sigma=100, lambda=.1/getN(ns), tol=10e-6, maxiter=100) m = klr$learn(ns, p) pred = klr$predict(m, nsTest) sum(pred != nsTest$y) / getN(nsTest) # SVR ns = noisySinc(100) svr = constructSVRLearner() p = list(kernel="rbfdot", sigma=100, nu=.1, C=1*getN(ns)) m = svr$learn(ns, p) nsTest = noisySinc(1000) pred = svr$predict(m, nsTest) sum((pred - nsTest$y)^2) / getN(nsTest) # Kernel ridge regression krr = constructKRRLearner() p = list(kernel="rbfdot", sigma=100, lambda=.1/getN(ns)) m = krr$learn(ns, p) pred = krr$predict(m, nsTest) sum((pred - nsTest$y)^2) / getN(nsTest)
# SVM ns = noisySine(100) svm = constructSVMLearner() p = list(kernel="rbfdot", sigma=100, nu=.1) m = svm$learn(ns, p) nsTest = noisySine(1000) pred = svm$predict(m, nsTest) sum(pred != nsTest$y) / getN(nsTest) # Kernel logistic regression klr = constructKlogRegLearner() p = list(kernel="rbfdot", sigma=100, lambda=.1/getN(ns), tol=10e-6, maxiter=100) m = klr$learn(ns, p) pred = klr$predict(m, nsTest) sum(pred != nsTest$y) / getN(nsTest) # SVR ns = noisySinc(100) svr = constructSVRLearner() p = list(kernel="rbfdot", sigma=100, nu=.1, C=1*getN(ns)) m = svr$learn(ns, p) nsTest = noisySinc(1000) pred = svr$predict(m, nsTest) sum((pred - nsTest$y)^2) / getN(nsTest) # Kernel ridge regression krr = constructKRRLearner() p = list(kernel="rbfdot", sigma=100, lambda=.1/getN(ns)) m = krr$learn(ns, p) pred = krr$predict(m, nsTest) sum((pred - nsTest$y)^2) / getN(nsTest)
This is a helper function which, geiven a named list of parameter
choices, expand the complete grid and returns a CVST.params
object suitable for CV
and fastCV
.
constructParams(...)
constructParams(...)
... |
The parameters that should be expanded. |
Returns a CVST.params
wich is basically a named list of
possible parameter vallues.
Tammo Krueger <[email protected]>
params = constructParams(kernel="rbfdot", sigma=10^(-1:5), nu=c(0.1, 0.2)) # the expanded grid contains 14 parameter lists: length(params)
params = constructParams(kernel="rbfdot", sigma=10^(-1:5), nu=c(0.1, 0.2)) # the expanded grid contains 14 parameter lists: length(params)
These functions handle the construction and calculation with
sequential tests as introduced by Wald (1947). getCVSTTest
constructs a special sequential test as introduced in Krueger
(2011). testSequence
test a sequence of 0/1 whether it is
distributed according to H0 or H1.
constructSequentialTest(piH0 = 0.5, piH1 = 0.9, beta, alpha) getCVSTTest(steps, beta = 0.1, alpha = 0.01) testSequence(st, s) plotSequence(st, s)
constructSequentialTest(piH0 = 0.5, piH1 = 0.9, beta, alpha) getCVSTTest(steps, beta = 0.1, alpha = 0.01) testSequence(st, s) plotSequence(st, s)
piH0 |
Probability of the binomial distribution for H0. |
piH1 |
Probability of the binomial distribution for H1. |
beta |
Significance level for H0. |
alpha |
Significance level for H1. |
steps |
Number of steps the CVST procedure should be executed. |
st |
A sequential test of type |
s |
A sequence of 0/1 values. |
constructSequentialTest
and getCVSTTest
return a
CVST.sequentialTest
with the specified
properties. testSequence
returns 1, if H1 can be expected, -1
if H0 can be accepted, and 0 if the test needs more data for a
decission. plotSequence
gives a graphical impression of the
this testing procedure.
Tammo Krueger <[email protected]>
Abraham Wald. Sequential Analysis. Wiley, 1947.
Tammo Krueger, Danny Panknin, and Mikio Braun. Fast cross-validation via sequential testing. Journal of Machine Learning Research 16 (2015) 1103-1155. URL https://jmlr.org/papers/volume16/krueger15a/krueger15a.pdf.
st = getCVSTTest(10) s = rbinom(10,1, .5) plotSequence(st, s) testSequence(st, s)
st = getCVSTTest(10) s = rbinom(10,1, .5) plotSequence(st, s) testSequence(st, s)
Performs the usual k-fold cross-validation procedure on a given data set, parameter grid and learner.
CV(data, learner, params, fold = 5, verbose = TRUE)
CV(data, learner, params, fold = 5, verbose = TRUE)
data |
The data set as |
learner |
The learner as |
params |
the parameter grid as |
fold |
The number of folds that should be generated for each set of parameters. |
verbose |
Should the procedure report the performance for each model? |
Returns the optimal parameter settings as determined by k-fold cross-validation.
Tammo Krueger <[email protected]>
M. Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society. Series B, 36(2):111–147, 1974.
Sylvain Arlot, Alain Celisse, and Paul Painleve. A survey of cross-validation procedures for model selection. Statistics Surveys, 4:40–79, 2010.
fastCV
constructData
constructLearner
constructParams
ns = noisySine(100) svm = constructSVMLearner() params = constructParams(kernel="rbfdot", sigma=10^(-3:3), nu=c(0.05, 0.1, 0.2, 0.3)) opt = CV(ns, svm, params)
ns = noisySine(100) svm = constructSVMLearner() params = constructParams(kernel="rbfdot", sigma=10^(-3:3), nu=c(0.05, 0.1, 0.2, 0.3)) opt = CV(ns, svm, params)
CVST is an improved cross-validation procedure which uses non-parametric testing coupled with sequential analysis to determine the best parameter set on linearly increasing subsets of the data. By eliminating underperforming candidates quickly and keeping promising candidates as long as possible, the method speeds up the computation while preserving the capability of a full cross-validation.
fastCV(train, learner, params, setup, test = NULL, verbose = TRUE)
fastCV(train, learner, params, setup, test = NULL, verbose = TRUE)
train |
The data set as |
learner |
The learner as |
params |
the parameter grid as |
setup |
A |
test |
An independent test set that should be used at each step. If
|
verbose |
Should the procedure report the performance after each step? |
Returns the optimal parameter settings as determined by fast cross-validation via sequential testing.
Tammo Krueger <[email protected]>
Tammo Krueger, Danny Panknin, and Mikio Braun. Fast cross-validation via sequential testing. Journal of Machine Learning Research 16 (2015) 1103-1155. URL https://jmlr.org/papers/volume16/krueger15a/krueger15a.pdf.
CV
constructCVSTModel
constructData
constructLearner
constructParams
ns = noisySine(100) svm = constructSVMLearner() params = constructParams(kernel="rbfdot", sigma=10^(-3:3), nu=c(0.05, 0.1, 0.2, 0.3)) opt = fastCV(ns, svm, params, constructCVSTModel())
ns = noisySine(100) svm = constructSVMLearner() params = constructParams(kernel="rbfdot", sigma=10^(-3:3), nu=c(0.05, 0.1, 0.2, 0.3)) opt = fastCV(ns, svm, params, constructCVSTModel())
This function allows to generate noisy variants of the toy signals introduced by Donoho (see reference section). The scaling is chosen to reflect the setting as discussed in the original paper.
noisyDonoho(n, fun = doppler, sigma = 1) blocks(x, scale = 3.656993) bumps(x, scale = 10.52884) doppler(x, scale = 24.22172) heavisine(x, scale = 2.356934)
noisyDonoho(n, fun = doppler, sigma = 1) blocks(x, scale = 3.656993) bumps(x, scale = 10.52884) doppler(x, scale = 24.22172) heavisine(x, scale = 2.356934)
n |
Number of data points that should be generated. |
fun |
Function to use to generate the data. |
sigma |
Standard deviation of the noise component. |
x |
Number of data points that should be generated. |
scale |
Scaling parameter. |
Returns a data set of type CVST.data
Tammo Krueger <[email protected]>
David L. Donoho and Jain M. Johnstone. Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81 (3) 425–455, 1994.
bumpsSet = noisyDonoho(1000, fun=bumps) plot(bumpsSet) dopplerSet = noisyDonoho(1000, fun=doppler) plot(dopplerSet)
bumpsSet = noisyDonoho(1000, fun=bumps) plot(bumpsSet) dopplerSet = noisyDonoho(1000, fun=doppler) plot(dopplerSet)
Regression and Classification Toy Data Set based on the sine and sinc function.
noisySine(n, dim = 5, sigma = 0.25) noisySinc(n, dim = 2, sigma = 0.1)
noisySine(n, dim = 5, sigma = 0.25) noisySinc(n, dim = 2, sigma = 0.1)
n |
Number of data points that should be generated. |
dim |
Intrinsic dimensionality of the data set (see references for details). |
sigma |
Standard deviation of the noise component. |
Returns a data set of type CVST.data
Tammo Krueger <[email protected]>
Tammo Krueger, Danny Panknin, and Mikio Braun. Fast cross-validation via sequential testing. Journal of Machine Learning Research 16 (2015) 1103-1155. URL https://jmlr.org/papers/volume16/krueger15a/krueger15a.pdf.
nsine = noisySine(1000) plot(nsine, col=nsine$y) nsinc = noisySinc(1000) plot(nsinc)
nsine = noisySine(1000) plot(nsine, col=nsine$y) nsinc = noisySinc(1000) plot(nsinc)