Title: | Estimate Vapnik-Chervonenkis Dimension and Sample Complexity |
---|---|
Description: | We provide a suite of tools for estimating the sample complexity of a chosen model through theoretical bounds and simulation. The package incorporates methods for estimating the Vapnik-Chervonenkis dimension (VCD) of a chosen algorithm, which can be used to estimate its sample complexity. Alternatively, we provide simulation methods to estimate sample complexity directly. For more details, see Carter, P & Choi, D (2024). "Learning from Noise: Applying Sample Complexity for Political Science Research" <doi:10.31219/osf.io/evrcj>. |
Authors: | Perry Carter [aut, cre] , Dahyun Choi [aut] |
Maintainer: | Perry Carter <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.3.0 |
Built: | 2024-12-18 23:46:46 UTC |
Source: | CRAN |
estimate_accuracy()
Utility function to generate accuracy metrics, for use with estimate_accuracy()
acc_sim(n, method = "Uniform", p = NULL, ...)
acc_sim(n, method = "Uniform", p = NULL, ...)
n |
An integer giving the desired sample size for which the target function is to be calculated. |
method |
An optional string stating the distribution from which data is to be generated. Default is i.i.d. uniform sampling. Currently also supports "Class Imbalance". Can also take a function outputting a vector of probabilities if the user wishes to specify a custom distribution. |
p |
If method is 'Class Imbalance', gives the degree of weight placed on the positive class. |
... |
Additional model parameters to be specified by the user. |
A data frame giving performance metrics for the specified sample size.
Replication data for 'Predicting Recidivism'
br
br
An object of class data.frame
with 7214 rows and 14 columns.
Julia Dressel and Hany Farid
https://www.science.org/doi/full/10.1126/sciadv.aao5580
Estimate sample complexity bounds for a binary classification algorithm using either simulated or user-supplied data.
estimate_accuracy( formula, model, data = NULL, dim = NULL, maxn = NULL, upperlimit = NULL, nsample = 30, steps = 50, eta = 0.05, delta = 0.05, epsilon = 0.05, predictfn = NULL, power = FALSE, effect_size = NULL, powersims = NULL, alpha = 0.05, parallel = TRUE, coreoffset = 0, packages = list(), method = c("Uniform", "Class Imbalance"), p = NULL, minn = ifelse(is.null(data), (dim + 1), (ncol(data) + 1)), x = NULL, y = NULL, ... )
estimate_accuracy( formula, model, data = NULL, dim = NULL, maxn = NULL, upperlimit = NULL, nsample = 30, steps = 50, eta = 0.05, delta = 0.05, epsilon = 0.05, predictfn = NULL, power = FALSE, effect_size = NULL, powersims = NULL, alpha = 0.05, parallel = TRUE, coreoffset = 0, packages = list(), method = c("Uniform", "Class Imbalance"), p = NULL, minn = ifelse(is.null(data), (dim + 1), (ncol(data) + 1)), x = NULL, y = NULL, ... )
formula |
A |
model |
A binary classification model supplied by the user. Must take arguments |
data |
Optional. A rectangular |
dim |
Required if |
maxn |
Required if |
upperlimit |
Optional. A positive integer giving the maximum sample size to be simulated, if data was supplied. |
nsample |
A positive integer giving the number of samples to be generated for each value of $n$. Larger values give more accurate results. |
steps |
A positive integer giving the number of values of $n$ for which simulations should be conducted. Larger values give more accurate results. |
eta |
A real number between 0 and 1 giving the probability of misclassification error in the training data. |
delta |
A real number between 0 and 1 giving the targeted maximum probability of observing an OOS error rate higher than |
epsilon |
A real number between 0 and 1 giving the targeted maximum out-of-sample (OOS) error rate |
predictfn |
An optional user-defined function giving a custom predict method. If also using a user-defined model, the |
power |
A logical indicating whether experimental power based on the predictions should also be reported |
effect_size |
If |
powersims |
If |
alpha |
If |
parallel |
Boolean indicating whether or not to use parallel processing. |
coreoffset |
If |
packages |
A list of packages that need to be loaded in order to run |
method |
An optional string stating the distribution from which data is to be generated. Default is i.i.d. uniform sampling. Can also take a function outputting a vector of probabilities if the user wishes to specify a custom distribution. |
p |
If method is 'Class Imbalance', gives the degree of weight placed on the positive class. |
minn |
Optional argument to set a different minimum n than the dimension of the algorithm. Useful with e.g. regularized regression models such as elastic net. |
x |
Optional argument for methods that take separate predictor and outcome data. Specifies a matrix-like object containing predictors. Note that if used, the x and y objects are bound together columnwise; this must be handled in the user-supplied helper function. |
y |
Optional argument for methods that take separate predictor and outcome data. Specifies a vector-like object containing outcome values. Note that if used, the x and y objects are bound together columnwise; this must be handled in the user-supplied helper function. |
... |
Additional arguments that need to be passed to |
A list
containing two named elements. Raw
gives the exact output of the simulations, while Summary
gives a table of accuracy metrics, including the achieved levels of and
given the specified values. Alternative values can be calculated using
getpac()
plot_accuracy()
, to represent simulations visually, getpac()
, to calculate summaries for alternate values of and
without conducting a new simulation, and
gendata()
, to generated synthetic datasets.
mylogit <- function(formula, data){ m <- structure( glm(formula=formula,data=data,family=binomial(link="logit")), class=c("svrclass","glm") #IMPORTANT - must use the class svrclass to work correctly ) return(m) } mypred <- function(m,newdata){ out <- predict.glm(m,newdata,type="response") out <- factor(ifelse(out>0.5,1,0),levels=c("0","1")) #Important - must specify levels to account for possibility of all #observations being classified into the same class in smaller samples return(out) } library(parallel) results <- estimate_accuracy(two_year_recid ~ race + sex + age + juv_fel_count + juv_misd_count + priors_count + charge_degree..misd.fel.,mylogit,br, predictfn = mypred, nsample=10, steps=10, coreoffset = (detectCores() -2) )
mylogit <- function(formula, data){ m <- structure( glm(formula=formula,data=data,family=binomial(link="logit")), class=c("svrclass","glm") #IMPORTANT - must use the class svrclass to work correctly ) return(m) } mypred <- function(m,newdata){ out <- predict.glm(m,newdata,type="response") out <- factor(ifelse(out>0.5,1,0),levels=c("0","1")) #Important - must specify levels to account for possibility of all #observations being classified into the same class in smaller samples return(out) } library(parallel) results <- estimate_accuracy(two_year_recid ~ race + sex + age + juv_fel_count + juv_misd_count + priors_count + charge_degree..misd.fel.,mylogit,br, predictfn = mypred, nsample=10, steps=10, coreoffset = (detectCores() -2) )
Simulate data with appropriate structure to be used in estimating sample complexity bounds
gendata(model, dim, maxn, predictfn = NULL, varnames = NULL, ...)
gendata(model, dim, maxn, predictfn = NULL, varnames = NULL, ...)
model |
A binary classification model supplied by the user. Must take arguments |
dim |
Gives the horizontal dimension of the data (number of predictor variables) to be generated. |
maxn |
Gives the vertical dimension of the data (number of observations) to be generated. |
predictfn |
An optional user-defined function giving a custom predict method. If also using a user-defined model, the |
varnames |
An optional character vector giving the names of variables to be used for the generated data |
... |
Additional arguments that need to be passed to |
A data.frame
containing the simulated data.
estimate_accuracy()
, to estimate sample complexity bounds given the generated data
mylogit <- function(formula, data){ m <- structure( glm(formula=formula,data=data,family=binomial(link="logit")), class=c("svrclass","glm") #IMPORTANT - must use the class svrclass to work correctly ) return(m) } mypred <- function(m,newdata){ out <- predict.glm(m,newdata,type="response") out <- factor(ifelse(out>0.5,1,0),levels=c("0","1")) #Important - must specify levels to account for possibility of all #observations being classified into the same class in smaller samples return(out) } formula <- two_year_recid ~ race + sex + age + juv_fel_count + juv_misd_count + priors_count + charge_degree..misd.fel. dat <- gendata(mylogit,7,7214,mypred,all.vars(formula)) library(parallel) results <- estimate_accuracy(formula,mylogit,dat,predictfn = mypred, nsample=10, steps=10, coreoffset = (detectCores() -2))
mylogit <- function(formula, data){ m <- structure( glm(formula=formula,data=data,family=binomial(link="logit")), class=c("svrclass","glm") #IMPORTANT - must use the class svrclass to work correctly ) return(m) } mypred <- function(m,newdata){ out <- predict.glm(m,newdata,type="response") out <- factor(ifelse(out>0.5,1,0),levels=c("0","1")) #Important - must specify levels to account for possibility of all #observations being classified into the same class in smaller samples return(out) } formula <- two_year_recid ~ race + sex + age + juv_fel_count + juv_misd_count + priors_count + charge_degree..misd.fel. dat <- gendata(mylogit,7,7214,mypred,all.vars(formula)) library(parallel) results <- estimate_accuracy(formula,mylogit,dat,predictfn = mypred, nsample=10, steps=10, coreoffset = (detectCores() -2))
Recalculate achieved sample complexity bounds given different parameter inputs
getpac(table, epsilon = 0.05, delta = 0.05)
getpac(table, epsilon = 0.05, delta = 0.05)
table |
A list containing an element named |
epsilon |
A real number between 0 and 1 giving the targeted maximum out-of-sample (OOS) error rate |
delta |
A real number between 0 and 1 giving the targeted maximum probability of observing an OOS error rate higher than |
A list
containing two named elements. Raw
gives the exact output of the simulations, while Summary
gives a table of accuracy metrics, including the achieved levels of and
given the specified values. Alternative values can be calculated using
getpac()
again.
plot_accuracy()
, to represent simulations visually, getpac()
, to calculate summaries for alternate values of and
without conducting a new simulation, and
gendata()
, to generated synthetic datasets.
mylogit <- function(formula, data){ m <- structure( glm(formula=formula,data=data,family=binomial(link="logit")), class=c("svrclass","glm") #IMPORTANT - must use the class svrclass to work correctly ) return(m) } mypred <- function(m,newdata){ out <- predict.glm(m,newdata,type="response") out <- factor(ifelse(out>0.5,1,0),levels=c("0","1")) #Important - must specify levels to account for possibility of all #observations being classified into the same class in smaller samples return(out) } library(parallel) results <- estimate_accuracy(two_year_recid ~ race + sex + age + juv_fel_count + juv_misd_count + priors_count + charge_degree..misd.fel.,mylogit,br,predictfn = mypred, nsample=10, steps=10, coreoffset = (detectCores() -2)) resultsalt <- getpac(results,epsilon=0.5,delta=0.3) print(resultsalt$Summary)
mylogit <- function(formula, data){ m <- structure( glm(formula=formula,data=data,family=binomial(link="logit")), class=c("svrclass","glm") #IMPORTANT - must use the class svrclass to work correctly ) return(m) } mypred <- function(m,newdata){ out <- predict.glm(m,newdata,type="response") out <- factor(ifelse(out>0.5,1,0),levels=c("0","1")) #Important - must specify levels to account for possibility of all #observations being classified into the same class in smaller samples return(out) } library(parallel) results <- estimate_accuracy(two_year_recid ~ race + sex + age + juv_fel_count + juv_misd_count + priors_count + charge_degree..misd.fel.,mylogit,br,predictfn = mypred, nsample=10, steps=10, coreoffset = (detectCores() -2)) resultsalt <- getpac(results,epsilon=0.5,delta=0.3) print(resultsalt$Summary)
simvcd()
Utility function to define the least-squares loss function to be optimized for simvcd()
loss(h, ngrid, xi, a = 0.16, a1 = 1.2, a11 = 0.14927)
loss(h, ngrid, xi, a = 0.16, a1 = 1.2, a11 = 0.14927)
h |
A positive real number giving the current guess at VC dimension |
ngrid |
Vector of sample sizes for which the bounding function is estimated. |
xi |
Vector of estimated values of the bounding function, usually obtained from |
a |
Scaling coefficient for the bounding function. Defaults to the value given by Vapnik, Levin and Le Cun 1994. |
a1 |
Scaling coefficient for the bounding function. Defaults to the value given by Vapnik, Levin and Le Cun 1994. |
a11 |
Scaling coefficient for the bounding function. Defaults to the value given by Vapnik, Levin and Le Cun 1994. |
A real number giving the estimated value of the MSE given the current guess.
simvcd()
, the user-facing function for simulating VC dimension and risk_bounds()
to generate estimates for xi.
Represent simulated sample complexity bounds graphically
plot_accuracy( table, metrics = c("Accuracy", "Precision", "Recall", "Fscore", "Delta", "Epsilon", "Power"), plottype = c("ggplot", "plotly"), letters = c("greek", "latin") )
plot_accuracy( table, metrics = c("Accuracy", "Precision", "Recall", "Fscore", "Delta", "Epsilon", "Power"), plottype = c("ggplot", "plotly"), letters = c("greek", "latin") )
table |
A list containing an element named |
metrics |
A character vector containing the metrics to display in the plot. Can be any of "Accuracy", "Precision", "Recall", "Fscore", "delta", "epsilon" |
plottype |
A string giving the graphics package to be used to generate the plot. Can be one of "ggplot" or "plotly" |
letters |
A string determining whether delta and epsilon should be given as greek letters in the plot legend. Defaults to Greek lettering but available in case of rendering issues. |
Either a ggplot or plot_ly plot object, depending on the chosen option of plottype
.
estimate_accuracy()
, to generate estimated sample complexity bounds.
mylogit <- function(formula, data){ m <- structure( glm(formula=formula,data=data,family=binomial(link="logit")), class=c("svrclass","glm") #IMPORTANT - must use the class svrclass to work correctly ) return(m) } mypred <- function(m,newdata){ out <- predict.glm(m,newdata,type="response") out <- factor(ifelse(out>0.5,1,0),levels=c("0","1")) #Important - must specify levels to account for possibility of all #observations being classified into the same class in smaller samples return(out) } library(parallel) results <- estimate_accuracy(two_year_recid ~ race + sex + age + juv_fel_count + juv_misd_count + priors_count + charge_degree..misd.fel.,mylogit,br,predictfn = mypred, nsample=10, steps=10, coreoffset = (detectCores() -2)) fig <- plot_accuracy(results,letters="latin") fig
mylogit <- function(formula, data){ m <- structure( glm(formula=formula,data=data,family=binomial(link="logit")), class=c("svrclass","glm") #IMPORTANT - must use the class svrclass to work correctly ) return(m) } mypred <- function(m,newdata){ out <- predict.glm(m,newdata,type="response") out <- factor(ifelse(out>0.5,1,0),levels=c("0","1")) #Important - must specify levels to account for possibility of all #observations being classified into the same class in smaller samples return(out) } library(parallel) results <- estimate_accuracy(two_year_recid ~ race + sex + age + juv_fel_count + juv_misd_count + priors_count + charge_degree..misd.fel.,mylogit,br,predictfn = mypred, nsample=10, steps=10, coreoffset = (detectCores() -2)) fig <- plot_accuracy(results,letters="latin") fig
Utility function to generate data points for estimation of the VC Dimension of a user-specified binary classification algorithm given a specified sample size.
risk_bounds(x, ...)
risk_bounds(x, ...)
x |
An integer giving the desired sample size for which the target function is to be approximated. |
... |
Additional model parameters to be specified by the user. |
A real number giving the estimated value of Xi(n), the bounding function
Calculate sample complexity bounds for a classifier given target accuracy
scb(vcd = NULL, epsilon = NULL, delta = NULL, eta = NULL, theor = TRUE, ...)
scb(vcd = NULL, epsilon = NULL, delta = NULL, eta = NULL, theor = TRUE, ...)
vcd |
The Vapnik-Chervonenkis dimension (VCD) of the chosen classifier. If |
epsilon |
A real number between 0 and 1 giving the targeted maximum out-of-sample (OOS) error rate |
delta |
A real number between 0 and 1 giving the targeted maximum probability of observing an OOS error rate higher than |
eta |
A real number between 0 and 1 giving the probability of misclassification error in the training data. |
theor |
A Boolean indicating whether the theoretical VCD is to be used. If |
... |
Arguments to be passed to |
A real number giving the sample complexity bound for the specified parameters.
simvcd()
, to calculate VCD for a chosen model
mylogit <- function(formula, data){ m <- structure( glm(formula=formula,data=data,family=binomial(link="logit")), class=c("svrclass","glm") #IMPORTANT - must use the class svrclass to work correctly ) return(m) } mypred <- function(m,newdata){ out <- predict.glm(m,newdata,type="response") out <- factor(ifelse(out>0.5,1,0),levels=c("0","1")) #Important - must specify levels to account for possibility of all #observations being classified into the same class in smaller samples return(out) } library(parallel) scb(epsilon=0.05,delta=0.05,eta=0.05,theor=FALSE, model=mylogit,dim=7,m=10,k=10,maxn=50,predictfn = mypred, coreoffset = (detectCores() -2)) vcd <- 7 scb(vcd,epsilon=0.05,delta=0.05,eta=0.05)
mylogit <- function(formula, data){ m <- structure( glm(formula=formula,data=data,family=binomial(link="logit")), class=c("svrclass","glm") #IMPORTANT - must use the class svrclass to work correctly ) return(m) } mypred <- function(m,newdata){ out <- predict.glm(m,newdata,type="response") out <- factor(ifelse(out>0.5,1,0),levels=c("0","1")) #Important - must specify levels to account for possibility of all #observations being classified into the same class in smaller samples return(out) } library(parallel) scb(epsilon=0.05,delta=0.05,eta=0.05,theor=FALSE, model=mylogit,dim=7,m=10,k=10,maxn=50,predictfn = mypred, coreoffset = (detectCores() -2)) vcd <- 7 scb(vcd,epsilon=0.05,delta=0.05,eta=0.05)
Estimate the Vapnik-Chervonenkis (VC) dimension of an arbitrary binary classification algorithm.
simvcd( model, dim, packages = list(), m = 1000, k = 1000, maxn = 5000, parallel = TRUE, coreoffset = 0, predictfn = NULL, a = 0.16, a1 = 1.2, a11 = 0.14927, ... )
simvcd( model, dim, packages = list(), m = 1000, k = 1000, maxn = 5000, parallel = TRUE, coreoffset = 0, predictfn = NULL, a = 0.16, a1 = 1.2, a11 = 0.14927, ... )
model |
A binary classification model supplied by the user. Must take arguments |
dim |
A positive integer giving dimension (number of input features) of the model. |
packages |
A |
m |
A positive integer giving the number of simulations to be performed at each design point (sample size value). Higher values give more accurate results but increase computation time. |
k |
A positive integer giving the number of design points (sample size values) for which the bounding function is to be estimated. Higher values give more accurate results but increase computation time. |
maxn |
Gives the vertical dimension of the data (number of observations) to be generated. |
parallel |
Boolean indicating whether or not to use parallel processing. |
coreoffset |
If |
predictfn |
An optional user-defined function giving a custom predict method. If also using a user-defined model, the |
a |
Scaling coefficient for the bounding function. Defaults to the value given by Vapnik, Levin and Le Cun 1994. |
a1 |
Scaling coefficient for the bounding function. Defaults to the value given by Vapnik, Levin and Le Cun 1994. |
a11 |
Scaling coefficient for the bounding function. Defaults to the value given by Vapnik, Levin and Le Cun 1994. |
... |
Additional arguments that need to be passed to |
A real number giving the estimated value of the VC dimension of the supplied model.
scb()
, to calculate sample complexity bounds given estimated VCD.
mylogit <- function(formula, data){ m <- structure( glm(formula=formula,data=data,family=binomial(link="logit")), class=c("svrclass","glm") #IMPORTANT - must use the class svrclass to work correctly ) return(m) } mypred <- function(m,newdata){ out <- predict.glm(m,newdata,type="response") out <- factor(ifelse(out>0.5,1,0),levels=c("0","1")) #Important - must specify levels to account for possibility of all #observations being classified into the same class in smaller samples return(out) } library(parallel) vcd <- simvcd(model=mylogit,dim=7,m=10,k=10,maxn=50,predictfn = mypred, coreoffset = (detectCores() -2))
mylogit <- function(formula, data){ m <- structure( glm(formula=formula,data=data,family=binomial(link="logit")), class=c("svrclass","glm") #IMPORTANT - must use the class svrclass to work correctly ) return(m) } mypred <- function(m,newdata){ out <- predict.glm(m,newdata,type="response") out <- factor(ifelse(out>0.5,1,0),levels=c("0","1")) #Important - must specify levels to account for possibility of all #observations being classified into the same class in smaller samples return(out) } library(parallel) vcd <- simvcd(model=mylogit,dim=7,m=10,k=10,maxn=50,predictfn = mypred, coreoffset = (detectCores() -2))