Title: | Data Driving Multiple Classifier System |
---|---|
Description: | Provides a novel framework to able to automatically develop and deploy an accurate Multiple Classifier System based on the feature-clustering distribution achieved from an input dataset. 'D2MCS' was developed focused on four main aspects: (i) the ability to determine an effective method to evaluate the independence of features, (ii) the identification of the optimal number of feature clusters, (iii) the training and tuning of ML models and (iv) the execution of voting schemes to combine the outputs of each classifier comprising the Multiple Classifier System. |
Authors: | David Ruano-Ordás [aut, ctb], Miguel Ferreiro-Díaz [aut, cre], José Ramón Méndez [aut, ctb], University of Vigo [cph] |
Maintainer: | Miguel Ferreiro-Díaz <[email protected]> |
License: | GPL-3 |
Version: | 1.0.1 |
Built: | 2024-11-20 06:59:47 UTC |
Source: | CRAN |
Computes the ratio of number of correct predictions to the total number of input samples.
D2MCS::MeasureFunction
-> Accuracy
new()
Method for initializing the object arguments during runtime.
Accuracy$new(performance.output = NULL)
performance.output
An optional ConfMatrix
used as
basis to compute the performance.
compute()
The function computes the Accuracy achieved by the M.L. model.
Accuracy$compute(performance.output = NULL)
performance.output
An optional ConfMatrix
parameter
to define the type of object used as basis to compute the
Accuracy measure.
This function is automatically invoke by the
ClassificationOutput
object.
A numeric vector of size 1 or NULL if an error occurred.
clone()
The objects of this class are cloneable with this method.
Accuracy$clone(deep = FALSE)
deep
Whether to make a deep clone.
MeasureFunction
, ClassificationOutput
,
ConfMatrix
.
The BinaryPlot
implements a basic plot for
bi-class problem.
D2MCS::GenericPlot
-> BinaryPlot
new()
Empty function used to initialize the object arguments in runtime.
BinaryPlot$new()
plot()
Plots feature-clustering data from a bi-class problem.
BinaryPlot$plot(summary)
summary
A data.frame comprising the elements to be plotted.
clone()
The objects of this class are cloneable with this method.
BinaryPlot$clone(deep = FALSE)
deep
Whether to make a deep clone.
Performs feature-clustering based on ChiSquare method.
D2MCS::GenericHeuristic
-> ChiSquareHeuristic
new()
Empty function used to initialize the object arguments in runtime.
ChiSquareHeuristic$new()
heuristic()
Functions responsible of performing the ChiSquare feature-clustering operation.
ChiSquareHeuristic$heuristic(col1, col2, column.names = NULL)
A numeric vector of length 1 or NA if an error occurs.
clone()
The objects of this class are cloneable with this method.
ChiSquareHeuristic$clone(deep = FALSE)
deep
Whether to make a deep clone.
Allows computing the classification performance values achieved
by D2MCS. The class is automatically created when D2MCS
classification method is invoked.
new()
Method for initializing the object arguments during runtime.
ClassificationOutput$new(voting.schemes, models)
voting.schemes
A list containing the voting schemes used
(inherited from VotingStrategy
.
models
A list containing the used Model
during classification stage.
getMetrics()
The function returns the measures used during training stage.
ClassificationOutput$getMetrics()
A character vector or NULL if training was not performed.
getPositiveClass()
The function gets the name of the positive class used for training/classification.
ClassificationOutput$getPositiveClass()
A character vector of size 1.
getModelInfo()
The function compiled all the information concerning to the M.L. models used during training/classification.
ClassificationOutput$getModelInfo(metrics = NULL)
metrics
A character vector defining the metrics used during training/classification.
A list with the information of each M.L. model.
getPerformances()
The function is used to compute the performance of D2MCS.
ClassificationOutput$getPerformances( test.set, measures, voting.names = NULL, metric.names = NULL, cutoff.values = NULL )
test.set
A Subset
object used to compute the
performance.
measures
A character vector with the measures to be used to
compute performance value (inherited from MeasureFunction
).
voting.names
A character vector with the name of the voting schemes to analyze the performance. If not defined, all the voting schemes used during classification stage will be taken into account.
metric.names
A character containing the measures used during training stage. If not defined, all training metrics used during classification will be taken into account.
cutoff.values
A character vector defining the minimum probability used to perform a a positive classification. If is not defined, all cutoffs used during classification stage will be taken into account.
dir.path
A character vector with location where the plot will be saved.
A list of performance values.
savePerformances()
The function is used to save the computed predictions into a CSV file.
ClassificationOutput$savePerformances( dir.path, test.set, measures, voting.names = NULL, metric.names = NULL, cutoff.values = NULL )
dir.path
A character vector with location where the plot will be saved.
test.set
A Subset
object used to compute the
performance.
measures
A character vector with the measures to be used to
compute performance value (inherited from MeasureFunction
).
voting.names
A character vector with the name of the voting schemes to analyze the performance. If not defined, all the voting schemes used during classification stage will be taken into account.
metric.names
A character containing the measures used during training stage. If not defined, all training metrics used during classification will be taken into account.
cutoff.values
A character vector defining the minimum probability used to perform a a positive classification. If is not defined, all cutoffs used during classification stage will be taken into account.
plotPerformances()
The function allows to graphically visualize the computed performance.
ClassificationOutput$plotPerformances( dir.path, test.set, measures, voting.names = NULL, metric.names = NULL, cutoff.values = NULL )
dir.path
A character vector with location where the plot will be saved.
test.set
A Subset
object used to compute the
performance.
measures
A character vector with the measures to be used to
compute performance value (inherited from MeasureFunction
).
voting.names
A character vector with the name of the voting schemes to analyze the performance. If not defined, all the voting schemes used during classification stage will be taken into account.
metric.names
A character containing the measures used during training stage. If not defined, all training metrics used during classification will be taken into account.
cutoff.values
A character vector defining the minimum probability used to perform a positive classification. If is not defined, all cutoffs used during classification stage will be taken into account.
getPredictions()
The function is used to obtain the computed predictions.
ClassificationOutput$getPredictions( voting.names = NULL, metric.names = NULL, cutoff.values = NULL, type = NULL, target = NULL, filter = FALSE )
voting.names
A character vector with the name of the voting schemes to analyze the performance. If not defined, all the voting schemes used during classification stage will be taken into account.
metric.names
A character containing the measures used during training stage. If not defined, all training metrics used during classification will be taken into account.
cutoff.values
A character vector defining the minimum probability used to perform a a positive classification. If is not defined, all cutoffs used during classification stage will be taken into account.
type
A character to define which type of predictions should be returned. If not defined all type of probabilities will be returned. Conversely if "prob" or "raw" is defined then computed 'probabilistic' or 'class' values are returned.
target
A character defining the value of the positive class.
filter
A logical value used to specify if only predictions matching the target value should be returned or not. If TRUE the function returns only the predictions matching the target value. Conversely if FALSE (by default) the function returns all the predictions.
A PredictionOutput
object.
savePredictions()
The function saves the predictions into a CSV file.
ClassificationOutput$savePredictions( dir.path, voting.names = NULL, metric.names = NULL, cutoff.values = NULL, type = NULL, target = NULL, filter = FALSE )
dir.path
A character vector with location defining the location of the CSV file.
voting.names
A character vector with the name of the voting schemes to analyze the performance. If not defined, all the voting schemes used during classification stage will be taken into account.
metric.names
A character containing the measures used during training stage. If not defined, all training metrics used during classification will be taken into account.
cutoff.values
A character vector defining the minimum probability used to perform a positive classification. If is not defined, all cutoffs used during classification stage will be taken into account.
type
A character to define which type of predictions should be returned. If not defined all type of probabilities will be returned. Conversely if "prob" or "raw" is defined then computed 'probabilistic' or 'class' values are returned.
target
A character defining the value of the positive class.
filter
A logical value used to specify if only predictions matching the target value should be returned or not. If TRUE the function returns only the predictions matching the target value. Conversely if FALSE (by default) the function returns all the predictions.
clone()
The objects of this class are cloneable with this method.
ClassificationOutput$clone(deep = FALSE)
deep
Whether to make a deep clone.
Implementation of the parliamentary 'majority voting' procedure. The majority class value is defined as final class. All class values have the same importance.
D2MCS::SimpleVoting
-> ClassMajorityVoting
new()
Method for initializing the object arguments during runtime.
ClassMajorityVoting$new(cutoff = 0.5, class.tie = NULL, majority.class = NULL)
cutoff
A character vector defining the minimum probability used to perform a positive classification. If is not defined, 0.5 will be used as default value.
class.tie
A character used to define the target class value used when a tie is found. If NULL positive class value will be assigned.
majority.class
A character defining the value of the majority class. If NULL will be used same value as training stage.
getMajorityClass()
The function returns the value of the majority class.
ClassMajorityVoting$getMajorityClass()
A character vector of length 1 with the name of the majority class.
getClassTie()
The function gets the class value assigned to solve ties.
ClassMajorityVoting$getClassTie()
A character vector of length 1.
execute()
The function implements the majority voting procedure.
ClassMajorityVoting$execute(predictions, verbose = FALSE)
predictions
A ClusterPredictions
object containing
all the predictions achieved for each cluster.
verbose
A logical value to specify if more verbosity is needed.
clone()
The objects of this class are cloneable with this method.
ClassMajorityVoting$clone(deep = FALSE)
deep
Whether to make a deep clone.
D2MCS
, ClassMajorityVoting
,
ClassWeightedVoting
, ProbAverageVoting
,
ProbAverageWeightedVoting
, ProbBasedMethodology
A new implementation of ClassMajorityVoting
where
each class value has different values (weights).
D2MCS::SimpleVoting
-> ClassWeightedVoting
new()
Method for initializing the object arguments during runtime.
ClassWeightedVoting$new(cutoff = 0.5, weights = NULL)
getWeights()
The function returns the weights used to perform the voting scheme.
ClassWeightedVoting$getWeights()
A numeric vector.
setWeights()
The function allows changing the value of the weights.
ClassWeightedVoting$setWeights(weights)
weights
A numeric vector containing the new weights.
execute()
The function implements the cluster-weighted majority voting procedure.
ClassWeightedVoting$execute(predictions, verbose = FALSE)
predictions
A ClusterPredictions
object containing
all the predictions achieved for each cluster.
verbose
A logical value to specify if more verbosity is needed.
clone()
The objects of this class are cloneable with this method.
ClassWeightedVoting$clone(deep = FALSE)
deep
Whether to make a deep clone.
D2MCS
, ClassMajorityVoting
,
ClassWeightedVoting
, ProbAverageVoting
,
ProbAverageWeightedVoting
, ProbBasedMethodology
Stores the predictions achieved by the best M.L. of each cluster.
new()
Method for initializing the object arguments during runtime.
ClusterPredictions$new(class.values, positive.class)
add()
The function is used to add the prediction achieved by a specific M.L. model.
ClusterPredictions$add(prediction)
prediction
A Prediction
object containing the
computed predictions.
get()
The function returns the predictions placed at specific position.
ClusterPredictions$get(position)
position
A numeric value indicating the position of the predictions to be obtained.
A Prediction
object.
getAll()
The function returns all the predictions.
ClusterPredictions$getAll()
A list containing all computed predictions.
size()
The function returns the number of computed predictions.
ClusterPredictions$size()
A numeric value.
getPositiveClass()
The function gets the value of the positive class.
ClusterPredictions$getPositiveClass()
A character vector of size 1.
getClassValues()
The function returns all the values of the target class.
ClusterPredictions$getClassValues()
A character vector containing all target values.
clone()
The objects of this class are cloneable with this method.
ClusterPredictions$clone(deep = FALSE)
deep
Whether to make a deep clone.
D2MCS
, ClassificationOutput
,
Prediction
Abstract class used as a template to define new customized strategies to combine the class predictions made by different metrics.
new()
Method for initializing the object arguments during runtime.
CombinedMetrics$new(required.metrics)
required.metrics
A character vector of length greater than 2 with the name of the required metrics.
getRequiredMetrics()
The function returns the required metrics that will participate in the combined metric process.
CombinedMetrics$getRequiredMetrics()
A character vector of length greater than 2 with the name of the required metrics.
getFinalPrediction()
Function used to implement the strategy to obtain the final prediction based on different metrics.
CombinedMetrics$getFinalPrediction( raw.pred, prob.pred, positive.class, negative.class )
raw.pred
A character list of length greater than 2 with the class value of the predictions made by the metrics.
prob.pred
A numeric list of length greater than 2 with the probability of the predictions made by the metrics.
positive.class
A character with the value of the positive class.
negative.class
A character with the value of the negative class.
A logical value indicating if the instance is predicted as positive class or not.
clone()
The objects of this class are cloneable with this method.
CombinedMetrics$clone(deep = FALSE)
deep
Whether to make a deep clone.
Calculates the final prediction by performing the result of the
predictions of different metrics obtained through a SimpleVoting
class.
D2MCS::VotingStrategy
-> CombinedVoting
new()
Method for initializing the object arguments during runtime.
CombinedVoting$new(voting.schemes, combined.metrics, methodology, metrics)
voting.schemes
A list of elements inherited from
SimpleVoting
.
combined.metrics
An object defining the metrics used to combine
the voting schemes. The object must inherit from
CombinedMetrics
class.
methodology
An object specifying the methodology used to execute
the combined voting. Object inherited from Methodology
object
metrics
A character vector with the name of the metrics used to perform the combined voting operations. Metrics should be previously defined during training stage.
getCombinedMetrics()
The function returns the metrics used to combine the metrics results.
CombinedVoting$getCombinedMetrics()
An object inherited from CombinedMetrics
class.
getMethodology()
The function gets the methodology used to execute the combined votings.
CombinedVoting$getMethodology()
An object inherited from Methodology
class.
getFinalPred()
The function returns the predictions obtained after executing the combined-voting methodology.
CombinedVoting$getFinalPred(type = NULL, target = NULL, filter = NULL)
type
A character to define which type of predictions should be returned. If not defined all type of probabilities will be returned. Conversely if "prob" or "raw" is defined then computed 'probabilistic' or 'class' values are returned.
target
A character defining the value of the positive class.
filter
A logical value used to specify if only predictions matching the target value should be returned or not. If TRUE the function returns only the predictions matching the target value. Conversely if FALSE (by default) the function returns all the predictions.
A data.frame with the computed predictions.
execute()
The function implements the combined voting scheme.
CombinedVoting$execute(predictions, verbose = FALSE)
predictions
A ClusterPredictions
object containing
the predictions computed for each cluster.
verbose
A logical value to specify if more verbosity is needed.
clone()
The objects of this class are cloneable with this method.
CombinedVoting$clone(deep = FALSE)
deep
Whether to make a deep clone.
D2MCS
, ClassMajorityVoting
,
ClassWeightedVoting
, ProbAverageVoting
,
ProbAverageWeightedVoting
, ProbBasedMethodology
,
SimpleVoting
Creates a R6
confusion matrix from the
confusionMatrix
caret package.
new()
Method to create a confusion matrix object from a
caret
confusionMatrix
ConfMatrix$new(confMatrix)
confMatrix
A caret
confusionMatrix argument.
getConfusionMatrix()
The function obtains the confusionMatrix
following the same structured as defined in the caret
package
ConfMatrix$getConfusionMatrix()
A confusionMatrix
object.
getTP()
The function is used to compute the number of True Positive values achieved.
ConfMatrix$getTP()
A numeric vector of size 1.
getTN()
The function computes the True Negative values.
ConfMatrix$getTN()
A numeric vector of size 1.
getFN()
The function returns the number of Type II errors (False Negative).
ConfMatrix$getFN()
A numeric vector of size 1.
getFP()
The function returns the number of Type I errors (False Negative).
ConfMatrix$getFP()
A numeric vector of size 1.
clone()
The objects of this class are cloneable with this method.
ConfMatrix$clone(deep = FALSE)
deep
Whether to make a deep clone.
D2MCS
, MeasureFunction
,
ClassificationOutput
The class is responsible of managing the whole process. Concretely builds the M.L. models (optimizes models hyperparameters), selects the best M.L. model for each cluster and executes the classification stage.
new()
The function is used to initialize all parameters needed to build a Multiple Classifier System.
D2MCS$new( dir.path, num.cores = NULL, socket.type = "PSOCK", outfile = NULL, serialize = FALSE )
dir.path
A character defining location were the trained models should be saved.
num.cores
An optional numeric value specifying the number of CPU cores used for training the models (only if parallelization is allowed). If not defined (num.cores - 2) cores will be used.
socket.type
A character value defining the type of socket
used to communicate the workers. The default type, "PSOCK"
, calls
makePSOCKcluster. Type "FORK"
calls makeForkCluster. For more
information see makeCluster
outfile
Where to direct the stdout and stderr connection output from the workers. "" indicates no redirection (which may only be useful for workers on the local machine). Defaults to '/dev/null'
serialize
A logical
value. If TRUE (default)
serialization will use XDR: where large amounts of data are to be
transferred and all the nodes are little-endian, communication may be
substantially faster if this is set to false.
train()
The function is responsible of performing the M.L. model training stage.
D2MCS$train( train.set, train.function, num.clusters = NULL, model.recipe = DefaultModelFit$new(), ex.classifiers = c(), ig.classifiers = c(), metrics = NULL, saveAllModels = FALSE )
train.set
A Trainset
object used as training input
for the M.L. models
train.function
A TrainFunction
defining the training
configuration options.
num.clusters
An numeric value used to define the number of
clusters from the Trainset
that should be utilized during
the training stage. If not defined all clusters will we taken into
account for training.
model.recipe
An unprepared recipe object inherited from
GenericModelFit
class.
ex.classifiers
A character vector containing the name of
the M.L. models used in training stage. See
getModelInfo
and
https://topepo.github.io/caret/available-models.html for more
information about all the available models.
ig.classifiers
A character vector containing the name of
the M.L. that should be ignored when performing the training stage. See
getModelInfo
and
https://topepo.github.io/caret/available-models.html for more
information about all the available models.
metrics
A character vector containing the metrics used to
perform the M.L. model hyperparameter optimization during the training
stage. See SummaryFunction
, UseProbability
and NoProbability
for more information.
saveAllModels
A logical parameter. A TRUE saves all trained models while A FALSE saves only the M.L. model achieving the best performance on each cluster.
A TrainOutput
object containing all the information
computed during the training stage.
classify()
The function is responsible for executing the classification stage.
D2MCS$classify(train.output, subset, voting.types, positive.class = NULL)
train.output
The TrainOutput
object computed in the
train stage.
subset
A Subset
containing the data to be classified.
voting.types
A list containing SingleVoting
or CombinedVoting
objects.
positive.class
An optional character parameter used to define the positive class value.
A ClassificationOutput
with all the values computed
during classification stage.
getAvailableModels()
The function obtains all the available M.L. models.
D2MCS$getAvailableModels()
A data.frame containing the information of the available M.L. models.
clone()
The objects of this class are cloneable with this method.
D2MCS$clone(deep = FALSE)
deep
Whether to make a deep clone.
# Specify the random number generation set.seed(1234) ## Create Dataset Handler object. loader <- DatasetLoader$new() ## Load 'hcc-data-complete-balanced.csv' dataset file. data <- loader$load(filepath = system.file(file.path("examples", "hcc-data-complete-balanced.csv"), package = "D2MCS"), header = TRUE, normalize.names = TRUE) ## Get column names data$getColumnNames() ## Split data into 4 partitions keeping balance ratio of 'Class' column. data$createPartitions(num.folds = 4, class.balance = "Class") ## Create a subset comprising the first 2 partitions for clustering purposes. cluster.subset <- data$createSubset(num.folds = c(1, 2), class.index = "Class", positive.class = "1") ## Create a subset comprising second and third partitions for trainning purposes. train.subset <- data$createSubset(num.folds = c(2, 3), class.index = "Class", positive.class = "1") ## Create a subset comprising last partitions for testing purposes. test.subset <- data$createSubset(num.folds = 4, class.index = "Class", positive.class = "1") ## Distribute the features into clusters using MCC heuristic. distribution <- SimpleStrategy$new(subset = cluster.subset, heuristic = MCCHeuristic$new()) distribution$execute() ## Get the best achieved distribution distribution$getBestClusterDistribution() ## Create a train set from the computed clustering distribution train.set <- distribution$createTrain(subset = train.subset) ## Not run: ## Initialization of D2MCS configuration parameters. ## - Defining training operation. ## + 10-fold cross-validation ## + Use only 1 CPU core. ## + Seed was set to ensure straightforward reproductivity of experiments. trFunction <- TwoClass$new(method = "cv", number = 10, savePredictions = "final", classProbs = TRUE, allowParallel = TRUE, verboseIter = FALSE, seed = 1234) #' ## - Specify the models to be trained ex.classifiers <- c("ranger", "lda", "lda2") ## Initialize D2MCS #' d2mcs <- D2MCS$new(dir.path = tempdir(), num.cores = 1) ## Execute training stage for using 'MCC' and 'PPV' measures to optimize model hyperparameters. trained.models <- d2mcs$train(train.set = train.set, train.function = trFunction, ex.classifiers = ex.classifiers, metrics = c("MCC", "PPV")) ## Execute classification stage using two different voting schemes predictions <- d2mcs$classify(train.output = trained.models, subset = test.subset, voting.types = c( SingleVoting$new(voting.schemes = c(ClassMajorityVoting$new(), ClassWeightedVoting$new()), metrics = c("MCC", "PPV")))) ## Compute the performance of each voting scheme using PPV and MMC measures. predictions$getPerformances(test.subset, measures = list(MCC$new(), PPV$new())) ## Execute classification stage using multiple voting schemes (simple and combined) predictions <- d2mcs$classify(train.output = trained.models, subset = test.subset, voting.types = c( SingleVoting$new(voting.schemes = c(ClassMajorityVoting$new(), ClassWeightedVoting$new()), metrics = c("MCC", "PPV")), CombinedVoting$new(voting.schemes = ClassMajorityVoting$new(), combined.metrics = MinimizeFP$new(), methodology = ProbBasedMethodology$new(), metrics = c("MCC", "PPV")))) ## Compute the performance of each voting scheme using PPV and MMC measures. predictions$getPerformances(test.subset, measures = list(MCC$new(), PPV$new())) ## End(Not run)
# Specify the random number generation set.seed(1234) ## Create Dataset Handler object. loader <- DatasetLoader$new() ## Load 'hcc-data-complete-balanced.csv' dataset file. data <- loader$load(filepath = system.file(file.path("examples", "hcc-data-complete-balanced.csv"), package = "D2MCS"), header = TRUE, normalize.names = TRUE) ## Get column names data$getColumnNames() ## Split data into 4 partitions keeping balance ratio of 'Class' column. data$createPartitions(num.folds = 4, class.balance = "Class") ## Create a subset comprising the first 2 partitions for clustering purposes. cluster.subset <- data$createSubset(num.folds = c(1, 2), class.index = "Class", positive.class = "1") ## Create a subset comprising second and third partitions for trainning purposes. train.subset <- data$createSubset(num.folds = c(2, 3), class.index = "Class", positive.class = "1") ## Create a subset comprising last partitions for testing purposes. test.subset <- data$createSubset(num.folds = 4, class.index = "Class", positive.class = "1") ## Distribute the features into clusters using MCC heuristic. distribution <- SimpleStrategy$new(subset = cluster.subset, heuristic = MCCHeuristic$new()) distribution$execute() ## Get the best achieved distribution distribution$getBestClusterDistribution() ## Create a train set from the computed clustering distribution train.set <- distribution$createTrain(subset = train.subset) ## Not run: ## Initialization of D2MCS configuration parameters. ## - Defining training operation. ## + 10-fold cross-validation ## + Use only 1 CPU core. ## + Seed was set to ensure straightforward reproductivity of experiments. trFunction <- TwoClass$new(method = "cv", number = 10, savePredictions = "final", classProbs = TRUE, allowParallel = TRUE, verboseIter = FALSE, seed = 1234) #' ## - Specify the models to be trained ex.classifiers <- c("ranger", "lda", "lda2") ## Initialize D2MCS #' d2mcs <- D2MCS$new(dir.path = tempdir(), num.cores = 1) ## Execute training stage for using 'MCC' and 'PPV' measures to optimize model hyperparameters. trained.models <- d2mcs$train(train.set = train.set, train.function = trFunction, ex.classifiers = ex.classifiers, metrics = c("MCC", "PPV")) ## Execute classification stage using two different voting schemes predictions <- d2mcs$classify(train.output = trained.models, subset = test.subset, voting.types = c( SingleVoting$new(voting.schemes = c(ClassMajorityVoting$new(), ClassWeightedVoting$new()), metrics = c("MCC", "PPV")))) ## Compute the performance of each voting scheme using PPV and MMC measures. predictions$getPerformances(test.subset, measures = list(MCC$new(), PPV$new())) ## Execute classification stage using multiple voting schemes (simple and combined) predictions <- d2mcs$classify(train.output = trained.models, subset = test.subset, voting.types = c( SingleVoting$new(voting.schemes = c(ClassMajorityVoting$new(), ClassWeightedVoting$new()), metrics = c("MCC", "PPV")), CombinedVoting$new(voting.schemes = ClassMajorityVoting$new(), combined.metrics = MinimizeFP$new(), methodology = ProbBasedMethodology$new(), metrics = c("MCC", "PPV")))) ## Compute the performance of each voting scheme using PPV and MMC measures. predictions$getPerformances(test.subset, measures = list(MCC$new(), PPV$new())) ## End(Not run)
Creates a valid simple dataset object.
new()
Method for initializing the object arguments during runtime.
Dataset$new( filepath, header = TRUE, sep = ",", skip = 0, normalize.names = FALSE, string.as.factor = FALSE, ignore.columns = NULL )
filepath
The name of the file which the data are to be read from.
Each row of the table appears as one line of the file. If it does not
contain an _absolute_ path, the file name is _relative_ to the current
working directory, 'getwd()
'.
header
A logical value indicating whether the file contains
the names of the variables as its first line. If missing, the value is
determined from the file format: 'header
' is set to 'TRUE'
if and only if the first row contains one fewer field than the number of
columns.
sep
The field separator character. Values on each line of the file are separated by this character.
skip
Defines the number of header lines should be skipped.
normalize.names
A logical value indicating whether the columns names should be automatically renamed to ensure R compatibility.
string.as.factor
A logical value indicating if character
columns should be converted to factors (default = FALSE
).
ignore.columns
Specify the columns from the input file that should be ignored.
getColumnNames()
Get the name of the columns comprising the dataset.
Dataset$getColumnNames()
A character vector with the name of each column.
getDataset()
Gets the full dataset.
Dataset$getDataset()
A data.frame with all the loaded information.
getNcol()
Obtains the number of columns present in the dataset.
Dataset$getNcol()
An integer of length 1 or NULL
getNrow()
Obtains the number of rows present in the dataset.
Dataset$getNrow()
An integer of length 1 or NULL
getRemovedColumns()
Get the columns removed or ignored.
Dataset$getRemovedColumns()
A list containing the name of the removed columns.
cleanData()
Removes data.frame columns matching some criterion.
Dataset$cleanData(remove.funcs = NULL, remove.na = TRUE, remove.const = FALSE)
removeColumns()
Applies cleanData
function over an specific set of
columns.
Dataset$removeColumns( columns, remove.funcs = NULL, remove.na = FALSE, remove.const = FALSE )
columns
Set of columns (numeric or character) where removal operation should be applied.
remove.funcs
A vector of functions use to define which columns must be removed.
remove.na
A logical value indicating whether
NA
values should be removed.
remove.const
A logical value used to indicate if constant values should be removed.
createPartitions()
Creates a k-folds partition from the initial dataset.
Dataset$createPartitions( num.folds = NULL, percent.folds = NULL, class.balance = NULL )
createSubset()
Creates a Subset
for testing or classification
purposes. A target class should be provided for testing purposes.
Dataset$createSubset( num.folds = NULL, opts = list(remove.na = TRUE, remove.const = FALSE), class.index = NULL, positive.class = NULL )
num.folds
A numeric defining the number of folds that should we used to build the Subset.
opts
A list with optional parameters. Valid arguments are
remove.na
(removes columns with NA values) and
remove.const
(ignore columns with constant values).
class.index
A numeric value identifying the column representing the target class
positive.class
Defines the positive class value.
A Subset object.
createTrain()
Creates a set for training purposes. A class should be defined to guarantee full-compatibility with supervised models.
Dataset$createTrain( class.index, positive.class, num.folds = NULL, opts = list(remove.na = TRUE, remove.const = FALSE) )
class.index
A numeric value identifying the column representing the target class
positive.class
Defines the positive class value.
num.folds
A numeric defining the number of folds that
should we used to build the Subset
.
opts
A list with optional parameters. Valid arguments are
remove.na
(removes columns with NA values) and
remove.const
(ignore columns with constant values).
A Trainset
object.
Wrapper class able to automatically create a
Dataset
, HDDataset
according to the input data.
new()
Empty function used to initialize the object arguments in runtime.
DatasetLoader$new()
load()
Stores the input source into a Dataset
or
HDDataset
type object.
DatasetLoader$load( filepath, header = TRUE, sep = ",", skip.lines = 0, normalize.names = FALSE, string.as.factor = FALSE, ignore.columns = NULL )
filepath
The name of the file which the data are to be read from.
Each row of the table appears as one line of the file. If it does not
contain an _absolute_ path, the file name is _relative_ to the current
working directory, 'getwd()
'.
header
A logical value indicating whether the file contains
the names of the variables as its first line. If missing, the value is
determined from the file format: 'header
' is set to 'TRUE'
if and only if the first row contains one fewer field than the number of
columns.
sep
The field separator character. Values on each line of the file are separated by this character.
skip.lines
Defines the number of header lines should be skipped.
normalize.names
A logical value indicating whether the columns names should be automatically renamed to ensure R compatibility.
string.as.factor
A logical value indicating if character columns should be converted to factors (default = FALSE).
ignore.columns
Specify the columns from the input file that should be ignored.
A Dataset
or HDDataset
object.
## Not run: # Create Dataset Handler object. loader <- DatasetLoader$new() # Load input file. data <- loader$load(filepath = system.file(file.path("examples", "hcc-data-complete-balanced.csv"), package = "D2MCS"), header = T, normalize.names = T) ## End(Not run)
## Not run: # Create Dataset Handler object. loader <- DatasetLoader$new() # Load input file. data <- loader$load(filepath = system.file(file.path("examples", "hcc-data-complete-balanced.csv"), package = "D2MCS"), header = T, normalize.names = T) ## End(Not run)
Creates a default recipe
and
formula
objects used in model training stage.
D2MCS::GenericModelFit
-> DefaultModelFit
new()
Method for initializing the object arguments during runtime.
DefaultModelFit$new()
createFormula()
The function is responsible of creating a
formula
for M.L. model.
DefaultModelFit$createFormula(instances, class.name, simplify = FALSE)
instances
A data.frame containing the instances used to create the recipe.
class.name
A character vector representing the name of the target class.
simplify
A logical argument defining whether the formula should be generated as simple as possible.
A formula
object.
createRecipe()
The function is responsible of creating a
recipe
with five operations over the data:
step_zv
, step_nzv
,
step_corr
, step_center
,
step_scale
DefaultModelFit$createRecipe(instances, class.name)
instances
A data.frame
containing the instances used
to create the recipe.
class.name
A character
vector representing the name
of the target class.
This function is automatically invoked by D2MCS
during model training stage.
An object of class recipe
.
clone()
The objects of this class are cloneable with this method.
DefaultModelFit$clone(deep = FALSE)
deep
Whether to make a deep clone.
Features are distributed according to their independence values. This strategy is divided into two steps. The first phase focuses on forming groups with those features most dependent on each other. This step also identifies those that are independent from all the others in the group. The second step is to try out different numbers of clusters until you find the one you think is best. These clusters are formed by inserting in all the independent characteristics identified previously and trying to distribute the features of the groups formed in the previous step in separate clusters. In this way, it seeks to ensure that the features are as independent as possible from those found in the same cluster.
The strategy is suitable only for binary and real features. Other
features are automatically grouped into a specific cluster named as
'unclustered'. This class requires the StrategyConfiguration
type object implements the following methods:
- getBinaryCutoff()
: The function is used to define the interval to
consider the dependency between binary features.
- getRealCutoff()
: The function allows defining the cutoff to consider
the dependency between real features.
- tiebreak(feature, clus.candidates, fea.dep.dist.clus, corpus,
heuristic, class, class.name)
: The function solves the ties between two
(or more) features.
- qualityOfCluster(clusters, metrics)
: The function determines the
quality of a cluster
- isImprovingClustering(clusters.deltha)
: The function indicates if
clustering is getting better as the number of them increases.
An example of implementation with the description of each parameter is the
DependencyBasedStrategyConfiguration
class.
D2MCS::GenericClusteringStrategy
-> DependencyBasedStrategy
new()
Method for initializing the object parameters during runtime.
DependencyBasedStrategy$new( subset, heuristic, configuration = DependencyBasedStrategyConfiguration$new() )
subset
The Subset
used to apply the feature-clustering
strategy.
heuristic
The heuristic used to compute the relevance of each
feature. Must inherit from GenericHeuristic
abstract class.
configuration
optional parameter to customize configuration
parameters for the strategy. Must inherited from
StrategyConfiguration
abstract class.
execute()
Function responsible of performing the dependency-based
feature clustering strategy over the defined Subset
.
DependencyBasedStrategy$execute(verbose = TRUE)
verbose
A logical value to specify if more verbosity is needed.
getDistribution()
Function used to obtain a specific cluster distribution.
DependencyBasedStrategy$getDistribution( num.clusters = NULL, num.groups = NULL, include.unclustered = FALSE )
A list with the features comprising an specific clustering distribution.
createTrain()
The function is used to create a Trainset
object from a specific clustering distribution.
DependencyBasedStrategy$createTrain( subset, num.clusters = NULL, num.groups = NULL, include.unclustered = FALSE )
subset
The Subset
object used as a basis to create
the train set (see Trainset
class).
num.clusters
A numeric value to select the number of clusters (define the distribution).
num.groups
A single or numeric vector value to identify a specific group that forms the clustering distribution.
include.unclustered
A logical value to determine if unclustered features should be included.
If num.clusters
and num.groups
are not defined,
best clustering distribution is used to create the train set.
plot()
The function is responsible for creating a plot to visualize the clustering distribution.
DependencyBasedStrategy$plot(dir.path = NULL, file.name = NULL)
dir.path
An optional argument to define the name of the directory
where the exported plot will be saved. If not defined, the file path will
be automatically assigned to the current working directory,
'getwd()
'.
file.name
A character to define the name of the PDF file where the plot is exported.
saveCSV()
The function is used to save the clustering distribution to a CSV file.
DependencyBasedStrategy$saveCSV( dir.path = NULL, name = NULL, num.clusters = NULL )
dir.path
The name of the directory to save the CSV file.
name
Defines the name of the CSV file.
num.clusters
An optional parameter to select the number of clusters to be saved. If not defined, all cluster distributions will be saved.
clone()
The objects of this class are cloneable with this method.
DependencyBasedStrategy$clone(deep = FALSE)
deep
Whether to make a deep clone.
GenericClusteringStrategy
,
StrategyConfiguration
,
DependencyBasedStrategyConfiguration
Define the default configuration parameters for the DependencyBasedStrategy strategy.
D2MCS::StrategyConfiguration
-> DependencyBasedStrategyConfiguration
new()
Method for initializing the object arguments during runtime.
DependencyBasedStrategyConfiguration$new( binaryCutoff = 0.6, realCutoff = 0.6, tiebreakMethod = "lfdc", metric = "dep.tar" )
binaryCutoff
The numeric value of binary cutoff.
realCutoff
The numeric value of real cutoff.
tiebreakMethod
The character value of tie-break method. The two tiebreak methods available are "lfdc" (less dependence cluster with the features) and "ltdc" (less dependence cluster with the target). These methods are used to add the features in the candidate feature clusters.
metric
The character value of the metric to apply the mean to obtain the quality of a cluster. The two metrics available are "dep.tar" (Dependence of cluster features on the target) and "dep.fea" (Dependence between cluster features).
minNumClusters()
Function used to return the minimum number of clusters distributions used. By default the minimum is set in 2.
DependencyBasedStrategyConfiguration$minNumClusters(...)
...
Further arguments passed down to minNumClusters
function.
A numeric vector of length 1.
maxNumClusters()
The function is responsible of returning the maximum number of cluster distributions used. By default the maximum number is set in 50.
DependencyBasedStrategyConfiguration$maxNumClusters(...)
...
Further arguments passed down to maxNumClusters
function.
A numeric vector of length 1.
getBinaryCutoff()
Gets the cutoff to consider the dependency between binary features.
DependencyBasedStrategyConfiguration$getBinaryCutoff()
The numeric value of binary cutoff.
getRealCutoff()
Gets the cutoff to consider the dependency between real features.
DependencyBasedStrategyConfiguration$getRealCutoff()
The numeric value of real cutoff.
setBinaryCutoff()
Sets the cutoff to consider the dependency between binary features.
DependencyBasedStrategyConfiguration$setBinaryCutoff(cutoff)
cutoff
The new numeric value of binary cutoff.
setRealCutoff()
Sets the cutoff to consider the dependency between real features.
DependencyBasedStrategyConfiguration$setRealCutoff(cutoff)
cutoff
The new numeric value of real cutoff.
tiebreak()
The function solves the ties between two (or more) features.
DependencyBasedStrategyConfiguration$tiebreak( feature, clus.candidates, fea.dep.dist.clus, corpus, heuristic, class, class.name )
feature
A character containing the name of the feature
clus.candidates
A single or numeric vector value to identify the candidate groups to insert the feature.
fea.dep.dist.clus
A list containing the groups chosen for the features.
corpus
A data.frame containing the features of the initial data.
heuristic
The heuristic used to compute the relevance of each feature. Must inherit from GenericHeuristic abstract class.
class
A character vector containing all the values of the target class.
class.name
A character value representing the name of the target class.
qualityOfCluster()
The function determines the quality of a cluster.
DependencyBasedStrategyConfiguration$qualityOfCluster(clusters, metrics)
A numeric vector of length 1.
isImprovingClustering()
The function indicates if clustering is getting better as the number of them increases.
DependencyBasedStrategyConfiguration$isImprovingClustering(clusters.deltha)
clusters.deltha
A numeric vector value with the quality values of the built clusters.
A numeric vector of length 1.
clone()
The objects of this class are cloneable with this method.
DependencyBasedStrategyConfiguration$clone(deep = FALSE)
deep
Whether to make a deep clone.
StrategyConfiguration
,
DependencyBasedStrategy
Performs feature-clustering based on Fisher's exact test for testing the null of independence of rows and columns in a contingency table with fixed marginals.
D2MCS::GenericHeuristic
-> FisherTestHeuristic
new()
Empty function used to initialize the object arguments in runtime.
FisherTestHeuristic$new()
heuristic()
Performs the Fisher's exact test for testing the null of independence between two columns (col1 and col2).
FisherTestHeuristic$heuristic(col1, col2, column.names = NULL)
A numeric vector of length 1 or NA if an error occurs.
clone()
The objects of this class are cloneable with this method.
FisherTestHeuristic$clone(deep = FALSE)
deep
Whether to make a deep clone.
Computes the ratio of number of Type II errors achieved by the final M.L. model.
D2MCS::MeasureFunction
-> FN
new()
Method for initializing the object arguments during runtime.
FN$new(performance.output = NULL)
performance.output
An optional ConfMatrix
parameter
to define the type of object used to compute the FN measure.
compute()
The function computes the FN achieved by the M.L. model.
FN$compute(performance.output = NULL)
performance.output
An optional ConfMatrix
parameter
to define the type of object used as basis to compute the FN
measure
This function is automatically invoked by the
ClassificationOutput
framework.
A numeric vector of size 1 or NULL if an error occurred.
clone()
The objects of this class are cloneable with this method.
FN$clone(deep = FALSE)
deep
Whether to make a deep clone.
MeasureFunction
, ClassificationOutput
,
ConfMatrix
This is the number of individuals with a negative condition for which the test result is positive. The value entered here must be non-negative.
D2MCS::MeasureFunction
-> FP
new()
Method for initializing the object arguments during runtime.
FP$new(performance.output = NULL)
performance.output
An optional ConfMatrix
parameter
used as basis to define the type of compute the FP
measure.
compute()
The function computes the FP achieved by the M.L. model.
FP$compute(performance.output = NULL)
performance.output
An optional ConfMatrix
parameter
to define the type of object used as basis to compute the FP
measure.
This function is automatically invoked by the
ClassificationOutput
object.
A numeric vector of size 1 or NULL if an error occurred.
clone()
The objects of this class are cloneable with this method.
FP$clone(deep = FALSE)
deep
Whether to make a deep clone.
MeasureFunction
, ClassificationOutput
,
ConfMatrix
Performs the feature-clustering using entropy-based filters.
D2MCS::GenericHeuristic
-> GainRatioHeuristic
new()
Empty function used to initialize the object arguments in runtime.
GainRatioHeuristic$new()
heuristic()
The algorithms find weights of discrete attributes basing on their correlation with continuous class attribute.
GainRatioHeuristic$heuristic(col1, col2, column.names = NULL)
A numeric vector of length 1 or NA if an error occurs.
clone()
The objects of this class are cloneable with this method.
GainRatioHeuristic$clone(deep = FALSE)
deep
Whether to make a deep clone.
Abstract class used as a template to ensure the proper definition of new customized clustering strategies.
The GenericClusteringStrategy is an archetype class so it cannot be instantiated.
new()
A function responsible for creating a GenericClusteringStrategy object.
GenericClusteringStrategy$new(subset, heuristic, description, configuration)
subset
A Subset
object to perform the clustering strategy.
heuristic
The heuristic to be applied. Must inherit from
GenericHeuristic
class.
description
A character vector describing the strategy operation.
configuration
Optional customized configuration parameters for the
strategy. Must inherited from StrategyConfiguration
abstract class.
getDescription()
The function is used to obtain the description of the strategy.
GenericClusteringStrategy$getDescription()
A character vector of NULL if not defined.
getHeuristic()
The function returns the heuristic applied for the clustering strategy.
GenericClusteringStrategy$getHeuristic()
An object inherited from GenericClusteringStrategy
class.
getConfiguration()
The function returns the configuration parameters used to perform the clustering strategy.
GenericClusteringStrategy$getConfiguration()
An object inherited from StrategyConfiguration
class.
getBestClusterDistribution()
The function obtains the best clustering distribution.
GenericClusteringStrategy$getBestClusterDistribution()
A list of clusters. Each list element represents a feature group.
getUnclustered()
The function is used to return the features that cannot be clustered due to incompatibilities with the used heuristic.
GenericClusteringStrategy$getUnclustered()
A character vector containing the unclassified features.
execute()
Abstract function responsible of performing the clustering
strategy over the defined Subset
.
GenericClusteringStrategy$execute(verbose, ...)
verbose
A logical value to specify if more verbosity is needed.
...
Further arguments passed down to execute
function.
getDistribution()
Abstract function used to obtain the set of features following an specific clustering distribution.
GenericClusteringStrategy$getDistribution( num.clusters = NULL, num.groups = NULL, include.unclustered = FALSE )
A list with the features comprising an specific clustering distribution.
createTrain()
Abstract function in charge of creating a
Trainset
object for training purposes.
GenericClusteringStrategy$createTrain( subset, num.cluster = NULL, num.groups = NULL, include.unclustered = FALSE )
subset
num.cluster
A numeric value to select the number of clusters (define the distribution).
num.groups
A single or numeric vector value to identify a specific group that forms the clustering distribution.
include.unclustered
A logical value to determine if unclustered features should be included.
plot()
Abstract function responsible of creating a plot to visualize the clustering distribution.
GenericClusteringStrategy$plot(dir.path = NULL, file.name = NULL, ...)
dir.path
An optional character argument to define the name
of the directory where the exported plot will be saved. If not defined,
the file path will be automatically assigned to the current working
directory, 'getwd()
'.
file.name
The name of the PDF file where the plot is exported.
...
Further arguments passed down to execute
function.
saveCSV()
Abstract function to save the clustering distribution to a CSV file.
GenericClusteringStrategy$saveCSV(dir.path, name, num.clusters = NULL)
dir.path
The name of the directory to save the CSV file.
name
Defines the name of the CSV file.
num.clusters
An optional parameter to select the number of clusters to be saved. If not defined, all clusters will be saved.
clone()
The objects of this class are cloneable with this method.
GenericClusteringStrategy$clone(deep = FALSE)
deep
Whether to make a deep clone.
Abstract class used as a template to define new customized clustering heuristics.
The GenericHeuristic is an archetype class so it cannot be instantiated.
new()
Empty function used to initialize the object arguments in runtime.
GenericHeuristic$new()
heuristic()
Function used to implement the clustering heuristic.
GenericHeuristic$heuristic(col1, col2, column.names = NULL, ...)
A numeric vector of length 1.
clone()
The objects of this class are cloneable with this method.
GenericHeuristic$clone(deep = FALSE)
deep
Whether to make a deep clone.
Template to create a recipe
or
formula
objects used in model training stage.
new()
Method for initializing the object arguments during runtime.
GenericModelFit$new()
createFormula()
The function is responsible of creating a
formula
for M.L. model.
GenericModelFit$createFormula(instances, class.name, simplify = TRUE)
instances
A data.frame containing the instances used to create the recipe.
class.name
A character vector representing the name of the target class.
simplify
A logical argument defining whether the formula should be generated as simple as possible.
A formula
object.
createRecipe()
The function is responsible of creating a
recipe
for M.L. model.
GenericModelFit$createRecipe(instances, class.name)
instances
A data.frame containing the instances used to create the recipe.
class.name
A character vector representing the name of the target class.
A object of class recipe
.
clone()
The objects of this class are cloneable with this method.
GenericModelFit$clone(deep = FALSE)
deep
Whether to make a deep clone.
The GenericPlot
implements a basic plot.
new()
Empty function used to initialize the object arguments in runtime.
GenericPlot$new()
plot()
Implements a generic plot to visualize basic feature-clustering data.
GenericPlot$plot(summary)
summary
A data.frame comprising the elements to be plotted.
clone()
The objects of this class are cloneable with this method.
GenericPlot$clone(deep = FALSE)
deep
Whether to make a deep clone.
Creates a high dimensional dataset object. Only the required instances are loaded in memory to avoid unnecessary of resources and memory.
new()
Method for initializing the object arguments during runtime.
HDDataset$new( filepath, header = TRUE, sep = ",", skip = 0, normalize.names = FALSE, ignore.columns = NULL )
filepath
The name of the file which the data are to be read from.
Each row of the table appears as one line of the file. If it does not
contain an _absolute_ path, the file name is _relative_ to the current
working directory, 'getwd()
'.
header
A logical value indicating whether the file contains
the names of the variables as its first line. If missing, the value is
determined from the file format: 'header
' is set to 'TRUE
'
if and only if the first row contains one fewer field than the number of
columns.
sep
The field separator character. Values on each line of the file are separated by this character.
skip
Defines the number of header lines should be skipped.
normalize.names
A logical value indicating whether the columns names should be automatically renamed to ensure R compatibility.
ignore.columns
Specify the columns from the input file that should be ignored.
getColumnNames()
Gets the name of the columns comprising the dataset
HDDataset$getColumnNames()
A character vector with the name of each column.
getNcol()
Obtains the number of columns present in the dataset.
HDDataset$getNcol()
An integer of length 1 or NULL
createSubset()
Creates a blinded HDSubset for classification purposes.
HDDataset$createSubset(column.id = FALSE, chunk.size = 1e+05)
A HDSubset
object.
Dataset
, HDSubset
,
DatasetLoader
Creates a high dimensional subset from a HDDataset
object. Only the required instances are loaded in memory to avoid unnecessary
use of resources and memory.
Use HDDataset
to ensure the creation of a valid
HDSubset
object.
new()
Method for initializing the object arguments during runtime.
HDSubset$new( file.path, feature.names, feature.id, start.at = 0, sep = ",", chunk.size )
file.path
The name of the file which the data are to be read from.
Each row of the table appears as one line of the file. If it does not
contain an _absolute_ path, the file name is _relative_ to the current
working directory, 'getwd()
'.
feature.names
A character vector specifying the name of the
features that should be included in the HDDataset
object.
feature.id
An integer or character indicating the column (number or name respectively) identifier. Default NULL value is valid ignores defining a identification column.
start.at
A numeric value to identify the reading start position.
sep
the field separator character. Values on each line of the file are separated by this character.
chunk.size
an integer value indicating the size of chunks taken over each iteration. By default chunk.size is defined as 10000.
getColumnNames()
Gets the name of the columns comprising the subset.
HDSubset$getColumnNames()
A character vector containing the name of each column.
getNcol()
Obtains the number of columns present in the dataset.
HDSubset$getNcol()
A numeric value or 0 if is empty.
getID()
Obtains the column identifier.
HDSubset$getID()
A character vector of size 1.
getIterator()
Creates the FIterator
object.
HDSubset$getIterator(chunk.size = private$chunk.size, verbose = FALSE)
A FIterator
object to transverse through
HDSubset
instances
isBlinded()
Checks if the subset contains a target class.
HDSubset$isBlinded()
A logical to specify if the subset contains a target class or not.
clone()
The objects of this class are cloneable with this method.
HDSubset$clone(deep = FALSE)
deep
Whether to make a deep clone.
Performs the feature-clustering using entropy-based filters.
D2MCS::GenericHeuristic
-> InformationGainHeuristic
new()
Empty function used to initialize the object arguments in runtime.
InformationGainHeuristic$new()
heuristic()
The algorithm find weights of discrete attributes basing on
their correlation with continuous class attribute. Particularly
Information Gain uses H(Class) + H(Attribute) - H(Class, Attribute)
InformationGainHeuristic$heuristic(col1, col2, column.names = NULL)
A numeric vector of length 1 or NA if an error occurs.
clone()
The objects of this class are cloneable with this method.
InformationGainHeuristic$clone(deep = FALSE)
deep
Whether to make a deep clone.
Cohen's Kappa measures the agreement between two raters who each classify N items into C mutually exclusive categories.
D2MCS::MeasureFunction
-> Kappa
new()
Method for initializing the object arguments during runtime.
Kappa$new(performance.output = NULL)
performance.output
An optional ConfMatrix
used as
basis to compute the performance.
compute()
The function computes the Kappa achieved by the M.L. model.
Kappa$compute(performance.output = NULL)
performance.output
An optional ConfMatrix
parameter
to define the type of object used as basis to compute the Kappa
measure.
This function is automatically invoked by the ClassificationOutput object.
A numeric vector of size 1 or NULL if an error occurred.
clone()
The objects of this class are cloneable with this method.
Kappa$clone(deep = FALSE)
deep
Whether to make a deep clone.
MeasureFunction
, ClassificationOutput
,
ConfMatrix
Performs the feature-clustering using Kendall correlation tests.
The method estimate the association between paired samples and compute a test of the value being zero. They use different measures of association, all in the range [-1, 1] with 0 indicating no association. Method valid only for bi-class problems.
D2MCS::GenericHeuristic
-> KendallHeuristic
new()
Empty function used to initialize the object arguments in runtime.
KendallHeuristic$new()
heuristic()
Test for association between paired samples using Kendall's tau value.
KendallHeuristic$heuristic(col1, col2, column.names = NULL)
a numeric vector of length 1 or NA if an error occurs.
clone()
The objects of this class are cloneable with this method.
KendallHeuristic$clone(deep = FALSE)
deep
Whether to make a deep clone.
The Matthews correlation coefficient is used in machine learning as a measure of the quality of binary (two-class) classifications. It takes into account true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes. The MCC is in essence a correlation coefficient between the observed and predicted binary classifications; it returns a value between -1 and +1.
D2MCS::MeasureFunction
-> MCC
new()
Method for initializing the object arguments during runtime.
MCC$new(performance.output = NULL)
performance.output
An optional ConfMatrix
parameter
used as basis to compute the MCC
measure.
compute()
The function computes the MCC achieved by the M.L. model.
MCC$compute(performance.output = NULL)
performance.output
An optional ConfMatrix
parameter
to define the type of object used as basis to compute the MCC
measure.
This function is automatically invoke by the ClassificationOutput object.
A numeric vector of size 1 or NULL if an error occurred.
clone()
The objects of this class are cloneable with this method.
MCC$clone(deep = FALSE)
deep
Whether to make a deep clone.
MeasureFunction
, ClassificationOutput
,
ConfMatrix
Performs the feature-clustering using MCC score. Valid for both bi-class and multi-class problems
D2MCS::GenericHeuristic
-> MCCHeuristic
new()
Empty function used to initialize the object arguments in runtime.
MCCHeuristic$new()
heuristic()
Calculates the Matthews correlation Coefficient (MCC) score.
MCCHeuristic$heuristic(col1, col2, column.names = NULL)
A numeric vector of length 1 or NA if an error occurs.
clone()
The objects of this class are cloneable with this method.
MCCHeuristic$clone(deep = FALSE)
deep
Whether to make a deep clone.
Abstract class used as a template to define new M.L. performance measures.
The GenericHeuristic
is an full-abstract class so it cannot
be instantiated. To ensure the proper operation, compute
method is
automatically invoke by D2MCS
framework when needed.
new()
Method for initializing the object arguments during runtime.
MeasureFunction$new(performance = NULL)
performance
An optional ConfMatrix
parameter to
define the type of object used to compute the measure.
compute()
The function implements the metric used to measure the performance achieved by the M.L. model.
MeasureFunction$compute(performance.output = NULL)
performance.output
An optional ConfMatrix
parameter
to define the type of object used to compute the measure.
This function is automatically invoke by the D2MCS
framework.
A numeric vector of size 1 or NULL if an error occurred.
clone()
The objects of this class are cloneable with this method.
MeasureFunction$clone(deep = FALSE)
deep
Whether to make a deep clone.
Abstract class used as a template to define new customized strategies to combine the probability predictions made by different metrics.
new()
Method for initializing the object arguments during runtime.
Methodology$new(required.metrics)
required.metrics
A character vector of length greater than 2 with the name of the required metrics.
getRequiredMetrics()
The function returns the required metrics that will participate in the methodology to compute a metric based on all of them.
Methodology$getRequiredMetrics()
A character vector of length greater than 2 with the name of the required metrics.
compute()
Function to compute the probability of the final prediction based on different metrics.
Methodology$compute(raw.pred, prob.pred, positive.class, negative.class)
raw.pred
A character list of length greater than 2 with the class value of the predictions made by the metrics.
prob.pred
A numeric list of length greater than 2 with the probability of the predictions made by the metrics.
positive.class
A character with the value of the positive class.
negative.class
A character with the value of the negative class.
A numeric value indicating the probability of the instance is predicted as positive class.
clone()
The objects of this class are cloneable with this method.
Methodology$clone(deep = FALSE)
deep
Whether to make a deep clone.
Calculates if the positive class is the predicted one in any of the metrics, otherwise, the instance is not considered to have the positive class associated.
D2MCS::CombinedMetrics
-> MinimizeFN
new()
Method for initializing the object arguments during runtime.
MinimizeFN$new(required.metrics = c("MCC", "PPV"))
required.metrics
A character vector of length 1 with the name of the required metrics.
getFinalPrediction()
Function to obtain the final prediction based on different metrics.
MinimizeFN$getFinalPrediction( raw.pred, prob.pred, positive.class, negative.class )
raw.pred
A character list of length greater than 2 with the class value of the predictions made by the metrics.
prob.pred
A numeric list of length greater than 2 with the probability of the predictions made by the metrics.
positive.class
A character with the value of the positive class.
negative.class
A character with the value of the negative class.
A logical value indicating if the instance is predicted as positive class or not.
clone()
The objects of this class are cloneable with this method.
MinimizeFN$clone(deep = FALSE)
deep
Whether to make a deep clone.
Calculates if the positive class is the predicted one in all metrics, otherwise, the instance is not considered to have the positive class associated.
D2MCS::CombinedMetrics
-> MinimizeFP
new()
Method for initializing the object arguments during runtime.
MinimizeFP$new(required.metrics = c("MCC", "PPV"))
required.metrics
A character vector of length greater than 2 with the name of the required metrics.
getFinalPrediction()
Function to obtain the final prediction based on different metrics.
MinimizeFP$getFinalPrediction( raw.pred, prob.pred, positive.class, negative.class )
raw.pred
A character list of length greater than 2 with the class value of the predictions made by the metrics.
prob.pred
A numeric list of length greater than 2 with the probability of the predictions made by the metrics.
positive.class
A character with the value of the positive class.
negative.class
A character with the value of the negative class.
A logical value indicating if the instance is predicted as positive class or not.
clone()
The objects of this class are cloneable with this method.
MinimizeFP$clone(deep = FALSE)
deep
Whether to make a deep clone.
Performs the feature-clustering using MCC score. Valid for both bi-class and multi-class problems. Only valid for bi-class problems.
D2MCS::GenericHeuristic
-> MultinformationHeuristic
new()
Empty function used to initialize the object arguments in runtime.
MultinformationHeuristic$new()
heuristic()
Mutinformation takes two random variables as input and computes the mutual information in nats according to the entropy estimator method.
MultinformationHeuristic$heuristic(col1, col2, column.names = NULL)
col1
A vector/factor denoting a random variable or a data.frame denoting a random vector where columns contain variables/features and rows contain outcomes/samples.
col2
An another random variable or random vector (vector/factor or data.frame).
column.names
An optional character vector with the names of both columns.
Returns the mutual information I(X;Y) in nats.
clone()
The objects of this class are cloneable with this method.
MultinformationHeuristic$clone(deep = FALSE)
deep
Whether to make a deep clone.
Computes the performance across resamples when class probabilities cannot be computed.
D2MCS::SummaryFunction
-> NoProbability
new()
The function defined during runtime the usage of five measures: 'Kappa', 'Accuracy', 'TCR_9', 'MCC' and 'PPV'.
NoProbability$new()
execute()
The function computes the performance across resamples using the previously defined measures.
NoProbability$execute(data, lev = NULL, model = NULL)
data
A data.frame containing the data used to compute the performance.
lev
An optional value used to define the levels of the target class.
model
An optional value used to define the M.L. model used.
A vector of performance estimates.
clone()
The objects of this class are cloneable with this method.
NoProbability$clone(deep = FALSE)
deep
Whether to make a deep clone.
Negative Predictive Values are the proportions of negative results in statistics and diagnostic tests that are true negative results.
D2MCS::MeasureFunction
-> NPV
new()
Method for initializing the object arguments during runtime.
NPV$new(performance.output = NULL)
performance.output
An optional ConfMatrix
parameter
to define the type of object used as basis to compute the NPV
measure.
compute()
The function computes the NPV achieved by the M.L. model.
NPV$compute(performance.output = NULL)
performance.output
An optional ConfMatrix
parameter
to define the type of object used as basis to compute the NPV
measure.
This function is automatically invoke by the ClassificationOutput object.
A numeric vector of size 1 or NULL if an error occurred.
clone()
The objects of this class are cloneable with this method.
NPV$clone(deep = FALSE)
deep
Whether to make a deep clone.
MeasureFunction
, ClassificationOutput
,
ConfMatrix
Performs the feature-clustering using Odds Ratio methodology. Valid only for bi-class problems.
D2MCS::GenericHeuristic
-> OddsRatioHeuristic
new()
Empty function used to initialize the object arguments in runtime.
OddsRatioHeuristic$new()
heuristic()
Calculates the Odds Ratio method.
OddsRatioHeuristic$heuristic(col1, col2, column.names = NULL)
A numeric vector of length 1 or NA if an error occurs.
clone()
The objects of this class are cloneable with this method.
OddsRatioHeuristic$clone(deep = FALSE)
deep
Whether to make a deep clone.
Performs the feature-clustering using Pearson correlation tests. Valid for both, bi-class and multi-class problems.
The test statistic is based on Pearson's product moment correlation coefficient cor(x, y) and follows a t distribution with length(x)-2 degrees of freedom if the samples follow independent normal distributions. If there are at least 4 complete pairs of observation, an asymptotic confidence interval is given based on Fisher's Z transform.
D2MCS::GenericHeuristic
-> PearsonHeuristic
new()
Creates a PearsonHeuristic object.
PearsonHeuristic$new()
heuristic()
Test for association between paired samples using Pearson test.
PearsonHeuristic$heuristic(col1, col2, column.names = NULL)
A numeric vector of length 1 or NA if an error occurs.
clone()
The objects of this class are cloneable with this method.
PearsonHeuristic$clone(deep = FALSE)
deep
Whether to make a deep clone.
Positive Predictive Values are the proportions of positive results in statistics and diagnostic tests that are true positive results.
D2MCS::MeasureFunction
-> PPV
new()
Method for initializing the object arguments during runtime.
PPV$new(performance.output = NULL)
performance.output
An optional ConfMatrix
parameter
to define the type of object used as basis to compute the PPV
measure.
compute()
The function computes the PPV achieved by the M.L. model.
PPV$compute(performance.output = NULL)
performance.output
An optional ConfMatrix
parameter
to define the type of object used as basis to compute the PPV
measure.
This function is automatically invoke by the ClassificationOutput object.
A numeric vector of size 1 or NULL if an error occurred.
clone()
The objects of this class are cloneable with this method.
PPV$clone(deep = FALSE)
deep
Whether to make a deep clone.
MeasureFunction
, ClassificationOutput
,
ConfMatrix
Precision is the fraction of relevant instances among the retrieved instances
D2MCS::MeasureFunction
-> Precision
new()
Method for initializing the object arguments during runtime.
Precision$new(performance.output = NULL)
performance.output
An optional ConfMatrix
parameter
to define the type of object used as basis to compute the measure.
compute()
The function computes the Precision achieved by the M.L. model.
Precision$compute(performance.output = NULL)
performance.output
An optional ConfMatrix
parameter
to define the type of object used as basis to compute the
Precision measure.
This function is automatically invoke by the ClassificationOutput object.
A numeric vector of size 1 or NULL if an error occurred.
clone()
The objects of this class are cloneable with this method.
Precision$clone(deep = FALSE)
deep
Whether to make a deep clone.
MeasureFunction
, ClassificationOutput
,
ConfMatrix
The class used to encapsulates all the computed predictions to facilitate their access and maintenance.
new()
Method for initializing the object arguments during runtime.
PredictionOutput$new(predictions, type, target)
predictions
type
A character to define which type of predictions should be returned. If not defined all type of probabilities will be returned. Conversely if "prob" or "raw" is defined then computed 'probabilistic' or 'class' values are returned.
target
A character defining the value of the positive class.
getPredictions()
The function returns the final predictions.
PredictionOutput$getPredictions()
A list containing the final predictions or NULL if classification stage was not successfully performed.
getType()
The function returns the type of prediction should be returned. If "prob" or "raw" is defined then computed 'probabilistic' or 'class' values are returned.
PredictionOutput$getType()
A character value.
getTarget()
The function returns the value of the target class.
PredictionOutput$getTarget()
A character value.
clone()
The objects of this class are cloneable with this method.
PredictionOutput$clone(deep = FALSE)
deep
Whether to make a deep clone.
Computes the final prediction by performing the mean value of the probability achieved by each prediction.
D2MCS::SimpleVoting
-> ProbAverageVoting
new()
Method for initializing the object arguments during runtime.
ProbAverageVoting$new(cutoff = 0.5, class.tie = NULL, majority.class = NULL)
cutoff
A character vector defining the minimum probability used to perform a positive classification. If is not defined, 0.5 will be used as default value.
class.tie
A character used to define the target class value used when a tie is found. If NULL positive class value will be assigned.
majority.class
A character defining the value of the majority class. If NULL will be used same value as training stage.
getMajorityClass()
The function returns the value of the majority class.
ProbAverageVoting$getMajorityClass()
A character vector of length 1 with the name of the majority class.
getClassTie()
The function gets the class value assigned to solve ties.
ProbAverageVoting$getClassTie()
A character vector of length 1.
execute()
The function implements the majority voting procedure.
ProbAverageVoting$execute(predictions, verbose = FALSE)
predictions
A ClusterPredictions
object containing
all the predictions achieved for each cluster.
verbose
A logical value to specify if more verbosity is needed.
clone()
The objects of this class are cloneable with this method.
ProbAverageVoting$clone(deep = FALSE)
deep
Whether to make a deep clone.
D2MCS
, ClassMajorityVoting
,
ClassWeightedVoting
, ProbAverageVoting
,
ProbAverageWeightedVoting
, ProbBasedMethodology
Computes the final prediction by performing the weighted mean of the probability achieved by each cluster prediction. By default, weight values are consistent with the performance value achieved by the best M.L. model on each cluster.
D2MCS::SimpleVoting
-> ProbAverageWeightedVoting
new()
Method for initializing the object arguments during runtime.
ProbAverageWeightedVoting$new(cutoff = 0.5, class.tie = NULL, weights = NULL)
cutoff
A character vector defining the minimum probability used to perform a positive classification. If is not defined, 0.5 will be used as default value.
class.tie
A character used to define the target class value used when a tie is found. If NULL positive class value will be assigned.
weights
A numeric vector with the weights of each cluster. If NULL performance achieved during training will be used as default.
getClassTie()
The function gets the class value assigned to solve ties.
ProbAverageWeightedVoting$getClassTie()
A character vector of length 1.
getWeights()
The function returns the value of the majority class.
ProbAverageWeightedVoting$getWeights()
A character vector of length 1 with the name of the majority class.
setWeights()
The function allows changing the value of the weights.
ProbAverageWeightedVoting$setWeights(weights)
weights
A numeric vector containing the new weights.
execute()
The function implements the cluster-weighted probabilistic voting procedure.
ProbAverageWeightedVoting$execute(predictions, verbose = FALSE)
predictions
A ClusterPredictions
object containing
all the predictions achieved for each cluster.
verbose
A logical value to specify if more verbosity is needed.
clone()
The objects of this class are cloneable with this method.
ProbAverageWeightedVoting$clone(deep = FALSE)
deep
Whether to make a deep clone.
D2MCS
, ClassMajorityVoting
,
ClassWeightedVoting
, ProbAverageVoting
,
ProbAverageWeightedVoting
, ProbBasedMethodology
Calculates the mean of the probabilities of the different metrics.
D2MCS::Methodology
-> ProbBasedMethodology
new()
Method for initializing the object arguments during runtime.
ProbBasedMethodology$new(required.metrics = c("MCC", "PPV"))
required.metrics
A character vector of length greater than 2 with the name of the required metrics.
compute()
Function to compute the probability of the final prediction based on different metrics.
ProbBasedMethodology$compute( raw.pred, prob.pred, positive.class, negative.class )
raw.pred
A character list of length greater than 2 with the class value of the predictions made by the metrics.
prob.pred
A numeric list of length greater than 2 with the probability of the predictions made by the metrics.
positive.class
A character with the value of the positive class.
negative.class
A character with the value of the negative class.
A numeric value indicating the probability of the instance is predicted as positive class.
clone()
The objects of this class are cloneable with this method.
ProbBasedMethodology$clone(deep = FALSE)
deep
Whether to make a deep clone.
Recall (also known as sensitivity) is the fraction of the total amount of relevant instances that were actually retrieved.
D2MCS::MeasureFunction
-> Recall
new()
Method for initializing the object arguments during runtime.
Recall$new(performance.output = NULL)
performance.output
An optional ConfMatrix
parameter
to define the type of object used as basis to compute the measure.
compute()
The function computes the Recall achieved by the M.L. model.
Recall$compute(performance.output = NULL)
performance.output
An optional ConfMatrix
parameter
to define the type of object used as basis to compute the Recall
measure.
This function is automatically invoke by the
ClassificationOutput
object.
A numeric vector of size 1 or NULL if an error occurred.
clone()
The objects of this class are cloneable with this method.
Recall$clone(deep = FALSE)
deep
Whether to make a deep clone.
MeasureFunction
, ClassificationOutput
,
ConfMatrix
Sensitivity is a measure of the proportion of actual positive cases that got predicted as positive (or true positive).
D2MCS::MeasureFunction
-> Sensitivity
new()
Method for initializing the object arguments during runtime.
Sensitivity$new(performance.output = NULL)
performance.output
An optional ConfMatrix
parameter
to define the type of object used as basis to compute the
Sensitivity
measure.
compute()
The function computes the Sensitivity achieved by the M.L. model.
Sensitivity$compute(performance.output = NULL)
performance.output
An optional ConfMatrix
parameter
to define the type of object used as basis to compute the
Sensitivity measure.
This function is automatically invoke by the ClassificationOutput object.
A numeric vector of size 1 or NULL if an error occurred.
clone()
The objects of this class are cloneable with this method.
Sensitivity$clone(deep = FALSE)
deep
Whether to make a deep clone.
MeasureFunction
, ClassificationOutput
,
ConfMatrix
Features are sorted by descendant according to the relevance value obtained after applying an specific heuristic. Next, features are distributed into N clusters following a card-dealing methodology. Finally best distribution is assigned to the distribution having highest homogeneity.
The strategy is suitable for all features that are valid for the indicated heuristics. Invalid features are automatically grouped into a specific cluster named as 'unclustered'.
D2MCS::GenericClusteringStrategy
-> SimpleStrategy
new()
Method for initializing the object arguments during runtime.
SimpleStrategy$new( subset, heuristic, configuration = StrategyConfiguration$new() )
subset
The Subset
used to apply the
feature-clustering strategy.
heuristic
The heuristic used to compute the relevance of each
feature. Must inherit from GenericHeuristic
abstract class.
configuration
Optional parameter to customize configuration
parameters for the strategy. Must inherited from
StrategyConfiguration
abstract class.
execute()
Function responsible of performing the clustering
strategy over the defined Subset
.
SimpleStrategy$execute(verbose = FALSE)
verbose
A logical value to specify if more verbosity is needed.
getBestClusterDistribution()
The function obtains the best clustering distribution.
SimpleStrategy$getBestClusterDistribution()
A list of clusters. Each list element represents a feature group.
getUnclustered()
The function is used to return the features that cannot be clustered due to incompatibilities with the used heuristic.
SimpleStrategy$getUnclustered()
A character vector containing the unclassified features.
getDistribution()
Function used to obtain a specific cluster distribution.
SimpleStrategy$getDistribution( num.clusters = NULL, num.groups = NULL, include.unclustered = FALSE )
A list with the features comprising an specific clustering distribution.
createTrain()
The function is used to create a Trainset
object from a specific clustering distribution.
SimpleStrategy$createTrain( subset, num.clusters = NULL, num.groups = NULL, include.unclustered = FALSE )
subset
The Subset
object used as a basis to create
the train set (see Trainset
class).
num.clusters
A numeric value to select the number of clusters (define the distribution).
num.groups
A single or numeric vector value to identify a specific group that forms the clustering distribution.
include.unclustered
A logical value to determine if unclustered features should be included.
If num.clusters
and num.groups
are not defined,
best clustering distribution is used to create the train set.
A Trainset
object.
plot()
The function is responsible for creating a plot to visualize the clustering distribution.
SimpleStrategy$plot(dir.path = NULL, file.name = NULL)
dir.path
An optional argument to define the name of the directory
where the exported plot will be saved. If not defined, the file path will
be automatically assigned to the current working directory,
'getwd()
'.
file.name
A character to define the name of the PDF file where the plot is exported.
saveCSV()
The function is used to save the clustering distribution to a CSV file.
SimpleStrategy$saveCSV(dir.path, name = NULL, num.clusters = NULL)
dir.path
The name of the directory to save the CSV file.
name
Defines the name of the CSV file.
num.clusters
An optional parameter to select the number of clusters to be saved. If not defined, all cluster distributions will be saved.
clone()
The objects of this class are cloneable with this method.
SimpleStrategy$clone(deep = FALSE)
deep
Whether to make a deep clone.
GenericClusteringStrategy
,
StrategyConfiguration
Abstract class used as a template to define new customized simple voting schemes.
new()
Method for initializing the object arguments during runtime.
SimpleVoting$new(cutoff = NULL)
cutoff
A character vector defining the minimum probability used to perform a positive classification. If is not defined, 0.5 will be used as default value.
getCutoff()
The function obtains the minimum probabilistic value used to perform a positive classification.
SimpleVoting$getCutoff()
A numeric value.
getFinalPred()
The function is used to return the prediction values computed by a voting strategy.
SimpleVoting$getFinalPred(type = NULL, target = NULL, filter = NULL)
type
A character to define which type of predictions should be returned. If not defined all type of probabilities will be returned. Conversely if 'prob' or 'raw' is defined then computed 'probabilistic' or 'class' values are returned.
target
A character defining the value of the positive class.
filter
A logical value used to specify if only predictions matching the target value should be returned or not. If TRUE the function returns only the predictions matching the target value. Conversely if FALSE (by default) the function returns all the predictions.
A FinalPred object.
execute()
Abstract function used to implement the operation of the voting scheme.
SimpleVoting$execute(predictions, verbose = FALSE)
predictions
A ClusterPredictions
object containing
all the predictions achieved for each cluster.
verbose
A logical value to specify if more verbosity is needed.
clone()
The objects of this class are cloneable with this method.
SimpleVoting$clone(deep = FALSE)
deep
Whether to make a deep clone.
D2MCS
, ClassMajorityVoting
,
ClassWeightedVoting
, ProbAverageVoting
,
ProbAverageWeightedVoting
, ProbBasedMethodology
,
CombinedVoting
The class is responsible of initializing and executing voting schemes. Additionally, to ensure a proper operation, the class automatically checks the compatibility of defined voting schemes.
D2MCS::VotingStrategy
-> SingleVoting
new()
The function initializes the object arguments during runtime.
SingleVoting$new(voting.schemes, metrics)
voting.schemes
A vector of voting schemes inheriting from
SimpleVoting
class.
metrics
A list containing the metrics used as basis to perform the voting strategy.
execute()
The function is used to execute all the previously defined (and compatible) voting schemes.
SingleVoting$execute(predictions, verbose = FALSE)
predictions
A ClusterPredictions
object containing
all the predictions computed in the classification stage.
verbose
A logical value to specify if more verbosity is needed.
clone()
The objects of this class are cloneable with this method.
SingleVoting$clone(deep = FALSE)
deep
Whether to make a deep clone.
D2MCS
, SimpleVoting
,
CombinedVoting
Performs the feature-clustering using Spearman's rho statistic.
Spearman's rho statistic is to estimate a rank-based measure of association. These tests may be used if the data do not necessarily come from a bivariate normal distribution.
D2MCS::GenericHeuristic
-> SpearmanHeuristic
new()
Creates a SpearmanHeuristic object.
SpearmanHeuristic$new()
heuristic()
Test for correlation between paired samples using Spearman rho statistic.
SpearmanHeuristic$heuristic(col1, col2, column.names = NULL)
A numeric vector of length 1 or NA if an error occurs.
clone()
The objects of this class are cloneable with this method.
SpearmanHeuristic$clone(deep = FALSE)
deep
Whether to make a deep clone.
Specificity is defined as the proportion of actual negatives, which got predicted as the negative (or true negative). This implies that there will be another proportion of actual negative, which got predicted as positive and could be termed as false positives.
D2MCS::MeasureFunction
-> Specificity
new()
Method for initializing the object arguments during runtime.
Specificity$new(performance.output = NULL)
performance.output
An optional ConfMatrix
parameter
to define the type of object used as basis to compute the measure.
compute()
The function computes the Specificity achieved by the M.L. model.
Specificity$compute(performance.output = NULL)
performance.output
An optional ConfMatrix
parameter
to define the type of object used as basis to compute the
Specificity measure.
This function is automatically invoke by the
ClassificationOutput
object.
A numeric vector of size 1 or NULL if an error occurred.
clone()
The objects of this class are cloneable with this method.
Specificity$clone(deep = FALSE)
deep
Whether to make a deep clone.
MeasureFunction
, ClassificationOutput
,
ConfMatrix
Define default configuration parameters for the clustering strategies.
The StrategyConfiguration
can be used to define the
default configuration parameters for a feature clustering strategy or as an
archetype to define new customized parameters.
new()
Empty function used to initialize the object arguments in runtime.
StrategyConfiguration$new()
minNumClusters()
Function used to return the minimum number of clusters distributions used. By default the minimum is set in 2.
StrategyConfiguration$minNumClusters(...)
...
Further arguments passed down to minNumClusters
function.
A numeric vector of length 1.
maxNumClusters()
The function is responsible of returning the maximum number of cluster distributions used. By default the maximum number is set in 50.
StrategyConfiguration$maxNumClusters(...)
...
Further arguments passed down to maxNumClusters
function.
A numeric vector of length 1.
clone()
The objects of this class are cloneable with this method.
StrategyConfiguration$clone(deep = FALSE)
deep
Whether to make a deep clone.
DependencyBasedStrategyConfiguration
The Subset
is used for testing or classification
purposes. If a target class is defined the Subset
can be used
as test and classification, otherwise the Subset
only
classification is compatible.
Use Dataset
to ensure the creation of a valid
Subset
object.
new()
Method for initializing the object arguments during runtime.
Subset$new( dataset, class.index = NULL, class.values = NULL, positive.class = NULL, feature.id = NULL )
dataset
A fully filled data.frame.
class.index
A numeric value identifying the column representing the target class
class.values
A character vector containing all the values of the target class.
positive.class
A character value representing the positive class value.
feature.id
A numeric value specifying the column number used as identifier.
getColumnNames()
Get the name of the columns comprising the subset.
Subset$getColumnNames()
A character vector containing the name of each column.
getFeatures()
Gets the values of all features or those indicated by arguments.
Subset$getFeatures(feature.names = NULL)
feature.names
A character vector comprising the name of the features to be obtained.
A character vector or NULL if subset is empty.
getID()
Gets the column name used as identifier.
Subset$getID()
A character vector of size 1 of NULL if column id is not defined.
getIterator()
Creates the DIterator object.
Subset$getIterator(chunk.size = private$chunk.size, verbose = FALSE)
A DIterator
object to transverse through
Subset
instances.
getClassValues()
Gets all the values of the target class.
Subset$getClassValues()
A factor vector with all the values of the target class.
getClassBalance()
The function is used to compute the ratio of each class
value in the Subset
.
Subset$getClassBalance(target.value = NULL)
target.value
The class value used as reference to perform the comparison.
A numeric value.
getClassIndex()
The function is used to obtain the index of the column containing the target class.
Subset$getClassIndex()
A numeric value.
getClassName()
The function is used to specify the name of the column containing the target class.
Subset$getClassName()
A character value.
getNcol()
The function is in charge of obtaining the number of columns
comprising the Subset
. See ncol
for more
information.
Subset$getNcol()
An integer of length 1 or NULL.
getNrow()
The function is used to determine the number of rows present
in the Subset
. See nrow
for more information.
Subset$getNrow()
An integer of length 1 or NULL.
getPositiveClass()
The function returns the value of the positive class.
Subset$getPositiveClass()
A character vector of size 1 or NULL if not defined.
isBlinded()
The function is used to check if the Subset contains a target class.
Subset$isBlinded()
A logical value where TRUE represents the absence of target class and FALSE its presence.
Dataset
, DatasetLoader
,
Trainset
Abstract used as template to define customized metrics to compute model performance during train.
This class is an archetype, so it cannot be instantiated.
new()
The function carries out the initialization of parameters during runtime.
SummaryFunction$new(measures)
measures
A character vector with the measures used.
execute()
Abstract function used to implement the performance
calculator method. To guarantee a proper operation, this method is
automatically invoked by D2MCS
framework.
SummaryFunction$execute()
getMeasures()
The function obtains the measures used to compute the performance across resamples.
SummaryFunction$getMeasures()
A character vector of NULL if measures are not defined.
clone()
The objects of this class are cloneable with this method.
SummaryFunction$clone(deep = FALSE)
deep
Whether to make a deep clone.
This is the number of individuals with a negative condition for which the test result is negative. The value entered here must be non-negative.
D2MCS::MeasureFunction
-> TN
new()
Method for initializing the object arguments during runtime.
TN$new(performance.output = NULL)
performance.output
An optional ConfMatrix
parameter
to define the type of object used to compute the TN measure.
compute()
The function computes the TN achieved by the M.L. model.
TN$compute(performance.output = NULL)
performance.output
An optional ConfMatrix
parameter
to define the type of object used as basis to compute the TN
measure.
This function is automatically invoke by the
ClassificationOutput
object.
A numeric vector of size 1 or NULL if an error occurred.
clone()
The objects of this class are cloneable with this method.
TN$clone(deep = FALSE)
deep
Whether to make a deep clone.
MeasureFunction
, ClassificationOutput
,
ConfMatrix
TP is the number of individuals with a positive condition for which the test result is positive. The value entered here must be non-negative.
D2MCS::MeasureFunction
-> TP
new()
Method for initializing the object arguments during runtime.
TP$new(performance.output = NULL)
performance.output
An optional ConfMatrix
parameter
to define the type of object used to compute the measure.
compute()
The function computes the TP achieved by the M.L. model.
TP$compute(performance.output = NULL)
performance.output
An optional ConfMatrix
parameter
to define the type of object used as basis to compute the TP
measure.
This function is automatically invoke by the
ClassificationOutput
object.
A numeric vector of size 1 or NULL if an error occurred.
clone()
The objects of this class are cloneable with this method.
TP$clone(deep = FALSE)
deep
Whether to make a deep clone.
MeasureFunction
, ClassificationOutput
,
ConfMatrix
Abstract class used as template to define customized functions to control the computational nuances of train function.
new()
Function used to initialize the object parameters during execution time.
TrainFunction$new( method, number, savePredictions, classProbs, allowParallel, verboseIter, seed )
method
The resampling method: "boot", "boot632", "optimism_boot", "boot_all", "cv", "repeatedcv", "LOOCV", "LGOCV" (for repeated training/test splits), "none" (only fits one model to the entire training set), "oob" (only for random forest, bagged trees, bagged earth, bagged flexible discriminant analysis, or conditional tree forest models), timeslice, "adaptive_cv", "adaptive_boot" or "adaptive_LGOCV"
number
Either the number of folds or number of resampling iterations
savePredictions
An indicator of how much of the hold-out predictions for each resample should be saved. Values can be either "all", "final", or "none". A logical value can also be used that convert to "all" (for true) or "none" (for false). "final" saves the predictions for the optimal tuning parameters.
classProbs
A logical value. Should class probabilities be computed for classification models (along with predicted values) in each resample?
allowParallel
A logical value. If a parallel backend is loaded and available, should the function use it?
verboseIter
A logical for printing a training log.
seed
An optional integer that will be used to set the seed during model training stage.
create()
Creates a trainControl
requires for the
training stage.
TrainFunction$create(summaryFunction, search.method = "grid", class.probs)
summaryFunction
An object inherited from
SummaryFunction
class.
search.method
Either "grid" or "random", describing how the tuning parameter grid is determined.
class.probs
A logical indicating if class probabilities should be computed for classification models (along with predicted values) in each resample.
getResamplingMethod()
Returns the resampling method used during training staged.
TrainFunction$getResamplingMethod()
A character vector or length 1 or NULL if not defined.
getNumberFolds()
Returns the number or folds or number of iterations used during training.
TrainFunction$getNumberFolds()
An integer vector or length 1 or NULL if not defined.
getSavePredictions()
Indicates if the predictions for each resample should be saved.
TrainFunction$getSavePredictions()
A logical value or NULL if not defined.
getClassProbs()
Indicates if class probabilities should be computed for classification models in each resample.
TrainFunction$getClassProbs()
A logical value.
getAllowParallel()
Determines if model training is performed in parallel.
TrainFunction$getAllowParallel()
A logical value. TRUE indicates parallelization is enabled and FALSE otherwise.
getVerboseIter()
Determines if training log should be printed.
TrainFunction$getVerboseIter()
A logical value. TRUE indicates training log is enabled and FALSE otherwise.
getTrFunction()
Function used to return the
trainControl
object.
TrainFunction$getTrFunction()
A trainControl
object.
getMeasures()
Returns the measures used to optimize model hyperparameters.
TrainFunction$getMeasures()
A character vector.
getType()
Obtains the type of classification problem ("Bi-class" or "Multi-class").
TrainFunction$getType()
A character vector with length 1. Either "Bi-class" or "Multi-class".
getSeed()
Indicates seed used during model training stage.
TrainFunction$getSeed()
An integer value or NULL if not defined.
setSummaryFunction()
Function used to change the SummaryFunction
used in the training stage.
TrainFunction$setSummaryFunction(summaryFunction)
summaryFunction
An object inherited from
SummaryFunction
class.
setClassProbs()
The function allows changing the class computation capabilities.
TrainFunction$setClassProbs(class.probs)
class.probs
A logical indicating if class probabilities should be computed for classification models (along with predicted values) in each resample
clone()
The objects of this class are cloneable with this method.
TrainFunction$clone(deep = FALSE)
deep
Whether to make a deep clone.
This class manages the results achieved during training stage (such as optimized hyperparameters, model information, utilized metrics).
new()
Function used to initialize the object arguments during runtime.
TrainOutput$new(models, class.values, positive.class)
getModels()
The function is used to obtain the best M.L. model of each cluster.
TrainOutput$getModels(metric)
metric
A character vector which specifies the metric(s) used for configuring M.L. hyperparameters.
A list is returned of class train.
getPerformance()
The function returns the performance value of M.L. models during training stage.
TrainOutput$getPerformance(metrics = NULL)
metrics
A character vector which specifies the metric(s) used to train the M.L. models.
A character vector containing the metrics used for configuring M.L. hyperparameters.
savePerformance()
The function is used to save into CSV file the performance achieved by the M.L. models during training stage.
TrainOutput$savePerformance(dir.path, metrics = NULL)
dir.path
The location to store the into a CSV file the performance of the trained M.L.
metrics
An optional parameter specifying the metric(s) used to train the M.L. models. If not defined, all the metrics used in train stage will be saved.
plot()
The function is responsible for creating a plot to visualize the performance achieved by the best M.L. model on each cluster.
TrainOutput$plot(dir.path, metrics = NULL)
dir.path
The location to store the exported plot will be saved.
metrics
An optional parameter specifying the metric(s) used to train the M.L. models. If not defined, all the metrics used in train stage will be plotted.
getMetrics()
The function returns all metrics used for configuring M.L. hyperparameters during train stage.
TrainOutput$getMetrics()
A character value.
getClassValues()
The function is used to get the values of the target class.
TrainOutput$getClassValues()
A character containing the values of the target class.
getPositiveClass()
The function returns the value of the positive class.
TrainOutput$getPositiveClass()
A character vector of size 1.
getSize()
The function is used to get the number of the trained M.L. models. Each cluster contains the best M.L. model.
TrainOutput$getSize()
A numeric value or NULL training was not successfully performed.
clone()
The objects of this class are cloneable with this method.
TrainOutput$clone(deep = FALSE)
deep
Whether to make a deep clone.
The Trainset
is used to perform training
operations over M.L. models. A target class should be defined to guarantee a
full compatibility with supervised models.
Use Dataset
object to ensure the creation of a valid
Trainset
object.
new()
Method for initializing the object arguments during runtime.
Trainset$new(cluster.dist, class.name, class.values, positive.class)
cluster.dist
The type of cluster distribution used as basis
to build the Trainset
. See
GenericClusteringStrategy
for more information.
class.name
Used to specify the name of the column containing the target class.
class.values
Specifies all the possible values of the target class.
positive.class
A character with the value of the positive class.
getPositiveClass()
The function is used to obtain the value of the positive class.
Trainset$getPositiveClass()
A numeric value with the positive class value.
getClassName()
The function is used to return the name of the target class.
Trainset$getClassName()
A character vector with length 1.
getClassValues()
The function is used to compute all the possible target class values.
Trainset$getClassValues()
A factor value.
getColumnNames()
The function returns the name of the columns comprising an specific cluster distribution.
Trainset$getColumnNames(num.cluster)
A character vector with all column names.
getFeatureValues()
The function returns the values of the columns comprising an specific cluster distribution. Target class is omitted.
Trainset$getFeatureValues(num.cluster)
A data.frame with the values of the features comprising the selected cluster distribution.
getInstances()
The function returns the values of the columns comprising an specific cluster distribution. Target class is included as the last column.
Trainset$getInstances(num.cluster)
A data.frame with the values of the features comprising the selected cluster distribution.
getNumClusters()
The function obtains the number of groups (clusters) that forms the cluster distribution.
Trainset$getNumClusters()
A numeric vector of size 1.
Dataset
, DatasetLoader
,
Subset
, GenericClusteringStrategy
Implementation to control the computational nuances of train function for bi-class problems.
D2MCS::TrainFunction
-> TwoClass
new()
TwoClass$new( method, number, savePredictions, classProbs, allowParallel, verboseIter, seed = NULL )
method
The resampling method: "boot", "boot632", "optimism_boot", "boot_all", "cv", "repeatedcv", "LOOCV", "LGOCV" (for repeated training/test splits), "none" (only fits one model to the entire training set), "oob" (only for random forest, bagged trees, bagged earth, bagged flexible discriminant analysis, or conditional tree forest models), timeslice, "adaptive_cv", "adaptive_boot" or "adaptive_LGOCV"
number
Either the number of folds or number of resampling iterations
savePredictions
An indicator of how much of the hold-out predictions for each resample should be saved. Values can be either "all", "final", or "none". A logical value can also be used that convert to "all" (for true) or "none" (for false). "final" saves the predictions for the optimal tuning parameters.
classProbs
A logical value. Should class probabilities be computed for classification models (along with predicted values) in each resample?
allowParallel
A logical value. If a parallel backend is loaded and available, should the function use it?
verboseIter
A logical for printing a training log.
seed
An optional integer that will be used to set the seed during model training stage.
create()
Creates a trainControl
requires for the
training stage.
TwoClass$create(summaryFunction, search.method = "grid", class.probs = NULL)
summaryFunction
An object inherited from
SummaryFunction
class.
search.method
Either "grid" or "random", describing how the tuning parameter grid is determined.
class.probs
A logical indicating if class probabilities should be computed for classification models (along with predicted values) in each resample
getTrFunction()
Function used to return the
trainControl
object.
TwoClass$getTrFunction()
A trainControl
object.
setClassProbs()
The function allows changing the class computation capabilities.
TwoClass$setClassProbs(class.probs)
getMeasures()
Returns the measures used to optimize model hyperparameters.
TwoClass$getMeasures()
A character vector.
getType()
Obtains the type of classification problem ("Bi-class" or "Multi-class").
TwoClass$getType()
A character vector with "Bi-class" value.
setSummaryFunction()
Function used to change the SummaryFunction
used in the training stage.
TwoClass$setSummaryFunction(summaryFunction)
summaryFunction
An object inherited from
SummaryFunction
class.
clone()
The objects of this class are cloneable with this method.
TwoClass$clone(deep = FALSE)
deep
Whether to make a deep clone.
Features are sorted by descendant according to the relevance value obtained after applying an specific heuristic. Next, features are distributed into N clusters following a card-dealing methodology. Finally best distribution is assigned to the distribution having highest homogeneity.
The strategy is suitable only for binary and real features. Other features are automatically grouped into a specific cluster named as 'unclustered'.
D2MCS::GenericClusteringStrategy
-> TypeBasedStrategy
new()
Method for initializing the object arguments during runtime.
TypeBasedStrategy$new( subset, heuristic, configuration = StrategyConfiguration$new() )
subset
The Subset
used to apply the
feature-clustering strategy.
heuristic
The heuristic used to compute the relevance of each
feature. Must inherit from GenericHeuristic
abstract class.
configuration
Optional parameter to customize configuration
parameters for the strategy. Must inherited from
StrategyConfiguration
abstract class.
execute()
Function responsible of performing the clustering strategy
over the defined Subset
.
TypeBasedStrategy$execute(verbose = FALSE)
verbose
A logical value to specify if more verbosity is needed.
getDistribution()
Function used to obtain a specific cluster distribution.
TypeBasedStrategy$getDistribution( num.clusters = NULL, num.groups = NULL, include.unclustered = FALSE )
A list with the features comprising an specific clustering distribution.
createTrain()
The function is used to create a Trainset object from a specific clustering distribution.
TypeBasedStrategy$createTrain( subset, num.clusters = NULL, num.groups = NULL, include.unclustered = FALSE )
subset
The Subset
object used as a basis to create
the train set (see Trainset
class).
num.clusters
A numeric value to select the number of clusters (define the distribution).
num.groups
A single or numeric vector value to identify a specific group that forms the clustering distribution.
include.unclustered
A logical value to determine if unclustered features should be included.
If num.clusters
and num.groups
are not defined,
best clustering distribution is used to create the train set.
A Trainset
object.
plot()
The function is responsible for creating a plot to visualize the clustering distribution.
TypeBasedStrategy$plot(dir.path = NULL, file.name = NULL)
dir.path
An optional character argument to define the name
of the directory where the exported plot will be saved. If not defined,
the file path will be automatically assigned to the current working
directory, 'getwd()
'.
file.name
A character to define the name of the PDF file where the plot is exported.
saveCSV()
The function is used to save the clustering distribution to a CSV file.
TypeBasedStrategy$saveCSV(dir.path = NULL, name = NULL, num.clusters = NULL)
dir.path
The name of the directory to save the CSV file.
name
Defines the name of the CSV file.
num.clusters
An optional parameter to select the number of clusters to be saved. If not defined, all cluster distributions will be saved.
clone()
The objects of this class are cloneable with this method.
TypeBasedStrategy$clone(deep = FALSE)
deep
Whether to make a deep clone.
GenericClusteringStrategy
,
StrategyConfiguration
Computes the performance across resamples when class probabilities can be computed.
D2MCS::SummaryFunction
-> UseProbability
new()
The function defined during runtime the usage of seven measures: 'ROC', 'Sens', 'Kappa', 'Accuracy', 'TCR_9', 'MCC' and 'PPV'.
UseProbability$new()
execute()
The function computes the performance across resamples using the previously defined measures.
UseProbability$execute(data, lev = NULL, model = NULL)
data
A data.frame containing the data used to compute the performance.
lev
An optional value used to define the levels of the target class.
model
An optional value used to define the M.L. model used.
A vector of performance estimates.
clone()
The objects of this class are cloneable with this method.
UseProbability$clone(deep = FALSE)
deep
Whether to make a deep clone.
Abstract class used to define new SingleVoting
and
CombinedVoting
schemes.
new()
Abstract method used to initialize the object arguments during runtime.
VotingStrategy$new()
getVotingSchemes()
The function returns the voting schemes that will participate in the voting strategy.
VotingStrategy$getVotingSchemes()
A vector of object inheriting from VotingStrategy
class.
getMetrics()
The function is used to get the metric that will be used during the voting strategy.
VotingStrategy$getMetrics()
A character vector.
execute()
Abstract function used to implement the operation of the voting schemes.
VotingStrategy$execute(predictions, ...)
predictions
A ClusterPredictions
object containing
the prediction achieved for each cluster.
...
Further arguments passed down to execute
function.
getName()
The function returns the name of the voting scheme.
VotingStrategy$getName()
A character vector of size 1.
clone()
The objects of this class are cloneable with this method.
VotingStrategy$clone(deep = FALSE)
deep
Whether to make a deep clone.
D2MCS
, SingleVoting
,
CombinedVoting