--- title: "A Brief Introduction to D2MCS" author: "Miguel Ferreiro Diaz" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{A Brief Introduction to D2MCS} %\VignetteEngine{knitr::rmarkdown} usepackage[utf8]{inputenc} --- # Introduction
*D2MCS* is an object-oriented framework able to identify and exploit the intrinsic characteristic of input data to (i) accurately distribute features in groups (feature clustering) and (ii) design and deploy effective MCS models. Below are included code snippets belonging to the different stages of *D2MCS* framework: (i) Data Manipulation, (ii) Feature Clustering, (iii) MCS Creation and (iv) Classification Results. Furthermore, the package provides the facility to read the descriptions and details of all functions through the *help(D2MCS)* command.
## Stage 1: Data Manipulation
The first step starts using the *DatasetLoader* class to convert the data to be analyzed into the structure compatible with *D2MCS* called *Dataset* (or *HDDataset* in case the dataset cannot be stored in memory). The following code fragment shows the parameters included in the *load* function after instantiating the *DatasetLoader* class. ```{R, echo = TRUE, results = "hide", eval = FALSE} data.loader <- DatasetLoader$new() data <- data.loader$load(filepath, header = TRUE, sep = ",", skip.lines = 0, normalize.names = FALSE, ignore.columns = NULL) ``` Once the loading process is completed and the dataset is available in a *Dataset* object, it is possible to perform different methods divided into three main categories taking into account their behaviour: (i) dataset information obtainer, (ii) dataset column removal and (iii) dataset splitting operation. The following code snippet shows some of the different functions in each category. ```{R, echo = TRUE, results = "hide", eval = FALSE} ## DATASET INFORMATION OBTAINER data$getNcol() data$getNrow() data$getColumnNames() data$getDataset() ## DATASET COLUMN REMOVAL data$cleanData(columns = NULL, remove.funcs = NULL, remove.na = FALSE, remove.const = FALSE) ## DATASET HANDLING AND SPLITTING data$createPartitions(num.folds = NULL, percent.folds = NULL, class.balance = NULL) subset <- data$createSubset(num.folds = NULL, column.id = NULL, opts = list(remove.na = TRUE, remove.const = FALSE), class.index = NULL, positive.class = NULL) train <- data$createTrain(num.folds = NULL, class.index, positive.class, opts = list(remove.na = TRUE, remove.const = FALSE)) ``` Using the *createPartitions()* method, the dataset is divided in order to use the divisions to create the data structures required in the following phases, using the *createSubset()* and *createTrain() * methods. While the first method performs the creation of a *Subset* object used both for clustering operations and for validation purposes. On the other hand, the second method is responsible for creating a *Trainset* object necessary to perform the model training stage.
## Stage 2: Feature Clustering
After creating a Subset object, the stage two based on the distribution of features in clusters starts. The code snippet below exemplifies the three steps necessary to create and execute the clustering strategy called *DependencyBasedStrategy* included by default in *D2MCS*. ```{R, echo = TRUE, results = "hide", eval = FALSE} ## FEATURE-CLUSTERING ALGORITHM CREATION conf <- DependencyBasedStrategyConfiguration$new() dbs <- DependencyBasedStrategy$new(subset, heuristic, configuration = conf) ## FEATURE-CLUSTERING ALGORITHM EXECUTION dbs$execute(verbose = FALSE) ## FEATURE-CLUSTERING ALGORITHM FUNCTIONALITIES dbs$getBestClusterDistribution() dbs.train <- dbs$createTrain(subset, num.clusters = NULL, num.groups = NULL, include.unclustered = FALSE) ```
## Stage 3: MCS Creation
From the previous execution of the selected clustering strategy, a *Trainset* object is obtained to be used as input for the SMC creation phase. This stage is divided into three main steps (i) *D2MCS* framework initialization, (ii) MCS behaviour customization options and (iii) execution of MCS discovery operation. ```{R, echo = TRUE, results = "hide", eval = FALSE} ## D2MCS FRAMEWORK INITIALIZATION d2mcs <- D2MCS$new(dir.path, num.cores = 2, socket.type = "PSOCK", outfile = NULL) ## MCS BEHAVIOUR CUSTOMIZATION OPTIONS trFunction <- TwoClass$new(method, number, savePredictions, classProbs, allowParallel, verboseIter, seed = NULL) ## EXECUTION OF MCS DISCOVERY OPERATION trained <- d2mcs$train(train.set, train.function, num.clusters = NULL, model.recipe = DefaultModelFit$new(), ex.classifiers = c(), ig.classifiers = c(), metrics = NULL, saveAllModels = FALSE) ``` Using the code fragment shown previously, the training of the ML models provided by the *caret* library is performed in order to find out which models offer the best performance for the dataset, taking into account the indicated parameters.
## Stage 4: Classification Results
After building the MCS, starts the next stage related to the classification of the data. To this end, the *D2MCS* tool needs to use the *TrainOutput* structure obtained by training the MCS, the dataset to be predicted and the voting schemes chosen to combine the results of the different MSC models. ```{R, echo = TRUE, results = "hide", eval = FALSE} ## VOTING SCHEMES AVAILABLE IN THE CLASSIFICATION STAGE voting.types <- c(SingleVoting$new(voting.schemes, metrics), CombinedVoting$new(voting.schemes, combined.metrics, methodology, metrics)) ## EXECUTE THE CLASSIFICATION STAGE predictions <- d2mcs$classify(train.output, subset, voting.types, positive.class = NULL) ## COMPUTE THE PERFORMANCE OF EACH VOTING SCHEME predictions$getPerformances(test.set, measures, voting.names = NULL, metric.names = NULL, cutoff.values = NULL) ## OBTAIN THE PREDICTIONS OBTAINED OF EACH VOTING SCHEME USED prediction$getPredictions(voting.names = NULL, metric.names = NULL, cutoff.values = NULL, type = NULL, target = NULL, filter = FALSE) ``` When the classification stage is completed, the tool produces a *ClassificationOutput* object to allow the user to obtain information about the classification performance (*getPerformances()* method) and to observe in depth the predictions obtained (*getPredictions()* method).
# Development The *D2MCS* package is also available in a development version at the Github development page: [github.com/drordas/D2MCS](https://github.com/drordas/D2MCS)