Title: | Preprocessing Algorithms for Imbalanced Datasets |
---|---|
Description: | Class imbalance usually damages the performance of classifiers. Thus, it is important to treat data before applying a classifier algorithm. This package includes recent resampling algorithms in the literature: (Barua et al. 2014) <doi:10.1109/tkde.2012.232>; (Das et al. 2015) <doi:10.1109/tkde.2014.2324567>, (Zhang et al. 2014) <doi:10.1016/j.inffus.2013.12.003>; (Gao et al. 2014) <doi:10.1016/j.neucom.2014.02.006>; (Almogahed et al. 2014) <doi:10.1007/s00500-014-1484-5>. It also includes an useful interface to perform oversampling. |
Authors: | Ignacio Cordón [aut, cre], Salvador García [aut], Alberto Fernández [aut], Francisco Herrera [aut] |
Maintainer: | Ignacio Cordón <[email protected]> |
License: | GPL (>= 2) | file LICENSE |
Version: | 1.0.2.1 |
Built: | 2024-12-24 06:59:00 UTC |
Source: | CRAN |
Dataset containing two attributes as well as a class one, that, if plotted, represent a banana shape
banana banana_orig
banana banana_orig
First attribute.
Second attribute.
Two possible classes: positive (banana shape), negative (surrounding of the banana).
banana
: A data frame with 2640 instances, 264 of which belong to positive class,
and 3 variables
banana_orig
: A data frame with 5300 instances, 2376 of which belong to positive
class, and 3 variables:
Imbalanced binary dataset containing protein traits for predicting their cellular localization sites.
ecoli1
ecoli1
A data frame with 336 instances, 77 of which belong to positive class, and 8 variables:
McGeoch's method for signal sequence recognition. Continuous attribute.
Von Heijne's method for signal sequence recognition. Continuous attribute.
von Heijne's Signal Peptidase II consensus sequence score. Discrete attribute.
Presence of charge on N-terminus of predicted lipoproteins. Discrete attribute.
Score of discriminant analysis of the amino acid content of outer membrane and periplasmic proteins. Continuous attribute.
Score of the ALOM membrane spanning region prediction program. Continuous attribute.
score of ALOM program after excluding putative cleavable signal regions from the sequence. Continuous attribute.
Two possible classes: positive (type im), negative (the rest).
Original available in UCI ML Repository.
Imbalanced binary classification dataset containing variables to identify types of glass.
glass0
glass0
A data frame with 214 instances, 70 of which belong to positve class, and 10 variables:
Refractive Index. Continuous attribute.
Sodium, weight percent in component. Continuous attribute.
Magnesium, weight percent in component. Continuous attribute.
Aluminum, weight percent in component. Continuous attribute.
Silicon, weight percent in component. Continuous attribute.
Potasium, weight percent in component. Continuous attribute.
Calcium, weight percent in component. Continuous attribute.
Barium, weight percent in component. Continuous attribute.
Iron, weight percent in component. Continuous attribute.
Two possible glass types: positive (building windows, float processed) and negative (the rest).
Original available in UCI ML Repository.
The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.
haberman
haberman
A data frame with 306 instances, 81 of which belong to positive class, and 4 variables:
Age of patient at time of operation. Discrete attribute.
Patient's year of operation. Discrete attribute.
Number of positive axillary nodes detected. Discrete attribute.
Two possible survival status: positive(survival rate of less than 5 years), negative (survival rate or more than 5 years).
Original available in UCI ML Repository.
Focused on binary class datasets, the imbalance
package provides
methods to generate synthetic examples and achieve balance between the
minority and majority classes in dataset distributions
Methods to oversample the minority class: racog
,
wracog
, rwo
, pdfos
,
mwmote
Method to measure imbalance ratio in a given two-class dataset:
imbalanceRatio
.
Method to visually evaluate algorithms: plotComparison
.
Methods to filter oversampled instances neater
.
Given a two-class dataset, it computes its imbalance ratio as {Size of minority class}/{Size of majority class}
imbalanceRatio(dataset, classAttr = "Class")
imbalanceRatio(dataset, classAttr = "Class")
dataset |
A target |
classAttr |
A |
A real number in [0,1] representing the imbalance ratio of
dataset
data(glass0) imbalanceRatio(glass0, classAttr = "Class")
data(glass0) imbalanceRatio(glass0, classAttr = "Class")
Modification of iris
dataset. Measurements in
centimeters of the variables sepal length and width and petal length and
width, respectively, for 50 flowers from each of 3 species of iris. The
possible classifications are positive (setosa) and negative (versicolor +
virginica).
iris0
iris0
A data frame with 150 instances, 50 of which belong to positive class, and 5 variables:
Measurement of sepal length, in cm. Continuous attribute.
Measurement of sepal width, in cm. Continuous attribute.
Measurement of petal length, in cm. Continuous attribute.
Measurement of petal width, in cm. Continuous attribute.
Two possible classes: positive (setosa) and negative (versicolor + virginica).
Modification for SMOTE technique which overcomes some of the problems of the SMOTE technique when there are noisy instances, in which case SMOTE would generate more noisy instances out of them.
mwmote( dataset, numInstances, kNoisy = 5, kMajority = 3, kMinority, threshold = 5, cmax = 2, cclustering = 3, classAttr = "Class" )
mwmote( dataset, numInstances, kNoisy = 5, kMajority = 3, kMinority, threshold = 5, cmax = 2, cclustering = 3, classAttr = "Class" )
dataset |
|
numInstances |
Integer. Number of new minority examples to generate. |
kNoisy |
Integer. Parameter of euclidean KNN to detect noisy examples as those whose whole kNoisy-neighbourhood is from the opposite class. |
kMajority |
Integer. Parameter of euclidean KNN to detect majority borderline examples as those who are in any kMajority-neighbourhood of minority instances. Should be a low integer. |
kMinority |
Integer. Parameter of euclidean KNN to detect minority
borderline examples as those who are in the KMinority-neighbourhood of
majority borderline ones. It should be a large integer. By default if not
parameter is fed to the function, |
threshold |
Numeric. A positive real indicating how much we measure tolerance of closeness to the boundary of minority boundary examples. A large integer indicates more margin of distance for a example to be considerated important boundary one. |
cmax |
Numeric. A positive real indicating how much we measure tolerance of closeness to the boundary of minority boundary examples. The larger this number, the more we are valuing boundary examples. |
cclustering |
Numeric. A positive real for tuning the output of an internal clustering. The larger this parameter, the more area focused is going to be the oversampling. |
classAttr |
|
A data.frame
with the same structure as dataset
,
containing the generated synthetic examples.
Barua, Sukarna; Islam, Md.M.; Yao, Xin; Murase, Kazuyuki. Mwmote–majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning. IEEE Transactions on Knowledge and Data Engineering 26 (2014), Nr. 2, p. 405–425
data(iris0) # Generates new minority examples newSamples <- mwmote(iris0, numInstances = 100, classAttr = "Class")
data(iris0) # Generates new minority examples newSamples <- mwmote(iris0, numInstances = 100, classAttr = "Class")
Filters oversampled examples from a binary class dataset
using game
theory to find out if keeping an example is worthy enough.
neater( dataset, newSamples, k = 3, iterations = 100, smoothFactor = 1, classAttr = "Class" )
neater( dataset, newSamples, k = 3, iterations = 100, smoothFactor = 1, classAttr = "Class" )
dataset |
The original |
newSamples |
A |
k |
Integer. Number of nearest neighbours to use in KNN algorithm to rule out samples. By default, 3. |
iterations |
Integer. Number of iterations for the algorithm. By default, 100. |
smoothFactor |
A positive |
classAttr |
|
Uses game theory and Nash equilibriums to calculate the minority examples probability of trully belonging to the minority class. It discards examples which at the final stage of the algorithm have more probability of being a majority example than a minority one.
Filtered samples as a data.frame
with same structure as
newSamples
.
Almogahed, B.A.; Kakadiaris, I.A. Neater: Filtering of Over-Sampled Data Using Non-Cooperative Game Theory. Soft Computing 19 (2014), Nr. 11, p. 3301–3322.
data(iris0) newSamples <- smotefamily::SMOTE(iris0[,-5], iris0[,5])$syn_data # SMOTE overrides Class attr turning it into class # and dataset must have same class attribute as newSamples names(newSamples) <- c(names(newSamples)[-5], "Class") neater(iris0, newSamples, k = 5, iterations = 100, smoothFactor = 1, classAttr = "Class")
data(iris0) newSamples <- smotefamily::SMOTE(iris0[,-5], iris0[,5])$syn_data # SMOTE overrides Class attr turning it into class # and dataset must have same class attribute as newSamples names(newSamples) <- c(names(newSamples)[-5], "Class") neater(iris0, newSamples, k = 5, iterations = 100, smoothFactor = 1, classAttr = "Class")
Data to predict patient's hyperthyroidism.
newthyroid1
newthyroid1
A data frame with 215 instances, 35 of which belong to positive class, and 6 variables:
T3-resin uptake test, percentage. Discrete attribute.
Total Serum thyroxin as measured by the isotopic displacement method. Continuous attribute.
Total serum triiodothyronine as measured by radioimmuno assay. Continuous attribute.
Basal thyroid-stimulating hormone (TSH) as measured by radioimmuno assay. Continuous attribute.
Maximal absolute difference of TSH value after injection of 200 micro grams of thyrotropin-releasing hormone as compared to the basal value. Continuous attribute.
Two possible classes: positive as hyperthyroidism, negative as non hyperthyroidism.
Original available in UCI ML Repository.
Wrapper that encapsulates a collection of algorithms to perform a class balancing preprocessing task for binary class datasets
oversample( dataset, ratio = NA, method = c("RACOG", "wRACOG", "PDFOS", "RWO", "ADASYN", "ANSMOTE", "SMOTE", "MWMOTE", "BLSMOTE", "DBSMOTE", "SLMOTE", "RSLSMOTE"), filtering = FALSE, classAttr = "Class", wrapper = c("KNN", "C5.0"), ... )
oversample( dataset, ratio = NA, method = c("RACOG", "wRACOG", "PDFOS", "RWO", "ADASYN", "ANSMOTE", "SMOTE", "MWMOTE", "BLSMOTE", "DBSMOTE", "SLMOTE", "RSLSMOTE"), filtering = FALSE, classAttr = "Class", wrapper = c("KNN", "C5.0"), ... )
dataset |
A binary class |
ratio |
Number between 0 and 1 indicating the desired ratio between
minority examples and majority ones, that is, the quotient size of
minority class/size of majority class. There are methods, such as
|
method |
A |
filtering |
Logical (TRUE or FALSE) indicating wheter to apply filtering
of oversampled instances with |
classAttr |
|
wrapper |
A |
... |
Further arguments to apply in selected method |
A balanced data.frame
with same structure as dataset
,
containing both original instances and new ones
data(glass0) # Oversample glass0 to get an imbalance ratio of 0.8 imbalanceRatio(glass0) # 0.4861111 newDataset <- oversample(glass0, ratio = 0.8, method = "MWMOTE") imbalanceRatio(newDataset) newDataset <- oversample(glass0, method = "ADASYN") newDataset <- oversample(glass0, ratio = 0.8, method = "SMOTE")
data(glass0) # Oversample glass0 to get an imbalance ratio of 0.8 imbalanceRatio(glass0) # 0.4861111 newDataset <- oversample(glass0, ratio = 0.8, method = "MWMOTE") imbalanceRatio(newDataset) newDataset <- oversample(glass0, method = "ADASYN") newDataset <- oversample(glass0, ratio = 0.8, method = "SMOTE")
Generates synthetic minority examples for a numerical dataset approximating a Gaussian multivariate distribution which best fits the minority data.
pdfos(dataset, numInstances, classAttr = "Class")
pdfos(dataset, numInstances, classAttr = "Class")
dataset |
|
numInstances |
Integer. Number of new minority examples to generate. |
classAttr |
|
To generate the synthetic data, it approximates a normal distribution with mean a given example belonging to the minority class, and whose variance is the minority class variance multiplied by a constant; that constant is computed so that it minimizes the mean integrated squared error of a Gaussian multivariate kernel function.
A data.frame
with the same structure as dataset
,
containing the generated synthetic examples.
Gao, Ming; Hong, Xia; Chen, Sheng; Harris, Chris J.; Khalaf, Emad. Pdfos: Pdf Estimation Based Oversampling for Imbalanced Two-Class Problems. Neurocomputing 138 (2014), p. 248–259
Silverman, B. W. Density Estimation for Statistics and Data Analysis. Chapman & Hall, 1986. – ISBN 0412246201
data(iris0) newSamples <- pdfos(iris0, numInstances = 100, classAttr = "Class")
data(iris0) newSamples <- pdfos(iris0, numInstances = 100, classAttr = "Class")
It plots a grid of one to one variable comparison, placing the former dataset graphics next to the balanced one, for each pair of attributes.
plotComparison(dataset, anotherDataset, attrs, cols = 2, classAttr = "Class")
plotComparison(dataset, anotherDataset, attrs, cols = 2, classAttr = "Class")
dataset |
A |
anotherDataset |
A |
attrs |
Vector of |
cols |
Integer. It indicates the number of columns of resulting grid. Must be an even number. By default, 2. |
classAttr |
|
Plot of 2D comparison between the variables.
data(iris0) set.seed(12345) rwoSamples <- rwo(iris0, numInstances = 100) rwoBalanced <- rbind(iris0, rwoSamples) plotComparison(iris0, rwoBalanced, names(iris0), cols = 2, classAttr = "Class")
data(iris0) set.seed(12345) rwoSamples <- rwo(iris0, numInstances = 100) rwoBalanced <- rbind(iris0, rwoSamples) plotComparison(iris0, rwoBalanced, names(iris0), cols = 2, classAttr = "Class")
Allows you to treat imbalanced discrete numeric datasets by generating synthetic minority examples, approximating their probability distribution.
racog(dataset, numInstances, burnin = 100, lag = 20, classAttr = "Class")
racog(dataset, numInstances, burnin = 100, lag = 20, classAttr = "Class")
dataset |
|
numInstances |
Integer. Number of new minority examples to generate. |
burnin |
Integer. It determines how many examples generated for a given one are going to be discarded firstly. By default, 100. |
lag |
Integer. Number of iterations between new generated example for a minority one. By default, 20. |
classAttr |
|
Approximates minority distribution using Gibbs Sampler. Dataset must be
discretized and numeric. In each iteration, it builds a new sample using a
Markov chain. It discards first burnin
iterations, and from then on,
each lag
iterations, it validates the example as a new minority
example. It generates where
is
minority examples number.
A data.frame
with the same structure as dataset
,
containing the generated synthetic examples.
Das, Barnan; Krishnan, Narayanan C.; Cook, Diane J. Racog and Wracog: Two Probabilistic Oversampling Techniques. IEEE Transactions on Knowledge and Data Engineering 27(2015), Nr. 1, p. 222–234.
data(iris0) # Generates new minority examples newSamples <- racog(iris0, numInstances = 40, burnin = 20, lag = 10, classAttr = "Class") newSamples <- racog(iris0, numInstances = 100)
data(iris0) # Generates new minority examples newSamples <- racog(iris0, numInstances = 40, burnin = 20, lag = 10, classAttr = "Class") newSamples <- racog(iris0, numInstances = 100)
Generates synthetic minority examples for a dataset trying to preserve the variance and mean of the minority class. Works on every type of dataset.
rwo(dataset, numInstances, classAttr = "Class")
rwo(dataset, numInstances, classAttr = "Class")
dataset |
|
numInstances |
Integer. Number of new minority examples to generate. |
classAttr |
|
Generates numInstances
new minority examples for dataset
,
adding to the each numeric column of the j-th example its variance scalated
by the inverse of the number of minority examples and a factor following a
distribution which depends on the example. When the column is
nominal, it uses a roulette scheme.
A data.frame
with the same structure as dataset
,
containing the generated synthetic examples.
Zhang, Huaxiang; Li, Mingfang. Rwo-Sampling: A Random Walk Over-Sampling Approach To Imbalanced Data Classification. Information Fusion 20 (2014), p. 99–116.
data(iris0) newSamples <- rwo(iris0, numInstances = 100, classAttr = "Class")
data(iris0) newSamples <- rwo(iris0, numInstances = 100, classAttr = "Class")
Generic methods to train classifiers
trainWrapper(wrapper, train, trainClass, ...)
trainWrapper(wrapper, train, trainClass, ...)
wrapper |
the wrapper instance |
train |
|
trainClass |
a vector containing the class column for |
... |
further arguments for |
A model which is predict
callable.
myWrapper <- structure(list(), class="C50Wrapper") trainWrapper.C50Wrapper <- function(wrapper, train, trainClass){ C50::C5.0(train, trainClass) }
myWrapper <- structure(list(), class="C50Wrapper") trainWrapper.C50Wrapper <- function(wrapper, train, trainClass){ C50::C5.0(train, trainClass) }
Binary class dataset containing traits about patients with cancer. Original dataset was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg.
wisconsin
wisconsin
A data frame with 683 instances, 239 of which belong to positive class, and 10 variables:
Discrete attribute.
Discrete attribute.
Discrete attribute.
Discrete attribute.
Discrete attribute.
Discrete attribute.
Disrete attribute.
Discrete attribute.
Discrete attribute.
Two possible classes: positive (cancer) and negative (not cancer).
Original available in UCI ML Repository.
Generates synthetic minority examples by approximating their probability
distribution until sensitivity of wrapper
over validation
cannot be further improved. Works only on discrete numeric datasets.
wracog( train, validation, wrapper, slideWin = 10, threshold = 0.02, classAttr = "Class", ... )
wracog( train, validation, wrapper, slideWin = 10, threshold = 0.02, classAttr = "Class", ... )
train |
|
validation |
|
wrapper |
An |
slideWin |
Number of last sensitivities to take into account to meet the stopping criteria. By default, 10. |
threshold |
Threshold that the last |
classAttr |
|
... |
further arguments for |
Until the last slideWin
executions of wrapper
over
validation
dataset reach a mean sensitivity lower than
threshold
, the algorithm keeps generating samples using Gibbs Sampler,
and adding misclassified samples with respect to a model generated by a
former train, to the train dataset. Initial model is built on initial
train
.
A data.frame
with the same structure as train
,
containing the generated synthetic examples.
Das, Barnan; Krishnan, Narayanan C.; Cook, Diane J. Racog and Wracog: Two Probabilistic Oversampling Techniques. IEEE Transactions on Knowledge and Data Engineering 27(2015), Nr. 1, p. 222–234.
data(haberman) # Create train and validation partitions of haberman trainFold <- sample(1:nrow(haberman), nrow(haberman)/2, FALSE) trainSet <- haberman[trainFold, ] validationSet <- haberman[-trainFold, ] # Defines our own wrapper with a C5.0 tree myWrapper <- structure(list(), class="TestWrapper") trainWrapper.TestWrapper <- function(wrapper, train, trainClass){ C50::C5.0(train, trainClass) } # Execute wRACOG with our own wrapper newSamples <- wracog(trainSet, validationSet, myWrapper, classAttr = "Class") # Execute wRACOG with predifined wrappers for "KNN" or "C5.0" KNNSamples <- wracog(trainSet, validationSet, "KNN") C50Samples <- wracog(trainSet, validationSet, "C5.0")
data(haberman) # Create train and validation partitions of haberman trainFold <- sample(1:nrow(haberman), nrow(haberman)/2, FALSE) trainSet <- haberman[trainFold, ] validationSet <- haberman[-trainFold, ] # Defines our own wrapper with a C5.0 tree myWrapper <- structure(list(), class="TestWrapper") trainWrapper.TestWrapper <- function(wrapper, train, trainClass){ C50::C5.0(train, trainClass) } # Execute wRACOG with our own wrapper newSamples <- wracog(trainSet, validationSet, myWrapper, classAttr = "Class") # Execute wRACOG with predifined wrappers for "KNN" or "C5.0" KNNSamples <- wracog(trainSet, validationSet, "KNN") C50Samples <- wracog(trainSet, validationSet, "C5.0")
Imbalanced binary dataset containing protein traits for predicting their cellular localization sites.
yeast4
yeast4
A data frame with 1484 instances, 51 of which belong to positive class, and 9 variables:
McGeoch's method for signal sequence recognition. Continuous attribute.
Von Heijne's method for signal sequence recognition. Continuous attribute.
Score of the ALOM membrane spanning region prediction program. Continuous attribute.
Score of discriminant analysis of the amino acid content of the N-terminal region (20 residues long) of mitochondrial and non-mitochondrial proteins. Continuous attribute.
Presence of "HDEL" substring (thought to act as a signal for retention in the endoplasmic reticulum lumen). Binary attribute. Discrete attribute.
Peroxisomal targeting signal in the C-terminus. Continuous attribute.
Score of discriminant analysis of the amino acid content of vacuolar and extracellular proteins. Continuous attribute.
Score of discriminant analysis of nuclear localization signals of nuclear and non-nuclear proteins. Continuous attribute.
Two possible classes: positive (membrane protein, uncleaved signal), negative (rest of localizations).
Original available in UCI ML Repository.