Title: | A Collection of Oversampling Techniques for Class Imbalance Problem Based on SMOTE |
---|---|
Description: | A collection of various oversampling techniques developed from SMOTE is provided. SMOTE is a oversampling technique which synthesizes a new minority instance between a pair of one minority instance and one of its K nearest neighbor. Other techniques adopt this concept with other criteria in order to generate balanced dataset for class imbalance problem. |
Authors: | Wacharasak Siriseriwan [aut, cre] |
Maintainer: | Wacharasak Siriseriwan <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.4.0 |
Built: | 2024-11-10 06:43:53 UTC |
Source: | CRAN |
Generate synthetic positive instances using ADASYN algorithm. The number of majority neighbors of each minority instance determines the number of synthetic instances generated from the minority instance.
ADAS(X,target,K=5)
ADAS(X,target,K=5)
X |
A data frame or matrix of numeric-attributed dataset |
target |
A vector of a target class attribute corresponding to a dataset X. |
K |
The number of nearest neighbors during sampling process |
data |
A resulting dataset consists of original minority instances, synthetic minority instances and original majority instances with a vector of their respective target class appended at the last column |
syn_data |
A set of synthetic minority instances with a vector of minority target class appended at the last column |
orig_N |
A set of original instances whose class is not oversampled with a vector of their target class appended at the last column |
orig_P |
A set of original instances whose class is oversampled with a vector of their target class appended at the last column |
K |
The value of parameter K for nearest neighbor process used for generating data |
K_all |
Unavailable for this method |
dup_size |
A vector of times of synthetic minority instances over original majority instances in the oversampling in each instances |
outcast |
Unavailable for this method |
eps |
Unavailable for this method |
method |
The name of oversampling method used for this generated dataset (ADASYN) |
Wacharasak Siriseriwan <[email protected]>
He, H., Bai, Y., Garcia, E. and Li, S. 2008. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. Proceedings of IJCNN 2008. (IEEE World Congress on Computational Intelligence). IEEE International Joint Conference. pp.1322-1328.
data_example = sample_generator(10000,ratio = 0.80) genData = ADAS(data_example[,-3],data_example[,3]) genData_2 = ADAS(data_example[,-3],data_example[,3],K=7)
data_example = sample_generator(10000,ratio = 0.80) genData = ADAS(data_example[,-3],data_example[,3]) genData_2 = ADAS(data_example[,-3],data_example[,3],K=7)
Generate a oversampling dataset from imbalanced dataset using Adaptive Neighbor SMOTE which provides the parameter K to each minority instance automatically
ANS(X, target, dupSize = 0)
ANS(X, target, dupSize = 0)
X |
A data frame or matrix of numeric-attributed dataset |
target |
A vector of a target class attribute corresponding to a dataset X. |
dupSize |
A number of vector representing the desired times of synthetic minority instances over the original number of majority instances, 0 for balanced dataset. |
data |
A resulting dataset consists of original minority instances, synthetic minority instances and original majority instances with a vector of their respective target class appended at the last column |
syn_data |
A set of synthetic minority instances with a vector of minority target class appended at the last column |
orig_N |
A set of original instances whose class is not oversampled with a vector of their target class appended at the last column |
orig_P |
A set of original instances whose class is oversampled with a vector of their target class appended at the last column |
K |
A vector of parameter K for each minority instance |
K_all |
The value of parameter C for nearest neighbor process used for identifying outcasts |
dup_size |
The maximum times of synthetic minority instances over original majority instances in the oversampling |
outcast |
A set of original minority instances which is defined as minority outcast |
eps |
The value of eps which determines automatic K |
method |
The name of oversampling method used for this generated dataset (ANS) |
Wacharasak Siriseriwan <[email protected]>
Siriseriwan, W. and Sinapiromsaran, K. Adaptive neighbor Synthetic Minority Oversampling TEchnique under 1NN outcast handling.Songklanakarin Journal of Science and Technology.
data_example = sample_generator(5000,ratio = 0.80) genData = ANS(data_example[,-3],data_example[,3])
data_example = sample_generator(5000,ratio = 0.80) genData = ANS(data_example[,-3],data_example[,3])
Generate synthetic positive instances using Borderline-SMOTE algorithm. The number of majority neighbor of each minority instance is used to divide minority instances into 3 groups; SAFE/DANGER/NOISE, only the DANGER are used to generate synthetic instances.
BLSMOTE(X,target,K=5,C=5,dupSize=0,method =c("type1","type2"))
BLSMOTE(X,target,K=5,C=5,dupSize=0,method =c("type1","type2"))
X |
A data frame or matrix of numeric-attributed dataset |
target |
A vector of a target class attribute corresponding to a dataset X. |
K |
The number of nearest neighbors during sampling process |
C |
The number of nearest neighbors during calculating safe-level process |
dupSize |
The number or vector representing the desired times of synthetic minority instances over the original number of majority instances, 0 for duplicating until balanced |
method |
A parameter to indicate which type of Borderline-SMOTE presented in the paper is used |
data |
A resulting dataset consists of original minority instances, synthetic minority instances and original majority instances with a vector of their respective target class appended at the last column |
syn_data |
A set of synthetic minority instances with a vector of minority target class appended at the last column |
orig_N |
A set of original instances whose class is not oversampled with a vector of their target class appended at the last column |
orig_P |
A set of original instances whose class is oversampled with a vector of their target class appended at the last column |
K |
The value of parameter K for nearest neighbor process used for generating data |
K_all |
The value of parameter C for nearest neighbor process used for determining SAFE/DANGER/NOISE |
dup_size |
The maximum times of synthetic minority instances over original majority instances in the oversampling |
outcast |
Unavailable for this method |
eps |
Unavailable for this method |
method |
The name of oversampling method and type used for this generated dataset (BLSMOTE type1/2) |
Wacharasak Siriseriwan <[email protected]>
Han, H., Wang, W.Y. and Mao, B.H. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In Proceedings of the 2005 international conference on Advances in Intelligent Computing - Volume Part I (ICIC'05), De-Shuang Huang, Xiao-Ping Zhang, and Guang-Bin Huang (Eds.), Vol. Part I. Springer-Verlag, Berlin, Heidelberg, 2005. 878-887. DOI=http://dx.doi.org/10.1007/11538059_91
data_example = sample_generator(5000,ratio = 0.80) genData = BLSMOTE(data_example[,-3],data_example[,3]) genData_2 = BLSMOTE(data_example[,-3],data_example[,3],K=7, C=5, method = "type2")
data_example = sample_generator(5000,ratio = 0.80) genData = BLSMOTE(data_example[,-3],data_example[,3]) genData_2 = BLSMOTE(data_example[,-3],data_example[,3],K=7, C=5, method = "type2")
Generate a oversampling dataset from imbalance dataset using Density-based SMOTE. Using density reachability concept to cluster minority instances and generate synthetic instances.
DBSMOTE(X, target, dupSize = 0, MinPts = NULL, eps = NULL)
DBSMOTE(X, target, dupSize = 0, MinPts = NULL, eps = NULL)
X |
A data frame or matrix of numeric-attributed dataset |
target |
A vector of a target class attribute |
dupSize |
A number of vector representing the desired times of synthetic minority instances over the original number of majority instances |
MinPts |
The minimum instance parameter to decide whether each instance inside eps is reachable, the automatic algorithm is used to find the value instead if there is no positive integer value given for it. |
eps |
The radius to consider neighbor. |
data |
A resulting dataset consists of original minority instances, synthetic minority instances and original majority instances with a vector of their respective target class appended at the last column |
syn_data |
A set of synthetic minority instances with a vector of minority target class appended at the last column |
orig_N |
A set of original instances whose class is not oversampled with a vector of their target class appended at the last column |
orig_P |
A set of original instances whose class is oversampled with a vector of their target class appended at the last column |
K |
Unavailable for this method |
K_all |
Unavailable for this method |
dup_size |
The maximum times of synthetic minority instances over original majority instances in the oversampling |
outcast |
A set of original minority instances which is defined as NOISE/minority outcast |
eps |
The value of parameter eps |
method |
The name of oversampling method used for this generated dataset |
Wacharasak Siriseriwan <[email protected]>
Bunkhumpornpat, C., Sinapiromsaran, K. and Lursinsap, C. 2012. DBSMOTE: Density-based synthetic minority oversampling technique. Applied Intelligence. 36, 664-684.
data_example = sample_generator(5000,ratio = 0.90) genData = DBSMOTE(data_example[,-3],data_example[,3])
data_example = sample_generator(5000,ratio = 0.90) genData = DBSMOTE(data_example[,-3],data_example[,3])
The function to provide a random number which uses to identify the location of each synthetic instance. The interval of possible values depends from safe-level values of instances in a pair.
gap(sl_p = 1, sl_n = 1)
gap(sl_p = 1, sl_n = 1)
sl_p |
The safe-level value of the first instance |
sl_n |
The safe-level value of the second instance |
A value between 0 to 1 which is used to identify the location of synthetic instance If sl_p >= sl_n, it gives the random number between 0 to sl_n/sl_p If sl_p < sl_n, it gives the random number between 1-sl_p/sl_n to 1
Wacharasak Siriseriwan <[email protected]>
r_num = gap() r_num_2 = gap(sl_p = 4, sl_n = 2)
r_num = gap() r_num_2 = gap(sl_p = 4, sl_n = 2)
The function to count how many neighbor of each instance belong to each class.
kncount(knidex, classArray)
kncount(knidex, classArray)
knidex |
The matrix of K nearest neighbor of dataset |
classArray |
The index of last instance of the first class in the dataset or the vector containing indices of last instances of each class. |
The dataset is expected to be sorted as all m1 instances in the first class are in the first m1 instances of the dataset following with all m2 instances in the next m2 instances etc. before performing k-nearest neighbor with the knearest function.
The matrix with the number of columns equal to the number of classes. Each a[i][j] represents the number of K-nearest neighbors of i th instance belonging to the class j th
Wacharasak Siriseriwan <[email protected]>
D = sample_generator(1000,ratio = 0.8) P = D[D[,3]=="p",] N = D[D[,3]=="n",] D_arr=rbind(P,N) knear=knearest(D_arr[,-3],P[,-3],5) kncount_result = kncount(knear,nrow(P))
D = sample_generator(1000,ratio = 0.8) P = D[D[,3]=="p",] N = D[D[,3]=="n",] D_arr=rbind(P,N) knear=knearest(D_arr[,-3],P[,-3],5) kncount_result = kncount(knear,nrow(P))
The function will find n_clust nearest neighbors of each instance using Fast nearest neighbors (through KD-tree method) but will correct the result if it reports the index of that instance as its neighbors.
knearest(D, P, n_clust)
knearest(D, P, n_clust)
D |
a query data matrix. |
P |
an input data matrix |
n_clust |
the maximum number of nearest neighbors to search |
This function will perform K-nearest neighbor of instances in P on instances in P based on FNN. Then, it will verify if one of neighbors of each instance is itself then removes if it is.
The index matrix of K nearest neighbour of each instance
Wacharasak Siriseriwan <[email protected]>
data_example = sample_generator(10000,ratio = 0.80) P = data_example[data_example[,3]=="p",-3] N = data_example[data_example[,3]=="n",-3] D = rbind(P,N) knear = knearest(D,P,n_clust = 5)
data_example = sample_generator(10000,ratio = 0.80) P = data_example[data_example[,3]=="p",-3] N = data_example[data_example[,3]=="n",-3] D = rbind(P,N) knear = knearest(D,P,n_clust = 5)
The function to calculate the maximum round each sampling is repeated, if dup_size is given as 0 then, it calculates the maximum round the number of positive instances to be duplicated to nearly match the number of negative instances
n_dup_max(size_input, size_P, size_N, dup_size = 0)
n_dup_max(size_input, size_P, size_N, dup_size = 0)
size_input |
The size of overall dataset |
size_P |
The number of positive instances |
size_N |
The number of negative instances |
dup_size |
A number or vector of the number of times to be duplicated. The default is zero which means duplicating until nearly balanced. |
If dup_size is zero or contains zero, the number of rounds to duplicate positive to nearly equal to the number of negative instances If dup_size is not zero or contains no zero, the maximum value in dup_size
Wacharasak Siriseriwan <[email protected]>
data_example = sample_generator(10000,ratio = 0.80) P = data_example[data_example[,3]=="p",-3] N = data_example[data_example[,3]=="n",-3] D = rbind(P,N) max_round =n_dup_max(nrow(D),nrow(P),nrow(N),dup_size= 0)
data_example = sample_generator(10000,ratio = 0.80) P = data_example[data_example[,3]=="p",-3] N = data_example[data_example[,3]=="n",-3] D = rbind(P,N) max_round =n_dup_max(nrow(D),nrow(P),nrow(N),dup_size= 0)
Generate synthetic positive instances using Relocating Safe-level SMOTE algorithm. Using the parameter "Safe-Level" to determine the possible location and relocating synthetic instances if there is too close to majority instances.
RSLS(X, target, K = 5, C = 5, dupSize = 0)
RSLS(X, target, K = 5, C = 5, dupSize = 0)
X |
A data frame or matrix of numeric-attributed dataset |
target |
A vector of a target class attribute corresponding to a dataset X. |
K |
The number of nearest neighbors during sampling process |
C |
The number of nearest neighbors during calculating safe-level process |
dupSize |
The number or vector representing the desired times of synthetic minority instances over the original number of majority instances |
data |
A resulting dataset consists of original minority instances, synthetic minority instances and original majority instances with a vector of their respective target class appended at the last column |
syn_data |
A set of synthetic minority instances with a vector of minority target class appended at the last column |
orig_N |
A set of original instances whose class is not oversampled with a vector of their target class appended at the last column |
orig_P |
A set of original instances whose class is oversampled with a vector of their target class appended at the last column |
K |
The value of parameter K for nearest neighbor process used for generating data |
K_all |
The value of parameter C for nearest neighbor process used for calculating safe-level |
dup_size |
The maximum times of synthetic minority instances over original majority instances in the oversampling |
outcast |
A set of original minority instances which has safe-level equal to zero and is defined as the minority outcast |
eps |
Unavailable for this method |
method |
The name of oversampling method used for this generated dataset (RSLS) |
Wacharasak Siriseriwan <[email protected]>
Siriseriwan, W. and Sinapiromsaran, K. The Effective Redistribution for Imbalance Dataset : Relocating Safe-Level SMOTE with Minority Outcast Handling. Chiang Mai Journal of Science. 43(1), 234 - 246.
library(smotefamily) data_example = sample_generator(5000,ratio = 0.80) genData = RSLS(data_example[,-3],data_example[,3]) genData_2 = RSLS(data_example[,-3],data_example[,3],K=7, C=5)
library(smotefamily) data_example = sample_generator(5000,ratio = 0.80) genData = RSLS(data_example[,-3],data_example[,3]) genData_2 = RSLS(data_example[,-3],data_example[,3],K=7, C=5)
The function to generate 2-dimensional dataset given the number of instances and the ratio between the number of negative instances to total instances. The positive instances will be distributed uniformly as the circle in the center while negative instances are around over the domain. The random positive outcasts are also generated. The dataset is used to show the difference between datasets generated by each sampling technique.
sample_generator(n, ratio = 0.8, xlim = c(0, 1), ylim = c(0, 1), radius = 0.25, overlap = -0.05, outcast_ratio = 0.01)
sample_generator(n, ratio = 0.8, xlim = c(0, 1), ylim = c(0, 1), radius = 0.25, overlap = -0.05, outcast_ratio = 0.01)
n |
The number of instances in the dataset |
ratio |
The ratio of negative instances to the total number of instances |
xlim |
The range of values in the first dimension |
ylim |
The range of values in the second dimension |
radius |
The radius of the circle of positive instances |
overlap |
The gap between the set of positive and negative instances |
outcast_ratio |
The ratio of outcast to be generate in this dataset. |
A 2-dimensional dataset with the 3rd column as its target class vector.
Wacharasak Siriseriwan <[email protected]>
data_example = sample_generator(5000,ratio = 0.80) plot(data_example[data_example[,3]=="n",1], data_example[data_example[,3]=="n",2],col="yellow") points(data_example[data_example[,3]=="p",1], data_example[data_example[,3]=="p",2],col="red",pch=14)
data_example = sample_generator(5000,ratio = 0.80) plot(data_example[data_example[,3]=="n",1], data_example[data_example[,3]=="n",2],col="yellow") points(data_example[data_example[,3]=="p",1], data_example[data_example[,3]=="p",2],col="red",pch=14)
Generate synthetic positive instances using Safe-level SMOTE algorithm. Using the parameter "Safe-level" to determine the possible location of synthetic instances.
SLS(X, target, K = 5, C = 5, dupSize = 0)
SLS(X, target, K = 5, C = 5, dupSize = 0)
X |
A data frame or matrix of numeric-attributed dataset |
target |
A vector of a target class attribute corresponding to a dataset X. |
K |
The number of nearest neighbors during sampling process |
C |
The number of nearest neighbors during calculating safe-level process |
dupSize |
The number or vector representing the desired times of synthetic minority instances over the original number of majority instances |
data |
A resulting dataset consists of original minority instances, synthetic minority instances and original majority instances with a vector of their respective target class appended at the last column |
syn_data |
A set of synthetic minority instances with a vector of minority target class appended at the last column |
orig_N |
A set of original instances whose class is not oversampled with a vector of their target class appended at the last column |
orig_P |
A set of original instances whose class is oversampled with a vector of their target class appended at the last column |
K |
The value of parameter K for nearest neighbor process used for generating data |
K_all |
The value of parameter C for nearest neighbor process used for calculating safe-level |
dup_size |
The maximum times of synthetic minority instances over original majority instances in the oversampling |
outcast |
A set of original minority instances which has safe-level equal to zero and is defined as the minority outcast |
eps |
Unavailable for this method |
method |
The name of oversampling method used for this generated dataset (SLS) |
Wacharasak Siriseriwan <[email protected]>
Bunkhumpornpat, C., Sinapiromsaran, K. and Lursinsap, C. 2009. Safe-level-SMOTE: Safe-level-synthetic minority oversampling technique for handling the class imbalanced problem. Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. 2009, 475-482.
data_example = sample_generator(5000,ratio = 0.80) genData = SLS(data_example[,-3],data_example[,3]) genData_2 = SLS(data_example[,-3],data_example[,3],K=7, C=5)
data_example = sample_generator(5000,ratio = 0.80) genData = SLS(data_example[,-3],data_example[,3]) genData_2 = SLS(data_example[,-3],data_example[,3],K=7, C=5)
Generate synthetic positive instances using SMOTE algorithm
SMOTE(X, target, K = 5, dup_size = 0)
SMOTE(X, target, K = 5, dup_size = 0)
X |
A data frame or matrix of numeric-attributed dataset |
target |
A vector of a target class attribute corresponding to a dataset X. |
K |
The number of nearest neighbors during sampling process |
dup_size |
The number or vector representing the desired times of synthetic minority instances over the original number of majority instances |
data |
A resulting dataset consists of original minority instances, synthetic minority instances and original majority instances with a vector of their respective target class appended at the last column |
syn_data |
A set of synthetic minority instances with a vector of minority target class appended at the last column |
orig_N |
A set of original instances whose class is not oversampled with a vector of their target class appended at the last column |
orig_P |
A set of original instances whose class is oversampled with a vector of their target class appended at the last column |
K |
The value of parameter K for nearest neighbor process used for generating data |
K_all |
Unavailable for this method |
dup_size |
The maximum times of synthetic minority instances over original majority instances in the oversampling |
outcast |
Unavailable for this method |
eps |
Unavailable for this method |
method |
The name of oversampling method used for this generated dataset (SMOTE) |
Wacharasak Siriseriwan <[email protected]>
Chawla, N., Bowyer, K., Hall, L. and Kegelmeyer, W. 2002. SMOTE: Synthetic minority oversampling technique. Journal of Artificial Intelligence Research. 16, 321-357.
data_example = sample_generator(10000,ratio = 0.80) genData = SMOTE(data_example[,-3],data_example[,3]) genData_2 = SMOTE(data_example[,-3],data_example[,3],K=7)
data_example = sample_generator(10000,ratio = 0.80) genData = SMOTE(data_example[,-3],data_example[,3]) genData_2 = SMOTE(data_example[,-3],data_example[,3],K=7)
The collection of SMOTE algorithm and some of its variants for oversampling numeric data
This package is built to collect several oversampling techniques for Imbalanced data which are parts of my doctorate research. Data to be used with these techniques in this package must be all numeric with one nominal attribute worked as the target class.
Wacharasak Siriseriwan <[email protected]>
'Chawla, N., Bowyer, K., Hall, L. and Kegelmeyer, W. 2002. SMOTE: Synthetic minority oversampling technique. Journal of Artificial Intelligence Research. 16, 321-357.' 'Bunkhumpornpat, C., Sinapiromsaran, K. and Lursinsap, C. 2009. Safe-level-SMOTE: Safe-level-synthetic minority oversampling technique for handling the class imbalanced problem. Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. 2009, 475-482.' 'Bunkhumpornpat, C., Sinapiromsaran, K. and Lursinsap, C. 2012. DBSMOTE: Density-based synthetic minority oversampling technique. Applied Intelligence. 36, 664-684.' 'Siriseriwan, W. and Sinapiromsaran, K. The Effective Redistribution for Imbalance Dataset : Relocating Safe-Level SMOTE with Minority Outcast Handling. Chiang Mai Journal of Science. 43(1), 234 - 246.' 'Siriseriwan, W. and Sinapiromsaran, K. Adaptive neighbor Synthetic Minority Oversampling TEchnique under 1NN outcast handling.Songklanakarin Journal of Science and Technology.' 'Han, H., Wang, W.Y. and Mao, B.H. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In Proceedings of the 2005 international conference on Advances in Intelligent Computing - Volume Part I (ICIC'05), De-Shuang Huang, Xiao-Ping Zhang, and Guang-Bin Huang (Eds.), Vol. Part I. Springer-Verlag, Berlin, Heidelberg, 2005. 878-887. DOI=http://dx.doi.org/10.1007/11538059_91'
SMOTE
SLS
DBSMOTE
RSLS
ANS
BLSMOTE
## Not run: data_example = sample_generator(10000,ratio = 0.80) genData = SMOTE(data_example[,-3],data_example[,3]) genData_2 = SMOTE(data_example[,-3],data_example[,3],K=7) ## End(Not run)
## Not run: data_example = sample_generator(10000,ratio = 0.80) genData = SMOTE(data_example[,-3],data_example[,3]) genData_2 = SMOTE(data_example[,-3],data_example[,3],K=7) ## End(Not run)