| Title: | Robust Oversampling with RM-SMOTE for Imbalanced Classification |
|---|---|
| Description: | Provides the ROBOSRMSMOTE (Robust Oversampling with RM-SMOTE) framework for imbalanced classification tasks. This package extends Mahalanobis distance-based oversampling techniques by integrating robust covariance estimators to better handle outliers and complex data distributions. The implemented methodology builds upon and significantly expands the RM-SMOTE algorithm originally proposed by Taban et al. (2025) <doi:10.1007/s10260-025-00819-8>. |
| Authors: | Emre Dunder [aut], Mehmet Ali Cengiz [aut], Zainab Subhi Mahmood Hawrami [aut, cre], Abdulmohsen Alharthi [aut] |
| Maintainer: | Zainab Subhi Mahmood Hawrami <[email protected]> |
| License: | GPL-3 |
| Version: | 1.0.0 |
| Built: | 2026-05-09 09:32:05 UTC |
| Source: | https://github.com/cran/ROBOSRMSMOTE |
Computes a robust estimate of the center (location) and covariance matrix for a given dataset using one of seven supported robust estimators.
get_robust_cov(data, method = "mcd")get_robust_cov(data, method = "mcd")
data |
A numeric matrix or data frame containing only the feature columns (no class column). Rows are observations, columns are variables. |
method |
A character string specifying the robust covariance estimator.
One of |
The following estimators are available via the rrcov package:
mcdMinimum Covariance Determinant (Rousseeuw & Driessen, 1999). The default and most widely used robust estimator. Suitable for most cases.
mveMinimum Volume Ellipsoid (Rousseeuw & Van Zomeren, 1990). An alternative to MCD, generally slower.
mestM-estimator of location and scatter. Iteratively re-weighted least squares approach.
mmestMM-estimator. Combines high breakdown point with high efficiency.
sdeStahel-Donoho Estimator. Projection-based robust estimator, useful for high-dimensional data.
sestS-estimator. High breakdown point estimator based on minimizing a robust scale.
ogkOrthogonalized Gnanadesikan-Kettenring estimator. Fast and stable for moderate dimensions.
A list with two elements:
centerA numeric vector of length ncol(data)
representing the robust location estimate.
covA numeric matrix of size ncol(data) x ncol(data)
representing the robust covariance matrix estimate.
Rousseeuw, P.J. and Driessen, K.V. (1999). A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41(3), 212-223.
Todorov, V. and Filzmoser, P. (2009). An object-oriented framework for robust multivariate analysis. Journal of Statistical Software, 32(3), 1-47.
# Generate a simple numeric dataset set.seed(42) X <- matrix(rnorm(100 * 3), nrow = 100, ncol = 3) # MCD estimator (default) result_mcd <- get_robust_cov(X, method = "mcd") result_mcd$center result_mcd$cov # OGK estimator result_ogk <- get_robust_cov(X, method = "ogk") result_ogk$center# Generate a simple numeric dataset set.seed(42) X <- matrix(rnorm(100 * 3), nrow = 100, ncol = 3) # MCD estimator (default) result_mcd <- get_robust_cov(X, method = "mcd") result_mcd$center result_mcd$cov # OGK estimator result_ogk <- get_robust_cov(X, method = "ogk") result_ogk$center
Binary imbalanced dataset from Haberman survival study (1958-1970). Minority class represents patients who did not survive 5+ years after breast cancer surgery.
habermanhaberman
A data frame with 306 rows and 4 columns:
Age of patient at operation (numeric).
Year of operation, 1958-1969 (numeric).
Number of positive axillary nodes detected (numeric).
"negative" = survived 5+ years (n=225);
"positive" = did not survive (n=81). IR = 2.78.
KEEL Repository https://sci2s.ugr.es/keel/. Used as benchmark dataset in Hawrami et al. (2025).
data(haberman) table(haberman$class) balanced <- ROBOS_RM_SMOTE(dt = haberman, target = "positive", eIR = 1) table(balanced$class)data(haberman) table(haberman$class) balanced <- ROBOS_RM_SMOTE(dt = haberman, target = "positive", eIR = 1) table(balanced$class)
Generates synthetic minority class observations using a robust version of SMOTE as part of the ROBOSRMSMOTE (Robust Oversampling with RM-SMOTE) framework. Atypical minority class observations (outliers) are down-weighted based on their robust Mahalanobis distance so that they have a lower probability of being selected as parents in the resampling step. The k-nearest neighbours of each candidate parent are also found using the robust Mahalanobis distance rather than the standard Euclidean distance.
ROBOS_RM_SMOTE( dt, target = "positive", dup_size = 0, eIR = 1, k = 5, threshold = 0.01, weight_func = 1, cov_method = "mcd" )ROBOS_RM_SMOTE( dt, target = "positive", dup_size = 0, eIR = 1, k = 5, threshold = 0.01, weight_func = 1, cov_method = "mcd" )
dt |
A data frame containing the full (imbalanced) training set. Must
include a column named |
target |
A character string identifying the minority class label in
the |
dup_size |
A non-negative numeric value. When |
eIR |
Expected imbalance ratio after oversampling. Used only when
|
k |
A positive integer specifying the number of nearest neighbours
used in the SMOTE resampling step. Default is |
threshold |
A numeric value in |
weight_func |
An integer (1, 2, or 3) passed to
|
cov_method |
A character string passed to |
The algorithm proceeds as follows (Algorithm 1 in Taban et al., 2025):
Extract minority class observations .
Robustly estimate the mean vector and covariance
matrix using the selected cov_method.
Compute the squared robust Mahalanobis distance for every minority observation.
Apply the selected weighting function to obtain a probability
distribution over .
Build the k-nearest neighbour graph over using the
robust Mahalanobis distance.
Repeat until the desired number of synthetic observations is reached:
Sample the first parent according to .
Choose the second parent uniformly from the k
neighbours of .
Generate where
.
A data frame with the same columns as dt, containing the
original observations plus the newly generated synthetic minority class
observations. Row names are reset to NULL.
Dunder, E., Cengiz, M.A., Hawrami, Z.S.M. and Alharthi, A. (2025). Robust Covariance-Based Oversampling Strategies for Imbalanced Classification. Manuscript in preparation.
Taban, R., Nunes, C. and Oliveira, M.R. (2025). RM-SMOTE: a new robust balancing technique. Statistical Methods & Applications. doi:10.1007/s10260-025-00819-8
Chawla, N.V., Bowyer, K.W., Hall, L.O. and Kegelmeyer, W.P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321-357.
# Load the package example dataset data(haberman) # Basic usage: balance with MCD (default) and hard exclusion balanced <- ROBOS_RM_SMOTE(dt = haberman, target = "positive", eIR = 1) table(balanced$class) # Use MVE estimator and soft weighting (omega_B) balanced_mve <- ROBOS_RM_SMOTE(dt = haberman, target = "positive", eIR = 1, cov_method = "mve", weight_func = 2) table(balanced_mve$class) # Control exact number of synthetic samples with dup_size balanced_dup <- ROBOS_RM_SMOTE(dt = haberman, target = "positive", dup_size = 2, cov_method = "ogk") table(balanced_dup$class)# Load the package example dataset data(haberman) # Basic usage: balance with MCD (default) and hard exclusion balanced <- ROBOS_RM_SMOTE(dt = haberman, target = "positive", eIR = 1) table(balanced$class) # Use MVE estimator and soft weighting (omega_B) balanced_mve <- ROBOS_RM_SMOTE(dt = haberman, target = "positive", eIR = 1, cov_method = "mve", weight_func = 2) table(balanced_mve$class) # Control exact number of synthetic samples with dup_size balanced_dup <- ROBOS_RM_SMOTE(dt = haberman, target = "positive", dup_size = 2, cov_method = "ogk") table(balanced_dup$class)
For each minority class observation, computes the robust Mahalanobis distance (MD) to the class center and assigns a weight based on the chosen weighting function. Observations flagged as outliers (MD exceeds the chi-square threshold) receive reduced or zero weight, lowering their probability of being selected as parents in the SMOTE resampling step.
weighting(data, threshold = 0.01, weight_func = 1, cov_method = "mcd")weighting(data, threshold = 0.01, weight_func = 1, cov_method = "mcd")
data |
A data frame of minority class observations. The last column
must be the class label column named |
threshold |
A numeric value in |
weight_func |
An integer (1, 2, or 3) selecting the weighting function applied to outlier observations:
|
cov_method |
A character string passed to |
The input data frame with three additional columns appended:
MDSquared robust Mahalanobis distance for each observation.
weightsRaw weight assigned to each observation (1 for non-outliers, reduced for outliers).
probNormalised selection probability derived from
weights. Sums to 1 across all rows.
Taban, R., Nunes, C. and Oliveira, M.R. (2025). RM-SMOTE: a new robust balancing technique. Statistical Methods & Applications. doi:10.1007/s10260-025-00819-8
get_robust_cov, ROBOS_RM_SMOTE
# Create a small imbalanced dataset set.seed(42) minority <- data.frame( x1 = c(rnorm(18), 10, 12), # last two are outliers x2 = c(rnorm(18), 9, 11), class = "positive" ) # Weight with hard exclusion (omega_A) result <- weighting(minority, threshold = 0.01, weight_func = 1) table(result$weights) # outliers get weight 0 # Weight with soft inverse (omega_B) result2 <- weighting(minority, threshold = 0.01, weight_func = 2) round(result2$prob, 4)# Create a small imbalanced dataset set.seed(42) minority <- data.frame( x1 = c(rnorm(18), 10, 12), # last two are outliers x2 = c(rnorm(18), 9, 11), class = "positive" ) # Weight with hard exclusion (omega_A) result <- weighting(minority, threshold = 0.01, weight_func = 1) table(result$weights) # outliers get weight 0 # Weight with soft inverse (omega_B) result2 <- weighting(minority, threshold = 0.01, weight_func = 2) round(result2$prob, 4)