| Title: | Unified Interface for Ensemble Machine Learning Methods |
|---|---|
| Description: | Provides a clean, unified interface for training, predicting, and evaluating ensemble machine learning models including Random Forest, Gradient Boosting ('XGBoost'), 'AdaBoost', and 'Bagging'. All algorithms share a consistent API: em_fit(), em_predict(), em_evaluate(), and em_tune(). Includes built-in cross-validation, feature importance, calibration diagnostics, partial dependence plots, and model comparison utilities. Methods: Breiman (2001) <doi:10.1023/A:1010933404324>; Chen and Guestrin (2016) <doi:10.1145/2939672.2939785>; Freund and Schapire (1997) <doi:10.1006/jcss.1997.1504>; Breiman (1996) <doi:10.1007/BF00058655>. |
| Authors: | Sadikul Islam [aut, cre] (ORCID: <https://orcid.org/0000-0003-2924-7122>) |
| Maintainer: | Sadikul Islam <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.2.5 |
| Built: | 2026-06-05 19:56:53 UTC |
| Source: | https://github.com/cran/ensembleML |
A clean, consistent API for ensemble machine learning covering training, prediction, evaluation, tuning, diagnostics, and model comparison.
em_fit(): Train any supported ensemble model
em_predict(): Generate predictions or class probabilities
em_evaluate(): Compute held-out performance metrics
em_cv(): k-fold cross-validation for a single model
em_tune(): Grid-search hyperparameter tuning via cross-validation
em_compare(): Side-by-side comparison of multiple algorithms
em_importance(): Feature importance extraction and visualisation
em_confusion(): Styled confusion matrix (classification)
em_calibration(): Calibration / reliability diagram (classification)
em_residuals(): Residual diagnostics plot (regression)
em_partial(): Partial dependence plot for one predictor
Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32. doi:10.1023/A:1010933404324
Chen, T. and Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794. doi:10.1145/2939672.2939785
Freund, Y. and Schapire, R. E. (1997). A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of Computer and System Sciences, 55(1), 119–139. doi:10.1006/jcss.1997.1504
Breiman, L. (1996). Bagging Predictors. Machine Learning, 24(2), 123–140. doi:10.1007/BF00058655
Maintainer: Sadikul Islam [email protected] (ORCID)
Checks how well predicted class probabilities match observed frequencies. Binary classification only. A well-calibrated model lies on the diagonal.
em_calibration(object, newdata, n_bins = 10L, positive = NULL)em_calibration(object, newdata, n_bins = 10L, positive = NULL)
object |
An |
newdata |
A |
n_bins |
Integer. Number of probability bins. Default |
positive |
Character. The positive class. Defaults to the second level. |
A data.frame of bin midpoints, mean predicted probability, and
observed fraction (invisibly).
Trains several ensemble algorithms on the same train/test split and returns a tidy comparison table plus an optional bar chart. Useful for algorithm selection before committing to hyperparameter tuning.
em_compare( formula, train, test, methods = NULL, task = NULL, metrics = NULL, sort_by = NULL, plot = TRUE, verbose = TRUE, ... )em_compare( formula, train, test, methods = NULL, task = NULL, metrics = NULL, sort_by = NULL, plot = TRUE, verbose = TRUE, ... )
formula |
A |
train |
A |
test |
A |
methods |
Character vector of algorithms to compare. Defaults to all algorithms appropriate for the detected task. |
task |
|
metrics |
Metrics to compute (forwarded to |
sort_by |
Character. Metric to sort by in the table. Defaults to the first computed metric. |
plot |
Logical. Print a bar chart? Default |
verbose |
Logical. Print fitting progress messages? Default |
... |
Extra arguments forwarded to |
A list:
tabledata.frame of algorithms - metrics, sorted by sort_by.
modelsNamed list of fitted ensembleML_model objects.
fit_timesNamed numeric vector of training times (seconds).
plotA ggplot bar chart (if plot = TRUE).
data(iris) set.seed(42) idx <- sample(nrow(iris), 120) cmp <- em_compare(Species ~ ., train = iris[idx, ], test = iris[-idx, ]) cmp$tabledata(iris) set.seed(42) idx <- sample(nrow(iris), 120) cmp <- em_compare(Species ~ ., train = iris[idx, ], test = iris[-idx, ]) cmp$table
Computes and visualises a confusion matrix with per-class recall (sensitivity) on the diagonal. For classification tasks only.
em_confusion(object, newdata, normalise = FALSE, plot = TRUE)em_confusion(object, newdata, normalise = FALSE, plot = TRUE)
object |
An |
newdata |
A |
normalise |
Logical. Show row-normalised proportions instead of raw
counts? Default |
plot |
Logical. Print a ggplot2 heatmap? Default |
The confusion matrix table (invisibly).
data(iris) set.seed(1) idx <- sample(nrow(iris), 120) m <- em_fit(Species ~ ., data = iris[idx, ], method = "random_forest") em_confusion(m, iris[-idx, ])data(iris) set.seed(1) idx <- sample(nrow(iris), 120) m <- em_fit(Species ~ ., data = iris[idx, ], method = "random_forest") em_confusion(m, iris[-idx, ])
Estimates generalisation performance of a model specification via repeated k-fold cross-validation. Returns fold-level metrics and aggregate statistics (mean - SD), helping you assess stability as well as average performance.
em_cv( formula, data, method = "random_forest", task = NULL, metrics = NULL, cv_folds = 5L, repeats = 1L, seed = 42L, verbose = TRUE, ... )em_cv( formula, data, method = "random_forest", task = NULL, metrics = NULL, cv_folds = 5L, repeats = 1L, seed = 42L, verbose = TRUE, ... )
formula |
A |
data |
A |
method |
Algorithm name (see |
task |
|
metrics |
Character vector of metrics. Defaults to task-appropriate set. |
cv_folds |
Integer. Number of folds. Default |
repeats |
Integer. Number of complete CV repeats (increases stability
of estimates). Default |
seed |
Integer for reproducibility. Default |
verbose |
Logical. Print fold progress? Default |
... |
Extra arguments forwarded to |
A list with:
summarydata.frame of mean, SD, min, max per metric.
fold_resultsdata.frame of per-fold metric values.
cv_foldsNumber of folds used.
repeatsNumber of repeats used.
data(iris) cv_result <- em_cv(Species ~ ., data = iris, method = "random_forest", cv_folds = 5, repeats = 3) cv_result$summarydata(iris) cv_result <- em_cv(Species ~ ., data = iris, method = "random_forest", cv_folds = 5, repeats = 3) cv_result$summary
Compute held-out performance metrics for a fitted ensembleML_model.
em_evaluate(object, newdata, metrics = NULL, positive = NULL)em_evaluate(object, newdata, metrics = NULL, positive = NULL)
object |
An |
newdata |
A |
metrics |
Character vector of metrics to compute.
Classification: |
positive |
Character. Positive class for binary precision/recall/F1. Defaults to the second factor level (conventional for binary tasks). |
A named numeric vector of metric values.
data(iris) set.seed(42) idx <- sample(nrow(iris), 120) m <- em_fit(Species ~ ., data = iris[idx, ], method = "random_forest") em_evaluate(m, iris[-idx, ]) em_evaluate(m, iris[-idx, ], metrics = c("accuracy", "f1"))data(iris) set.seed(42) idx <- sample(nrow(iris), 120) m <- em_fit(Species ~ ., data = iris[idx, ], method = "random_forest") em_evaluate(m, iris[-idx, ]) em_evaluate(m, iris[-idx, ], metrics = c("accuracy", "f1"))
Unified entry point for training any supported ensemble algorithm.
All fitted objects share a consistent structure enabling seamless use
with em_predict(), em_evaluate(), em_tune(), and em_compare().
em_fit( formula, data, method = "random_forest", task = NULL, weights = NULL, verbose = FALSE, ... )em_fit( formula, data, method = "random_forest", task = NULL, weights = NULL, verbose = FALSE, ... )
formula |
A |
data |
A |
method |
Character. One of |
task |
|
weights |
Optional non-negative numeric vector of observation weights
(length |
verbose |
Logical. Print a model summary after fitting? Default
|
... |
Algorithm-specific hyperparameters forwarded to the underlying
engine (e.g. |
An ensembleML_model S3 object with named fields:
modelRaw fitted object from the underlying engine.
methodAlgorithm name.
task"classification" or "regression".
formulaThe formula used.
feature_namesCharacter vector of predictor names.
response_nameName of the response variable.
levelsFactor levels of the response (classification
only; NULL otherwise).
n_trainNumber of training rows.
callThe original function call.
fit_timeTraining wall-clock time (seconds).
train_metricsIn-sample metrics – use for sanity checks only.
Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32. doi:10.1023/A:1010933404324
Chen, T. and Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794. doi:10.1145/2939672.2939785
Freund, Y. and Schapire, R. E. (1997). A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of Computer and System Sciences, 55(1), 119–139. doi:10.1006/jcss.1997.1504
Breiman, L. (1996). Bagging Predictors. Machine Learning, 24(2), 123–140. doi:10.1007/BF00058655
em_predict(), em_evaluate(), em_tune(), em_compare(),
em_cv(), em_importance()
data(iris) m <- em_fit(Species ~ ., data = iris, method = "random_forest", verbose = TRUE) summary(m)data(iris) m <- em_fit(Species ~ ., data = iris, method = "random_forest", verbose = TRUE) summary(m)
Extracts and optionally plots variable importance scores from a fitted ensemble model. The interpretation of importance varies by algorithm (e.g. mean decrease in impurity for Random Forest, gain for XGBoost).
em_importance( object, top_n = NULL, plot = TRUE, normalise = TRUE, type = "MeanDecreaseGini" )em_importance( object, top_n = NULL, plot = TRUE, normalise = TRUE, type = "MeanDecreaseGini" )
object |
An |
top_n |
Integer. Return/show only the top-n features. |
plot |
Logical. Print a horizontal bar chart? Default |
normalise |
Logical. Scale scores to sum to 100%? Default |
type |
Character. Importance measure for Random Forest: |
A data.frame with columns feature and importance (invisibly
when plot = TRUE).
data(iris) m <- em_fit(Species ~ ., data = iris, method = "random_forest") imp <- em_importance(m, top_n = 4)data(iris) m <- em_fit(Species ~ ., data = iris, method = "random_forest") imp <- em_importance(m, top_n = 4)
Shows the marginal effect of a single predictor on the model output, averaging over the joint distribution of all other predictors. Helps understand non-linear effects and is model-agnostic.
em_partial(object, data, feature, n_grid = 30L, class = NULL)em_partial(object, data, feature, n_grid = 30L, class = NULL)
object |
An |
data |
The training (or any reference) |
feature |
Character. Name of the predictor to vary. |
n_grid |
Integer. Number of equally-spaced grid points for a numeric
predictor. Default |
class |
Character. For multi-class classification, which class probability to plot? Defaults to the first class level. |
A data.frame of grid values and mean predicted response (invisibly).
data(iris) m <- em_fit(Species ~ ., data = iris, method = "random_forest") em_partial(m, iris, feature = "Petal.Length")data(iris) m <- em_fit(Species ~ ., data = iris, method = "random_forest") em_partial(m, iris, feature = "Petal.Length")
Visualise the distribution of a metric across all CV folds, with a reference line at the mean. Useful for assessing model stability.
em_plot_cv(cv_result, metric = NULL)em_plot_cv(cv_result, metric = NULL)
cv_result |
Output from |
metric |
Character. Metric to plot. Defaults to first available metric. |
A ggplot object (invisibly).
Generate predictions from a fitted ensembleML_model. Output type is
consistent regardless of the underlying algorithm.
em_predict(object, newdata, type = NULL, ...)em_predict(object, newdata, type = NULL, ...)
object |
An |
newdata |
A |
type |
Character. |
... |
Currently unused. |
For type = "class": a factor. For type = "prob": a numeric
matrix with one column per class. For regression: a numeric vector.
data(iris) m <- em_fit(Species ~ ., data = iris, method = "random_forest") preds <- em_predict(m, iris[1:10, ]) probs <- em_predict(m, iris[1:10, ], type = "prob")data(iris) m <- em_fit(Species ~ ., data = iris, method = "random_forest") preds <- em_predict(m, iris[1:10, ]) probs <- em_predict(m, iris[1:10, ], type = "prob")
Produces a 2-panel diagnostic plot: (1) residuals vs fitted values, and (2) a QQ plot of residuals. Useful for detecting heteroscedasticity and departures from normality.
em_residuals(object, newdata)em_residuals(object, newdata)
object |
An |
newdata |
A |
A data.frame of fitted values and residuals (invisibly).
set.seed(1) d <- data.frame(x = rnorm(200), y = 3 + 2*rnorm(200) + rnorm(200)) m <- em_fit(y ~ x, data = d[1:160,], method = "random_forest") em_residuals(m, d[161:200,])set.seed(1) d <- data.frame(x = rnorm(200), y = 3 + 2*rnorm(200) + rnorm(200)) m <- em_fit(y ~ x, data = d[1:160,], method = "random_forest") em_residuals(m, d[161:200,])
Performs an exhaustive grid search over a named list of hyperparameter values, using k-fold cross-validation to select the best configuration. After selection the best model is refit on the full dataset.
em_tune( formula, data, method = "random_forest", param_grid = list(), task = NULL, metric = NULL, cv_folds = 5L, seed = 42L, verbose = TRUE )em_tune( formula, data, method = "random_forest", param_grid = list(), task = NULL, metric = NULL, cv_folds = 5L, seed = 42L, verbose = TRUE )
formula |
A |
data |
A |
method |
Algorithm name (see |
param_grid |
A named |
task |
|
metric |
Optimisation criterion. Defaults to |
cv_folds |
Integer. Number of CV folds. Default |
seed |
Integer. Random seed. Default |
verbose |
Logical. Print progress? Default |
A list with:
best_paramsNamed list of the best hyperparameters found.
best_scoreCross-validated score for the best configuration.
metricThe metric that was optimised.
resultsdata.frame of all configurations sorted by score.
best_modelensembleML_model refit on the full dataset.
data(iris) tuned <- em_tune( Species ~ ., data = iris, method = "random_forest", param_grid = list(ntree = c(100, 300), mtry = c(1, 2, 3)) ) tuned$best_params tuned$best_score tuned$resultsdata(iris) tuned <- em_tune( Species ~ ., data = iris, method = "random_forest", param_grid = list(ntree = c(100, 300), mtry = c(1, 2, 3)) ) tuned$best_params tuned$best_score tuned$results