| Title: | Centroid Decision Forest for High-Dimensional Classification |
|---|---|
| Description: | Implements the Centroid Decision Forest (CDF) as a single user-facing function CDF(). The method selects discriminative features via a multi-class class separability score (CSS), splits by nearest class centroid, and aggregates tree votes to produce predictions and class probabilities. Returns CSS-based feature importance as well. Amjad Ali, Saeed Aldahmani, Zardad Khan (2025) <doi:10.48550/arXiv.2503.19306>. |
| Authors: | Amjad Ali [aut, cre], Saeed Aldahmani [aut], Zardad Khan [aut] |
| Maintainer: | Amjad Ali <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.0 |
| Built: | 2026-05-09 06:29:01 UTC |
| Source: | https://github.com/cran/CDF |
Trains an ensemble of centroid-splitting trees and predicts for new data.
Nodes select top-k features via a multi-class class separability score (CSS),
split by nearest class centroid, and aggregate votes.
CDF(xtrain, ytrain, xtest, ntrees = 500, depth = 3, mnode = 3, k = round(2 * log(ncol(xtrain))), mtry = round(0.2 * ncol(xtrain)), seed = NULL)CDF(xtrain, ytrain, xtest, ntrees = 500, depth = 3, mnode = 3, k = round(2 * log(ncol(xtrain))), mtry = round(0.2 * ncol(xtrain)), seed = NULL)
xtrain |
Numeric matrix or data frame of training predictors. |
ytrain |
Factor or character vector of class labels (length = nrow(xtrain)). |
xtest |
Numeric matrix or data frame of test predictors. |
ntrees |
Integer. Number of trees (default 500). |
depth |
Integer. Maximum tree depth (default 3). |
mnode |
Integer. Minimum node size to split (default 3). |
k |
Integer. Top- |
mtry |
Integer. Candidate features per node (default |
seed |
Optional integer seed for reproducibility. |
A list with:
predictions |
Character vector of predicted classes for |
probabilities |
Numeric matrix of class probabilities (columns are classes). |
feature_importance |
Named numeric vector of normalized CSS importances. |
Amjad Ali, Saeed Aldahmani, Zardad Khan
Ali, A., Khan, Z., and Aldahmani, S. (2025). Centroid Decision Forest. arXiv:2503.19306.
data(DARWIN) set.seed(2025) n <- nrow(DARWIN) p <- ncol(DARWIN) # Split the data into training (70%) and test (30%) sets tr <- sample(seq_len(n), floor(0.7 * n)) te <- setdiff(seq_len(n), tr) # Prepare training and test matrices Xtr <- as.matrix(DARWIN[tr, 1:(p - 1), drop = FALSE]) ytr <- DARWIN$Y[tr] Xte <- as.matrix(DARWIN[te, 1:(p - 1), drop = FALSE]) yte <- DARWIN$Y[te] # Fit the CDF model FitCDF <- CDF(Xtr, ytr, Xte, ntrees = 100, seed = 2025) # Compute classification accuracy mean(FitCDF$predictions == yte) # Predicted classes for the test data FitCDF$predictions # Predicted class probabilities for the test data FitCDF$probabilities # Top 10 most important features order(FitCDF$feature_importance, decreasing = TRUE)[1:10]data(DARWIN) set.seed(2025) n <- nrow(DARWIN) p <- ncol(DARWIN) # Split the data into training (70%) and test (30%) sets tr <- sample(seq_len(n), floor(0.7 * n)) te <- setdiff(seq_len(n), tr) # Prepare training and test matrices Xtr <- as.matrix(DARWIN[tr, 1:(p - 1), drop = FALSE]) ytr <- DARWIN$Y[tr] Xte <- as.matrix(DARWIN[te, 1:(p - 1), drop = FALSE]) yte <- DARWIN$Y[te] # Fit the CDF model FitCDF <- CDF(Xtr, ytr, Xte, ntrees = 100, seed = 2025) # Compute classification accuracy mean(FitCDF$predictions == yte) # Predicted classes for the test data FitCDF$predictions # Predicted class probabilities for the test data FitCDF$probabilities # Top 10 most important features order(FitCDF$feature_importance, decreasing = TRUE)[1:10]
Handwriting data from 174 participants for a binary classification task:
distinguishing Alzheimer's disease patients (P) from healthy controls (H).
The feature matrix contains 450 numeric handwriting-derived measures; the last
column Y is the class label with levels P and H.
data(DARWIN)data(DARWIN)
A data.frame with 174 rows and 451 columns:
Numeric handwriting features (predictors).
YFactor, class label with two levels: P (patients), H (healthy).
Purpose: Research on predicting Alzheimer's via handwriting analysis.
Instances: 174
Features (X): 450 (numeric)
Target (Y): P/H
Missing values: None
OpenML dataset ID 46606. https://www.openml.org/search?type=data&status=any&id=46606&sort=runs
data(DARWIN) dim(DARWIN) table(DARWIN$Y)data(DARWIN) dim(DARWIN) table(DARWIN$Y)