| Title: | Unsupervised Feature Selection using the Heterogeneous Correlation Matrix |
|---|---|
| Description: | Unsupervised multivariate filter feature selection using the UFS-rHCM or UFS-cHCM algorithms based on the heterogeneous correlation matrix (HCM). The HCM consists of Pearson's correlations between numerical features, polyserial correlations between numerical and ordinal features, and polychoric correlations between ordinal features. Tortora C., Madhvani S., Punzo A. (2025). "Designing unsupervised mixed-type feature selection techniques using the heterogeneous correlation matrix." International Statistical Review <doi:10.1111/insr.70016>. This work was supported by the National Science foundation NSF Grant N 2209974 (Tortora) and by the Italian Ministry of University and Research (MUR) under the PRIN 2022 grant number 2022XRHT8R (CUP: E53D23005950006), as part of ‘The SMILE Project: Statistical Modelling and Inference to Live the Environment’, funded by the European Union – Next Generation EU (Punzo). |
| Authors: | Cristina Tortora [aut, cre, fnd], Antonio Punzo [aut], Shaam Madhvani [aut] |
| Maintainer: | Cristina Tortora <[email protected]> |
| License: | GPL-2 |
| Version: | 1.0.1 |
| Built: | 2026-05-23 07:55:07 UTC |
| Source: | https://github.com/cran/hetcorFS |
The Employee Satisfaction Index (ESI) data set, from Kaggle (Harris, 2023), is a fictional data set that measures employee satisfaction
data(ESI)data(ESI)
A data frame with 500 rows and 10 features.
label.
continuous from 23 to 45.
categorical.
binary.
binary.
categorical.
ordinal from 1 to 5.
ordinal from 1 to 5.
binary.
number of awards 0-9.
binary.
continuous from 24.1 to 86.8.
binary.
Harris, M. (2023). Employee Satisfaction Index Dataset. Evanston, Illinois: Kaggle. Version 1
Displays retained features for different values of alpha in a bar plot.
FS_barplot( data = NULL, grid.alpha = seq(0.01, 0.99, by = 0.01), missing = FALSE, pv_adj = "none", smooth.tol = 10^-12, method = "c" )FS_barplot( data = NULL, grid.alpha = seq(0.01, 0.99, by = 0.01), missing = FALSE, pv_adj = "none", smooth.tol = 10^-12, method = "c" )
data |
A data frame. Values of type 'numeric' or 'integer' are treated as numerical. |
grid.alpha |
A vector of alpha values to be plotted, default = seq(0.01,0.99,by=0.01). |
missing |
Pairwise complete by default, set to TRUE for complete deletion. |
pv_adj |
Correction method for p-value, "none" by default. For options see p.adjust. |
smooth.tol |
Minimum acceptable eigenvalue for the smoothing, default 10^-12. |
method |
Algorithm used. c (cell-wise) by default, r (row-wise) as the alternative. |
Displays a bar plot depicting which features are selected at each value of alpha (multiplied by 100) and a list with elements:
survivors |
Vector depicting how many alphas a variable is selected for |
data_names |
Vector depicting the corresponding names of the features |
Tortora C., Madhvani S., Punzo A. (2025). Designing unsupervised mixed-type feature selection techniques using the heterogeneous correlation matrix. International Statistical Review. https://doi.org/10.1111/insr.70016
data(ESI) data=ESI[,-c(1,3,4,6,9)]##removing categorical features FS_barplot(data, pv_adj='BH') #using BH adkustment for the p-valuesdata(ESI) data=ESI[,-c(1,3,4,6,9)]##removing categorical features FS_barplot(data, pv_adj='BH') #using BH adkustment for the p-values
Extends the traditional correlation matrix (between numerical data) to also include binary and ordinal categorical data and computes the p-values for the tests of uncorrelation.
HCPM(data = NULL)HCPM(data = NULL)
data |
A data frame. Values of type 'numeric' or 'integer' are treated as numerical. |
A list with with elements:
cor_mat |
An |
p_value |
An |
Tortora C., Madhvani S., Punzo A. (2025). Designing unsupervised mixed-type feature selection techniques using the heterogeneous correlation matrix. International Statistical Review. https://doi.org/10.1111/insr.70016
data(ESI) data=ESI[,-c(1,3,4,6,9)]##removing categorical features HCPM(data)data(ESI) data=ESI[,-c(1,3,4,6,9)]##removing categorical features HCPM(data)
Computes the Jaccard index using Gower's dissimilarity.
JaccardRate( data, data_red, k=6 )JaccardRate( data, data_red, k=6 )
data |
A data frame. Values of type 'numeric' or 'integer' are treated as numerical. |
data_red |
A data frame. A subset of data with the selected features. |
k |
number of neighbors |
Jaccard Index |
numeric |
Zhao, Z., L. Wang, and H. Liu (2010). Efficient spectral feature selection with minimum redundancy. In Proceedings of the AAAI conference on artificial intelligence, Volume 24, pp. 673–678.
data(ESI) data=ESI[,-c(1,3,4,6,9)] ##removing categorical features out=UFS(data,alpha=0.01,method='c',pv_adj='BH') JR=JaccardRate(data,out$selected.features) JR #visualize the indexdata(ESI) data=ESI[,-c(1,3,4,6,9)] ##removing categorical features out=UFS(data,alpha=0.01,method='c',pv_adj='BH') JR=JaccardRate(data,out$selected.features) JR #visualize the index
Computes the Redundancy Rate using heterogeneous correlation matrix.
RedRate( data_red )RedRate( data_red )
data_red |
A data frame. A subset of data with the selected features. |
Redundancy Rate |
numeric |
Zhao, Z., L. Wang, and H. Liu (2010). Efficient spectral feature selection with minimum redundancy. In Proceedings of the AAAI conference on artificial intelligence, Volume 24, pp. 673–678.
data(ESI) data=ESI[,-c(1,3,4,6,9)] ##removing categorical features out=UFS(data,alpha=0.01,method='c',pv_adj='BH') RR=RedRate(out$selected.features) RR #visualize the indexdata(ESI) data=ESI[,-c(1,3,4,6,9)] ##removing categorical features out=UFS(data,alpha=0.01,method='c',pv_adj='BH') RR=RedRate(out$selected.features) RR #visualize the index
Performs unsupervised feature selection for mixed type data. Both algorithms are based on the heterogeneous correlation matrix.
UFS( data = NULL, alpha = 0.05, missing = FALSE, pv_adj = "none", smooth.tol = 10^-12, method = "c" )UFS( data = NULL, alpha = 0.05, missing = FALSE, pv_adj = "none", smooth.tol = 10^-12, method = "c" )
data |
A data frame. Values of type 'numeric' or 'integer' are treated as numerical, factors as ordinal categorical. |
alpha |
Significance level to be used for testing, default = 0.05. |
missing |
Pairwise complete by default, set to TRUE for complete deletion. |
pv_adj |
Correction method for p-value, "none" by default. For options see p.adjust. |
smooth.tol |
Minimum acceptable eigenvalue for the smoothing, default = 10^-12. |
method |
Algorithm used. c (cell-wise) by default, r (row-wise) as the alternative. |
An list of elements:
rearranged.data.set |
Original data frame with with numerical features first |
selected.features |
A data frame of the selected features |
feature.indices |
The indices of the selected features from the original data frame |
original.corr.matrix |
The |
corr.matrix |
The |
original.p.value.matrix |
The |
p.value.matrix |
The |
Tortora C., Madhvani S., Punzo A. (2025). Designing unsupervised mixed-type feature selection techniques using the heterogeneous correlation matrix. International Statistical Review. https://doi.org/10.1111/insr.70016
data(ESI)#Loading the data data = ESI[,-c(1,3,4,6,9)]##removing categorical features res = UFS(data) ### visualize selected features colnames(res$selected.features)data(ESI)#Loading the data data = ESI[,-c(1,3,4,6,9)]##removing categorical features res = UFS(data) ### visualize selected features colnames(res$selected.features)