Title: | Robust Aggregative Feature Selection |
---|---|
Description: | A cross-validated minimal-optimal feature selection algorithm. It utilises popularity counting, hierarchical clustering with feature dissimilarity measures, and prefiltering with all-relevant feature selection method to obtain the minimal-optimal set of features. |
Authors: | Radosław Piliszek [aut, cre], Witold Remigiusz Rudnicki [ths, aut] |
Maintainer: | Radosław Piliszek <[email protected]> |
License: | GPL-3 |
Version: | 0.2.4 |
Built: | 2024-11-06 09:23:14 UTC |
Source: | CRAN |
To be used in run_rafs
.
builtin_dist_funs
builtin_dist_funs
An object of class list
of length 5.
See also default_dist_funs
.
This is a secondary function, useful when experimenting with different
feature selection filters and rankings. Its output is used in run_rafs_with_fs_results
and it is called for the user in run_rafs
.
compute_fs_results(data, decision, k, seeds, fs_fun = default_fs_fun)
compute_fs_results(data, decision, k, seeds, fs_fun = default_fs_fun)
data |
input data where columns are variables and rows are observations (all numeric) |
decision |
decision variable as a binary sequence of length equal to number of observations |
k |
number of folds for internal cross validation |
seeds |
a vector of seeds used for fold generation for internal cross validation |
fs_fun |
function to compute feature selection p-values, it must have the same signature as |
A list
with feature selection results, e.g. from default_fs_fun
.
library(MDFS) mdfs_omp_set_num_threads(1) # only to pass CRAN checks data(madelon) fs_results <- compute_fs_results(madelon$data, madelon$decision, 2, c(12345)) run_rafs_with_fs_results(madelon$data, madelon$decision, fs_results)
library(MDFS) mdfs_omp_set_num_threads(1) # only to pass CRAN checks data(madelon) fs_results <- compute_fs_results(madelon$data, madelon$decision, 2, c(12345)) run_rafs_with_fs_results(madelon$data, madelon$decision, fs_results)
To be used as one of the dist_funs
in run_rafs
.
cor_dist(relevant_train_data, train_decision = NULL, seed = NULL)
cor_dist(relevant_train_data, train_decision = NULL, seed = NULL)
relevant_train_data |
input data where columns are variables and rows are observations (all numeric); assumed to contain only relevant data |
train_decision |
decision variable as a binary sequence of length equal to number of observations |
seed |
a numerical seed |
A matrix of distances (dissimilarities).
A utility function used in RAFS but useful also for external cross-validation.
create_seeded_folds(decision, k, seed)
create_seeded_folds(decision, k, seed)
decision |
decision variable as a binary sequence of length equal to number of observations |
k |
number of folds for cross validation |
seed |
a numerical seed |
A vector of folds. Each fold being a vector of selected indices.
As used in run_rafs
.
default_dist_funs
default_dist_funs
An object of class list
of length 3.
The default functions compute:
Pearson's correlation (cor
: cor_dist
),
Variation of Information (vi
: vi_dist
) and
Symmetric Target Information Gain (stig
: stig_dist
).
These functions follow a similar protocol to default_fs_fun
.
They expect the same input except for the assumption that the data passed in is relevant.
Each of them outputs a matrix of distances (dissimilarities) between features.
See also builtin_dist_funs
.
See run_rafs
for how it is used. Only the train portion of the
dataset is to be fed into this function.
default_fs_fun(train_data, train_decision, seed)
default_fs_fun(train_data, train_decision, seed)
train_data |
input data where columns are variables and rows are observations (all numeric) |
train_decision |
decision variable as a binary sequence of length equal to number of observations |
seed |
a numerical seed |
The function MUST use this train_data
and MAY ignore
the train_decision
.
If the function depends on randomness, it MUST use the seed parameter to seed the PRNG.
The function needs to return a list
with at least two elements:
rel_vars
and rel_vars_rank
, which are vectors and contain,
respectively, the indices of variables considered relevant and the rank
for each relevant variable. The function MAY return a list with more elements.
Other examples of sensible functions are included in the tests of this package.
A list
with at least two fields:
rel_vars
and rel_vars_rank
, which are vectors and contain,
respectively, the indices of variables considered relevant and the rank
for each relevant variable.
As used in run_rafs
to call hclust
.
default_hclust_methods
default_hclust_methods
An object of class character
of length 4.
This helper function works on results of get_rafs_reps_popcnts
to obtain all representatives at the chosen number of clusters.
get_rafs_all_reps_from_popcnts(reps_popcnts, n_clusters)
get_rafs_all_reps_from_popcnts(reps_popcnts, n_clusters)
reps_popcnts |
representatives' popcnts for the chosen variant as obtained from |
n_clusters |
the desired number of clusters |
A vector of all representatives.
library(MDFS) mdfs_omp_set_num_threads(1) # only to pass CRAN checks data(madelon) rafs_results <- run_rafs(madelon$data, madelon$decision, 2, c(12345)) rafs_reps_popcnts <- get_rafs_reps_popcnts(rafs_results, 5) get_rafs_all_reps_from_popcnts(rafs_reps_popcnts$stig_single, 5)
library(MDFS) mdfs_omp_set_num_threads(1) # only to pass CRAN checks data(madelon) rafs_results <- run_rafs(madelon$data, madelon$decision, 2, c(12345)) rafs_reps_popcnts <- get_rafs_reps_popcnts(rafs_results, 5) get_rafs_all_reps_from_popcnts(rafs_reps_popcnts$stig_single, 5)
This function obtains a matrix describing a graph of co-occurrence
at each count of clusters (from n_clusters_range
) computed
over all runs of RAFS.
get_rafs_occurrence_matrix( rafs_results, interesting_reps, n_clusters_range = 2:15 )
get_rafs_occurrence_matrix( rafs_results, interesting_reps, n_clusters_range = 2:15 )
rafs_results |
RAFS results as obtained from |
interesting_reps |
the interesting representatives to build matrices for (in principle, these need not be representatives but it is more common) |
n_clusters_range |
range of clusters number to obtain matrices for |
If a single result over a cluster number range is desired, the selected matrices can be summed.
A nested list
with matrices.
The first level is per the RAFS variant (combination of feature dissimilarity function and hclust method).
The second level is per the number of clusters.
The third (and last) level is the co-occurrence matrix.
library(MDFS) mdfs_omp_set_num_threads(1) # only to pass CRAN checks data(madelon) rafs_results <- run_rafs(madelon$data, madelon$decision, 2, c(12345)) rafs_reps_popcnts <- get_rafs_reps_popcnts(rafs_results, 5) rafs_top_reps <- get_rafs_top_reps_from_popcnts(rafs_reps_popcnts$stig_single, 5) get_rafs_occurrence_matrix(rafs_results, rafs_top_reps, 5)
library(MDFS) mdfs_omp_set_num_threads(1) # only to pass CRAN checks data(madelon) rafs_results <- run_rafs(madelon$data, madelon$decision, 2, c(12345)) rafs_reps_popcnts <- get_rafs_reps_popcnts(rafs_results, 5) rafs_top_reps <- get_rafs_top_reps_from_popcnts(rafs_reps_popcnts$stig_single, 5) get_rafs_occurrence_matrix(rafs_results, rafs_top_reps, 5)
This function obtains a matrix of representatives's describing a graph of co-representation
at each count of clusters (from n_clusters_range
) computed
over all runs of RAFS.
get_rafs_rep_tuples_matrix( rafs_results, interesting_reps, n_clusters_range = 2:15 )
get_rafs_rep_tuples_matrix( rafs_results, interesting_reps, n_clusters_range = 2:15 )
rafs_results |
RAFS results as obtained from |
interesting_reps |
the interesting representatives to build matrices for |
n_clusters_range |
range of clusters number to obtain matrices for |
If a single result over a cluster number range is desired, the selected matrices can be summed.
A nested list
with matrices.
The first level is per the RAFS variant (combination of feature dissimilarity function and hclust method).
The second level is per the number of clusters.
The third (and last) level is the co-representation matrix.
library(MDFS) mdfs_omp_set_num_threads(1) # only to pass CRAN checks data(madelon) rafs_results <- run_rafs(madelon$data, madelon$decision, 2, c(12345)) rafs_reps_popcnts <- get_rafs_reps_popcnts(rafs_results, 5) rafs_top_reps <- get_rafs_top_reps_from_popcnts(rafs_reps_popcnts$stig_single, 5) get_rafs_rep_tuples_matrix(rafs_results, rafs_top_reps, 5)
library(MDFS) mdfs_omp_set_num_threads(1) # only to pass CRAN checks data(madelon) rafs_results <- run_rafs(madelon$data, madelon$decision, 2, c(12345)) rafs_reps_popcnts <- get_rafs_reps_popcnts(rafs_results, 5) rafs_top_reps <- get_rafs_top_reps_from_popcnts(rafs_reps_popcnts$stig_single, 5) get_rafs_rep_tuples_matrix(rafs_results, rafs_top_reps, 5)
This function obtains popularity counts (popcnts) of representatives' tuples
present at each count of clusters (from n_clusters_range
) computed
over all runs of RAFS.
get_rafs_rep_tuples_popcnts(rafs_results, n_clusters_range = 2:15)
get_rafs_rep_tuples_popcnts(rafs_results, n_clusters_range = 2:15)
rafs_results |
RAFS results as obtained from |
n_clusters_range |
range of clusters number to obtain popcnts for |
A nested list
with popcnts.
The first level is per the RAFS variant (combination of feature dissimilarity function and hclust method).
The second level is per the number of clusters.
The third (and last) level is popcnts per representatives' tuple.
library(MDFS) mdfs_omp_set_num_threads(1) # only to pass CRAN checks data(madelon) rafs_results <- run_rafs(madelon$data, madelon$decision, 2, c(12345)) get_rafs_rep_tuples_popcnts(rafs_results, 2:5)
library(MDFS) mdfs_omp_set_num_threads(1) # only to pass CRAN checks data(madelon) rafs_results <- run_rafs(madelon$data, madelon$decision, 2, c(12345)) get_rafs_rep_tuples_popcnts(rafs_results, 2:5)
This function obtains popularity counts (popcnts) of representatives
present at each count of clusters (from n_clusters_range
) computed
over all runs of RAFS.
get_rafs_reps_popcnts(rafs_results, n_clusters_range = 2:15)
get_rafs_reps_popcnts(rafs_results, n_clusters_range = 2:15)
rafs_results |
RAFS results as obtained from |
n_clusters_range |
range of clusters number to obtain popcnts for |
These results might be fed into further helper functions:
get_rafs_top_reps_from_popcnts
and get_rafs_all_reps_from_popcnts
.
A nested list
with popcnts.
The first level is per the RAFS variant (combination of feature dissimilarity function and hclust method).
The second level is per the number of clusters.
The third (and last) level is popcnts per representative.
library(MDFS) mdfs_omp_set_num_threads(1) # only to pass CRAN checks data(madelon) rafs_results <- run_rafs(madelon$data, madelon$decision, 2, c(12345)) get_rafs_reps_popcnts(rafs_results, 2:5)
library(MDFS) mdfs_omp_set_num_threads(1) # only to pass CRAN checks data(madelon) rafs_results <- run_rafs(madelon$data, madelon$decision, 2, c(12345)) get_rafs_reps_popcnts(rafs_results, 2:5)
This helper function works on results of get_rafs_rep_tuples_popcnts
to obtain the desired number of top (most common) representatives' tuples at the chosen number of clusters.
get_rafs_top_rep_tuples_from_popcnts( rep_tuples_popcnts, n_clusters, n_tuples = 1 )
get_rafs_top_rep_tuples_from_popcnts( rep_tuples_popcnts, n_clusters, n_tuples = 1 )
rep_tuples_popcnts |
tuples' popcnts for the chosen variant as obtained from |
n_clusters |
the desired number of clusters |
n_tuples |
the desired number of top tuples |
A list of top tuples (each tuple being a vector of representatives).
library(MDFS) mdfs_omp_set_num_threads(1) # only to pass CRAN checks data(madelon) rafs_results <- run_rafs(madelon$data, madelon$decision, 2, c(12345)) rafs_rep_tuples_popcnts <- get_rafs_rep_tuples_popcnts(rafs_results, 5) get_rafs_top_rep_tuples_from_popcnts(rafs_rep_tuples_popcnts$stig_single, 5)
library(MDFS) mdfs_omp_set_num_threads(1) # only to pass CRAN checks data(madelon) rafs_results <- run_rafs(madelon$data, madelon$decision, 2, c(12345)) rafs_rep_tuples_popcnts <- get_rafs_rep_tuples_popcnts(rafs_results, 5) get_rafs_top_rep_tuples_from_popcnts(rafs_rep_tuples_popcnts$stig_single, 5)
This helper function works on results of get_rafs_reps_popcnts
to obtain the desired number of top (most common) representatives at the chosen number of clusters.
get_rafs_top_reps_from_popcnts(reps_popcnts, n_clusters, n_reps = n_clusters)
get_rafs_top_reps_from_popcnts(reps_popcnts, n_clusters, n_reps = n_clusters)
reps_popcnts |
popcnts for the chosen variant as obtained from |
n_clusters |
the desired number of clusters |
n_reps |
the desired number of top representatives |
A vector of top representatives.
library(MDFS) mdfs_omp_set_num_threads(1) # only to pass CRAN checks data(madelon) rafs_results <- run_rafs(madelon$data, madelon$decision, 2, c(12345)) rafs_reps_popcnts <- get_rafs_reps_popcnts(rafs_results, 5) get_rafs_top_reps_from_popcnts(rafs_reps_popcnts$stig_single, 5)
library(MDFS) mdfs_omp_set_num_threads(1) # only to pass CRAN checks data(madelon) rafs_results <- run_rafs(madelon$data, madelon$decision, 2, c(12345)) rafs_reps_popcnts <- get_rafs_reps_popcnts(rafs_results, 5) get_rafs_top_reps_from_popcnts(rafs_reps_popcnts$stig_single, 5)
This function obtains popularity counts (popcnts) of top variables computed over all runs of FS.
get_rafs_tops_popcnts(fs_results, n_top_range = 2:15)
get_rafs_tops_popcnts(fs_results, n_top_range = 2:15)
fs_results |
RAFS FS results as obtained from |
n_top_range |
range of top number to obtain popcnts for |
These results might be fed into further helper functions:
get_rafs_top_reps_from_popcnts
and get_rafs_all_reps_from_popcnts
.
A nested list
with popcnts.
The first level is per the number of top variables.
The second (and last) level is popcnts per top variable.
library(MDFS) mdfs_omp_set_num_threads(1) # only to pass CRAN checks data(madelon) fs_results <- compute_fs_results(madelon$data, madelon$decision, 2, c(12345)) get_rafs_tops_popcnts(fs_results, 2:5)
library(MDFS) mdfs_omp_set_num_threads(1) # only to pass CRAN checks data(madelon) fs_results <- compute_fs_results(madelon$data, madelon$decision, 2, c(12345)) get_rafs_tops_popcnts(fs_results, 2:5)
A utility function used in RAFS to generate cross validation run identifiers, thus useful also for external cross-validation.
get_run_id(seed, k, i)
get_run_id(seed, k, i)
seed |
a numerical seed |
k |
number of folds for cross validation |
i |
current fold number (1 to |
A string with the run identifier.
This is the main function of the RAFS library to run for analysis.
run_rafs( data, decision, k = 5, seeds = sample.int(32767, 10), fs_fun = default_fs_fun, dist_funs = default_dist_funs, hclust_methods = default_hclust_methods )
run_rafs( data, decision, k = 5, seeds = sample.int(32767, 10), fs_fun = default_fs_fun, dist_funs = default_dist_funs, hclust_methods = default_hclust_methods )
data |
input data where columns are variables and rows are observations (all numeric) |
decision |
decision variable as a binary sequence of length equal to number of observations |
k |
number of folds for internal cross validation |
seeds |
a vector of seeds used for fold generation for internal cross validation |
fs_fun |
function to compute feature selection p-values, it must have the same signature as |
dist_funs |
a list of feature dissimilarity functions computed over the relevant portion of the training dataset (see the example |
hclust_methods |
a vector of |
Depending on your pipeline, you may want to also check out run_rafs_with_fs_results
and compute_fs_results
which this function simply wraps over.
The results from this function can be fed into one of the helper functions
to analyse them further: get_rafs_reps_popcnts
,
get_rafs_rep_tuples_popcnts
,
get_rafs_rep_tuples_matrix
and
get_rafs_occurrence_matrix
.
A nested list
with hclust
results.
The first level is per the cross validation run.
The second level is per the feature dissimilarity function.
The third (and last) level is per the hclust method.
library(MDFS) mdfs_omp_set_num_threads(1) # only to pass CRAN checks data(madelon) run_rafs(madelon$data, madelon$decision, 2, c(12345))
library(MDFS) mdfs_omp_set_num_threads(1) # only to pass CRAN checks data(madelon) run_rafs(madelon$data, madelon$decision, 2, c(12345))
This is a secondary function, useful when experimenting with different
feature selection filters and rankings. The output is exactly the same as
from run_rafs
.
run_rafs_with_fs_results( data, decision, fs_results, dist_funs = default_dist_funs, hclust_methods = default_hclust_methods )
run_rafs_with_fs_results( data, decision, fs_results, dist_funs = default_dist_funs, hclust_methods = default_hclust_methods )
data |
input data where columns are variables and rows are observations (all numeric) |
decision |
decision variable as a binary sequence of length equal to number of observations |
fs_results |
output from |
dist_funs |
a list of feature dissimilarity functions computed over the relevant portion of the training dataset (see the example |
hclust_methods |
a vector of |
A nested list
with hclust
results.
The first level is per the cross validation run.
The second level is per the feature dissimilarity function.
The third (and last) level is per the hclust method.
library(MDFS) mdfs_omp_set_num_threads(1) # only to pass CRAN checks data(madelon) fs_results <- compute_fs_results(madelon$data, madelon$decision, 2, c(12345)) run_rafs_with_fs_results(madelon$data, madelon$decision, fs_results)
library(MDFS) mdfs_omp_set_num_threads(1) # only to pass CRAN checks data(madelon) fs_results <- compute_fs_results(madelon$data, madelon$decision, 2, c(12345)) run_rafs_with_fs_results(madelon$data, madelon$decision, fs_results)
To be used as one of the dist_funs
in run_rafs
.
stig_dist(relevant_train_data, train_decision, seed)
stig_dist(relevant_train_data, train_decision, seed)
relevant_train_data |
input data where columns are variables and rows are observations (all numeric); assumed to contain only relevant data |
train_decision |
decision variable as a binary sequence of length equal to number of observations |
seed |
a numerical seed |
This function computes the STIG metric directly from the data, maximising it over 30 discretisations.
A matrix of distances (dissimilarities).
To be used as one of the dist_funs
in run_rafs
.
stig_from_ig_dist(relevant_train_data, train_decision, seed)
stig_from_ig_dist(relevant_train_data, train_decision, seed)
relevant_train_data |
input data where columns are variables and rows are observations (all numeric); assumed to contain only relevant data |
train_decision |
decision variable as a binary sequence of length equal to number of observations |
seed |
a numerical seed |
This function computes the STIG metric from single Information Gains (IGs) maximised over 30 discretisations and then summed pair-wise.
This function is similar to stig_dist
but the results differ slightly. We recommend the direct computation in general.
A matrix of distances (dissimilarities).
To be used as one of the dist_funs
in run_rafs
.
stig_stable_dist(relevant_train_data, train_decision, seed)
stig_stable_dist(relevant_train_data, train_decision, seed)
relevant_train_data |
input data where columns are variables and rows are observations (all numeric); assumed to contain only relevant data |
train_decision |
decision variable as a binary sequence of length equal to number of observations |
seed |
a numerical seed |
This function computes the STIG metric directly from the data, maximising it over 30 discretisations, but reusing the common 1D conditional entropy.
A matrix of distances (dissimilarities).
To be used as one of the dist_funs
in run_rafs
.
vi_dist(relevant_train_data, train_decision = NULL, seed)
vi_dist(relevant_train_data, train_decision = NULL, seed)
relevant_train_data |
input data where columns are variables and rows are observations (all numeric); assumed to contain only relevant data |
train_decision |
decision variable as a binary sequence of length equal to number of observations |
seed |
a numerical seed |
This function computes the Variation of Information (VI) averaged over 30 discretisations.
A matrix of distances (dissimilarities).