Package 'RAFS' reference manual

Title:	Robust Aggregative Feature Selection
Description:	A cross-validated minimal-optimal feature selection algorithm. It utilises popularity counting, hierarchical clustering with feature dissimilarity measures, and prefiltering with all-relevant feature selection method to obtain the minimal-optimal set of features.
Authors:	Radosław Piliszek [aut, cre], Witold Remigiusz Rudnicki [ths, aut]
Maintainer:	Radosław Piliszek <[email protected]>
License:	GPL-3
Version:	0.2.4
Built:	2024-11-06 09:23:14 UTC
Source:	CRAN

All built-in feature dissimilarity functions

Description

To be used in run_rafs.

Usage

builtin_dist_funs
builtin_dist_funs

Format

An object of class list of length 5.

Details

Compute preliminary feature selection results for RAFS

Description

This is a secondary function, useful when experimenting with different feature selection filters and rankings. Its output is used in run_rafs_with_fs_results and it is called for the user in run_rafs.

Usage

compute_fs_results(data, decision, k, seeds, fs_fun = default_fs_fun)
compute_fs_results(data, decision, k, seeds, fs_fun = default_fs_fun)

Arguments

`data`	input data where columns are variables and rows are observations (all numeric)
`decision`	decision variable as a binary sequence of length equal to number of observations
`k`	number of folds for internal cross validation
`seeds`	a vector of seeds used for fold generation for internal cross validation
`fs_fun`	function to compute feature selection p-values, it must have the same signature as `default_fs_fun` (which is the default, see its help to learn more)

Value

A list with feature selection results, e.g. from default_fs_fun.

Examples

library(MDFS)
mdfs_omp_set_num_threads(1)  # only to pass CRAN checks
data(madelon)
fs_results <- compute_fs_results(madelon$data, madelon$decision, 2, c(12345))
run_rafs_with_fs_results(madelon$data, madelon$decision, fs_results)
library(MDFS)
mdfs_omp_set_num_threads(1)  # only to pass CRAN checks
data(madelon)
fs_results <- compute_fs_results(madelon$data, madelon$decision, 2, c(12345))
run_rafs_with_fs_results(madelon$data, madelon$decision, fs_results)

Feature dissimilarity based on Pearson's Correlation (cor)

Description

To be used as one of the dist_funs in run_rafs.

Usage

cor_dist(relevant_train_data, train_decision = NULL, seed = NULL)
cor_dist(relevant_train_data, train_decision = NULL, seed = NULL)

Arguments

`relevant_train_data`	input data where columns are variables and rows are observations (all numeric); assumed to contain only relevant data
`train_decision`	decision variable as a binary sequence of length equal to number of observations
`seed`	a numerical seed

Value

A matrix of distances (dissimilarities).

Create seeded folds

Description

A utility function used in RAFS but useful also for external cross-validation.

Usage

create_seeded_folds(decision, k, seed)
create_seeded_folds(decision, k, seed)

Arguments

`decision`	decision variable as a binary sequence of length equal to number of observations
`k`	number of folds for cross validation
`seed`	a numerical seed

Value

A vector of folds. Each fold being a vector of selected indices.

Default feature dissimilarity functions

Description

As used in run_rafs.

Usage

default_dist_funs
default_dist_funs

Format

An object of class list of length 3.

Details

The default functions compute: Pearson's correlation (cor: cor_dist), Variation of Information (vi: vi_dist) and Symmetric Target Information Gain (stig: stig_dist).

These functions follow a similar protocol to default_fs_fun. They expect the same input except for the assumption that the data passed in is relevant. Each of them outputs a matrix of distances (dissimilarities) between features.

Default (example) feature selection function for RAFS

Description

See run_rafs for how it is used. Only the train portion of the dataset is to be fed into this function.

Usage

default_fs_fun(train_data, train_decision, seed)
default_fs_fun(train_data, train_decision, seed)

Arguments

`train_data`	input data where columns are variables and rows are observations (all numeric)
`train_decision`	decision variable as a binary sequence of length equal to number of observations
`seed`	a numerical seed

Details

The function MUST use this train_data and MAY ignore the train_decision.

If the function depends on randomness, it MUST use the seed parameter to seed the PRNG.

The function needs to return a list with at least two elements: rel_vars and rel_vars_rank, which are vectors and contain, respectively, the indices of variables considered relevant and the rank for each relevant variable. The function MAY return a list with more elements.

Other examples of sensible functions are included in the tests of this package.

Value

A list with at least two fields: rel_vars and rel_vars_rank, which are vectors and contain, respectively, the indices of variables considered relevant and the rank for each relevant variable.

Default hclust methods

Description

As used in run_rafs to call hclust.

Usage

default_hclust_methods
default_hclust_methods

Format

An object of class character of length 4.

Get all representatives from their popcnts

Description

This helper function works on results of get_rafs_reps_popcnts to obtain all representatives at the chosen number of clusters.

Usage

get_rafs_all_reps_from_popcnts(reps_popcnts, n_clusters)
get_rafs_all_reps_from_popcnts(reps_popcnts, n_clusters)

Arguments

`reps_popcnts`	representatives' popcnts for the chosen variant as obtained from `get_rafs_reps_popcnts`
`n_clusters`	the desired number of clusters

Value

A vector of all representatives.

Examples

library(MDFS)
mdfs_omp_set_num_threads(1)  # only to pass CRAN checks
data(madelon)
rafs_results <- run_rafs(madelon$data, madelon$decision, 2, c(12345))
rafs_reps_popcnts <- get_rafs_reps_popcnts(rafs_results, 5)
get_rafs_all_reps_from_popcnts(rafs_reps_popcnts$stig_single, 5)
library(MDFS)
mdfs_omp_set_num_threads(1)  # only to pass CRAN checks
data(madelon)
rafs_results <- run_rafs(madelon$data, madelon$decision, 2, c(12345))
rafs_reps_popcnts <- get_rafs_reps_popcnts(rafs_results, 5)
get_rafs_all_reps_from_popcnts(rafs_reps_popcnts$stig_single, 5)

Get co-occurrence matrix from RAFS results

Description

This function obtains a matrix describing a graph of co-occurrence at each count of clusters (from n_clusters_range) computed over all runs of RAFS.

Usage

get_rafs_occurrence_matrix(
  rafs_results,
  interesting_reps,
  n_clusters_range = 2:15
)
get_rafs_occurrence_matrix(
  rafs_results,
  interesting_reps,
  n_clusters_range = 2:15
)

Arguments

`rafs_results`	RAFS results as obtained from `run_rafs`
`interesting_reps`	the interesting representatives to build matrices for (in principle, these need not be representatives but it is more common)
`n_clusters_range`	range of clusters number to obtain matrices for

Details

If a single result over a cluster number range is desired, the selected matrices can be summed.

Value

A nested list with matrices. The first level is per the RAFS variant (combination of feature dissimilarity function and hclust method). The second level is per the number of clusters. The third (and last) level is the co-occurrence matrix.

Examples

library(MDFS)
mdfs_omp_set_num_threads(1)  # only to pass CRAN checks
data(madelon)
rafs_results <- run_rafs(madelon$data, madelon$decision, 2, c(12345))
rafs_reps_popcnts <- get_rafs_reps_popcnts(rafs_results, 5)
rafs_top_reps <- get_rafs_top_reps_from_popcnts(rafs_reps_popcnts$stig_single, 5)
get_rafs_occurrence_matrix(rafs_results, rafs_top_reps, 5)
library(MDFS)
mdfs_omp_set_num_threads(1)  # only to pass CRAN checks
data(madelon)
rafs_results <- run_rafs(madelon$data, madelon$decision, 2, c(12345))
rafs_reps_popcnts <- get_rafs_reps_popcnts(rafs_results, 5)
rafs_top_reps <- get_rafs_top_reps_from_popcnts(rafs_reps_popcnts$stig_single, 5)
get_rafs_occurrence_matrix(rafs_results, rafs_top_reps, 5)

Get representatives' tuples' co-representation matrix from RAFS results

Description

This function obtains a matrix of representatives's describing a graph of co-representation at each count of clusters (from n_clusters_range) computed over all runs of RAFS.

Usage

get_rafs_rep_tuples_matrix(
  rafs_results,
  interesting_reps,
  n_clusters_range = 2:15
)
get_rafs_rep_tuples_matrix(
  rafs_results,
  interesting_reps,
  n_clusters_range = 2:15
)

Arguments

`rafs_results`	RAFS results as obtained from `run_rafs`
`interesting_reps`	the interesting representatives to build matrices for
`n_clusters_range`	range of clusters number to obtain matrices for

Details

If a single result over a cluster number range is desired, the selected matrices can be summed.

Value

Examples

library(MDFS)
mdfs_omp_set_num_threads(1)  # only to pass CRAN checks
data(madelon)
rafs_results <- run_rafs(madelon$data, madelon$decision, 2, c(12345))
rafs_reps_popcnts <- get_rafs_reps_popcnts(rafs_results, 5)
rafs_top_reps <- get_rafs_top_reps_from_popcnts(rafs_reps_popcnts$stig_single, 5)
get_rafs_rep_tuples_matrix(rafs_results, rafs_top_reps, 5)
library(MDFS)
mdfs_omp_set_num_threads(1)  # only to pass CRAN checks
data(madelon)
rafs_results <- run_rafs(madelon$data, madelon$decision, 2, c(12345))
rafs_reps_popcnts <- get_rafs_reps_popcnts(rafs_results, 5)
rafs_top_reps <- get_rafs_top_reps_from_popcnts(rafs_reps_popcnts$stig_single, 5)
get_rafs_rep_tuples_matrix(rafs_results, rafs_top_reps, 5)

Get representatives' tuples' popularity counts (popcnts) from RAFS results

Description

This function obtains popularity counts (popcnts) of representatives' tuples present at each count of clusters (from n_clusters_range) computed over all runs of RAFS.

Usage

get_rafs_rep_tuples_popcnts(rafs_results, n_clusters_range = 2:15)
get_rafs_rep_tuples_popcnts(rafs_results, n_clusters_range = 2:15)

Arguments

`rafs_results`	RAFS results as obtained from `run_rafs`
`n_clusters_range`	range of clusters number to obtain popcnts for

Value

A nested list with popcnts. The first level is per the RAFS variant (combination of feature dissimilarity function and hclust method). The second level is per the number of clusters. The third (and last) level is popcnts per representatives' tuple.

Examples

library(MDFS)
mdfs_omp_set_num_threads(1)  # only to pass CRAN checks
data(madelon)
rafs_results <- run_rafs(madelon$data, madelon$decision, 2, c(12345))
get_rafs_rep_tuples_popcnts(rafs_results, 2:5)
library(MDFS)
mdfs_omp_set_num_threads(1)  # only to pass CRAN checks
data(madelon)
rafs_results <- run_rafs(madelon$data, madelon$decision, 2, c(12345))
get_rafs_rep_tuples_popcnts(rafs_results, 2:5)

Get representatives' popularity counts (popcnts) from RAFS results

Description

This function obtains popularity counts (popcnts) of representatives present at each count of clusters (from n_clusters_range) computed over all runs of RAFS.

Usage

get_rafs_reps_popcnts(rafs_results, n_clusters_range = 2:15)
get_rafs_reps_popcnts(rafs_results, n_clusters_range = 2:15)

Arguments

`rafs_results`	RAFS results as obtained from `run_rafs`
`n_clusters_range`	range of clusters number to obtain popcnts for

Details

These results might be fed into further helper functions: get_rafs_top_reps_from_popcnts and get_rafs_all_reps_from_popcnts.

Value

Examples

library(MDFS)
mdfs_omp_set_num_threads(1)  # only to pass CRAN checks
data(madelon)
rafs_results <- run_rafs(madelon$data, madelon$decision, 2, c(12345))
get_rafs_reps_popcnts(rafs_results, 2:5)
library(MDFS)
mdfs_omp_set_num_threads(1)  # only to pass CRAN checks
data(madelon)
rafs_results <- run_rafs(madelon$data, madelon$decision, 2, c(12345))
get_rafs_reps_popcnts(rafs_results, 2:5)

Get top (i.e., most common) representatives's tuples from their popcnts

Description

This helper function works on results of get_rafs_rep_tuples_popcnts to obtain the desired number of top (most common) representatives' tuples at the chosen number of clusters.

Usage

get_rafs_top_rep_tuples_from_popcnts(
  rep_tuples_popcnts,
  n_clusters,
  n_tuples = 1
)
get_rafs_top_rep_tuples_from_popcnts(
  rep_tuples_popcnts,
  n_clusters,
  n_tuples = 1
)

Arguments

`rep_tuples_popcnts`	tuples' popcnts for the chosen variant as obtained from `get_rafs_rep_tuples_popcnts`
`n_clusters`	the desired number of clusters
`n_tuples`	the desired number of top tuples

Value

A list of top tuples (each tuple being a vector of representatives).

Examples

library(MDFS)
mdfs_omp_set_num_threads(1)  # only to pass CRAN checks
data(madelon)
rafs_results <- run_rafs(madelon$data, madelon$decision, 2, c(12345))
rafs_rep_tuples_popcnts <- get_rafs_rep_tuples_popcnts(rafs_results, 5)
get_rafs_top_rep_tuples_from_popcnts(rafs_rep_tuples_popcnts$stig_single, 5)
library(MDFS)
mdfs_omp_set_num_threads(1)  # only to pass CRAN checks
data(madelon)
rafs_results <- run_rafs(madelon$data, madelon$decision, 2, c(12345))
rafs_rep_tuples_popcnts <- get_rafs_rep_tuples_popcnts(rafs_results, 5)
get_rafs_top_rep_tuples_from_popcnts(rafs_rep_tuples_popcnts$stig_single, 5)

Get top (i.e., most common) representatives from their popcnts

Description

This helper function works on results of get_rafs_reps_popcnts to obtain the desired number of top (most common) representatives at the chosen number of clusters.

Usage

get_rafs_top_reps_from_popcnts(reps_popcnts, n_clusters, n_reps = n_clusters)
get_rafs_top_reps_from_popcnts(reps_popcnts, n_clusters, n_reps = n_clusters)

Arguments

`reps_popcnts`	popcnts for the chosen variant as obtained from `get_rafs_reps_popcnts`
`n_clusters`	the desired number of clusters
`n_reps`	the desired number of top representatives

Value

A vector of top representatives.

Examples

library(MDFS)
mdfs_omp_set_num_threads(1)  # only to pass CRAN checks
data(madelon)
rafs_results <- run_rafs(madelon$data, madelon$decision, 2, c(12345))
rafs_reps_popcnts <- get_rafs_reps_popcnts(rafs_results, 5)
get_rafs_top_reps_from_popcnts(rafs_reps_popcnts$stig_single, 5)
library(MDFS)
mdfs_omp_set_num_threads(1)  # only to pass CRAN checks
data(madelon)
rafs_results <- run_rafs(madelon$data, madelon$decision, 2, c(12345))
rafs_reps_popcnts <- get_rafs_reps_popcnts(rafs_results, 5)
get_rafs_top_reps_from_popcnts(rafs_reps_popcnts$stig_single, 5)

Get top popularity counts (popcnts) from FS results

Description

This function obtains popularity counts (popcnts) of top variables computed over all runs of FS.

Usage

get_rafs_tops_popcnts(fs_results, n_top_range = 2:15)
get_rafs_tops_popcnts(fs_results, n_top_range = 2:15)

Arguments

`fs_results`	RAFS FS results as obtained from `compute_fs_results`
`n_top_range`	range of top number to obtain popcnts for

Details

These results might be fed into further helper functions: get_rafs_top_reps_from_popcnts and get_rafs_all_reps_from_popcnts.

Value

A nested list with popcnts. The first level is per the number of top variables. The second (and last) level is popcnts per top variable.

Examples

library(MDFS)
mdfs_omp_set_num_threads(1)  # only to pass CRAN checks
data(madelon)
fs_results <- compute_fs_results(madelon$data, madelon$decision, 2, c(12345))
get_rafs_tops_popcnts(fs_results, 2:5)
library(MDFS)
mdfs_omp_set_num_threads(1)  # only to pass CRAN checks
data(madelon)
fs_results <- compute_fs_results(madelon$data, madelon$decision, 2, c(12345))
get_rafs_tops_popcnts(fs_results, 2:5)

Generate CV run identifiers

Description

A utility function used in RAFS to generate cross validation run identifiers, thus useful also for external cross-validation.

Usage

get_run_id(seed, k, i)
get_run_id(seed, k, i)

Arguments

`seed`	a numerical seed
`k`	number of folds for cross validation
`i`	current fold number (1 to `k`)

Value

A string with the run identifier.

Robust Aggregative Feature Selection (RAFS)

Description

This is the main function of the RAFS library to run for analysis.

Usage

run_rafs(
  data,
  decision,
  k = 5,
  seeds = sample.int(32767, 10),
  fs_fun = default_fs_fun,
  dist_funs = default_dist_funs,
  hclust_methods = default_hclust_methods
)
run_rafs(
  data,
  decision,
  k = 5,
  seeds = sample.int(32767, 10),
  fs_fun = default_fs_fun,
  dist_funs = default_dist_funs,
  hclust_methods = default_hclust_methods
)

Arguments

`data`	input data where columns are variables and rows are observations (all numeric)
`decision`	decision variable as a binary sequence of length equal to number of observations
`k`	number of folds for internal cross validation
`seeds`	a vector of seeds used for fold generation for internal cross validation
`fs_fun`	function to compute feature selection p-values, it must have the same signature as `default_fs_fun` (which is the default, see its help to learn more)
`dist_funs`	a list of feature dissimilarity functions computed over the relevant portion of the training dataset (see the example `default_dist_funs` and `builtin_dist_funs` to learn more)
`hclust_methods`	a vector of `hclust` methods to use

Details

Depending on your pipeline, you may want to also check out run_rafs_with_fs_results and compute_fs_results which this function simply wraps over.

The results from this function can be fed into one of the helper functions to analyse them further: get_rafs_reps_popcnts, get_rafs_rep_tuples_popcnts, get_rafs_rep_tuples_matrix and get_rafs_occurrence_matrix.

Value

A nested list with hclust results. The first level is per the cross validation run. The second level is per the feature dissimilarity function. The third (and last) level is per the hclust method.

Examples

library(MDFS)
mdfs_omp_set_num_threads(1)  # only to pass CRAN checks
data(madelon)
run_rafs(madelon$data, madelon$decision, 2, c(12345))
library(MDFS)
mdfs_omp_set_num_threads(1)  # only to pass CRAN checks
data(madelon)
run_rafs(madelon$data, madelon$decision, 2, c(12345))

Robust Aggregative Feature Selection (RAFS) from feature selection results

Description

This is a secondary function, useful when experimenting with different feature selection filters and rankings. The output is exactly the same as from run_rafs.

Usage

run_rafs_with_fs_results(
  data,
  decision,
  fs_results,
  dist_funs = default_dist_funs,
  hclust_methods = default_hclust_methods
)
run_rafs_with_fs_results(
  data,
  decision,
  fs_results,
  dist_funs = default_dist_funs,
  hclust_methods = default_hclust_methods
)

Arguments

`data`	input data where columns are variables and rows are observations (all numeric)
`decision`	decision variable as a binary sequence of length equal to number of observations
`fs_results`	output from `compute_fs_results` computed for the same `data` and `decision`
`dist_funs`	a list of feature dissimilarity functions computed over the relevant portion of the training dataset (see the example `default_dist_funs` to learn more)
`hclust_methods`	a vector of `hclust` methods to use

Value

Examples

library(MDFS)
mdfs_omp_set_num_threads(1)  # only to pass CRAN checks
data(madelon)
fs_results <- compute_fs_results(madelon$data, madelon$decision, 2, c(12345))
run_rafs_with_fs_results(madelon$data, madelon$decision, fs_results)
library(MDFS)
mdfs_omp_set_num_threads(1)  # only to pass CRAN checks
data(madelon)
fs_results <- compute_fs_results(madelon$data, madelon$decision, 2, c(12345))
run_rafs_with_fs_results(madelon$data, madelon$decision, fs_results)

Symmetric Target Information Gain (STIG) computed directly

Description

To be used as one of the dist_funs in run_rafs.

Usage

stig_dist(relevant_train_data, train_decision, seed)
stig_dist(relevant_train_data, train_decision, seed)

Arguments

`relevant_train_data`	input data where columns are variables and rows are observations (all numeric); assumed to contain only relevant data
`train_decision`	decision variable as a binary sequence of length equal to number of observations
`seed`	a numerical seed

Details

This function computes the STIG metric directly from the data, maximising it over 30 discretisations.

Value

A matrix of distances (dissimilarities).

Symmetric Target Information Gain (STIG) computed from single Information Gains (IGs)

Description

To be used as one of the dist_funs in run_rafs.

Usage

stig_from_ig_dist(relevant_train_data, train_decision, seed)
stig_from_ig_dist(relevant_train_data, train_decision, seed)

Arguments

`relevant_train_data`	input data where columns are variables and rows are observations (all numeric); assumed to contain only relevant data
`train_decision`	decision variable as a binary sequence of length equal to number of observations
`seed`	a numerical seed

Details

This function computes the STIG metric from single Information Gains (IGs) maximised over 30 discretisations and then summed pair-wise.

This function is similar to stig_dist but the results differ slightly. We recommend the direct computation in general.

Value

A matrix of distances (dissimilarities).

Symmetric Target Information Gain (STIG) computed directly but with pre-computed 1D conditional entropy (aka stable)

Description

To be used as one of the dist_funs in run_rafs.

Usage

stig_stable_dist(relevant_train_data, train_decision, seed)
stig_stable_dist(relevant_train_data, train_decision, seed)

Arguments

`relevant_train_data`	input data where columns are variables and rows are observations (all numeric); assumed to contain only relevant data
`train_decision`	decision variable as a binary sequence of length equal to number of observations
`seed`	a numerical seed

Details

This function computes the STIG metric directly from the data, maximising it over 30 discretisations, but reusing the common 1D conditional entropy.

Value

A matrix of distances (dissimilarities).

Variation of Information (VI)

Description

To be used as one of the dist_funs in run_rafs.

Usage

vi_dist(relevant_train_data, train_decision = NULL, seed)
vi_dist(relevant_train_data, train_decision = NULL, seed)

Arguments

`relevant_train_data`	input data where columns are variables and rows are observations (all numeric); assumed to contain only relevant data
`train_decision`	decision variable as a binary sequence of length equal to number of observations
`seed`	a numerical seed

Details

This function computes the Variation of Information (VI) averaged over 30 discretisations.

Value

A matrix of distances (dissimilarities).

Package 'RAFS'

Help Index

All built-in feature dissimilarity functions

Description

Usage

Format

Details

Compute preliminary feature selection results for RAFS

Description

Usage

Arguments

Value

Examples

Feature dissimilarity based on Pearson's Correlation (cor)

Description

Usage

Arguments

Value

Create seeded folds

Description

Usage

Arguments

Value

Default feature dissimilarity functions

Description

Usage

Format

Details

Default (example) feature selection function for RAFS

Description

Usage

Arguments

Details

Value

Default hclust methods

Description

Usage

Format

Get all representatives from their popcnts

Description

Usage

Arguments

Value

Examples

Get co-occurrence matrix from RAFS results

Description

Usage

Arguments

Details

Value

Examples

Get representatives' tuples' co-representation matrix from RAFS results

Description

Usage

Arguments

Details

Value

Examples

Get representatives' tuples' popularity counts (popcnts) from RAFS results

Description

Usage

Arguments

Value

Examples

Get representatives' popularity counts (popcnts) from RAFS results

Description

Usage

Arguments

Details

Value

Examples

Get top (i.e., most common) representatives's tuples from their popcnts

Description

Usage

Arguments

Value

Examples

Get top (i.e., most common) representatives from their popcnts

Description

Usage