| Title: | Evolutionary Feature Engineering |
|---|---|
| Description: | Automates feature engineering using evolutionary algorithms inspired by genetic programming. Starting from raw input features, the package evolves candidate transformation recipes through selection, crossover, and mutation, evaluating fitness via cross-validation or train/validation splits with gradient-boosted tree models ('LightGBM' or 'XGBoost'). Built-in transformers include arithmetic, logarithmic, and power operations, interaction terms, target encoding, quantile and log-based binning, principal component analysis, truncated singular value decomposition, Uniform Manifold Approximation and Projection (UMAP) dimensionality reduction, and minimum spanning tree (MST) graph-based clustering. The evolutionary search yields an optimised feature recipe that can be applied to new data for prediction. Methods are described in McInnes et al. (2018) <doi:10.21105/joss.00861>, Ke et al. (2017) <https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-framework>, Chen and Guestrin (2016) <doi:10.1145/2939672.2939785>, Gagolewski (2021) <doi:10.1016/j.softx.2021.100722>, Gagolewski (2026) <doi:10.32614/CRAN.package.lumbermark>, and Gagolewski (2026) <doi:10.32614/CRAN.package.deadwood>. |
| Authors: | Gustavo Pereira [aut, cre] |
| Maintainer: | Gustavo Pereira <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.0 |
| Built: | 2026-06-09 19:21:04 UTC |
| Source: | https://github.com/cran/evoFE |
Apply a single gene to a dataset
apply_gene( gene, train_data, val_data = NULL, target_col = NULL, state_cache = NULL, data_hash = NULL )apply_gene( gene, train_data, val_data = NULL, target_col = NULL, state_cache = NULL, data_hash = NULL )
gene |
A gene list representing a feature transformation. |
train_data |
A data.frame or data.table representing the training data. |
val_data |
Optional validation data.frame or data.table. |
target_col |
Name of the target column. |
state_cache |
Optional environment to cache full-dataset fitted states of stateful transformers. |
data_hash |
Optional pre-computed xxhash64 digest of the target column, to avoid redundant hashing when applying multiple genes. |
A list with three elements: train (the modified training
data.table with the new gene column appended), val (the
modified validation data.table or NULL), and gene
(the gene list, with its state element populated if the transformer
is stateful).
Apply an entire individual's recipe to data
apply_individual( ind, train_data, val_data = NULL, target_col = NULL, state_cache = NULL )apply_individual( ind, train_data, val_data = NULL, target_col = NULL, state_cache = NULL )
ind |
An evo_individual object. |
train_data |
A data.frame or data.table representing the training data. |
val_data |
Optional validation data.frame or data.table. |
target_col |
Name of the target column. |
state_cache |
Optional environment to cache full-dataset fitted states of stateful transformers. |
A list with three elements: train (the transformed training
data.table with all gene columns applied), val (the
transformed validation data.table or NULL), and ind
(the updated evo_individual whose genes now carry fitted states).
Create a single gene
create_gene(transformer_name, input_cols)create_gene(transformer_name, input_cols)
transformer_name |
Name of the transformer |
input_cols |
Vector of input column names |
A gene list with elements transformer_name, input_cols,
params (transformer-specific parameters), state (NULL
until fitted), and output_col (auto-generated column name).
Create an individual
create_individual( genes = list(), numeric_cols = character(0), categorical_cols = character(0) )create_individual( genes = list(), numeric_cols = character(0), categorical_cols = character(0) )
genes |
List of genes |
numeric_cols |
Vector of numeric column names |
categorical_cols |
Vector of categorical column names |
An evo_individual S3 object: a list with elements
genes (topologically sorted), numeric_cols,
categorical_cols, and fitness (initialised to
NA_real_).
Create a transformer definition
create_transformer( name, type, input_type = "numeric", output_type = "numeric", fit_func = NULL, apply_func, name_generator, allow_replace = FALSE )create_transformer( name, type, input_type = "numeric", output_type = "numeric", fit_func = NULL, apply_func, name_generator, allow_replace = FALSE )
name |
Transformer name |
type |
Type: "unary", "binary", "supervised_unary" |
input_type |
Type of input: "numeric" or "categorical" |
output_type |
Type of output: "numeric" or "categorical" |
fit_func |
function(data, input_cols, target_col = NULL) returning state |
apply_func |
function(data, input_cols, state = NULL) returning new column vector |
name_generator |
function(input_cols) returning output column name |
allow_replace |
Logical. Whether column sampling allows replacement. |
An evo_transformer S3 object: a list with elements
name, type, input_type, output_type,
fit_func, apply_func, name_generator, and
allow_replace.
Crossover two individuals
crossover(ind1, ind2, verbose = FALSE)crossover(ind1, ind2, verbose = FALSE)
ind1 |
Parent 1 |
ind2 |
Parent 2 |
verbose |
Logical. Whether to print crossover details. |
An evo_individual child created by randomly sampling genes
from both parents with duplicate gene outputs removed.
Evaluate the fitness of an individual
evaluate_fitness( ind, data, target_col, task = "classification", cv_folds = 3, evaluation_strategy = "cv", split_ids = NULL, shared_splits = NULL, evaluator = "lightgbm", fold_ids = NULL, shared_folds = NULL, shared_full = NULL, state_cache = NULL, threads = 2 )evaluate_fitness( ind, data, target_col, task = "classification", cv_folds = 3, evaluation_strategy = "cv", split_ids = NULL, shared_splits = NULL, evaluator = "lightgbm", fold_ids = NULL, shared_folds = NULL, shared_full = NULL, state_cache = NULL, threads = 2 )
ind |
An evo_individual object. |
data |
A data.frame or data.table containing the dataset. |
target_col |
Name of the target column. |
task |
"classification" or "regression". |
cv_folds |
Number of cross-validation folds. |
evaluation_strategy |
Character string, either "cv" (cross-validation) or "split" (train/validation split). |
split_ids |
Optional vector of pre-defined split assignments (e.g. "train", "val", "holdout"). |
shared_splits |
Optional list of shared data.table splits for in-place caching. |
evaluator |
The ML model to use ("lightgbm" or "xgboost"). |
fold_ids |
Optional vector of pre-defined fold assignments. |
shared_folds |
Optional list of shared data.table CV folds for in-place caching. |
shared_full |
Optional data.table of the full dataset for in-place caching. |
state_cache |
Optional environment to cache full-dataset fitted states of stateful transformers. |
threads |
Number of threads to use for parallel execution (default 2) |
The input evo_individual with its fitness field set to
the computed score (higher is better), importances set to a named
numeric vector of feature importances, holdout_fitness set to
NULL, and genes updated with fitted transformer states.
A list of default transformer definitions available for feature engineering.
evo_transformersevo_transformers
A named list of evo_transformer objects, each defining a
feature transformation (e.g. log, pca, target_encode).
Run evolutionary feature engineering
evolve_features( data, target_col, task = "classification", generations = 10, pop_size = 10, cv_folds = 3, evaluation_strategy = "cv", split_ratio = c(0.6, 0.2, 0.2), split_ids = NULL, early_stopping_rounds = 3, evaluator = "lightgbm", dynamic_population = TRUE, crossover_type = "both", threads = 2, max_clustering_size = 5000, verbose = TRUE )evolve_features( data, target_col, task = "classification", generations = 10, pop_size = 10, cv_folds = 3, evaluation_strategy = "cv", split_ratio = c(0.6, 0.2, 0.2), split_ids = NULL, early_stopping_rounds = 3, evaluator = "lightgbm", dynamic_population = TRUE, crossover_type = "both", threads = 2, max_clustering_size = 5000, verbose = TRUE )
data |
A data.frame or data.table |
target_col |
Name of the target column |
task |
"classification" or "regression" |
generations |
Number of generations (max iterations) |
pop_size |
Population size |
cv_folds |
Number of cross-validation folds |
evaluation_strategy |
"cv" or "split". Strategy to evaluate candidate recipes. |
split_ratio |
A numeric vector of length 2 or 3 defining train/validation/holdout proportions (e.g. c(0.6, 0.2, 0.2)). |
split_ids |
An optional character vector of split assignments (e.g. "train", "val", "holdout"). |
early_stopping_rounds |
Stop if fitness doesn't improve for this many generations |
evaluator |
The ML model to use ("lightgbm" or "xgboost") |
dynamic_population |
Logical. If TRUE, population expands dynamically during stagnation. |
crossover_type |
Crossover type: "both" (default, 50% random / 50% union), "random", or "union" |
threads |
Number of threads to use for parallel execution (default 2) |
max_clustering_size |
Maximum unique training rows to cluster (default 5000, 0/NULL for unlimited) |
verbose |
Logical. If TRUE, prints progress. |
An evo_recipe S3 object: a list with elements
best_individual (the top-scoring evo_individual),
history (list of all evaluated individuals across generations),
task, best_model (the trained model object),
evaluator, and classes (class levels for multiclass tasks,
otherwise NULL).
Convert a gene to a formula string
gene_to_formula(gene)gene_to_formula(gene)
gene |
A gene list |
A character string representing the gene as a human-readable
formula, e.g. "log(col1)" or "pca2(col1, col2)".
Convert a gene to a formula string for state caching (ignoring component index)
gene_to_state_formula(gene)gene_to_state_formula(gene)
gene |
A gene list |
A character string representing the gene formula suitable for state caching. For multi-component transformers (PCA, SVD, UMAP) the component index is omitted so that all components share one cache key.
Convert an individual to a recipe string of formulas
individual_to_recipe_string(ind)individual_to_recipe_string(ind)
ind |
An evo_individual |
A character string listing all gene formulas in bracket notation,
e.g. "[log(x), sqrt(y)]", or "[Original features only]"
when the individual has no genes.
Initialize a population
initialize_population( pop_size, numeric_cols, categorical_cols, initial_genes = 2, task = "classification" )initialize_population( pop_size, numeric_cols, categorical_cols, initial_genes = 2, task = "classification" )
pop_size |
Population size. |
numeric_cols |
Vector of numeric column names. |
categorical_cols |
Vector of categorical column names. |
initial_genes |
Number of initial genes per individual. |
task |
Task type ("classification", "regression", or "multiclass"). |
A list of evo_individual objects of length pop_size.
The first individual is a baseline with no genes; the remaining individuals
each carry initial_genes randomly generated genes.
Mutate an individual
mutate( ind, verbose = FALSE, force_add = FALSE, importances = numeric(0), temperature = 1, task = "classification", tested_gene_outputs = NULL )mutate( ind, verbose = FALSE, force_add = FALSE, importances = numeric(0), temperature = 1, task = "classification", tested_gene_outputs = NULL )
ind |
An evo_individual. |
verbose |
Logical. Whether to print mutation details. |
force_add |
Logical. If TRUE, forces adding a new gene. |
importances |
A numeric vector of feature importances. |
temperature |
A numeric temperature value controlling selection weights. |
task |
The task type ("classification", "regression", or "multiclass") |
tested_gene_outputs |
Character vector of gene output names that have been evaluated in a previous generation and are safe for chaining. When NULL (default), all existing gene outputs are available. Pass character(0) to block all chaining (e.g. during initialization). |
An evo_individual with the mutation applied (gene added,
removed, or modified) and fitness reset to NA_real_.
Predict target values using the fully evolved model
predict_model(object, newdata, ...)predict_model(object, newdata, ...)
object |
An evo_recipe object containing the trained model and best individual |
newdata |
A data.frame or data.table to make predictions on |
... |
Additional arguments (currently unused) |
For binary classification and regression tasks a numeric vector of predictions. For multiclass tasks a numeric matrix with one column per class (columns named after class levels).
Apply feature engineering recipe to new data
## S3 method for class 'evo_recipe' predict(object, newdata, ...)## S3 method for class 'evo_recipe' predict(object, newdata, ...)
object |
An evo_recipe object |
newdata |
A data.frame or data.table |
... |
Additional arguments |
A data.table containing the engineered feature columns
(original plus all gene-derived columns) for newdata, ready for
downstream modelling.
Union Crossover of two individuals
union_crossover(ind1, ind2, verbose = FALSE)union_crossover(ind1, ind2, verbose = FALSE)
ind1 |
Parent 1 |
ind2 |
Parent 2 |
verbose |
Logical. Whether to print crossover details. |
An evo_individual child created by taking the union of all
genes from both parents with duplicate gene outputs removed.