| Title: | QSAR Modelling Using Genetic Algorithm Based Variable Selection |
|---|---|
| Description: | Implements genetic algorithm-based variable selection for building quantitative structure-activity relationship (QSAR) models. The package provides a workflow for selecting optimal predictor subsets from large descriptor spaces using leave-one-out cross-validation (LOOCV) with Q2 as the fitness criterion. Features include automatic handling of multicollinearity via variance inflation factor (VIF) thresholding, customizable genetic algorithm operators, and diagnostic tools for model evaluation. Supports both training set optimization and external validation, plus nested (double) cross-validation for unbiased performance estimation and predictor stability diagnostics. Built-in visualization functions include Q2 curves and Williams plots to assess model applicability domain. The method is demonstrated in papers predicting antibacterial activity by Araya-Cloutier et al. (2018) <doi:10.1038/s41598-018-27545-4> and Kalli et al. (2021) <doi:10.1038/s41598-021-92964-9>. |
| Authors: | Jos Hageman [aut, cre] |
| Maintainer: | Jos Hageman <[email protected]> |
| License: | GPL-3 |
| Version: | 1.2.3 |
| Built: | 2026-06-24 13:25:46 UTC |
| Source: | https://github.com/cran/gaQSAR |
Builds a simple line plot of the GA best fitness value per generation.
Works with both gaQSAR and gaQSAR_dcv objects. For nested CV objects
(gaQSAR_dcv), one series is shown per outer fold (each inner GA run).
createBestFitnessPlot(object, title = NULL)createBestFitnessPlot(object, title = NULL)
object |
A |
title |
Optional character plot title. If missing, a default title is chosen based on the object type. |
A ggplot2 object.
Summarizes training metrics across outer folds for one or multiple
gaQSAR_dcv objects and plots mean +/- SE versus the number of predictors in
a single panel. Metrics share axes and are distinguished by color/shape.
createDCVTrainingMetricsPlot( dcvResults, metrics = c("R2", "R2adj", "Q2"), title = "", includeOuterQ2 = FALSE )createDCVTrainingMetricsPlot( dcvResults, metrics = c("R2", "R2adj", "Q2"), title = "", includeOuterQ2 = FALSE )
dcvResults |
A |
metrics |
Character vector of metrics to plot. Allowed values:
|
title |
Optional character plot title. |
includeOuterQ2 |
Logical; if |
SE is computed across outer folds with status "ok".
A ggplot2 object.
Build a Williams plot for a gaQSAR_dcv run using diagnostics collected
across training folds. Optionally aggregate per object (mean/median) or
show raw points (no aggregation).
createDCVWilliamsPlot( dcvResult, residualThreshold = 2.5, aggregation = c("mean", "median", "none"), colorBy = c("objectId", "fold", "none", "object"), label = "", labelOutliers = c("rowNumber", "rowName", "none") )createDCVWilliamsPlot( dcvResult, residualThreshold = 2.5, aggregation = c("mean", "median", "none"), colorBy = c("objectId", "fold", "none", "object"), label = "", labelOutliers = c("rowNumber", "rowName", "none") )
dcvResult |
A |
residualThreshold |
Numeric threshold for standardized residual bands. Default is 2.5. |
aggregation |
Character scalar: "mean", "median", or "none" (raw points). Default is "mean". |
colorBy |
Character scalar controlling color mapping: "objectId", "fold", or "none". "object" is accepted as a backward-compatible alias for "objectId". Default is "objectId". |
label |
Optional character string appended to the plot title. |
labelOutliers |
Character scalar controlling outlier labeling: "rowNumber", "rowName", or "none". Outliers are defined as points with high leverage (> h*) and/or high residual (> residualThreshold). Default is "rowNumber". |
A ggplot2 object.
Create a diagnostic plot of cross-validated performance against model size. The figure shows Q2 for the training set (LOOCV) and, when available, the external validation set as a function of the number of selected predictors. Use this for visual inspection of GA-based variable selection results and to compare model parsimony versus predictive ability.
createQ2Plot(output, label = "")createQ2Plot(output, label = "")
output |
A |
label |
Character string used as the plot title (set to |
Lines are drawn only for groups that contain two or more points; otherwise
individual points are shown. Point shape and color encode whether values
originate from the training (Q2Loocv) or external (Q2Ext) set. The
x-axis shows the number of predictors; tick labels are limited to integers.
Returns a Q2 versus number of predictors plot.
gaVariableSelection(), Q2(), createWilliamsPlot()
Creates Williams plots for one gaQSAR object or for a list of gaQSAR
objects. A Williams plot shows standardized residuals against leverage and is
commonly used to inspect outliers and the applicability domain of a QSAR
model.
createWilliamsPlot( output, xtrain, ytrain, xtest, ytest, label = "", residualThreshold = 2.5, labelPoints = c("outliers", "all", "none") )createWilliamsPlot( output, xtrain, ytrain, xtest, ytest, label = "", residualThreshold = 2.5, labelPoints = c("outliers", "all", "none") )
output |
A |
xtrain |
A data.frame or matrix with training predictors. Row names are used as object identifiers when available. |
ytrain |
Numeric vector with training responses, aligned with the rows of
|
xtest |
A data.frame or matrix with validation predictors. It must contain the selected predictor columns. Row names are used as object identifiers when available. |
ytest |
Numeric vector with validation responses, aligned with the rows of
|
label |
Optional character string appended to the plot title. |
residualThreshold |
Numeric threshold for standardized residuals. The default is 2.5. |
labelPoints |
Character scalar controlling point labels: |
For each model, the selected predictors are taken from importantPredictors.
The model is refitted on xtrain and ytrain. Training leverages are computed
from the fitted design matrix. Validation leverages are computed from the
validation design matrix using the inverse information matrix from the training
data.
Training standardized residuals are computed as
residual / (sigma * sqrt(1 - h)). Validation residuals are standardized as
prediction residuals using residual / (sigma * sqrt(1 + h)), where h is
the leverage of the validation object relative to the training design. This
avoids undefined residuals when validation leverages are larger than one.
Horizontal reference lines are drawn at 0 and at
+/- residualThreshold. The leverage threshold is 3 * p / n, where p is
the number of model coefficients including the intercept and n is the
number of training observations.
Points are treated as Williams plot outliers when they have an absolute
standardized residual larger than residualThreshold and/or a leverage
larger than the leverage threshold.
The input object, with each gaQSAR object augmented by:
- williamsData: data.frame with object identifiers, leverages, residuals and dataset type.
- williamsPlot: a ggplot2 Williams plot.
- williamsOutliers: counts of high-residual and high-leverage objects
Performs nested cross-validation combining an outer CV loop (for final model
assessment) with an inner CV loop (used by the GA for fitness evaluation during
variable selection). The inner loop always uses LOOCV via singleCV as the
GA fitness. The outer loop may be leave-one-out or k-fold (default), providing
an unbiased estimate of outer-loop predictive performance and stability analysis
of selected predictors.
gaDoubleCrossValidation( x, y, outerMethod = "kfold", outerK = 5, seed = NULL, residualThreshold = 2.5, verbose = FALSE, ... )gaDoubleCrossValidation( x, y, outerMethod = "kfold", outerK = 5, seed = NULL, residualThreshold = 2.5, verbose = FALSE, ... )
x |
A data.frame or matrix of predictors (rows = observations, columns = candidate variables). |
y |
Numeric response vector aligned with rows of |
outerMethod |
Character; outer CV method: "loo" for leave-one-out or "kfold" for k-fold cross-validation (default). |
outerK |
Integer; number of folds for outer k-fold CV (only used if
|
seed |
Integer; random seed for reproducible fold splits (especially relevant
for k-fold CV). If |
residualThreshold |
Numeric; threshold for standardized residuals in Williams plot diagnostics. |
verbose |
Logical; if |
... |
Additional arguments forwarded to |
For each outer fold:
Create outerTrain (all data except outerTest) and outerTest (hold-out fold).
Run gaVariableSelection() on outerTrain using inner CV for fitness evaluation.
Fit the selected lm model on outerTrain using selected predictors.
Predict on outerTest and store predictions and residuals.
Compute and store diagnostic measures (Williams plot data, VIF, etc.).
Two types of seeds are employed:
The seed parameter (outer fold seed) controls reproducible creation of outer CV folds.
The seeds parameter (passed via ... to gaVariableSelection) enables multiple GA runs
for robustness within each fold. The best seed is tracked in foldSummaries$bestSeed.
Across all outer folds, summaries are computed:
selectionFrequency: How often each predictor is selected across successful outer folds.
selectionFrequencyAllFolds: Optional frequency over all outer folds, including failed folds.
signStability: For each selected predictor, the proportion of times its coefficient is positive (vs negative) across folds.
modelSizeDistribution: Distribution of numbers of selected predictors.
perObject outlier/high-leverage frequencies based on Williams plot thresholds.
An object of class "gaQSAR_dcv" (list) containing:
call: The matched call.
outer: List with elements method, k, seed (the outer fold seed for reproducibility),
and folds (list of outer fold indices).
outerPredictions: Numeric vector of length n (full dataset) with out-of-fold predictions.
outerResiduals: Numeric vector of length n with outer residuals.
yObserved: Original observed response vector (aligned with rows of x).
outerQ2: Numeric; Q2 computed from outer predictions (NA if < 2 valid predictions).
foldModels: List of gaQSAR objects (one per outer fold; NULL for failed folds).
foldSummaries: Data.frame with one row per outer fold, including columns: nPredictors, selectedPredictors, innerQ2 (LOOCV fitness), trainingR2, trainingR2Adj, maxVif, bestSeed (the GA seed that produced the best model for that fold), status ("ok" or "failed"), errorMessage.
foldDiagnostics: List with Williams plot diagnostic data per fold (NULL for failed folds); each element includes williamsData, hStar, residualThreshold.
adThresholds: List describing applicability domain thresholds used: leverageCutoffMethod, leverageCutoff (NA if varies), residualCutoff, notes.
selectionFrequency: Named numeric vector of predictor frequencies (0 to 1) computed over successful folds only.
selectionFrequencyAllFolds: Named numeric vector of predictor frequencies (0 to 1) computed over all folds, including failed folds.
signStability: Named numeric vector of positive sign frequency for selected predictors.
modelSizeDistribution: Named integer vector of model size frequencies.
perObjectOutlierFrequency: Numeric vector of length n with outlier frequencies.
perObjectHighLeverageFrequency: Numeric vector of length n with high-leverage frequencies.
gaVariableSelection(), singleCV(), Q2(), createWilliamsPlot()
set.seed(42) n <- 50 p <- 20 x <- matrix(rnorm(n * p), nrow = n) colnames(x) <- paste0("X", seq_len(p)) y <- 2 * x[, 2] - 1.5 * x[, 5] + rnorm(n, sd = 0.5) # Nested CV with LOO outer and inner LOOCV fitness # NOTE: Settings below are example-only (fast runtime, not optimal). # For real QSAR work, use larger population sizes, more generations, and multiple seeds. # Good starting point are the default GA settings, or run an experimental design to # tune GA settings for your dataset. dcvFit <- gaDoubleCrossValidation( x = x, y = y, outerMethod = "kfold", outerK = 2, numberOfVariables = 3, popSize = 20, maxIter = 20, seeds = 1, interval = 5, verbose = TRUE ) print(dcvFit) summary(dcvFit) plot(dcvFit) plot(dcvFit, type = "selectionFrequency") plot(dcvFit, type = "williamsFrequency")set.seed(42) n <- 50 p <- 20 x <- matrix(rnorm(n * p), nrow = n) colnames(x) <- paste0("X", seq_len(p)) y <- 2 * x[, 2] - 1.5 * x[, 5] + rnorm(n, sd = 0.5) # Nested CV with LOO outer and inner LOOCV fitness # NOTE: Settings below are example-only (fast runtime, not optimal). # For real QSAR work, use larger population sizes, more generations, and multiple seeds. # Good starting point are the default GA settings, or run an experimental design to # tune GA settings for your dataset. dcvFit <- gaDoubleCrossValidation( x = x, y = y, outerMethod = "kfold", outerK = 2, numberOfVariables = 3, popSize = 20, maxIter = 20, seeds = 1, interval = 5, verbose = TRUE ) print(dcvFit) summary(dcvFit) plot(dcvFit) plot(dcvFit, type = "selectionFrequency") plot(dcvFit, type = "williamsFrequency")
Mutate an integer-encoded chromosome by independently replacing each gene,
with probability object@pmutation, by a new value drawn uniformly from
1:object@upper[i]. Intended for use as the mutation function in GA::ga().
gaintegerMutation(object, parent, ...)gaintegerMutation(object, parent, ...)
object |
A |
parent |
Integer; row index of the parent chromosome in |
... |
Ignored; for compatibility with the |
Mutation decisions are drawn independently per gene using
stats::runif(). When mutation occurs for gene i, a new integer in the
range [1, upper[i]] is sampled with replacement via sample.int().
A numeric vector containing the mutated chromosome.
GA::ga
Perform one-point crossover on two parent chromosomes. A single crossover
point is sampled uniformly over gene positions; genes to the left of the
point (inclusive) are swapped between parents. Intended for use as the
crossover function in GA::ga().
gaintegerOnePointCrossover(object, parents, ...)gaintegerOnePointCrossover(object, parents, ...)
object |
A |
parents |
Integer vector of length 2 giving the row indices of the two
parents in |
... |
Ignored; included for compatibility with the |
The function returns two children with identical lengths and integer
encoding as the parents. The fitness values are not evaluated here and are
returned as NA_real_ placeholders, to be computed by the GA engine.
A list with elements children (2 x nGenes integer matrix) and
fitness (numeric vector of length 2 filled with NA).
GA::ga
Create an initial population for integer-encoded chromosomes. For each gene
position j, values are drawn independently and uniformly from the range
1:object@upper[j]. Intended for use as the population function in
GA::ga().
gaintegerPopulation(object, ...)gaintegerPopulation(object, ...)
object |
A |
... |
Ignored; present for compatibility with the |
The returned matrix has one row per individual and one column per
gene. Sampling uses sample.int() with replacement.
A numeric matrix with dimensions popSize x length(upper) containing
the initial population.
GA::ga
Perform a two-point crossover on two parent chromosomes. Two crossover points
are sampled uniformly, ordered as p1 < p2, and the outer segments
(positions 1..p1 and p2..end) are swapped between parents. Intended for
use as the crossover function in GA::ga().
gaintegerTwoPointCrossover(object, parents, ...)gaintegerTwoPointCrossover(object, parents, ...)
object |
A |
parents |
Integer vector of length 2 giving the row indices of the two
parents in |
... |
Ignored; included for compatibility with the |
Returns two children with the same integer encoding and length as
the parents. Fitness values are not computed here and are returned as
NA_real_ placeholders.
A list with elements children (2 x nGenes integer matrix) and
fitness (numeric vector of length 2 filled with NA).
GA::ga
Performs a Y-scrambling permutation test for objects produced by gaQSAR or gaQSAR_dcv. In each permutation, the response vector is randomly permuted while the descriptor matrix is kept unchanged. The GA variable selection procedure is then repeated using the same settings as in the original analysis.
gaPermutationTest( object, x, nPermutations = 500, seed = NULL, validateSettings = FALSE, verbose = FALSE, workers = 1, ... )gaPermutationTest( object, x, nPermutations = 500, seed = NULL, validateSettings = FALSE, verbose = FALSE, workers = 1, ... )
object |
A |
x |
A data.frame or matrix of predictors (same as used in the original analysis). |
nPermutations |
Integer; number of permutations to perform. Default is 500. |
seed |
Integer; random seed for reproducibility of permutations. If |
validateSettings |
Logical; if |
verbose |
Logical; if |
workers |
Integer; number of parallel workers to use for permutations. Use |
... |
Currently unused. |
The permutation test works by:
Extracting the true performance metric (Q2 for gaQSAR, outer Q2 for gaQSAR_dcv).
For each permutation: scrambling y, re-running the analysis, and recording the permuted performance metric.
The empirical p-value is computed with a plus-one correction:
(1 + nGreaterEqual) / (1 + nValid), where nGreaterEqual is the number of
valid permutations with a performance metric greater than or equal to the
observed metric, and nValid is the number of valid permutation runs.
A list of class "gaQSAR_permTest" containing:
trueMetric: Numeric; the true performance metric (Q2 or outer Q2).
permutedMetrics: Numeric vector of length nPermutations with performance
metrics from permuted runs.
pValue: Numeric; empirical p-value with plus-one correction.
nPermutations: Integer; number of permutations performed.
objectType: Character; either "gaQSAR" or "gaQSAR_dcv".
seed: Integer or NULL; random seed used.
gaVariableSelection(), gaDoubleCrossValidation()
# Simple example with gaQSAR object set.seed(42) n <- 50 p <- 20 x <- matrix(rnorm(n * p), nrow = n) colnames(x) <- paste0("X", seq_len(p)) y <- 2 * x[, 2] - 1.5 * x[, 5] + rnorm(n, sd = 0.5) # NOTE: Settings below are example-only (fast runtime, not optimal). # For real QSAR work, use larger population sizes, more generations, and multiple seeds. # Good starting point are the default GA settings, or run an experimental design to # tune GA settings for your dataset. fit <- gaVariableSelection( x = x, y = y, numberOfVariables = 3, popSize = 10, maxIter = 10, seeds = 1, interval = 5, verbose = TRUE ) # For a meaningful permutation test, use a larger number of permutations (e.g., 100 or more). permTest <- gaPermutationTest( object = fit, x = x, nPermutations = 2, # small number for example; use 100+ for real tests verbose = TRUE ) print(permTest)# Simple example with gaQSAR object set.seed(42) n <- 50 p <- 20 x <- matrix(rnorm(n * p), nrow = n) colnames(x) <- paste0("X", seq_len(p)) y <- 2 * x[, 2] - 1.5 * x[, 5] + rnorm(n, sd = 0.5) # NOTE: Settings below are example-only (fast runtime, not optimal). # For real QSAR work, use larger population sizes, more generations, and multiple seeds. # Good starting point are the default GA settings, or run an experimental design to # tune GA settings for your dataset. fit <- gaVariableSelection( x = x, y = y, numberOfVariables = 3, popSize = 10, maxIter = 10, seeds = 1, interval = 5, verbose = TRUE ) # For a meaningful permutation test, use a larger number of permutations (e.g., 100 or more). permTest <- gaPermutationTest( object = fit, x = x, nPermutations = 2, # small number for example; use 100+ for real tests verbose = TRUE ) print(permTest)
Runs a genetic algorithm (GA) to select a fixed number of predictors using
leave-one-out cross-validation (LOOCV) for stable optimization.
The GA is executed for each seed in seeds for robustness, and the best run
(highest LOOCV Q2) is returned.
gaVariableSelection( x, y, numberOfVariables = 4, pMutation = 0.2, pCrossover = 0.7, crossoverFunc = "gaintegerOnePointCrossover", popSize = 100, maxIter = 1300, elitism = 3, seeds = 1:5, interval = 50, verbose = FALSE )gaVariableSelection( x, y, numberOfVariables = 4, pMutation = 0.2, pCrossover = 0.7, crossoverFunc = "gaintegerOnePointCrossover", popSize = 100, maxIter = 1300, elitism = 3, seeds = 1:5, interval = 50, verbose = FALSE )
x |
A data.frame or matrix of predictors (rows = observations, columns = candidate variables). |
y |
Numeric response vector aligned with rows of |
numberOfVariables |
Integer; number of predictors to select. |
pMutation |
Numeric; mutation probability. |
pCrossover |
Numeric; crossover probability. |
crossoverFunc |
Character or function; crossover function used by the GA. |
popSize |
Integer; GA population size. |
maxIter |
Integer; maximum number of GA iterations. |
elitism |
Integer; number of elite individuals carried over. |
seeds |
Integer vector; RNG seeds to repeat GA runs for robustness. |
interval |
Integer; iteration interval for progress monitoring when |
verbose |
Logical; if |
Fitness function: Uses LOOCV-based Q2 (via singleCV) as the GA fitness metric.
LOOCV provides a deterministic, non-random evaluation landscape that enables stable
variable selection, preventing fold-splitting variability from interfering with
optimization. The nested CV structure (outer: validation, inner: LOOCV fitness)
separates variable selection from performance assessment.
Robustness: For each seed, the GA runs with identical parameters. The run with the highest LOOCV Q2 is selected and returned. The best seed is tracked for reproducibility in permutation tests.
An object of class "gaQSAR" containing the best GA run for the
specified model size. The structure typically includes: numVar,
importantPredictors, yTrain, model, R2Train, R2AdjTrain, Q2Loocv
(LOOCV-based Q2 used by GA for fitness), VIF (variance inflation factors for
each selected predictor), bestFitnessPerGeneration (best GA fitness value at
each generation), bestSeed (the seed that produced the optimal solution), and
gaSettings (list of GA parameters used, for reproducibility in permutation
tests). Use plot.gaQSAR() to visualize any attached plots.
singleCV(), createQ2Plot(), createWilliamsPlot()
# This is a toy example for the documentation: GA settings are set so it runs fast. # For real QSAR work: start with the default GA settings, or run an experimental design # to tune GA settings for your dataset. set.seed(1) # Create toy descriptor matrix (n compounds x p predictors) n <- 40 p <- 20 x <- matrix(rnorm(n * p), nrow = n) # Add names to mimic typical QSAR data structures (molecules + descriptor names) rownames(x) <- paste0("mol_", seq_len(n)) colnames(x) <- paste0("X", seq_len(p)) # Create a synthetic response with a known signal in predictors 2 and 7 y <- 1.5 * x[, 2] - 0.8 * x[, 7] + rnorm(n, sd = 0.5) # Run GA variable selection # NOTE: Settings below are example-only (fast runtime, not optimal). # For real QSAR work, use larger population sizes, more generations, and multiple seeds. # Good starting point are the default GA settings, or run an experimental design to # tune GA settings for your dataset. fit <- gaVariableSelection( x = x, y = y, numberOfVariables = 2, # target model size (number of selected predictors) popSize = 20, # small population for speed (increase in real runs) maxIter = 20, # few generations for speed (increase in real runs) seeds = 1, # single seed for a minimal example (use multiple seeds) interval = 5 # print progress every 5 generations when verbose=TRUE ) # Methods for gaQSAR objects print(fit) summary(fit) plot(fit, type = "predObs") # observed vs predicted plot# This is a toy example for the documentation: GA settings are set so it runs fast. # For real QSAR work: start with the default GA settings, or run an experimental design # to tune GA settings for your dataset. set.seed(1) # Create toy descriptor matrix (n compounds x p predictors) n <- 40 p <- 20 x <- matrix(rnorm(n * p), nrow = n) # Add names to mimic typical QSAR data structures (molecules + descriptor names) rownames(x) <- paste0("mol_", seq_len(n)) colnames(x) <- paste0("X", seq_len(p)) # Create a synthetic response with a known signal in predictors 2 and 7 y <- 1.5 * x[, 2] - 0.8 * x[, 7] + rnorm(n, sd = 0.5) # Run GA variable selection # NOTE: Settings below are example-only (fast runtime, not optimal). # For real QSAR work, use larger population sizes, more generations, and multiple seeds. # Good starting point are the default GA settings, or run an experimental design to # tune GA settings for your dataset. fit <- gaVariableSelection( x = x, y = y, numberOfVariables = 2, # target model size (number of selected predictors) popSize = 20, # small population for speed (increase in real runs) maxIter = 20, # few generations for speed (increase in real runs) seeds = 1, # single seed for a minimal example (use multiple seeds) interval = 5 # print progress every 5 generations when verbose=TRUE ) # Methods for gaQSAR objects print(fit) summary(fit) plot(fit, type = "predObs") # observed vs predicted plot
Plot diagnostics for a gaQSAR result.
## S3 method for class 'gaQSAR' plot(x, type = "all", ...)## S3 method for class 'gaQSAR' plot(x, type = "all", ...)
x |
An object of class "gaQSAR" (returned from |
type |
Character scalar specifying which plot to create or print.
Supported values are |
... |
Additional arguments (currently unused). |
type = "all" prints available plots separately, one after another.
The fitness plot is always created with createBestFitnessPlot(). The
observed-versus-predicted plot is created from x$yTrain and, when available,
x$yExt. The Williams plot use stored plot objects created by createWilliamsPlot().
Invisibly returns the created or printed ggplot2 object for a
single plot type, or a named list of plot objects for type = "all".
gaVariableSelection(), createWilliamsPlot()
Generate diagnostic plots for nested cross-validation results using ggplot2.
## S3 method for class 'gaQSAR_dcv' plot(x, type = "outerPredObs", ...)## S3 method for class 'gaQSAR_dcv' plot(x, type = "outerPredObs", ...)
x |
An object of class "gaQSAR_dcv" (returned from |
type |
Character; plot type. Options:
|
... |
Additional arguments (currently unused). |
A ggplot2 object (or invisible NULL for type = "all").
print.gaQSAR_dcv(), summary.gaQSAR_dcv(), gaDoubleCrossValidation()
Creates a histogram of permuted metrics with a vertical line indicating the true metric value.
## S3 method for class 'gaQSAR_permTest' plot(x, bins = 30, title = NULL, ...)## S3 method for class 'gaQSAR_permTest' plot(x, bins = 30, title = NULL, ...)
x |
A |
bins |
Integer; number of histogram bins. Default is 30. |
title |
Optional character plot title. |
... |
Additional arguments (unused). |
A ggplot2 object.
Compute external predictions for test objects using stored coefficient
vectors (per selected model) and derive the external Q2 metric. For each
element in output, predictor columns in x are matched by name to the
coefficient names (excluding the intercept), an intercept term is prepended,
and predictions are obtained via a matrix product. The resulting external
Q2 and per-object predictions/residuals are attached to each element.
predictOOBObjects(output, x, y, verbose = FALSE)predictOOBObjects(output, x, y, verbose = FALSE)
output |
A |
x |
A data.frame or matrix with predictors in columns and observations in
rows. Column names must cover the predictor names used in |
y |
Numeric vector of observed response values aligned with the rows of |
verbose |
Logical; if |
Predictor order does not matter: columns are matched by name to the coefficient vector (excluding the intercept). An intercept column of ones is automatically added before prediction.
Returns an updated gaQSAR object (if a single object was supplied)
or an updated list of gaQSAR objects, with each element augmented by:
Q2Ext: numeric external Q2 computed by Q2(y, yHat).
yExt: data.frame with columns y, yHat, and residual for test data.
Print comprehensive information about a gaQSAR object returned from
gaVariableSelection(). Displays model configuration, performance metrics,
selected predictors, model coefficients, and data summaries.
## S3 method for class 'gaQSAR' print(x, ...)## S3 method for class 'gaQSAR' print(x, ...)
x |
An object of class "gaQSAR" or a list of such objects (returned
from |
... |
Additional arguments (currently unused). |
Prints detailed information including:
Number of selected predictors and their indices/names
Training R2, adjusted R2, and LOOCV Q2 metrics
External validation Q2 (if available)
OLS model coefficients (intercept and slopes)
Variance Inflation Factors (VIF) for each selected predictor
Summary statistics for training set residuals
Summary statistics for external validation residuals (if available)
Information about attached plots
Invisibly returns the object.
summary.gaQSAR(), gaVariableSelection(), plot.gaQSAR()
Display a brief overview of nested cross-validation results.
## S3 method for class 'gaQSAR_dcv' print(x, ...)## S3 method for class 'gaQSAR_dcv' print(x, ...)
x |
An object of class "gaQSAR_dcv" (returned from |
... |
Additional arguments (currently unused). |
Invisibly returns the object.
summary.gaQSAR_dcv(), gaDoubleCrossValidation()
Print method for gaQSAR_permTest objects
## S3 method for class 'gaQSAR_permTest' print(x, ...)## S3 method for class 'gaQSAR_permTest' print(x, ...)
x |
A |
... |
Additional arguments (unused). |
Invisibly returns the input object.
Compute the Q2 statistic from observed and predicted values:
This function does not perform cross-validation; it only evaluates the formula for supplied predictions (e.g., from LOOCV or an external test set).
Q2(y, yhat)Q2(y, yhat)
y |
Numeric vector of observed response values. |
yhat |
Numeric vector of predicted response values, aligned with |
The vectors y and yhat must have the same length. The denominator
is the total sum of squares of y; if y is constant this denominator is 0,
making Q2 undefined (the result will be NaN or Inf). Missing values are
not handled specially; if present, the result may be NA.
A numeric scalar Q2 value. Values can be negative when predictive performance is poor and approach 1 for perfect predictions.
singleCV(), createQ2Plot(), predictOOBObjects()
y_obs <- c(2.1, 4.0, 5.9, 8.1, 10.2) y_pred <- c(2.2, 3.8, 6.1, 7.9, 10.3) Q2(y_obs, y_pred)y_obs <- c(2.1, 4.0, 5.9, 8.1, 10.2) y_pred <- c(2.2, 3.8, 6.1, 7.9, 10.3) Q2(y_obs, y_pred)
Create a monitor callback for GA::ga() that prints concise progress lines
at fixed intervals: generation number, best fitness value, number of selected
variables, and their indices.
QSARMonitorFactory(interval = 50)QSARMonitorFactory(interval = 50)
interval |
Integer; print progress every |
The returned function closes over interval and is intended for the
monitor argument of GA::ga(). It inspects the current GA object to obtain
the best individual and reports its fitness and selected variable indices.
Output is written to standard output via cat() and nothing is returned.
A function suitable for the monitor argument of GA::ga().
GA::ga
Compute a leave-one-out cross-validated Q2 for a candidate set of predictors. Designed for use as a fitness function inside a genetic algorithm. Duplicate predictors are penalized with a large negative value, and candidate sets with variance inflation factor (VIF) above a threshold are rejected (fitness 0).
singleCV(predictors, x, y, vifThreshold = 5)singleCV(predictors, x, y, vifThreshold = 5)
predictors |
Integer vector of 1-based predictor indices referring to
columns of |
x |
Matrix or data.frame of predictors (rows = observations, columns = candidate variables). |
y |
Numeric response vector aligned with rows of |
vifThreshold |
Numeric; maximum allowed VIF (default 5). |
An intercept is added internally. VIF values are computed by
regressing each selected predictor on the remaining selected predictors; if
any VIF exceeds vifThreshold, the fitness is 0. LOOCV is performed via
explicit refits for each left-out observation.
A numeric scalar: Q2 if valid; 0 if the VIF constraint fails; or -100 if duplicate predictors are present.
Split observations into training and test sets using either a Kennard-Stone split ("KS") based on sample representativeness or a simple random split. The function returns row indices for both sets.
splitUp( data, method = c("KS", "random"), trainPercentage = 0.8, pc = 0.95, verbose = FALSE )splitUp( data, method = c("KS", "random"), trainPercentage = 0.8, pc = 0.95, verbose = FALSE )
data |
Data frame or matrix with observations in rows (predictors and/or response may be present; only row indices are returned). |
method |
Character; one of |
trainPercentage |
Numeric in (0, 1); fraction of observations assigned to the training set. |
pc |
Numeric; proportion of variance to retain in the PCA step used by
|
verbose |
Logical; if |
For method = "KS", the function requires the prospectr package
and delegates to prospectr::kenStone() with k = round(trainPercentage * n)
on a scaled version of data. For method = "random", rows are permuted and
the first k indices are assigned to training.
A list with integer vectors model (training indices) and test (test indices).
set.seed(123) toy <- data.frame(x = rnorm(40), y = rnorm(40)) split <- splitUp(toy, method = "random", trainPercentage = 0.75) str(split)set.seed(123) toy <- data.frame(x = rnorm(40), y = rnorm(40)) split <- splitUp(toy, method = "random", trainPercentage = 0.75) str(split)
Produce a concise summary of a gaQSAR object returned from
gaVariableSelection(). Displays key model configuration and performance
metrics.
## S3 method for class 'gaQSAR' summary(object, ...)## S3 method for class 'gaQSAR' summary(object, ...)
object |
An object of class "gaQSAR" (returned from |
... |
Additional arguments (currently unused). |
Prints a summary including:
Number of selected predictors
Selected predictor indices and names (if available)
Training R2, adjusted R2, and LOOCV Q2
External validation Q2 (if available)
Model coefficients
Variance Inflation Factors (VIF) for each selected predictor
Residual statistics for training and validation sets
An object of class "summary.gaQSAR" (invisibly).
print.gaQSAR(), gaVariableSelection(), plot.gaQSAR()
Produce a detailed summary of nested cross-validation results, including per-fold summaries and stability metrics.
## S3 method for class 'gaQSAR_dcv' summary(object, ...)## S3 method for class 'gaQSAR_dcv' summary(object, ...)
object |
An object of class "gaQSAR_dcv" (returned from
|
... |
Additional arguments (currently unused). |
An object of class "summary.gaQSAR_dcv" (invisibly).
print.gaQSAR_dcv(), gaDoubleCrossValidation()
Summary method for gaQSAR_permTest objects
## S3 method for class 'gaQSAR_permTest' summary(object, ...)## S3 method for class 'gaQSAR_permTest' summary(object, ...)
object |
A |
... |
Additional arguments (unused). |
Invisibly returns the input object.