Train/test QSAR workflow with gaQSAR

This vignette shows the train/test workflow used in gaQSAR. The workflow is useful when a separate test set is available, or when the user wants a quick first analysis before moving to double cross-validation.

The main idea is simple. First split the compounds into a training set and a test set. Then run the genetic algorithm for several model sizes. Each run gives one gaQSAR object. These objects are collected in a list, so that the models can be compared by their cross-validated Q2 and their performance on the external test set.

This is the workflow:

prepare the QSAR data;
split the data into training and test compounds;
run gaVariableSelection() for several numbers of predictors;
predict the test compounds with predictOOBObjects();
compare the models with createQ2Plot();
inspect the applicability domain with createWilliamsPlot();
select a final model and, if needed, run a permutation test.

The examples below use the AquaticTox data from the QSARdata package. For a short test run, set smokeTest <- TRUE.

Load packages and helper function

library(gaQSAR)
library(QSARdata)

timestampFileName <- function(fileName, time = Sys.time()) {
  paste0(format(time, "%Y%m%d_%H%M%S_"), fileName)
}

The timestamp helper is not needed by gaQSAR itself. It is only used here to save result files without overwriting earlier runs.

Prepare the data

data(AquaticTox)

qsarLabel <- "AquaticTox"

qsarData <- cbind(
  activity = AquaticTox_Outcome$Activity,
  AquaticTox_moe2D[, -1],
  AquaticTox_moe3D[, -1]
)

missingColumns <- which(colSums(is.na(qsarData)) != 0)

if (length(missingColumns) > 0) {
  qsarData <- qsarData[, -missingColumns]
}

The first column is the response variable. The remaining columns are molecular descriptors. In this example, descriptors with missing values are removed before modelling. For other data sets, a different preprocessing step may be more appropriate.

Choose settings

smokeTest <- FALSE
permutationTest <- TRUE
nWorkers <- 1L

numVars <- 2:7
theSeeds <- 1:5

gaSettings <- list(
  pMutation = 0.2,
  pCrossover = 0.7,
  popSize = 100,
  maxIter = 300,
  interval = 50,
  elitism = 3,
  crossoverType = "gaintegerOnePointCrossover",
  KSpercentage = 0.95
)

if (smokeTest) {
  gaSettings$popSize <- 25
  gaSettings$maxIter <- 10
  gaSettings$elitism <- 2
  gaSettings$interval <- 5
  numVars <- c(3, 5, 7)
  theSeeds <- 1:2
}

The vector numVars defines the model sizes to compare. For example, numVars <- 2:7 fits models with 2, 3, 4, 5, 6 and 7 descriptors.

The seeds argument is used because the genetic algorithm is stochastic. Running several seeds gives a more stable result than relying on a single GA run.

Split the data

splitupMolecules <- splitUp(qsarData, method = "KS", pc = gaSettings$KSpercentage)

xTrain <- as.matrix(qsarData[splitupMolecules$model, -1, drop = FALSE])
yTrain <- qsarData[splitupMolecules$model, 1]

xTest <- as.matrix(qsarData[splitupMolecules$test, -1, drop = FALSE])
yTest <- qsarData[splitupMolecules$test, 1]

The Kennard-Stone split is used here to select a training set that covers the descriptor space. With KSpercentage = 0.95, the external test set is small. That can be useful for a first analysis on small data sets, but it should not be oversold as a strong external validation.

Run the GA for several model sizes

baseArgs <- list(
  x = xTrain,
  y = yTrain,
  popSize = gaSettings$popSize,
  pMutation = gaSettings$pMutation,
  pCrossover = gaSettings$pCrossover,
  crossoverFunc = gaSettings$crossoverType,
  elitism = gaSettings$elitism,
  maxIter = gaSettings$maxIter,
  interval = gaSettings$interval,
  seeds = theSeeds,
  verbose = TRUE
)

fitOneModelSize <- function(numberOfVariables) {
  args <- baseArgs
  args$numberOfVariables <- numberOfVariables
  do.call(gaQSAR::gaVariableSelection, args)
}

output <- lapply(numVars, fitOneModelSize)
names(output) <- paste0("p", numVars)

Each element of output is a gaQSAR object. The list structure is important, because the plotting and comparison functions work on a set of candidate models.

Optional parallel execution

The same loop can be run in parallel with future.apply. This is useful for real analyses, but it is not shown as the default in this vignette because parallel code is less convenient for package checks and examples.

library(future)
library(future.apply)

useFuture <- nWorkers > 1L

if (useFuture) {
  oldPlan <- future::plan()
  on.exit(future::plan(oldPlan), add = TRUE)
  future::plan(future::multisession, workers = nWorkers)

  output <- future.apply::future_lapply(
    X = numVars,
    FUN = fitOneModelSize,
    future.seed = TRUE,
    future.packages = "gaQSAR"
  )
  names(output) <- paste0("p", numVars)
}

Predict the test set

output <- predictOOBObjects(output, xTest, yTest)

This adds external predictions and external validation statistics to each fitted model. The test set was not used during variable selection.

Compare Q2 values

myQ2Plot <- createQ2Plot(output, label = qsarLabel)
print(myQ2Plot)

The Q2 plot is used to compare model sizes. The training-set Q2 is based on LOOCV. If test set predictions were added, the external Q2 is shown as well.

For small data sets, small differences in Q2 should not be treated as decisive. A smaller model with similar Q2 may be easier to interpret and less fragile.

Williams plot, fitness plot and Observed versus Predicted plot

output <- createWilliamsPlot(
  output,
  xTrain,
  yTrain,
  xTest,
  yTest,
  residualThreshold = 2.5
)

print(plot(output[[2]]))

The Williams plot combines standardized residuals and leverages. It is mainly used to inspect the applicability domain and to find objects that behave differently from the rest of the data.

Objects with high leverage are not automatically wrong. They may be important compounds near the edge of the descriptor space. The plot should therefore be used as a diagnostic, not as a mechanical deletion rule.

Select one model

Q2Values <- vapply(output, function(object) object$Q2Loocv, numeric(1))
bestIdx <- which.max(Q2Values)
bestObj <- output[[bestIdx]]

cat(sprintf("Selected model size: %d predictors\n", numVars[bestIdx]))
cat(sprintf("LOOCV Q2: %.4f\n", bestObj$Q2Loocv))

summary(bestObj)

This selects the model with the highest LOOCV Q2. That is a practical rule, but not the only possible rule. When several model sizes perform similarly, the smaller model is often a reasonable choice.

Permutation test

if (permutationTest) {
  nPermutations <- if (smokeTest) 20 else 100

  permutationResult <- gaPermutationTest(
    bestObj,
    x = xTrain,
    y = yTrain,
    nPermutations = nPermutations,
    seed = 1,
    validateSettings = TRUE,
    verbose = TRUE,
    workers = nWorkers
  )

  print(plot(permutationResult))
  summary(permutationResult)
}

The permutation test scrambles the response and repeats the modelling procedure under the null situation where the descriptor matrix and the response have no real relation.

In this workflow, the permutation test is applied to the selected model size. It does not fully repeat the whole earlier choice over all values of numVars. That distinction matters if the permutation result is used as formal evidence.

Save results

save(output, file = timestampFileName(paste0(qsarLabel, "Results.Rdata")))

if (exists("permutationResult")) {
  save(
    permutationResult,
    file = timestampFileName(paste0(qsarLabel, "PermutationResults.Rdata"))
  )
}

Summary

The train/test workflow is useful for a direct QSAR analysis with an external test set. The result is a list of fitted gaQSAR objects, one for each model size. This list is then used for prediction, Q2 plotting and Williams plots.

For a less biased estimate of predictive performance, especially when model size is also chosen from the data, use the double cross-validation workflow shown in the second vignette.