Title: | Machine Learning Immunogenicity and Vaccine Response Analysis |
---|---|
Description: | Used for analyzing immune responses and predicting vaccine efficacy using machine learning and advanced data processing techniques. 'Immunaut' integrates both unsupervised and supervised learning methods, managing outliers and capturing immune response variability. It performs multiple rounds of predictive model testing to identify robust immunogenicity signatures that can predict vaccine responsiveness. The platform is designed to handle high-dimensional immune data, enabling researchers to uncover immune predictors and refine personalized vaccination strategies across diverse populations. |
Authors: | Ivan Tomic [aut, cre, cph] , Adriana Tomic [aut, ctb, cph, fnd] , Stephanie Hao [aut] |
Maintainer: | Ivan Tomic <[email protected]> |
License: | GPL-3 |
Version: | 1.0.1 |
Built: | 2024-10-26 03:36:29 UTC |
Source: | CRAN |
This function automates the process of building machine learning models using the caret package. It supports both binary and multi-class classification and allows users to specify a list of machine learning algorithms to be trained on the dataset. The function splits the dataset into training and testing sets, applies preprocessing steps, and trains models using cross-validation. It computes relevant performance metrics such as confusion matrix, AUROC (for binary classification), and prAUC (for binary classification).
auto_simon_ml(dataset_ml, settings)
auto_simon_ml(dataset_ml, settings)
dataset_ml |
A data frame containing the dataset for training. All columns except the outcome column should contain the features. |
settings |
A list containing the following parameters:
|
The function performs preprocessing (e.g., centering, scaling, and imputation of missing values) on the dataset based on the provided settings. It splits the data into training and testing sets using the specified partition, trains models using cross-validation, and computes performance metrics.
For binary classification problems, the function calculates AUROC and prAUC. For multi-class classification, it calculates macro-averaged AUROC, though prAUC is not used.
The function returns a list of trained models along with their performance metrics, including confusion matrix, variable importance, and post-resample metrics.
A list where each element corresponds to a trained model for one of the algorithms specified in
settings$selectedPackages
. Each element contains:
info
: General information about the model, including resampling indices, problem type,
and outcome mapping.
training
: The trained model object and variable importance.
predictions
: Predictions on the test set, including probabilities, confusion matrix,
post-resample statistics, AUROC (for binary classification), and prAUC (for binary classification).
## Not run: dataset <- read.csv("fc_wo_noise.csv", header = TRUE, row.names = 1) # Generate a file header for the dataset to use in downstream analysis file_header <- generate_file_header(dataset) settings <- list( fileHeader = file_header, # Columns selected for analysis selectedColumns = c("ExampleColumn1", "ExampleColumn2"), clusterType = "Louvain", removeNA = TRUE, preProcessDataset = c("scale", "center", "medianImpute", "corr", "zv", "nzv"), target_clusters_range = c(3,4), resolution_increments = c(0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5), min_modularities = c(0.4, 0.5, 0.6, 0.7, 0.8, 0.85, 0.9), pickBestClusterMethod = "Modularity", seed = 1337 ) result <- immunaut(dataset, settings) dataset_ml <- result$dataset$original dataset_ml$pandora_cluster <- tsne_clust[[i]]$info.norm$pandora_cluster dataset_ml <- dplyr::rename(dataset_ml, immunaut = pandora_cluster) dataset_ml <- dataset_ml[, c("immunaut", setdiff(names(dataset_ml), "immunaut"))] settings_ml <- list( excludedColumns = c("ExampleColumn0"), preProcessDataset = c("scale", "center", "medianImpute", "corr", "zv", "nzv"), selectedPartitionSplit = split, # Use the current partition split selectedPackages = c("rf", "RRF", "RRFglobal", "rpart2", "c5.0", "sparseLDA", "gcvEarth", "cforest", "gaussPRPoly", "monmlp", "slda", "spls"), trainingTimeout = 180 # Timeout 3 minutes ) ml_results <- auto_simon_ml(dataset_ml, settings_ml) ## End(Not run)
## Not run: dataset <- read.csv("fc_wo_noise.csv", header = TRUE, row.names = 1) # Generate a file header for the dataset to use in downstream analysis file_header <- generate_file_header(dataset) settings <- list( fileHeader = file_header, # Columns selected for analysis selectedColumns = c("ExampleColumn1", "ExampleColumn2"), clusterType = "Louvain", removeNA = TRUE, preProcessDataset = c("scale", "center", "medianImpute", "corr", "zv", "nzv"), target_clusters_range = c(3,4), resolution_increments = c(0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5), min_modularities = c(0.4, 0.5, 0.6, 0.7, 0.8, 0.85, 0.9), pickBestClusterMethod = "Modularity", seed = 1337 ) result <- immunaut(dataset, settings) dataset_ml <- result$dataset$original dataset_ml$pandora_cluster <- tsne_clust[[i]]$info.norm$pandora_cluster dataset_ml <- dplyr::rename(dataset_ml, immunaut = pandora_cluster) dataset_ml <- dataset_ml[, c("immunaut", setdiff(names(dataset_ml), "immunaut"))] settings_ml <- list( excludedColumns = c("ExampleColumn0"), preProcessDataset = c("scale", "center", "medianImpute", "corr", "zv", "nzv"), selectedPartitionSplit = split, # Use the current partition split selectedPackages = c("rf", "RRF", "RRFglobal", "rpart2", "c5.0", "sparseLDA", "gcvEarth", "cforest", "gaussPRPoly", "monmlp", "slda", "spls"), trainingTimeout = 180 # Timeout 3 minutes ) ml_results <- auto_simon_ml(dataset_ml, settings_ml) ## End(Not run)
This function generates a demo dataset with a specified number of subjects, features,
and desired number of clusters, ensuring that the generated clusters are not too far apart
and have some degree of overlap to simulate real-world data.
The generated dataset includes demographic information (outcome
, age
, and gender
),
as well as numeric features with a specified probability of missing values.
generate_demo_data( n_subjects = 1000, n_features = 200, missing_prob = 0.1, desired_number_clusters = 3, cluster_overlap_sd = 15 )
generate_demo_data( n_subjects = 1000, n_features = 200, missing_prob = 0.1, desired_number_clusters = 3, cluster_overlap_sd = 15 )
n_subjects |
Integer. The number of subjects (rows) to generate. Defaults to 1000. |
n_features |
Integer. The number of features (columns) to generate. Defaults to 200. |
missing_prob |
Numeric. The probability of introducing missing values (NA) in the feature columns. Defaults to 0.1. |
desired_number_clusters |
Integer. The approximate number of clusters to generate in the feature space. Defaults to 3. |
cluster_overlap_sd |
Numeric. The standard deviation to control cluster overlap. Defaults to 15 for more overlap. |
The function generates n_features
numeric columns based on Gaussian clusters
with some overlap between clusters to simulate more realistic data. Missing values are
introduced in each feature column based on the missing_prob
.
A data frame containing the generated demo dataset, with columns:
outcome
: A categorical variable with values "low" or "high".
age
: A numeric variable representing the age of the subject (range 18-90).
gender
: A categorical variable with values "male" or "female".
Feature X
: Numeric feature columns with random values and some missing data.
# Generate a demo dataset with 1000 subjects, 200 features, and 3 clusters demo_data <- generate_demo_data(n_subjects = 1000, n_features = 200, desired_number_clusters = 3, cluster_overlap_sd = 15, missing_prob = 0.1) # View the first few rows of the dataset head(demo_data)
# Generate a demo dataset with 1000 subjects, 200 features, and 3 clusters demo_data <- generate_demo_data(n_subjects = 1000, n_features = 200, desired_number_clusters = 3, cluster_overlap_sd = 15, missing_prob = 0.1) # View the first few rows of the dataset head(demo_data)
This function generates a fileHeader object from a given data frame which includes original names and remapped names of the data frame columns.
generate_file_header(dataset)
generate_file_header(dataset)
dataset |
The input data frame. |
A data frame containing original and remapped column names.
This function performs clustering and dimensionality reduction analysis on a dataset using user-defined settings. It handles various preprocessing steps, dimensionality reduction via t-SNE, multiple clustering methods, and generates associated plots based on user-defined or default settings.
immunaut(dataset, settings = list())
immunaut(dataset, settings = list())
dataset |
A data frame representing the dataset on which the analysis will be performed. The dataset must contain numeric columns for dimensionality reduction and clustering. |
settings |
A named list containing settings for the analysis. If NULL, defaults will be used. The settings list may contain:
|
A list containing the following:
tsne_calc
: The t-SNE results object.
tsne_clust
: The clustering results.
dataset
: A list containing the original dataset, the preprocessed dataset, and a dataset with machine learning-ready data.
clusters
: The final cluster assignments.
settings
: The list of settings used for the analysis.
data <- matrix(runif(2000), ncol=20) settings <- list(clusterType = "Louvain", resolution_increments = c(0.05, 0.1), min_modularities = c(0.3, 0.5)) result <- immunaut(data.frame(data), settings) print(result$clusters)
data <- matrix(runif(2000), ncol=20) settings <- list(clusterType = "Louvain", resolution_increments = c(0.05, 0.1), min_modularities = c(0.3, 0.5)) result <- immunaut(data.frame(data), settings) print(result$clusters)
Demo data set from immunaut package. This data is used in this package examples. It consist of 4x4 feature matrix + additional dummy columns that can be used for testing.
data(immunautDemo)
data(immunautDemo)
An object of class data.frame
with 4 rows and 7 columns.
## Not run: data(immunautDemo) ## define settings variable settings <- list() settings$fileHeader <- generate_file_header(immunautDemo) # ... and other settings results <- immunaut(immunautDemo, settings) ## End(Not run)
## Not run: data(immunautDemo) ## define settings variable settings <- list() settings$fileHeader <- generate_file_header(immunautDemo) # ... and other settings results <- immunaut(immunautDemo, settings) ## End(Not run)
This function generates a t-SNE plot with cluster assignments using consistent color mappings. It includes options for plotting points based on their t-SNE coordinates and adding cluster labels at the cluster centroids. The plot is saved as an SVG file in a temporary directory.
plot_clustered_tsne(info.norm, cluster_data, settings)
plot_clustered_tsne(info.norm, cluster_data, settings)
info.norm |
A data frame containing t-SNE coordinates ( |
cluster_data |
A data frame containing the cluster centroids and labels, with columns |
settings |
A list of settings for the plot, including:
|
ggplot2 object representing the clustered t-SNE plot.
## Not run: # Example usage plot <- plot_clustered_tsne(info.norm, cluster_data, settings) print(plot) ## End(Not run)
## Not run: # Example usage plot <- plot_clustered_tsne(info.norm, cluster_data, settings) print(plot) ## End(Not run)