Title: | RNA-Seq Profile Classifier |
---|---|
Description: | We developed a lightweight machine learning tool for RNA profiling of acute lymphoblastic leukemia (ALL), however, it can be used for any problem where multiple classes need to be identified from multi-dimensional data. The methodology is described in Makinen V-P, Rehn J, Breen J, Yeung D, White DL (2022) Multi-cohort transcriptomic subtyping of B-cell acute lymphoblastic leukemia, International Journal of Molecular Sciences 23:4574, <doi:10.3390/ijms23094574>. The classifier contains optimized mean profiles of the classes (centroids) as observed in the training data, and new samples are matched to these centroids using the shortest Euclidean distance. Centroids derived from a dataset of 1,598 ALL patients are included, but users can train the models with their own data as well. The output includes both numerical and visual presentations of the classification results. Samples with mixed features from multiple classes or atypical values are also identified. |
Authors: | Ville-Petteri Makinen [aut, cre] |
Maintainer: | Ville-Petteri Makinen <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.0.7 |
Built: | 2024-12-18 06:41:26 UTC |
Source: | CRAN |
Trains a new classification model.
assemble(obj) <- value
assemble(obj) <- value
obj |
An object of the class Asset. |
value |
A list that containts training data, see details. |
The value argument must contain three named elements: title
,
dat
and bits
. Optional predictors
and
covariates
elements can also be included.
The title is a descriptive identifier for the asset that will be
displayed by the Classifier object in report()
.
The dat
element is a matrix that contains the training samples.
The variables are organized into named rows and the samples into
named columns. Non-finite values are not allowed.
The predictors
element contains the names of the input variables
that should be used for training the model. If empty, all inputs are
used for automatic feature selection and subsequent training steps.
The covariates
element contains additional information for constructing
the final classification models. Unlike the data matrix, variables
are organized into named columns and the samples are stored as rows.
Non-finite values are not allowed.
The bits
element contains labels for category memberships. Three
formats are supported: 1) a character vector of named elements that
contains non-empty strings, 2) a matrix or a data frame with row names
and a single column of non-empty values, and 3) a matrix or a data frame
with multiple columns that contain binary values where 1s indicate
category membership (the name of the column is the name of the
category). Overlapping categories are allowed.
The final asset is assembled in six steps. First, the training data are standardized and normalized. Second, input variables are sorted according to their univariate classification performance. Third, redundant features are excluded by testing the sorted variables for mutual correlations; this produces an optimized listing of non-redundant features that are the most predictive of the category labels. Fourth, mean centroids are calculated for each category. Fifth, training samples are matched to their nearest centroids and the distances collected as preliminary predictor scores. Lastly, logistic regression models are fitted to the preliminary scores, covariates and category labels to enable the calculation of standardized predictor scores for new data.
Updates the Asset object.
# Prepare training data. simu <- bcellALL(200) materials <- list(title="Simutypes") materials$dat <- simu$counts materials$covariates <- simu$metadata[,c("MALE","AGE")] materials$bits <- simu$metadata[,"SUBTYPE",drop=FALSE] # Assemble classification asset. bALL <- asset() assemble(bALL) <- materials # Export asset into a new folder. tpath <- tempfile() export(bALL, folder = tpath) # Create a classifier. cls <- classifier(tpath, verbose = FALSE) # Classify new samples. simu <- bcellALL(5) covariates(cls) <- simu$metadata profiles(cls) <- simu$counts primary <- predictions(cls)[[1]] print(primary[,c("LABEL","PROX","EXCL")])
# Prepare training data. simu <- bcellALL(200) materials <- list(title="Simutypes") materials$dat <- simu$counts materials$covariates <- simu$metadata[,c("MALE","AGE")] materials$bits <- simu$metadata[,"SUBTYPE",drop=FALSE] # Assemble classification asset. bALL <- asset() assemble(bALL) <- materials # Export asset into a new folder. tpath <- tempfile() export(bALL, folder = tpath) # Create a classifier. cls <- classifier(tpath, verbose = FALSE) # Classify new samples. simu <- bcellALL(5) covariates(cls) <- simu$metadata profiles(cls) <- simu$counts primary <- predictions(cls)[[1]] print(primary[,c("LABEL","PROX","EXCL")])
Creates a new Asset object.
asset(folder = NULL, verbose = TRUE)
asset(folder = NULL, verbose = TRUE)
folder |
Path to a folder that contains the necessary files for a classification asset. |
verbose |
If TRUE, accessed asset items (file names) are printed on screen. |
Returns an Asset object.
# Set up an ALL subtyping asset. folder <- system.file("subtypes", package="Allspice") a <- asset(folder)
# Set up an ALL subtyping asset. folder <- system.file("subtypes", package="Allspice") a <- asset(folder)
Simulated data of B-cell acute lymphoblastic leukemia.
bcellALL(n = 200, contamination = 0.05)
bcellALL(n = 200, contamination = 0.05)
n |
Number of samples. |
contamination |
Proportion of samples with randomly shuffled values. |
Returns a list with two elements: counts
contains gene RNA read
counts and metadata
contains age and sex information and the
generating subtype label.
# Simulate B-cell ALL samples. simu <- bcellALL(5) print(head(simu$counts)) print(simu$metadata)
# Simulate B-cell ALL samples. simu <- bcellALL(5) print(head(simu$counts)) print(simu$metadata)
Creates a new Classifier object.
classifier(..., verbose = TRUE)
classifier(..., verbose = TRUE)
... |
Any number of paths to folders that contain assets
(see |
verbose |
If TRUE, information about the imported assets is printed on screen. |
The first input argument will set the primary asset, and the others will be considered secondary assets.
Returns a Classifier object.
# Set up an ALL classifier object. cls <- classifier()
# Set up an ALL classifier object. cls <- classifier()
Assigns category labels to new data.
classify(obj, dat, covariates)
classify(obj, dat, covariates)
obj |
An object of the class Asset. |
dat |
A matrix that containts variables as rows and samples as columns. |
covariates |
A data.frame or matrix that containts samples as rows and covariates as columns. |
The input data will be automatically normalized and standardized using
the internal asset parameters, see configuration()
for
details.
Returns a data frame that contains predicted category labels and performance indicators. The column 'CATEG' contains the final predictions, including "Unclassified" or "Ambiguous" for samples that could not be reliably classified. The columns 'MATCH.1st' and 'MATCH.2nd' contain the first and second best matching categories, respectively.
The column 'BIOMRK' contains a standardized biomarker score that indicates how similar the sample is with respect to the best-matching category. The column 'PROX' tells the likelihood of how likely it is that the observed biomarker score would have been generated by a training sample from the best-matching category (balanced group sizes). The column 'EXCL' tells the likelihood that the sample does not share characteristic features with any other category.
The returned data frame also has the attribute "biomarkers" that contains biomarker scores for all categories.
# Import ALL subtyping asset. base <- system.file(package = "Allspice") folder <- file.path(base, "subtypes") a <- asset(folder) # Simulated data. simu <- bcellALL(5) # Predict categories. res <- classify(a, dat = simu$counts, covariates = simu$metadata) print(res[,c("LABEL","PROX","EXCL")])
# Import ALL subtyping asset. base <- system.file(package = "Allspice") folder <- file.path(base, "subtypes") a <- asset(folder) # Simulated data. simu <- bcellALL(5) # Predict categories. res <- classify(a, dat = simu$counts, covariates = simu$metadata) print(res[,c("LABEL","PROX","EXCL")])
Set internal parameters for an Asset object.
configuration(obj) <- value configuration(obj)
configuration(obj) <- value configuration(obj)
obj |
An object of the class Asset. |
value |
A numeric vector with named elements. |
Element names from the input are compared with the internal list of parameters. Those that match will be updated.
Normalization parameters include 'norm' (if set to 0, normalization
is not performed), 'nonzero.min' (the minimum data value considered larger
than zero) and 'nonzero.ratio' (minimum ratio of non-zero values to include
a variable in the output). See normalize()
for additional
details.
Standardization parameters include 'standard' (if set to 0, standardization
is not performed) and 'logarithm' (if set to 0, data values are used without
taking the logarithm). See standardize()
for additional
details.
Feature selection parameters include 'ninput.max' (the maximum number of
features to be used for classification) and 'rrinput.max' (the maximum
correlation r-squared to be allowed between features). See
assemble()
for details on selecting non-redundant inputs.
Classification parameters include 'probability.min' (minimum probability
rating considered for reliable classification) and 'exclusivity.min'
(minimum exclusivity for non-ambiguous classification).
See classify()
for additional details.
Updates the Asset object.
# Change asset configuration. a <- asset() print(configuration(a)) configuration(a) <- c(nonzero.min=0, nonzero.ratio=0) print(configuration(a))
# Change asset configuration. a <- asset() print(configuration(a)) configuration(a) <- c(nonzero.min=0, nonzero.ratio=0) print(configuration(a))
Add covariate data into the classifier.
covariates(obj) <- value
covariates(obj) <- value
obj |
An object of the class Classifier. |
value |
A numeric vector or a matrix. |
If the input is a vector, the elements must be named and these names will be used to identify variables.
If the input is a matrix, it must have named rows and named columns that
will be matched with sample identities in profiles()
.
Updates the Classifier object. Any previous data are discarded.
# Simulated data. simu <- bcellALL(5) # Predict subtypes without covariates. cls <- classifier(verbose = FALSE) profiles(cls) <- simu$counts primary <- predictions(cls)[[1]] print(primary[,c("LABEL","PROX","EXCL")]) # Predict subtypes with covariates. cls <- classifier(verbose = FALSE) covariates(cls) <- simu$metadata profiles(cls) <- simu$counts primary <- predictions(cls)[[1]] print(primary[,c("LABEL","PROX","EXCL")])
# Simulated data. simu <- bcellALL(5) # Predict subtypes without covariates. cls <- classifier(verbose = FALSE) profiles(cls) <- simu$counts primary <- predictions(cls)[[1]] print(primary[,c("LABEL","PROX","EXCL")]) # Predict subtypes with covariates. cls <- classifier(verbose = FALSE) covariates(cls) <- simu$metadata profiles(cls) <- simu$counts primary <- predictions(cls)[[1]] print(primary[,c("LABEL","PROX","EXCL")])
Assigns category labels to new data.
export(obj, folder)
export(obj, folder)
obj |
An object of the class Asset. |
folder |
Path to a folder that will contain the asset files. |
Returns the names of the files that were saved in the folder.
# Import ALL subtyping asset. base <- system.file(package = "Allspice") folder <- file.path(base, "subtypes") a <- asset(folder) # Export asset into a new folder. tpath <- tempfile() fnames <- export(a, folder = tpath) print(dir(tpath))
# Import ALL subtyping asset. base <- system.file(package = "Allspice") folder <- file.path(base, "subtypes") a <- asset(folder) # Export asset into a new folder. tpath <- tempfile() fnames <- export(a, folder = tpath) print(dir(tpath))
Information about the assets of a classifier
information(obj)
information(obj)
obj |
An object of the class Classifier. |
A list with three elements: covariates
is a data frame that contains
the variable names that are included in the classification assets,
configuration
contains the analysis settings for each asset and
categories
contains the labels and visual attributes for the assets.
# Show the contents of the b-cell ALL classifier. cls <- classifier(verbose=FALSE) info <- information(cls) print(info$covariates) print(info$configuration) print(head(info$categories)) print(tail(info$categories))
# Show the contents of the b-cell ALL classifier. cls <- classifier(verbose=FALSE) info <- information(cls) print(info$covariates) print(info$configuration) print(head(info$categories)) print(tail(info$categories))
Conversion table between variable naming schemes.
nomenclature(obj) <- value
nomenclature(obj) <- value
obj |
An object of the class Asset. |
value |
A data frame, see details. |
The data frame should contain character columns of variable names with each column representing a spefic naming scheme from which the variables are translated into the row names of the data frame.
For example, to convert gene symbols into Ensemble codes, a data frame with the gene symbols as row names and one column of ensemble codes is needed.
Updates the Asset object.
# Import nomenclature from a system file. base <- system.file(package = "Allspice") fname <- file.path(base, "subtypes", "nomenclature.txt") info <- read.delim(fname, stringsAsFactors = FALSE) # Set ENSEMBLE identities as row names. rownames(info) <- info$ENSEMBL info$ENSEMBL <- NULL print(head(info)) # Create a new asset and set nomenclature. a <- asset() nomenclature(a) <- info # Prepare training data. simu <- bcellALL(200) materials <- list(title="Simutypes") materials$dat <- simu$counts materials$covariates <- simu$metadata[,c("MALE","AGE")] materials$bits <- simu$metadata[,"SUBTYPE",drop=FALSE] # Assemble classification asset. assemble(a) <- materials # Check that nomenclature was set. simu <- bcellALL(5) expres <- normalize(a, dat = simu$counts) print(head(simu$counts)) print(head(expres))
# Import nomenclature from a system file. base <- system.file(package = "Allspice") fname <- file.path(base, "subtypes", "nomenclature.txt") info <- read.delim(fname, stringsAsFactors = FALSE) # Set ENSEMBLE identities as row names. rownames(info) <- info$ENSEMBL info$ENSEMBL <- NULL print(head(info)) # Create a new asset and set nomenclature. a <- asset() nomenclature(a) <- info # Prepare training data. simu <- bcellALL(200) materials <- list(title="Simutypes") materials$dat <- simu$counts materials$covariates <- simu$metadata[,c("MALE","AGE")] materials$bits <- simu$metadata[,"SUBTYPE",drop=FALSE] # Assemble classification asset. assemble(a) <- materials # Check that nomenclature was set. simu <- bcellALL(5) expres <- normalize(a, dat = simu$counts) print(head(simu$counts)) print(head(expres))
Adjust scale differences between samples.
normalize(obj, dat)
normalize(obj, dat)
obj |
An object of the class Asset. |
dat |
A matrix that containts variables as rows and samples as columns. |
The normalization pipeline comprises three steps. First, variable names
are checked against the internal nomenclature and converted to the
internal naming scheme where necessary (see nomenclature()
).
Second, variables that are present in the internal normalization reference
are imputed with reference values if not available in the data. Third, the
data are normalized according to the DESeq2 algortihm (Love MI, Huber W &
Anders S, Moderated estimation of fold change and dispersion for RNA-seq
data with DESeq2, Genome Biol 15, 550, 2014).
Returns a matrix in the same format as the input.
# Import ALL subtyping asset. base <- system.file(package = "Allspice") folder <- file.path(base, "subtypes") a <- asset(folder) # Simulated data. simu <- bcellALL(5) # Normalize RNA read counts. expres <- normalize(a, dat = simu$counts) print(head(simu$counts)) print(head(expres))
# Import ALL subtyping asset. base <- system.file(package = "Allspice") folder <- file.path(base, "subtypes") a <- asset(folder) # Simulated data. simu <- bcellALL(5) # Normalize RNA read counts. expres <- normalize(a, dat = simu$counts) print(head(simu$counts)) print(head(expres))
Classifies samples based on their profiles.
predictions(obj)
predictions(obj)
obj |
An object of the class Classifier. |
Use the functions covariates()
and
profiles()
to import data into the classifier.
Returns a list of data frames that contain the output from each
classification asset within the classifier. See classify()
for details on the result items.
# Simulated data. simu <- bcellALL(5) # Predict subtypes. cls <- classifier(verbose = FALSE) covariates(cls) <- simu$metadata profiles(cls) <- simu$counts pred <- predictions(cls) print(pred[[1]][,c("LABEL","PROX","EXCL")]) print(pred[[2]][,c("LABEL","PROX","EXCL")]) print(pred[[3]][,c("LABEL","PROX","EXCL")])
# Simulated data. simu <- bcellALL(5) # Predict subtypes. cls <- classifier(verbose = FALSE) covariates(cls) <- simu$metadata profiles(cls) <- simu$counts pred <- predictions(cls) print(pred[[1]][,c("LABEL","PROX","EXCL")]) print(pred[[2]][,c("LABEL","PROX","EXCL")]) print(pred[[3]][,c("LABEL","PROX","EXCL")])
Analyse new data using classification assets.
profiles(obj) <- value
profiles(obj) <- value
obj |
An object of the class Classifier. |
value |
A numeric vector or a matrix where samples are organized into columns and variables into rows. |
If the input is a vector, the elements must be named and these names will be used to identify variables.
If the input is a matrix, it must have sample identities as column names and variables identified by row names.
Updates the Classifier object. Any previous data are discarded.
# Simulated data. simu <- bcellALL(5) # Predict subtypes. cls <- classifier(verbose = FALSE) covariates(cls) <- simu$metadata profiles(cls) <- simu$counts primary <- predictions(cls)[[1]] print(primary[,c("LABEL","PROX","EXCL")])
# Simulated data. simu <- bcellALL(5) # Predict subtypes. cls <- classifier(verbose = FALSE) covariates(cls) <- simu$metadata profiles(cls) <- simu$counts primary <- predictions(cls)[[1]] print(primary[,c("LABEL","PROX","EXCL")])
Creates a visual report of the classification results.
report(obj, name, file = NULL)
report(obj, name, file = NULL)
obj |
An object of the class Classifier. |
name |
Name of the sample to be shown. |
file |
Name of the output file. |
The function generates a Scalable Vector Graphics figure that shows the results from each classification asset within the Classifier. The report will highlight the predicted category label and quality metrics for the primary asset and bar charts for the fits to categories in all assests.
If no file name is provided, the report is plotted on the current device, however, note that best visual outcomes are achieved by plotting in a file, especially with classifiers with three or more assets.
Returns the name of the output file.
# Simulated data. simu <- bcellALL(5) keys <- colnames(simu$counts) # Predict subtypes. cls <- classifier(verbose = FALSE) covariates(cls) <- simu$metadata profiles(cls) <- simu$counts # Show visual report by name. dev.new() report(cls, name = keys[3]) # Show visual report by sample index. dev.new() report(cls, name = 3)
# Simulated data. simu <- bcellALL(5) keys <- colnames(simu$counts) # Predict subtypes. cls <- classifier(verbose = FALSE) covariates(cls) <- simu$metadata profiles(cls) <- simu$counts # Show visual report by name. dev.new() report(cls, name = keys[3]) # Show visual report by sample index. dev.new() report(cls, name = 3)
Classification scores for samples based on their profiles.
scores(obj)
scores(obj)
obj |
An object of the class Classifier. |
Use the functions covariates()
and
profiles()
to import data into the classifier.
Returns a list of data frames that contain the output from each classification asset within the classifier.
# Simulated data. simu <- bcellALL(5) # Predict subtypes. cls <- classifier(verbose = FALSE) covariates(cls) <- simu$metadata profiles(cls) <- simu$counts z <- scores(cls) print(z[[1]][,1:5]) print(z[[2]][,1:5]) print(z[[3]][,1:5])
# Simulated data. simu <- bcellALL(5) # Predict subtypes. cls <- classifier(verbose = FALSE) covariates(cls) <- simu$metadata profiles(cls) <- simu$counts z <- scores(cls) print(z[[1]][,1:5]) print(z[[2]][,1:5]) print(z[[3]][,1:5])
Standardize scale and location of variables.
standardize(obj, dat, trim = FALSE)
standardize(obj, dat, trim = FALSE)
obj |
An object of the class Asset. |
dat |
A matrix that containts variables as rows and samples as columns. |
trim |
If true, returns only variables used as input features for classification. |
If the asset is so configured, the data are first transformed by log(x + 1). Values are processed with the mean and standard deviation that were calculated from the training data when the asset was assembled. The mean is subtracted and the values divided by SD. To control for outliers, extreme values are compressed by the t-distribution with 50 degrees of freedom.
Returns a matrix in the same format as the input.
# Import ALL subtyping asset. base <- system.file(package = "Allspice") folder <- file.path(base, "subtypes") a <- asset(folder) # Simulated data. simu <- bcellALL(5) # Standardize RNA read counts. expres <- normalize(a, dat = simu$counts) zscores <- standardize(a, dat = expres) print(head(simu$counts)) print(head(expres)) print(head(zscores))
# Import ALL subtyping asset. base <- system.file(package = "Allspice") folder <- file.path(base, "subtypes") a <- asset(folder) # Simulated data. simu <- bcellALL(5) # Standardize RNA read counts. expres <- normalize(a, dat = simu$counts) zscores <- standardize(a, dat = expres) print(head(simu$counts)) print(head(expres)) print(head(zscores))
Attach text and color attributes to categories.
visuals(obj) <- value
visuals(obj) <- value
obj |
An object of the class Asset. |
value |
A character vector or a data frame, see details. |
If the input value is a character vector, the elements are stored as the category text and names of the elements are stored as category names.
If the input value is a data frame, the column 'LABEL' is used as the text
and the row names as the names of the categories. Additional columns may
include 'COLOR', 'COLOR.dark' and 'COLOR.light' that must contain strings
of color names or hexadecimal codes as produced by rgb()
.
In absence of color data, the function assigns automatic colors.
Updates the Asset object.
# Create a new asset and set nomenclature. a <- asset() # Set category labels with automatic colors. labels <- paste("Category", 1:8) names(labels) <- paste0("cat", 1:8) visuals(a) <- labels print(a@categories) # Add color information. info <- data.frame(stringsAsFactors = FALSE, LABEL = labels, COLOR = "red") rownames(info) <- names(labels) visuals(a) <- info print(a@categories)
# Create a new asset and set nomenclature. a <- asset() # Set category labels with automatic colors. labels <- paste("Category", 1:8) names(labels) <- paste0("cat", 1:8) visuals(a) <- labels print(a@categories) # Add color information. info <- data.frame(stringsAsFactors = FALSE, LABEL = labels, COLOR = "red") rownames(info) <- names(labels) visuals(a) <- info print(a@categories)