Title: | Genome-Wide Discovery of Pre-miRNAs with few Labeled Examples |
---|---|
Description: | Machine learning method specifically designed for pre-miRNA prediction. It takes advantage of unlabeled sequences to improve the prediction rates even when there are just a few positive examples, when the negative examples are unreliable or are not good representatives of its class. Furthermore, the method can automatically search for negative examples if the user is unable to provide them. MiRNAss can find a good boundary to divide the pre-miRNAs from other groups of sequences; it automatically optimizes the threshold that defines the classes boundaries, and thus, it is robust to high class imbalance. Each step of the method is scalable and can handle large volumes of data. |
Authors: | Cristian Yones |
Maintainer: | Cristian Yones <[email protected]> |
License: | Apache License 2.0 |
Version: | 1.5 |
Built: | 2024-11-27 06:30:03 UTC |
Source: | CRAN |
This funtions builds the adjacency matrix (the graph) given a data frame of numerical features.
adjacencyMatrixKNN(sequenceFeatures, sequenceLabels = rep(0, nrow(sequenceFeatures)), nNearestNeighbor = 10, threadNumber = NA)
adjacencyMatrixKNN(sequenceFeatures, sequenceLabels = rep(0, nrow(sequenceFeatures)), nNearestNeighbor = 10, threadNumber = NA)
sequenceFeatures |
Data frame with features extracted from stem-loop sequences. |
sequenceLabels |
Vector of labels of the stem-loop sequences. It must have -1 for negative examples, 1 for known miRNAs and zero for the unknown sequences (the ones that would be classificated). |
nNearestNeighbor |
Number of nearest neighbors in the KNN graph. The default value is 10. |
threadNumber |
Number of threads used for the calculations. If it is NA leave OpenMP decide the number (may vary across different platforms). |
Returns the eigen descomposition as a list with two elements: The eigen vectors matrix 'U' and the eigen values vector 'D'.
# First construct the label vector with the CLASS column y = as.numeric(celegans$CLASS)*2 - 1 # Remove some labels to make a test y[sample(which(y>0),200)] = 0 y[sample(which(y<0),700)] = 0 # Take all the features but remove the label column x = subset(celegans, select = -CLASS) A = adjacencyMatrixKNN(x, y, 10, 8) for (nev in seq(50,200, 50)) { # the data frame of features 'x' should not be pass as parameter p = miRNAss(sequenceLabels = y, AdjMatrix = A, nEigenVectors = nev) # Calculate some performance measures SE = mean(p[ celegans$CLASS & y==0] > 0) SP = mean(p[!celegans$CLASS & y==0] < 0) cat("N: ", nev, "\n SE: ", SE, "\n SP: ", SP, "\n") }
# First construct the label vector with the CLASS column y = as.numeric(celegans$CLASS)*2 - 1 # Remove some labels to make a test y[sample(which(y>0),200)] = 0 y[sample(which(y<0),700)] = 0 # Take all the features but remove the label column x = subset(celegans, select = -CLASS) A = adjacencyMatrixKNN(x, y, 10, 8) for (nev in seq(50,200, 50)) { # the data frame of features 'x' should not be pass as parameter p = miRNAss(sequenceLabels = y, AdjMatrix = A, nEigenVectors = nev) # Calculate some performance measures SE = mean(p[ celegans$CLASS & y==0] > 0) SP = mean(p[!celegans$CLASS & y==0] < 0) cat("N: ", nev, "\n SE: ", SE, "\n SP: ", SP, "\n") }
Small dataset of features extracted from C. elegans hairpins. The full dataset is contained in the zip file "experiment_scripts.zip" that can be downloaded from:
celegans
celegans
A data frame with 1000 rows and 29 columns. The first 28 columns are numeric features used in [1]. The last column is a logical variable indicating if the stem-loop is a pre-miRNA or not.
http://sourceforge.net/projects/sourcesinc/files/mirnass/
[1] Gudyś, A., Szcześniak, M. W., Sikora, M., & Makałowska, I. (2013). HuntMi: an efficient and taxon-specific approach in pre-miRNA identification. BMC bioinformatics, 14(1), 1.
This funtions calculate the eigenvectors and eigen values of the Laplacian of the graph. As this proccess is quite time comsumin, this functions allows to obtain this decomposition once and the be able to run miRNAss several times in shorter times.
eigenDecomposition(AdjMatrix, nEigenVectors)
eigenDecomposition(AdjMatrix, nEigenVectors)
AdjMatrix |
Adjacency sparse matrix of the graph. |
nEigenVectors |
Number of eigen vectors. |
Returns the eigen descomposition as a list with two elements: The eigen vectors matrix 'U' and the eigen values vector 'D'.
# First construct the label vector with the CLASS column y = as.numeric(celegans$CLASS)*2 - 1 # Remove some labels to make a test y[sample(which(y>0),200)] = 0 y[sample(which(y<0),700)] = 0 # Take all the features but remove the label column x = subset(celegans, select = -CLASS) A = adjacencyMatrixKNN(x, y, 10, 8) E = eigenDecomposition(AdjMatrix = A, nEigenVectors = 100) for (mp in c(0.1,1,10)) { p = miRNAss(sequenceLabels = y, AdjMatrix = A, eigenVectors = E, missPenalization = mp) # Calculate some performance measures SE = mean(p[ celegans$CLASS & y==0] > 0) SP = mean(p[!celegans$CLASS & y==0] < 0) cat("mP: ", mp, "\n SE: ", SE, "\n SP: ", SP, "\n") }
# First construct the label vector with the CLASS column y = as.numeric(celegans$CLASS)*2 - 1 # Remove some labels to make a test y[sample(which(y>0),200)] = 0 y[sample(which(y<0),700)] = 0 # Take all the features but remove the label column x = subset(celegans, select = -CLASS) A = adjacencyMatrixKNN(x, y, 10, 8) E = eigenDecomposition(AdjMatrix = A, nEigenVectors = 100) for (mp in c(0.1,1,10)) { p = miRNAss(sequenceLabels = y, AdjMatrix = A, eigenVectors = E, missPenalization = mp) # Calculate some performance measures SE = mean(p[ celegans$CLASS & y==0] > 0) SP = mean(p[!celegans$CLASS & y==0] < 0) cat("mP: ", mp, "\n SE: ", SE, "\n SP: ", SP, "\n") }
This is the main function of the miRNAss package and implements the miRNA prediction method, It takes as main parameters a matrix with numerical features extracted from RNA hairpins and an incomplent vector of labels where the positive number represents known miRNAs, the negative are not-miRNA hairpins and te zero values are unknown sequences (those that will be classified). As a results it returns a complete label vector.
miRNAss(sequenceFeatures = NULL, sequenceLabels, AdjMatrix = NULL, nNearestNeighbor = 10, missPenalization = 1, scallingMethod = "relief", thresholdObjective = "Gm", neg2label = 0.05, positiveProp = NULL, eigenVectors = NULL, nEigenVectors = min(400, round(length(sequenceLabels)/5)), threadNumber = NA)
miRNAss(sequenceFeatures = NULL, sequenceLabels, AdjMatrix = NULL, nNearestNeighbor = 10, missPenalization = 1, scallingMethod = "relief", thresholdObjective = "Gm", neg2label = 0.05, positiveProp = NULL, eigenVectors = NULL, nEigenVectors = min(400, round(length(sequenceLabels)/5)), threadNumber = NA)
sequenceFeatures |
Data frame with features extracted from stem-loop sequences. It is not required if the adjacency matrix is provided. |
sequenceLabels |
Vector of labels of the stem-loop sequences. It must have -1 for negative examples, 1 for known miRNAs and zero for the unknown sequences (the ones that would be classificated). |
AdjMatrix |
Sparse adjacency matrix representeing the graph. If sequence features are provided it is ignored. |
nNearestNeighbor |
Number of nearest neighbors in the KNN graph. The default value is 10. |
missPenalization |
Penalization of the missclassification of known examples. The default value is 1. If the examples are not very confident, this value can be diminished. |
scallingMethod |
Method used for normalization and scalling of the features. The options are 'none', 'whitening' and 'relief' (the default option). The first option does nothing, the second calls the built-in function 'scale' and the last one uses the ReliefFexpRank algorithm from the coreLearn package. |
thresholdObjective |
Performance measure that would be optimized when estimating the threshold. The options are 'Gm' (geometric mean of the SE and the SP), 'G' (geometric mean of the SE and the precision), 'F1' (harmonic mean between SE and the precision) and 'none' (do not calculate any threshold). The default value is 'Gm'. |
neg2label |
Proportion of unlabeled stem-loops that would be labeled as negative with the automatic method to start the classification algorithm. The default is 0.05. |
positiveProp |
Expected proportion of positive sequences. If it is not provided by the user, is estimated as sum(y > 0) / sum(y != 0) when there are negative examples or as 2 * sum(y > 0) / sum(y == 0) when not. |
eigenVectors |
Eigen decomposition of the Laplacian matrix, as returned by the function eigenDecomposition. If is not provided is calculated internally (this parameter allows to calculate the eigen vectors once and then run several times miRNAss with the same eigen vectors). |
nEigenVectors |
Number of eigen vectors used to aproximate the solution of the optimization problem. If the number is too low, smoother topographic solutions are founded, probabily losing SP but achieving a better SE. Generally, 400 are enought. |
threadNumber |
Number of threads used for the calculations. If it is NA leave OpenMP decide the number (may vary across different platforms). |
Returns a vector with the same size of the input vector y with the prediction scores for all sequences (even the labelled examples). If a threshold Objective different from 'none' was set, the threshold is estimated and subtracted from the scores, therefore the new threshold that divide the classes is zero. Also, the positive scores are divided by the max positive score, and the negative scores are divided by the magnitud of the minimum negative score.
# First construct the label vector with the CLASS column y = as.numeric(celegans$CLASS)*2 - 1 # Remove some labels to make a test y[sample(which(y>0),200)] = 0 y[sample(which(y<0),700)] = 0 # Take all the features but remove the label column x = subset(celegans, select = -CLASS) # Call miRNAss with default parameters p = miRNAss(x,y) # Calculate some performance measures SE = mean(p[ celegans$CLASS & y==0] > 0) SP = mean(p[!celegans$CLASS & y==0] < 0) cat("Sensitivity: ", SE, "\nSpecificity: ", SP, "\n")
# First construct the label vector with the CLASS column y = as.numeric(celegans$CLASS)*2 - 1 # Remove some labels to make a test y[sample(which(y>0),200)] = 0 y[sample(which(y<0),700)] = 0 # Take all the features but remove the label column x = subset(celegans, select = -CLASS) # Call miRNAss with default parameters p = miRNAss(x,y) # Calculate some performance measures SE = mean(p[ celegans$CLASS & y==0] > 0) SP = mean(p[!celegans$CLASS & y==0] < 0) cat("Sensitivity: ", SE, "\nSpecificity: ", SP, "\n")