Title: | Feature (gene) selection based on the Proportional Overlapping Scores |
---|---|
Description: | A package for selecting the most relevant features (genes) in the high-dimensional binary classification problems. The discriminative features are identified using analyzing the overlap between the expression values across both classes. The package includes functions for measuring the proportional overlapping score for each gene avoiding the outliers effect. The used measure for the overlap is the one defined in the "Proportional Overlapping Score (POS)" technique for feature selection. A gene mask which represents a gene's classification power can also be produced for each gene (feature). The set size of the selected genes might be set by the user. The minimum set of genes that correctly classify the maximum number of the given tissue samples (observations) can be also produced. |
Authors: | Osama Mahmoud, Andrew Harrison, Aris Perperoglou, Asma Gul, Zardad Khan, Berthold Lausen |
Maintainer: | Osama Mahmoud <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.0 |
Built: | 2024-10-31 19:55:57 UTC |
Source: | CRAN |
A package for selecting the most relevant features (genes) in the high-dimensional binary classification problems. The discriminative features are identified using analyzing the overlap between the expression values across both classes. The package includes functions for measuring the proportional overlapping score for each gene avoiding the outliers effect. The used measure of the overlap is the one defined in the “Proportional Overlapping Score (POS)” technique for feature selection, see ‘References’ section below. A gene mask which represents a gene's classification power can also be produced for each gene (feature). The set size of the selected genes might be set by the user. The minimum set of genes that correctly classify the maximum number of the given tissue samples (observations) can be also produced.
Package: | propOverlap |
Type: | Package |
Version: | 1.0 |
Date: | 2014-09-15 |
License: | GPL (>= 2) |
Osama Mahmoud, Andrew Harrison, Aris Perperoglou, Asma Gul, Zardad Khan, Berthold Lausen Maintainer: Osama Mahmoud [email protected]
Mahmoud O., Harrison A., Perperoglou A., Gul A., Khan Z., Metodiev M. and Lausen B. (2014) A feature selection method for classification within functional genomics experiments based on the proportional overlapping score. BMC Bioinformatics, 2014, 15:274
CI.emprical
is used to compute the core interval boundaries for each class.
CI.emprical(ES, Y)
CI.emprical(ES, Y)
ES |
gene (feature) matrix: P, number of genes, by N, number of samples(observations). |
Y |
a vector of length N for samples' class label. |
CI.emprical
returns an object of class “data.frame” which has P rows and 4 columns. The first two columns represent a1, the minimum boundary of the first class, and b1, the maximum boundary of the first class, respectively. Whereas, the last two columns represent a2, the minimum boundary of the second class, and b2, the maximum boundary of the second class, respectively.
Osama Mahmoud [email protected]
Mahmoud O., Harrison A., Perperoglou A., Gul A., Khan Z., Metodiev M. and Lausen B. (2014) A feature selection method for classification within functional genomics experiments based on the proportional overlapping score. BMC Bioinformatics, 2014, 15:274.
data(lung) GenesExpression <- lung[1:12533,] #define the features matrix Class <- lung[12534,] #define the observations' class labels CoreIntervals <- CI.emprical(GenesExpression, Class) CoreIntervals[1:10,] #show classes' core interval for the first 10 features
data(lung) GenesExpression <- lung[1:12533,] #define the features matrix Class <- lung[12534,] #define the observations' class labels CoreIntervals <- CI.emprical(GenesExpression, Class) CoreIntervals[1:10,] #show classes' core interval for the first 10 features
GMask
produces the masks of features (genes). Each gene mask reports the samples that can unambiguously be assigned to their correct target classes by this gene.
GMask(ES, Core, Y)
GMask(ES, Core, Y)
ES |
gene (feature) matrix: P, number of genes, by N, number of samples(observations). |
Core |
a |
Y |
a vector of length N for samples' class label. |
GMask
gives the gene masks that can represent the capability of genes to correctly classify each sample. Such a mask represents a gene's classification power. Each element of a mask is set either to 1 or 0 based on whether the corresponding sample (observation) could be unambiguously assign to its correct target class by the considered gene or not respectively.
It returns a P by N matrix with elements of zeros and ones.
Osama Mahmoud [email protected]
Mahmoud O., Harrison A., Perperoglou A., Gul A., Khan Z., Metodiev M. and Lausen B. (2014) A feature selection method for classification within functional genomics experiments based on the proportional overlapping score. BMC Bioinformatics, 2014, 15:274.
CI.emprical
for the core interval boundaries.
data(leukaemia) GenesExpression <- leukaemia[1:7129,] #define the features matrix Class <- leukaemia[7130,] #define the observations' class labels Gene.Masks <- GMask(GenesExpression, CI.emprical(GenesExpression, Class), Class) Gene.Masks[1:100,] #show the masks of the first 100 features
data(leukaemia) GenesExpression <- leukaemia[1:7129,] #define the features matrix Class <- leukaemia[7130,] #define the observations' class labels Gene.Masks <- GMask(GenesExpression, CI.emprical(GenesExpression, Class), Class) Gene.Masks[1:100,] #show the masks of the first 100 features
The leukemia dataset was taken from a collection of leukemia patient samples reported by Golub et. al., (1999). This dataset often serves as a benchmark for microarray analysis methods. It contains gene expressions corresponding to acute lymphoblast leukemia (ALL) and acute myeloid leukemia (AML) samples from bone marrow and peripheral blood. The dataset consisted of 72 samples: 49 samples of ALL; 23 samples of AML. Each sample is measured over 7,129 genes.
data(leukaemia)
data(leukaemia)
A matrix with 7130 rows (7129 rows show the gene expressions while the last row reports the corresponding sample's class label), and 72 columns represent the samples. The samples class's label coded as follows:
1
acute lymphoblast leukemia sample (ALL).
2
acute myeloid leukemia sample (AML).
http://cilab.ujn.edu.cn/datasets.htm
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science: 286 (5439), 531-537.
data(leukaemia) str(leukaemia)
data(leukaemia) str(leukaemia)
Gene expression data for lung cancer classification between two classes: adenocarcinoma (ADCA); malignant pleural mesothe-lioma (MPM). The lung data set contains 181 tissue samples (150 ADCA and 31 MPM). Each sample is described by 12533 genes.
data(lung)
data(lung)
A matrix with 12534 rows (12533 rows show the gene expressions for 181 tissue samples, reported in columns, while the last row reports the corresponding sample's class label). The samples class's label coded as follows:
1
adenocarcinoma sample (ADCA).
2
malignant pleural mesothe-lioma sample (MPM).
http://cilab.ujn.edu.cn/datasets.htm
Gordon GJ, Jensen RV, Hsiao L-L, Gullans SR, Blumenstock JE, Ramaswamy S, Richards WG, Sugarbaker DJ, Bueno R. (2002) Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer research: 62(17), 4963-4967.
data(lung) str(lung)
data(lung) str(lung)
POS
computes the proportional overlapping scores of the given genes (features). This score measures the overlap degree between gene expression values across various classes. It produces a value lies in the interval [0,1]. A lower score denotes gene with higher discriminative power for the considered classification problem.
POS(ES, Core, Y)
POS(ES, Core, Y)
ES |
gene (feature) matrix: P, number of genes, by N, number of samples(observations). |
Core |
a |
Y |
a vector of length N for samples' class label. |
For each gene, POS
computes a measure that estimates the overlapping degree between the expression intervals of different classes. For estimating the overlap, POS measure takes into account three factors: the length of the overlapping region; number of the overlapped samples (observations); the proportion of each class's overlapped samples to the total number of overlapping samples.
It returns a vector of length P for ‘POS’ measures of all genes (features).
Osama Mahmoud [email protected]
Mahmoud O., Harrison A., Perperoglou A., Gul A., Khan Z., Metodiev M. and Lausen B. (2014) A feature selection method for classification within functional genomics experiments based on the proportional overlapping score. BMC Bioinformatics, 2014, 15:274.
CI.emprical
for the core interval boundaries and GMask
for the gene masks.
data(leukaemia) Score <- POS(leukaemia[1:7129,], CI.emprical(leukaemia[1:7129,], leukaemia[7130,]), leukaemia[7130,]) Score[1:5] #show the proportional overlapping scores for the first 5 features summary(Score) #show the the summary of the scores of all features.
data(leukaemia) Score <- POS(leukaemia[1:7129,], CI.emprical(leukaemia[1:7129,], leukaemia[7130,]), leukaemia[7130,]) Score[1:5] #show the proportional overlapping scores for the first 5 features summary(Score) #show the the summary of the scores of all features.
RDC
associates genes (features) with the class which it is more able to distingish. For each gene, a class that has the highest proportion, relative to classes' size, of correctly assigned samples (observations) is reported as the relative dominant class for the considered gene.
RDC(GMask, Y)
RDC(GMask, Y)
GMask |
gene (feature) mask matrix: P, number of genes, by N, number of samples(observations) with elements of zeros and ones. See the returned value of the |
Y |
a vector of length N for samples' class label. |
RDC
returns a vector of length P. Each element's value is either 1 or 2 indicating which class label is reported as the relative dominant class for the corresponding gene (feature).
Osama Mahmoud [email protected]
Mahmoud O., Harrison A., Perperoglou A., Gul A., Khan Z., Metodiev M. and Lausen B. (2014) A feature selection method for classification within functional genomics experiments based on the proportional overlapping score. BMC Bioinformatics, 2014, 15:274.
GMask
for gene (feature) mask matrix.
data(lung) Class <- lung[12534,] #define the observations' class labels Gene.Masks <- GMask(lung[1:12533,], CI.emprical(lung[1:12533,], Class), Class) RelativeDC <- RDC(Gene.Masks, Class) RelativeDC[1:10] #show the relative dominant classes for the first 10 features table(RelativeDC) #show the number of assignments for each class
data(lung) Class <- lung[12534,] #define the observations' class labels Gene.Masks <- GMask(lung[1:12533,], CI.emprical(lung[1:12533,], Class), Class) RelativeDC <- RDC(Gene.Masks, Class) RelativeDC[1:10] #show the relative dominant classes for the first 10 features table(RelativeDC) #show the number of assignments for each class
Sel.Feature
selects the most discriminative genes (features) among the given ones.
Sel.Features(ES, Y, K = "Min", Verbose = FALSE)
Sel.Features(ES, Y, K = "Min", Verbose = FALSE)
ES |
gene (feature) matrix: P, number of genes, by N, number of samples (observations). |
Y |
a vector of length N for samples' class label. |
K |
the number of genes to be selected. The default is to give the minimum subset of genes that correctly classify the maximum number of the given tissue samples (observations). Alternatively, |
Verbose |
logical. If |
Sel.Feature
selects the most relevant genes (features) in the high-dimensional binary classification problems. The discriminative genes are identified using analyzing the overlap between the expression values across both classes. The “POS” technique has been applied to produce the selected set of genes. A proportional overlapping score measures the overlapping degree avoiding the outliers effect for each gene. Each gene is described by a robust mask that represents its discriminative power. The constructed masks along with the gene scores are exploited to produce the selected subset of genes.
If K
is specified as ‘Min’ (the default), a list containing the following components is returned:
Features |
A matrix of the indices of selected genes with their POS measures. See |
Covered.Obs |
A vector showing the indices of the observations that have been covered by the returned minimum subset of genes. |
If K
is specified as a positive integer, a list containing the following components is returned:
features |
A vector of the indices of the selected genes. |
nMin.Features |
The number of genes included in the minimum subset. |
Measures |
Available only when |
Verbose
is only needed when K
is specified. If K
is set to “Min” (default), all information are automatically returned.
Osama Mahmoud [email protected]
Mahmoud O., Harrison A., Perperoglou A., Gul A., Khan Z., Metodiev M. and Lausen B. (2014) A feature selection method for classification within functional genomics experiments based on the proportional overlapping score. BMC Bioinformatics, 2014, 15:274.
POS
for calculating the proportional overlapping scores and RDC
for assigning the relative dominant class.
data(leukaemia) GenesExpression <- leukaemia[1:7129,] #define the features matrix Class <- leukaemia[7130,] #define the observations' class labels ## select the minimum subset of features Selection <- Sel.Features(GenesExpression, Class) attributes(Selection) (Candidates <- Selection$Features) #return the selected features (Covered.observations <- Selection$Covered.Obs) #return the covered observations by the selection ## select a specific number of features Selection.k <- Sel.Features(GenesExpression, Class, K=10, Verbose=TRUE) Selection.k$Features Selection.k$nMin.Features #return the size of the minimum subset of genes Selection.k$Measures #return the selected features' information
data(leukaemia) GenesExpression <- leukaemia[1:7129,] #define the features matrix Class <- leukaemia[7130,] #define the observations' class labels ## select the minimum subset of features Selection <- Sel.Features(GenesExpression, Class) attributes(Selection) (Candidates <- Selection$Features) #return the selected features (Covered.observations <- Selection$Covered.Obs) #return the covered observations by the selection ## select a specific number of features Selection.k <- Sel.Features(GenesExpression, Class, K=10, Verbose=TRUE) Selection.k$Features Selection.k$nMin.Features #return the size of the minimum subset of genes Selection.k$Measures #return the selected features' information