Title: | A Mixture Model-Based Approach to the Clustering of Microarray Expression Data |
---|---|
Description: | Provides unsupervised selection and clustering of microarray data using mixture models. Following the methods described in McLachlan, Bean and Peel (2002) <doi:10.1093/bioinformatics/18.3.413> a subset of genes are selected based one the likelihood ratio statistic for the test of one versus two components when fitting mixtures of t-distributions to the expression data for each gene. The dimensionality of this gene subset is further reduced through the use of mixtures of factor analyzers, allowing the tissue samples to be clustered by fitting mixtures of normal distributions. |
Authors: | Andrew Thomas Jones [aut, cre] |
Maintainer: | Andrew Thomas Jones <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.1.4 |
Built: | 2024-11-17 06:33:31 UTC |
Source: | CRAN |
Clusters tissues using all group means
all_cluster_tissues(gen, clusters, q = 6, G = 2)
all_cluster_tissues(gen, clusters, q = 6, G = 2)
gen |
EMMIXgene object |
clusters |
mclust object |
q |
number of factors if using mfa |
G |
number of components if using mfa |
a clustering for each sample (columns) by each group(rows)
example <- plot_single_gene(alon_data,1) #only run on first 100 genes for speed alon_sel <- select_genes(alon_data[seq_len(100), ]) alon_clust<- cluster_genes(alon_sel , 2) alon_tissue_all<-all_cluster_tissues(alon_sel, alon_clust, q=1, G=2)
example <- plot_single_gene(alon_data,1) #only run on first 100 genes for speed alon_sel <- select_genes(alon_data[seq_len(100), ]) alon_clust<- cluster_genes(alon_sel , 2) alon_tissue_all<-all_cluster_tissues(alon_sel, alon_clust, q=1, G=2)
A dataset containing centred and normalized values of the logged expression values of a subset of 2000 genes taken from Alon, Uri, et al. "Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays." Proceedings of the National Academy of Sciences 96.12 (1999): 6745-6750. The method of subset selection was described in G. J. McLachlan, R. W. Bean, D. Peel; A mixture model-based approach to the clustering of microarray expression data , Bioinformatics, Volume 18, Issue 3, 1 March 2002, Pages 413–422.
data(alon_data)
data(alon_data)
A data frame with 2000 rows (genes) and 62 variables (samples).
dim(alon_data)
dim(alon_data)
Sorts genes into clusters using mixtures of normal distributions with covariance matrices restricted to be multiples of the identity matrix.
cluster_genes(gen, g = NULL)
cluster_genes(gen, g = NULL)
gen |
an EMMIXgene object produced by select_genes(). |
g |
The desired number of gene clusters. If not specified will be selected automatically on the basis of BIC. |
An array containing the clustering.
#only run on first 100 genes for speed alon_sel <- select_genes(alon_data[seq_len(100), ]) alon_clust<- cluster_genes(alon_sel , 2)
#only run on first 100 genes for speed alon_sel <- select_genes(alon_data[seq_len(100), ]) alon_clust<- cluster_genes(alon_sel , 2)
Clusters tissues
cluster_tissues(gen, clusters, method = "t", q = 6, G = 2)
cluster_tissues(gen, clusters, method = "t", q = 6, G = 2)
gen |
EMMIXgene object |
clusters |
mclust object |
method |
Method for separating tissue classes. Can be either 't' for a univariate mixture of t-distributions on gene cluster means, or 'mfa' for a mixture of factor analyzers. |
q |
number of factors if using mfa |
G |
number of components if using mfa |
a clustering for each sample (columns) by each group(rows)
#only run on first 100 genes for speed alon_sel <- select_genes(alon_data[seq_len(100), ]) alon_clust<- cluster_genes(alon_sel,2) alon_tissue_t<- cluster_tissues(alon_sel,alon_clust,method='t') alon_tissue_mfa<- cluster_tissues(alon_sel, alon_clust,method='mfa',q=2,G=2)
#only run on first 100 genes for speed alon_sel <- select_genes(alon_data[seq_len(100), ]) alon_clust<- cluster_genes(alon_sel,2) alon_tissue_t<- cluster_tissues(alon_sel,alon_clust,method='t') alon_tissue_mfa<- cluster_tissues(alon_sel, alon_clust,method='mfa',q=2,G=2)
Selects genes using the EMMIXgene algorithm, following the methodology of G. J. McLachlan, R. W. Bean, D. Peel; A mixture model-based approach to the clustering of microarray expression data , Bioinformatics, Volume 18, Issue 3, 1 March 2002, Pages 413–422, https://doi.org/10.1093/bioinformatics/18.3.413
select_genes
: Selects the most differentially expressed genes.
cluster_genes
: Clusters the genes using a mixture model
approach.
cluster_tissues
: Clusters the tissues based on the differences
between the tissue samples among the gene groups.
See vignette('The-EMMIXgene-Workflow')
for more details.
A dataset containing the centred and normalized values of the logged expression values of a subset of 3731 genes taken from Golub, Todd R., et al. "Molecular classification of cancer: class discovery and class prediction by gene expression monitoring." Science 286.5439 (1999): 531-537. The method of subset selection was described in G. J. McLachlan, R. W. Bean, D. Peel; A mixture model-based approach to the clustering of microarray expression data , Bioinformatics, Volume 18, Issue 3, 1 March 2002, Pages 413–422.
data(golub_data)
data(golub_data)
A data frame with 3731 rows (genes) and 72 variables (samples). #'@examples dim(golub_data)
Plot heat maps of gene expression data. Optionally sort the x-axis according to a predetermined clustering.
heat_maps(dat, clustering = NULL, y_lab = NULL)
heat_maps(dat, clustering = NULL, y_lab = NULL)
dat |
matrix of gene expression data. |
clustering |
a vector of sample classifications. Must be same length as the number of columns in dat. |
y_lab |
optional label for y-axis. |
A ggplot2 heat map.
example <- heat_maps(alon_data[seq_len(100), ])
example <- heat_maps(alon_data[seq_len(100), ])
Plot a single gene expression histogram with best fitted mixture of t-distributions according to the EMMIX-gene algorithm.
plot_single_gene( dat, gene_id, g = NULL, random_starts = 8, max_it = 100, ll_thresh = 8, min_clust_size = 8, tol = 1e-04, start_method = "both", three = TRUE, min = -4, max = 2 )
plot_single_gene( dat, gene_id, g = NULL, random_starts = 8, max_it = 100, ll_thresh = 8, min_clust_size = 8, tol = 1e-04, start_method = "both", three = TRUE, min = -4, max = 2 )
dat |
matrix of gene expression data. |
gene_id |
row number of gene to be plotted. |
g |
force number of components, default = NULL |
random_starts |
The number of random initializations used per gene when fitting mixtures of t-distributions. Initialization uses k-means by default. |
max_it |
The maximum number of iterations per mixture fit. Default value is 100. |
ll_thresh |
The difference in -2 log lambda used as a threshold for selecting between g=1 and g=2 for each gene. Default value is 8, which was chosen arbitrarily in the original paper. |
min_clust_size |
The minimum number of observations per cluster used when fitting mixtures of t-distributions for each gene. Default value is 8. |
tol |
Tolerance value used for detecting convergence of EMMIX fits. |
start_method |
Default value is "both". Can also choose "random" for purely random starts. |
three |
Also test g=2 vs g=3 where appropriate. Defaults to TRUE. |
min , max
|
Minimum and maximum x-axis values for the plot window. |
A ggplot2 histogram with fitted t-distributions overlayed.
example <- plot_single_gene(alon_data,1) #plot(example)
example <- plot_single_gene(alon_data,1) #plot(example)
Follows the gene selection methodology of G. J. McLachlan, R. W. Bean, D. Peel; A mixture model-based approach to the clustering of microarray expression data , Bioinformatics, Volume 18, Issue 3, 1 March 2002, Pages 413–422, https://doi.org/10.1093/bioinformatics/18.3.413
select_genes( dat, filename, random_starts = 4, max_it = 100, ll_thresh = 8, min_clust_size = 8, tol = 1e-04, start_method = "both", three = FALSE )
select_genes( dat, filename, random_starts = 4, max_it = 100, ll_thresh = 8, min_clust_size = 8, tol = 1e-04, start_method = "both", three = FALSE )
dat |
A matrix or dataframe containing gene expression data. Rows are genes and columns are samples. Must supply one of filename and dat. |
filename |
Name of file containing gene data. Can be either .csv or space separated .dat. Rows are genes and columns are samples. Must supply one of filename and dat. |
random_starts |
The number of random initializations used per gene when fitting mixtures of t-distributions. Initialization uses k-means by default. |
max_it |
The maximum number of iterations per mixture fit. Default value is 100. |
ll_thresh |
The difference in -2 log lambda used as a threshold for selecting between g=1 and g=2 for each gene. Default value is 8, which was chosen arbitrarily in the original paper. |
min_clust_size |
The minimum number of observations per cluster used when fitting mixtures of t-distributions for each gene. Default value is 8. |
tol |
Tolerance value used for detecting convergence of EMMIX fits. |
start_method |
Default value is "both". Can also choose "random" for purely random starts. |
three |
Also test g=2 vs g=3 where appropriate. Defaults to FALSE. |
An EMMIXgene object containing:
stat |
The difference in log-likelihood for g=1 and g=2 for each gene (or for g=2 and g=3 where relevant). |
g |
The selected number of components for each gene. |
it |
The number of iterations for each genes selected fit. |
selected |
An indicator for each genes selected status |
ranks |
selected gene ids ranked by stat |
genes |
A dataframe of selected genes. |
all_genes |
Returns dat or contents of filename. |
#only run on first 100 genes for speed alon_sel <- select_genes(alon_data[seq_len(100), ])
#only run on first 100 genes for speed alon_sel <- select_genes(alon_data[seq_len(100), ])
Cluster tissues
top_genes_cluster_tissues(gen, n_top = 100, method = "mfa", q = 2, g = 2)
top_genes_cluster_tissues(gen, n_top = 100, method = "mfa", q = 2, g = 2)
gen |
An EMMIXgene object produced by select_genes(). |
n_top |
number of top genes (as ranked by likelihood) to be selected |
method |
Method for separating tissue classes. Can be either 't' for a univariate mixture of t-distributions on gene cluster means, or 'mfa' for a mixture of factor analysers. |
q |
number of factors if using mfa |
g |
number of components if using mfa |
An EMMIXgene object containing:
stat |
A matrix containing clustering (0 or 1) for each sample (columns) by each group(rows). |
top_gene |
The row numbers of the top genes. |
fit |
The fit object used to determine the clustering. |
alon_sel <- select_genes(alon_data[seq_len(100), ]) alon_top_10<-top_genes_cluster_tissues(alon_sel, 10, method='mfa', q=3, g=2)
alon_sel <- select_genes(alon_data[seq_len(100), ]) alon_top_10<-top_genes_cluster_tissues(alon_sel, 10, method='mfa', q=3, g=2)