Package 'EMMIXgene' reference manual

Title:	A Mixture Model-Based Approach to the Clustering of Microarray Expression Data
Description:	Provides unsupervised selection and clustering of microarray data using mixture models. Following the methods described in McLachlan, Bean and Peel (2002) <doi:10.1093/bioinformatics/18.3.413> a subset of genes are selected based one the likelihood ratio statistic for the test of one versus two components when fitting mixtures of t-distributions to the expression data for each gene. The dimensionality of this gene subset is further reduced through the use of mixtures of factor analyzers, allowing the tissue samples to be clustered by fitting mixtures of normal distributions.
Authors:	Andrew Thomas Jones [aut, cre]
Maintainer:	Andrew Thomas Jones <[email protected]>
License:	GPL (>= 3)
Version:	0.1.4
Built:	2025-02-15 06:33:35 UTC
Source:	CRAN

Clusters tissues using all group means

Description

Clusters tissues using all group means

Usage

all_cluster_tissues(gen, clusters, q = 6, G = 2)
all_cluster_tissues(gen, clusters, q = 6, G = 2)

Arguments

`gen`	EMMIXgene object
`clusters`	mclust object
`q`	number of factors if using mfa
`G`	number of components if using mfa

Value

a clustering for each sample (columns) by each group(rows)

Examples


example <- plot_single_gene(alon_data,1) 
#only run on first 100 genes for speed
alon_sel <- select_genes(alon_data[seq_len(100), ]) 
alon_clust<- cluster_genes(alon_sel , 2)
alon_tissue_all<-all_cluster_tissues(alon_sel, alon_clust, q=1, G=2)
example <- plot_single_gene(alon_data,1) 
#only run on first 100 genes for speed
alon_sel <- select_genes(alon_data[seq_len(100), ]) 
alon_clust<- cluster_genes(alon_sel , 2)
alon_tissue_all<-all_cluster_tissues(alon_sel, alon_clust, q=1, G=2)

Normalized gene expression values from Alon et al. (1999).

Description

A dataset containing centred and normalized values of the logged expression values of a subset of 2000 genes taken from Alon, Uri, et al. "Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays." Proceedings of the National Academy of Sciences 96.12 (1999): 6745-6750. The method of subset selection was described in G. J. McLachlan, R. W. Bean, D. Peel; A mixture model-based approach to the clustering of microarray expression data , Bioinformatics, Volume 18, Issue 3, 1 March 2002, Pages 413–422.

Usage

data(alon_data)
data(alon_data)

Format

A data frame with 2000 rows (genes) and 62 variables (samples).

Examples

dim(alon_data)
dim(alon_data)

Clusters genes using mixtures of normal distributions

Description

Sorts genes into clusters using mixtures of normal distributions with covariance matrices restricted to be multiples of the identity matrix.

Usage

cluster_genes(gen, g = NULL)
cluster_genes(gen, g = NULL)

Arguments

`gen`	an EMMIXgene object produced by select_genes().
`g`	The desired number of gene clusters. If not specified will be selected automatically on the basis of BIC.

Value

An array containing the clustering.

Examples


#only run on first 100 genes for speed
alon_sel <- select_genes(alon_data[seq_len(100), ]) 
alon_clust<- cluster_genes(alon_sel , 2)
#only run on first 100 genes for speed
alon_sel <- select_genes(alon_data[seq_len(100), ]) 
alon_clust<- cluster_genes(alon_sel , 2)

Clusters tissues

Description

Clusters tissues

Usage

cluster_tissues(gen, clusters, method = "t", q = 6, G = 2)
cluster_tissues(gen, clusters, method = "t", q = 6, G = 2)

Arguments

`gen`	EMMIXgene object
`clusters`	mclust object
`method`	Method for separating tissue classes. Can be either 't' for a univariate mixture of t-distributions on gene cluster means, or 'mfa' for a mixture of factor analyzers.
`q`	number of factors if using mfa
`G`	number of components if using mfa

Value

a clustering for each sample (columns) by each group(rows)

Examples

#only run on first 100 genes for speed
alon_sel <- select_genes(alon_data[seq_len(100), ]) 
alon_clust<- cluster_genes(alon_sel,2)
alon_tissue_t<-
   cluster_tissues(alon_sel,alon_clust,method='t')
alon_tissue_mfa<-
   cluster_tissues(alon_sel, alon_clust,method='mfa',q=2,G=2) 
#only run on first 100 genes for speed
alon_sel <- select_genes(alon_data[seq_len(100), ]) 
alon_clust<- cluster_genes(alon_sel,2)
alon_tissue_t<-
   cluster_tissues(alon_sel,alon_clust,method='t')
alon_tissue_mfa<-
   cluster_tissues(alon_sel, alon_clust,method='mfa',q=2,G=2)

EMMIXgene:

Description

Selects genes using the EMMIXgene algorithm, following the methodology of G. J. McLachlan, R. W. Bean, D. Peel; A mixture model-based approach to the clustering of microarray expression data , Bioinformatics, Volume 18, Issue 3, 1 March 2002, Pages 413–422, https://doi.org/10.1093/bioinformatics/18.3.413

Functions

select_genes: Selects the most differentially expressed genes.

cluster_genes: Clusters the genes using a mixture model approach.

cluster_tissues: Clusters the tissues based on the differences between the tissue samples among the gene groups.

See vignette('The-EMMIXgene-Workflow') for more details.

Normalized gene expression values from Golub et al. (1999).

Description

A dataset containing the centred and normalized values of the logged expression values of a subset of 3731 genes taken from Golub, Todd R., et al. "Molecular classification of cancer: class discovery and class prediction by gene expression monitoring." Science 286.5439 (1999): 531-537. The method of subset selection was described in G. J. McLachlan, R. W. Bean, D. Peel; A mixture model-based approach to the clustering of microarray expression data , Bioinformatics, Volume 18, Issue 3, 1 March 2002, Pages 413–422.

Usage

data(golub_data)
data(golub_data)

Format

A data frame with 3731 rows (genes) and 72 variables (samples). #'@examples dim(golub_data)

Heat maps

Description

Plot heat maps of gene expression data. Optionally sort the x-axis according to a predetermined clustering.

Usage

heat_maps(dat, clustering = NULL, y_lab = NULL)
heat_maps(dat, clustering = NULL, y_lab = NULL)

Arguments

`dat`	matrix of gene expression data.
`clustering`	a vector of sample classifications. Must be same length as the number of columns in dat.
`y_lab`	optional label for y-axis.

Value

A ggplot2 heat map.

Examples


example <- heat_maps(alon_data[seq_len(100), ])


example <- heat_maps(alon_data[seq_len(100), ])

Plot a single gene expression histogram with best fitted mixture of t-distributions.

Description

Plot a single gene expression histogram with best fitted mixture of t-distributions according to the EMMIX-gene algorithm.

Usage

plot_single_gene(
  dat,
  gene_id,
  g = NULL,
  random_starts = 8,
  max_it = 100,
  ll_thresh = 8,
  min_clust_size = 8,
  tol = 1e-04,
  start_method = "both",
  three = TRUE,
  min = -4,
  max = 2
)
plot_single_gene(
  dat,
  gene_id,
  g = NULL,
  random_starts = 8,
  max_it = 100,
  ll_thresh = 8,
  min_clust_size = 8,
  tol = 1e-04,
  start_method = "both",
  three = TRUE,
  min = -4,
  max = 2
)

Arguments

`dat`	matrix of gene expression data.
`gene_id`	row number of gene to be plotted.
`g`	force number of components, default = NULL
`random_starts`	The number of random initializations used per gene when fitting mixtures of t-distributions. Initialization uses k-means by default.
`max_it`	The maximum number of iterations per mixture fit. Default value is 100.
`ll_thresh`	The difference in -2 log lambda used as a threshold for selecting between g=1 and g=2 for each gene. Default value is 8, which was chosen arbitrarily in the original paper.
`min_clust_size`	The minimum number of observations per cluster used when fitting mixtures of t-distributions for each gene. Default value is 8.
`tol`	Tolerance value used for detecting convergence of EMMIX fits.
`start_method`	Default value is "both". Can also choose "random" for purely random starts.
`three`	Also test g=2 vs g=3 where appropriate. Defaults to TRUE.
`min`, `max`	Minimum and maximum x-axis values for the plot window.

Value

A ggplot2 histogram with fitted t-distributions overlayed.

Examples


example <- plot_single_gene(alon_data,1) 
#plot(example)

example <- plot_single_gene(alon_data,1) 
#plot(example)

Selects genes using the EMMIXgene algorithm.

Description

Follows the gene selection methodology of G. J. McLachlan, R. W. Bean, D. Peel; A mixture model-based approach to the clustering of microarray expression data , Bioinformatics, Volume 18, Issue 3, 1 March 2002, Pages 413–422, https://doi.org/10.1093/bioinformatics/18.3.413

Usage

select_genes(
  dat,
  filename,
  random_starts = 4,
  max_it = 100,
  ll_thresh = 8,
  min_clust_size = 8,
  tol = 1e-04,
  start_method = "both",
  three = FALSE
)
select_genes(
  dat,
  filename,
  random_starts = 4,
  max_it = 100,
  ll_thresh = 8,
  min_clust_size = 8,
  tol = 1e-04,
  start_method = "both",
  three = FALSE
)

Arguments

`dat`	A matrix or dataframe containing gene expression data. Rows are genes and columns are samples. Must supply one of filename and dat.
`filename`	Name of file containing gene data. Can be either .csv or space separated .dat. Rows are genes and columns are samples. Must supply one of filename and dat.
`random_starts`	The number of random initializations used per gene when fitting mixtures of t-distributions. Initialization uses k-means by default.
`max_it`	The maximum number of iterations per mixture fit. Default value is 100.
`ll_thresh`	The difference in -2 log lambda used as a threshold for selecting between g=1 and g=2 for each gene. Default value is 8, which was chosen arbitrarily in the original paper.
`min_clust_size`	The minimum number of observations per cluster used when fitting mixtures of t-distributions for each gene. Default value is 8.
`tol`	Tolerance value used for detecting convergence of EMMIX fits.
`start_method`	Default value is "both". Can also choose "random" for purely random starts.
`three`	Also test g=2 vs g=3 where appropriate. Defaults to FALSE.

Value

An EMMIXgene object containing:

`stat`	The difference in log-likelihood for g=1 and g=2 for each gene (or for g=2 and g=3 where relevant).
`g`	The selected number of components for each gene.
`it`	The number of iterations for each genes selected fit.
`selected`	An indicator for each genes selected status
`ranks`	selected gene ids ranked by stat
`genes`	A dataframe of selected genes.
`all_genes`	Returns dat or contents of filename.

Examples

#only run on first 100 genes for speed
alon_sel <- select_genes(alon_data[seq_len(100), ]) 

#only run on first 100 genes for speed
alon_sel <- select_genes(alon_data[seq_len(100), ])

Cluster tissues

Description

Cluster tissues

Usage

top_genes_cluster_tissues(gen, n_top = 100, method = "mfa", q = 2, g = 2)
top_genes_cluster_tissues(gen, n_top = 100, method = "mfa", q = 2, g = 2)

Arguments

`gen`	An EMMIXgene object produced by select_genes().
`n_top`	number of top genes (as ranked by likelihood) to be selected
`method`	Method for separating tissue classes. Can be either 't' for a univariate mixture of t-distributions on gene cluster means, or 'mfa' for a mixture of factor analysers.
`q`	number of factors if using mfa
`g`	number of components if using mfa

Value

An EMMIXgene object containing:

`stat`	A matrix containing clustering (0 or 1) for each sample (columns) by each group(rows).
`top_gene`	The row numbers of the top genes.
`fit`	The fit object used to determine the clustering.

Examples

alon_sel <- select_genes(alon_data[seq_len(100), ])
alon_top_10<-top_genes_cluster_tissues(alon_sel, 10, method='mfa', q=3, g=2)
alon_sel <- select_genes(alon_data[seq_len(100), ])
alon_top_10<-top_genes_cluster_tissues(alon_sel, 10, method='mfa', q=3, g=2)

Package 'EMMIXgene'

Help Index

Clusters tissues using all group means

Description

Usage

Arguments

Value

Examples

Normalized gene expression values from Alon et al. (1999).

Description

Usage

Format

Examples

Clusters genes using mixtures of normal distributions

Description

Usage

Arguments

Value

Examples

Clusters tissues

Description

Usage

Arguments

Value

Examples

EMMIXgene:

Description

Functions

Normalized gene expression values from Golub et al. (1999).

Description

Usage

Format

Heat maps

Description

Usage

Arguments

Value

Examples

Plot a single gene expression histogram with best fitted mixture of t-distributions.

Description

Usage

Arguments

Value

Examples

Selects genes using the EMMIXgene algorithm.

Description

Usage

Arguments

Value

Examples

Cluster tissues

Description

Usage

Arguments

Value

Examples