Title: | Fast Estimation of Gaussian Mixture Copula Models |
---|---|
Description: | Unsupervised Clustering and Meta-analysis using Gaussian Mixture Copula Models. |
Authors: | Anders Ellern Bilgrau [aut, cre, cph] , Poul Svante Eriksen [ths, ctb] , Martin Boegsted [ths, ctb] |
Maintainer: | Anders Ellern Bilgrau <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.4 |
Built: | 2024-11-07 06:42:57 UTC |
Source: | CRAN |
Gaussian mixture copula models (GMCM) are a flexible class of statistical
models which can be used for unsupervised clustering, meta analysis, and
many other things. In meta analysis, GMCMs can be used to
quantify and identify which features which have been reproduced across
multiple experiments. This package provides a fast and general
implementation of GMCM cluster analysis and serves as an improvement and
extension of the features available in the idr
package.
If the meta analysis of Li et al. (2011) is to be performed, the
function fit.meta.GMCM
is used to identify the maximum
likelihood estimate of the special Gaussian mixture copula model (GMCM)
defined by Li et al. (2011). The function get.IDR
computes the local and adjusted Irreproducible Discovery Rates defined
by Li et al. (2011) to determine the level of reproducibility.
Tewari et. al. (2011) proposed using GMCMs as an general unsupervised
clustering tool. If such a general unsupervised clustering is needed, like
above, the function fit.full.GMCM
computes the maximum
likelihood estimate of the general GMCM. The function
get.prob
is used to estimate the class membership
probabilities of each observation.
SimulateGMCMData
provide easy simulation from the GMCMs.
Anders Ellern Bilgrau, Martin Boegsted, Poul Svante Eriksen
Maintainer: Anders Ellern Bilgrau <[email protected]>
Anders Ellern Bilgrau, Poul Svante Eriksen, Jakob Gulddahl Rasmussen, Hans Erik Johnsen, Karen Dybkaer, Martin Boegsted (2016). GMCM: Unsupervised Clustering and Meta-Analysis Using Gaussian Mixture Copula Models. Journal of Statistical Software, 70(2), 1-23. doi:10.18637/jss.v070.i02
Li, Q., Brown, J. B. J. B., Huang, H., & Bickel, P. J. (2011). Measuring reproducibility of high-throughput experiments. The Annals of Applied Statistics, 5(3), 1752-1779. doi:10.1214/11-AOAS466
Tewari, A., Giering, M. J., & Raghunathan, A. (2011). Parametric Characterization of Multimodal Distributions with Non-gaussian Modes. 2011 IEEE 11th International Conference on Data Mining Workshops, 286-292. doi:10.1109/ICDMW.2011.135
Core user functions: fit.meta.GMCM
,
fit.full.GMCM
, get.IDR
,
get.prob
, SimulateGMCMData
,
SimulateGMMData
, rtheta
,
Uhat
, choose.theta
,
full2meta
, meta2full
Package by Li et. al. (2011): idr
.
# Loading data data(u133VsExon) # Subsetting data to reduce computation time u133VsExon <- u133VsExon[1:5000, ] # Ranking and scaling, # Remember large values should be critical to the null! uhat <- Uhat(1 - u133VsExon) # Visualizing P-values and the ranked and scaled P-values ## Not run: par(mfrow = c(1,2)) plot(u133VsExon, cex = 0.5, pch = 4, col = "tomato", main = "P-values", xlab = "P (U133)", ylab = "P (Exon)") plot(uhat, cex = 0.5, pch = 4, col = "tomato", main = "Ranked P-values", xlab = "rank(1-P) (U133)", ylab = "rank(1-P) (Exon)") ## End(Not run) # Fitting using BFGS fit <- fit.meta.GMCM(uhat, init.par = c(0.5, 1, 1, 0.5), pgtol = 1e-2, method = "L-BFGS", positive.rho = TRUE, verbose = TRUE) # Compute IDR values and classify idr <- get.IDR(uhat, par = fit) table(idr$K) # 1 = irreproducible, 2 = reproducible ## Not run: # See clustering results par(mfrow = c(1,2)) plot(u133VsExon, cex = 0.5, pch = 4, main = "Classified genes", col = c("tomato", "steelblue")[idr$K], xlab = "P-value (U133)", ylab = "P-value (Exon)") plot(uhat, cex = 0.5, pch = 4, main = "Classified genes", col = c("tomato", "steelblue")[idr$K], xlab = "rank(1-P) (U133)", ylab = "rank(1-P) (Exon)") ## End(Not run)
# Loading data data(u133VsExon) # Subsetting data to reduce computation time u133VsExon <- u133VsExon[1:5000, ] # Ranking and scaling, # Remember large values should be critical to the null! uhat <- Uhat(1 - u133VsExon) # Visualizing P-values and the ranked and scaled P-values ## Not run: par(mfrow = c(1,2)) plot(u133VsExon, cex = 0.5, pch = 4, col = "tomato", main = "P-values", xlab = "P (U133)", ylab = "P (Exon)") plot(uhat, cex = 0.5, pch = 4, col = "tomato", main = "Ranked P-values", xlab = "rank(1-P) (U133)", ylab = "rank(1-P) (Exon)") ## End(Not run) # Fitting using BFGS fit <- fit.meta.GMCM(uhat, init.par = c(0.5, 1, 1, 0.5), pgtol = 1e-2, method = "L-BFGS", positive.rho = TRUE, verbose = TRUE) # Compute IDR values and classify idr <- get.IDR(uhat, par = fit) table(idr$K) # 1 = irreproducible, 2 = reproducible ## Not run: # See clustering results par(mfrow = c(1,2)) plot(u133VsExon, cex = 0.5, pch = 4, main = "Classified genes", col = c("tomato", "steelblue")[idr$K], xlab = "P-value (U133)", ylab = "P-value (Exon)") plot(uhat, cex = 0.5, pch = 4, main = "Classified genes", col = c("tomato", "steelblue")[idr$K], xlab = "rank(1-P) (U133)", ylab = "rank(1-P) (Exon)") ## End(Not run)
A function that attempts to coerce a theta-like list into a proper formatted
object of class theta
.
as.theta(x)
as.theta(x)
x |
A theta-like object that can be coerced. |
First, if the list is of length 3 and not 5, the number of components and
dimension is assumed to be missing and added.
Secondly, the class is added.
Thirdly, names are added if needed.
Next, matrix means and array covariances are
coerced to list form.
Covariances on array form are assumed to be d
by d
by m
.
Means on matrix form are as assumed to be d
by m
. I.e.
rows correspond to the dimensions and columns to components, or the mean vectors
as column vectors.
Finally, the sum constraint of 1 for the mixture proportions is enforced.
A theta object. See rtheta
.
m <- 2 d <- 3 x <- list(m = m, d = d, pie = c(0.5, 0.5), mu = list(comp1=rep(0,d), comp2=rep(1,d)), sigma = list(comp1=diag(d), comp2=diag(d))) print(x) theta <- as.theta(x) print(theta) x2 <- unname(list( # Unnamed # missing m and d pie = c(1, 1), # Does not sum to 1 mu = simplify2array(list(comp1=rep(0,d), comp2=rep(1,d))), # matrix, not a list sigma = simplify2array(list(comp1=diag(d), comp2=diag(d))) # array, not a list )) theta2 <- as.theta(x2) print(theta2)
m <- 2 d <- 3 x <- list(m = m, d = d, pie = c(0.5, 0.5), mu = list(comp1=rep(0,d), comp2=rep(1,d)), sigma = list(comp1=diag(d), comp2=diag(d))) print(x) theta <- as.theta(x) print(theta) x2 <- unname(list( # Unnamed # missing m and d pie = c(1, 1), # Does not sum to 1 mu = simplify2array(list(comp1=rep(0,d), comp2=rep(1,d))), # matrix, not a list sigma = simplify2array(list(comp1=diag(d), comp2=diag(d))) # array, not a list )) theta2 <- as.theta(x2) print(theta2)
This function uses a k
-means algorithm to heuristically select
suitable starting values for the general model.
choose.theta(u, m, no.scaling = FALSE, ...)
choose.theta(u, m, no.scaling = FALSE, ...)
u |
A matrix of (estimates of) realizations from the GMCM. |
m |
The number of components to be fitted. |
no.scaling |
Logical. If TRUE, no scaling of the means and variance-covariance matrices is done. |
... |
Arguments passed to |
The function selects the centers from the k-means algorithm as an initial estimate of the means. The proportional sizes of the clusters are selected as the initial values of the mixture proportions. The within cluster standard deviations are squared and used as the variance of the clusters within each dimension. The correlations between each dimension are taken to be zero.
A list of parameters for the GMCM model on the form described in
rtheta
.
The function uses the kmeans
function from the
stats
-package.
Anders Ellern Bilgrau <[email protected]>
set.seed(2) # Simulating data data1 <- SimulateGMCMData(n = 10000, m = 3, d = 2) obs.data <- Uhat(data1$u) # The ranked observed data # Using choose.theta to get starting estimates theta <- choose.theta(u = obs.data, m = 3) print(theta) # To illustrate theta, we can simulate from the model data2 <- SimulateGMMData(n = 10000, theta = theta) cols <- apply(get.prob(obs.data,theta),1,which.max) # Plotting par(mfrow = c(1,3)) plot(data1$z, main = "True latent GMM") plot(Uhat(data1$u), col = cols, main = "Observed GMCM\nColoured by k-means clustering") plot(data2$z, main = "initial GMM") # Alteratively, theta can simply be plotted to illustrate the GMM density par(mfrow = c(1,1)) plot(theta, add.ellipses = TRUE) points(data2$z, pch = 16, cex = 0.4)
set.seed(2) # Simulating data data1 <- SimulateGMCMData(n = 10000, m = 3, d = 2) obs.data <- Uhat(data1$u) # The ranked observed data # Using choose.theta to get starting estimates theta <- choose.theta(u = obs.data, m = 3) print(theta) # To illustrate theta, we can simulate from the model data2 <- SimulateGMMData(n = 10000, theta = theta) cols <- apply(get.prob(obs.data,theta),1,which.max) # Plotting par(mfrow = c(1,3)) plot(data1$z, main = "True latent GMM") plot(Uhat(data1$u), col = cols, main = "Observed GMCM\nColoured by k-means clustering") plot(data2$z, main = "initial GMM") # Alteratively, theta can simply be plotted to illustrate the GMM density par(mfrow = c(1,1)) plot(theta, add.ellipses = TRUE) points(data2$z, pch = 16, cex = 0.4)
Classify observations according to the maximum a posterior probabilites.
classify(x, theta)
classify(x, theta)
x |
Either a |
theta |
A list of parameters for the full model as described in
|
A integer vector of class numbers with length equal to the number of
rows in x
.
# Classify using probabilites (usually returned from get.prob) probs <- matrix(runif(75), 25, 3) classify(probs) # Classify using a matrix of observations and theta theta <- rtheta(d = 4, m = 3) u <- SimulateGMCMData(n = 20, theta = theta)$u classify(x = u, theta = theta)
# Classify using probabilites (usually returned from get.prob) probs <- matrix(runif(75), 25, 3) classify(probs) # Classify using a matrix of observations and theta theta <- rtheta(d = 4, m = 3) u <- SimulateGMCMData(n = 20, theta = theta)$u classify(x = u, theta = theta)
Fast simulation from and evaluation of multivariate Gaussian probability densities.
dmvnormal(x, mu, sigma) rmvnormal(n, mu, sigma)
dmvnormal(x, mu, sigma) rmvnormal(n, mu, sigma)
x |
A |
mu |
The mean vector of dimension |
sigma |
The variance-covariance matrix of dimension |
n |
The number of observations to be simulated. |
dmvnormal
functions similarly to dmvnorm
from the
mvtnorm
-package and likewise for rmvnormal
and
rmvnorm
.
dmvnormal
returns a by
matrix of the
probability densities corresponding to each row of
x
.
sigma
. Each row corresponds to an observation.
rmvnormal
returns a p
by k
matrix of
observations from a multivariate normal distribution with the given mean
mu
and covariance
Anders Ellern Bilgrau
dmvnorm
and rmvnorm
in the mvtnorm
-package.
dmvnormal(x = matrix(rnorm(300), 100, 3), mu = 1:3, sigma = diag(3)) rmvnormal(n = 10, mu = 1:4, sigma = diag(4))
dmvnormal(x = matrix(rnorm(300), 100, 3), mu = 1:3, sigma = diag(3)) rmvnormal(n = 10, mu = 1:4, sigma = diag(4))
The regular expectation-maximization algorithm for general multivariate Gaussian mixture models.
EMAlgorithm(x, theta, m, eps = 1e-06, max.ite = 1e+05, trace.theta = FALSE, verbose = FALSE)
EMAlgorithm(x, theta, m, eps = 1e-06, max.ite = 1e+05, trace.theta = FALSE, verbose = FALSE)
x |
A |
theta |
A list of parameters of class |
m |
|
eps |
The maximal required difference in successive likelihoods to establish convergence. |
max.ite |
The maximum number of iterations. |
trace.theta |
Logical. If |
verbose |
Set to |
Though not as versatile, the algorithm can be a faster alternative
to Mclust
in the mclust
-package. If theta
is not given,
a k-means clustering is used to determine the initial theta
.
A list of length 3 with elements:
theta |
A list of the estimated parameters as described in
|
loglik.tr |
A numeric vector of the log-likelihood trace. |
kappa |
A matrix where |
Anders Ellern Bilgrau <[email protected]>
set.seed(3) true.theta <- rtheta(d = 2, m = 3, method = "old") true.theta$sigma <- lapply(true.theta$sigma, cov2cor) # Scale ## Not run: plot(true.theta, nlevels = 20, add.ellipses = TRUE) ## End(Not run) data <- SimulateGMCMData(n = 1000, theta = true.theta) start.theta <- rtheta(d = 2, m = 3) start.theta$mu <- t(kmeans(data$z, 3)$centers) # More sensible location estimates start.theta <- as.theta(start.theta) # Coerce the matrix to a list res <- GMCM:::EMAlgorithm(data$z, theta = start.theta) par(mfrow = c(1,2)) plot(data$z, cex = 0.5, pch = 16, main = "Simulated data", col = rainbow(3)[data$K]) plot(data$z, cex = 0.5, pch = 16, main = "GMM clustering", col = rainbow(3)[apply(res$kappa,1,which.max)])
set.seed(3) true.theta <- rtheta(d = 2, m = 3, method = "old") true.theta$sigma <- lapply(true.theta$sigma, cov2cor) # Scale ## Not run: plot(true.theta, nlevels = 20, add.ellipses = TRUE) ## End(Not run) data <- SimulateGMCMData(n = 1000, theta = true.theta) start.theta <- rtheta(d = 2, m = 3) start.theta$mu <- t(kmeans(data$z, 3)$centers) # More sensible location estimates start.theta <- as.theta(start.theta) # Coerce the matrix to a list res <- GMCM:::EMAlgorithm(data$z, theta = start.theta) par(mfrow = c(1,2)) plot(data$z, cex = 0.5, pch = 16, main = "Simulated data", col = rainbow(3)[data$K]) plot(data$z, cex = 0.5, pch = 16, main = "GMM clustering", col = rainbow(3)[apply(res$kappa,1,which.max)])
Estimates the parameters of general Gaussian mixture copula models (GMCM). The function finds the maximum likelihood estimate of a general GMCM with various optimization procedures. Note, all but the PEM methods provides the maximum likelihood estimate.
fit.full.GMCM(u, m, theta = choose.theta(u, m), method = c("NM", "SANN", "L-BFGS", "L-BFGS-B", "PEM"), max.ite = 1000, verbose = TRUE, ...) fit.general.GMCM(u, m, theta = choose.theta(u, m), method = c("NM", "SANN", "L-BFGS", "L-BFGS-B", "PEM"), max.ite = 1000, verbose = TRUE, ...)
fit.full.GMCM(u, m, theta = choose.theta(u, m), method = c("NM", "SANN", "L-BFGS", "L-BFGS-B", "PEM"), max.ite = 1000, verbose = TRUE, ...) fit.general.GMCM(u, m, theta = choose.theta(u, m), method = c("NM", "SANN", "L-BFGS", "L-BFGS-B", "PEM"), max.ite = 1000, verbose = TRUE, ...)
u |
An |
m |
The number of components to be fitted. |
theta |
A list of parameters as defined in |
method |
A character vector of length |
max.ite |
The maximum number of iterations. If the |
verbose |
Logical. If |
... |
Arguments passed to the |
The "L-BFGS-B"
method does not perform a transformation of
the parameters and uses box constraints as implemented in optim
.
Note that the many parameter configurations are poorly estimable or
directly unidentifiable.
fit.general.GMCM
is simply an alias of fit.full.gmcm
.
A list of parameters formatted as described in rtheta
.
When method
equals "PEM"
, a list of extra information
(log-likelihood trace, the matrix of group probabilities, theta trace) is
added as an attribute called "extra".
All the optimization procedures are strongly dependent on the initial values and other parameters (such as the cooling scheme for method SANN). Therefore it is advisable to apply multiple different initial parameters (and optimization routines) and select the best fit.
The choose.theta
itself chooses random a initialization.
Hence, the output when theta
is not directly supplied can vary.
See optim
for further details.
Anders Ellern Bilgrau <[email protected]>
Li, Q., Brown, J. B. J. B., Huang, H., & Bickel, P. J. (2011). Measuring reproducibility of high-throughput experiments. The Annals of Applied Statistics, 5(3), 1752-1779. doi:10.1214/11-AOAS466
Tewari, A., Giering, M. J., & Raghunathan, A. (2011). Parametric Characterization of Multimodal Distributions with Non-gaussian Modes. 2011 IEEE 11th International Conference on Data Mining Workshops, 286-292. doi:10.1109/ICDMW.2011.135
set.seed(17) sim <- SimulateGMCMData(n = 1000, m = 3, d = 2) # Plotting simulated data par(mfrow = c(1,2)) plot(sim$z, col = rainbow(3)[sim$K], main = "Latent process") plot(sim$u, col = rainbow(3)[sim$K], main = "GMCM process") # Observed data uhat <- Uhat(sim$u) # The model should be fitted multiple times using different starting estimates start.theta <- choose.theta(uhat, m = 3) # Random starting estimate res <- fit.full.GMCM(u = uhat, theta = start.theta, method = "NM", max.ite = 3000, reltol = 1e-2, trace = TRUE) # Note, 1e-2 is too big # Confusion matrix Khat <- apply(get.prob(uhat, theta = res), 1, which.max) table("Khat" = Khat, "K" = sim$K) # Note, some components have been swapped # Simulation from GMCM with the fitted parameters simfit <- SimulateGMCMData(n = 1000, theta = res) # As seen, the underlying latent process is hard to estimate. # The clustering, however, is very good. par(mfrow = c(2,2)) plot(simfit$z, col = simfit$K, main = "Model check 1\nSimulated GMM") plot(simfit$u, col = simfit$K, main = "Model check 2\nSimulated GMCM") plot(sim$u, col = Khat, main = "MAP clustering")
set.seed(17) sim <- SimulateGMCMData(n = 1000, m = 3, d = 2) # Plotting simulated data par(mfrow = c(1,2)) plot(sim$z, col = rainbow(3)[sim$K], main = "Latent process") plot(sim$u, col = rainbow(3)[sim$K], main = "GMCM process") # Observed data uhat <- Uhat(sim$u) # The model should be fitted multiple times using different starting estimates start.theta <- choose.theta(uhat, m = 3) # Random starting estimate res <- fit.full.GMCM(u = uhat, theta = start.theta, method = "NM", max.ite = 3000, reltol = 1e-2, trace = TRUE) # Note, 1e-2 is too big # Confusion matrix Khat <- apply(get.prob(uhat, theta = res), 1, which.max) table("Khat" = Khat, "K" = sim$K) # Note, some components have been swapped # Simulation from GMCM with the fitted parameters simfit <- SimulateGMCMData(n = 1000, theta = res) # As seen, the underlying latent process is hard to estimate. # The clustering, however, is very good. par(mfrow = c(2,2)) plot(simfit$z, col = simfit$K, main = "Model check 1\nSimulated GMM") plot(simfit$u, col = simfit$K, main = "Model check 2\nSimulated GMCM") plot(sim$u, col = Khat, main = "MAP clustering")
This function estimates the parameters of the special restricted Gaussian mixture copula model (GMCM) proposed by Li et. al. (2011). It is used to perform reproducibility (or meta) analysis using GMCMs. It features various optimization routines to identify the maximum likelihood estimate of the special GMCMs.
fit.meta.GMCM(u, init.par, method = c("NM", "SANN", "L-BFGS", "L-BFGS-B", "PEM"), max.ite = 1000, verbose = TRUE, positive.rho = TRUE, trace.theta = FALSE, ...) fit.special.GMCM(u, init.par, method = c("NM", "SANN", "L-BFGS", "L-BFGS-B", "PEM"), max.ite = 1000, verbose = TRUE, positive.rho = TRUE, trace.theta = FALSE, ...)
fit.meta.GMCM(u, init.par, method = c("NM", "SANN", "L-BFGS", "L-BFGS-B", "PEM"), max.ite = 1000, verbose = TRUE, positive.rho = TRUE, trace.theta = FALSE, ...) fit.special.GMCM(u, init.par, method = c("NM", "SANN", "L-BFGS", "L-BFGS-B", "PEM"), max.ite = 1000, verbose = TRUE, positive.rho = TRUE, trace.theta = FALSE, ...)
u |
An |
init.par |
A 4-dimensional vector of the initial parameters where,
|
method |
A character vector of length |
max.ite |
The maximum number of iterations. If the |
verbose |
Logical. If |
positive.rho |
|
trace.theta |
|
... |
Arguments passed to the |
The "L-BFGS-B"
method does not perform a transformation of
the parameters.
fit.special.GMCM
is simply an alias of fit.meta.gmcm
.
A vector par
of length 4 of the fitted parameters where
par[1]
is the probability of being from the first (or null)
component, par[2]
is the mean, par[3]
is the standard
deviation, and par[4]
is the correlation.
If trace.theta
is TRUE
, then a list
is returned where
the first entry is as described above and the second entry is the trace
information (dependent of method
.).
Simulated annealing is strongly dependent on the initial values and the cooling scheme.
See optim
for further details.
Anders Ellern Bilgrau <[email protected]>
Li, Q., Brown, J. B. J. B., Huang, H., & Bickel, P. J. (2011). Measuring reproducibility of high-throughput experiments. The Annals of Applied Statistics, 5(3), 1752-1779. doi:10.1214/11-AOAS466
set.seed(1) # True parameters true.par <- c(0.9, 2, 0.7, 0.6) # Simulation of data from the GMCM model data <- SimulateGMCMData(n = 1000, par = true.par) uhat <- Uhat(data$u) # Ranked observed data init.par <- c(0.5, 1, 0.5, 0.9) # Initial parameters # Optimization with Nelder-Mead nm.par <- fit.meta.GMCM(uhat, init.par = init.par, method = "NM") ## Not run: # Comparison with other optimization methods # Optimization with simulated annealing sann.par <- fit.meta.GMCM(uhat, init.par = init.par, method = "SANN", max.ite = 3000, temp = 1) # Optimization with the Pseudo EM algorithm pem.par <- fit.meta.GMCM(uhat, init.par = init.par, method = "PEM") # The estimates agree nicely rbind("True" = true.par, "Start" = init.par, "NM" = nm.par, "SANN" = sann.par, "PEM" = pem.par) ## End(Not run) # Get estimated cluster Khat <- get.IDR(x = uhat, par = nm.par)$Khat plot(uhat, col = Khat, main = "Clustering\nIDR < 0.05")
set.seed(1) # True parameters true.par <- c(0.9, 2, 0.7, 0.6) # Simulation of data from the GMCM model data <- SimulateGMCMData(n = 1000, par = true.par) uhat <- Uhat(data$u) # Ranked observed data init.par <- c(0.5, 1, 0.5, 0.9) # Initial parameters # Optimization with Nelder-Mead nm.par <- fit.meta.GMCM(uhat, init.par = init.par, method = "NM") ## Not run: # Comparison with other optimization methods # Optimization with simulated annealing sann.par <- fit.meta.GMCM(uhat, init.par = init.par, method = "SANN", max.ite = 3000, temp = 1) # Optimization with the Pseudo EM algorithm pem.par <- fit.meta.GMCM(uhat, init.par = init.par, method = "PEM") # The estimates agree nicely rbind("True" = true.par, "Start" = init.par, "NM" = nm.par, "SANN" = sann.par, "PEM" = pem.par) ## End(Not run) # Get estimated cluster Khat <- get.IDR(x = uhat, par = nm.par)$Khat plot(uhat, col = Khat, main = "Clustering\nIDR < 0.05")
This dataset contains a data.frame
of -scores (from a Linear
mixed effects model) and
-values for
differential expression between pre (Im, N) and post germinal (M, PB) centre
cells within peripheral blood.
The first and second column contain the the test for the hypothesis of no
differentially expression between pre and post germinal cells for the
freshly sorted and gene profiled cells.
The third and fourth column contain the the test for the hypothesis of no
differentially expression between pre and post germinal cells for the
cryopreserved (frozen), thawed, sorted, and gene profiled cells.
The fifth and sixth column contain the the test for the hypothesis of no
differentially expression between fresh and frozen cells.
The used array type was Affymetrix Human Exon 1.0 ST microarray.
The format of the data.frame
is:
'data.frame': 18708 obs. of 6 variables:
$ PreVsPost.Fresh.tstat : num -1.073 -0.381 -1.105 -0.559 -1.054 ...
$ PreVsPost.Fresh.pval : num 0.283 0.703 0.269 0.576 0.292 ...
$ PreVsPost.Frozen.tstat: num -0.245 -0.731 -0.828 -0.568 -1.083 ...
$ PreVsPost.Frozen.pval : num 0.806 0.465 0.408 0.57 0.279 ...
$ FreshVsFrozen.tstat : num 0.836 1.135 -0.221 0.191 -0.783 ...
$ FreshVsFrozen.pval : num 0.403 0.256 0.825 0.849 0.434 ...
Further details can be found in Rasmussen and Bilgrau et al. (2015).
Anders Ellern Bilgrau <[email protected]>
Rasmussen SM, Bilgrau AE, Schmitz A, Falgreen S, Bergkvist KS, Tramm AM, Baech J, Jacobsen CL, Gaihede M, Kjeldsen MK, Boedker JS, Dybkaer K, Boegsted M, Johnsen HE (2015). "Stable Phenotype Of B-Cell Subsets Following Cryopreservation and Thawing of Normal Human Lymphocytes Stored in a Tissue Biobank." Cytometry Part B: Clinical Cytometry, 88(1), 40-49.
data(freshVsFrozen) str(freshVsFrozen) # Plot P-values plot(freshVsFrozen[,c(2,4)], cex = 0.5) # Plot ranked and scaled P-values plot(Uhat(abs(freshVsFrozen[,c(1,3)])), cex = 0.5)
data(freshVsFrozen) str(freshVsFrozen) # Plot P-values plot(freshVsFrozen[,c(2,4)], cex = 0.5) # Plot ranked and scaled P-values plot(Uhat(abs(freshVsFrozen[,c(1,3)])), cex = 0.5)
These functions converts/coerces the parameters between the general Gaussian
mixture (copula) model and the special GMCM.
Most functions of the GMCM packages use the theta
format described in rtheta
.
full2meta(theta) meta2full(par, d)
full2meta(theta) meta2full(par, d)
theta |
A list of parameters for the full model. Formatted as described
in |
par |
A vector of length 4 where |
d |
An integer giving the dimension of the mixture distribution. |
If a theta
is supplied which is not on the form of Li et. al. (2011)
the output is coerced by simply picking the first element of the second
component mean vector as mean,
the square roof of the first diagonal entry of the second component
covariance matrix as standard deviation, and first off-diagonal entry as
correlation (properly scaled).
full2meta
returns a numeric vector of length 4 formatted as
par
.
meta2full returns a formatted 'theta' list of parameters as described
by rtheta
.
Anders Ellern Bilgrau <[email protected]>
Li, Q., Brown, J. B. J. B., Huang, H., & Bickel, P. J. (2011). Measuring reproducibility of high-throughput experiments. The Annals of Applied Statistics, 5(3), 1752-1779. doi:10.1214/11-AOAS466
Tewari, A., Giering, M., & Raghunathan, A. (2011). Parametric Characterization of Multimodal Distributions with Non-gaussian Modes. IEEE 11th International Conference on Data Mining Workshops, 2011, 286-292. doi:10.1109/ICDMW.2011.135
theta <- GMCM:::rtheta(m = 2, d = 2) print(par <- full2meta(theta)) print(theta.special.case <- meta2full(par, d = 2))
theta <- GMCM:::rtheta(m = 2, d = 2) print(par <- full2meta(theta)) print(theta.special.case <- meta2full(par, d = 2))
Functions for computing posterior cluster probabilities (get.prob
)
in the general GMCM as well as local and
adjusted irreproducibility discovery rates (get.IDR
) in the
special GMCM.
get.IDR(x, par, threshold = 0.05, ...) get.prob(x, theta, ...)
get.IDR(x, par, threshold = 0.05, ...) get.prob(x, theta, ...)
x |
A |
par |
A vector of length 4 where |
threshold |
The threshold level of the IDR rate. |
... |
Arguments passed to |
theta |
A list of parameters for the full model as described in
|
get.IDR
returns a list of length 5 with elements:
idr |
A vector of the local idr values. I.e. the posterior
probability that |
IDR |
A vector of the adjusted IDR values. |
l |
The number of reproducible features at the specified
|
threshold |
The IDR threshold at which features are deemed reproducible. |
Khat |
A vector signifying whether the corresponding feature is reproducible or not. |
get.prob
returns a matrix where entry (i,j)
is the
posterior probability that the observation x[i, ]
belongs to cluster
j
.
From GMCM version 1.1 get.IDR
has been an internal function.
Use get.prop
or get.IDR
instead. The function can still be
accessed with GMCM:::get.idr
. get.idr
returns a vector where
the 'th entry is the posterior probability that observation
is irreproducible. It is a simple wrapper for
get.prob
.
Anders Ellern Bilgrau <[email protected]>
Li, Q., Brown, J. B. J. B., Huang, H., & Bickel, P. J. (2011). Measuring reproducibility of high-throughput experiments. The Annals of Applied Statistics, 5(3), 1752-1779. doi:10.1214/11-AOAS466
Tewari, A., Giering, M., & Raghunathan, A. (2011). Parametric Characterization of Multimodal Distributions with Non-gaussian Modes. IEEE 11th International Conference on Data Mining Workshops, 2011, 286-292. doi:10.1109/ICDMW.2011.135
set.seed(1123) # True parameters true.par <- c(0.9, 2, 0.7, 0.6) # Simulation of data from the GMCM model data <- SimulateGMCMData(n = 1000, par = true.par, d = 2) # Initial parameters init.par <- c(0.5, 1, 0.5, 0.9) # Nelder-Mead optimization nm.par <- fit.meta.GMCM(data$u, init.par = init.par, method = "NM") # Get IDR values res <- get.IDR(data$u, nm.par, threshold = 0.05) # Plot results plot(data$u, col = res$Khat, pch = c(3,16)[data$K])
set.seed(1123) # True parameters true.par <- c(0.9, 2, 0.7, 0.6) # Simulation of data from the GMCM model data <- SimulateGMCMData(n = 1000, par = true.par, d = 2) # Initial parameters init.par <- c(0.5, 1, 0.5, 0.9) # Nelder-Mead optimization nm.par <- fit.meta.GMCM(data$u, init.par = init.par, method = "NM") # Get IDR values res <- get.IDR(data$u, nm.par, threshold = 0.05) # Plot results plot(data$u, col = res$Khat, pch = c(3,16)[data$K])
Compute goodness of fit as described in AIC
. The number of
parameters used correspond to the number of variables free to vary in the
general model.
goodness.of.fit(theta, u, method = c("AIC", "BIC"), k = 2)
goodness.of.fit(theta, u, method = c("AIC", "BIC"), k = 2)
theta |
A |
u |
An |
method |
A |
k |
A integer specifying the default used constant "k" in AIC. See
|
A single number giving the goodness of fit as requested.
set.seed(2) data(u133VsExon) u <- Uhat(u133VsExon[sample(19577, 500), ]) # Subset for faster fitting theta1 <- fit.full.GMCM(u, m = 2, method = "L-BFGS") goodness.of.fit(theta1, u) # AIC goodness.of.fit(theta1, u, method = "BIC") ## Not run: theta2 <- fit.full.GMCM(u, m = 3, method = "L-BFGS") goodness.of.fit(theta2, u) goodness.of.fit(theta2, u, method = "BIC") ## End(Not run)
set.seed(2) data(u133VsExon) u <- Uhat(u133VsExon[sample(19577, 500), ]) # Subset for faster fitting theta1 <- fit.full.GMCM(u, m = 2, method = "L-BFGS") goodness.of.fit(theta1, u) # AIC goodness.of.fit(theta1, u, method = "BIC") ## Not run: theta2 <- fit.full.GMCM(u, m = 3, method = "L-BFGS") goodness.of.fit(theta2, u) goodness.of.fit(theta2, u, method = "BIC") ## End(Not run)
Function to check whether the argument is coherent and in the correct format.
is.theta(theta, check.class = TRUE)
is.theta(theta, check.class = TRUE)
theta |
A list on the |
check.class |
Logical. If |
logical
. Returns TRUE
if theta
is coherent and
in the correct format. Otherwise, the function returns FALSE
with
an accompanying warning message of the problem.
Anders Ellern Bilgrau <[email protected]>
theta1 <- rtheta() # Create a random correctly formatted theta is.theta(theta1) theta2 <- rtheta(d = 3, m = 5) theta2$m <- 6 # m is now incoherent with the number of components is.theta(theta2) theta3 <- rtheta(d = 4, m = 2) theta3$sigma$comp1[1, 2] <- 0 # Making the covariance matrix non-symmetric is.theta(theta3) theta4 <- rtheta(d = 10, m = 10) theta4$sigma$comp1[1, 1] <- 0 # Destroy positive semi-definiteness is.theta(theta4) theta5 <- rtheta() names(theta5) <- c("m", "d", "prop", "mu", "sigmas") # Incorrect names is.theta(theta5)
theta1 <- rtheta() # Create a random correctly formatted theta is.theta(theta1) theta2 <- rtheta(d = 3, m = 5) theta2$m <- 6 # m is now incoherent with the number of components is.theta(theta2) theta3 <- rtheta(d = 4, m = 2) theta3$sigma$comp1[1, 2] <- 0 # Making the covariance matrix non-symmetric is.theta(theta3) theta4 <- rtheta(d = 10, m = 10) theta4$sigma$comp1[1, 1] <- 0 # Destroy positive semi-definiteness is.theta(theta4) theta5 <- rtheta() names(theta5) <- c("m", "d", "prop", "mu", "sigmas") # Incorrect names is.theta(theta5)
Visualizes the chosen dimensions of the theta object graphically by the GMM density and possibly the individual gaussian components.
## S3 method for class 'theta' plot(x, which.dims = c(1L, 2L), n.sd = qnorm(0.99), add.means = TRUE, ..., add.ellipses = FALSE)
## S3 method for class 'theta' plot(x, which.dims = c(1L, 2L), n.sd = qnorm(0.99), add.means = TRUE, ..., add.ellipses = FALSE)
x |
An object of class |
which.dims |
An integer vector of length 2 choosing which two dimensions to plot. |
n.sd |
An integer choosing the number of standard deviations in each dimension to determine the plotting window. |
add.means |
logical. If TRUE, dots corresponding to the means are added to the plot. |
... |
Arguments passed to |
add.ellipses |
logical. If TRUE, ellipses outlining a 95% confidence
regions for each component are added in the bivariate multivariate
distribution defined by theta and |
Plots via the contour
function. Invisibly returns a list with
x, y, z coordinates that is passed to contour.
Anders Ellern Bilgrau <[email protected]>
set.seed(5) theta <- rtheta(d = 3, m = 2) ## Not run: plot(theta) plot(theta, col = "blue", asp = 1, add.means = FALSE) plot(theta, col = "blue", asp = 1, add.means = TRUE) plot(theta, which.dims = c(3L, 2L), asp = 1) ## End(Not run) plot(theta, asp = 1, n.sd = 3, add.ellipses = TRUE, nlevels = 40, axes = FALSE, xlab = "Dimension 1", ylab = "Dimension 2")
set.seed(5) theta <- rtheta(d = 3, m = 2) ## Not run: plot(theta) plot(theta, col = "blue", asp = 1, add.means = FALSE) plot(theta, col = "blue", asp = 1, add.means = TRUE) plot(theta, which.dims = c(3L, 2L), asp = 1) ## End(Not run) plot(theta, asp = 1, n.sd = 3, add.ellipses = TRUE, nlevels = 40, axes = FALSE, xlab = "Dimension 1", ylab = "Dimension 2")
Print method for theta class
## S3 method for class 'theta' print(x, ...)
## S3 method for class 'theta' print(x, ...)
x |
A |
... |
Arguments to be passed to subsequent methods. |
Invisibly returns x
.
Anders Ellern Bilgrau <[email protected]>
theta <- rtheta() print(theta)
theta <- rtheta() print(theta)
Generate a random set parameters for the Gaussian mixture
model (GMM) and Gaussian mixture copula model (GMCM). Primarily, it provides
an easy prototype of the theta
-format used in GMCM.
rtheta(m = 3, d = 2, method = c("old", "EqualSpherical", "UnequalSpherical", "EqualEllipsoidal", "UnequalEllipsoidal"))
rtheta(m = 3, d = 2, method = c("old", "EqualSpherical", "UnequalSpherical", "EqualEllipsoidal", "UnequalEllipsoidal"))
m |
The number of components in the mixture. |
d |
The dimension of the mixture distribution. |
method |
The method by which the theta should be generated.
See details. Defaults to |
Depending on the method
argument the parameters are generated as
follows. The new behavior is inspired by the simulation scenarios in
Friedman (1989) but not exactly the same.
pie
is generated by draws of a chi-squared distribution with
degrees of freedom divided by their sum. If
method = "old"
the uniform distribution is used instead.
mu
is generated by i.i.d.
-dimensional zero-mean
normal vectors with covariance matrix
100I
.
(unchanged from the old behavior)
sigma
is dependent on method
. The covariance matrices for each
component are generated as follows. If the method
is
"EqualSpherical"
, then the covariance matrices are the
identity matrix and thus are all equal and spherical.
"UnequalSpherical"
, then the covariance matrices are
scaled identity matrices. In component , the covariance
matrix is
"EqualEllipsoidal"
, then highly elliptical covariance
matrices which equal for all components are used.
The square root of the eigenvalues are chosen
equidistantly on
the interval
to
and a randomly (uniformly)
oriented orthonormal basis is chosen and used for all
components.
"UnqualEllipsoidal"
, then highly elliptical covariance
matrices different for all components are used.
The eigenvalues of the covariance matrices equal as in all
components as in "EqualEllipsoidal"
. However, they are all
randomly (uniformly) oriented (unlike as described in
Friedman (1989)).
"old"
, then the old behavior is used.
The old behavior differs from "EqualEllipsoidal"
by using
the absolute value of zero-mean i.i.d. normal
eigenvalues with a standard deviation of 8.
In all cases, the orientation is selected uniformly.
A named list of parameters with the 4 elements:
m |
An integer giving the number of components in the mixture. Default is 3. |
d |
An integer giving the dimension of the mixture distribution. Default is 2. |
pie |
A numeric vector of length |
mu |
A |
sigma |
A |
The function is.theta
checks whether or not theta
is in the correct format.
Anders Ellern Bilgrau <[email protected]>
Friedman, Jerome H. "Regularized discriminant analysis." Journal of the American statistical association 84.405 (1989): 165-175.
rtheta() rtheta(d = 5, m = 2) rtheta(d = 3, m = 2, method = "EqualEllipsoidal") test <- rtheta() is.theta(test) summary(test) print(test) plot(test) ## Not run: A <- SimulateGMMData(n = 100, rtheta(d = 2, method = "EqualSpherical")) plot(A$z, col = A$K, pch = A$K, asp = 1) B <- SimulateGMMData(n = 100, rtheta(d = 2, method = "UnequalSpherical")) plot(B$z, col = B$K, pch = B$K, asp = 1) C <- SimulateGMMData(n = 100, rtheta(d = 2, method = "EqualEllipsoidal")) plot(C$z, col = C$K, pch = C$K, asp = 1) D <- SimulateGMMData(n = 100, rtheta(d = 2, method = "UnequalEllipsoidal")) plot(D$z, col = D$K, pch = D$K, asp = 1) ## End(Not run)
rtheta() rtheta(d = 5, m = 2) rtheta(d = 3, m = 2, method = "EqualEllipsoidal") test <- rtheta() is.theta(test) summary(test) print(test) plot(test) ## Not run: A <- SimulateGMMData(n = 100, rtheta(d = 2, method = "EqualSpherical")) plot(A$z, col = A$K, pch = A$K, asp = 1) B <- SimulateGMMData(n = 100, rtheta(d = 2, method = "UnequalSpherical")) plot(B$z, col = B$K, pch = B$K, asp = 1) C <- SimulateGMMData(n = 100, rtheta(d = 2, method = "EqualEllipsoidal")) plot(C$z, col = C$K, pch = C$K, asp = 1) D <- SimulateGMMData(n = 100, rtheta(d = 2, method = "UnequalEllipsoidal")) plot(D$z, col = D$K, pch = D$K, asp = 1) ## End(Not run)
Function for starting a local instance of the GMCM shiny application. The online application is found at https://gmcm.shinyapps.io/GMCM/.
runGMCM(...)
runGMCM(...)
... |
Arguments passed to |
Retuns nothing (usually). See runApp
.
Exit or stop the app by interrupting R.
## Not run: runGMCM() runGMCM(launch.browser = FALSE, port = 1111) # Open browser and enter URL http://127.0.0.1:1111/ ## End(Not run)
## Not run: runGMCM() runGMCM(launch.browser = FALSE, port = 1111) # Open browser and enter URL http://127.0.0.1:1111/ ## End(Not run)
Easy and fast simulation of data from Gaussian mixture copula models (GMCM) and Gaussian mixture models (GMM).
SimulateGMCMData(n = 1000, par, d = 2, theta, ...) SimulateGMMData(n = 1000, theta = rtheta(...), ...)
SimulateGMCMData(n = 1000, par, d = 2, theta, ...) SimulateGMMData(n = 1000, theta = rtheta(...), ...)
n |
A single integer giving the number of realizations (observations) drawn from the model. Default is 1000. |
par |
A vector of parameters of length 4 where |
d |
The number of dimensions (or, equivalently, experiments) in the mixture distribution. |
theta |
A list of parameters for the model as described in
|
... |
In |
The functions provide simulation of observations and
-dimensional GMCMs and GMMs with provided parameters.
The
par
argument specifies the parameters of the Li et. al. (2011)
GMCM. The theta
argument specifies an arbitrary GMCM of
Tewari et. al. (2011). Either one can be supplied. If both are missing,
random parameters are chosen for the general model.
SimulateGMCMData
returns a list of length 4 with elements:
u |
A matrix of the realized values of the GMCM. |
z |
A matrix of the latent GMM realizations. |
K |
An integer vector denoting the component from which the realization comes. |
theta |
A list containing the used parameters for the simulations
with the format described in |
SimulateGMMData
returns a list of length 3 with elements:
z |
A matrix of GMM realizations. |
K |
An integer vector denoting the component from which the realization comes. |
theta |
As above and in |
Anders Ellern Bilgrau <[email protected]>
Li, Q., Brown, J. B. J. B., Huang, H., & Bickel, P. J. (2011). Measuring reproducibility of high-throughput experiments. The Annals of Applied Statistics, 5(3), 1752-1779. doi:10.1214/11-AOAS466
Tewari, A., Giering, M. J., & Raghunathan, A. (2011). Parametric Characterization of Multimodal Distributions with Non-gaussian Modes. 2011 IEEE 11th International Conference on Data Mining Workshops, 286-292. doi:10.1109/ICDMW.2011.135
set.seed(2) # Simulation from the GMM gmm.data1 <- SimulateGMMData(n = 200, m = 3, d = 2) str(gmm.data1) # Plotting the simulated data plot(gmm.data1$z, col = gmm.data1$K) # Simulation from the GMCM gmcm.data1 <- SimulateGMCMData(n = 1000, m = 4, d = 2) str(gmcm.data1) # Plotthe 2nd simulation par(mfrow = c(1,2)) plot(gmcm.data1$z, col = gmcm.data1$K) plot(gmcm.data1$u, col = gmcm.data1$K) # Simulation from the special case of GMCM theta <- meta2full(c(0.7, 2, 1, 0.7), d = 3) gmcm.data2 <- SimulateGMCMData(n = 5000, theta = theta) str(gmcm.data2) # Plotting the 3rd simulation par(mfrow=c(1,2)) plot(gmcm.data2$z, col = gmcm.data2$K) plot(gmcm.data2$u, col = gmcm.data2$K)
set.seed(2) # Simulation from the GMM gmm.data1 <- SimulateGMMData(n = 200, m = 3, d = 2) str(gmm.data1) # Plotting the simulated data plot(gmm.data1$z, col = gmm.data1$K) # Simulation from the GMCM gmcm.data1 <- SimulateGMCMData(n = 1000, m = 4, d = 2) str(gmcm.data1) # Plotthe 2nd simulation par(mfrow = c(1,2)) plot(gmcm.data1$z, col = gmcm.data1$K) plot(gmcm.data1$u, col = gmcm.data1$K) # Simulation from the special case of GMCM theta <- meta2full(c(0.7, 2, 1, 0.7), d = 3) gmcm.data2 <- SimulateGMCMData(n = 5000, theta = theta) str(gmcm.data2) # Plotting the 3rd simulation par(mfrow=c(1,2)) plot(gmcm.data2$z, col = gmcm.data2$K) plot(gmcm.data2$u, col = gmcm.data2$K)
Summary method for theta class
## S3 method for class 'theta' summary(object, ...)
## S3 method for class 'theta' summary(object, ...)
object |
A |
... |
arguments to be passed to subsequent methods |
Invisibly returns x
.
Anders Ellern Bilgrau <[email protected]>
theta <- rtheta() summary(theta)
theta <- rtheta() summary(theta)
This dataset contains a data.frame
of unadjusted P-values for
differential expression between germinal center cells and other B-cells
within tonsils for two different experiments. The experiments differ
primarily in the microarray platform used. The first column corresponds the
evidence from the Affymetrix GeneChip Human Genome U133 Plus 2.0 Array.
The second column corresponds to the Affymetrix GeneChip Human Exon 1.0 ST
Array.
The format of the data.frame
is:
'data.frame': 19577 obs. of 2 variables:
$ u133: num 0.17561 0.00178 0.005371 0.000669 0.655261 ...
$ exon: num 1.07e-01 6.74e-10 1.51e-03 6.76e-05 3.36e-01 ...
Further details can be found in Bergkvist et al. (2014) and Rasmussen and Bilgrau et al. (2014).
Anders Ellern Bilgrau <[email protected]>
Bergkvist, Kim Steve, Mette Nyegaard, Martin Boegsted, Alexander Schmitz, Julie Stoeve Boedker, Simon Mylius Rasmussen, Martin Perez-Andres et al. (2014). "Validation and Implementation of a Method for Microarray Gene Expression Profiling of Minor B-Cell Subpopulations in Man". BMC immunology, 15(1), 3.
Rasmussen SM, Bilgrau AE, Schmitz A, Falgreen S, Bergkvist KS, Tramm AM, Baech J, Jacobsen CL, Gaihede M, Kjeldsen MK, Boedker JS, Dybkaer K, Boegsted M, Johnsen HE (2015). "Stable Phenotype Of B-Cell Subsets Following Cryopreservation and Thawing of Normal Human Lymphocytes Stored in a Tissue Biobank." Cytometry Part B: Clinical Cytometry, 88(1), 40-49.
data(u133VsExon) str(u133VsExon) # Plot P-values plot(u133VsExon, cex = 0.5) # Plot ranked and scaled P-values plot(Uhat(1-u133VsExon), cex = 0.5)
data(u133VsExon) str(u133VsExon) # Plot P-values plot(u133VsExon, cex = 0.5) # Plot ranked and scaled P-values plot(Uhat(1-u133VsExon), cex = 0.5)
Function for computing the scaled ranks for each column of the input matrix.
In other words, the values are ranked column-wise and divided by
nrow(x) + 1
. A "1334" ranking scheme is used where the lowest values
is awarded rank 1, second lowest value rank 2, and ties are given the
maximum available rank.
Uhat(x)
Uhat(x)
x |
A |
A matrix
with the same dimensions as x
of the scaled
ranks.
Anders Ellern Bilgrau <[email protected]>
SimulateGMMData
, SimulateGMCMData
data <- SimulateGMMData() par(mfrow = c(1,2)) plot(data$z, xlab = expression(z[1]), ylab = expression(z[2])) plot(Uhat(data$z), xlab = expression(hat(u)[1]), ylab = expression(hat(u)[2]))
data <- SimulateGMMData() par(mfrow = c(1,2)) plot(data$z, xlab = expression(z[1]), ylab = expression(z[2])) plot(Uhat(data$z), xlab = expression(hat(u)[1]), ylab = expression(hat(u)[2]))