Title: | Poisson Hurdle Clustering for Sparse Microbiome Data |
---|---|
Description: | Clustering analysis for sparse microbiome data, based on a Poisson hurdle model. |
Authors: | Zhili Qiao |
Maintainer: | Zhili Qiao <[email protected]> |
License: | GPL-3 |
Version: | 0.1.0 |
Built: | 2024-12-14 06:44:18 UTC |
Source: | CRAN |
This function estimates the optimal number of clusters for a given dataset.
Hybrid(data, absolute = FALSE, Kstart = NULL, Treatment)
Hybrid(data, absolute = FALSE, Kstart = NULL, Treatment)
data |
Data matrix with dimension N*P indicating N features and P samples. |
absolute |
Logical. Whether we should use absolute (TRUE) or relative (FALSE) abundance of features to determine clusters. |
Kstart |
Positive integer. The number of clusters for starting the hybrid merging algorithm. Should be relatively large to ensure that Kstart > optimal number of clusters. Uses max(50, sqrt(N)) by default. |
Treatment |
Vector of length p, indicating replicates of different treatment groups. For example, Treatment = c(1,1,2,2,3,3) indicates 3 treatment groups, each with 2 replicates. |
A positive integer indicating the optimal number of clusters
######## Run the following codes in order: ## ## This is a sample data set which has 100 features, and 4 treatment groups with 4 replicates each. data('sample_data') head(sample_data) set.seed(1) ## ## Finding the optimal number of clusters K <- Hybrid(sample_data, Kstart = 4, Treatment = rep(c(1,2,3,4), each = 4)) ## ## Clustering result from EM algorithm result <- PHcluster(sample_data, rep(c(1,2,3,4), each = 4), K, method = 'EM', nstart = 1) print(result$cluster) ## ## Plot the feature abundance level for each cluster plot_abundance(result, sample_data, Treatment = rep(c(1,2,3,4), each = 4))
######## Run the following codes in order: ## ## This is a sample data set which has 100 features, and 4 treatment groups with 4 replicates each. data('sample_data') head(sample_data) set.seed(1) ## ## Finding the optimal number of clusters K <- Hybrid(sample_data, Kstart = 4, Treatment = rep(c(1,2,3,4), each = 4)) ## ## Clustering result from EM algorithm result <- PHcluster(sample_data, rep(c(1,2,3,4), each = 4), K, method = 'EM', nstart = 1) print(result$cluster) ## ## Plot the feature abundance level for each cluster plot_abundance(result, sample_data, Treatment = rep(c(1,2,3,4), each = 4))
This function gives the clustering result based on a Poisson hurdle model.
PHcluster( data, Treatment, nK, method = c("EM", "SA"), absolute = FALSE, cool = 0.9, nstart = 1 )
PHcluster( data, Treatment, nK, method = c("EM", "SA"), absolute = FALSE, cool = 0.9, nstart = 1 )
data |
Data matrix with dimension N*P indicating N features and P samples. The cluster analysis is done feature-wised. |
Treatment |
Vector of length P. Indicating replicates of different treatment groups. For example, Treatment = c(1,1,2,2,3,3) indicates 3 treatment groups, each with 2 replicates. |
nK |
Positive integer. Number of clusters. |
method |
Method for the algorithm. Can choose between "EM" as Expectation Maximization or "SA" as Simulated Annealing. |
absolute |
Logical. Whether we should use absolute (TRUE) or relative (False) abundance of features to determine clusters. |
cool |
Real number between (0, 1). Cooling rate for the "SA" algorithm. Uses 0.9 by default. |
nstart |
Positive integer. Number of starts for the entire algorithm. Note that as nstart increases the computational time also grows linearly. Uses 1 by default. |
Vector of length N consisting of integers from 1 to nK. Indicating final clustering result. For evaluating the clustering result please check NMI for Normalized Mutual Information.
N*nK matrix. The (i, j)th element representing the probability that observation i belongs to cluster j.
Scaler. The Poisson hurdle log-likelihood of the final clustering result.
Vector of length N. The geometric mean abundance level for each feature, across all treatment groups.
vector of length P. The normalizing constant of sequencing depth for each sample.
######## Run the following codes in order: ## ## This is a sample data set which has 100 features, and 4 treatment groups with 4 replicates each. data('sample_data') head(sample_data) set.seed(1) ## ## Finding the optimal number of clusters K <- Hybrid(sample_data, Kstart = 4, Treatment = rep(c(1,2,3,4), each = 4)) ## ## Clustering result from EM algorithm result <- PHcluster(sample_data, rep(c(1,2,3,4), each = 4), K, method = 'EM', nstart = 1) print(result$cluster) ## ## Plot the feature abundance level for each cluster plot_abundance(result, sample_data, Treatment = rep(c(1,2,3,4), each = 4))
######## Run the following codes in order: ## ## This is a sample data set which has 100 features, and 4 treatment groups with 4 replicates each. data('sample_data') head(sample_data) set.seed(1) ## ## Finding the optimal number of clusters K <- Hybrid(sample_data, Kstart = 4, Treatment = rep(c(1,2,3,4), each = 4)) ## ## Clustering result from EM algorithm result <- PHcluster(sample_data, rep(c(1,2,3,4), each = 4), K, method = 'EM', nstart = 1) print(result$cluster) ## ## Plot the feature abundance level for each cluster plot_abundance(result, sample_data, Treatment = rep(c(1,2,3,4), each = 4))
This function plots the feature abundance level for each cluster, after extracting the effect of sample-wise normalization factors and feature-wise geometric mean.
plot_abundance(result, data, Treatment)
plot_abundance(result, data, Treatment)
result |
Clustering result from function PHclust(). |
data |
Data matrix with dimension N*P indicating N features and P samples. |
Treatment |
Vector of length P. Indicating replicates of different treatment groups. For example, Treatment = c(1,1,2,2,3,3) indicates 3 treatment groups, each with 2 replicates. |
A plot for feature abundance level will be shown. No value is returned.
######## Run the following codes in order: ## ## This is a sample data set which has 100 features, and 4 treatment groups with 4 replicates each. data('sample_data') head(sample_data) set.seed(1) ## ## Finding the optimal number of clusters K <- Hybrid(sample_data, Kstart = 4, Treatment = rep(c(1,2,3,4), each = 4)) ## ## Clustering result from EM algorithm result <- PHcluster(sample_data, rep(c(1,2,3,4), each = 4), K, method = 'EM', nstart = 1) print(result$cluster) ## ## Plot the feature abundance level for each cluster plot_abundance(result, sample_data, Treatment = rep(c(1,2,3,4), each = 4))
######## Run the following codes in order: ## ## This is a sample data set which has 100 features, and 4 treatment groups with 4 replicates each. data('sample_data') head(sample_data) set.seed(1) ## ## Finding the optimal number of clusters K <- Hybrid(sample_data, Kstart = 4, Treatment = rep(c(1,2,3,4), each = 4)) ## ## Clustering result from EM algorithm result <- PHcluster(sample_data, rep(c(1,2,3,4), each = 4), K, method = 'EM', nstart = 1) print(result$cluster) ## ## Plot the feature abundance level for each cluster plot_abundance(result, sample_data, Treatment = rep(c(1,2,3,4), each = 4))
A sample data matrix with 100 features in 2 true clusters, 4 treatment groups with 4 replicates in each group.
sample_data
sample_data
The dataset contains 16 columns, indexed as A1 ~ A4, B1 ~ B4, C1 ~ C4, D1 ~ D4 to represent 4 treatment groups.
head(sample_data)
head(sample_data)