Title: | Implementation of SparseMDC Algorithm |
---|---|
Description: | Implements the algorithm described in Barron, M., and Li, J. (Not yet published). This algorithm clusters samples from multiple ordered populations, links the clusters across the conditions and identifies marker genes for these changes. The package was designed for scRNA-Seq data but is also applicable to many other data types, just replace cells with samples and genes with variables. The package also contains functions for estimating the parameters for SparseMDC as outlined in the paper. We recommend that users further select their marker genes using the magnitude of the cluster centers. |
Authors: | Martin Barron [aut], Jun Li [aut, cre] |
Maintainer: | Jun Li <[email protected]> |
License: | GPL-3 |
Version: | 0.99.5 |
Built: | 2024-12-04 07:13:48 UTC |
Source: | CRAN |
The cell type of each of the cells in the Biase data.
cell_type_biase
cell_type_biase
An R.Data object containing a vector with the cell type of each of the cells in the Biase Data.
The condition for each sample in the Biase data. To be used when splitting the data to demonstrate SparseDC.
condition_biase
condition_biase
An R.Data object containing a vector with the conditon of the 49 cells in the Biase data.
This dataset was created by Biase et al. to study cell fat inclination in mouse embryos. It contains FPKM gene expression measurements for 49 cells and 16,514 genes. There are three cell types in the dataset, zygote, two-cell embryo, and four-cell embryo cells.
data_biase
data_biase
An R.Data object storing FPKM gene expression measurements for each of the samples.
Uniform data generator For use with the gap statistic. Generates datasets drawn from the reference distribution where each reference feature is generated uniformly over the range of observed values for that feature.
generate_uni_dat(data)
generate_uni_dat(data)
data |
A dataset with rows as features and columns as samples. |
A dataset drawn from the reference distribution for use internally with the gap statistic.
Calculates the lambda 1 value for the SparseMDC algorithm. The lambda 1 value controls the number of marker genes selected for each cluster in the output from SparseMDC. It is calculated as the value of lambda 1 that results in no marker genes being selected when then are no meaningful clusters present in the data. Please see the original manuscript for further details.
lambda1_calculator(dat_l, dim, nclust, nboot = 1000, alpha1 = 0.5, delta = 1e-07)
lambda1_calculator(dat_l, dim, nclust, nboot = 1000, alpha1 = 0.5, delta = 1e-07)
dat_l |
list with D entries, each entry contains data d, p * n matrix. This data should be centered and log-transformed. |
dim |
Total number of conditions, D. |
nclust |
Total number of clusters. |
nboot |
The number of bootstrap repetitions used for estimating lambda 1, the default value is 1000. |
alpha1 |
The quantile of the bootstrapped lambda 1 values to use, range is (0,1). The default value is 0.5, the median of the calculated lambda 1 values. |
delta |
Small value term added to ensure existance, default value is 0.0000001. |
The estimated value of lambda1 for use in main SparseMDC algorithm
set.seed(10) # Select small dataset for example data_test <- data_biase[1:100,] # Split data into condition A and B data_A <- data_test[ , which(condition_biase == "A")] data_B <- data_test[ , which(condition_biase == "B")] data_C <- data_test[ , which(condition_biase == "C")] # Store data as list dat_l <- list(data_A, data_B, data_C) # Pre-process the data pdat <- pre_proc_data(dat_l, dim=3, norm = FALSE, log = TRUE, center = TRUE) lambda1 <- lambda1_calculator(pdat, dim = 3, nclust = 3 )
set.seed(10) # Select small dataset for example data_test <- data_biase[1:100,] # Split data into condition A and B data_A <- data_test[ , which(condition_biase == "A")] data_B <- data_test[ , which(condition_biase == "B")] data_C <- data_test[ , which(condition_biase == "C")] # Store data as list dat_l <- list(data_A, data_B, data_C) # Pre-process the data pdat <- pre_proc_data(dat_l, dim=3, norm = FALSE, log = TRUE, center = TRUE) lambda1 <- lambda1_calculator(pdat, dim = 3, nclust = 3 )
Calculates the lambda 2 values for use in the main SparseMDC algorithm, the lambda 2 value controls the number of genes that show condition-dependent expression within each cell type. That is it controls the number of different mean values across the conditions for each cluster. It is calculated by estimating the value of lambda 2 that would result in no difference in mean values across conditions when there are no meaningful differences across between the conditions. For further details please see the original manuscript.
lambda2_calculator(dat_l, dim, nclust, nboot = 1000, alpha2 = 0.5, delta = 1e-07, lambda1)
lambda2_calculator(dat_l, dim, nclust, nboot = 1000, alpha2 = 0.5, delta = 1e-07, lambda1)
dat_l |
list with D entries, each entry contains data d, p * n matrix. This data should be centered and log-transformed. |
dim |
Total number of conditions, D. |
nclust |
Total number of clusters. |
nboot |
The number of bootstrap repetitions for estimating lambda 2, the default value is 1000. |
alpha2 |
The quantile of the bootstrapped lambda 2 values to use, range is (0,1). The default value is 0.5, the median of the calculated lambda 2 values. |
delta |
Small term to ensure existance of solution, default is 0.0000001. |
lambda1 |
Calcualted penalty parameter for mean size. |
The estimated value of lambda2
set.seed(10) # Select small dataset for example data_test <- data_biase[1:100,] # Split data into condition A and B data_A <- data_test[ , which(condition_biase == "A")] data_B <- data_test[ , which(condition_biase == "B")] data_C <- data_test[ , which(condition_biase == "C")] # Store data as list dat_l <- list(data_A, data_B, data_C) # Pre-process the data pdat <- pre_proc_data(dat_l, dim=3, norm = FALSE, log = TRUE, center = TRUE) lambda1 <- lambda1_calculator(pdat, dim = 3, nclust = 3) lambda2 <- lambda2_calculator(pdat, dim = 3, nclust = 3, lambda1 = lambda1)
set.seed(10) # Select small dataset for example data_test <- data_biase[1:100,] # Split data into condition A and B data_A <- data_test[ , which(condition_biase == "A")] data_B <- data_test[ , which(condition_biase == "B")] data_C <- data_test[ , which(condition_biase == "C")] # Store data as list dat_l <- list(data_A, data_B, data_C) # Pre-process the data pdat <- pre_proc_data(dat_l, dim=3, norm = FALSE, log = TRUE, center = TRUE) lambda1 <- lambda1_calculator(pdat, dim = 3, nclust = 3) lambda2 <- lambda2_calculator(pdat, dim = 3, nclust = 3, lambda1 = lambda1)
This function handles the calculations of the Mu Solver. This function
runs inside sparse_mdc
.
mu_calc(d, v, EQ, dim, x, nk, p1, p2, delta)
mu_calc(d, v, EQ, dim, x, nk, p1, p2, delta)
d |
- The current condition. |
v |
- Relationship matrix, describing relationship between d-1 and d. |
EQ |
- Equality matrix specifying the number of following terms to which mu_d is equal. |
dim |
Total number of conditions, D. |
x |
- Matrix of mean values. |
nk |
- Vector with the number of samples in each dimension for this cluster. |
p1 |
- Penalties on mean size. |
p2 |
- Penalties on mean difference. |
delta |
Small term to ensure existance of solution. |
A list containing two vectors containing the calculated values of mu_d | mu_d < mu_d-1 and mu_d | mu_d > mu_d-1 respecitively.
Calculates the regularized center values for a cluster. This function
runs inside sparse_mdc
.
mu_solver(d, mu, v, EQ, dim, x, nk, p1, p2, delta)
mu_solver(d, mu, v, EQ, dim, x, nk, p1, p2, delta)
d |
- The current condition. |
mu |
- A matrix of regularized mean values. |
v |
- Relationship matrix, describing relationship between d-1 and d. |
EQ |
- Equality matrix specifying the number of following terms to which mu_d is equal. |
dim |
- Total dimensions. |
x |
- Matrix of mean values. |
nk |
- Vector with the number of samples in each condition for this cluster. |
p1 |
- Penalties on mean size. |
p2 |
- Penalties on mean difference. |
delta |
Small term to ensure existance of solution. |
A matrix containing the regularized cluster means for each dimension.
Calcualtes the penalty terms for penalizing mean size and mean difference.
This function runs inside sparse_mdc
.
pen_calculator(lambda1, lambda2, nk, delta)
pen_calculator(lambda1, lambda2, nk, delta)
lambda1 |
Calculated penalty parameter for mean size. |
lambda2 |
Calculated penalty parameter for mean difference. |
nk |
Vector containing the number of samples in each dimension. |
delta |
Small term to ensure existance of solution, default is 0.0000001. |
a list with two vectors containing the penalty terms for mean size and mean difference respectively.
This function centers on a gene-by-gene basis, normalizes and/or log
transforms the data prior to the application of SparseMDC.For the
sequencing depth normalization we recommend that users use one of the many
methods developed for normalizing scRNA-Seq data prior to using SparseMDC and
so can set norm = FALSE
. However, here we normalize the data by
dividing by the total number of reads. This function log transforms the data
by applying log(x + 1)
to each of the data sets. By far the most
important pre-processing step for SparseMDC is the centralization of the
data. Having centralized data is a core component of the SparseMDC algorithm
and is necessary for both accurate clustering of the cells and identifying
marker genes. We therefore recommend that all users centralize their data
using this function and that only experienced users set center = FALSE
.
pre_proc_data(dat_l, dim, norm = FALSE, log = TRUE, center = TRUE)
pre_proc_data(dat_l, dim, norm = FALSE, log = TRUE, center = TRUE)
dat_l |
list with D entries, each entry contains data d, p * n matrix. The entries should be ordered according the condition of each dataset. The rows of the data matrix should contain samples while the columns contain features or genes. |
dim |
Total number of conditions, D. |
norm |
True/False on if the data should be normalized. This parameter
controls whether the data is normalized for sequencing depth by dividing
each column by the total number of reads for that sample. We recommend that
user use one of the many methods for normalizing scRNA-Seq data and so set
this as |
log |
True/False of if the data should be transformed as log(x + 1).
The default value is |
center |
This parameter controls whether the data is centered on a gene
by gene basis. We recommend all users center their data prior to applying
SparseMDC and only experienced users should set this as |
A list with D entries containing the pre-processed data.
set.seed(10) # Select small dataset for example data_test <- data_biase[1:100,] # Split data into condition A and B data_A <- data_test[ , which(condition_biase == "A")] data_B <- data_test[ , which(condition_biase == "B")] data_C <- data_test[ , which(condition_biase == "C")] # Store data as list dat_l <- list(data_A, data_B, data_C) # Pre-process the data pdat <- pre_proc_data(dat_l, dim=3, norm = FALSE, log = TRUE, center = TRUE)
set.seed(10) # Select small dataset for example data_test <- data_biase[1:100,] # Split data into condition A and B data_A <- data_test[ , which(condition_biase == "A")] data_B <- data_test[ , which(condition_biase == "B")] data_C <- data_test[ , which(condition_biase == "C")] # Store data as list dat_l <- list(data_A, data_B, data_C) # Pre-process the data pdat <- pre_proc_data(dat_l, dim=3, norm = FALSE, log = TRUE, center = TRUE)
Function to solve the soft thresholding problem
S_func(x, a)
S_func(x, a)
x |
The data value |
a |
The lambda value |
The solution to the soft thresholding operator.
Calculates the score for each combination of cluster assignments
and center values. This function runs inside sparse_mdc
.
score_calc(pdat, clusters, mu, lambda1, lambda2, nclust, delta, dim)
score_calc(pdat, clusters, mu, lambda1, lambda2, nclust, delta, dim)
pdat |
list with D entries, each entry contains data d, p * n matrix. This data should be centered and log-transformed. |
clusters |
List containig cluster assignments for each dimension as entries. |
mu |
list with D entries, each entry contains centers for data d, p*k matrix. |
lambda1 |
Penalty parameter for mean size. |
lambda2 |
Penalty parameter for mean difference. |
nclust |
Number of clusters in the data. |
delta |
Small term to ensure existance of solution, default is 0.0000001. |
dim |
Total number of conditions, D. |
The caluculated score.
Applies sparse clustering to data from multiple conditions, linking the clusters across conditions and selecting a set of marker variables for each cluster and condition. This is a wrapper function to run SparseMDC in parallel and choose the solution with the mimimum score for each run.
sdc_mpar(pdat, nclust, dim, lambda1, lambda2, nitter = 20, nstarts = 50, init_iter = 5, delta = 1e-07, par_starts, cores)
sdc_mpar(pdat, nclust, dim, lambda1, lambda2, nitter = 20, nstarts = 50, init_iter = 5, delta = 1e-07, par_starts, cores)
pdat |
list with D entries, each entry contains data d, p * n matrix. This data should be centered and log-transformed. |
nclust |
Number of clusters in the data. |
dim |
Total number of conditions, D. |
lambda1 |
The lambda 1 value to use in the SparseMDC function. This value controls the number of marker genes detected for each of the clusters in the final result. This can be calculated using the "lambda1_calculator" function or supplied by the user. |
lambda2 |
The lambda 2 value to use in the SparseMDC function. This value controls the number of genes that show condition-dependent expression within each cell type. This can be calculated using the "lambda2_calculator" function or supplied by the user. |
nitter |
The max number of iterations for each of the start values, the default value is 20. |
nstarts |
The max number of possible starts. The default value is 50. |
init_iter |
The number of iterations used to initialize the algorithm. Higher values result in less starts but more accurate and vice versa. Default is 5. |
delta |
Small term to ensure existance of solution, default is 0.0000001. |
par_starts |
Number of parallel starts. |
cores |
Number of cores to use. |
A list containing cluster assignments, center values and the scores for each start.
set.seed(10) # Select small dataset for example data_test <- data_biase[1:100,] # Split data into condition A and B data_A <- data_test[ , which(condition_biase == "A")] data_B <- data_test[ , which(condition_biase == "B")] data_C <- data_test[ , which(condition_biase == "C")] # Store data as list dat_l <- list(data_A, data_B, data_C) # Pre-process the data pdat <- pre_proc_data(dat_l, dim=3, norm = FALSE, log = TRUE, center = TRUE) # Calculate lambda1 lambda1 <- lambda1_calculator(pdat, dim = 3, nclust = 3) # Calcualte lambda2 lambda2 <- lambda2_calculator(pdat, dim = 3, nclust = 3, lambda1 = lambda1) # Prepare parallel enviornment library(doParallel) # Load package library(foreach) # Load the package library(doRNG) # Apply SparseMDC smdc_res <- sdc_mpar(pdat, nclust = 3, dim = 3, lambda1 = lambda1, lambda2 = lambda2, par_starts = 2, cores = 2)
set.seed(10) # Select small dataset for example data_test <- data_biase[1:100,] # Split data into condition A and B data_A <- data_test[ , which(condition_biase == "A")] data_B <- data_test[ , which(condition_biase == "B")] data_C <- data_test[ , which(condition_biase == "C")] # Store data as list dat_l <- list(data_A, data_B, data_C) # Pre-process the data pdat <- pre_proc_data(dat_l, dim=3, norm = FALSE, log = TRUE, center = TRUE) # Calculate lambda1 lambda1 <- lambda1_calculator(pdat, dim = 3, nclust = 3) # Calcualte lambda2 lambda2 <- lambda2_calculator(pdat, dim = 3, nclust = 3, lambda1 = lambda1) # Prepare parallel enviornment library(doParallel) # Load package library(foreach) # Load the package library(doRNG) # Apply SparseMDC smdc_res <- sdc_mpar(pdat, nclust = 3, dim = 3, lambda1 = lambda1, lambda2 = lambda2, par_starts = 2, cores = 2)
Applies sparse clustering to data from multiple conditions, linking the clusters across conditions and selecting a set of marker variables for each cluster and condition. See the manuscript for descriptions of the different categories of marker genes.
sparse_mdc(pdat, nclust, dim, lambda1, lambda2, nitter = 20, nstarts = 50, init_iter = 5, delta = 1e-07)
sparse_mdc(pdat, nclust, dim, lambda1, lambda2, nitter = 20, nstarts = 50, init_iter = 5, delta = 1e-07)
pdat |
list with D entries, each entry contains data d, p * n matrix. This data should be centered and log-transformed. |
nclust |
Number of clusters in the data. |
dim |
Total number of conditions, D. |
lambda1 |
The lambda 1 value to use in the SparseMDC function. This value controls the number of marker genes detected for each of the clusters in the final result. This can be calculated using the "lambda1_calculator" function or supplied by the user. |
lambda2 |
The lambda 2 value to use in the SparseMDC function. This value controls the number of genes that show condition-dependent expression within each cell type. This can be calculated using the "lambda2_calculator" function or supplied by the user. |
nitter |
The max number of iterations for each of the start values, the default value is 20. |
nstarts |
The max number of possible starts. The default value is 50. |
init_iter |
The number of iterations used to initialize the algorithm. Higher values result in less starts but more accurate and vice versa. Default is 5. |
delta |
Small term to ensure existance of solution, default is 0.0000001. |
A list containing cluster assignments, center values and the scores for each start.
set.seed(10) # Select small dataset for example data_test <- data_biase[1:100,] # Split data into condition A and B data_A <- data_test[ , which(condition_biase == "A")] data_B <- data_test[ , which(condition_biase == "B")] data_C <- data_test[ , which(condition_biase == "C")] # Store data as list dat_l <- list(data_A, data_B, data_C) # Pre-process the data pdat <- pre_proc_data(dat_l, dim=3, norm = FALSE, log = TRUE, center = TRUE) lambda1 <- lambda1_calculator(pdat, dim = 3, nclust = 3) lambda2 <- lambda2_calculator(pdat, dim = 3, nclust = 3, lambda1 = lambda1) smdc_res <- sparse_mdc(pdat, nclust = 3, dim = 3, lambda1 = lambda1, lambda2 = lambda2)
set.seed(10) # Select small dataset for example data_test <- data_biase[1:100,] # Split data into condition A and B data_A <- data_test[ , which(condition_biase == "A")] data_B <- data_test[ , which(condition_biase == "B")] data_C <- data_test[ , which(condition_biase == "C")] # Store data as list dat_l <- list(data_A, data_B, data_C) # Pre-process the data pdat <- pre_proc_data(dat_l, dim=3, norm = FALSE, log = TRUE, center = TRUE) lambda1 <- lambda1_calculator(pdat, dim = 3, nclust = 3) lambda2 <- lambda2_calculator(pdat, dim = 3, nclust = 3, lambda1 = lambda1) smdc_res <- sparse_mdc(pdat, nclust = 3, dim = 3, lambda1 = lambda1, lambda2 = lambda2)
This function calculates the gap statistic for SparseMDC. For use when the number of clusters in the data is unknown. We recommend using alternate methods to infer the number of clusters in the data.
sparsemdc_gap(pdat, dim, min_clus, max_clus, nboots = 200, nitter = 20, nstarts = 10, l1_boot = 50, l2_boot = 50)
sparsemdc_gap(pdat, dim, min_clus, max_clus, nboots = 200, nitter = 20, nstarts = 10, l1_boot = 50, l2_boot = 50)
pdat |
list with D entries, each entry contains data d, p * n matrix. This data should be centered and log-transformed. |
dim |
Total number of conditions, D. |
min_clus |
The minimum number of clusters to try, minimum value is 2. |
max_clus |
The maximum number of clusters to try. |
nboots |
The number of bootstrap repetitions to use, default = 200. |
nitter |
The max number of iterations for each of the start values, the default value is 20. |
nstarts |
The number of start values to use for SparseDC. The default value is 10. |
l1_boot |
The number of bootstrap repetitions used for estimating lambda 1. |
l2_boot |
The number of bootstrap repetitions used for estimating lambda 2. |
A list containing the optimal number of clusters, as well as gap statistics and the calculated standard error for each number of clusters.
set.seed(10) # Select small dataset for example data_test <- data_biase[1:100,] # Split data into condition A and B data_A <- data_test[ , which(condition_biase == "A")] data_B <- data_test[ , which(condition_biase == "B")] data_C <- data_test[ , which(condition_biase == "C")] # Store data as list dat_l <- list(data_A, data_B, data_C) # Pre-process the data pdat <- pre_proc_data(dat_l, dim=3, norm = FALSE, log = TRUE, center = TRUE) # Run with one bootstrap sample for example gap_stat <- sparsemdc_gap(pdat, dim=3, min_clus = 2, max_clus =3, nboots =2, nitter = 2, nstarts = 1, l1_boot = 5, l2_boot = 5)
set.seed(10) # Select small dataset for example data_test <- data_biase[1:100,] # Split data into condition A and B data_A <- data_test[ , which(condition_biase == "A")] data_B <- data_test[ , which(condition_biase == "B")] data_C <- data_test[ , which(condition_biase == "C")] # Store data as list dat_l <- list(data_A, data_B, data_C) # Pre-process the data pdat <- pre_proc_data(dat_l, dim=3, norm = FALSE, log = TRUE, center = TRUE) # Run with one bootstrap sample for example gap_stat <- sparsemdc_gap(pdat, dim=3, min_clus = 2, max_clus =3, nboots =2, nitter = 2, nstarts = 1, l1_boot = 5, l2_boot = 5)
Assigns samples to clusters. This function runs inside
sparse_mdc
.
update_c(mu, pdat, nclust, dim, lambda1, lambda2, delta)
update_c(mu, pdat, nclust, dim, lambda1, lambda2, delta)
mu |
list with D entries, each entry contains centers for data d, p*k matrix. |
pdat |
list with D entries, each entry contains data d, p * n matrix. This data should be centered and log-transformed. |
nclust |
Total number of clusters. |
dim |
Total number of conditions, D. |
lambda1 |
Calculated penalty parameter for mean size. |
lambda2 |
Calculated penalty parameter for mean difference. |
delta |
Small term to ensure existance of solution. |
A list with D entries containing cluster assignments for each sample.
Update the mean/center values for each cluster and dimension.
This function runs inside sparse_mdc
.
update_mu(clusters, pdat, nclust, dim, lambda1, lambda2, ngenes, delta)
update_mu(clusters, pdat, nclust, dim, lambda1, lambda2, ngenes, delta)
clusters |
List containig cluster assignments for each dimension as entries. |
pdat |
list with D entries, each entry contains data d, p * n matrix. This data should be centered and log-transformed. |
nclust |
Number of clusters in the data. |
dim |
Total number of conditions, D. |
lambda1 |
Calculated penalty parameter for mean size. |
lambda2 |
Calculated penalty parameter for mean difference. |
ngenes |
The number of genes in the data. |
delta |
Small term to ensure existance of solution. |
A list containing the center values for the clusters in each dimensions as entries.