| Title: | Semi-Supervised Deconvolution of Bulk RNA-Seq Data with Hyperparameter Optimization |
|---|---|
| Description: | Performs semi-supervised deconvolution of bulk RNA sequencing (RNA-seq) data. Known cell-type proportions are estimated using supervised methods -- 'CIBERSORTx' (CSx), 'CIBERSORT' (CS), 'FARDEEP' (Fast And Robust DEconvolution of Expression Profiles), and 'DCQ' (Digital Cell Quantifier) -- while unknown components are inferred using non-negative matrix factorization ('NMF') with limited-memory Broyden-Fletcher-Goldfarb-Shanno with bounds ('L-BFGS-B') optimization. Hyperparameters are selected automatically using a Pareto-frontier-based approach with knee-point detection, allowing application when the reference signature matrix is incomplete. More details about 'DICEpro' can be found in Ba et al. (2026) <doi:10.64898/2026.06.17.732876>. |
| Authors: | Kalidou BA [aut, cre], Boris P. Hejblum [aut] |
| Maintainer: | Kalidou BA <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.0.2 |
| Built: | 2026-06-30 16:50:24 UTC |
| Source: | https://github.com/cran/dicepro |
Performs semi-supervised deconvolution of bulk RNA sequencing (RNA-seq) data. Known cell-type proportions are estimated using supervised methods – 'CIBERSORTx' (CSx), 'CIBERSORT' (CS), 'FARDEEP' (Fast And Robust DEconvolution of Expression Profiles), and 'DCQ' (Digital Cell Quantifier) – while unknown components are inferred using non-negative matrix factorization ('NMF') with limited-memory Broyden-Fletcher-Goldfarb-Shanno with bounds ('L-BFGS-B') optimization. Hyperparameters are selected automatically using a Pareto-frontier-based approach with knee-point detection, allowing application when the reference signature matrix is incomplete. More details about 'DICEpro' can be found in Ba et al. (2026) doi:10.64898/2026.06.17.732876.
Maintainer: Kalidou BA [email protected]
Authors:
Boris P. Hejblum [email protected]
Useful links:
Identifies the best pair from a set of
optimization trials by:
Filtering trials where exceeds
constraint_threshold.
Computing abs_constraint = |1 - constraint|.
Extracting the Pareto frontier (minimize frobNorm AND
abs_constraint) via KraljicMatrix::get_frontier
with quadrant = "bottom.left".
Selecting the knee point via maximum perpendicular distance to the utopia-nadir line in normalized objective space.
best_hyperParams(trials_df, W, H, constraint_threshold = 0.1)best_hyperParams(trials_df, W, H, constraint_threshold = 0.1)
trials_df |
data.frame of optimization trials as returned
by |
W |
List of |
H |
List of |
constraint_threshold |
Numeric scalar. Maximum allowed
|
Named list, or invisible(NULL) when no valid configuration
survives filtering:
List with lambda and gamma.
List with frobNorm, abs_constraint, and
constraint for the selected trial.
Filtered trials data.frame (constraint-passing rows only).
The matrix of the selected trial.
The matrix of the selected trial.
A ggplot2 Pareto frontier figure.
Gene expression reference matrix for bulk RNA-seq deconvolution, derived from sorted cell populations spanning Immune, Stromal, Endothelial, epithelial, and muscle compartments. BlueCode serves as the default reference for dicepro and directly informs the hierarchical Dirichlet simulation framework.
data(BlueCode)data(BlueCode)
A numeric matrix of dimensions , where
denotes the number of genes after quality filtering. Rows are labeled
by HGNC gene symbol; columns are labeled by cell-type name. All values
are non-negative read counts prior to normalization; apply
.normalize_zscore_per_gene or TPM scaling before use in
downstream analyses.
BlueCode was built from publicly available sorted bulk RNA-seq profiles downloaded from the ENCODE and GEO repositories. For each cell population, replicate profiles were averaged after quantile normalization. Genes with zero variance across all 34 profiles, or with missing values in more than 10 \ log-transformed so that it can be used directly with both Gaussian and Poisson deconvolution models.
The 34 cell types are partitioned into five major tissue compartments. The column ordering in the matrix follows this partition exactly (columns 1–9, 10–17, 18–20, 21–25, 26–34).
Immune compartment – columns 1–9:
B.cell.naive
B.cell.memory
T.Cell.CD4
T.cell.CD4.memory
T.cell.CD8.memory
NK.cell.CD56
Monocyte.CD14
Macrophage
Mature.neutrophil
Stromal compartment – columns 10–17:
Fibroblast.cardiac.ventricle
Fibroblast.of.arm
Fibroblast.of.lung
Fibroblast.of.the.aortic.adventitia
Fibroblast.or.papilla.dermal.cell
MSC.like.pluripotent.cell
Chondrocyte.articular
Osteoblast
Endothelial compartment – columns 18–20:
Endothelial.large.blood.vessel
Endothelial.microvascular.mammary.or.endometrial
Endothelial.microvascular.non.reproductive
Epithelial compartment – columns 21–25:
Epithelial.cell.mammary
Epithelial.renal.cortical
Epithelial.respiratory
Hair.follicular.keratinocyte
Melanocyte.of.skin
Muscle compartment – columns 26–34:
Smooth.muscle.cell.aortic
Smooth.muscle.cell.bronchial
Smooth.muscle.cell.coronary.artery
Smooth.muscle.cell.pulmonary.artery
Smooth.muscle.cell.of.bladder
Smooth.muscle.cell.of.trachea
Smooth.muscle.cell.uterine
Myocyte.regular.cardiac
Myometrial.cell
This compartment structure directly informs the hierarchical Dirichlet
model in generateProp with scenario = "hierarchical".
The Dirichlet concentration parameters at each level
(, ,
,
) reflect the expected relative abundance
of each compartment in mixed tissue samples, so that simulated proportions
are biologically realistic with respect to BlueCode cell-type coverage.
data(BlueCode) # Quick inspection dim(BlueCode) # G x 34 colnames(BlueCode) # 34 cell-type labels # Use directly in dicepro out <- dicepro( reference = BlueCode, bulk = my_bulk_matrix, methodDeconv = "FARDEEP" )
BlueCode reference matrix constructed for the dicepro benchmarking framework from ENCODE (https://www.encodeproject.org) and GEO (https://www.ncbi.nlm.nih.gov/geo/) sorted RNA-seq profiles. See the accompanying manuscript for full details on data sources, replicate averaging, quality filtering, and normalization.
Experimentally constructed bulk RNA-seq mixtures of sorted cell populations, designed for benchmarking deconvolution algorithms. Each sample is a known mixture of cell types drawn from the BlueCode reference panel, making ground-truth proportions available for quantitative evaluation.
data(CellMixtures)data(CellMixtures)
A numeric matrix of dimensions , where
denotes the number of genes (~31 400 after quality filtering). Rows are
labeled by HGNC gene symbol; columns are labeled A through
L, corresponding to 12 experimentally mixed samples. Values are
raw read counts prior to normalization.
The 12 mixture samples (A–L) were prepared by combining sorted cell
populations at known proportions across the five tissue compartments
represented in BlueCode. Each sample targets a distinct
mixture composition, providing a controlled benchmark spanning a wide
range of cell-type abundance profiles – from immune-dominated samples
to Stromal- or muscle-enriched configurations.
CellMixtures covers approximately 31 400 gene symbols, a super-set of
those present in BlueCode. dicepro automatically intersects the two
gene sets before deconvolution; the effective number of informative
genes is therefore determined by the BlueCode reference (see
BlueCode for details on its gene filtering criteria).
A gene-overlap check prior to running dicepro is
recommended:
data(BlueCode)
data(CellMixtures)
n_common <- length(intersect(rownames(BlueCode), rownames(CellMixtures)))
cat(sprintf("Common genes: %d / %d reference genes\n",
n_common, nrow(BlueCode)))
Raw counts should be normalized before deconvolution.
dicepro applies .normalize_zscore_per_gene internally;
no preprocessing is required when using the main function.
For exploratory analyses, log-CPM or TPM normalization is recommended:
# log2-CPM normalization cpm <- sweep(CellMixtures, 2, colSums(CellMixtures) / 1e6, FUN = "/") log2_cpm <- log2(cpm + 1)
data(BlueCode) data(CellMixtures) out <- dicepro( reference = BlueCode, bulk = CellMixtures, methodDeconv = "FARDEEP", bulkName = "CellMixtures", refName = "BlueCode", hp_max_evals = 100, hspaceTechniqueChoose = "all" ) # Inspect estimated proportions head(out$H)
CellMixtures and BlueCode are designed as a paired benchmark: the reference matrix is derived from the same cell populations used to construct the mixtures, ensuring that all cell types present in the bulk are represented in the reference. This controlled setting enables direct evaluation of the reconstruction accuracy of dicepro.
CellMixtures was constructed for the DICEPro benchmarking framework. Sorted cell populations were obtained from ENCODE (https://www.encodeproject.org) and mixed in silico at predefined proportions from experimentally validated RNA-seq profiles. See the accompanying manuscript for full details on mixture design, library preparation, and sequencing.
Check if a value contains NaN or Inf
contains_nan_or_inf(value)contains_nan_or_inf(value)
value |
Numeric, vector, matrix, or data frame to check. |
Logical. TRUE if NaN or Inf is present, FALSE otherwise.
Generates a ggplot2 scatter plot of vs
sampled from the search space defined by hspaceTechniqueChoose,
with feasibility bounds overlaid for "restrictionEspace".
create_gamma_lambda_plot( hspaceTechniqueChoose = c("all", "restrictionEspace"), n_samples = 200, seed = NULL )create_gamma_lambda_plot( hspaceTechniqueChoose = c("all", "restrictionEspace"), n_samples = 200, seed = NULL )
hspaceTechniqueChoose |
Character scalar.
|
n_samples |
Positive integer. Configurations to draw (default |
seed |
Integer. Random seed used for full pipeline reproducibility.
Defaults to |
A ggplot2 figure object.
create_gamma_lambda_plot(hspaceTechniqueChoose = "all") create_gamma_lambda_plot(hspaceTechniqueChoose = "restrictionEspace")create_gamma_lambda_plot(hspaceTechniqueChoose = "all") create_gamma_lambda_plot(hspaceTechniqueChoose = "restrictionEspace")
Combines supervised estimation of known cell types with unsupervised discovery of latent components, with automatic Pareto-frontier-based hyper-parameter optimization.
dicepro( reference, bulk, methodDeconv = "CS", cibersortx_email = NULL, cibersortx_token = NULL, cibersort_perm = 0, cibersort_QN = TRUE, W_prime = 0, bulkName = "Bulk", refName = "Reference", hp_max_evals = 100, N_unknownCT = 1L, algo_select = "random", output_path = NULL, hspaceTechniqueChoose = "gamma_dominant", gamma_ratio_min = 10, constraint_threshold = 0.1, out_Decon = NULL, normalize = TRUE, seed = NULL )dicepro( reference, bulk, methodDeconv = "CS", cibersortx_email = NULL, cibersortx_token = NULL, cibersort_perm = 0, cibersort_QN = TRUE, W_prime = 0, bulkName = "Bulk", refName = "Reference", hp_max_evals = 100, N_unknownCT = 1L, algo_select = "random", output_path = NULL, hspaceTechniqueChoose = "gamma_dominant", gamma_ratio_min = 10, constraint_threshold = 0.1, out_Decon = NULL, normalize = TRUE, seed = NULL )
reference |
Numeric matrix (genes * cell types). |
bulk |
Numeric matrix (genes * samples). |
methodDeconv |
Character. One of |
cibersortx_email |
Character. CIBERSORTx email (required for
|
cibersortx_token |
Character. CIBERSORTx token (required for
|
cibersort_perm |
Non-negative integer. Number of permutations
for p-value estimation in the built-in |
cibersort_QN |
Logical. Apply quantile normalisation to the
bulk mixture in the built-in |
W_prime |
Initial unknown-signature matrix or |
bulkName |
Character scalar. Label for the bulk data-set. |
refName |
Character scalar. Label for the reference. |
hp_max_evals |
Positive integer. Number of hyper-parameter trials. |
N_unknownCT |
Positive integer. Number of unknown cell types. |
algo_select |
Character. One of |
output_path |
Character scalar. Root output directory. Defaults to tempdir(); pass an explicit directory to persist results across sessions. |
hspaceTechniqueChoose |
Character. One of
\item{\code{"all"}}{Full independent log-uniform grid for
\code{lambda_}, \code{gamma}, \code{p_prime}.}
\item{\code{"restrictionEspace"}}{Restricted space where
\code{lambda_ = gamma * lambda_factor},
\code{lambda_factor} \eqn{\in [2, 100]}.}
|
gamma_ratio_min |
Positive numeric. Minimum ratio
|
constraint_threshold |
Positive numeric. Maximum allowed deviation
from the constraint (|1 - constraint|) when filtering hyper-parameter
trials before Pareto selection (default
|
out_Decon |
Optional pre-computed deconvolution matrix (samples * cell types). When provided, the supervised deconvolution step is skipped entirely. |
normalize |
Logical. Apply z-score normalisation per
gene (default |
seed |
Integer. Random seed used for full pipeline reproducibility.
Defaults to |
When out_Decon is provided, the supervised step is skipped.
Gene matrices are optionally z-score normalized per gene, and only
intersecting genes are retained before optimization.
The built-in "CIBERSORT" method closely follows Newman et al.
(2015): nu-SVR with three candidate nu values (0.25, 0.50, 0.75),
optional quantile normalisation, and optional permutation-based p-values.
It requires packages e1071, parallel, and
preprocessCore. No external account or Docker installation is
needed, unlike "CSx".
An object of class "dicepro" (a named list) containing:
hyper-parameters – selected and
metrics – loss and constraint of the best trial
trials – all evaluated configurations
W – estimated unknown signature matrix
H – estimated proportion matrix
plot – Pareto frontier ggplot2 figure
plot_hyperopt – hyper-parameter space ggplot2 figure
Returns invisible(NULL) with a warning when no valid configuration
is found.
Newman AM et al. (2015). Robust enumeration of cell subsets from tissue expression profiles. Nature Methods, 12(5), 453-457. doi:10.1038/nmeth.3337
running_method, run_experiment,
best_hyperParams
sim_data <- simulation( loi = "gauss", scenario = "hierarchical", nSample = 5, nGenes = 50, nCellsType = 5, sigma_bio = 0.07, sigma_tech = 0.07, seed = 2101 ) out <- dicepro( reference = as.matrix(sim_data$W)[, -c(1)], bulk = as.matrix(sim_data$B), methodDeconv = "FARDEEP", W_prime = 0, bulkName = "SimBulk", refName = "SimRef", hp_max_evals = 5, algo_select = "random", output_path = tempdir(), hspaceTechniqueChoose = "all", normalize = FALSE )sim_data <- simulation( loi = "gauss", scenario = "hierarchical", nSample = 5, nGenes = 50, nCellsType = 5, sigma_bio = 0.07, sigma_tech = 0.07, seed = 2101 ) out <- dicepro( reference = as.matrix(sim_data$W)[, -c(1)], bulk = as.matrix(sim_data$B), methodDeconv = "FARDEEP", W_prime = 0, bulkName = "SimBulk", refName = "SimRef", hp_max_evals = 5, algo_select = "random", output_path = tempdir(), hspaceTechniqueChoose = "all", normalize = FALSE )
Fits a two-way mixed-effects model (populations and subjects as random
effects) and derives ICC(3,1), Concordance correlation coefficient, and relative RMSE, complemented by
sample-wise correlation and RMSE from samplewise_metrics.
full_metrics(x, y)full_metrics(x, y)
x |
Numeric matrix of observed values (rows = subjects, columns = variables / populations). |
y |
Numeric matrix of predicted values; same dimensions as |
Named list:
Mean sample-wise Pearson correlation.
Mean sample-wise RMSE.
ICC(3,1) from the mixed model
(NA for constant data).
Concordance correlation coefficient
(NA for constant data).
Relative RMSE
(NA for constant data).
Constructs a synthetic gene-by-cell-type reference matrix using either Poisson-based (via Gaussian copula) or log-normal gene expression models, with optional block sparsity and TPM normalization.
generate_ref_matrix( loi = "rpois", tpm = FALSE, bloc = FALSE, nGenesByCellType = 50, nCell = 500, nCellsType = 10, nGenes = 500, lambda_vec = NULL, corr = NULL, sparse = FALSE, prob_sparse = NULL )generate_ref_matrix( loi = "rpois", tpm = FALSE, bloc = FALSE, nGenesByCellType = 50, nCell = 500, nCellsType = 10, nGenes = 500, lambda_vec = NULL, corr = NULL, sparse = FALSE, prob_sparse = NULL )
loi |
Character. Expression model: |
tpm |
Logical. If |
bloc |
Logical. If |
nGenesByCellType |
Integer. Number of genes per cell-type block
(used when |
nCell |
Integer. Total number of cells (unused directly; retained for API compatibility). |
nCellsType |
Integer. Number of cell types. Default: |
nGenes |
Integer. Number of genes. Default: |
lambda_vec |
Numeric vector of length |
corr |
Numeric matrix. Inter-cell-type correlation matrix.
If |
sparse |
Logical. If |
prob_sparse |
Numeric. Probability of a non-zero entry when
|
For loi = "rpois", correlated Poisson counts are generated via a
Gaussian copula: multivariate normal draws with covariance corr are
mapped to uniform marginals via pnorm, then to Poisson quantiles via
qpois. This preserves the inter-cell-type correlation structure
without requiring SimMultiCorrData.
For any other value of loi, a log-normal model is used: multivariate
normal draws are exponentiated after a location shift of 6.
A numeric matrix of dimensions nGenes x nCellsType.
Row names are Gene_1, ..., Gene_nGenes; column names are
CellType_1, ..., CellType_nCellsType.
set.seed(2101) ref_pois <- generate_ref_matrix(loi = "rpois", nGenes = 50, nCellsType = 5) ref_gauss <- generate_ref_matrix(loi = "gauss", nGenes = 50, nCellsType = 5) dim(ref_pois) # 50 x 5 dim(ref_gauss) # 50 x 5set.seed(2101) ref_pois <- generate_ref_matrix(loi = "rpois", nGenes = 50, nCellsType = 5) ref_gauss <- generate_ref_matrix(loi = "gauss", nGenes = 50, nCellsType = 5) dim(ref_pois) # 50 x 5 dim(ref_gauss) # 50 x 5
Simulates cell-type proportion matrices for bulk RNA-seq deconvolution benchmarking. Four scenarios are supported: equal proportions across cell types, uniform random sampling, hierarchical Dirichlet sampling reflecting realistic tissue compartment organization, or a BlueCode-specific hierarchical model that preserves real cell-type names.
generateProp( n_cell_types = NULL, nSample, nCell = 500, scenario = NULL, alpha_groups = c(Immune = 4, Stromal = 2.5, Endothelial = 1.8, Epithelial = 1.8, Muscle = 1.5), alpha_subtypes = list(Immune = 8, Stromal = 8, Endothelial = 8, Epithelial = 8, Muscle = 8), seed = NULL )generateProp( n_cell_types = NULL, nSample, nCell = 500, scenario = NULL, alpha_groups = c(Immune = 4, Stromal = 2.5, Endothelial = 1.8, Epithelial = 1.8, Muscle = 1.5), alpha_subtypes = list(Immune = 8, Stromal = 8, Endothelial = 8, Epithelial = 8, Muscle = 8), seed = NULL )
n_cell_types |
Integer. Number of distinct cell types. Ignored when
|
nSample |
Integer. Number of samples to generate. |
nCell |
Integer. Total number of cells per sample (used for rounding precision in the uniform scenario). |
scenario |
Character. Proportion generation strategy. One of:
|
alpha_groups |
Named numeric vector of length 5. Dirichlet
concentration parameters for the five tissue compartments
(Immune, Stromal, Endothelial, Epithelial, Muscle). Only used when
|
alpha_subtypes |
Named list with one scalar per compartment. Controls
the concentration of sub-type Dirichlet draws within each compartment.
Higher values produce more even sub-type distributions. Only used when
|
seed |
Integer. Random seed. Only used when |
A numeric matrix of dimensions nSample x n_cell_types where
each row sums to 1. For all scenarios except "bluecode", row names
are Sample_1, ..., Sample_nSample and column names are
CellType_1, ..., CellType_n_cell_types. For
"bluecode", row names are sample_1, ...,
sample_nSample and column names are the real BlueCode cell-type
names ordered by compartment.
set.seed(2101) # Equal proportions prop_even <- generateProp(nSample = 20, n_cell_types = 10, scenario = "even") # Generic hierarchical Dirichlet prop_hier <- generateProp(nSample = 20, n_cell_types = 34, scenario = "hierarchical") all(abs(rowSums(prop_hier) - 1) < 1e-8) # TRUE # BlueCode hierarchical Dirichlet (real cell-type names) prop_bc <- generateProp(nSample = 20, scenario = "bluecode") colnames(prop_bc) # real BlueCode cell-type namesset.seed(2101) # Equal proportions prop_even <- generateProp(nSample = 20, n_cell_types = 10, scenario = "even") # Generic hierarchical Dirichlet prop_hier <- generateProp(nSample = 20, n_cell_types = 34, scenario = "hierarchical") all(abs(rowSums(prop_hier) - 1) < 1e-8) # TRUE # BlueCode hierarchical Dirichlet (real cell-type names) prop_bc <- generateProp(nSample = 20, scenario = "bluecode") colnames(prop_bc) # real BlueCode cell-type names
Generates a raster heatmap showing the estimated abundance of each cell type across NMF iterations (or samples). Color encodes abundance using the viridis scale (reversed: dark = high abundance).
heatmap_abundances(res2plot, base_size = 9, title = NULL, midpoint = NULL)heatmap_abundances(res2plot, base_size = 9, title = NULL, midpoint = NULL)
res2plot |
A
|
base_size |
Numeric. Base font size in points (default |
title |
Character. Plot title (default |
midpoint |
Numeric or |
A ggplot object. It can be printed or saved using
ggplot2::ggsave().
df <- data.frame( Iterate = 1:10, CellTypeA = runif(10), CellTypeB = runif(10) ) heatmap_abundances(df) heatmap_abundances(df, title = "Estimated abundances", midpoint = 0.5)df <- data.frame( Iterate = 1:10, CellTypeA = runif(10), CellTypeB = runif(10) ) heatmap_abundances(df) heatmap_abundances(df, title = "Estimated abundances", midpoint = 0.5)
Aligns predicted and observed matrices to their common populations, row-normalizes both, then computes global and sample-wise agreement metrics.
makeTable1Tool(pred_mat, obs_mat)makeTable1Tool(pred_mat, obs_mat)
pred_mat |
Numeric matrix of predicted compositions (rows = samples, columns = populations). |
obs_mat |
Numeric matrix of observed compositions; must share at
least one column name with |
Populations absent from pred_mat but present in obs_mat
(and vice versa) are silently dropped after the intersection step.
Both matrices are row-normalized via row_norm_pos before
any metric is computed.
A list with one element:
A one-row data.frame containing:
Correlation, RMSE, Correlation_mean,
RMSE_mean, NRMSE, ICC3, CCC.
All fields are NA when computation is not possible.
Plots the evolution of one or more scalar performance metrics (e.g.,
NRMSE, R, loss) over NMF iterations. Useful for monitoring
convergence and diagnosing early stopping.
metric_plot( perf2plot, ylab = "Error between folds", title = NULL, base_size = 9, add_smooth = FALSE, show_points = FALSE )metric_plot( perf2plot, ylab = "Error between folds", title = NULL, base_size = 9, add_smooth = FALSE, show_points = FALSE )
perf2plot |
A
Optionally a |
ylab |
Character. Y-axis label
(default |
title |
Character. Plot title (default |
base_size |
Numeric. Base font size in points (default |
add_smooth |
Logical. When |
show_points |
Logical. When |
A ggplot object.
df <- data.frame( Iterate = 1:50, metric = cumsum(runif(50, -0.01, 0.1)) ) metric_plot(df) metric_plot(df, ylab = "NRMSE", add_smooth = TRUE) # Multi-group df2 <- data.frame( Iterate = rep(1:20, 2), metric = c(cumsum(runif(20, 0, 0.1)), cumsum(runif(20, 0, 0.05))), group = rep(c("Fold 1", "Fold 2"), each = 20) ) metric_plot(df2)df <- data.frame( Iterate = 1:50, metric = cumsum(runif(50, -0.01, 0.1)) ) metric_plot(df) metric_plot(df, ylab = "NRMSE", add_smooth = TRUE) # Multi-group df2 <- data.frame( Iterate = rep(1:20, 2), metric = c(cumsum(runif(20, 0, 0.1)), cumsum(runif(20, 0, 0.05))), group = rep(c("Fold 1", "Fold 2"), each = 20) ) metric_plot(df2)
Performs non-negative matrix factorization (NMF) using L-BFGS-B optimization. The objective combines a Gaussian log-likelihood with an augmented Lagrangian penalty that enforces the sum-to-one constraint on cell-type proportions.
nmf_lbfgsb( r_dataset, W_prime = 0, p_prime = 0, lambda_ = 10, gamma_par = 100, N_unknownCT = 1L, con = list(maxit = 3000) )nmf_lbfgsb( r_dataset, W_prime = 0, p_prime = 0, lambda_ = 10, gamma_par = 100, N_unknownCT = 1L, con = list(maxit = 3000) )
r_dataset |
Named list with three elements:
|
W_prime |
Numeric scalar or matrix. Initial value(s) for the
unknown cell-type column(s) in |
p_prime |
Numeric scalar or matrix. Initial value(s) for the
unknown cell-type column(s) in |
lambda_ |
Numeric scalar. Augmented Lagrangian multiplier
initialization (default |
gamma_par |
Numeric scalar. Penalty coefficient for the sum-to-one
constraint (default |
N_unknownCT |
Positive integer. Number of latent cell types to
estimate (default |
con |
Named list of control parameters forwarded to
|
A named list on success:
Optimized signature matrix (N_gene × N_cells_Type).
Frobenius norm of W.
Variance of the unknown column of W.
Optimized proportion data.frame (N_sample × N_cells_Type + 1),
with a leading Mixture column.
Frobenius norm of H.
Variance of the unknown column of H.
Gaussian log-likelihood (objective terms 1 + 2).
Objective term 1 (data-fit component).
Objective term 2 (log-normalization component).
Penalty term 1 (linear Lagrangian).
Penalty term 2 (quadratic penalty).
Total objective value (loss).
Total penalty (c1 + c2).
Convergence code from lbfgsb3c.
Absolute constraint violation: .
Returns NULL when the optimizer fails or the solution is degenerate.
Thin wrapper around nmf_lbfgsb that normalizes the
p_prime argument and cleans the returned matrices.
nmf_lbfgsb_hyperOpt( dataset, W_prime = NULL, p_prime = NULL, lambda_ = 10, gamma_par = 100 )nmf_lbfgsb_hyperOpt( dataset, W_prime = NULL, p_prime = NULL, lambda_ = 10, gamma_par = 100 )
dataset |
List containing matrices |
W_prime |
Optional numeric matrix. Initial |
p_prime |
Optional numeric matrix or scalar. Initial |
lambda_ |
Numeric. Regularization parameter |
gamma_par |
Numeric. Regularization parameter |
List output from nmf_lbfgsb with H and
W cleaned by .clean_nmf_matrix.
Computes RMSE between back-transformed pred and obs vectors,
then normalizes by a summary statistic of the observations.
nrmse( pred, obs, method = "sd", transformation = "none", trans_function = "none" )nrmse( pred, obs, method = "sd", transformation = "none", trans_function = "none" )
pred |
Numeric vector of predicted values (on the transformed scale). |
obs |
Numeric vector of observed values (same scale as
|
method |
Normalization denominator: |
transformation |
Back-transformation applied before computing RMSE.
One of |
trans_function |
R expression string used when
|
Non-negative numeric scalar, or NA_real_ (with a message)
when NRMSE is undefined (fewer than 2 complete observations, or zero
denominator).
set.seed(1) obs <- rnorm(100) pred <- obs + rnorm(100, sd = 0.1) nrmse(pred, obs) nrmse(pred, obs, method = "maxmin")set.seed(1) obs <- rnorm(100) pred <- obs + rnorm(100, sd = 0.1) nrmse(pred, obs) nrmse(pred, obs, method = "maxmin")
Runs one NMF trial for a given configuration
and returns a structured result list, or NULL when the trial
produces invalid values.
objective_opt( dataset, config = list(), lambda_ = NULL, gamma_factor = NULL, gamma = NULL, p_prime = NULL, W_prime = 0 )objective_opt( dataset, config = list(), lambda_ = NULL, gamma_factor = NULL, gamma = NULL, p_prime = NULL, W_prime = 0 )
dataset |
List with matrices |
config |
List. Configuration object (needs |
lambda_ |
Numeric. Regularization parameter |
gamma_factor |
Numeric or |
gamma |
Numeric. |
p_prime |
Numeric matrix ( |
W_prime |
Numeric matrix or scalar. Initial |
A named list with elements loss, constraint,
status, current_params, W, H, cvrge;
or NULL when the trial is invalid.
Generic function for plotting the hyperparameter search report stored in a
dicepro object. Dispatches to plot_hyperopt.dicepro.
Builds a scatter-matrix of all evaluated
configurations, color-coded by loss value, with violin/bar marginals for
the top 5\
plot_hyperopt(x, ...) ## S3 method for class 'dicepro' plot_hyperopt( x, params, metric = "loss", loss_metric = "loss", loss_behaviour = "min", not_log = NULL, categorical = NULL, max_deviation = NULL, title = NULL, ... )plot_hyperopt(x, ...) ## S3 method for class 'dicepro' plot_hyperopt( x, params, metric = "loss", loss_metric = "loss", loss_behaviour = "min", not_log = NULL, categorical = NULL, max_deviation = NULL, title = NULL, ... )
x |
A |
... |
Currently unused. Reserved for future extensions. |
params |
Character vector of hyperparameter column names to
display (e.g. |
metric |
Character scalar. Column used for point size
(default |
loss_metric |
Character scalar. Column used as the loss axis
(default |
loss_behaviour |
Character scalar. Direction of the loss:
|
not_log |
Character vector of parameter names that should
not be log-scaled on their axis (default |
categorical |
Character vector of categorical parameter names;
these are displayed as bar charts in the marginal row
(default |
max_deviation |
Numeric scalar. Trials with
|
title |
Character scalar. Optional title displayed above the
combined figure (default |
Whatever the dispatched method returns (a patchwork figure
for dicepro objects).
A patchwork object which can be printed, saved with
ggplot2::ggsave(), or embedded in R Markdown documents.
plot_hyperopt.dicepro
sim_data <- simulation( loi = "gauss", scenario = "hierarchical", nSample = 5, nGenes = 150, nCellsType = 10, sigma_bio = 0.07, sigma_tech = 0.07, seed = 2101 ) out <- dicepro( reference = as.matrix(sim_data$W)[, -c(1, 5, 10)], bulk = as.matrix(sim_data$B), methodDeconv = "FARDEEP", W_prime = 0, bulkName = "SimBulk", refName = "SimRef", hp_max_evals = 10, algo_select = "random", output_path = tempdir(), hspaceTechniqueChoose = "all", normalize = FALSE ) plot_hyperopt(out, params = c("lambda_", "gamma", "p_prime"))sim_data <- simulation( loi = "gauss", scenario = "hierarchical", nSample = 5, nGenes = 150, nCellsType = 10, sigma_bio = 0.07, sigma_tech = 0.07, seed = 2101 ) out <- dicepro( reference = as.matrix(sim_data$W)[, -c(1, 5, 10)], bulk = as.matrix(sim_data$B), methodDeconv = "FARDEEP", W_prime = 0, bulkName = "SimBulk", refName = "SimRef", hp_max_evals = 10, algo_select = "random", output_path = tempdir(), hspaceTechniqueChoose = "all", normalize = FALSE ) plot_hyperopt(out, params = c("lambda_", "gamma", "p_prime"))
Combines a cell-abundance heatmap with a fold-error plot using
patchwork. Requires at least two unique iterations in
x$Matrix_prediction.
## S3 method for class 'dicepro' plot(x, ...)## S3 method for class 'dicepro' plot(x, ...)
x |
A |
... |
Additional arguments (currently unused; reserved for future use). |
A patchwork figure, or invisible(NULL) with a warning
when only one iteration is present.
heatmap_abundances, metric_plot
Runs hp_max_evals NMF trials by sampling from hp_space,
using either random or TPE sampling, and collects results.
research_hyperOpt( objective_opt, dataset, config, hp_space = NULL, W_prime = NULL, seed = NULL )research_hyperOpt( objective_opt, dataset, config, hp_space = NULL, W_prime = NULL, seed = NULL )
objective_opt |
Function passed verbatim to
|
dataset |
List with |
config |
List. Configuration object; must contain
|
hp_space |
prebuilt named list of parsed hyper-parameter specs.
If |
W_prime |
Optional initial |
seed |
Integer. Random seed used for full pipeline reproducibility. |
A named list:
data.frame of all successful trial results.
List of matrices, one per successful trial.
List of matrices, one per successful trial.
Clamps negative values to 0, then divides each row by its sum so that
rows sum to 1. Rows whose sum equals 0 become all-NaN (degenerate
samples are flagged rather than silently zeroed).
row_norm_pos(mat)row_norm_pos(mat)
mat |
Numeric matrix. |
Numeric matrix of the same dimensions as mat, with
each row summing to 1 (or all-NaN when the row sum is 0).
m <- matrix(c(-1, 2, 3, 0, 0, 0, 1, 1, 1), nrow = 3, byrow = TRUE) row_norm_pos(m)m <- matrix(c(-1, 2, 3, 0, 0, 0, 1, 1, 1), nrow = 3, byrow = TRUE) row_norm_pos(m)
This function runs the CIBERSORTx deconvolution method using Docker with the provided email and token. The method is used to estimate cell type proportions in bulk RNA-seq data.
run_CSx( bulk, reference, cibersortx_email, cibersortx_token, work_dir = tempdir() )run_CSx( bulk, reference, cibersortx_email, cibersortx_token, work_dir = tempdir() )
bulk |
A matrix of bulk RNA-seq data. |
reference |
A matrix of reference cell type gene expression profiles. |
cibersortx_email |
The CIBERSORTx account email. |
cibersortx_token |
The CIBERSORTx account token. |
work_dir |
Character. Path to the working directory where CIBERSORTx
output files are written. Defaults to |
The function performs the following steps:
Runs CIBERSORTx with the provided email and token.
Reads the output results from CIBERSORTx.
Extracts cell type proportions and returns them as a matrix.
A matrix containing the estimated cell type proportions.
Builds the hyperparameter search space from hspaceTechniqueChoose
and runs research_hyperOpt(), returning the collected trials.
run_experiment( dataset, W_prime = 0, bulkName, refName, hp_max_evals, algo_select, output_base_dir = tempdir(), hspaceTechniqueChoose, gamma_ratio_min = 10, seed )run_experiment( dataset, W_prime = 0, bulkName, refName, hp_max_evals, algo_select, output_base_dir = tempdir(), hspaceTechniqueChoose, gamma_ratio_min = 10, seed )
dataset |
List containing at least |
W_prime |
Numeric matrix (or |
bulkName |
Character scalar. Identifier for the bulk dataset (used in output path construction). |
refName |
Character scalar. Identifier for the reference dataset (used in output path construction). |
hp_max_evals |
Positive integer. Number of hyperparameter trials to run. |
algo_select |
Character scalar. Sampling algorithm passed to
the configuration (e.g. |
output_base_dir |
Character scalar. Root directory for all outputs
(default |
hspaceTechniqueChoose |
Character scalar. Search-space strategy:
|
gamma_ratio_min |
Positive numeric. Minimum ratio
|
seed |
Integer. Random seed used for full pipeline reproducibility. |
The list returned by research_hyperOpt:
trials, W, and H.
Performs cell-type deconvolution on bulk RNA-seq data using one of four supported methods. All methods receive gene-intersected, numeric matrices and return a proportion matrix normalised to sum to 1 per sample (columns).
running_method( bulk, reference, methodDeconv = "CS", cibersortx_email = NULL, cibersortx_token = NULL, cibersort_perm = 0, cibersort_QN = TRUE )running_method( bulk, reference, methodDeconv = "CS", cibersortx_email = NULL, cibersortx_token = NULL, cibersort_perm = 0, cibersort_QN = TRUE )
bulk |
Numeric matrix (genes x samples). Bulk RNA-seq expression data. Row names must be gene symbols. |
reference |
Numeric matrix (genes x cell types). Reference cell-type signature profiles. Row names must be gene symbols. |
methodDeconv |
Character scalar. Deconvolution method:
|
cibersortx_email |
Character. CIBERSORTx account email (required
when |
cibersortx_token |
Character. CIBERSORTx account token (required
when |
cibersort_perm |
Non-negative integer. Number of permutations for
CIBERSORT p-values (default |
cibersort_QN |
Logical. Apply quantile normalisation in
CIBERSORT (default |
Gene intersection is performed automatically: only genes present in both
bulk and reference are used. Both matrices are coerced to
numeric before processing.
For "CS", the built-in implementation closely follows the
original Newman et al. (2015) algorithm: nu-SVR with three candidate nu
values (0.25, 0.50, 0.75), optional quantile normalisation, and optional
permutation-based p-values. Packages e1071, parallel, and
preprocessCore must be installed.
Numeric matrix (cell types x samples) of estimated cell-type proportions. Each column sums to 1.
Newman AM et al. (2015). Robust enumeration of cell subsets from tissue expression profiles. Nature Methods, 12(5), 453-457. doi:10.1038/nmeth.3337
For each row (sample) of two matrices, computes the Pearson correlation and RMSE between predicted and observed values, then returns their cross-sample means.
samplewise_metrics(obs_mat, pred_mat)samplewise_metrics(obs_mat, pred_mat)
obs_mat |
Numeric matrix of observations (rows = samples, columns = variables). |
pred_mat |
Numeric matrix of predictions; must have the same
dimensions as |
Rows where either obs_mat or pred_mat are constant
(zero standard deviation) yield NA for correlation but still
contribute to the RMSE mean.
Named list with two elements:
Mean Pearson correlation across samples
(ignoring NA rows).
Mean RMSE across samples.
Generates synthetic bulk RNA-seq expression matrices by linearly mixing reference cell-type profiles according to simulated proportions, with optional biological and technical noise. The noise model follows the Gaussian likelihood framework underlying the dicepro objective function: biological noise is multiplicative, whereas technical noise is additive. Consequently, normalization to a Gaussian scale prior to deconvolution is required.
simulation( W = NULL, prop = NULL, nSample = 50, nCell = 500, nCellsType = 50, nGenes = 500, lambda_vec = NULL, corr = NULL, scenario = NULL, loi = "rpois", tpm = FALSE, bloc = FALSE, nGenesByCellType = 50, sparse = FALSE, prob_sparse = 0.5, sigma_bio = 0.07, sigma_tech = 0.07, seed = NULL )simulation( W = NULL, prop = NULL, nSample = 50, nCell = 500, nCellsType = 50, nGenes = 500, lambda_vec = NULL, corr = NULL, scenario = NULL, loi = "rpois", tpm = FALSE, bloc = FALSE, nGenesByCellType = 50, sparse = FALSE, prob_sparse = 0.5, sigma_bio = 0.07, sigma_tech = 0.07, seed = NULL )
W |
Numeric matrix (genes x cell types) or |
prop |
Numeric matrix (samples x cell types) or |
nSample |
Integer. Number of samples. Default: |
nCell |
Integer. Total cells per sample. Default: |
nCellsType |
Integer. Number of cell types. Default: |
nGenes |
Integer. Number of genes. Default: |
lambda_vec |
Numeric vector of length |
corr |
Numeric matrix. Inter-cell-type correlation.
If |
scenario |
Character. Proportion scenario passed to
|
loi |
Character. Expression law for reference generation.
|
tpm |
Logical. TPM normalization of reference.
Default: |
bloc |
Logical. Block sparsity in reference.
Default: |
nGenesByCellType |
Integer. Genes per block (when |
sparse |
Logical. Random sparsity in reference.
Default: |
prob_sparse |
Numeric. Sparsity probability. Default: |
sigma_bio |
Numeric. Standard deviation of multiplicative
biological noise. Default: |
sigma_tech |
Numeric. Standard deviation of additive technical
noise (proportional to expression level). Default: |
seed |
Integer. Random seed. Default: |
The bulk expression matrix is generated as:
Noise is then applied as:
where
and .
Negative values are clipped to zero.
A named list with three elements:
pdata.frame. True cell-type proportions (samples x cell types).
Wdata.frame. Reference signature matrix (genes x cell types).
Bdata.frame. Noisy bulk expression matrix (genes x samples).
set.seed(2026) sim <- simulation( scenario = "hierarchical", nSample = 20, nGenes = 100, nCellsType = 10, sigma_bio = 0.07, sigma_tech = 0.07, seed = 2101 ) dim(sim$p) # 20 x 10 dim(sim$B) # 100 x 20set.seed(2026) sim <- simulation( scenario = "hierarchical", nSample = 20, nGenes = 100, nCellsType = 10, sigma_bio = 0.07, sigma_tech = 0.07, seed = 2101 ) dim(sim$p) # 20 x 10 dim(sim$B) # 100 x 20
Generates synthetic bulk RNA-seq expression matrices by mixing the bundled BlueCode reference signature matrix with hierarchically simulated cell-type proportions, then applying biological and technical noise.
simulation_bluecode( nSample = 50, alpha_groups = c(Immune = 4, Stromal = 2.5, Endothelial = 1.8, Epithelial = 1.8, Muscle = 1.5), alpha_subtypes = list(Immune = 8, Stromal = 8, Endothelial = 8, Epithelial = 8, Muscle = 8), sigma_bio = 0.15, sigma_tech = 0.02, seed = NULL )simulation_bluecode( nSample = 50, alpha_groups = c(Immune = 4, Stromal = 2.5, Endothelial = 1.8, Epithelial = 1.8, Muscle = 1.5), alpha_subtypes = list(Immune = 8, Stromal = 8, Endothelial = 8, Epithelial = 8, Muscle = 8), sigma_bio = 0.15, sigma_tech = 0.02, seed = NULL )
nSample |
Integer. Number of samples to simulate.
Default: |
alpha_groups |
Named numeric vector of length 5. Dirichlet
concentration parameters for the five tissue compartments
(Immune, Stromal, Endothelial, Epithelial, Muscle). Larger values
produce more even compartment proportions.
Default: |
alpha_subtypes |
Named list with one numeric scalar per compartment.
Controls the concentration of sub-type Dirichlet draws within each
compartment. Higher values produce more even sub-type distributions.
Default: all compartments set to |
sigma_bio |
Numeric. Standard deviation of multiplicative
biological noise (log-normal). Default: |
sigma_tech |
Numeric. Standard deviation of additive technical
noise, expressed as a fraction of the global expression standard
deviation. Default: |
seed |
Integer. Random seed for full reproducibility.
Proportion sampling uses |
The proportion model follows a two-level Dirichlet hierarchy that mirrors the biological organization of human tissues:
Compartment level – five major tissue compartments
(Immune, Stromal, Endothelial, Epithelial, Muscle) are drawn from
.
Cell-type level – within each compartment, sub-type
proportions are drawn from a symmetric Dirichlet controlled by
alpha_subtypes and scaled by the compartment proportion.
This two-level structure induces realistic positive within-compartment and negative between-compartment correlations between cell types.
The bulk matrix is computed as , then noise is
applied as:
where
and
.
Negative values are clipped to zero.
The function validates at runtime that the cell-type names in
.bluecode_cell_groups exactly match colnames(BlueCode),
ensuring consistency between the proportion model and the reference matrix.
A named list with three elements:
pdata.frame (samples x 34 cell types). True cell-type proportions with real BlueCode cell-type names. Each row sums to 1.
Wdata.frame (genes x 34 cell types). BlueCode reference
signature matrix, column-ordered to match p.
Bdata.frame (genes x samples). Noisy simulated bulk expression matrix.
sim <- simulation_bluecode( nSample = 30, sigma_bio = 0.15, sigma_tech = 0.02 ) dim(sim$p) dim(sim$W) dim(sim$B) all(abs(rowSums(sim$p) - 1) < 1e-8)sim <- simulation_bluecode( nSample = 30, sigma_bio = 0.15, sigma_tech = 0.02 ) dim(sim$p) dim(sim$W) dim(sim$B) all(abs(rowSums(sim$p) - 1) < 1e-8)