| Title: | 'NetSurvProx': Network-Based Survival Analysis via Proximal Methods |
|---|---|
| Description: | Introduces a novel network-constrained survival analysis framework for variable selection and parameter estimation in penalized survival models with convex penalties. The package extends two classical survival models, the Cox Proportional Hazards (PH) model and the Accelerated Failure Time (AFT) model, by incorporating prior biological knowledge from curated interaction networks (e.g., KEGG) into a double-penalty framework. The first penalty enforces variable selection through a LASSO penalty, while the second preserves gene-gene correlations by incorporating Laplacian-based constraints, ensuring that biologically relevant network structures are maintained. Using censored survival data, the method enables the identification of predictive biomarkers and pathways with potential relevance for target therapies. Model estimation is performed via proximal optimization algorithms combined with cross-validation for reliable tuning. To enhance interpretability, dedicated utility functions are implemented to consolidate results, yielding biologically coherent insights that can support personalized medicine and contribute to improved patient outcomes. |
| Authors: | Maura Mecchi [aut, cre], Antonella Iuliano [aut] |
| Maintainer: | Maura Mecchi <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 1.0.0 |
| Built: | 2026-06-09 11:08:28 UTC |
| Source: | https://github.com/cran/NetSurvProx |
Builds a Laplacian network penalty based on a prior weighted graph. It encourages coefficients corresponding to connected covariates to behave similarly: if two covariates are strongly connected in the network, their estimated coefficients tend to be either both close to zero or both nonzero. In this way, the penalty promotes smoothness and structural coherence across related variables.
CreateNetwork( X, Y = NULL, delta = NULL, doid = NULL, tissue = NULL, disease_file = NULL, tissue_file = NULL, cache = FALSE, cache_dir = NULL, choice = 1, model = NULL, dist = NULL, verbose = FALSE )CreateNetwork( X, Y = NULL, delta = NULL, doid = NULL, tissue = NULL, disease_file = NULL, tissue_file = NULL, cache = FALSE, cache_dir = NULL, choice = 1, model = NULL, dist = NULL, verbose = FALSE )
X |
Numeric matrix of standardized covariates. |
Y |
Numeric vector of observed survival times (log-transformed under |
delta |
Integer vector of censoring indicators (1 = event, 0 = censored),
required for |
doid |
Character string specifying Disease Ontology ID ( |
tissue |
Character string specifying tissue name, used to retrieve the
tissue-specific network from HumanBase, used only if
|
disease_file |
Character string specifying optional path to a tab-delimited
file containing disease-associated genes (columns: |
tissue_file |
Character string specifying optional path to a tab-delimited
file with tissue-specific gene interactions (columns:
|
cache |
Logical value; if |
cache_dir |
Character string specifying a directory used to cache
downloaded HumanBase files (when |
choice |
Value specifying the choice for the signs of the adjacency matrix
|
model |
Character string specifying the fitted survival model
( |
dist |
Character string specifying the AFTNet distribution.
Must be one of |
verbose |
Logical value, if |
This prior network is represented by a weighted graph where each vertex
corresponds to a covariate and the edges describe relationships between covariates.
The edge weights are stored in an adjacency matrix , which has zeros on its diagonal.
The degree matrix contains on its diagonal the sum of the absolute
edge weights connected to each vertex. The Laplacian matrix is defined as ,
where is the weighted matrix estimated from .
Two strategies can be used.
Correlation-based signs (choice = 1): the sign of an edge is
set according to the Pearson correlation between the two corresponding covariates.
Ridge-based signs (choice = 2): the sign of an edge is determined
by the signs of ridge regression coefficients obtained from a penalized
survival model. This ridge estimator provides stable coefficient estimates
in high-dimensional settings. For the Cox model the ridge fit is obtained
via glmnet::glmnet()), while for the AFT model via survival::survreg()).
The framework is used to construct a disease-specific gene interaction network, where edges represent biological relationships between genes relevant to a given cancer and tissue type.
Internally, the function relies on helper routines (see RepositoryDisease and RepositoryTissue)
to retrieve biological prior information from the HumanBase database.
These datasets are combined to construct a disease- and tissue-specific adjacency matrix
that defines the structure of the Laplacian penalty. User-provided files with
the same format can be supplied to bypass the download step.
A list with two elements:
disease_genes: data frame of disease genes used in the network.
L: final Laplacian matrix.
If tissue-specific or disease-specific files are not provided, the function downloads the relevant data from HumanBase. In this case, an active internet connection is required. Moreover, not all DOIDs and tissues are present in the HumanBase repository. f the requested is not available, the function may return an empty list.
data(LUADdataset) net <- CreateNetwork( LUADdataset$X_train, doid = "DOID:1324", tissue = "lung", choice = 1, verbose = TRUE) L <- net$L # final laplacian matrix disease_genes <- net$disease_genes # disease genes and scoresdata(LUADdataset) net <- CreateNetwork( LUADdataset$X_train, doid = "DOID:1324", tissue = "lung", choice = 1, verbose = TRUE) L <- net$L # final laplacian matrix disease_genes <- net$disease_genes # disease genes and scores
COXNet and AFTNet
Performs K-fold cross-validation to select the optimal regularization
parameter for penalized survival models (COXNet, AFTNet)
estimated via ProxGDNet. The criterion is based on cross-validated
linear predictors and negative (partial) log-likelihood.
CvNet( X, Y, delta, L = NULL, lambda, alpha, model = NULL, dist = NULL, sigma = NULL, nfolds = 5, seed = 2026, value = 2, niter = 1000, conv = 0.001, parallel = TRUE, ncore_max = 5, verbose = FALSE )CvNet( X, Y, delta, L = NULL, lambda, alpha, model = NULL, dist = NULL, sigma = NULL, nfolds = 5, seed = 2026, value = 2, niter = 1000, conv = 0.001, parallel = TRUE, ncore_max = 5, verbose = FALSE )
X |
Numeric matrix of standardized covariates. |
Y |
Numeric vector of observed survival times (log-transformed under |
delta |
Integer vector of censoring indicators (1 = event, 0 = censored). |
L |
Optional positive semi-definite, symmetric, and diagonally dominant
Laplacian matrix encoding prior network information
(see |
lambda |
Numeric vector of candidate tuning parameters (in descending order). |
alpha |
Numeric parameter controlling the convex combination of the two
penalty terms (value in |
model |
Character string specifying the fitted survival model
( |
dist |
Character string specifying the error distribution
of |
sigma |
Positive numeric scalar representing the scale parameter of the
error distribution in |
nfolds |
Number of cross-validation folds ( |
seed |
Random seed for reproducibility ( |
value |
Numeric scalar greater than 1 specifying the multiplicative
factor used to increase the step-size constant during
backtracking line search ( |
niter |
Maximum number of proximal gradient iterations ( |
conv |
Convergence tolerance for proximal gradient ( |
parallel |
Logical value, whether to use parallel processing ( |
ncore_max |
Maximum number of cores for parallel processing over cross validation ( |
verbose |
Logical value, if |
The dataset is split into K folds. For each fold, the model is trained on K-1 folds, and evaluated on the held-out fold. The cross-validated linear predictor is computed as
for COXNet, or the cross-validated standardized residual as
for AFTNet, and used to evaluate the cross-validation criterion over a grid of values.
The optimal parameter is selected according to:
the minimum CV error (lambda.min).
the largest within one standard error
of the minimum (lambda.1se).
An object of class "cv.out" containing:
cv.err.linPred: CV error for each value of .
cv.err.obj: estimated standard error associated with each value of CV error per fold.
lambda.grid: grid of regularization parameters values.
lambda.min: value of minimizing the CV error.
ind.lambda.min: indices of lambda.min.
lambda.1se: largest within one standard error of the minimum.
ind.lambda.1se: indices of lambda.1se.
cvup: upper error curve.
cvlo: lower error curve.
Computation can be performed sequentially (parallel: FALSE), or
in parallel (parallel: TRUE) using parLapply.
The number of cores is automatically determined based on system availability,
number of folds and user-specified maximum ncore_max.
PlotCvNet for visualization of the obtained cross-validation curve.
ProxGDNet for proximal network-penalized gradient descent algorithm details.
Performs pathway enrichment analysis to evaluate whether a set of
genes is over-represented in one or more pathways compared to a background set of genes.
For each pathway, it calculates the number of observed genes, the Fisher's exact test
p-value, and FDR-adjusted p-values. Significant pathways (padj < 0.05)
are marked with Yes in the highlight column.
Enrichment( genes, pathway_df, background_genes = NULL, min_genes = 2, top_n = 10, out_file = NULL )Enrichment( genes, pathway_df, background_genes = NULL, min_genes = 2, top_n = 10, out_file = NULL )
genes |
Character vector specifying the list of selected gene symbols. |
pathway_df |
Data frame with at least the following columns:
|
background_genes |
Character vector specifying background gene set.
If |
min_genes |
Numeric value specifying the minimum number of background
genes that a pathway must have to be considered ( |
top_n |
Numeric value specifying the number of top pathways sorted by
adjusted p-value to return ( |
out_file |
Character string specifying the path to save the enrichment
results as an Excel file (.xlsx). If |
The function implements an over-representation analysis (ORA) workflow:
Intersects the input gene list with a background set (user-provided or derived from all pathway genes).
Filters pathways to retain only those with at least min_genes present in the background.
Performs Fisher's exact test for each pathway to assess over-representation.
Adjusts p-values using the false discovery rate (FDR) method.
Identifies significantly enriched pathways (padj < 0.05) and marks them in the highlight column.
Selects the top top_n pathways for visualization in dashboards or plots.
The results are automatically saved as an Excel file Enrichment_results.xlsx and are used by
PathwayDashboard to display enrichment results interactively
in the dedicated panel.
A list containing:
results: Full enrichment table with p-values and FDR correction,
including pathway, nGenes
(number of genes for pathway), pval,
padj, highlight
(Yes/No if the pathway is enriched), name.
bar_data: Top top_n enriched pathways.
PathwayDashboard for interactive visualization of enrichment results.
A pre-processed dataset containing clinical survival information and gene expression covariates for Lung Adenocarcinoma (TCGA-LUAD). This dataset allows users to bypass the computationally intensive download and preprocessing pipeline, providing immediate access to the covariate matrix, survival outcomes, and censoring indicators.
data(LUADdataset)data(LUADdataset)
A list with the following components.
X_train : numeric matrix of training covariates.
X_test : numeric matrix of testing covariates.
Y_train : numeric vector of observed training survival times.
Y_test : numeric vector of observed testing survival times.
delta_train : integer vector of training censoring indicators.
delta_test : integer vector of testing censoring indicators.
Gene expression data (RNA-seq) were obtained from the LinkedOmics portal and processed to construct:
screened gene expression matrix X (samples × genes),
observed survival times Y (real scale),
censoring indicators delta (1 = event, 0 = censored).
The screening was performed using the BMD method (see VariableScreening)
focusing on disease-associated genes retrieved for doid = "DOID:1324"
via RepositoryDisease.
The dataset is pre-partitioned into an 70% training set for model estimation and a 30% testing set for validation.
https://linkedomics.org/data_download/TCGA-LUAD/
Computes a variety of performance metrics for survival model supporting both real-data evaluation and simulation studies.
Metrics( Y_train = NULL, delta_train = NULL, X_test = NULL, Y_test = NULL, delta_test = NULL, beta_est, beta_true = NULL, model = NULL, p_active = NULL, times_auc = NULL, metrics = NULL )Metrics( Y_train = NULL, delta_train = NULL, X_test = NULL, Y_test = NULL, delta_test = NULL, beta_est, beta_true = NULL, model = NULL, p_active = NULL, times_auc = NULL, metrics = NULL )
Y_train |
Numeric vector of observed training survival times
(log-transformed under |
delta_train |
Integer vector of training censoring indicators (1 = event, 0 = censored). |
X_test |
Numeric matrix of testing covariates standardized using the training data. |
Y_test |
Numeric vector of observed testing survival times
(log-transformed under |
delta_test |
Integer vector of testing censoring indicators (1 = event, 0 = censored). |
beta_est |
Numeric vector of estimated regression coefficients obtained from the training set. |
beta_true |
Optional numeric vector of true regression coefficients. Required only for simulation-based metrics (FPR, FNR, PMSE). |
model |
Character string specifying the fitted survival model
( |
p_active |
Integer scalar specifying the number of truly active covariates,
required only when |
times_auc |
Optional numeric vector of time points at which the time-dependent AUC is evaluated.
If |
metrics |
Character vector specifying the performance measures to compute. Allowed values:
|
The predicted quantity depends on the model type:
For COXNet, PredRisk is the hazard ratio.
For AFTNet, PredRisk is proportional to the expected survival time.
Harrell's concordance index is computed using rcorr.cens.
The time-dependent AUC is computed using Uno's estimator via
AUC.uno at the specified time points.
The metrics FPR, FNR, and PMSE are defined only in
simulation settings because they require knowledge of the true regression
coefficients. When beta_true is not provided, these metrics are
returned as NA if requested.
All other metrics can be computed for both simulated and real datasets.
A named list containing the requested performance metrics.
Scalar metrics are returned as numeric values, PredRisk as
a numeric vector of predicted risk scores, and time-dependent AUC values
as separate list elements with names of the form "AUC_t_<time>".
Fits network-constrained penalized survival models (COXNet and AFTNet)
to identify prognostic signature genes and build a Prognostic Index (PI).
The model is trained on a training dataset by incorporating both Laplacian
constraints and LASSO regularization, with optional feature standardization.
The tuning parameters are jointly selected through cross-validation.
An optimal cutoff for the PI is estimated from the training data to enable
prognostic stratification. Predictive performance is subsequently evaluated
on an independent testing dataset. Model assessment includes survival curve
analyses and visualization. Predictive accuracy is quantified using selected metrics.
NetSurvProx( X_train, Y_train, delta_train, X_test, Y_test, delta_test, L = NULL, standardize_train = TRUE, standardize_test = TRUE, model = NULL, dist = NULL, select_lambda = TRUE, alpha_grid = c(0.3, 0.5, 0.7), nlambda = 50, lambda_ratio = 0.01, nfolds = 5, method = NULL, probs = seq(0.25, 0.8, by = 0.05), cutoffplot = FALSE, seed = 2026, value = 2, niter = 1000, conv = 0.001, parallel_cv = TRUE, plotCV = FALSE, colors_pcv = NULL, errorbar = FALSE, ncore_max = 5, p_active = NULL, times_auc = NULL, beta_true = NULL, metrics = NULL, verbose = FALSE, palette = NULL, plot_test = FALSE )NetSurvProx( X_train, Y_train, delta_train, X_test, Y_test, delta_test, L = NULL, standardize_train = TRUE, standardize_test = TRUE, model = NULL, dist = NULL, select_lambda = TRUE, alpha_grid = c(0.3, 0.5, 0.7), nlambda = 50, lambda_ratio = 0.01, nfolds = 5, method = NULL, probs = seq(0.25, 0.8, by = 0.05), cutoffplot = FALSE, seed = 2026, value = 2, niter = 1000, conv = 0.001, parallel_cv = TRUE, plotCV = FALSE, colors_pcv = NULL, errorbar = FALSE, ncore_max = 5, p_active = NULL, times_auc = NULL, beta_true = NULL, metrics = NULL, verbose = FALSE, palette = NULL, plot_test = FALSE )
X_train |
Numeric matrix of training covariates standardized
(possibly screened using |
Y_train |
Numeric vector of observed training survival times (log-transformed under |
delta_train |
Integer vector of training censoring indicators (1 = event, 0 = censored). |
X_test |
Numeric matrix of testing covariates. |
Y_test |
Numeric vector of observed testing survival times (log-transformed under |
delta_test |
Integer vector of testing censoring indicators (1 = event, 0 = censored). |
L |
Optional positive semi-definite, symmetric, and diagonally dominant
Laplacian matrix encoding prior network information (see |
standardize_train |
Logical value indicating whether to standardize the training matrix:
if |
standardize_test |
Logical value indicating whether to standardize |
model |
Character string specifying the fitted survival model
( |
dist |
Character string specifying the |
select_lambda |
Logical value, if |
alpha_grid |
Numeric vector specifying the candidate values for |
nlambda |
Numeric value specifying the number of candidate values for
|
lambda_ratio |
Numeric value giving the ratio of minimum to maximum
|
nfolds |
Numeric value of folds performed for tuning optimal parameters ( |
method |
Character string specifying the cutoff selection method
( |
probs |
Vector of probabilities used when |
cutoffplot |
Logical value indicating whether survival curves should be produced
( |
seed |
Random seed for reproducibility ( |
value |
Numeric scalar greater than 1 specifying the multiplicative
factor used to increase the step-size constant during
backtracking line search in |
niter |
Maximum number of iterations for |
conv |
Convergence tolerance for ProxGDNet ( |
parallel_cv |
Logical value whether to use parallel processing for
|
plotCV |
Logical value indicating whether CV curves should be shown
( |
colors_pcv |
Optional named list of colors for CV plot (see |
errorbar |
Logical value, if |
ncore_max |
Maximum number of cores for parallel processing over CV ( |
p_active |
Numeric value indicating the number of truly active covariates (required for FPR/FNR computation in simulation settings). |
times_auc |
Numeric vector of time points for time-dependent AUC.
If |
beta_true |
Numeric vector of true coefficients (used only for simulated data). |
metrics |
Character vector specifying performance |
verbose |
Logical value, if |
palette |
Optional character vector of length 2 specifying colors used
for the survival curves. For |
plot_test |
Logical value, if |
An object of class NetSurvProx containing:
fit_training: training results (see NetSurvProx_Training).
fit_testing: testing results (see NetSurvProx_Testing).
# - Simulate 40 TFs, each regulating 10 targets with a independent structure - targets <- 10 n <- 165 simul_data <- Simulations( n = n, r = 40, targets = targets, p_active = 40, rho = 0.70, rate = 0.50, b_true = c(0.8, 1.2, -1.2, -0.8), nsimul = 1, model = "AFTNet", baseline = "lognormal", sigma_true = 1, shared_scheme = NULL, choice = 1, save = FALSE, save_path = NULL, seed = 2026, verbose = TRUE) X <- simul_data$X_list[[1]] Y <- simul_data$time_list[[1]] # generated in log-scale delta <- simul_data$delta_list[[1]] L <- simul_data$L_list[[1]] beta_true <- as.vector(unlist(simul_data$beta)) # - Split the dataset (training/testing sets) - set.seed(2026) train_idx <- sample(seq_len(n), size = floor(0.7 * n)) X_train <- X[train_idx,] Y_train <- Y[train_idx] delta_train <- delta[train_idx] X_test <- X[-train_idx,] Y_test <- Y[-train_idx] delta_test <- delta[-train_idx] # - Fitting LogNormal AFTNet - out <- NetSurvProx( X_train, Y_train, delta_train, X_test, Y_test, delta_test, L = L, standardize_train = TRUE, standardize_test = TRUE, model = "AFTNet", dist = "lognormal", select_lambda = TRUE, alpha_grid = 0.5, nlambda = 50, lambda_ratio = 0.1, nfolds = 5, method = "minpvalue", probs = seq(0.25, 0.80, by = 0.05), cutoffplot = FALSE, seed = 2026, value = 2, niter = 1000, conv = 1e-3, parallel_cv = FALSE, plotCV = FALSE, colors_pcv = NULL, errorbar = FALSE, ncore_max = 1, p_active = 40, times_auc = NULL, beta_true = beta_true, metrics = "CIndex", verbose = FALSE, palette = NULL, plot_test = FALSE) # - Results - data.frame(out$fit_testing$performance)# - Simulate 40 TFs, each regulating 10 targets with a independent structure - targets <- 10 n <- 165 simul_data <- Simulations( n = n, r = 40, targets = targets, p_active = 40, rho = 0.70, rate = 0.50, b_true = c(0.8, 1.2, -1.2, -0.8), nsimul = 1, model = "AFTNet", baseline = "lognormal", sigma_true = 1, shared_scheme = NULL, choice = 1, save = FALSE, save_path = NULL, seed = 2026, verbose = TRUE) X <- simul_data$X_list[[1]] Y <- simul_data$time_list[[1]] # generated in log-scale delta <- simul_data$delta_list[[1]] L <- simul_data$L_list[[1]] beta_true <- as.vector(unlist(simul_data$beta)) # - Split the dataset (training/testing sets) - set.seed(2026) train_idx <- sample(seq_len(n), size = floor(0.7 * n)) X_train <- X[train_idx,] Y_train <- Y[train_idx] delta_train <- delta[train_idx] X_test <- X[-train_idx,] Y_test <- Y[-train_idx] delta_test <- delta[-train_idx] # - Fitting LogNormal AFTNet - out <- NetSurvProx( X_train, Y_train, delta_train, X_test, Y_test, delta_test, L = L, standardize_train = TRUE, standardize_test = TRUE, model = "AFTNet", dist = "lognormal", select_lambda = TRUE, alpha_grid = 0.5, nlambda = 50, lambda_ratio = 0.1, nfolds = 5, method = "minpvalue", probs = seq(0.25, 0.80, by = 0.05), cutoffplot = FALSE, seed = 2026, value = 2, niter = 1000, conv = 1e-3, parallel_cv = FALSE, plotCV = FALSE, colors_pcv = NULL, errorbar = FALSE, ncore_max = 1, p_active = 40, times_auc = NULL, beta_true = beta_true, metrics = "CIndex", verbose = FALSE, palette = NULL, plot_test = FALSE) # - Results - data.frame(out$fit_testing$performance)
Evaluates predictive performance of a fitted COXNet or AFTNet model
on an independent testing set. The function computes the Prognostic Index (PI)
using the selected signature genes and the optimal cutoff obtained from the
training phase, generates survival curves, PI distribution plots, and calculates
specified performance metrics.
NetSurvProx_Testing( X_train = NULL, standardize = TRUE, Y_train = NULL, delta_train = NULL, X_test, Y_test, delta_test, model = NULL, dist = NULL, beta, beta_true = NULL, opt_cutoff, p_active = NULL, times_auc = NULL, metrics = NULL, verbose = FALSE, plot = FALSE, palette = NULL )NetSurvProx_Testing( X_train = NULL, standardize = TRUE, Y_train = NULL, delta_train = NULL, X_test, Y_test, delta_test, model = NULL, dist = NULL, beta, beta_true = NULL, opt_cutoff, p_active = NULL, times_auc = NULL, metrics = NULL, verbose = FALSE, plot = FALSE, palette = NULL )
X_train |
Numeric matrix of training covariates (used only to scale
|
standardize |
Logical value indicating whether to standardize |
Y_train |
Numeric vector of observed training survival times (log-transformed under |
delta_train |
Integer vector of training censoring indicators (1 = event, 0 = censored). Required only for time-dependent AUC computation. |
X_test |
Numeric matrix of testing covariates. |
Y_test |
Numeric vector of observed testing survival times (log-transformed under |
delta_test |
Integer vector of testing censoring indicators (1 = event, 0 = censored). |
model |
Character string specifying the fitted survival model
( |
dist |
Character string specifying the |
beta |
Numeric vector of regression coefficients estimated on the training set. |
beta_true |
Numeric vector of true coefficients (used only for simulated data). |
opt_cutoff |
Numeric value used to split the PI into two prognostic groups. |
p_active |
Numeric value indicating the number of truly active covariates (required for FPR/FNR computation in simulation settings). |
times_auc |
Numeric vector of time points for time-dependent AUC.
If |
metrics |
Character vector specifying performance metrics to compute.
For real datasets: |
verbose |
Logical value, if |
plot |
Logical value, if |
palette |
Optional character vector of length 2 specifying colors used
for the survival curves. For |
The testing set must be independent from the training set used in NetSurvProx_Training.
When standardize = TRUE, X_test is standardized using the mean and standard deviation
of X_train. Only covariates with non-zero coefficients in beta are retained for PI computation.
Prognostic stratification is performed using ValidationPI, producing:
Kaplan–Meier curves and log-rank test for COXNet.
Parametric survival curves and likelihood ratio test for AFTNet.
PI distribution plots by prognostic group.
A list containing:
df: data frame with PI (computed for each subject),
Y, delta, and groupRisk
(prognostic group assigned based on opt_cutoff).
p_value: from the log-rank test (COXNet) or likelihood ratio test (AFTNet).
performance: named list with the requested performance metrics.
Metrics for available performance metrics options.
NetSurvProx_Training for training routine.
OptimalPICutoff for opt_cutoff estimation.
ValidationPI for PI validation and optional plot.
Trains penalized regression methods (COXNet or AFTNet) to incorporate
gene regulatory relationships and select signature genes using the training set.
Regularization parameters are selected via cross-validation, and an optimal
Prognostic Index (PI) cutoff is determined for risk stratification (COXNet) or
for survival time stratification (AFTNet). The procedure includes optional
feature standardization and simultaneous selection of the regularization
parameters for the Laplacian constraint and the Lasso penalty.
NetSurvProx_Training( X_train, Y_train, delta_train, L = NULL, model = NULL, dist = NULL, select_lambda = TRUE, alpha_grid = c(0.3, 0.5, 0.7), nlambda = 50, lambda_ratio = 0.01, nfolds = 5, method = NULL, probs = seq(0.25, 0.8, by = 0.05), cutoffplot = FALSE, seed = 2026, value = 2, niter = 1000, conv = 0.001, parallel = TRUE, plotCV = FALSE, colors_pcv = NULL, errorbar = FALSE, ncore_max = 5, standardize = TRUE, verbose = FALSE, palette = NULL )NetSurvProx_Training( X_train, Y_train, delta_train, L = NULL, model = NULL, dist = NULL, select_lambda = TRUE, alpha_grid = c(0.3, 0.5, 0.7), nlambda = 50, lambda_ratio = 0.01, nfolds = 5, method = NULL, probs = seq(0.25, 0.8, by = 0.05), cutoffplot = FALSE, seed = 2026, value = 2, niter = 1000, conv = 0.001, parallel = TRUE, plotCV = FALSE, colors_pcv = NULL, errorbar = FALSE, ncore_max = 5, standardize = TRUE, verbose = FALSE, palette = NULL )
X_train |
Numeric matrix of training covariates standardized
(possibly screened using |
Y_train |
Numeric vector of observed training survival times (log-transformed under |
delta_train |
Integer vector of training censoring indicators (1 = event, 0 = censored). |
L |
Optional positive semi-definite, symmetric, and diagonally dominant
Laplacian matrix encoding prior network information. If |
model |
Character string specifying the fitted survival model
( |
dist |
Character string specifying the |
select_lambda |
Logical value, if |
alpha_grid |
Numeric vector specifying the candidate values for |
nlambda |
Numeric value specifying the number of candidate values for
|
lambda_ratio |
Numeric value giving the ratio of minimum to maximum
|
nfolds |
Number of cross-validation folds ( |
method |
Character string specifying the cutoff selection method
( |
probs |
Vector of probabilities used when |
cutoffplot |
Logical value indicating whether survival curves should be produced
( |
seed |
Random seed for reproducibility ( |
value |
Numeric scalar greater than 1 specifying the multiplicative
factor used to increase the step-size constant during
backtracking line search ( |
niter |
Maximum number of iterations for ProxGDNet ( |
conv |
Convergence tolerance for ProxGDNet ( |
parallel |
Logical value whether to use parallel processing for CvNet ( |
plotCV |
Logical value indicating whether cross-validation curves should be shown
( |
colors_pcv |
Optional named list of colors:
If |
errorbar |
Logical value, if |
ncore_max |
Maximum number of cores for parallel processing over CV ( |
standardize |
Logical value indicating whether to standardize the input matrix:
if |
verbose |
Logical value, if |
palette |
Optional character vector of length 2 specifying colors used
for the survival curves. For |
The function performs joint tuning for regularization parameters:
a grid of values in (0, 1) is constructed, and for each candidate
computes corresponding grids via cross-validation using the negative
(partial for COXNet) log-likelihood's gradient.
Parallel computation is supported to improve efficiency.
A list containing:
alpha.opt: numeric value of optimal alpha.
lambda.opt: numeric value of optimal lambda.
beta: estimated regression coefficients.
index.nonzerobeta: index of non-zero beta.
lambda.min: value of minimizing the CV error.
lambda.1se: largest within one standard error of the minimum.
cutoff.opt: numeric value of optimal prognostic index cutoff.
lambda.grid: grid of regularization parameters values.
cv.err.linPred: cross-validated error for each value of .
cv.err.obj: estimated standard error associated with each value of CV error.
full_summary: data.frame as summary of CV results for all tested alpha values.
CreateNetwork: for L matrix computation.
CvNet: for CV and parallel processing details.
PlotCvNet: for cross-validation plot.
OptimalPICutoff: for the optimal cutoff value to stratify observations.
ProxGDNet: for proximal network-penalized gradient descent algorithm details.
VariableScreening: for the screen_vars list.
Identifies the optimal cutoff value of a Prognostic Index (PI)
to stratify subjects into prognostic groups. It supports COXNet and AFTNet
models with several distributions.
OptimalPICutoff( X, Y, delta, beta, method = NULL, model = NULL, dist = NULL, probs = seq(0.25, 0.8, by = 0.05), plot = FALSE, palette = NULL )OptimalPICutoff( X, Y, delta, beta, method = NULL, model = NULL, dist = NULL, probs = seq(0.25, 0.8, by = 0.05), plot = FALSE, palette = NULL )
X |
Numeric matrix of covariates. |
Y |
Numeric vector of observed survival times (log-transformed under |
delta |
Integer vector of censoring indicators (1 = event, 0 = censored). |
beta |
Numeric vector of estimated regression coefficients obtained from the training set. |
method |
Character string specifying the cutoff selection method
( |
model |
Character string specifying the fitted survival model
( |
dist |
Character string specifying the |
probs |
Vector of probabilities used when |
plot |
Logical value indicating whether survival curves should be produced
( |
palette |
Optional character vector of length 2 specifying colors used
for the survival curves. For |
The Prognostic Index (PI) is computed as a linear predictor. Two alternative strategies are available to define the cutoff.
Median-based cutoff: Subjects are dichotomized as follows:
COXNet: PI median is High Risk, otherwise Low Risk.
AFTNet: - PI median is Short Survival, otherwise Long Survival.
Minimum p-value approach: A grid of candidate cutoffs is generated from the quantiles of the PI. For each candidate:
The cohort is dichotomized according to the model-specific direction.
Two models are fitted (full model including the group indicator, and null model without the group indicator).
A likelihood ratio (LR) test is performed between the two models.
Model fitting is performed using survival::coxph() for COXNet, or survival::survreg() for AFTNet.
The raw p-values are adjusted for multiple testing using the Benjamini–Hochberg procedure. The optimal cutoff corresponds to the smallest adjusted p-value.
If plot = TRUE, survival curves are generated (Kaplan–Meier curves for COXNet,
parametric survival curves based on the selected distribution for AFTNet).
For method = "median", a list with
cutoff: numeric cutoff value.
PI.data: data frame containing the PI, survival time, status,
and group labels.
For method = "minpvalue", the list additionally contains:
summary: table of p-values across candidate quantiles.
optimal: optimal cutoff information (quantile, cutoff value,
raw and adjusted p-values).
Constructs interactive pathway analysis networks and generates an HTML dashboard from a list of genes. Pathways can be retrieved via KEGG database or provided through a custom file.
PathwayDashboard( genes_list, header = TRUE, useKeggAPI = TRUE, pathway_file = NULL, nodesCols = c("#5C7997", "#F5C59F"), diseaseNodes = FALSE, disease_file = NULL, top_percent = 20, batch_size = 10, background_genes = NULL, min_genes = 2, top_n = 10, db_name = "org.Hs.eg.db", organism = "hsa", out_dir = NULL, open_browser = TRUE, verbose = FALSE )PathwayDashboard( genes_list, header = TRUE, useKeggAPI = TRUE, pathway_file = NULL, nodesCols = c("#5C7997", "#F5C59F"), diseaseNodes = FALSE, disease_file = NULL, top_percent = 20, batch_size = 10, background_genes = NULL, min_genes = 2, top_n = 10, db_name = "org.Hs.eg.db", organism = "hsa", out_dir = NULL, open_browser = TRUE, verbose = FALSE )
genes_list |
Character vector of gene symbols, a file path to a tab-delimited file, or a data frame where the first column contains gene symbols. |
header |
Logical value indicating whether the input file has a header ( |
useKeggAPI |
Logical value indicating whether to use the KEGG REST API
to retrieve pathways ( |
pathway_file |
Optional data frame or file path containing custom pathway data.
Required if |
nodesCols |
Character vector of length 2 defining node colors.
First color for regular nodes, second for highlighted nodes (when |
diseaseNodes |
Logical value indicating whether to highlight
disease-associated nodes ( |
disease_file |
Optional file path or data frame containing disease-associated gene scores. Must have at least two columns: gene and score. |
top_percent |
Numeric value indicating the percentage of top genes
to highlight based on |
batch_size |
Numeric value indicating the batch size for KEGG API queries ( |
background_genes |
Optional vector of background genes for enrichment analysis. |
min_genes |
Numeric value indicating minimum number of genes in a pathway
to be considered ( |
top_n |
Numeric value indicating the number of top pathways to display
in the dashboard ( |
db_name |
Character string specifying the Bioconductor Annotation DB name for gene
mapping ( |
organism |
Character string specifying KEGG organism code ( |
out_dir |
Character string specifying output directory for results. |
open_browser |
Logical value; if |
verbose |
Logical value, if |
Workflow implemented by the function:
Converts gene symbols to Entrez IDs for KEGG queries and maps back to gene symbols after pathway retrieval.
Retrieves pathways using KEGG API if useKeggAPI = TRUE, otherwise uses pathway_file.
Constructs a gene-pathway binary incidence matrix (genes as rows, pathways as columns).
Builds an igraph network where genes are nodes and edges link genes in the same pathways.
Assigns node colors based on connectivity and optional disease association.
Highlights top genes by connectivity or disease association using nodesCols and top_percent.
Saves network information in network_data.rds and optionally renders an interactive HTML dashboard
(Dashboard.html).
The network_data.rds object contains:
g: igraph object representing the network.
edge_info: data frame with edges, colors, and pathway labels.
legend_info: legend codes, colors, and counts for pathways.
all_genes, conn_genes: all input genes and connected genes.
node_colours: node colors and borders for plotting.
pathway_df: data frame of pathways and genes.
background, min_genes, top_n: parameters.
Saves:
network_data.rds: serialized network object for later use.
Dashboard.html: interactive dashboard showing network and enrichment panels.
If useKeggAPI = TRUE, the function queries the KEGG REST API to
retrieve pathway information. An active internet connection is required in this case.
Moreover, gene names conversion relies on local Bioconductor Annotation DBs (e.g., org.Hs.eg.db).
The function returns paths to generated files but does not print to console
or open files unless explicitly requested.
Enrichment for pathway enrichment results.
COXNet and AFTNet
Produces a ggplot2 visualization of the cross-validation curve obtained
from CvNet. The plot displays the CV error as a function of
with optional error bars, and reference lines for
lambda.min and lambda.1se.
PlotCvNet(cv.out, alpha = NULL, errorbar = FALSE, colors = NULL)PlotCvNet(cv.out, alpha = NULL, errorbar = FALSE, colors = NULL)
cv.out |
Object of class
|
alpha |
Numeric parameter controlling the convex combination of the two
penalty terms (value in |
errorbar |
Logical value, if |
colors |
Optional named list of colors:
If |
A ggplot2 object showing the CV-LP curve.
COXNet and AFTNet
Estimate the regression coefficients in COXNet and AFTNet models
using a proximal gradient descent algorithm. The objective function combines
the normalized negative (partial) log-likelihood with an penalty,
and a Laplacian regularization term.
ProxGDNet( X, Y, delta, L = NULL, beta0, alpha, lambda, model = NULL, dist = NULL, sigma = NULL, value = 2, niter = 1000, conv = 0.001 )ProxGDNet( X, Y, delta, L = NULL, beta0, alpha, lambda, model = NULL, dist = NULL, sigma = NULL, value = 2, niter = 1000, conv = 0.001 )
X |
Numeric matrix of standardized covariates. |
Y |
Numeric vector of observed survival times (log-transformed under |
delta |
Integer vector of censoring indicators (1 = event, 0 = censored). |
L |
Optional positive semi-definite, symmetric, and diagonally dominant
Laplacian matrix encoding prior network information
(see |
beta0 |
Numeric vector of initial regression coefficients. |
alpha |
Numeric parameter controlling the convex combination of the two
penalty terms (value in |
lambda |
Non-negative regularization parameter. |
model |
Character string specifying the fitted survival model
( |
dist |
Character string specifying the error distribution in
|
sigma |
Positive numeric scalar representing the scale parameter of the
error distribution in |
value |
Numeric scalar greater than 1 specifying the multiplicative
factor used to increase the step-size constant during
backtracking line search ( |
niter |
Maximum number of iterations ( |
conv |
Convergence tolerance ( |
The algorithm minimizes the objective function:
where is the log-likelihood (partial for COXNet),
is the LASSO penalty, is
the Laplacian constraint.
At each iteration the method performs the backtracking line search to enforce a sufficient decrease condition, the gradient step size adaptation (initialized as Lipschitz constant), and an early stopping based on relative change in objective function.
Convergence is reached when either the maximum number of iterations is attained,
or the relative change in the objective function between consecutive iterations
falls below the specific tolerance conv.
A list with the following components
beta: numeric vector of estimated regression coefficients.
objective: numeric scalar, the final value of the objective function.
iterations: number of iterations performed until convergence
(or until the maximum number of iterations niter is reached).
Download disease-associated gene predictions from the HumanBase resource. The function retrieves gene-level association scores for a given Disease Ontology ID (DOID) and returns a tidy data frame containing gene identifiers and scores.
RepositoryDisease( doid = NULL, cache = FALSE, cache_dir = NULL, verbose = FALSE )RepositoryDisease( doid = NULL, cache = FALSE, cache_dir = NULL, verbose = FALSE )
doid |
Character string specifying Disease Ontology ID ( |
cache |
Logical value; if |
cache_dir |
Character string specifying a directory used to cache
downloaded HumanBase files (when |
verbose |
Logical value, if |
A data frame with three columns:
entrez_id: Entrez gene identifier.
standard_name: Gene symbol.
score: Association score from HumanBase.
An active internet connection is required.
# - Download disease-specific gene repository for Lung Adenocarcinoma - disease_genes <- RepositoryDisease( doid = "DOID:1324", cache = FALSE, cache_dir = NULL, verbose = FALSE )$standard_name head(disease_genes)# - Download disease-specific gene repository for Lung Adenocarcinoma - disease_genes <- RepositoryDisease( doid = "DOID:1324", cache = FALSE, cache_dir = NULL, verbose = FALSE )$standard_name head(disease_genes)
Downloads the top edge gene interaction network for a specific human tissue from the HumanBase resource.
RepositoryTissue( tissue = NULL, cache = FALSE, cache_dir = NULL, verbose = FALSE )RepositoryTissue( tissue = NULL, cache = FALSE, cache_dir = NULL, verbose = FALSE )
tissue |
Character string specifying the name of the tissue to download. Spaces will automatically be converted to underscores. |
cache |
Logical value; if |
cache_dir |
Character string specifying a directory used to cache
downloaded HumanBase files (when |
verbose |
Logical value, if |
A data.frame with tissue-specific gene interactions (columns:
gene1, gene2, and score).
An active internet connection is required.
# - Download tissue-specific repository for Lung Adenocarcinoma - tissue <- RepositoryTissue( tissue = "lung", cache = FALSE, cache_dir = NULL, verbose = FALSE ) head(tissue)# - Download tissue-specific repository for Lung Adenocarcinoma - tissue <- RepositoryTissue( tissue = "lung", cache = FALSE, cache_dir = NULL, verbose = FALSE ) head(tissue)
Generates structured gene expression data based on TFs and their regulated
target genes, together with survival outcomes simulated from COXNet
or AFTNet models. The function supports both independent and interconnected
TF modules with user-defined shared targets via shared_scheme.
Simulations( n, r, targets, p_active, rho = 0.7, rate = 0.5, b_true = c(0.8, 1.2, -1.2, -0.8), nsimul = 10, model = NULL, baseline = NULL, phi = 0.1, sigma_true = 1, breaks = c(0, 6, 36, 60), hazards = c(0.15, 0.005, 0.1), shared_scheme = NULL, choice = 1, save = FALSE, save_path = NULL, seed = 2026, verbose = FALSE )Simulations( n, r, targets, p_active, rho = 0.7, rate = 0.5, b_true = c(0.8, 1.2, -1.2, -0.8), nsimul = 10, model = NULL, baseline = NULL, phi = 0.1, sigma_true = 1, breaks = c(0, 6, 36, 60), hazards = c(0.15, 0.005, 0.1), shared_scheme = NULL, choice = 1, save = FALSE, save_path = NULL, seed = 2026, verbose = FALSE )
n |
Numeric value of observations. |
r |
Numeric value of TFs (for interconnected modules, at least 4 TFs are recommended). |
targets |
Numeric value of target genes regulated by each TF. |
p_active |
Numeric value of truly active predictors (non-zero coefficients). |
rho |
Numeric value of correlation between each TF and its target ( |
rate |
Numeric value of desired censoring proportion ( |
b_true |
Numeric vector of length 4 |
nsimul |
Numeric value of simulated datasets ( |
model |
Character string specifying the survival model used for simulation
( |
baseline |
Character string specifying baseline hazard distribution.
|
phi |
Numeric value of frailty parameter for |
sigma_true |
Positive numeric scalar representing the scale parameter of the
error distribution in |
breaks |
Numeric vector of time breakpoints for piecewise exponential hazards
(required if |
hazards |
Numeric vector of hazard rates corresponding to each interval in |
shared_scheme |
List defining interconnected TF modules. If
|
choice |
Value specifying the choice for the signs of the adjacency matrix
|
save |
Logical value, if |
save_path |
Character string specifying an existing directory used only when
|
seed |
Random seed for reproducibility ( |
verbose |
Logical value, if |
The total number of predictors is given by ,
where each TF contributes one regulatory variable in addition to its associated
target genes.
The function supports two alternative network topologies
Independent structure: each TF regulates its own targets independently.
Interconnected structure: TFs specified in the same shared_scheme
share shared genes and additionally have their own unique genes
as specified in unique.
These regulatory relationships are encoded in the adjacency matrix, which exhibits a block-diagonal structure under independence, and introduces cross-connections between TFs and shared targets when modules are specified.
Survival times are generated according to the chosen baseline distribution and
linear predictors derived from the simulated gene expression data.
Optional frailty effects and censoring are included, with the censoring mechanism
calibrated to achieve the desired censoring proportion specified by rate.
The function also returns the true regression coefficients, allowing the user to evaluate variable selection performance using measures such as false positive and false negative rates.
A list with the following components:
X_list: list of simulated design matrices.
beta_list: list of true regression coefficient vectors.
time_list: list of observed survival times (log-transformed under AFTNet).
delta_list: list of censoring indicators (1 = event, 0 = censored).
L_list: list of Laplacian matrices representing the TF–gene regulatory network.
# - Simulate interconnected structure under Weibull-COXNet model - targets <- 10 s1 <- 5 s2 <- 3 shared_scheme <- list( list(tfs = c(1, 3), shared = s1, unique = c(targets - s1, targets - s1)), list(tfs = c(2, 4), shared = s2, unique = c(targets - s2, targets - s2))) simul_data <- Simulations( n = 165, r = 40, targets = targets, p_active = 40, b_true = c(0.8,1.2,-1.2,-0.8), rate = 0.3, nsimul = 1, model = "COXNet", baseline = "weibull", shared_scheme = shared_scheme, seed = 2026, verbose = FALSE) # Extract the Laplacian matrix L <- simul_data$L[[1]] # This matrix uncovers the topological overlap between TFs: # TF1 and TF3 co-regulate 5 genes, while TF2 and TF4 share 3 target genes.# - Simulate interconnected structure under Weibull-COXNet model - targets <- 10 s1 <- 5 s2 <- 3 shared_scheme <- list( list(tfs = c(1, 3), shared = s1, unique = c(targets - s1, targets - s1)), list(tfs = c(2, 4), shared = s2, unique = c(targets - s2, targets - s2))) simul_data <- Simulations( n = 165, r = 40, targets = targets, p_active = 40, b_true = c(0.8,1.2,-1.2,-0.8), rate = 0.3, nsimul = 1, model = "COXNet", baseline = "weibull", shared_scheme = shared_scheme, seed = 2026, verbose = FALSE) # Extract the Laplacian matrix L <- simul_data$L[[1]] # This matrix uncovers the topological overlap between TFs: # TF1 and TF3 co-regulate 5 genes, while TF2 and TF4 share 3 target genes.
Validates a Prognostic Index (PI) obtained from a fitted survival model
(COXNet or AFTNet) on an independent testing set.
Given the estimated regression coefficients, it computes the PI for each subject,
assigns prognostic groups using a pre-specified optimal cutoff, and evaluates
survival separation and statistical significance.
ValidationPI( X, Y, delta, beta, opt_cutoff, model = NULL, dist = NULL, plot = FALSE, palette = NULL )ValidationPI( X, Y, delta, beta, opt_cutoff, model = NULL, dist = NULL, plot = FALSE, palette = NULL )
X |
Numeric matrix of testing covariates scaled using the training data. |
Y |
Numeric vector of observed testing survival times (log-transformed under |
delta |
Integer vector of testing censoring indicators (1 = event, 0 = censored). |
beta |
Numeric vector of estimated regression coefficients obtained from the training set. |
opt_cutoff |
Numeric cutoff value used to split the PI into two prognostic groups. |
model |
Character string specifying the fitted survival model
( |
dist |
Character string specifying the |
plot |
Logical value, if |
palette |
Optional character vector of length 2 specifying colors used
for the survival curves. For |
For COXNet, Kaplan-Meier survival curves are computed, a log-rank test
is performed, and the is compared to opt_cutoff
to define High Risk and Low Risk groups.
For AFTNet, parametric survival curves are computed using the specified
distribution, a likelihood ratio test is performed, and the
is compared to opt_cutoff to define Short Survival and
Long Survival groups.
The function also produces:
Survival curves with group-specific colors,
Risk tables (number-at-risk) aligned with survival curves,
Distribution plots of the PI across groups.
A list containing:
df: data frame with columns PI (prognostic index for each subject), Y, delta,
groupRisk (assigned prognostic group based on opt_cutoff),
p_value: from the log-rank test (COXNet) or likelihood ratio test (AFTNet),
measuring survival separation between groups.
OptimalPICutoff for opt_cutoff value selection.
Reduces the high-dimensional feature space to a more manageable subset of variables by applying one of three screening strategies:
BMD (Biomedical-driven): selects covariates based on prior biomedical knowledge about their relevance to the disease under investigation,
DAD (Data-driven): selects features using component-wise estimators obtained from the chosen penalized model,
BMD+DAD: combines both biomedical knowledge and data-driven insights.
VariableScreening( X, Y, delta, disease_genes, screening = NULL, model = NULL, dist = NULL, rank_method = NULL, d = NULL, standardize = TRUE, verbose = FALSE )VariableScreening( X, Y, delta, disease_genes, screening = NULL, model = NULL, dist = NULL, rank_method = NULL, d = NULL, standardize = TRUE, verbose = FALSE )
X |
Numeric matrix of covariates. |
Y |
Numeric vector of observed survival times (log-transformed under |
delta |
Integer vector of censoring indicators (1 = event, 0 = censored). |
disease_genes |
Character vector containing the names of genes known to be associated with diseases. |
screening |
Character string specifying the screening method
( |
model |
Character string specifying the fitted survival model
( |
dist |
Character string specifying the AFTNet distribution.
Must be one of |
rank_method |
Character string specifying the ranking criterion for DAD-based screening:
|
d |
Numeric value representing the threshold for top-ranked features to select
in DAD-based screening ( |
standardize |
Logical value indicating whether to standardize the input matrix in DAD-based screening:
|
verbose |
Logical value, if |
The function uses marginal ranking approaches to select features based on their association with survival outcomes.
In the BMD approach, prior knowledge comes from literature or external biological databases such as HumanBase.
The DAD screening computes marginal regression coefficients to rank features according to their estimated importance under the selected model:
absmg: top d covariates by largest
absolute marginal coefficients.
mg: top d covariates by largest marginal
coefficients, preserving the direction.
mgpadj: top d covariates passing significance
thresholds based on adjusted p-values.
The BMD+DAD combines prior biological knowledge and data-driven selection for comprehensive feature screening.
A list containing selected variable names screen_vars.
CreateNetwork or RepositoryDisease for the disease_genes names.