We introduce a Pseudo-value Regression Approach for Network Analysis (PRANA) [1]. To our knowledge, this is the first attempt of utilizing a regression modeling for the differential network (DN) analysis by collective gene expression levels under two experimental conditions (e.g. ‘current’ vs. ‘non-current smokers’ or ‘high-risk’ vs. ‘low-risk’). We start from the mutual information (MI) criteria, followed by pseudo-value calculations, which are then entered into a robust regrssion model.
Please install these R packages prior to use PRANA. Note that minet package is available in Bioconductor. Please see below for further instruction. Of important note (as of July 2024), dnapath R package is archived in CRAN as of May 2024, which no longer becomes a requirement for the installation of PRANA. Instead, run_aracne() function is available in our package as of version 1.0.5 to obtain mutual information (MI) estimate via ARACNE for network estimation step.
# library(dplyr)
# library(parallel) # To use mclapply() when reestimating the association matrix.
# library(robustbase) # To fit a robust regression
# library(minet) # To estimate network via ARACNE
# Please run the following lines to install minet package
# from Bioconductor in your R console:
#if (!require("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
#BiocManager::install("minet")
In the original article [1], a real-data analysis was performed to showcase the utility of PRANA. This contains clinical and expression data for 406 samples and 28 COPD-related genes that were highlighted in a recent genome-wide association study [2]. The full data is available from the Gene Expression Database with accession number GSE158699 [3].
The main variable of our interest in this analysis is the current smoking status. We obtain the indices of subjects who are ‘current’ vs. ‘non-current smokers.’ These indices are used to dichotomize expression dataset into ‘current (Group B)’ and ‘non-current smokers (Group A).’ This is important as the we estimate the group-specific p × p association matrices.
# A gene expression data part of the downloaded data.
rnaseqdat <- combinedCOPDdat_RGO[ , 8:ncol(combinedCOPDdat_RGO)]
rnaseqdat <- as.data.frame(apply(rnaseqdat, 2, as.numeric))
# A clinical data with additional covariates sorted by current smoking groups:
# The first column is ID, so do not include.
phenodat <- combinedCOPDdat_RGO[order(combinedCOPDdat_RGO$currentsmoking), 2:7]
# Indices of non-current smoker (namely Group A)
index_grpA <- which(combinedCOPDdat_RGO$currentsmoking == 0)
# Indices of current smoker (namely Group B)
index_grpB <- which(combinedCOPDdat_RGO$currentsmoking == 1)
Once the data processing is done, we are off to apply the PRANA. In
PRANA
function, please make sure you specify the expression
and clinical data separately. Additionally, you will need to provide the
indices for each group of a binary indicator variable that is deemed as
a main predictor variable.
In this package, there are some additional R functions that can help you locating the name of genes that are significantly differentially connected (DC) and names and adjusted p-values via the empirical Bayes adjustment [4] for all variables and for a specific variable.
# This is useful when we want to create a table with adjusted p-values only.
adjpval(PRANAres)
#> currentsmoking packyrs age gender race FEV1perc
#> 10370 0.006 1.000 1.000 1 1.000 1
#> 10420 0.000 1.000 1.000 1 1.000 1
#> 1306 0.662 0.986 1.000 1 1.000 1
#> 155185 0.000 1.000 1.000 1 1.000 1
#> 158158 0.866 1.000 1.000 1 1.000 1
#> 1653 0.006 1.000 1.000 1 1.000 1
#> 1762 0.000 1.000 1.000 1 1.000 1
#> 23389 0.028 1.000 0.996 1 1.000 1
#> 253461 0.000 1.000 1.000 1 1.000 1
#> 26112 0.796 0.854 1.000 1 0.998 1
#> 27436 0.000 1.000 1.000 1 1.000 1
#> 3308 0.004 1.000 1.000 1 1.000 1
#> 3696 0.000 1.000 1.000 1 1.000 1
#> 374739 0.000 1.000 1.000 1 0.634 1
#> 3842 0.004 1.000 1.000 1 1.000 1
#> 406 0.000 1.000 1.000 1 1.000 1
#> 56986 0.000 1.000 1.000 1 1.000 1
#> 57188 0.000 1.000 1.000 1 1.000 1
#> 6239 0.464 1.000 1.000 1 1.000 1
#> 7067 0.000 0.996 1.000 1 0.326 1
#> 7871 0.000 1.000 0.994 1 1.000 1
#> 79961 0.006 1.000 1.000 1 1.000 1
#> 79991 0.000 1.000 1.000 1 1.000 1
#> 8224 0.000 1.000 1.000 1 1.000 1
#> 8853 0.000 1.000 1.000 1 1.000 1
#> 8870 0.000 1.000 1.000 1 1.000 1
#> 9258 0.000 1.000 1.000 1 0.992 1
#> 9686 0.866 0.974 1.000 1 0.998 1
Back to the description of the original article by Ahn et al., the main interest is in the
current smoking status to declare whether a gene is DC between current
and non-current smokers at the user-specified significance level. Thus,
we subset the vector of p-values for current smoking status from the
adjpvaltab
object, aka an object with adjusted p-values and
gene names using adjpval
function above.
# Create an object to keep the table with adjusted p-values using adjpval() function.
adjptab <- adjpval(PRANAres)
adjpval_specific_var
is a handy function to retrieve
adjusted p-values with gene names for a single variable of interest.
# NOTE: Please do NOT forget to provide a name of variable with the quotation marks!
adjpval_specific_var(adjptab = adjptab, varname = "currentsmoking")
#> currentsmoking
#> 10370 0.006
#> 10420 0.000
#> 1306 0.662
#> 155185 0.000
#> 158158 0.866
#> 1653 0.006
#> 1762 0.000
#> 23389 0.028
#> 253461 0.000
#> 26112 0.796
#> 27436 0.000
#> 3308 0.004
#> 3696 0.000
#> 374739 0.000
#> 3842 0.004
#> 406 0.000
#> 56986 0.000
#> 57188 0.000
#> 6239 0.464
#> 7067 0.000
#> 7871 0.000
#> 79961 0.006
#> 79991 0.000
#> 8224 0.000
#> 8853 0.000
#> 8870 0.000
#> 9258 0.000
#> 9686 0.866
If you would like to quickly know which genes are significantly DC
for your main binary grouping variable, please use
sigDCGtab
to proceed. This will return a table with
adjusted p-values and gene names.
# NOTE: Please do NOT forget to provide a name of variable with the quotation marks!
sigDCGtab <- sigDCGtab(adjptab = adjptab, groupvar = "currentsmoking", alpha = 0.05)
#> Warning: Using one column matrices in `filter()` was deprecated in dplyr 1.1.0.
#> ℹ Please use one dimensional logical vectors instead.
#> ℹ The deprecated feature was likely used in the PRANA package.
#> Please report the issue to the authors.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
sigDCGtab
#> currentsmoking
#> 10370 0.006
#> 10420 0.000
#> 155185 0.000
#> 1653 0.006
#> 1762 0.000
#> 23389 0.028
#> 253461 0.000
#> 27436 0.000
#> 3308 0.004
#> 3696 0.000
#> 374739 0.000
#> 3842 0.004
#> 406 0.000
#> 56986 0.000
#> 57188 0.000
#> 7067 0.000
#> 7871 0.000
#> 79961 0.006
#> 79991 0.000
#> 8224 0.000
#> 8853 0.000
#> 8870 0.000
#> 9258 0.000
Lastly, a function below, called sigDCGnames
, will be
useful when you just want to return the names of the significantly DC
genes from PRANA at the 0.05 significance level.
# NOTE: Please do NOT forget to provide a name of variable with the quotation marks!
sigDCGnames <- sigDCGnames(adjptab = adjptab, groupvar = "currentsmoking", alpha = 0.05)
sigDCGnames
#> [1] "10370" "10420" "155185" "1653" "1762" "23389" "253461" "27436"
#> [9] "3308" "3696" "374739" "3842" "406" "56986" "57188" "7067"
#> [17] "7871" "79961" "79991" "8224" "8853" "8870" "9258"
As an additional step, we can use rename_genes
function
to rename Entrez gene IDs into gene symbols. However, it is currently
not available since dnapath
package is currently archived.
A user may manually install the archived version of dnapath
from the CRAN, and follow the R code below.
[1] Ahn S, Grimes T, Datta S. (2023). A pseudo-value regression approach for differential network analysis of co-expression data. BMC Bioinformatics. 24(1), 8.
[2] Sakornsakolpat P, Prokopenko D, Lamontagne M, and et al. (2019). Genetic landscape of chronic obstructive pulmonary disease identifies heterogeneous cell-type and phenotype associations. Nature Genetics, 51(3), 494–505.
[3] Wang Z, Masoomi A, Xu Z, and et al. (2021). Improved prediction of smoking status via isoform-aware RNA-seq deep learning models. PLoS Computational Biology, 17(10), e1009433.
[4] Datta S, Datta S. (2005). Empirical Bayes screening of many p-values with applications to microarray studies. Bioinformatics, 21(9), 1987–1994.