Here, We present two possible bioinformatic pipelines for drug target identification. Both rely on the integration of RNA or protein expression data from a given model system with interactomic data from protein-protein interaction networks. In the first, we execute a two-step pipeline, where the PPI network is filtered based on RNA expression data. We then rank the nodes again based on betweenness centrality to identify the most critical hub genes of the network. In the second, we use in-silico repression to rank the nodes rather than a simple node scoring method after a filtration step.
For this example, we will rely on previously published gene
expression data from a Ewing Sarcoma cell line (A673). (described in
more detail here).
Ewing Sarcoma is a rare bone malignancy primarily seen in pediatric and
adolescent patients. We will first isolate the sample of interest and
log2 normalize the gene expression values. Then we will format the
expression values into a named numeric vector, the preferred format for
interacting with the crosstalkr API. In the code below, we load a
pre-computed PPI to save time but you can download directly from
stringdb or Biogrid using the load_ppi
method.
#Load PPI we will be using
load(url("https://github.com/DavisWeaver/disruptr/blob/main/inst/test_data/stringdb.Rda?raw=true"))
g_ppi <- g
#To download directly from stringdb - see load_ppi()
df <- df %>% select(1, '5_A673_DMSO') %>%
rename(expression = '5_A673_DMSO')
colnames(df)[1] <- 'gene_name'
df <- df %>%
filter(expression !=0) %>%
mutate(expression = log2(expression)) %>%
filter(!is.na(expression))
exp <- df$expression
names(exp) <- df$gene_name
print(head(df))
#> # A tibble: 6 × 2
#> gene_name expression
#> <chr> <dbl>
#> 1 A1BG 2.92
#> 2 A1BG-AS1 2.88
#> 3 A1CF -1.28
#> 4 A2M 2.21
#> 5 A2ML1 2.97
#> 6 A4GALT 2.84
First, we will use the gene expression values to reduce the size of the PPI network we need to analyze.
Under the hood, the following steps will happen:
load_ppi
)val
parameter)igraph::induced_subgraph
method to create a
new igraph object that contains only the genes from step 2.#start by getting the PPI since we don't want to download it twice
g <- gfilter(method = "value", g=g_ppi, cache = NULL, val=exp, val_name = "expression",
use_ppi = FALSE, desc = TRUE,n=100)
length(igraph::V(g))
#> [1] 100
head(igraph::get.vertex.attribute(g, name = "expression"))
#> [1] 3.846040 3.961776 3.980389 3.843208 3.982157 3.955569
Next, we’ll use gfilter again to score this smaller subgraph
according to the betweenness centrality and return only the top 5
proteins. It is worth noting that there are a number of potential
measures of centrality that could be used here, as well as many other
potentially useful node scoring methods provided in the
igraph
package.
g <- gfilter(g=g, igraph_method = "betweenness", n = 5, desc= TRUE, use_ppi=FALSE, val_name = "betweenness")
igraph::V(g)
#> + 5/5 vertices, named, from 82a9a8d:
#> [1] GAPDH HSP90AA1 EEF1A1 HNRNPC TPT1
As you can see, our not very thoughtful attempt at identifying drug targets in Ewing Sarcoma has turned up GAPDH, HSP90AA1, EEF1A1, HNRNPC, and TPT1 as the most likely candidates.
GAPDH is a glycolytic enzyme that has also been assigned a number of other functions, and is highly expressed in normal bone marrow. HSP90AA1 encodes for a heat shock protein and has been implicated in drug resistance development in cancer. EEF1A1 is involved in protein translation. HNRNPC is a ubiquitously expressed RNA binding protein. TPT1 is a regulator of cellular proliferation, and is causally implicated in cancer development.
Of our 5 candidate proteins, 4 are either ubiquitously expressed housekeeping genes or proteins involved in the generic cellular stress response. Surprisingly, our very simple pipeline (applied to just a single sample) identified a potentially promising candidate in TPT1. TPT1 was even recently associated with Ewing Sarcoma drug resistance development in the literature.
If we had a compelling reason to think these 5 genes were key
regulators in Ewing Sarcoma, we could use the
compute_crosstalk
function to identify a Ewing Sarcoma
specific subnetwork from the full protein-protein interaction
network.
out <- gfilter.ct(seeds = names(igraph::V(g)), g = g_ppi, use_ppi = FALSE,
return_df=TRUE, cache = NULL, seed_name = "vignette", n = 100,
agg_int= 10, ncores = 1)
plot_ct(out[[2]], out[[1]], prop_keep=0.2)
cdf <- out[[2]]
cdf %>%
select(-p_value, -nobs, -mean_p, -var_p) %>%
slice_max(affinity_score, n=10) %>%
knitr::kable(digits=3)
node | seed | affinity_score | Z | adj_p_value |
---|---|---|---|---|
GAPDH | yes | 0.122 | 695.818 | 0 |
EEF1A1 | yes | 0.122 | 643.376 | 0 |
HSP90AA1 | yes | 0.121 | 1053.037 | 0 |
TPT1 | yes | 0.121 | 539.553 | 0 |
HNRNPC | yes | 0.121 | 442.876 | 0 |
NUDCD2 | no | 0.049 | 1053.037 | 0 |
TSSK6 | no | 0.049 | 1053.037 | 0 |
RALYL | no | 0.048 | 442.876 | 0 |
POTEF | no | 0.025 | 535.537 | 0 |
TOMM34 | no | 0.024 | 485.623 | 0 |
We identified 341 proteins in the subnetwork defined by GAPDH, HSP90AA1, EEF1A1, HNRNPC, and TPT1. Next, we will re-analyze this data using a different combination of analytic steps to illustrate more features of crosstalkr.
In this pipeline, we will reduce the graph two times, first using degree (number of neighbors for a given node), and then using gene expression. The resulting graph has 500 nodes, all with both high connectivity in the PPI network and high expression in our Ewing Sarcoma cell line. There are 9624 edges (interactions) between these 500 nodes (proteins).
Next, we will demonstrate a method for node scoring that uses in-silico repression. In our context, in-silico repression refers to a 6-step process:
For this example, we will use network potential as the state function
(described here)
The resulting data structure is sparse matrix that shows the effect of
removing the node in a given column on the state function value for the
node represented in a given row. The global effect of removing a given
node is then the column sum.
dnp <- node_repression(g=g, v_rm = names(igraph::V(g)), exp = exp)
dnp <- Matrix::colSums(dnp)
dnp <- sort(dnp, decreasing = TRUE)
names(dnp)[1:5]
#> [1] "HSP90AA1" "RPS27A" "UBA52" "CDK1" "CCNB1"
Using this modified pipeline, we identified HSP90AA1, RPS27A, UBA52, CDK1, and CCNB1 as the most likely targets for therapy in this Ewing Sarcoma cell line. HSP90AA1 was also identified in the first instance. RPS27A is a ribosomal protein that is involved in RNA binding. UBA52 is a paralog of RPS27A and appears to be an unlikely candidate for cancer therapy. CDK1 is a cyclin dependent kinase, involved in cellular proliferation and causally associated with cancer in the literature. CCNB1 is a cyclin, a cellular regulator of mitosis. As with the method above, our bioinformatic pipeline identified a number of ubiquitiously expressed housekeeping genes as well as one or two proteins that are likely to be cancer-specific.
It is possible to modify these types of analysis to reduce the likelihood of finding heavily connected proteins that are involved in the generic stress response. For example, we could compute a null distribution for our state function of interest by generating hundreds or thousands of rewired graphs with preserved degree and then re-calculating our metric of interest. We support this capability for the network potential example (see compute_null_dnp). Supporting this feature for additional state functions is an area of development.