Using crosstalkr for drug target identification

Here, We present two possible bioinformatic pipelines for drug target identification. Both rely on the integration of RNA or protein expression data from a given model system with interactomic data from protein-protein interaction networks. In the first, we execute a two-step pipeline, where the PPI network is filtered based on RNA expression data. We then rank the nodes again based on betweenness centrality to identify the most critical hub genes of the network. In the second, we use in-silico repression to rank the nodes rather than a simple node scoring method after a filtration step.

library(crosstalkr)
library(dplyr)
#library(dnet)

The Data

For this example, we will rely on previously published gene expression data from a Ewing Sarcoma cell line (A673). (described in more detail here). Ewing Sarcoma is a rare bone malignancy primarily seen in pediatric and adolescent patients. We will first isolate the sample of interest and log2 normalize the gene expression values. Then we will format the expression values into a named numeric vector, the preferred format for interacting with the crosstalkr API. In the code below, we load a pre-computed PPI to save time but you can download directly from stringdb or Biogrid using the load_ppi method.

#Load PPI we will be using
load(url("https://github.com/DavisWeaver/disruptr/blob/main/inst/test_data/stringdb.Rda?raw=true"))
g_ppi <- g

#To download directly from stringdb - see load_ppi()
df <- df %>% select(1, '5_A673_DMSO') %>% 
  rename(expression = '5_A673_DMSO')
colnames(df)[1] <- 'gene_name'
df <- df %>% 
  filter(expression !=0) %>%
  mutate(expression = log2(expression)) %>% 
  filter(!is.na(expression))

exp <- df$expression
names(exp) <- df$gene_name
print(head(df))
#> # A tibble: 6 × 2
#>   gene_name expression
#>   <chr>          <dbl>
#> 1 A1BG            2.92
#> 2 A1BG-AS1        2.88
#> 3 A1CF           -1.28
#> 4 A2M             2.21
#> 5 A2ML1           2.97
#> 6 A4GALT          2.84

Pipeline One

Graph Filtering

First, we will use the gene expression values to reduce the size of the PPI network we need to analyze.

Under the hood, the following steps will happen:

  1. Download and modify PPI according to user-provided parameters (passed to load_ppi)
  2. Take the maximum or minimum 1000 genes according to the user provided named numeric vector (passed to the val parameter)
  3. Call the igraph::induced_subgraph method to create a new igraph object that contains only the genes from step 2.
  4. Add gene expression to the new graph object as an attribute to allow further manipulation.
#start by getting the PPI since we don't want to download it twice
g <- gfilter(method = "value", g=g_ppi, cache = NULL, val=exp, val_name = "expression", 
             use_ppi = FALSE, desc = TRUE,n=100)
length(igraph::V(g))
#> [1] 100
head(igraph::get.vertex.attribute(g, name = "expression"))
#> [1] 3.846040 3.961776 3.980389 3.843208 3.982157 3.955569

Node Scoring

Next, we’ll use gfilter again to score this smaller subgraph according to the betweenness centrality and return only the top 5 proteins. It is worth noting that there are a number of potential measures of centrality that could be used here, as well as many other potentially useful node scoring methods provided in the igraph package.

g <- gfilter(g=g, igraph_method = "betweenness", n = 5, desc= TRUE, use_ppi=FALSE, val_name = "betweenness")
igraph::V(g)
#> + 5/5 vertices, named, from 82a9a8d:
#> [1] GAPDH    HSP90AA1 EEF1A1   HNRNPC   TPT1

As you can see, our not very thoughtful attempt at identifying drug targets in Ewing Sarcoma has turned up GAPDH, HSP90AA1, EEF1A1, HNRNPC, and TPT1 as the most likely candidates.

GAPDH is a glycolytic enzyme that has also been assigned a number of other functions, and is highly expressed in normal bone marrow. HSP90AA1 encodes for a heat shock protein and has been implicated in drug resistance development in cancer. EEF1A1 is involved in protein translation. HNRNPC is a ubiquitously expressed RNA binding protein. TPT1 is a regulator of cellular proliferation, and is causally implicated in cancer development.

Of our 5 candidate proteins, 4 are either ubiquitously expressed housekeeping genes or proteins involved in the generic cellular stress response. Surprisingly, our very simple pipeline (applied to just a single sample) identified a potentially promising candidate in TPT1. TPT1 was even recently associated with Ewing Sarcoma drug resistance development in the literature.

Disease-Specific Subnetwork Identification

If we had a compelling reason to think these 5 genes were key regulators in Ewing Sarcoma, we could use the compute_crosstalk function to identify a Ewing Sarcoma specific subnetwork from the full protein-protein interaction network.

out <- gfilter.ct(seeds = names(igraph::V(g)), g = g_ppi, use_ppi = FALSE, 
                  return_df=TRUE, cache = NULL, seed_name = "vignette", n = 100, 
                  agg_int= 10, ncores = 1)

plot_ct(out[[2]], out[[1]], prop_keep=0.2)

cdf <- out[[2]] 
cdf %>% 
  select(-p_value, -nobs, -mean_p, -var_p) %>% 
  slice_max(affinity_score, n=10) %>%
  knitr::kable(digits=3)
node seed affinity_score Z adj_p_value
GAPDH yes 0.122 695.818 0
EEF1A1 yes 0.122 643.376 0
HSP90AA1 yes 0.121 1053.037 0
TPT1 yes 0.121 539.553 0
HNRNPC yes 0.121 442.876 0
NUDCD2 no 0.049 1053.037 0
TSSK6 no 0.049 1053.037 0
RALYL no 0.048 442.876 0
POTEF no 0.025 535.537 0
TOMM34 no 0.024 485.623 0

We identified 341 proteins in the subnetwork defined by GAPDH, HSP90AA1, EEF1A1, HNRNPC, and TPT1. Next, we will re-analyze this data using a different combination of analytic steps to illustrate more features of crosstalkr.

Pipeline Two

Graph Reduction

In this pipeline, we will reduce the graph two times, first using degree (number of neighbors for a given node), and then using gene expression. The resulting graph has 500 nodes, all with both high connectivity in the PPI network and high expression in our Ewing Sarcoma cell line. There are 9624 edges (interactions) between these 500 nodes (proteins).

g <- gfilter(g = g_ppi, use_ppi=FALSE, cache = NULL, n = 2000, igraph_method = igraph::degree, val_name = "degree")
g <- gfilter(g=g, use_ppi = FALSE, n = 500, method = "val", val_name = "expression", val = exp)
igraph::gsize(g)
#> [1] 9624

Node Scoring

Next, we will demonstrate a method for node scoring that uses in-silico repression. In our context, in-silico repression refers to a 6-step process:

  1. Calculation of a node score s for all nodes in the graph using some state function.
  2. Calculation of the total network state $S =\sum\limits_is_i$
  3. Removal of a given node v from the graph.
  4. Re-calculation of node score sv for all nodes. (In practice, we only re-calculate the node score of those nodes affected by the removal of node v)
  5. Re-calculation of total network state $S_v = \sum\limits_is_{vi}$
  6. Calculation of ΔSv = S − Sv where ΔSv is used to score and rank nodes.

For this example, we will use network potential as the state function (described here)
The resulting data structure is sparse matrix that shows the effect of removing the node in a given column on the state function value for the node represented in a given row. The global effect of removing a given node is then the column sum.

dnp <- node_repression(g=g, v_rm = names(igraph::V(g)), exp = exp)
dnp <- Matrix::colSums(dnp)
dnp <- sort(dnp, decreasing = TRUE)
names(dnp)[1:5]
#> [1] "HSP90AA1" "RPS27A"   "UBA52"    "CDK1"     "CCNB1"

Using this modified pipeline, we identified HSP90AA1, RPS27A, UBA52, CDK1, and CCNB1 as the most likely targets for therapy in this Ewing Sarcoma cell line. HSP90AA1 was also identified in the first instance. RPS27A is a ribosomal protein that is involved in RNA binding. UBA52 is a paralog of RPS27A and appears to be an unlikely candidate for cancer therapy. CDK1 is a cyclin dependent kinase, involved in cellular proliferation and causally associated with cancer in the literature. CCNB1 is a cyclin, a cellular regulator of mitosis. As with the method above, our bioinformatic pipeline identified a number of ubiquitiously expressed housekeeping genes as well as one or two proteins that are likely to be cancer-specific.

It is possible to modify these types of analysis to reduce the likelihood of finding heavily connected proteins that are involved in the generic stress response. For example, we could compute a null distribution for our state function of interest by generating hundreds or thousands of rewired graphs with preserved degree and then re-calculating our metric of interest. We support this capability for the network potential example (see compute_null_dnp). Supporting this feature for additional state functions is an area of development.