Title: | Analysis of Graph-Structured Data with a Focus on Protein-Protein Interaction Networks |
---|---|
Description: | Provides a general toolkit for drug target identification. We include functionality to reduce large graphs to subgraphs and prioritize nodes. In addition to being optimized for use with generic graphs, we also provides support to analyze protein-protein interactions networks from online repositories. For more details on core method, refer to Weaver et al. (2021) <https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008755>. |
Authors: | Davis Weaver [aut, cre] (0000-0003-3086-497X) |
Maintainer: | Davis Weaver <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.0.5 |
Built: | 2024-12-14 06:56:52 UTC |
Source: | CRAN |
attach expression values from user-provided expression vector to graph.
add_expression(exp, g)
add_expression(exp, g)
exp |
expression vector - assumed to be a named vector where the values are expression and the names are the gene name |
g |
igraph object - will be filtered so that only nodes found in both exp and g are kept |
subgraph of g containing only shared keys with exp and with expression attached.
Attach a generic user-provided value to graph
add_value(val, g, val_name = "value")
add_value(val, g, val_name = "value")
val |
named numeric vector where the names correspond to vertices in g |
g |
igraph object - will be filtered so that only nodes found in both exp and g are kept |
val_name |
str key for val |
subgraph of g containing only shared keys with val and val attached
Convert from most other representations of gene name to gene.symbol
as_gene_symbol(x, edb = NULL)
as_gene_symbol(x, edb = NULL)
x |
vector of ensemble.gene ids, ensemble.peptide ids, ensemble.transcript ids or entrez gene ids |
edb |
ensemble database object |
vector of gene symbols
#1) from numeric formatted entrez id as_gene_symbol(1956) #2) from character formatted entrez id as_gene_symbol("1956") #3) from ensemble gene id as_gene_symbol("ENSG00000146648") #4) From a vector of entrez ids as_gene_symbol(c("123", "1956", "2012"))
#1) from numeric formatted entrez id as_gene_symbol(1956) #2) from character formatted entrez id as_gene_symbol("1956") #3) from ensemble gene id as_gene_symbol("ENSG00000146648") #4) From a vector of entrez ids as_gene_symbol(c("123", "1956", "2012"))
This function will generate a bootstrapped null distribution to identify signficant vertices in a PPI given a set of user-defined seed proteins. Bootstrapping is done by performing random walk with repeats repeatedly over "random" sets of seed proteins. Degree distribution of user-provided seeds is used to inform sampling.
bootstrap_null( seed_proteins, g, n = 1000, agg_int = 100, gamma = 0.6, eps = 1e-10, tmax = 1000, norm = TRUE, set_seed = NULL, cache = NULL, seed_name = NULL, ncores = 1 )
bootstrap_null( seed_proteins, g, n = 1000, agg_int = 100, gamma = 0.6, eps = 1e-10, tmax = 1000, norm = TRUE, set_seed = NULL, cache = NULL, seed_name = NULL, ncores = 1 )
seed_proteins |
user defined seed proteins |
g |
igraph object |
n |
number of random walks with repeats to create null distribution |
agg_int |
number of runs before we need to aggregate the results - necessary to save memory. set at lower numbers to save even more memory. |
gamma |
restart probability |
eps |
maximum allowed difference between the computed probabilities at the steady state |
tmax |
the maximum number of iterations for the RWR |
norm |
if True, w is normalized by dividing each value by the column sum. |
set_seed |
integer to set random number seed - for reproducibility |
cache |
A filepath to a folder downloaded files should be stored |
seed_name |
Name to give the cached ngull distribution - must be a character string |
ncores |
Number of cores to use - defaults to 1. Significant speedup can be achieved by using multiple cores for computation. |
data frame containing mean/ standard deviation for null distribution
#g <- prep_biogrid() #bootstrap_null(seed_proteins = c("EGFR", "KRAS"), g= g, ncores = 1, n = 10)
#g <- prep_biogrid() #bootstrap_null(seed_proteins = c("EGFR", "KRAS"), g= g, ncores = 1, n = 10)
helper function to calculate dnp for one sample
calc_dnp_i(df, g, v_rm = NULL, keep_all = TRUE)
calc_dnp_i(df, g, v_rm = NULL, keep_all = TRUE)
df |
dataframe with one cell line + log expression |
g |
igraph object containing ppi info |
v_rm |
passed to |
keep_all |
logical flag denoting if we should keep genes that we didn't calculate dnp for |
same dataframe with dnp calculated for each gene.
calculate network potential for one node.
calc_np(c_i, c_j)
calc_np(c_i, c_j)
c_i |
expression for a given node. |
c_j |
vector of expressions for each neighbor of c_i |
function to calculate the network potential for each protein in a user-provided vector - cpp internal version
calc_np_all(exp, g, v = "default", neighbors = NULL)
calc_np_all(exp, g, v = "default", neighbors = NULL)
exp |
expression vector - assumed to be a named vector where the values are expression and the names are the gene name |
g |
igraph object - will be filtered so that only nodes found in both exp and g are kept |
v |
character vector of nodes over which to calculate network potential. |
neighbors |
named list containing the neighbors for each node of graph g. If not provided, it will be computed |
dataframe containing network potential for each of the inputed gene names.
Mostly just used to help debug the CPP version - not exported
calc_np_all_legacy( exp, g, v = as.character(names(igraph::V(g))), neighbors = NULL )
calc_np_all_legacy( exp, g, v = as.character(names(igraph::V(g))), neighbors = NULL )
exp |
expression vector - assumed to be a named vector where the values are expression and the names are the gene name |
g |
igraph object - will be filtered so that only nodes found in both exp and g are kept |
v |
character vector of nodes over which to calculate network potential. |
neighbors |
named list containing the neighbors for each node of graph g. If not provided, it will be computed |
dataframe containing network potential for each of the inputed gene names.
helper function to calculate np for one sample
calc_np_i(df, g)
calc_np_i(df, g)
df |
dataframe with one cell line + log expression |
g |
igraph object containing ppi info |
same dataframe with np calculated for each gene.
This function is a helper function for plot_ct
that verifies the input is a valid output of compute_crosstalk
check_crosstalk(crosstalk_df)
check_crosstalk(crosstalk_df)
crosstalk_df |
a dataframe containing the results of |
message if not correct object type, null otherwise
.combine function for compute_null foreach looping structure
combine_null(x)
combine_null(x)
x |
aggregated data structure |
data.frame
compute_crosstalk
returns a dataframe of proteins that are significantly
associated with user-defined seed proteins. These identified "crosstalkers"
can be combined with the user-defined seed proteins to identify functionally
relevant subnetworks. Affinity scores for every protein in the network are
calculated using a random-walk with repeats (sparseRWR
). Significance is
determined by comparing these affinity scores to a bootstrapped null distribution
(see bootstrap_null
). If using non-human PPI from string, refer to the stringdb documentation
for how to specify proteins
compute_crosstalk( seed_proteins, g = NULL, use_ppi = TRUE, ppi = "stringdb", species = "homo sapiens", n = 1000, union = FALSE, intersection = FALSE, gamma = 0.6, eps = 1e-10, tmax = 1000, norm = TRUE, set_seed, cache = NULL, min_score = 700, seed_name = NULL, ncores = 1, significance_level = 0.95, p_adjust = "bonferroni", agg_int = 100, return_g = FALSE )
compute_crosstalk( seed_proteins, g = NULL, use_ppi = TRUE, ppi = "stringdb", species = "homo sapiens", n = 1000, union = FALSE, intersection = FALSE, gamma = 0.6, eps = 1e-10, tmax = 1000, norm = TRUE, set_seed, cache = NULL, min_score = 700, seed_name = NULL, ncores = 1, significance_level = 0.95, p_adjust = "bonferroni", agg_int = 100, return_g = FALSE )
seed_proteins |
user defined seed proteins |
g |
igraph network object. |
use_ppi |
bool, should g be a protein-protein interaction network? If
false, user must provide an igraph object in |
ppi |
character string describing the ppi to use: currently only "stringdb" and "biogrid" are supported. |
species |
character string describing the species of interest.
For a list of supported species, see |
n |
number of random walks with repeats to create null distribution |
union |
bool, should we take the union of string db and biogrid to compute the PPI? Only applicable for the human PPI |
intersection |
bool, should we take the intersection of string db and biogrid to compute the PPI? Only applicable for the human PPI |
gamma |
restart probability |
eps |
maximum allowed difference between the computed probabilities at the steady state |
tmax |
the maximum number of iterations for the RWR |
norm |
if True, w is normalized by dividing each value by the column sum. |
set_seed |
integer to set random number seed - for reproducibility |
cache |
A filepath to a folder downloaded files should be stored |
min_score |
minimum connectivity score for each edge in the network. |
seed_name |
Name to give the cached ngull distribution - must be a character string |
ncores |
Number of cores to use - defaults to 1. Significant speedup can be achieved by using multiple cores for computation. |
significance_level |
user-defined signficance level for hypothesis testing |
p_adjust |
adjustment method to correct for multiple hypothesis testing:
defaults to "holm". see |
agg_int |
number of runs before we need to aggregate the results - necessary to save memory. set at lower numbers to save even more memory. |
return_g |
bool, should we return the graph used? mostly for internal use |
data frame containing affinity score, p-value, for all "crosstalkers" related to a given set of seeds
#1) easy to use for querying biological networks - n = 10000 is more appropriate for actual analyses #compute_crosstalk(c("EGFR", "KRAS"), n =10) #2) Also works for any other kind of graph- just specify g (must be igraph formatted as of now) g <- igraph::sample_gnp(n = 1000, p = 10/1000) compute_crosstalk(c(1,3,5,8,10), g = g, use_ppi = FALSE, n = 100)
#1) easy to use for querying biological networks - n = 10000 is more appropriate for actual analyses #compute_crosstalk(c("EGFR", "KRAS"), n =10) #2) Also works for any other kind of graph- just specify g (must be igraph formatted as of now) g <- igraph::sample_gnp(n = 1000, p = 10/1000) compute_crosstalk(c(1,3,5,8,10), g = g, use_ppi = FALSE, n = 100)
This function takes a tidy dataframe as input containing RNA sequencing data for one or more samples and conducts in-silico repression. Make sure to run with the same arguments for ppi and cache to maintain consistency for a given pipeline.
compute_dnp( cache = NULL, df, experiment_name, ppi, ncores = 1, min_score = NULL )
compute_dnp( cache = NULL, df, experiment_name, ppi, ncores = 1, min_score = NULL )
cache |
user-provided filepath for where to store data etc |
df |
dataframe output of compute_np |
experiment_name |
name of the experiment for saving output. |
ppi |
should we use biogrid or stringdb for the PPI |
ncores |
number of cores to use for calculations |
min_score |
if ppi is stringdb, which mininum score should we use to filter edges? |
data.frame
main function to compute np from a user-provided expression matrix.
compute_np( cache = NULL, experiment_name, ppi = "biogrid", min_score = NULL, exp_mat, mir_paper = TRUE, ncores = 1 )
compute_np( cache = NULL, experiment_name, ppi = "biogrid", min_score = NULL, exp_mat, mir_paper = TRUE, ncores = 1 )
cache |
user-provided filepath for where to store data etc |
experiment_name |
name of the experiment for saving output. |
ppi |
should we use biogrid or stringdb for the PPI |
min_score |
if ppi is stringdb, which mininum score should we use to filter edges? |
exp_mat |
expression matrix where columns are samples and rows are features |
mir_paper |
are we running this in the context of the mir paper? a few quirks of that data |
ncores |
number of cores to use for calculations |
tidy data frame with one column for expression and another for np
compute_null_dnp
calculates a null distribution for the change in network potential for
for each node in a cell signaling network.
compute_null_dnp( cache = NULL, df, ppi = "biogrid", n, n_genes = 50, experiment_name, ncores = 4, min_score = NULL )
compute_null_dnp( cache = NULL, df, ppi = "biogrid", n, n_genes = 50, experiment_name, ncores = 4, min_score = NULL )
cache |
user-provided filepath for where to store data etc |
df |
output of |
ppi |
should we use biogrid or stringdb for the PPI |
n |
number of permutations |
n_genes |
integer describing number of genes per sample that we will compute the null distribution for |
experiment_name |
name of the experiment for saving output. |
ncores |
number of cores to use for calculations |
min_score |
if ppi is stringdb, which mininum score should we use to filter edges? |
The input for this function will be the output of compute_dnp()
.
To compute the null distribution, the nodes in the provided cell signaling
network will be randomly permuted n
times, with dnp computed or each new
cell signaling network. The mean and standard error of dnp for this set of random
networks will constitute the null model that we will use for comparison.
Be warned that this operation is extremely expensive computationally. It is
recommended to either use a high-performance cluster or limit the computation of the
null distribution to a small number of nodes.
To distribute the workload over multiple cores, just specify ncores.
df, also saves to cache if specified
compute_dnp()
and compute_np()
compute_crosstalk
Useful if the user wants to carry out further analysis or design custom visualizations.
crosstalk_subgraph(crosstalk_df, g, seed_proteins, tg = TRUE)
crosstalk_subgraph(crosstalk_df, g, seed_proteins, tg = TRUE)
crosstalk_df |
a dataframe containing the results of |
g |
igraph network object. |
seed_proteins |
user defined seed proteins |
tg |
bool do we want to tidy the graph for plotting? |
a tidygraph structure containing information about the crosstalkr subgraph
## Not run: ct_df <- compute_crosstalk(c("EGFR", "KRAS")) g <- prep_biogrid() crosstalk_subgraph(ct_df, g = g, seed_proteins = c("EGFR", "KRAS")) ## End(Not run)
## Not run: ct_df <- compute_crosstalk(c("EGFR", "KRAS")) g <- prep_biogrid() crosstalk_subgraph(ct_df, g = g, seed_proteins = c("EGFR", "KRAS")) ## End(Not run)
crosstalkr provides a key user function, compute_crosstalk
as well
as several additional functions that assist in setup and visualization (under development).
compute_crosstalk
calculates affinity scores of all proteins in a network
relative to user-provided seed proteins. Users can use the human interactome or
provide a network represented as an igraph object.
sparseRWR
performs random walk with restarts on a sparse matrix.
Compared to dense matrix implementations, this should be extremely fast.
bootstrap_null
Generates a null distribution based on n calls to
sparseRWR
setup_init
manages download and storage of interactome data to
speed up future analysis
plot_ct
allows users to visualize the subnetwork identified in
compute_crosstalk
. This function relies on the ggraph framework.
Users are encouraged to use ggraph or other network visualization packages for
more customized figures.
crosstalk_subgraph
converts the output of compute_crosstalk
to a
tidygraph object containing only the identified nodes and their connections to the
user-provided seed_proteins. This function also adds degree, degree_rank, and
seed_label as attributes to the identified subgraph to assist in plotting.
Determine which format of gene is used to specify by user-defined seed proteins
detect_inputtype(x)
detect_inputtype(x)
x |
vector of gene symbols |
"gene_symbol", "entrez_id", "ensemble_id" or "other"
This function is called by the high-level function "bootstrap_null". Not expected to be used by end-users - we only export it so that environments inside foreach loops can find it.
dist_calc(df, seed_proteins)
dist_calc(df, seed_proteins)
df |
: numeric vector |
seed_proteins |
user defined seed proteins |
data.frame containing summary statistics for the computed null distribution
Determine if ensembl id is a Protein, gene, or transcript_id
ensembl_type(x)
ensembl_type(x)
x |
vector or single gene symbol |
character: "PROTEINID", "GENEID", "TRANSCRIPTID"
this is highly specific to the miR paper
experiment_breakout(df)
experiment_breakout(df)
df |
dataframe |
data.frame
Function to calculate the network potential for vertices v
fcalc_np_all(neighbors, vertices, v, exp)
fcalc_np_all(neighbors, vertices, v, exp)
neighbors |
list of neighbors for every node in the graph, type Rcpp::list |
vertices |
node list for graph, type Rcpp::StringVector |
v |
list of nodes for which we plan to calculate network potential |
exp |
named vector of expression for each node in vertices |
final .combine function to run in compute_null_dnp foreach looping structure
final_combine(x)
final_combine(x)
x |
aggregated info |
data.frame
This function is called by the high-level function "bootstrap_null".
final_dist_calc(df_list)
final_dist_calc(df_list)
df_list |
: list of dataframes from foreach loop in bootstrap_null |
data.frame
just a wrapper around igraph::neighbors()
for convenience
get_neighbors(gene, g)
get_neighbors(gene, g)
gene |
gene to grab neighbors from. |
g |
igraph object - will be filtered so that only nodes found in both exp and g are kept |
named numeric vector.
currently just a wrapper for igraph::rewire but may add more functionality in the future
get_random_graph(g)
get_random_graph(g)
g |
graph to be permuted |
igraph
Helper function for compute_null_dnp - returns the top n genes by dnp for each sample
get_topn(df, n_genes)
get_topn(df, n_genes)
df |
output of |
n_genes |
integer describing number of genes per sample that we will compute the null distribution for |
Generic function to filter either an igraph object or a PPI network
gfilter( method = NULL, g = NULL, val = NULL, use_ppi, igraph_method = NULL, n = 100, desc = TRUE, ... )
gfilter( method = NULL, g = NULL, val = NULL, use_ppi, igraph_method = NULL, n = 100, desc = TRUE, ... )
method |
str |
g |
igraph object |
val |
named numeric vector - some measure of node state (i.e. gene expression in the case of a PPI) |
use_ppi |
bool - should we use a ppi from online repository? |
igraph_method |
bool - is the user-provided method an igraph node scoring function? |
n |
int - number of nodes to include in the returned subgraph |
desc |
bool - do we want the top or bottom examples of the provided metric |
... |
additional params passed to |
igraph
gfilter.ct, gfilter.np, gfilter.igraph_method
Method to filter the graph based on parameters passed to compute_crosstalk
gfilter.ct(seeds, return_df = FALSE, ...)
gfilter.ct(seeds, return_df = FALSE, ...)
seeds |
vector (str or numeric) user provided vertex ids to use as seeds in the random walk with restarts' |
return_df |
bool should we return a list containing the filtered graph + the RWR output that was used to do the filtering? |
... |
additional arguments passed to |
igraph object
Method to filter graph based on any igraph method that scores verticies.
gfilter.igraph_method(g, use_ppi = TRUE, method, n = 500, desc, val_name, ...)
gfilter.igraph_method(g, use_ppi = TRUE, method, n = 500, desc, val_name, ...)
g |
igraph object |
use_ppi |
bool - should we use a ppi from online repository? |
method |
str |
n |
int - number of nodes to include in the returned subgraph |
desc |
bool - do we want the top or bottom examples of the provided metric |
val_name |
str |
... |
additional parameters passed to load_ppi |
igraph
convenience function - it just calls gfilter.value after computing np
gfilter.np(g, val, use_ppi = TRUE, n = 500, desc, ...)
gfilter.np(g, val, use_ppi = TRUE, n = 500, desc, ...)
g |
igraph object |
val |
named numeric vector - some measure of node state (i.e. gene expression in the case of a PPI) |
use_ppi |
bool - should we use a ppi from online repository? |
n |
int - number of nodes to include in the returned subgraph |
desc |
bool - do we want the top or bottom examples of the provided metric |
... |
additional params passed to |
For more information on network potential, see related paper
igraph
Method to filter graph based on user provided value
gfilter.value(g, val, use_ppi = TRUE, n = 500, val_name = "value", desc, ...)
gfilter.value(g, val, use_ppi = TRUE, n = 500, val_name = "value", desc, ...)
g |
igraph object |
val |
named numeric vector - some measure of node state (i.e. gene expression in the case of a PPI) |
use_ppi |
bool - should we use a ppi from online repository? |
n |
int - number of nodes to include in the returned subgraph |
val_name |
str |
desc |
bool - do we want the top or bottom examples of the provided metric |
... |
additional params passed to |
igraph
Determine if a character vector contains ensembl gene_ids
is_ensembl(x)
is_ensembl(x)
x |
vector or single gene symbol |
logical
Determine if a character vector contains entrez gene_ids
is_entrez(x)
is_entrez(x)
x |
vector or single gene symbol |
logical
Helper function to load requested PPI w/ parameters
load_ppi( cache = NULL, union = FALSE, intersection = FALSE, species = "9606", min_score = 0, ppi = "stringdb" )
load_ppi( cache = NULL, union = FALSE, intersection = FALSE, species = "9606", min_score = 0, ppi = "stringdb" )
cache |
A filepath to a folder downloaded files should be stored |
union |
bool |
intersection |
bool |
species |
species code either using latin species name or taxon id |
min_score |
minimum connectivity score for each edge in the network. |
ppi |
str |
igraph object
This function will generate n character vectors of seeds to be passed to sparseRWR as part of the construction of a boostrapped null distribution for significance testing.
match_seeds(g, seed_proteins, n, set_seed = NULL)
match_seeds(g, seed_proteins, n, set_seed = NULL)
g |
igraph object representing the network under study. specified by "ppi" in bootstrap_null |
seed_proteins |
user defined seed proteins |
n |
number of random walks with repeats to create null distribution |
set_seed |
integer to set random number seed - for reproducibility |
list of character vectors: randomly generated seed proteins with a similar degree distribution to parent seed proteins
this function is still under development.
node_repression( g, v_rm = as.character(names(igraph::V(g))), exp, state_function = calc_np_all, neighbors_only = TRUE, ... )
node_repression( g, v_rm = as.character(names(igraph::V(g))), exp, state_function = calc_np_all, neighbors_only = TRUE, ... )
g |
igraph network object |
v_rm |
index of vertices to remove |
exp |
expression vector for nodes in graph g |
state_function |
function to use to calculate network state before and after node_repression |
neighbors_only |
logical designating whether state function should be calculated for all nodes or just neighbors |
... |
additional parameters passed to state function. |
Function to normalize adjacency matrix by dividing each value by the colsum.
norm_colsum(w)
norm_colsum(w)
w |
The adjacency matrix of a given graph in sparse format - dgCMatrix |
input matrix, normalized by column sums
# 1) Normalize by column sum on a simple matrix v1 = (c(1,1,1,0)) v2 = c(0,0,0,1) v3 = c(1,1,1,0) v4 = c(0,0,0,1) w = matrix(data = c(v1,v2,v3,v4), ncol = 4, nrow = 4) norm_colsum(w)
# 1) Normalize by column sum on a simple matrix v1 = (c(1,1,1,0)) v2 = c(0,0,0,1) v3 = c(1,1,1,0) v4 = c(0,0,0,1) w = matrix(data = c(v1,v2,v3,v4), ncol = 4, nrow = 4) norm_colsum(w)
Convenience function for plotting crosstalkers - if you want to make more
customized/dynamic figures, there are lots of packages that can facilitate that,
including: visnetwork
, ggraph
, and even the base R plotting library
plot_ct(crosstalk_df, g, label_prop = 0.1, prop_keep = 0.4)
plot_ct(crosstalk_df, g, label_prop = 0.1, prop_keep = 0.4)
crosstalk_df |
a dataframe containing the results of |
g |
igraph network object. |
label_prop |
Proportion of nodes to label - based on degree |
prop_keep |
How many proteins do we want to keep in the visualization (as a proportion of total) - subsets on top x proteins ranked by affinity score |
NULL, draws the identified subgraph to device\
## Not run: ct_df <- compute_crosstalk(c("EGFR", "KRAS")) g <- prep_biogrid() plot_ct(ct_df, g = g) ## End(Not run)
## Not run: ct_df <- compute_crosstalk(c("EGFR", "KRAS")) g <- prep_biogrid() plot_ct(ct_df, g = g) ## End(Not run)
Function to allow users to choose the intersection of stringdb and biogrid Only works with the human PPI. min_score parameter only applies to strindb
ppi_intersection(cache = NULL, min_score = 800, edb = "default")
ppi_intersection(cache = NULL, min_score = 800, edb = "default")
cache |
A filepath to a folder downloaded files should be stored |
min_score |
minimum connectivity score for each edge in the network. |
edb |
ensemble database object |
igraph object corresponding to PPI following intersection
Function to allow users to choose the union of stringdb and biogrid Only works with the human PPI. min_score parameter only applies to strindb
ppi_union(cache = NULL, min_score = 0, edb = "default")
ppi_union(cache = NULL, min_score = 0, edb = "default")
cache |
A filepath to a folder downloaded files should be stored |
min_score |
minimum connectivity score for each edge in the network. |
edb |
ensemble database object |
igraph object corresponding to PPI following union
Prepare biogrid for use in analyses
prep_biogrid(cache = NULL)
prep_biogrid(cache = NULL)
cache |
A filepath to a folder downloaded files should be stored |
igraph object built from the adjacency matrix downloaded from thebiogrid.org.
Basically a wrapper around the get_graph method from the stringdb package
prep_stringdb( cache = NULL, edb = "default", min_score = 200, version = "11.5", species = "homo sapiens" )
prep_stringdb( cache = NULL, edb = "default", min_score = 200, version = "11.5", species = "homo sapiens" )
cache |
A filepath to a folder downloaded files should be stored |
edb |
ensemble database object |
min_score |
minimum connectivity score for each edge in the network. |
version |
stringdb version |
species |
species code either using latin species name or taxon id |
igraph object built from the adjacency matrix downloaded from stringdb.
This function borrows heavily from the RWR function in the RANKS package (cite here)
sparseRWR(seed_proteins, w, gamma = 0.6, eps = 1e-10, tmax = 1000, norm = TRUE)
sparseRWR(seed_proteins, w, gamma = 0.6, eps = 1e-10, tmax = 1000, norm = TRUE)
seed_proteins |
user defined seed proteins |
w |
The adjacency matrix of a given graph in sparse format - dgCMatrix |
gamma |
restart probability |
eps |
maximum allowed difference between the computed probabilities at the steady state |
tmax |
the maximum number of iterations for the RWR |
norm |
if True, w is normalized by dividing each value by the column sum. |
numeric vector, affinity scores for all nodes in graph relative to provided seeds
# 1) Run Random walk with restarts on a simple matrix v1 = (c(1,1,1,0)) v2 = c(0,0,0,1) v3 = c(1,1,1,0) v4 = c(0,0,0,1) w = matrix(data = c(v1,v2,v3,v4), ncol = 4, nrow = 4) sparseRWR(seed_proteins = c(1,3), w = w, norm = TRUE) # 2) Works just as well on a sparse matrix v1 = (c(1,1,1,0)) v2 = c(0,0,0,1) v3 = c(1,1,1,0) v4 = c(0,0,0,1) w = matrix(data = c(v1,v2,v3,v4), ncol = 4, nrow = 4) w = Matrix::Matrix(w, sparse = TRUE) sparseRWR(seed_proteins = c(1,4), w = w, norm = TRUE) #3) Sample workflow for use with human protein-protein interaction network #g <- prep_biogrid() #w <- igraph::as_adjacency_matrix(g) #sparseRWR(seed_proteins = c("EGFR", "KRAS"), w = w, norm = TRUE)
# 1) Run Random walk with restarts on a simple matrix v1 = (c(1,1,1,0)) v2 = c(0,0,0,1) v3 = c(1,1,1,0) v4 = c(0,0,0,1) w = matrix(data = c(v1,v2,v3,v4), ncol = 4, nrow = 4) sparseRWR(seed_proteins = c(1,3), w = w, norm = TRUE) # 2) Works just as well on a sparse matrix v1 = (c(1,1,1,0)) v2 = c(0,0,0,1) v3 = c(1,1,1,0) v4 = c(0,0,0,1) w = matrix(data = c(v1,v2,v3,v4), ncol = 4, nrow = 4) w = Matrix::Matrix(w, sparse = TRUE) sparseRWR(seed_proteins = c(1,4), w = w, norm = TRUE) #3) Sample workflow for use with human protein-protein interaction network #g <- prep_biogrid() #w <- igraph::as_adjacency_matrix(g) #sparseRWR(seed_proteins = c("EGFR", "KRAS"), w = w, norm = TRUE)
returns a dataframe with information on supported species
supported_species()
supported_species()
dataframe
helper function to convert expression matrix to tidy dataframe (if not already)
tidy_expression(df)
tidy_expression(df)
df |
dataframe |
data.frame
helper to convert user-inputs to ncbi reference taxonomy.
to_taxon_id(species)
to_taxon_id(species)
species |
user-inputted species |
string corresponding to taxon id