Title: | Computing Graph Structures on WikiPathways |
---|---|
Description: | Converts pathways from 'WikiPathways' GPML format into 'igraph' objects. Includes tools to find all cycles in the resulting graphs and determine which ones involve negative feedback (inhibition). |
Authors: | Kevin R. Coombes [aut, cre], Polina Bombina [aut] |
Maintainer: | Kevin R. Coombes <[email protected]> |
License: | Artistic-2.0 |
Version: | 0.1.2 |
Built: | 2024-11-07 06:35:00 UTC |
Source: | CRAN |
Tools to find and interpret cycles in graphs derived from pathways in WikiPathways.
findCycles(graph) interpretCycle(v, graph) cycleSubgraph(graph, cycles)
findCycles(graph) interpretCycle(v, graph) cycleSubgraph(graph, cycles)
graph |
An |
v |
One of the cycles produced by the |
cycles |
The list of cycles produced by the |
The implementation of the findCycles
function, while provably
correct, makes no concessions to efficiency. It is likely to be
adequate for the vast majority of biological pathways present in
WikiPathways, but may prove incapable of handling large complex graphs
in general. The algorithm is a sraightforward double loop. It first
iterates over all nodes, and uses the neighbors
function from
igraph
to find all nodes that are directly connected to the
start node. It then iterates over those neighbors and uses the
all_simple_paths
function from igraph
to find all
paths that lead back to the starting node. One aspect of its lack of
efficiency is that each cycle with N nodes is found N times, once for
each place you could start traversing the cycle. Before returning the
value to the user, it chooses a unique representative for each such
cycle by finding the "earliest" node, based on its index, to start
from.
The findCycles
function returns a list. Each element of the
list is a cycle, repreented by a named character vector specifying the
nodes in the order that they can be traversed.
The interpretCycle
function returns a matrix with two columns,
genes
and arrows
. Each row contains the gene name (or
label) of a node and the form of the interaction arrow relating it
to the next node in the cycle. Likely to be most often used inside an
lapply
function in order to interpret all cycles at once.
The cycleSubgraph
function returns an igraph
object. This value repesents the subgraph of the full graph defined by
all nodes that are part of at least one cycle.
Kevin R. Coombes [email protected], Polina Bombina [email protected]
xmlfile <- system.file("pathways/WP3850.gpml", package = "WayFindR") graf <- GPMLtoIgraph(xmlfile) cyc <- findCycles(graf) cyc CS <- cycleSubgraph(graf, cyc) plot(CS)
xmlfile <- system.file("pathways/WP3850.gpml", package = "WayFindR") graf <- GPMLtoIgraph(xmlfile) cyc <- findCycles(graf) cyc CS <- cycleSubgraph(graf, cyc) plot(CS)
Our standard colors, line types (for edges), and shapes (for nodes) to display graph features from WikiPathways GPML files.
data(edgeColors) data(edgeTypes) data(nodeColors) data(nodeShapes)
data(edgeColors) data(edgeTypes) data(nodeColors) data(nodeShapes)
There are four different objects.
edgeColors
A named character vector defining the colors used to display 17 different kinds of edges.
edgeTypes
A named character vector defining the line types (solid, dashed, etc.) used to display 17 different kinds of edges.
nodeColors
A named character vector defining the colors used to display 16 different kinds of nodes.
nodeShapes
A named character vector defining the symbols used to display 16 different kinds of nodes.
Kevin R. Coombes [email protected], Polina Bombina [email protected]
We downloaded the complete set of 889 human (Homo sapiens) pathways from WikiPathways, in GPML format, at the end of March 2024. We wrote and ran perl scripts to iterate over this set of pathways and extract all the node types and all the edge types used anywhere in the set. We added a handful of node types (Undefined, Shape, Label, EDGE) in order to enable almost all of the pathways to be converted into mathematical graphs. We then defined colors and shapes to be used to distinguish different types in plots.
The situation with edges was slightly more complicated. The pathway editor (PathVisio; https://pathvisio.org/) typically used to create pathways in WikiPathways supports two different biologically meaningful vocabularies for edges: Molecular Interaction Maps (MIM; https://discover.nci.nih.gov/mim/formal_mim_spec.pdf) and System Biology Graphical Notation (SBGN; https://sbgn.github.io/), along with a very simple descrption as "Arrows" or "TBars". To make things more consistent, we converted both the SBGN and simple systems to match the MIM specification. We then defined colors and edge types so we could distinguish different edge types in plots.
data(edgeColors) data(edgeTypes) data(nodeColors) data(nodeShapes) if (require(Polychrome)) { swatch(edgeColors) swatch(nodeColors) } plot(0,0, type = "n", xlab="", ylab = "") legend("center", legend = names(edgeColors), lwd = 3, col = edgeColors, lty = edgeTypes) num <- c(rectangle = 15, circle = 16) plot(0,0, type = "n", xlab="", ylab = "") legend("center", legend = names(nodeColors), cex = 1.5, col = nodeColors, pch = num[nodeShapes])
data(edgeColors) data(edgeTypes) data(nodeColors) data(nodeShapes) if (require(Polychrome)) { swatch(edgeColors) swatch(nodeColors) } plot(0,0, type = "n", xlab="", ylab = "") legend("center", legend = names(edgeColors), lwd = 3, col = edgeColors, lty = edgeTypes) num <- c(rectangle = 15, circle = 16) plot(0,0, type = "n", xlab="", ylab = "") legend("center", legend = names(nodeColors), cex = 1.5, col = nodeColors, pch = num[nodeShapes])
Extract entities of different types from GPML files in order to convert the pathway to a mathematical graph that we can compute on.
collectEdges(xmldoc) collectNodes(xmldoc) collectGroups(xmldoc, allnodes) collectAnchors(xmldoc) collectLabels(xmldoc) collectShapes(xmldoc)
collectEdges(xmldoc) collectNodes(xmldoc) collectGroups(xmldoc, allnodes) collectAnchors(xmldoc) collectLabels(xmldoc) collectShapes(xmldoc)
xmldoc |
Either the name of an XML file meeting the
specifications of the Genomic Pathway Markup Language (GPML), or an
object of class |
allnodes |
A data frame containing node information, in the format
produced by the |
These functions are primarily intended as utility functions that
implement processes required by the main function in the package,
GPMLtoIgraph
. They have been made accessible to
the end user for use in debugging problematic GPML files or to reuse
the GPML files in contexts other than the one we focus on in this
package.
While the meaning of nodes (known as DataNodes
in GPML) and
edges (known as Interactions
in GPML) should be obvious, some
of the other objects are less so. For example, an Anchor
in
GPML is an invisible object used to allow an edge to point to
another edge instead of to a node. That structure isn't allowed in
graphs in mathematics or computer science. WayFindR
handles
this by creating a new node type to represent the anchor position,
breaking the target edge into two pieces separated by the anchor, and
adding an edge from the source of the anchored edge to the new node.
In GPML, a Label
is a text box allowing extra information to be
placed on a pathway, and a Shape
is a graphical display object. The
definition type document (DTD) for GPML describes both of these
entities as non-semantic, intending them for display purposes
only. However, some authors of pathways in the WikiPathways database
use such objects as the (usually, final or "leaf") targets of
interaction edges. When that happens, the WayFindR
package
deals with it by creating actual nodes to represent such labels or
shapes. Other labels and shapes are ignored.
GPML also uses the idea of a Group
as a first class object in
their DTD. These are defined as "A collection of structurally or
functionally similar or related pathway elements." The GPML file
subclassifies some groups as "Complexes", indicating that they
represent physical interactions and bindings between two or more
molecules. Other groups may simply indicate that there is a related set
of molecules (for example, STAT2 and STA3) that play the same role at
this point in the pathway. WayFindR
deals with groups by
creating a new node to represent the group as a whole and expanding
the component genes into nodes with a single "contained" edge that
points to the new group node.
The collectEdges
function returns a data frame with three
columns (Source
, Target
, and MIM
), where each row
describes one edge (or "Interaction" in the GPML terminology) of the
pathway/graph. The Source
and Target
columns are the
alphanumeric identifiers of items decribing nodes. The MIM
column is the edge type in GPML, which often contains terminology based
on the "Molecular Interaction Map" standard. When creating an
igraph
object from a pathway, the first two columns are
used as identifiers to define the nodes at each end of the edge.
The collectNodes
function returns a data frame with three
columns (GraphId, label, and Type), where each row describes node
or vertex of the pathway/graph. The GraphId
column is a unique
alphanumeric identifier. The label
column is a human-readable
name for the node, often the official gene symbol. When creating an
igraph
object from a pathway, the first column is used as
identifier to define the node. Also, the plot
method for
igraph
s recognizes the term label
as a column that
defines the text that should be displayed in a node.
The collectAnchors
function returns a list containing a
nodes
element (in the same format returned by
collectNodes
) and an edges
element (in the same format
returned by collectEdges
). The collectGroups
function
returns a list with nodes
and edges
components, just
like the one from collectAnchors
.
Both the collectLabels
and collectShapes
functions return
the same kind of data frame that is returned by collectNodes
.
Kevin R. Coombes [email protected], Polina Bombina [email protected]
xmlfile <- system.file("pathways/WP3850.gpml", package = "WayFindR") xmldoc <- XML::xmlParseDoc(xmlfile) edges <- collectEdges(xmldoc) nodes <- collectNodes(xmldoc) anchors <- collectAnchors(xmldoc) labels <- collectLabels(xmldoc) edges <- collectShapes(xmldoc) groups <- collectGroups(xmldoc, nodes)
xmlfile <- system.file("pathways/WP3850.gpml", package = "WayFindR") xmldoc <- XML::xmlParseDoc(xmlfile) edges <- collectEdges(xmldoc) nodes <- collectNodes(xmldoc) anchors <- collectAnchors(xmldoc) labels <- collectLabels(xmldoc) edges <- collectShapes(xmldoc) groups <- collectGroups(xmldoc, nodes)
Takes a GPML file from WikiPathways, extracts the entities therein,
and makes minor adjustments necessary to convert it into an
igraph
object. Along the way, it assigns a consistent set
of colors, line types, and shapes.
GPMLtoIgraph(xmldoc, returnLists = FALSE, debug = FALSE) nodeLegend(x, graph) edgeLegend(x, graph)
GPMLtoIgraph(xmldoc, returnLists = FALSE, debug = FALSE) nodeLegend(x, graph) edgeLegend(x, graph)
xmldoc |
Either the name of an XML file meeting the
specifications of the Genomic Pathway Markup Language (GPML), or an
object of class |
returnLists |
A logical value; should the return value include the node list and edge list matrices? |
debug |
A logical value; should debugging progress information be printed? Probably best to leave it equal to FALSE. |
x |
A character string, such as "topleft" indicating where to place the legend. |
graph |
An |
GPMLtoIgraph
is the main function of the
WayFindR
package. It achieves the primary goal of converting
pathways from the GPML file format to a mathematical graph, in the
format defined by the igraph
package, at which point we
can apply a wide variety of graph algorithms from computer science in
order to "compute on biological pathways".
The implementation of GPMLtoIgraph
relies on the utility
functions described in gpml-utility. Briefly, the algorithm
starts by collecting all nodes (DataNodes
in GPML) and edges
(Interactions
in GPML) from the GPML input file. However, GPML
includes two other structures with (semantic) biological
meaning. First, the GPML description includes the idea of an
(invisible) "Anchor
" that allows one edge to point to another
edge. We expand those invisible target locations into full-fledged
nodes in the final graph. Second, GPML includes "Group
s"
that represent protein complexes or sets of closely related genes. In
WayFindR
, we represent such groups as their own nodes in the
final graph, and add "contained" edges linking in the group members.
The transformations of Anchors and Groups do not change the
fnudamental topology (in particular, the existence of cycles) of the
resulting graph.
Finally, GPML includes non-semantic features (including "Labels
"
and "Shapes
") that are (mis)used by some pathway authors as the
targets of edges. WayFindR
converts any targeted non-semantic
features into nodes in order to preserve as much information as
possible from the original pathway in WikiPathways.
The GMLtoIgraph
function usually returns an igraph
object that represents the pathway defined by the input
xmlfile
. If the argument returnLists = TRUE
, then it
returns a list containing three components; graph
is the
igraph
object, nodes
is a data frame containing node
information where each row is a node, and edges
is a matrix
containing edge information where each row is an edge. The node and
edge information can be used to reproduce the graph in any network or
graph visualization tool that accepts such matrices to describe the
graph. The nodes
data frame includes columns for color
and shape
, and the edges
data frame includes columns for
color
and lty
that are recognized and used by the
plot.igraph
function.
Both nodeLegend
and edgeLegend
invisibly return the same
value that is returned by the legend
function that is
used in the implementation.
Kevin R. Coombes [email protected], Polina Bombina [email protected]
xmlfile <- system.file("pathways/WP3850.gpml", package = "WayFindR") graf <- GPMLtoIgraph(xmlfile) set.seed(13579) L <- igraph::layout_with_graphopt plot(graf, layout=L) nodeLegend("topleft", graf) edgeLegend("bottomright", graf)
xmlfile <- system.file("pathways/WP3850.gpml", package = "WayFindR") graf <- GPMLtoIgraph(xmlfile) set.seed(13579) L <- igraph::layout_with_graphopt plot(graf, layout=L) nodeLegend("topleft", graf) edgeLegend("bottomright", graf)