Title: | Permutation Testing Network Module Preservation Across Datasets |
---|---|
Description: | Functions for assessing the replication/preservation of a network module's topology across datasets through permutation testing; Ritchie et al. (2015) <doi: 10.1016/j.cels.2016.06.012>. |
Authors: | Scott Ritchie [aut, cre] (0000-0002-8454-9548) |
Maintainer: | Scott Ritchie <[email protected]> |
License: | GPL-2 |
Version: | 1.2.7 |
Built: | 2024-11-11 07:28:20 UTC |
Source: | CRAN |
A 'disk.matrix'
contains a file path to a matrix stored on disk,
along with meta data for how to read that file. This allows NetRep
to load datasets into RAM only when required, i.e. one at a time. This
significantly reduces the memory usage of R when analysing large datasets.
'disk.matrix'
objects may be supplied instead of 'matrix'
objects in the input list arguments 'network'
, 'data'
, and
'correlation'
, which are common to most of NetRep's functions.
attach.disk.matrix(file, serialized = TRUE, ...) serialize.table(file, ...) is.disk.matrix(x) as.disk.matrix(x, file, serialize = TRUE) ## S4 method for signature 'disk.matrix' as.disk.matrix(x, file, serialize = TRUE) ## S4 method for signature 'matrix' as.disk.matrix(x, file, serialize = TRUE) ## S4 method for signature 'ANY' as.disk.matrix(x, file, serialize = TRUE) ## S4 method for signature 'disk.matrix' as.matrix(x) ## S4 method for signature 'disk.matrix' show(object)
attach.disk.matrix(file, serialized = TRUE, ...) serialize.table(file, ...) is.disk.matrix(x) as.disk.matrix(x, file, serialize = TRUE) ## S4 method for signature 'disk.matrix' as.disk.matrix(x, file, serialize = TRUE) ## S4 method for signature 'matrix' as.disk.matrix(x, file, serialize = TRUE) ## S4 method for signature 'ANY' as.disk.matrix(x, file, serialize = TRUE) ## S4 method for signature 'disk.matrix' as.matrix(x) ## S4 method for signature 'disk.matrix' show(object)
file |
for |
serialized |
determines how the matrix will be loaded from disk into R
by |
... |
arguments to be used by |
x |
for |
serialize |
determines how the matrix is saved to disk by
|
object |
a |
Matrices may either be stored as regular table files that can be read by
read.table
, or as serialized R objects that can be read by
readRDS
. Serialized objects are much faster to load, but
cannot be read by other programs.
The attach.disk.matrix
function creates a disk.matrix
object
from a file path. The as.matrix
function will load the data from disk
into the R session as a regular matrix
object.
The as.disk.matrix
function converts a matrix into a
disk.matrix
by saving its contents to the specified file
. The
serialize
argument determines whether the data is stored as a
serialized R object or as a tab-separated file (i.e. sep="\t"
). We
recommend storing the matrix as a serialized R object unless disk space is
a concern. More control over the storage format can be obtained by using
saveRDS
or write.table
directly.
The serialize.matrix
function converts a file in table format to a
serialized R object with the same file name, but with the ".rds" extension.
A disk.matrix
object (attach.disk.matrix
, as.disk.matrix
),
a matrix
(as.matrix
), the file path to a serialized matrix
(serialize.table
), or a TRUE
or FALSE
indicating
whether an object is a disk.matrix
(is.disk.matrix
).
file
the name of the file where the matrix is saved.
read.func
either "read.table"
or "readRDS"
.
func.args
a list of arguments to be supplied to the 'read.func'
.
attach.disk.matrix
does not check whether the specified file can be
read into R. as.matrix
will fail and throw an error if this is the
case.
Example gene coexpression networks inferred from two independent datasets to demonstrate the usage of package functions.
data("NetRep")
data("NetRep")
a matrix
with 150 columns and 150 rows containing the network
edge weights encoding the interaction strength between each pair of
genes in the discovery dataset.
a matrix
with 150 columns (genes) and 30 rows (samples) whose
entries correspond to the expression level of each gene in each sample
in the discovery dataset.
a matrix
with 150 columns and 150 rows containing the
correlation-coefficients between each pair of genes calculated from the
"discovery_data" matrix
.
a named vector
with 150 entries containing the module assignment
for each gene as identified in the discovery dataset.
a matrix
with 150 columns and 150 rows containing the network
edge weights encoding the interaction strength between each pair of
genes in the test dataset.
a matrix
with 150 columns (genes) and 30 rows (samples) whose
entries correspond to the expression level of each gene in each sample
in the test dataset.
a matrix
with 150 columns and 150 rows containing the
correlation-coefficients between each pair of genes calculated from the
"test_data" matrix
.
An object of class matrix
(inherits from array
) with 150 rows and 150 columns.
An object of class matrix
(inherits from array
) with 30 rows and 150 columns.
An object of class matrix
(inherits from array
) with 150 rows and 150 columns.
An object of class numeric
of length 150.
An object of class matrix
(inherits from array
) with 150 rows and 150 columns.
An object of class matrix
(inherits from array
) with 30 rows and 150 columns.
An object of class matrix
(inherits from array
) with 150 rows and 150 columns.
The preservation of network modules in a second
dataset is quantified by measuring the preservation of topological
properties between the discovery and test datasets. These
properties are calculated not only from the interaction networks inferred
in each dataset, but also from the data used to infer those networks (e.g.
gene expression data) as well as the correlation structure between
variables/nodes. Thus, all functions in the NetRep
package have the
following arguments:
network
:a list of interaction networks, one for each dataset.
data
:a list of data matrices used to infer those networks, one for each dataset.
correlation
:a list of matrices containing the pairwise correlation coefficients between variables/nodes in each dataset.
moduleAssignments
:a list of vectors, one for each discovery dataset, containing the module assignments for each node in that dataset.
modules
:a list of vectors, one vector for each discovery dataset, containing the names of the modules from that dataset to analyse.
discovery
:a vector indicating the names or indices of the previous arguments' lists to use as the discovery dataset(s) for the analyses.
test
:a list of vectors, one vector for each discovery dataset,
containing the names or indices of the network
, data
, and
correlation
argument lists to use as the test dataset(s)
for the analysis of each discovery dataset.
This data is used to provide concrete examples of the usage of these arguments in each package function.
The discovery gene expression dataset ("discovery_data"
)
containing 30 samples and 150 genes was simulated to contain four distinct
modules of sizes 20, 25, 30, and 35 genes. Data for each module were
simulated as:
Where is the simulated module's summary vector,
is the simulated module's node contributions for each gene,
and
is the error term drawn from a standard normal
distribution.
and
were simulated by bootstrapping
(sampling with replacement) samples and genes from the corresponding
vectors in modules 63, 51, 57, and 50 discovered in the liver tissue gene
expression data from a
publicly
available mouse dataset (see reference (1) for details on the
dataset and network discovery). The remaining 40 genes that were not part
of any module were simulated by randomly selecting 40 liver genes and
bootstrapping 30 samples and adding the noise term,
. A
vector of module assignments was created ("module_labels") in which
each gene was labelled with a number 1-4 corresponding to the module they
were simulated to be coexpressed with, or a label of 0 for the for the 40
"background" genes not participating in any module. The correlation
structure ("discovery_correlation") was calculated as the Pearson's
correlation coefficient between genes
(cor(discovery_data)). Edge weights in the
interaction network ("discovery_network") were calculated as the
absolute value of the correlation coefficient exponentiated to the power 5
(abs(discovery_correlation)^5).
An independent test dataset ("test_data") containing the same 150
genes as the discovery dataset but 30 different samples was
simulated as above. Modules 1 and 4 (containing 20 and 35 genes
respectively) were simulated to be preserved using the same equation
above, where the summary vector was bootstrapped from
the same liver modules (modules 63 and 50) as in the discovery and
with identical node contributions
as in the
discovery dataset. Genes in modules 2 and 3 were simulated as
"background" genes, i.e. not preserved as described above. The
correlation structure between genes in the test dataset
("test_correlation") and the interaction network
("test_network") were calculated the same way as in the
discovery dataset.
The random seed used for the simulations was 37.
Ritchie, S.C., et al., A scalable permutation approach reveals replication and preservation patterns of network modules in large datasets. Cell Systems. 3, 71-82 (2016).
modulePreservation
, plotModule
, and
networkProperties
.
Quantify the preservation of network modules (sub-graphs) in an independent dataset through permutation testing on module topology. Seven network statistics (see details) are calculated for each module and then tested by comparing to distributions generated from their calculation on random subsets in the test dataset.
modulePreservation( network, data, correlation, moduleAssignments, modules = NULL, backgroundLabel = "0", discovery = 1, test = 2, selfPreservation = FALSE, nThreads = NULL, nPerm = NULL, null = "overlap", alternative = "greater", simplify = TRUE, verbose = TRUE )
modulePreservation( network, data, correlation, moduleAssignments, modules = NULL, backgroundLabel = "0", discovery = 1, test = 2, selfPreservation = FALSE, nThreads = NULL, nPerm = NULL, null = "overlap", alternative = "greater", simplify = TRUE, verbose = TRUE )
network |
a list of interaction networks, one for each dataset. Each
entry of the list should be a |
data |
a list of matrices, one for each dataset. Each entry of the list
should be the data used to infer the interaction |
correlation |
a list of matrices, one for each dataset. Each entry of
the list should be a |
moduleAssignments |
a list of vectors, one for each discovery dataset, containing the module assignments for each node in that dataset. |
modules |
a list of vectors, one for each |
backgroundLabel |
a single label given to nodes that do not belong to
any module in the |
discovery |
a vector of names or indices denoting the discovery
dataset(s) in the |
test |
a list of vectors, one for each |
selfPreservation |
logical; if |
nThreads |
number of threads to parallelise the calculation of network properties over. Automatically determined as the number of cores - 1 if not specified. |
nPerm |
number of permutations to use. If not specified, the number of permutations will be automatically determined (see details). When set to 0 the permutation procedure will be skipped and the observed module preservation will be returned without p-values. |
null |
variables to include when generating the null distributions. Must be either "overlap" or "all" (see details). |
alternative |
The type of module preservation test to perform. Must be one of "greater" (default), "less" or "two.sided" (see details). |
simplify |
logical; if |
verbose |
logical; should progress be reported? Default is |
The preservation of network modules in a second dataset is quantified by
measuring the preservation of topological properties between the
discovery and test datasets. These properties are calculated
not only from the interaction networks inferred in each dataset, but also
from the data used to infer those networks (e.g. gene expression data) as
well as the correlation structure between variables/nodes. Thus, all
functions in the NetRep
package have the following arguments:
network
:
a list of interaction networks, one for each dataset.
data
:
a list of data matrices used to infer those networks, one for each
dataset.
correlation
:
a list of matrices containing the pairwise correlation coefficients
between variables/nodes in each dataset.
moduleAssignments
:
a list of vectors, one for each discovery dataset, containing
the module assignments for each node in that dataset.
modules
:
a list of vectors, one for each discovery dataset, containing
the names of the modules from that dataset to analyse.
discovery
:
a vector indicating the names or indices of the previous arguments'
lists to use as the discovery dataset(s) for the analyses.
test
:
a list of vectors, one vector for each discovery dataset,
containing the names or indices of the network
, data
, and
correlation
argument lists to use as the test dataset(s)
for the analysis of each discovery dataset.
The formatting of these arguments is not strict: each function will attempt
to make sense of the user input. For example, if there is only one
discovery
dataset, then input to the moduleAssigments
and
test
arguments may be vectors, rather than lists.
Matrices in the network
, data
, and correlation
lists
can be supplied as disk.matrix
objects. This class allows
matrix data to be kept on disk and loaded as required by NetRep.
This dramatically decreases memory usage: the matrices for only one
dataset will be kept in RAM at any point in time.
Additional memory usage of the permutation procedure is directly proportional to the sum of module sizes squared multiplied by the number of threads. Very large modules may result in significant additional memory usage per core due to extraction of the correlation coefficient sub-matrix at each permutation.
Module preservation is assessed through seven module preservation statistics, each of which captures a different aspect of a module's topology; i.e. the structure of the relationships between its nodes (1,2). Below is a description of each statistic, what they seek to measure, and where their interpretation may be inappropriate.
The module coherence ('coherence'
), average node
contribution ('avg.contrib'
), and concordance of node
contribution ('cor.contrib'
) are all calculated from the data used
to infer the network (provided in the 'data'
argument). They are
calculated from the module's summary profile. This is the eigenvector
of the 1st principal component across all observations for every node
composing the module. For gene coexpression modules this can be interpreted
as a "summary expression profile". It is typically referred to as the
"module eigengene" in the weighted gene coexpression network analysis
literature (4).
The module coherence ('coherence'
) quantifies the proportion
of module variance explained by the module's "summary profile". The higher
this value, the more "coherent" the data is, i.e. the more similar
the observations are nodes for each sample. With the default alternate
hypothesis, a small permutation P-value indicates that the module is
more coherent than expected by chance.
The average node contribution ('avg.contrib'
) and
concordance of node contribution ('cor.contrib'
) are calculated
from the node contribution, which quantifies how similar each node is
to the modules's summary profile. It is calculated as the Pearson
correlation coefficient between each node and the module summary profile. In
the weighted gene coexpression network literature it is typically called the
"module membership" (2).
The average node contribution ('avg.contrib'
) quantifies how
similar nodes are to the module summary profile in the test dataset. Nodes
detract from this score where the sign of their node contribution flips
between the discovery and test datasets, e.g. in the case of
differential gene expression across conditions. A high average node
contribution with a small permutation P-value indicates that the
module remains coherent in the test dataset, and that the nodes are acting
together in a similar way.
The concordance of node contribution ('cor.contrib'
) measures
whether the relative rank of nodes (in terms of their node contribution) is
preserved across datasets. If a module is coherent enough that all nodes
contribute strongly, then this statistic will not be meaningful as its value
will be heavily influenced by tiny variations in node rank. This can be
assessed through visualisation of the module topology (see
plotContribution
.) Similarly, a strong
'cor.contrib'
is unlikely to be meaningful if the
'avg.contrib'
is not significant.
The concordance of correlation strucutre ('cor.cor'
) and
density of correlation structure ('avg.cor'
) are calculated
from the user-provided correlation structure between nodes (provided in the
'correlation'
argument). This is referred to as "coexpression" when
calculated on gene expression data.
The 'avg.cor'
measures how strongly nodes within a module are
correlation on average in the test dataset. This average depends on the
correlation coefficients in the discovery dataset: the score is penalised
where correlation coefficients change in sign between datasets. A high
'avg.cor'
with a small permutation P-value indicates that the
module is (a) more strongly correlated than expected by chance for a module
of the same size, and (b) more consistently correlated with respect to the
discovery dataset than expected by chance.
The 'cor.cor'
measures how similar the correlation coefficients are
across the two datasets. A high 'cor.cor'
with a small permutation
P-value indicates that the correlation structure within a module is
more similar across datasets than expected by chance. If all nodes within a
module are very similarly correlated then this statistic will not be
meaningful, as its value will be heavily influenced by tiny, non-meaningful,
variations in correlation strength. This can be assessed through
visualisation of the module topology (see plotCorrelation
.)
Similarly, a strong 'cor.cor'
is unlikely to be meaningful if the
'avg.cor'
is not significant.
The average edge weight ('avg.weight'
) and concordance
of weighted degree ('cor.degree'
) are both calculated from the
interaction network (provided as adjacency matrices to the 'network'
argument).
The 'avg.weight'
measures the average connection strength between
nodes in the test dataset. In the weighted gene coexpression network
literature this is typically called the "module density" (2). A high
'avg.weight'
with a small permutation P-value indicates that
the module is more strongly connected in the test dataset than expected by
chance.
The 'cor.degree'
calculates whether the relative rank of each node's
weighted degree is similar across datasets. The weighted
degree is calculated as the sum of a node's edge weights to all other nodes
in the module. In the weighted gene coexpression network literature this is
typically called the "intramodular connectivity" (2). This statistic
will not be meaningful where all nodes are connected to each other with
similar strength, as its value will be heavily influenced by tiny,
non-meaningful, variations in weighted degree. This can be assessed through
visualisation of the module topology (see plotDegree
.)
Both the 'avg.weight'
and 'cor.degree'
assume edges are
weighted, and that the network is densely connected. Note that for sparse
networks, edges with zero weight are included when calculating both
statistics. Only the magnitude of the weights, not their sign, contribute to
the score. If the network is unweighted, i.e. edges indicate
presence or absence of a relationship, then the 'avg.weight'
will be
the proportion of the number of edges to the total number of possible edges
while the weighted degree simply becomes the degree. A high
'avg.weight'
in this case measures how interconnected a module is in
the test dataset. A high degree indicates that a node is connected to
many other nodes. The interpretation of the 'cor.degree'
remains
unchanged between weighted and unweighted networks. If the network is
directed the interpretation of the 'avg.weight'
remains unchanged,
while the cor.degree will measure the concordance of the node
in-degree in the test network. To measure the out-degree
instead, the adjacency matrices provided to the 'network'
argument
should be transposed.
Caution should be used when running NetRep
on sparse data (i.e. where there are many zero values in the data
used to infer the network). For this data, the average node contribution
('avg.contrib'
), concordance of node contribution
('cor.contrib'
), and module coherence ('coherence'
)
will all be systematically underestimated due to their reliance on the
Pearson correlation coefficient to calculate the node contribution.
Care should also be taken to use appropriate methods for inferring the correlation structure when the data is sparse for the same reason.
Caution should be used when running NetRep
on proportional data (
i.e. where observations across samples all sum to the same value,
e.g. 1). For this data, the average node contribution
('avg.contrib'
), concordance of node contribution
('cor.contrib'
), and module coherence ('coherence'
)
will all be systematically overestimated due to their reliance on the
Pearson correlation coefficient to calculate the node contribution.
Care should also be taken to use appropriate methods for inferring the correlation structure from proportional data for the same reason.
Three alternative hypotheses are available. "greater", the default, tests whether each module preservation statistic is larger than expected by chance. "lesser" tests whether each module preservation statistic is smaller than expected by chance, which may be useful for identifying modules that are extremely different in the test dataset. "two.sided" can be used to test both alternate hypotheses.
To determine whether a module preservation statistic deviates from chance, a
permutation procedure is employed. Each statistic is calculated between the
module in the discovery dataset and nPerm
random subsets of
the same size in the test dataset in order to assess the distribution
of each statistic under the null hypothesis.
Two models for the null hypothesis are available: "overlap", the default, only nodes that are present in both the discovery and test networks are used when generating null distributions. This is appropriate under an assumption that nodes that are present in the test dataset, but not present in the discovery dataset, are unobserved: that is, they may fall in the module(s) of interest in the discovery dataset if they were to be measured there. Alternatively, "all" will use all nodes in the test network when generating the null distributions.
The number of permutations required for any given significance threshold is
approximately 1 / the desired significance for one sided tests, and double
that for two-sided tests. This can be calculated with
requiredPerms
. When nPerm
is not specified, the number
of permutations is automatically calculated as the number required for a
Bonferroni corrected significance threshold adjusting for the total number
of tests for each statistic, i.e. the total number of modules to be analysed
multiplied by the number of test datasets each module is tested in.
Although assessing the replication of a small numberof modules calls for
very few permutations, we recommend using no fewer than 1,000 as fewer
permutations are unlikely to generate representative null distributions.
Note: the assumption used by requiredPerms
to
determine the correct number of permtutations breaks down when assessing the
preservation of modules in a very small dataset (e.g. gene sets in a dataset
with less than 100 genes total). However, the reported p-values will still
be accurate (see permutationTest
) (3).
A nested list structure. At the top level, the list has one element per
'discovery'
dataset. Each of these elements is a list that has one
element per 'test'
dataset analysed for that 'discovery'
dataset. Each of these elements is also a list, containing the following
objects:
observed
:
A matrix of the observed values for the module preservation statistics.
Rows correspond to modules, and columns to the module preservation
statistics.
nulls
:
A three dimensional array containing the values of the module
preservation statistics evaluated on random permutation of module
assignment in the test network. Rows correspond to modules, columns to
the module preservation statistics, and the third dimension to the
permutations.
p.values
:
A matrix of p-values for the observed
module preservation
statistics as evaluated through a permutation test using the
corresponding values in nulls
.
nVarsPresent
:
A vector containing the number of variables that are present in the test
dataset for each module.
propVarsPresent
:
A vector containing the proportion of variables present in the test dataset
for each module. Modules where this is less than 1 should be
investigated further before making judgements about preservation to
ensure that the missing variables are not the most connected ones.
contingency
:
If moduleAssignments
are present for both the discovery
and test datasets, then a contingency table showing the overlap
between modules across datasets is returned. Rows correspond to modules
in the discovery dataset, columns to modules in the test
dataset.
When simplify = TRUE
then the simplest possible structure will be
returned. E.g. if module preservation is tested in only one dataset, then
the returned list will have only the above elements.
When simplify = FALSE
then a nested list of datasets will always be
returned, i.e. each element at the top level and second level correspond to
a dataset, e.g. results[["Dataset1"]][["Dataset2"]]
indicates an
analysis where modules discovered in "Dataset1" are assessed for
preservation in "Dataset2". Dataset comparisons which have not been
assessed will contain NULL
.
Ritchie, S.C., et al., A scalable permutation approach reveals replication and preservation patterns of network modules in large datasets. Cell Systems. 3, 71-82 (2016).
Langfelder, P., Luo, R., Oldham, M. C. & Horvath, S. Is my network module preserved and reproducible? PLoS Comput. Biol. 7, e1001057 (2011).
Phipson, B. & Smyth, G. K. Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn. Stat. Appl. Genet. Mol. Biol. 9, Article39 (2010).
Langfelder, P. & Horvath, S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 9, 559 (2008).
Functions for: visualising network modules, calculating module topology, calculating permutation test P-values, and splitting computation over multiple machines.
# load in example data, correlation, and network matrices for a discovery and test dataset: data("NetRep") # Set up input lists for each input matrix type across datasets. The list # elements can have any names, so long as they are consistent between the # inputs. network_list <- list(discovery=discovery_network, test=test_network) data_list <- list(discovery=discovery_data, test=test_data) correlation_list <- list(discovery=discovery_correlation, test=test_correlation) labels_list <- list(discovery=module_labels) # Assess module preservation: you should run at least 10,000 permutations preservation <- modulePreservation( network=network_list, data=data_list, correlation=correlation_list, moduleAssignments=labels_list, nPerm=1000, discovery="discovery", test="test", nThreads=2 )
# load in example data, correlation, and network matrices for a discovery and test dataset: data("NetRep") # Set up input lists for each input matrix type across datasets. The list # elements can have any names, so long as they are consistent between the # inputs. network_list <- list(discovery=discovery_network, test=test_network) data_list <- list(discovery=discovery_data, test=test_data) correlation_list <- list(discovery=discovery_correlation, test=test_correlation) labels_list <- list(discovery=module_labels) # Assess module preservation: you should run at least 10,000 permutations preservation <- modulePreservation( network=network_list, data=data_list, correlation=correlation_list, moduleAssignments=labels_list, nPerm=1000, discovery="discovery", test="test", nThreads=2 )
Functions for assessing the replication/preservation of a network module's topology across datasets through permutation testing. This is suitable for networks that can be meaningfully inferred from multiple datasets. These include gene coexpression networks, protein-protein interaction networks, and microbial interaction networks. Modules within these networks consist of groups of nodes that are particularly interesting: for example a group of tightly connected genes associated with a disease, groups of genes annotated with the same term in the Gene Ontology database, or groups of interacting microbial species, i.e. communities. Application of this method can answer questions such as; (1) do the relationships between genes in a module replicate in an independent cohort? (2) are these gene coexpression modules preserved across tissues or tissue specific? (3) are these modules conserved across species? (4) are microbial communities preserved across multiple spatial locations?
The main function for this package is modulePreservation
.
Several functions for downstream are also provided:
networkProperties
for calculating the topological properties
of a module, and plotModule
for visualising a module.
Maintainer: Scott Ritchie [email protected] (0000-0002-8454-9548)
Useful links:
Report bugs at https://github.com/sritchie73/NetRep/issues
Calculates the network properties used to assess module preservation for one or more modules in a user specified dataset.
networkProperties( network, data, correlation, moduleAssignments = NULL, modules = NULL, backgroundLabel = "0", discovery = NULL, test = NULL, simplify = TRUE, verbose = TRUE )
networkProperties( network, data, correlation, moduleAssignments = NULL, modules = NULL, backgroundLabel = "0", discovery = NULL, test = NULL, simplify = TRUE, verbose = TRUE )
network |
a list of interaction networks, one for each dataset. Each
entry of the list should be a |
data |
a list of matrices, one for each dataset. Each entry of the list
should be the data used to infer the interaction |
correlation |
a list of matrices, one for each dataset. Each entry of
the list should be a |
moduleAssignments |
a list of vectors, one for each discovery dataset, containing the module assignments for each node in that dataset. |
modules |
a list of vectors, one for each |
backgroundLabel |
a single label given to nodes that do not belong to
any module in the |
discovery |
a vector of names or indices denoting the discovery
dataset(s) in the |
test |
a list of vectors, one for each |
simplify |
logical; if |
verbose |
logical; should progress be reported? Default is |
The preservation of network modules in a second
dataset is quantified by measuring the preservation of topological
properties between the discovery and test datasets. These
properties are calculated not only from the interaction networks inferred
in each dataset, but also from the data used to infer those networks (e.g.
gene expression data) as well as the correlation structure between
variables/nodes. Thus, all functions in the NetRep
package have the
following arguments:
network
:
a list of interaction networks, one for each dataset.
data
:
a list of data matrices used to infer those networks, one for each
dataset.
correlation
:
a list of matrices containing the pairwise correlation coefficients
between variables/nodes in each dataset.
moduleAssignments
:
a list of vectors, one for each discovery dataset, containing
the module assignments for each node in that dataset.
modules
:
a list of vectors, one for each discovery dataset, containing
the names of the modules from that dataset to analyse.
discovery
:
a vector indicating the names or indices of the previous arguments'
lists to use as the discovery dataset(s) for the analyses.
test
:
a list of vectors, one vector for each discovery dataset,
containing the names or indices of the network
, data
, and
correlation
argument lists to use as the test dataset(s)
for the analysis of each discovery dataset.
The formatting of these arguments is not strict: each function will attempt
to make sense of the user input. For example, if there is only one
discovery
dataset, then input to the moduleAssigments
and
test
arguments may be vectors, rather than lists. If the
networkProperties
are being calculate within the discovery or
test datasets, then the discovery
and test
arguments do
not need to be specified, and the input matrices for the network
,
data
, and correlation
arguments do not need to be wrapped in
a list.
Matrices in the network
, data
, and correlation
lists
can be supplied as disk.matrix
objects. This class allows
matrix data to be kept on disk and loaded as required by NetRep.
This dramatically decreases memory usage: the matrices for only one
dataset will be kept in RAM at any point in time.
A nested list structure. At the top level, the list has one element per
'discovery'
dataset. Each of these elements is a list that has one
element per 'test'
dataset analysed for that 'discovery'
dataset. Each of these elements is a list that has one element per
'modules'
specified. Each of these is a list containing the following
objects:
'degree'
:
The weighted within-module degree: the sum of edge weights for each
node in the module.
'avgWeight'
:
The average edge weight within the module.
If the 'data'
used to infer the 'test'
network is provided
then the following are also returned:
'summary'
:
A vector summarising the module across each sample. This is calculated
as the first eigenvector of the module from a principal component
analysis.
'contribution'
:
The node contribution: the similarity between each node and the
module summary profile ('summary'
).
'coherence'
:
The proportion of module variance explained by the 'summary'
vector.
When simplify = TRUE
then the simplest possible structure will be
returned. E.g. if the network properties are requested for only one module
in only one dataset, then the returned list will have only the above elements.
When simplify = FALSE
then a nested list of datasets will always be
returned, i.e. each element at the top level and second level correspond to
a dataset, and each element at the third level will correspond to modules
discovered in the dataset specified at the top level if module labels are
provided in the corresponding moduleAssignments
list element. E.g.
results[["Dataset1"]][["Dataset2"]][["module1"]]
will contain the
properties of "module1" as calculated in "Dataset2", where "module1" was
indentified in "Dataset1". Modules and datasets for which calculation of
the network properties have not been requested will contain NULL
.
Getting nodes ordered by degree., and Ordering samples by module summary
# load in example data, correlation, and network matrices for a discovery and test dataset: data("NetRep") # Set up input lists for each input matrix type across datasets. The list # elements can have any names, so long as they are consistent between the # inputs. network_list <- list(discovery=discovery_network, test=test_network) data_list <- list(discovery=discovery_data, test=test_data) correlation_list <- list(discovery=discovery_correlation, test=test_correlation) labels_list <- list(discovery=module_labels) # Calculate the topological properties of all network modules in the discovery dataset props <- networkProperties( network=network_list, data=data_list, correlation=correlation_list, moduleAssignments=labels_list ) # Calculate the topological properties in the test dataset for the same modules test_props <- networkProperties( network=network_list, data=data_list, correlation=correlation_list, moduleAssignments=labels_list, discovery="discovery", test="test" )
# load in example data, correlation, and network matrices for a discovery and test dataset: data("NetRep") # Set up input lists for each input matrix type across datasets. The list # elements can have any names, so long as they are consistent between the # inputs. network_list <- list(discovery=discovery_network, test=test_network) data_list <- list(discovery=discovery_data, test=test_data) correlation_list <- list(discovery=discovery_correlation, test=test_correlation) labels_list <- list(discovery=module_labels) # Calculate the topological properties of all network modules in the discovery dataset props <- networkProperties( network=network_list, data=data_list, correlation=correlation_list, moduleAssignments=labels_list ) # Calculate the topological properties in the test dataset for the same modules test_props <- networkProperties( network=network_list, data=data_list, correlation=correlation_list, moduleAssignments=labels_list, discovery="discovery", test="test" )
Plot the correlation structure, network edges, scaled weighted degree, node contribtuion, module data, and module summary vectors of one or more network modules.
Individual components of the module plot can be plotted using
plotCorrelation
, plotNetwork
,
plotDegree
, plotContribution
,
plotData
, and plotSummary
.
plotModule( network, data, correlation, moduleAssignments = NULL, modules = NULL, backgroundLabel = "0", discovery = NULL, test = NULL, verbose = TRUE, orderSamplesBy = NULL, orderNodesBy = NULL, orderModules = TRUE, plotNodeNames = TRUE, plotSampleNames = TRUE, plotModuleNames = NULL, main = "Module Topology", main.line = 1, drawBorders = FALSE, lwd = 1, naxt.line = -0.5, saxt.line = -0.5, maxt.line = NULL, xaxt.line = -0.5, xaxt.tck = -0.025, xlab.line = 2.5, yaxt.line = 0, yaxt.tck = -0.15, ylab.line = 2.5, laxt.line = 2.5, laxt.tck = 0.04, cex.axis = 0.8, legend.main.line = 1.5, cex.lab = 1.2, cex.main = 2, dataCols = NULL, dataRange = NULL, corCols = correlation.palette(), corRange = c(-1, 1), netCols = network.palette(), netRange = c(0, 1), degreeCol = "#feb24c", contribCols = c("#A50026", "#313695"), summaryCols = c("#1B7837", "#762A83"), naCol = "#bdbdbd", dryRun = FALSE )
plotModule( network, data, correlation, moduleAssignments = NULL, modules = NULL, backgroundLabel = "0", discovery = NULL, test = NULL, verbose = TRUE, orderSamplesBy = NULL, orderNodesBy = NULL, orderModules = TRUE, plotNodeNames = TRUE, plotSampleNames = TRUE, plotModuleNames = NULL, main = "Module Topology", main.line = 1, drawBorders = FALSE, lwd = 1, naxt.line = -0.5, saxt.line = -0.5, maxt.line = NULL, xaxt.line = -0.5, xaxt.tck = -0.025, xlab.line = 2.5, yaxt.line = 0, yaxt.tck = -0.15, ylab.line = 2.5, laxt.line = 2.5, laxt.tck = 0.04, cex.axis = 0.8, legend.main.line = 1.5, cex.lab = 1.2, cex.main = 2, dataCols = NULL, dataRange = NULL, corCols = correlation.palette(), corRange = c(-1, 1), netCols = network.palette(), netRange = c(0, 1), degreeCol = "#feb24c", contribCols = c("#A50026", "#313695"), summaryCols = c("#1B7837", "#762A83"), naCol = "#bdbdbd", dryRun = FALSE )
network |
a list of interaction networks, one for each dataset. Each
entry of the list should be a |
data |
a list of matrices, one for each dataset. Each entry of the list
should be the data used to infer the interaction |
correlation |
a list of matrices, one for each dataset. Each entry of
the list should be a |
moduleAssignments |
a list of vectors, one for each discovery dataset, containing the module assignments for each node in that dataset. |
modules |
a list of vectors, one for each |
backgroundLabel |
a single label given to nodes that do not belong to
any module in the |
discovery |
a vector of names or indices denoting the discovery
dataset(s) in the |
test |
a list of vectors, one for each |
verbose |
logical; should progress be reported? Default is |
orderSamplesBy |
|
orderNodesBy |
|
orderModules |
logical; if |
plotNodeNames |
logical; controls whether the node names are drawed on the bottom axis. |
plotSampleNames |
logical; controls whether the sample names are drawed on the left axis. |
plotModuleNames |
logical; controls whether module names are drawed.
The default is for module names to be drawed when multiple |
main |
title for the plot. |
main.line |
the number of lines into the top margin at which the plot title will be drawn. |
drawBorders |
logical; if |
lwd |
line width for borders and axes. |
naxt.line |
the number of lines into the bottom margin at which the node names will be drawn. |
saxt.line |
the number of lines into the left margin at which the sample names will be drawn. |
maxt.line |
the number of lines into the bottom margin at which the module names will be drawn. |
xaxt.line |
the number of lines into the bottom margin at which the x-axis tick labels will be drawn on the module summary bar plot. |
xaxt.tck |
the size of the x-axis ticks for the module summary bar plot. |
xlab.line |
the number of lines into the bottom margin at which the x axis label on the module summary bar plot(s) will be drawn. |
yaxt.line |
the number of lines into the left margin at which the y-axis tick labels will be drawn on the weighted degree and node contribution bar plots. |
yaxt.tck |
the size of the y-axis ticks for the weighted degree and node contribution bar plots. |
ylab.line |
the number of lines into the left margin at which the y axis labels on the weighted degree and node contribution bar plots will be drawn. |
laxt.line |
the distance from the legend to draw the legend axis
labels, as multiple of |
laxt.tck |
size of the ticks on each axis legend relative to the size of the correlation, edge weights, and data matrix heatmaps. |
cex.axis |
relative size of the node and sample names. |
legend.main.line |
the distance from the legend to draw the legend title. |
cex.lab |
relative size of the module names and legend titles. |
cex.main |
relative size of the plot titles. |
dataCols |
a character vector of colors to create a gradient from for
the data heatmap (see details). Automatically determined if |
dataRange |
the range of values to map to the |
corCols |
a character vector of colors to create a gradient from for the correlation structure heatmap (see details). |
corRange |
the range of values to map to the |
netCols |
a character vector of colors to create a gradient from for the network edge weight heatmap (see details). |
netRange |
the range of values to map to the |
degreeCol |
color to use for the weighted degree bar plot. |
contribCols |
color(s) to use for the node contribution bar plot (see details). |
summaryCols |
color(s) to use for the node contribution bar plot (see details). |
naCol |
color to use for missing nodes and samples on the data, correlation structure, and network edge weight heat maps. |
dryRun |
logical; if |
The preservation of network modules in a second
dataset is quantified by measuring the preservation of topological
properties between the discovery and test datasets. These
properties are calculated not only from the interaction networks inferred
in each dataset, but also from the data used to infer those networks (e.g.
gene expression data) as well as the correlation structure between
variables/nodes. Thus, all functions in the NetRep
package have the
following arguments:
network
:
a list of interaction networks, one for each dataset.
data
:
a list of data matrices used to infer those networks, one for each
dataset.
correlation
:
a list of matrices containing the pairwise correlation coefficients
between variables/nodes in each dataset.
moduleAssignments
:
a list of vectors, one for each discovery dataset, containing
the module assignments for each node in that dataset.
modules
:
a list of vectors, one for each discovery dataset, containing
the names of the modules from that dataset to analyse.
discovery
:
a vector indicating the names or indices of the previous arguments'
lists to use as the discovery dataset(s) for the analyses.
test
:
a list of vectors, one vector for each discovery dataset,
containing the names or indices of the network
, data
, and
correlation
argument lists to use as the test dataset(s)
for the analysis of each discovery dataset.
The formatting of these arguments is not strict: each function will attempt
to make sense of the user input. For example, if there is only one
discovery
dataset, then input to the moduleAssigments
and
test
arguments may be vectors, rather than lists. If the node and
sample ordering is being calculated within the same dataset being
visualised, then the discovery
and test
arguments do
not need to be specified, and the input matrices for the network
,
data
, and correlation
arguments do not need to be wrapped in
a list.
Matrices in the network
, data
, and correlation
lists
can be supplied as disk.matrix
objects. This class allows
matrix data to be kept on disk and loaded as required by NetRep.
This dramatically decreases memory usage: the matrices for only one
dataset will be kept in RAM at any point in time.
By default, nodes are ordered in decreasing order of weighted degree
in the discovery
dataset (see nodeOrder
). Missing
nodes are colored in grey. This facilitates the visual comparison of
modules across datasets, as the node ordering will be preserved.
Alternatively, a vector containing the names or indices of one or more
datasets can be provided to the orderNodesBy
argument.
If a single dataset is provided, then nodes will be ordered in decreasing
order of weighted degree in that dataset. Only nodes that are
present in this dataset will be drawn when ordering nodes by a dataset
that is not the discovery
dataset for the requested modules(s).
If multiple datasets are provided then the weighted degree will be averaged across these datasets (see nodeOrder for more details). This is useful for obtaining a robust ordering of nodes by relative importance, assuming the modules displayed are preserved in those datasets.
Ordering of nodes by weighted degree can be suppressed by setting
orderNodesBy
to NA
, in which case nodes will be ordered as
in the matrices provided in the network
, data
, and
correlation
arguments.
When multiple modules are drawn, modules are ordered by the similarity
of their summary vectors in the dataset(s) specified in orderNodesBy
argument. If multiple datasets are provided to the orderNodesBy
argument then the module summary vectors are concatenated across datasets.
By default, samples in the data heatmap and accompanying module summary bar
plot are ordered in descending order of module summary in the drawn
dataset (specified by the test
argument). If multiple modules are
drawn, samples are ordered as per the left-most module on the plot.
Alternatively, a vector containing the name or index of another dataset
may be provided to the orderSamplesBy
argument. In this case,
samples will be ordered in descending order of module summary in
the specified dataset. This is useful when comparing different
measurements across samples, for example, gene expression data obtained
from multiple tissues samples across the same individuals. If the dataset
specified is the discovery
dataset, then missing samples will be
displayed as horizontal grey bars. If the dataset specified is one of the
other datasets, samples present in both the specified dataset and the
test
dataset will be displayed first in order of the specified
dataset, then samples present in only the test dataset will be displayed
underneath a horizontal black line ordered by their module summary vector
in the test dataset.
Order of samples by module summary can be suppressed by setting
orderSamplesBy
to NA
, in which case samples will be order as
in the matrix provided to the data
argument for the drawn dataset.
When drawn on a plot, the weighted degree of each node is scaled to the maximum weighted degree within its module. The scaled weighted degree is measure of relative importance for each node to its module. This makes visualisation of multiple modules with different sizes and densities possible. However, the scaled weighted degree should only be interpreted for groups of nodes that have an apparent module structure.
For optimal results we recommend viewing single modules on a PNG device
with a width of 1500, a height of 2700 and a nominal resolution of 300
(png(filename, width=5*300, height=9*300, res=300))
).
Warning: PDF and other vectorized devices should not be used when plotting more than a hundred nodes. Large files will be generated which may cause image editing programs such as Inkscape or Illustrator to crash when polishing figures for publication.
When dryRun
is TRUE
only the axes, legends, labels, and
title will be drawn, allowing for quick iteration of customisable
parameters to get the plot layout correct.
If axis labels or legends are drawn off screen then the margins of the
plot should be adjusted prior to plotting using the
par
command to increase the margin size
(see the "mar" option in the par
help page).
The size of text labels can be modified by increasing or decreasing the
cex.main
, cex.lab
, and cex.axis
arguments:
cex.main
: controls the size of the plot title (specified
in the main
argument).
cex.lab
: controls the size of the axis labels on the
weighted degree, node contribution,
and module summary bar plots as well as
the size of the module labels and the heatmap
legend titles.
cex.axis
: contols the size of the axis tick labels,
including the node and sample labels.
The position of these labels can be changed through the following arguments:
xaxt.line
: controls the distance from the plot the x-axis
tick labels are drawn on the module summary bar plot.
xlab.line
: controls the distance from the plot the x-axis
label is drawn on the module summary bar plot.
yaxt.line
: controls the distance from the plot the y-axis
tick labels are drawn on the weighted degree and
node contribution bar plots.
ylab.line
: controls the distance from the plot the y-axis
label is drawn on the weighted degree and
node contribution bar plots.
main.line
: controls the distance from the plot the title is
drawn.
naxt.line
: controls the distance from the plot the node
labels are drawn.
saxt.line
: controls the distance from the plot the sample
labels are drawn.
maxt.line
: controls the distance from the plot the module
labels are drawn.
laxt.line
: controls the distance from the heatmap legends
that the gradient legend labels are drawn.
legend.main.line
: controls the distance from the heatmap
legends that the legend title is drawn.
The rendering of node, sample, and module names can be disabled by setting
plotNodeNames
, plotSampleNames
, and plotModuleNames
to FALSE
.
The size of the axis ticks can be changed by increasing or decreasing the following arguments:
xaxt.tck
: size of the x-axis tick labels as a multiple of
the height of the module summary bar plot
yaxt.tck
: size of the y-axis tick labels as a multiple of
the width of the weighted degree or node contribution
bar plots.
laxt.tck
: size of the heatmap legend axis ticks as a
multiple of the width of the data, correlation structure, or
network edge weight heatmaps.
The drawBorders
argument controls whether borders are drawn around
the weighted degree, node contribution, or module summary bar plots. The
lwd
argument controls the thickness of these borders, as well as
the thickness of axes and axis ticks.
The dataCols
and dataRange
arguments control the appearance
of the data heatmap (see plotData
). The gradient of colors
used on the heatmap can be changed by specifying a vector of colors to
interpolate between in dataCols
and dataRange
specifies the
range of values that maps to this gradient. Values outside of the
specified dataRange
will be rendered with the colors used at either
extreme of the gradient. The default gradient is determined based on the
data
shown on the plot. If all values in the data
matrix are
positive, then the gradient is interpolated between white and green, where
white is used for the smallest value and green for the largest. If all
values are negative, then the gradient is interpolated between purple and
white, where purple is used for the smallest value and white for the value
closest to zero. If the data contains both positive and negative values,
then the gradient is interpolated between purple, white, and green, where
white is used for values of zero. In this case the range shown is always
centered at zero, with the values at either extreme determined by the
value in the rendered data
with the strongest magnitude (the
maximum of the absolute value).
The corCols
and corRange
arguments control the appearance of
the correlation structure heatmap (see plotCorrelation
). The
gradient of colors used on the heatmap can be changed by specifying a
vector of colors to interpolate between in corCols
. By default,
strong negative correlations are shown in blue, and strong positive
correlations in red, and weak correlations as white. corRange
controls the range of values that this gradient maps to, by default, -1 to
1. Changing this may be useful for showing differences where range of
correlation coefficients is small.
The netCols
and netRange
arguments control the appearance of
the network edge weight heatmap (see plotNetwork
). The
gradient of colors used on the heatmap can be changed by specifying a
vector of colors to interpolate between in netCols
. By default,
weak or non-edges are shown in white, while strong edges are shown in red.
The netRange
controls the range of values this gradient maps to,
by default, 0 to 1. If netRange
is set to NA
, then the
gradient will be mapped to values between 0 and the maximum edge weight of
the shown network.
The degreeCol
argument controls the color of the weighted degree
bar plot (see plotDegree
).
The contribCols
argument controls the color of the node
contribution bar plot (see plotContribution
. This can be
specified as single value to be used for all nodes, or as two colors: one
to use for nodes with positive contributions and one to use for nodes with
negative contributions.
The summaryCols
argument controls the color of the module summary
bar plot (see plotSummary
. This can be specified as single
value to be used for all samples, or as two colors: one to use for samples
with a positive module summary value and one fpr samples with a negative
module summary value.
The naCol
argument controls the color of missing nodes and samples
on the data, correlaton structure, and network edge weight heatmaps.
The chunk option fig.keep="last"
should be set to avoid an empty
plot being embedded above the plot generated by plotModule
. This
empty plot is generated so that an error will be thrown as early as
possible if the margins are too small to be displayed. Normally, these
are drawn over with the actual plot components when drawing the plot on
other graphical devices.
plotCorrelation
,
plotNetwork
,
plotDegree
,
plotContribution
,
plotData
, and
plotSummary
.
# load in example data, correlation, and network matrices for a discovery # and test dataset: data("NetRep") # Set up input lists for each input matrix type across datasets. The list # elements can have any names, so long as they are consistent between the # inputs. network_list <- list(discovery=discovery_network, test=test_network) data_list <- list(discovery=discovery_data, test=test_data) correlation_list <- list(discovery=discovery_correlation, test=test_correlation) labels_list <- list(discovery=module_labels) # Plot module 1, 2 and 4 in the discovery dataset plotModule( network=network_list, data=data_list, correlation=correlation_list, moduleAssignments=labels_list, modules=c(1, 2, 4) ) # Now plot them in the test dataset (module 2 does not replicate) plotModule( network=network_list,data=data_list, correlation=correlation_list, moduleAssignments=labels_list, modules=c(1, 2, 4), discovery="discovery", test="test" ) # Plot modules 1 and 4, which replicate, in the test datset ordering nodes # by weighted degree averaged across the two datasets plotModule( network=network_list, data=data_list, correlation=correlation_list, moduleAssignments=labels_list, modules=c(1, 4), discovery="discovery", test="test", orderNodesBy=c("discovery", "test") )
# load in example data, correlation, and network matrices for a discovery # and test dataset: data("NetRep") # Set up input lists for each input matrix type across datasets. The list # elements can have any names, so long as they are consistent between the # inputs. network_list <- list(discovery=discovery_network, test=test_network) data_list <- list(discovery=discovery_data, test=test_data) correlation_list <- list(discovery=discovery_correlation, test=test_correlation) labels_list <- list(discovery=module_labels) # Plot module 1, 2 and 4 in the discovery dataset plotModule( network=network_list, data=data_list, correlation=correlation_list, moduleAssignments=labels_list, modules=c(1, 2, 4) ) # Now plot them in the test dataset (module 2 does not replicate) plotModule( network=network_list,data=data_list, correlation=correlation_list, moduleAssignments=labels_list, modules=c(1, 2, 4), discovery="discovery", test="test" ) # Plot modules 1 and 4, which replicate, in the test datset ordering nodes # by weighted degree averaged across the two datasets plotModule( network=network_list, data=data_list, correlation=correlation_list, moduleAssignments=labels_list, modules=c(1, 4), discovery="discovery", test="test", orderNodesBy=c("discovery", "test") )
How many permutations do I need to test at my desired significance level?
requiredPerms(alpha, alternative = "greater")
requiredPerms(alpha, alternative = "greater")
alpha |
desired significance threshold. |
alternative |
a character string specifying the alternative hypothesis, must be one of "greater" (default), "less", or "two.sided". You can specify just the initial letter. |
The minimum number of permutations required to detect any significant
associations at the provided alpha
. The minimum p-value will always
be smaller than alpha
.
data("NetRep") # Set up input lists for each input matrix type across datasets. The list # elements can have any names, so long as they are consistent between the # inputs. network_list <- list(discovery=discovery_network, test=test_network) data_list <- list(discovery=discovery_data, test=test_data) correlation_list <- list(discovery=discovery_correlation, test=test_correlation) labels_list <- list(discovery=module_labels) # How many permutations are required to Bonferroni adjust for the 4 modules # in the example data? nPerm <- requiredPerms(0.05/4) # Note that we recommend running at least 10,000 permutations to make sure # that the null distributions are representative. preservation <- modulePreservation( network=network_list, data=data_list, correlation=correlation_list, moduleAssignments=labels_list, nPerm=nPerm, discovery="discovery", test="test" )
data("NetRep") # Set up input lists for each input matrix type across datasets. The list # elements can have any names, so long as they are consistent between the # inputs. network_list <- list(discovery=discovery_network, test=test_network) data_list <- list(discovery=discovery_data, test=test_data) correlation_list <- list(discovery=discovery_correlation, test=test_correlation) labels_list <- list(discovery=module_labels) # How many permutations are required to Bonferroni adjust for the 4 modules # in the example data? nPerm <- requiredPerms(0.05/4) # Note that we recommend running at least 10,000 permutations to make sure # that the null distributions are representative. preservation <- modulePreservation( network=network_list, data=data_list, correlation=correlation_list, moduleAssignments=labels_list, nPerm=nPerm, discovery="discovery", test="test" )