High-throughput interaction data from novel chromosome interaction assays has become a staple in genomics research. Methods such as high-throughput chromosome conformation capture (Hi-C) and chromatin interaction analysis by paired-end tag sequencing (ChIA-PET) provide researchers with a way of quantifying three-dimensional chromatin architecture, while also gaining insights into which regions in the genome are interacting frequently. A common downstream data-type from these experimental methods is chromatin loop data. Loops are inferred by marking regions in the genome with high frequency of interaction compared to a background. Typically, we are interested in loops because they provide an insulated environment for interaction of genomic regions, as well as a direct mode of contact for regions near the loop anchors. A classic canonical chromatin loop interaction is one that involves enhancer-promoter interactions in the regulation of gene expression. Therefore, a very common workflow involving chromatin loop data is the integration and concurrent analysis of genomic element data.
However, this poses two current challenges that are not readily solved by packages available in CRAN or Bioconductor:
Genomic interaction data involves two sets of coordinates at the same time, and current widely used tools such as the GenomicRanges package do not have readily available functions to handle this data-type.
Typical analysis workflows involving chromatin loop and genomic element data will incorporate multiple looping and element datasets at a time. This introduces another challenge for available software.
LoopRig leverages the GenomicRanges and IRanges packages to standardize workflows utilizing chromatin loop and element data. This vignette covers all functional aspects of LoopRig by going over a workflow of a canonical mechanism of genomic interaction - enhancer <-> promoter interactions.
Tab-delimited element data files in BED3..n format and loop data
files in BEDPE format are required, meaning that the data files cannot
have headers. Element data files can have any numbers of extra metadata
columns apart from the required 3 columns (chr, start, end). Similarly,
BEDPE interaction files have no limit on their metadata columns but must
contain the required 6 columns at a minimum (chr1, start1, end1, chr2,
start2, end2). The number of extra columns must be specified when
calling the LoopsToRanges()
and
ElementsToRanges()
using the custom_cols
parameter. Additionally, metadata from the BED files can be attached to
the GRanges objects to be created by specifying which BED file
columns to include using the custom_mcols
parameter. The
example files included with this package comprise of three looping
datasets with no extra columns, and two element datasets with one extra
column.
LoopRig has two central S3 classes of objects - LoopRanges
objects for storing loop data, and ElementRanges objects for
storing element data. Each class comprises of lists of S4
GRanges objects from the GenomicRanges package. Multiple
element and looping data can be stored into both object types using the
LoopsToRanges()
and ElementsToRanges()
functions. The following examples go over creating these data objects
using example chromatin loop data of three datasets (ovary, pancreas,
spleen) from Salameh et al. [1], and enhancer and promoter element data
from the Pan-Cancer Analysis of Whole Genomes Consortium (PCAWG) [2].
LoopRanges objects can be created using
LoopsToRanges()
:
library(LoopRig)
# Load example files for chromatin loop data. These are BEDPE files with no extra columns.
ovary_loops <- system.file("extdata/loops", "ovary_hg19.bedpe", package = "LoopRig", mustWork = TRUE)
pancreas_loops <- system.file("extdata/loops", "pancreas_hg19.bedpe", package = "LoopRig", mustWork = TRUE)
spleen_loops <- system.file("extdata/loops", "spleen_hg19.bedpe", package = "LoopRig", mustWork = TRUE)
# Call LoopsToRanges() on all files at once. Since there are custom columns, we indicate this by custom_cols = 0.
loops <- LoopsToRanges(ovary_loops, pancreas_loops, spleen_loops, custom_cols = 0, loop_names = c("ovary", "pancreas", "spleen"))
# View LoopRanges object for first loop dataset (ovary).
head(loops, 1)
#> $ovary
#> GRangesList object of length 2:
#> $`Anchor 1`
#> GRanges object with 5648 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> [1] chr5 14140109-14150109 *
#> [2] chr5 29850107-29860107 *
#> [3] chr5 158137008-158147008 *
#> [4] chr5 158197008-158207008 *
#> [5] chr5 158167008-158177008 *
#> ... ... ... ...
#> [5644] chr4 78621154-78631154 *
#> [5645] chr4 33601622-33611622 *
#> [5646] chr4 41452017-41462017 *
#> [5647] chr4 99321151-99331151 *
#> [5648] chr4 169501151-169511151 *
#> -------
#> seqinfo: 23 sequences from an unspecified genome; no seqlengths
#>
#> $`Anchor 2`
#> GRanges object with 5648 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> [1] chr5 14270109-14280109 *
#> [2] chr5 29940107-29950107 *
#> [3] chr5 158507008-158517008 *
#> [4] chr5 158507008-158517008 *
#> [5] chr5 158447008-158457008 *
#> ... ... ... ...
#> [5644] chr4 78771154-78781154 *
#> [5645] chr4 33771622-33781622 *
#> [5646] chr4 41622017-41632017 *
#> [5647] chr4 99551151-99561151 *
#> [5648] chr4 169741151-169751151 *
#> -------
#> seqinfo: 23 sequences from an unspecified genome; no seqlengths
# We can see that each element in the LoopRanges list contains the loops from a given dataset, in this case the ovarian loops.
# We can confirm the class of our object by using the class() function.
class(loops)
#> [1] "LoopRanges"
Each chromatin loop data set is stored as a GRangesList
object within the LoopRanges object. Similarly for elements
using ElementsToRanges()
:
# Load example files for genomic element data. These are BED4 files, indicating they have one extra column for metadata.
enhancers <- system.file("extdata/elements", "enhancers.bed", package = "LoopRig", mustWork = TRUE)
promoters <- system.file("extdata/elements", "promoters.bed", package = "LoopRig", mustWork = TRUE)
# Call ElementsToRanges() on all files at once. We must specify that there is one extra column by custom_cols = 1. We want to use this column as metadata for our object, so we indicate this by custom_mcols = 4, meaning that the metadata columns for our ElementRanges object will be from column 4 of the original BED files.
element_ranges <- ElementsToRanges(enhancers, promoters, element_names = c("enhancers", "promoters"), custom_cols = 1, custom_mcols = 4)
# View ElementRanges object for first element type (enhancers).
head(element_ranges, 1)
#> $enhancers
#> GRanges object with 1000 ranges and 1 metadata column:
#> seqnames ranges strand | mcols
#> <Rle> <IRanges> <Rle> | <character>
#> [1] chr2 48752479-48807772 * | E573
#> [2] chrX 9000916-9002368 * | E1383
#> [3] chr4 84406018-84406908 * | E6724
#> [4] chr15 90773255-90784140 * | E3155
#> [5] chr1 112525348-112535556 * | E8752
#> ... ... ... ... . ...
#> [996] chr12 50781274-50836617 * | E9077
#> [997] chr6 150067458-150068014 * | E1642
#> [998] chr17 34263213-34274392 * | E6285
#> [999] chr3 192954838-192959007 * | E5603
#> [1000] chr1 204432559-204468852 * | E281
#> -------
#> seqinfo: 25 sequences from an unspecified genome; no seqlengths
# As we can see, we have metadata for the GRanges object that stores element information.
# Confirm object class.
class(element_ranges)
#> [1] "ElementRanges"
Similar to LoopRanges, ElementRanges objects comprise of a list of GRanges objects storing genomic element coordinates. These two objects can easily be subset, and users experienced in the GenomicRanges package can use the wealth of functions it provides. However, this is not advised unless there is a non-trivial analysis that must be performed which LoopRig does not address.
There’s a wealth of publicly available chromatin loop data, and one
of the most important challenges is extracting high-confidence
interactions. LoopRig provides a function,
ConsensusLoops()
, which can be used on objects of
LoopRanges class. This function has a number of options and
consensus can be determined using a variety of methods. The most basic
methods comprises of determining which loops in one dataset overlap with
those in the other dataset, and those that meet the
stringency
threshold parameter are considered consensus
loops. Another metric central to ConsensusLoops()
and most
other functions in LoopRig is the overlap_threshold
parameter, which defines how many nucleotide positions must be
overlapping between two loci to constitute a ‘hit’. In the case of
ConsensusLoops()
, this indicates the threshold for
nucleotide overlap of both chromatin loop anchors for loops to be
considered overlapping with others across datasets. To determine a set
of consensus loops from our LoopRanges object created earlier,
we can call ConsensusLoops()
with minimal parameters:
# Call the ConsensusLoops() function. Stringency indicates how many datasets a loop must be present in to be considered a consensus loop, and the overlap threshold (nucleotides) defines a *hit* across datasets. In this case, a loop must be present in at least 2 datasets based on overlap with another of 1 nucleotide.
consensus_loops <- ConsensusLoops(loop_ranges = loops, stringency = 2, overlap_threshold = 1)
# View consensus_loops.
head(consensus_loops, 1)
#> $Consensus
#> GRangesList object of length 2:
#> $`Anchor 1`
#> GRanges object with 68 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> [1] chr11 120530709-120540709 *
#> [2] chr7 95139312-95149312 *
#> [3] chr20 36698402-36708402 *
#> [4] chr17 17773314-17783314 *
#> [5] chr8 121702240-121712240 *
#> ... ... ... ...
#> [64] chr4 185251153-185261153 *
#> [65] chr4 26741622-26751622 *
#> [66] chr12 64943780-64953780 *
#> [67] chr11 111500724-111510724 *
#> [68] chr7 121420054-121430054 *
#> -------
#> seqinfo: 23 sequences from an unspecified genome; no seqlengths
#>
#> $`Anchor 2`
#> GRanges object with 68 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> [1] chr11 120620709-120630709 *
#> [2] chr7 95309312-95319312 *
#> [3] chr20 36788402-36798402 *
#> [4] chr17 17933314-17943314 *
#> [5] chr8 121812240-121822240 *
#> ... ... ... ...
#> [64] chr4 185371154-185381154 *
#> [65] chr4 26831622-26841622 *
#> [66] chr12 65083780-65093780 *
#> [67] chr11 111630724-111640724 *
#> [68] chr7 121510054-121520054 *
#> -------
#> seqinfo: 23 sequences from an unspecified genome; no seqlengths
# As we can see, we still have a LoopRanges object, but it only comprises of one GRangesList object containing the consensus loop data. We can double-check this using the length() function.
length(loops)
#> [1] 3
length(consensus_loops)
#> [1] 1
# We can further check class to ensure LoopRanges object is intact.
class(consensus_loops)
#> [1] "LoopRanges"
The ConsensusLoops()
function still returns an object of
LoopRanges class, which can be used in many subsequent
functions for analysis. The length of this LoopRanges object is
1, as it now only contains one dataset comprising of subset loops based
on consensus.
LoopRig also provides another function for manipulating loop data in
LoopRanges form - DropLoops()
. This function
allows LoopRanges objects to be subset before or calling
ConsensusLoops()
(i.e. it can work with LoopRanges
objects of any length except 0). Loops can be subset based on anchor
size or total loop size, which is specified using the type
parameter. We can consider an example of subsetting our consensus loops
object for loop sizes that are between 100-200 Kb:
# Call DropLoops() and indicate the 'type' parameter to subset loops by loop size and use the 'size' parameter to specify lengths to keep (0-10 Kb).
consensus_loops_dropped <- DropLoops(loop_ranges = consensus_loops, type = "loop_size", size = c(100000, 200000))
# We can view the change by determining how many loops are in each object. The first index must be specified because we are still using LoopRanges list objects, and we must index an anchor to determine a count.
length(consensus_loops[[1]][[1]])
#> [1] 68
length(consensus_loops_dropped[[1]][[1]])
#> [1] 41
# Our total number of loops reduced from 68 to 41, and these 41 loops all have total loop sizes (anchor end-to-end) between 100-200 Kb.
# Recheck class of subset object.
class(consensus_loops_dropped)
#> [1] "LoopRanges"
Similar to ConsensusLoops()
, DropLoops()
returns object of LoopRanges class, which can be subset
repeatedly using these two functions as much as necessary.
LoopRig provides three functions for linking elements datasets based
on chromatin loops: LinkedElements()
,
StackedElements()
, and ScaffoldElements()
.
Each of these functions links elements based on chromatin loop overlap
in different ways, but they all share the same parameters: input of a
LoopRanges object of length 1 for the loop_ranges
parameter (i.e. must be subset or have ConsensusLoops()
called on it), input of two GRanges element objects for the
element_ranges_x
and element_ranges_y
parameters (subset ElementRanges object using list indexing),
an overlap_threshold
parameter to determine degree of
nucleotide overlap that counts as a hit, and range_out_x
and range_out_y
parameters which can be used to override
the default dataframe outputs and instead output the subset elements as
ElementRanges objects.
LinkedElements()
The LinkedElements()
function considers which elements
from two datasets are linked by chromatin loop anchors. Based on input
chromatin loops, if an element from one set coincides with a loop anchor
and an element from the other set coincides with the opposite loop
anchor, those two elements are considered linked. We can determine if
any if our enhancers and promoters are linked in this manner using the
consensus set of chromatin loops:
# Call LinkedElements() for the promoter and enhancer ranges in element_ranges and use the default output parameters (dataframe). We must subset the ElementRanges object we created initially, as it has both the enhancer (index 1) and promoter (index 2) data in one object.
linked_elements <- LinkedElements(loop_ranges = consensus_loops, element_ranges_x = element_ranges[[1]], element_ranges_y = element_ranges[[2]])
linked_elements
#> Anchor_1 Anchor_2
#> 1 E3762 KSR1
# Using our consensus set of loops, enhancer data, and promoter data, we find that the only linked elements are Enhancer 3762 and the promoter corresponding to the KSR1 gene.
# We can do the exact same analysis, but this time output the promoters subset by this linkage instead of the dataframe indicating the links (i.e. the range corresponding to the KSR1 promoter).
linked_elements_promoters <- LinkedElements(loop_ranges = consensus_loops, element_ranges_x = element_ranges[[1]], element_ranges_y = element_ranges[[2]], range_out_y = TRUE)
linked_elements_promoters
#> $el_y_linked
#> GRanges object with 1 range and 1 metadata column:
#> seqnames ranges strand | mcols
#> <Rle> <IRanges> <Rle> | <character>
#> [1] chr17 25783469-25951167 * | KSR1
#> -------
#> seqinfo: 25 sequences from an unspecified genome; no seqlengths
#>
#> attr(,"class")
#> [1] "ElementRanges"
class(linked_elements_promoters)
#> [1] "ElementRanges"
# As we can see, the output this time is an ElementRanges object containing the promoter data.
StackedElements()
The StackedElements()
function determines which elements
are both overlapping with each other and with a chromatin loop anchor.
Therefore, it considers which elements are ‘stacked’ on top of loop
anchors. Using our enhancer and promoter element data and the consensus
set of loops determined previously, we can determine if any of these
elements are stacked on loop anchors:
# Call StackedElements() for the promoter and enhancer data by subsetting the element_ranges object and using the consensus_loops object for the looping data. The default output will be a dataframe object.
stacked_elements <- StackedElements(loop_ranges = consensus_loops, element_ranges_x = element_ranges[[1]], element_ranges_y = element_ranges[[2]])
stacked_elements
#> Element_1 Element_2
#> 1 E2564 RAPGEF4
#> 2 E3762 KSR1
#> 4 E5878 TGM2
# From here, we can see that we have three instances of stacked promoters and enhancers on our consensus loop anchors.
# We can output which enhancers correspond to these stacks by using the range_out_x parameter. This will output these enhancers as an ElementRanges object.
stacked_elements_enhancers <- StackedElements(loop_ranges = consensus_loops, element_ranges_x = element_ranges[[1]], element_ranges_y = element_ranges[[2]], range_out_x = TRUE)
stacked_elements_enhancers
#> $el_x_stacked
#> GRanges object with 3 ranges and 1 metadata column:
#> seqnames ranges strand | mcols
#> <Rle> <IRanges> <Rle> | <character>
#> [1] chr2 173600324-173896190 * | E2564
#> [2] chr17 25783469-25951167 * | E3762
#> [3] chr20 36788672-36799980 * | E5878
#> -------
#> seqinfo: 25 sequences from an unspecified genome; no seqlengths
#>
#> attr(,"class")
#> [1] "ElementRanges"
class(stacked_elements_enhancers)
#> [1] "ElementRanges"
ScaffoldElements()
The least intuitive of the linkage functions is
ScaffoldElements()
. This function aims to determine a
subset of the input loops which have anchors overlapping with the first
element set (element_ranges_x
), and then determines which
elements in the second set (element_ranges_y
) overlap with
those subset loops. The ‘links’ that are returned indicate the loop
number, element from the first set that ‘scaffolds’ that loop, and
element in the second set that overlaps with that loop. This analysis is
particularly useful when considering elements that may stabilize loops,
such as CTCF sites. For demonstration purposes, we will consider our
enhancers to be the scaffolds and determine which promoters are linked
to the scaffolded loops using our consensus loop set:
# Call ScaffoldElements() for the promoter and enhancer data using indexing and use the consensus_loops object for the loop data. Use default options so the output is a dataframe object.
scaffold_elements <- ScaffoldElements(loop_ranges = consensus_loops, element_ranges_x = element_ranges[[1]], element_ranges_y = element_ranges[[2]])
scaffold_elements
#> Element_1_Scaffold Element_2_Link
#> 1 E5878 TGM2
#> 2 E2564 RAPGEF4
#> 3 E3762 KSR1
# Similar to LinkedElements() and StackedElements(), we can output either the linked enhancers or promoters as ElementRanges objects.
# Let's output the promoters using the range_out_y parameter.
scaffold_elements_promoters <- ScaffoldElements(loop_ranges = consensus_loops, element_ranges_x = element_ranges[[1]], element_ranges_y = element_ranges[[2]], range_out_y = TRUE)
scaffold_elements_promoters
#> $el_y_scaffold_linked
#> GRanges object with 3 ranges and 1 metadata column:
#> seqnames ranges strand | mcols
#> <Rle> <IRanges> <Rle> | <character>
#> [1] chr17 25783469-25951167 * | KSR1
#> [2] chr20 36788672-36799980 * | TGM2
#> [3] chr2 173600324-173896190 * | RAPGEF4
#> -------
#> seqinfo: 25 sequences from an unspecified genome; no seqlengths
#>
#> attr(,"class")
#> [1] "ElementRanges"
class(scaffold_elements_promoters)
#> [1] "ElementRanges"
Once all relevant analysis has been done, LoopRanges and
ElementRanges objects can be exported into BEDPE and BED files
respectively using the ExportBED()
function. The index of
the object must be specified using the index
parameter, and
the first metadata column of the object can be optionally appended to
the exported BED file by indicating mcol = TRUE
. For
example, we can export our promoters linked to enhancers determined by
the LinkedElements()
function:
Salameh TJ, Wang X, Song F, Zhang B, Wright SM. A supervised learning framework for chromatin loop detection in genome-wide contact maps. bioRxiv. 2019;1–25.
Rheinbay E, Nielsen MM, Abascal F, Tiao G, Hornshøj H, Hess JM, et al. Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes. bioRxiv 2017;237313.