gprofiler2 provides an R interface to the widely used web toolset g:Profiler (https://biit.cs.ut.ee/gprofiler) [1].
The toolset performs functional enrichment analysis and visualization of gene lists, converts gene/protein/SNP identifiers to numerous namespaces, and maps orthologous genes across species. g:Profiler relies on Ensembl databases as the primary data source and follows their release cycle for updates.
The main tools in g:Profiler are:
The input for any of the tools can consist of mixed types of gene identifiers, SNP rs-IDs, chromosomal intervals or term IDs. The gene IDs from chromosomal regions are retrieved automatically. The gene doesn’t need to fit the region fully. The format for chromosome regions is chr:region_start:region_end, e.g. X:1:2000000. In case of term IDs like GO:0007507 (heart development), g:Profiler uses all the genes annotated to that term as an input (in this case about six hundred human genes associated to heart development). Fully numeric identifiers need to be prefixed with the corresponding namespace. g:Profiler will automatically prefix all the detected numeric IDs using the prefix determined by the selected numeric namespace parameter.
Corresponding functions in the gprofiler2 R package are:
gprofiler2 uses the publicly available APIs of the g:Profiler web tool which ensures that the results from all of the interfaces are consistent.
The package corresponds to the 2019 update of g:Profiler and provides access for versions e94_eg41_p11 and higher. The older versions are available from the previous R package gProfileR.
gostgost enables to perform functional profiling of gene
lists. The function performs statistical enrichment analysis to find
over-representation of functions from Gene Ontology, biological pathways
like KEGG and Reactome, human disease annotations, etc. This is done
with the hypergeometric test followed by correction for multiple
testing.
A standard input of the gost function is a (named) list
of gene identifiers. The list can consist of mixed types of identifiers
(proteins, transcripts, microarray IDs, etc), SNP IDs, chromosomal
intervals or functional term IDs.
The parameter organism enables to define the
corresponding source organism for the gene list. The organism names are
usually constructed by concatenating the first letter of the name and
the family name, e.g human - hsapiens. If some of the input
gene identifiers are fully numeric, the parameter
numeric_ns enables to define the corresponding namespace.
See section Supported
organisms and identifier namespaces for links to supported organisms
and namespaces.
If the input genes are decreasingly ordered based on some biological
importance, then ordered_query = TRUE will take this into
account. For instance, the genes can be ordered according to
differential expression or absolute expression values. In this case,
incremental enrichment testing is performed with increasingly larger
numbers of genes starting from the top of the list. Note that with this
parameter, the query size might be different for every functional
term.
The parameter significant = TRUE is an indicator whether
all or only statistically significant results should be returned.
In case of Gene Ontology (GO), the exclude_iea = TRUE
would exclude the electronic GO annotations from the data source before
testing. These are the terms with the IEA evidence code indicating that
these annotations are assigned to genes using in silico curation
methods.
In order to measure under-representation instead of
over-representation set
measure_underrepresentation = TRUE.
By default, the user_threshold = 0.05 which defines a
custom p-value significance threshold for the results. Results with
smaller p-value are tagged as significant. We don’t recommend to set it
higher than 0.05.
In order to reduce the amount of false positives, a multiple
testing correction method is applied to the enrichment p-values. By
default, our tailor-made algorithm g:SCS is used
(correction_method = "gSCS" with synonyms
g_SCS and analytical), but there are also
options to apply the Bonferroni correction
(correction_method = "bonferroni") or FDR
(correction_method = "fdr"). The adjusted p-values are
reported in the results.
The parameter domain_scope defines how the statistical
domain size is calculated. This is one of the parameters in the
hypergeometric probability function. If
domain_scope = "annotated" then only the genes with at
least one annotation are considered to be part of the full domain. In
case if domain_scope = "known" then all the genes of the
given organism are considered to be part of the domain.
Depending on the research question, in some occasions it is advisable
to limit the domain/background set. For example, one may use the custom
background when they want to compare a gene list with a custom list of
expressed genes. gost provides the means to define a custom
background as a (mixed) vector of gene identifiers with the parameter
custom_bg. If this parameter is used, then the domain scope
is set to domain_scope = "custom". It is also possible to
set this parameter to domain_scope = "custom_annotated"
which will use the set of genes that are annotated in the data source
and are also included in the user provided background list.
The parameter sources enables to choose the data sources
of interest. By default, all the sources are analysed. The available
data sources and their abbreviations are listed under section [Data
sources]. For example, if sources = c("GO:MF", "REAC") then
only the results from molecular functions branch of Gene Ontology and
the pathways from Reactome are returned. One can also upload their own
annotation data which is further described in the section Custom data sources
with upload_GMT_file.
Parameter highlight includes an indicator TRUE/FALSE
column called “highlighted” to the analysis results to
highlight
driver terms in GO. This option works starting from Ensembl version
108 (e108). The option doesn’t work with custom GMT files.
gostres <- gost(query = c("X:1000:1000000", "rs17396340", "GO:0005005", "ENSG00000156103", "NLRP1"),
organism = "hsapiens", ordered_query = FALSE,
multi_query = FALSE, significant = TRUE, exclude_iea = FALSE,
measure_underrepresentation = FALSE, evcodes = FALSE,
user_threshold = 0.05, correction_method = "g_SCS",
domain_scope = "annotated", custom_bg = NULL,
numeric_ns = "", sources = NULL, as_short_link = FALSE, highlight = TRUE)The result is a named list where
“result” is a data.frame with the
enrichment analysis results and “meta” containing a
named list with all the metadata for the query.
head(gostres$result, 3)
#> query significant p_value term_size query_size intersection_size
#> 1 query_1 TRUE 2.490324e-02 3 16 2
#> 2 query_1 TRUE 4.966900e-02 4 16 2
#> 3 query_1 TRUE 6.988992e-139 50 59 50
#> precision recall term_id source
#> 1 0.1250000 0.6666667 CORUM:6586 CORUM
#> 2 0.1250000 0.5000000 CORUM:1185 CORUM
#> 3 0.8474576 1.0000000 GO:0036323 GO:BP
#> term_name
#> 1 VEcad-VEGFR complex
#> 2 EGFR-containing signaling complex
#> 3 vascular endothelial growth factor receptor-1 signaling pathway
#> effective_domain_size source_order parents highlighted
#> 1 3383 2236 CORUM:0000000 FALSE
#> 2 3383 662 CORUM:0000000 FALSE
#> 3 20972 8941 GO:0048010 TRUEThe result data.frame contains the following
columns:
list input.ordered_query = TRUE and the optimal
cutoff for the term was found before the end of the querydomain_scope was set to “annotated” or
“custom”gostplot (see below).names(gostres$meta)
#> [1] "query_metadata" "result_metadata" "genes_metadata" "timestamp"
#> [5] "version"The query parameters are listed in the “query_metadata” part of the metadata object. The “result_metadata” includes the statistics of data sources that are used in the enrichment testing. This includes the “domain_size” showing the number of genes annotated to this domain. The “number_of_terms” indicating the number of terms g:Profiler has in the database for this source and the nominal significance “threshold” for this source. The “genes_metadata” shows the specifics of the query genes (failed, ambiguous or duplicate inputs) and their mappings to the ENSG namespace. In addition, the query time and the used g:Profiler data version are shown in the metadata.
The parameter evcodes = TRUE includes the evidence codes
to the results. In addition, a column “intersection”
will appear to the results showing the input gene IDs that intersect
with the corresponding functional term. Note that his parameter can
decrease the performance and make the query slower.
gostres2 <- gost(query = c("X:1000:1000000", "rs17396340", "GO:0005005", "ENSG00000156103", "NLRP1"),
organism = "hsapiens", ordered_query = FALSE,
multi_query = FALSE, significant = TRUE, exclude_iea = FALSE,
measure_underrepresentation = FALSE, evcodes = TRUE,
user_threshold = 0.05, correction_method = "g_SCS",
domain_scope = "annotated", custom_bg = NULL,
numeric_ns = "", sources = NULL, highlight = TRUE)head(gostres2$result, 3)
#> query significant p_value term_size query_size intersection_size
#> 1 query_1 TRUE 2.490324e-02 3 16 2
#> 2 query_1 TRUE 4.966900e-02 4 16 2
#> 3 query_1 TRUE 6.988992e-139 50 59 50
#> precision recall term_id source
#> 1 0.1250000 0.6666667 CORUM:6586 CORUM
#> 2 0.1250000 0.5000000 CORUM:1185 CORUM
#> 3 0.8474576 1.0000000 GO:0036323 GO:BP
#> term_name
#> 1 VEcad-VEGFR complex
#> 2 EGFR-containing signaling complex
#> 3 vascular endothelial growth factor receptor-1 signaling pathway
#> effective_domain_size source_order parents
#> 1 3383 2236 CORUM:0000000
#> 2 3383 662 CORUM:0000000
#> 3 20972 8941 GO:0048010
#> evidence_codes
#> 1 CORUM,CORUM
#> 2 CORUM,CORUM
#> 3 IEA,IEA,IEA,IEA,IEA,IEA,IEA,IEA,IEA,IEA,IEA,IEA,IEA,IEA,IDA IEA,IEA,IEA,IEA,IEA,IEA,IEA,IEA,IEA,IEA,IEA,IEA,IEA,IEA,IEA,IEA,IEA,IEA,IEA,IEA,IEA,IEA,IEA,IEA,IEA,IEA,IEA,IEA,IEA,IEA,IEA,IEA,IEA,IEA,IEA,IEA
#> intersection
#> 1 ENSG00000037280,ENSG00000128052
#> 2 ENSG00000141736,ENSG00000146648
#> 3 ENSG00000027644,ENSG00000030304,ENSG00000037280,ENSG00000044524,ENSG00000047936,ENSG00000062524,ENSG00000065361,ENSG00000066056,ENSG00000066468,ENSG00000068078,ENSG00000070886,ENSG00000077782,ENSG00000080224,ENSG00000092445,ENSG00000102755,ENSG00000105976,ENSG00000113721,ENSG00000116106,ENSG00000120156,ENSG00000122025,ENSG00000128052,ENSG00000133216,ENSG00000134853,ENSG00000135333,ENSG00000140443,ENSG00000140538,ENSG00000141736,ENSG00000142627,ENSG00000145242,ENSG00000146648,ENSG00000146904,ENSG00000148053,ENSG00000153208,ENSG00000154928,ENSG00000157404,ENSG00000160867,ENSG00000162733,ENSG00000164078,ENSG00000165731,ENSG00000167601,ENSG00000169071,ENSG00000171094,ENSG00000171105,ENSG00000178568,ENSG00000182578,ENSG00000182580,ENSG00000183317,ENSG00000196411,ENSG00000198400,ENSG00000204580
#> highlighted
#> 1 FALSE
#> 2 FALSE
#> 3 TRUEThe result data.frame will include additional
columns:
The query results can also be gathered into a short-link to the
g:Profiler web tool. For that, set the parameter
as_short_link = TRUE. In this case, the function
gost() returns only the web tool link to the results as a
character string. For example, this is useful when you discover an
interesting result you want to instantly share with your colleagues.
Then you can just programmatically generate the short-link and copy it
to your colleagues.
gostres_link <- gost(query = c("X:1000:1000000", "rs17396340", "GO:0005005", "ENSG00000156103", "NLRP1"),
as_short_link = TRUE)This query returns a short-link of form https://biit.cs.ut.ee/gplink/l/HfapQyB5TJ.
The function gost also allows to perform enrichment on
multiple input gene lists. Multiple queries are automatically detected
if the input query is a list of vectors with
gene identifiers and the results are combined into identical
data.frame as in case of single query.
multi_gostres1 <- gost(query = list("chromX" = c("X:1000:1000000", "rs17396340",
"GO:0005005", "ENSG00000156103", "NLRP1"),
"chromY" = c("Y:1:10000000", "rs17396340",
"GO:0005005", "ENSG00000156103", "NLRP1")),
multi_query = FALSE)head(multi_gostres1$result, 3)
#> query significant p_value term_size query_size intersection_size
#> 1 chromX TRUE 2.490324e-02 3 16 2
#> 2 chromX TRUE 4.966900e-02 4 16 2
#> 3 chromX TRUE 6.988992e-139 50 59 50
#> precision recall term_id source
#> 1 0.1250000 0.6666667 CORUM:6586 CORUM
#> 2 0.1250000 0.5000000 CORUM:1185 CORUM
#> 3 0.8474576 1.0000000 GO:0036323 GO:BP
#> term_name
#> 1 VEcad-VEGFR complex
#> 2 EGFR-containing signaling complex
#> 3 vascular endothelial growth factor receptor-1 signaling pathway
#> effective_domain_size source_order parents
#> 1 3383 2236 CORUM:0000000
#> 2 3383 662 CORUM:0000000
#> 3 20972 8941 GO:0048010The column “query” in the result
data.frame will now contain the corresponding name for the
query. If no name is specified, then the query name is defined as the
order of query with the prefix “query_”.
Another option for multiple gene lists is setting the parameter
multiquery = TRUE. Then the results from all of the input
queries are grouped according to term IDs for better comparison.
multi_gostres2 <- gost(query = list("chromX" = c("X:1000:1000000", "rs17396340",
"GO:0005005", "ENSG00000156103", "NLRP1"),
"chromY" = c("Y:1:10000000", "rs17396340",
"GO:0005005", "ENSG00000156103", "NLRP1")),
multi_query = TRUE)head(multi_gostres2$result, 3)
#> term_id p_values significant term_size query_sizes
#> 1 GO:0005005 6.517769e-148, 4.029814e-131 TRUE, TRUE 53 60, 90
#> 2 GO:0005003 3.518396e-146, 2.172179e-129 TRUE, TRUE 54 60, 90
#> 3 GO:0036323 6.988992e-139, 6.485400e-124 TRUE, TRUE 50 59, 88
#> intersection_sizes source
#> 1 53, 53 GO:MF
#> 2 53, 53 GO:MF
#> 3 50, 50 GO:BP
#> term_name
#> 1 transmembrane-ephrin receptor activity
#> 2 ephrin receptor activity
#> 3 vascular endothelial growth factor receptor-1 signaling pathway
#> effective_domain_size source_order parents
#> 1 20208 1201 GO:0005003
#> 2 20208 1199 GO:0004714
#> 3 20972 8941 GO:0048010The result data.frame contains the following
columns:
A major update in this package is providing the functionality to produce similar visualizations as are now available from the web tool.
The enrichment results are visualized with a Manhattan-like-plot
using the function gostplot and the previously found
gost results gostres:
gostplot(gostres, capped = TRUE, interactive = TRUE)
#> Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
#> ℹ Please use the `linewidth` argument instead.
#> ℹ The deprecated feature was likely used in the gprofiler2 package.
#> Please report the issue at <https://biit.cs.ut.ee/gprofiler/page/contact>.
#> This warning is displayed once per session.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
#> ℹ Please use `linewidth` instead.
#> ℹ The deprecated feature was likely used in the gprofiler2 package.
#> Please report the issue at <https://biit.cs.ut.ee/gprofiler/page/contact>.
#> This warning is displayed once per session.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.The x-axis represents the functional terms that are grouped and
color-coded according to data sources and positioned according to the
fixed “source_order”. The order is defined in a way
that terms that are closer to each other in the source hierarchy are
also next to each other in the Manhattan plot. The source colors are
adjustable with the parameter pal that defines the color
map with a named list.
The y-axis shows the adjusted p-values in negative log10 scale. Every
circle is one term and is sized according to the term size, i.e larger
terms have larger circles. If interactive = TRUE, then an
interactive plot is returned using the plotly package.
Hovering over the circle will show the corresponding information.
The parameter capped = TRUE is an indicator whether the
-log10(p-values) would be capped at 16 if bigger than 16. This fixes the
scale of y-axis to keep Manhattan plots from different queries
comparable and is also intuitive as, statistically, p-values smaller
than that can all be summarised as highly significant.
If interactive = FALSE, then the function returns a
static ggplot object.
The function publish_gostplot takes the static plot
object as an input and enables to highlight a selection of interesting
terms from the results with numbers and table of results. These can be
set with parameter highlight_terms listing the term IDs in
a vector or as a data.frame with column
“term_id” such as a subset of the result
data.frame.
pp <- publish_gostplot(p, highlight_terms = c("GO:0048013", "REAC:R-HSA-3928663"),
width = NA, height = NA, filename = NULL )The function also allows to save the result into an image file into
PNG, PDF, JPEG, TIFF or BMP with parameter filename. The
plot width and height can be adjusted with corresponding parameters
width and height.
If additional graphical parameters like increased resolution are
needed, then the plot objects can easily be saved using the function
ggsave from the package ggplot2.
The gost results can also be visualized with a table.
The publish_gosttable will create a nice-looking table with
the result statistics for the highlight_terms from the
result data.frame. The highlight_terms can be
a vector of term IDs or a subset of the results.
publish_gosttable(gostres, highlight_terms = gostres$result[c(1:2,10,120),],
use_colors = TRUE,
show_columns = c("source", "term_name", "term_size", "intersection_size"),
filename = NULL)
#> The input 'highlight_terms' is a data.frame. The column 'term_id' will be used.The parameter use_colors = FALSE indicates that the
p-values column should not be highlighted with background colors. The
show_columns is used to list the names of additional
columns to show in the table in addition to the
“term_id” and “p_value”.
The same functions work also in case of multiquery results showing multiple Manhattan plots on top of each other:
Note that if a term is clicked on one of the Manhattan plots, it is also highlighted in the others (if it is present) enabling to compare the multiple queries. The insignificant terms are shown with lighter color.
publish_gosttable(multi_gostres1,
highlight_terms = multi_gostres1$result[c(1, 82, 176),],
use_colors = TRUE,
show_columns = c("source", "term_name", "term_size"),
filename = NULL)
#> The input 'highlight_terms' is a data.frame. The column 'term_id' will be used.Available data sources and their abbreviations are:
The function get_version_info enables to obtain the full
metadata about the versions of different data sources for a given
organism.
get_version_info(organism = "hsapiens")
#> $biomart
#> [1] "Ensembl"
#>
#> $biomart_version
#> [1] "114"
#>
#> $display_name
#> [1] "Human"
#>
#> $genebuild
#> [1] "GRCh38.p14"
#>
#> $gprofiler_version
#> [1] "e114_eg62_p19_27110d83"
#>
#> $organism
#> [1] "hsapiens"
#>
#> $sources
#> $sources$CORUM
#> $sources$CORUM$name
#> [1] "CORUM protein complexes"
#>
#> $sources$CORUM$version
#> [1] "28.11.2022 Corum 4.1"
#>
#>
#> $sources$`GO:BP`
#> $sources$`GO:BP`$name
#> [1] "biological process"
#>
#> $sources$`GO:BP`$version
#> [1] "annotations: BioMart\nclasses: releases/2026-01-23"
#>
#>
#> $sources$`GO:CC`
#> $sources$`GO:CC`$name
#> [1] "cellular component"
#>
#> $sources$`GO:CC`$version
#> [1] "annotations: BioMart\nclasses: releases/2026-01-23"
#>
#>
#> $sources$`GO:MF`
#> $sources$`GO:MF`$name
#> [1] "molecular function"
#>
#> $sources$`GO:MF`$version
#> [1] "annotations: BioMart\nclasses: releases/2026-01-23"
#>
#>
#> $sources$HP
#> $sources$HP$name
#> [1] "Human Phenotype Ontology"
#>
#> $sources$HP$version
#> [1] "annotations: 03.2026\nclasses: None"
#>
#>
#> $sources$HPA
#> $sources$HPA$name
#> [1] "Human Protein Atlas"
#>
#> $sources$HPA$version
#> [1] "annotations: HPA website: 25-11-06\nclasses: script: 26-01-20"
#>
#>
#> $sources$KEGG
#> $sources$KEGG$name
#> [1] "Kyoto Encyclopedia of Genes and Genomes"
#>
#> $sources$KEGG$version
#> [1] "KEGG FTP Release 2026-03-15"
#>
#>
#> $sources$MIRNA
#> $sources$MIRNA$name
#> [1] "miRTarBase"
#>
#> $sources$MIRNA$version
#> [1] "Release 10.0"
#>
#>
#> $sources$REAC
#> $sources$REAC$name
#> [1] "Reactome"
#>
#> $sources$REAC$version
#> [1] "annotations: BioMart\nclasses: 2026-3-20"
#>
#>
#> $sources$TF
#> $sources$TF$name
#> [1] "Transfac"
#>
#> $sources$TF$version
#> [1] "annotations: TRANSFAC Release 2025.2\nclasses: v2"
#>
#>
#> $sources$WP
#> $sources$WP$name
#> [1] "WikiPathways"
#>
#> $sources$WP$version
#> [1] "20260310"
#>
#>
#>
#> $taxonomy_id
#> [1] "9606"upload_GMT_fileIn addition to the available GO, KEGG, etc data sources, users can upload their own custom data source using the Gene Matrix Transposed file format (GMT). The file format is described in here. The users can compose the files themselves or use pre-compiled gene sets from available dedicated websites like Molecular Signatures Database (MSigDB), etc. The GMT files for g:Profiler default sources (except for KEGG and Transfac as we are restricted by data source licenses) are downloadabale from the Data sources section in g:Profiler.
upload_GMT_file enables to upload GMT file(s). The input
gmtfile is the filename of the GMT file together with the
path to the file. The input can also be several GMT files compressed
into a ZIP file. The file extension should be .gmt or
.zip in case of multiple GMT files. The uploaded
filename is used to define the source name in the enrichment
results.
For example, using the BioCarta gene sets downloaded from the MSigDB Collections.
download.file(url = "http://software.broadinstitute.org/gsea/resources/msigdb/7.0/c2.cp.biocarta.v7.0.symbols.gmt", destfile = "extdata/biocarta.gmt")The result is a string that denotes the unique ID of the uploaded data source in the g:Profiler database. In this examaple, the ID is gp_ _TEXF_hZLM_d18.
After the upload, this ID can be used as a value for the parameter
organism in the gost function. The input
query should consist of identifiers that are available in
the GMT file. Note that all the genes in the GMT file define the domain
size and therefore it is not sufficient to include only the selection of
interesting terms to the file.
custom_gostres <- gost(query = c("MAPK3", "PIK3C2G", "HRAS", "PIK3R1", "MAP2K1",
"RAF1", "PLCG1", "GNAQ", "MAPK1", "PRKCB", "CRK", "BCAR1", "NFKB1"),
organism = "gp__TEXF_hZLM_d18")
#> Detected custom GMT source request
head(custom_gostres$result, 3)
#> query significant p_value term_size query_size intersection_size
#> 1 query_1 TRUE 5.324995e-26 19 13 13
#> 2 query_1 TRUE 1.690996e-12 27 13 9
#> 3 query_1 TRUE 8.094988e-12 19 13 8
#> precision recall term_id source
#> 1 1.0000000 0.6842105 BIOCARTA_CXCR4_PATHWAY biocarta
#> 2 0.6923077 0.3333333 BIOCARTA_PYK2_PATHWAY biocarta
#> 3 0.6153846 0.4210526 BIOCARTA_CCR3_PATHWAY biocarta
#> term_name
#> 1 http://www.gsea-msigdb.org/gsea/msigdb/cards/BIOCARTA_CXCR4_PATHWAY
#> 2 http://www.gsea-msigdb.org/gsea/msigdb/cards/BIOCARTA_PYK2_PATHWAY
#> 3 http://www.gsea-msigdb.org/gsea/msigdb/cards/BIOCARTA_CCR3_PATHWAY
#> effective_domain_size source_order parents
#> 1 1476 47 NULL
#> 2 1476 109 NULL
#> 3 1476 31 NULLThere is no need to repeatedly upload the same GMT file(s) every time before the enrichment analysis. This can only be uploaded once and then the ID can be used in any further enrichment analyses that are based on that custom source. The same ID can also be used in the web tool as a token under the Custom GMT options. For example, the same query in the web tool is available from https://biit.cs.ut.ee/gplink/l/jh3HdbUWQZ.
Generic Enrichment Map (GEM) is a file format that can be used as an input for Cytoscape EnrichmentMap application. In EnrichmentMap you can set the Analysis Type parameter as Generic/gProfiler and upload the required files: GEM file with enrichment results (input field Enrichments) and GMT file that defines the annotations (input field GMT).
For a single query, the GEM file can be generated and saved using the following commands:
gostres <- gost(query = c("X:1000:1000000", "rs17396340", "GO:0005005", "ENSG00000156103", "NLRP1"),
evcodes = TRUE, multi_query = FALSE,
sources = c("GO", "REAC", "MIRNA", "CORUM", "HP", "HPA", "WP"))
gem <- gostres$result[,c("term_id", "term_name", "p_value", "intersection")]
colnames(gem) <- c("GO.ID", "Description", "p.Val", "Genes")
gem$FDR <- gem$p.Val
gem$Phenotype = "+1"
gem <- gem[,c("GO.ID", "Description", "p.Val", "FDR", "Phenotype", "Genes")]
head(gem, 3)
#> GO.ID Description
#> 1 CORUM:6586 VEcad-VEGFR complex
#> 2 CORUM:1185 EGFR-containing signaling complex
#> 3 GO:0036323 vascular endothelial growth factor receptor-1 signaling pathway
#> p.Val FDR Phenotype
#> 1 2.490324e-02 2.490324e-02 +1
#> 2 4.966900e-02 4.966900e-02 +1
#> 3 6.988992e-139 6.988992e-139 +1
#> Genes
#> 1 ENSG00000037280,ENSG00000128052
#> 2 ENSG00000141736,ENSG00000146648
#> 3 ENSG00000027644,ENSG00000030304,ENSG00000037280,ENSG00000044524,ENSG00000047936,ENSG00000062524,ENSG00000065361,ENSG00000066056,ENSG00000066468,ENSG00000068078,ENSG00000070886,ENSG00000077782,ENSG00000080224,ENSG00000092445,ENSG00000102755,ENSG00000105976,ENSG00000113721,ENSG00000116106,ENSG00000120156,ENSG00000122025,ENSG00000128052,ENSG00000133216,ENSG00000134853,ENSG00000135333,ENSG00000140443,ENSG00000140538,ENSG00000141736,ENSG00000142627,ENSG00000145242,ENSG00000146648,ENSG00000146904,ENSG00000148053,ENSG00000153208,ENSG00000154928,ENSG00000157404,ENSG00000160867,ENSG00000162733,ENSG00000164078,ENSG00000165731,ENSG00000167601,ENSG00000169071,ENSG00000171094,ENSG00000171105,ENSG00000178568,ENSG00000182578,ENSG00000182580,ENSG00000183317,ENSG00000196411,ENSG00000198400,ENSG00000204580Here you can replace the query parameter with your own
input. The parameter evcodes = TRUE is necessary as it
returns the column intersection with corresponding gene
IDs that are annotated to the term.
Saving the file before uploading to Cytoscape:
Here the parameter file should be the character string
naming the file together with the path you want to save it to.
In addition to the GEM file, EnrichmentMap requires also the data
source description GMT file as an input. For example, if you are using
g:Profiler default data sources and your input query consists of human
ENSG identifiers, then the required GMT file is available from https://biit.cs.ut.ee/gprofiler/static/gprofiler_full_hsapiens.ENSG.gmt.
Note that this file does not include annotations from KEGG and Transfac
as we are restricted by data source licenses that do not allow us to
share these two data sources with our users. This means that the
enrichment results in the GEM file cannot include results from these
resources, otherwise you will get an error from the Cytoscape
application. This can be assured by setting appropriate values to the
sources parameter in the gost() function.
For other organisms, the GMT files are downloadable from the g:Profiler web page under the Data sources section, after setting a suitable value for the organism. If you are using a custom GMT file for you analysis, then this should be uploaded to EnrichmentMap.
In case you want to compare multiple queries in EnrichmentMap you could generate individual GEM files for each of the queries and upload these as separate Data sets. This EnrichmentMap option enables you to browse, edit and compare multiple networks simultaneously by color-coding different uploaded Data sets.
For example, these files can be generated with the following commands
(note that the parameter is still set to
multi_query = FALSE):
# enrichment for two input gene lists
multi_gostres <- gost(query = list("chromX" = c("X:1000:1000000", "rs17396340",
"GO:0005005", "ENSG00000156103", "NLRP1"),
"chromY" = c("Y:1:10000000", "rs17396340",
"GO:0005005", "ENSG00000156103", "NLRP1")),
evcodes = TRUE, multi_query = FALSE,
sources = c("GO", "REAC", "MIRNA", "CORUM", "HP", "HPA", "WP"))
# format to GEM
gem <- multi_gostres$result[,c("query", "term_id", "term_name", "p_value", "intersection")]
colnames(gem) <- c("query", "GO.ID", "Description", "p.Val", "Genes")
gem$FDR <- gem$p.Val
gem$Phenotype = "+1"
# write separate files for queries
# install.packages("dplyr")
library(dplyr)
gem %>% group_by(query) %>%
group_walk(~
write.table(data.frame(.x[,c("GO.ID", "Description", "p.Val", "FDR", "Phenotype", "Genes")]),
file = paste0("gProfiler_", unique(.y$query), "_gem.txt"),
sep = "\t", quote = F, row.names = F))gconvertgconvert enables to map between genes, proteins,
microarray probes, common names, various database identifiers, etc, from
numerous databases
and for many species.
gconvert(query = c("GO:0005030", "rs17396340", "NLRP1"), organism = "hsapiens",
target="ENSG", mthreshold = Inf, filter_na = TRUE)
#> input_number input target_number target name
#> 1 1 GO:0005030 1.1 ENSG00000027644 INSRR
#> 2 1 GO:0005030 1.10 ENSG00000068078 FGFR3
#> 3 1 GO:0005030 1.11 ENSG00000070886 EPHA8
#> 4 1 GO:0005030 1.12 ENSG00000077782 FGFR1
#> 5 1 GO:0005030 1.13 ENSG00000080224 EPHA6
#> 6 1 GO:0005030 1.14 ENSG00000092445 TYRO3
#> 7 1 GO:0005030 1.15 ENSG00000102755 FLT1
#> 8 1 GO:0005030 1.16 ENSG00000105976 MET
#> 9 1 GO:0005030 1.17 ENSG00000113721 PDGFRB
#> 10 1 GO:0005030 1.18 ENSG00000116106 EPHA4
#> 11 1 GO:0005030 1.19 ENSG00000120156 TEK
#> 12 1 GO:0005030 1.2 ENSG00000030304 MUSK
#> 13 1 GO:0005030 1.20 ENSG00000122025 FLT3
#> 14 1 GO:0005030 1.21 ENSG00000128052 KDR
#> 15 1 GO:0005030 1.22 ENSG00000133216 EPHB2
#> 16 1 GO:0005030 1.23 ENSG00000134243 SORT1
#> 17 1 GO:0005030 1.24 ENSG00000134853 PDGFRA
#> 18 1 GO:0005030 1.25 ENSG00000135333 EPHA7
#> 19 1 GO:0005030 1.26 ENSG00000140443 IGF1R
#> 20 1 GO:0005030 1.27 ENSG00000140538 NTRK3
#> 21 1 GO:0005030 1.28 ENSG00000141736 ERBB2
#> 22 1 GO:0005030 1.29 ENSG00000142627 EPHA2
#> 23 1 GO:0005030 1.3 ENSG00000037280 FLT4
#> 24 1 GO:0005030 1.30 ENSG00000145242 EPHA5
#> 25 1 GO:0005030 1.31 ENSG00000146648 EGFR
#> 26 1 GO:0005030 1.32 ENSG00000146904 EPHA1
#> 27 1 GO:0005030 1.33 ENSG00000148053 NTRK2
#> 28 1 GO:0005030 1.34 ENSG00000151892 GFRA1
#> 29 1 GO:0005030 1.35 ENSG00000153208 MERTK
#> 30 1 GO:0005030 1.36 ENSG00000154928 EPHB1
#> 31 1 GO:0005030 1.37 ENSG00000157404 KIT
#> 32 1 GO:0005030 1.38 ENSG00000160867 FGFR4
#> 33 1 GO:0005030 1.39 ENSG00000162733 DDR2
#> 34 1 GO:0005030 1.4 ENSG00000044524 EPHA3
#> 35 1 GO:0005030 1.40 ENSG00000164078 MST1R
#> 36 1 GO:0005030 1.41 ENSG00000165731 RET
#> 37 1 GO:0005030 1.42 ENSG00000167601 AXL
#> 38 1 GO:0005030 1.43 ENSG00000169071 ROR2
#> 39 1 GO:0005030 1.44 ENSG00000171094 ALK
#> 40 1 GO:0005030 1.45 ENSG00000171105 INSR
#> 41 1 GO:0005030 1.46 ENSG00000178568 ERBB4
#> 42 1 GO:0005030 1.47 ENSG00000182578 CSF1R
#> 43 1 GO:0005030 1.48 ENSG00000182580 EPHB3
#> 44 1 GO:0005030 1.49 ENSG00000183317 EPHA10
#> 45 1 GO:0005030 1.5 ENSG00000047936 ROS1
#> 46 1 GO:0005030 1.50 ENSG00000196411 EPHB4
#> 47 1 GO:0005030 1.51 ENSG00000198400 NTRK1
#> 48 1 GO:0005030 1.52 ENSG00000204580 DDR1
#> 49 1 GO:0005030 1.6 ENSG00000062524 LTK
#> 50 1 GO:0005030 1.7 ENSG00000065361 ERBB3
#> 51 1 GO:0005030 1.8 ENSG00000066056 TIE1
#> 52 1 GO:0005030 1.9 ENSG00000066468 FGFR2
#> 53 2 1:10226118:10226118 2.1 ENSG00000054523 KIF1B
#> 54 3 NLRP1 3.1 ENSG00000091592 NLRP1
#> description
#> 1 insulin receptor related receptor [Source:HGNC Symbol;Acc:HGNC:6093]
#> 2 fibroblast growth factor receptor 3 [Source:HGNC Symbol;Acc:HGNC:3690]
#> 3 EPH receptor A8 [Source:HGNC Symbol;Acc:HGNC:3391]
#> 4 fibroblast growth factor receptor 1 [Source:HGNC Symbol;Acc:HGNC:3688]
#> 5 EPH receptor A6 [Source:HGNC Symbol;Acc:HGNC:19296]
#> 6 TYRO3 protein tyrosine kinase [Source:HGNC Symbol;Acc:HGNC:12446]
#> 7 fms related receptor tyrosine kinase 1 [Source:HGNC Symbol;Acc:HGNC:3763]
#> 8 MET proto-oncogene, receptor tyrosine kinase [Source:HGNC Symbol;Acc:HGNC:7029]
#> 9 platelet derived growth factor receptor beta [Source:HGNC Symbol;Acc:HGNC:8804]
#> 10 EPH receptor A4 [Source:HGNC Symbol;Acc:HGNC:3388]
#> 11 TEK receptor tyrosine kinase [Source:HGNC Symbol;Acc:HGNC:11724]
#> 12 muscle associated receptor tyrosine kinase [Source:HGNC Symbol;Acc:HGNC:7525]
#> 13 fms related receptor tyrosine kinase 3 [Source:HGNC Symbol;Acc:HGNC:3765]
#> 14 kinase insert domain receptor [Source:HGNC Symbol;Acc:HGNC:6307]
#> 15 EPH receptor B2 [Source:HGNC Symbol;Acc:HGNC:3393]
#> 16 sortilin 1 [Source:HGNC Symbol;Acc:HGNC:11186]
#> 17 platelet derived growth factor receptor alpha [Source:HGNC Symbol;Acc:HGNC:8803]
#> 18 EPH receptor A7 [Source:HGNC Symbol;Acc:HGNC:3390]
#> 19 insulin like growth factor 1 receptor [Source:HGNC Symbol;Acc:HGNC:5465]
#> 20 neurotrophic receptor tyrosine kinase 3 [Source:HGNC Symbol;Acc:HGNC:8033]
#> 21 erb-b2 receptor tyrosine kinase 2 [Source:HGNC Symbol;Acc:HGNC:3430]
#> 22 EPH receptor A2 [Source:HGNC Symbol;Acc:HGNC:3386]
#> 23 fms related receptor tyrosine kinase 4 [Source:HGNC Symbol;Acc:HGNC:3767]
#> 24 EPH receptor A5 [Source:HGNC Symbol;Acc:HGNC:3389]
#> 25 epidermal growth factor receptor [Source:HGNC Symbol;Acc:HGNC:3236]
#> 26 EPH receptor A1 [Source:HGNC Symbol;Acc:HGNC:3385]
#> 27 neurotrophic receptor tyrosine kinase 2 [Source:HGNC Symbol;Acc:HGNC:8032]
#> 28 GDNF family receptor alpha 1 [Source:HGNC Symbol;Acc:HGNC:4243]
#> 29 MER proto-oncogene, tyrosine kinase [Source:HGNC Symbol;Acc:HGNC:7027]
#> 30 EPH receptor B1 [Source:HGNC Symbol;Acc:HGNC:3392]
#> 31 KIT proto-oncogene, receptor tyrosine kinase [Source:HGNC Symbol;Acc:HGNC:6342]
#> 32 fibroblast growth factor receptor 4 [Source:HGNC Symbol;Acc:HGNC:3691]
#> 33 discoidin domain receptor tyrosine kinase 2 [Source:HGNC Symbol;Acc:HGNC:2731]
#> 34 EPH receptor A3 [Source:HGNC Symbol;Acc:HGNC:3387]
#> 35 macrophage stimulating 1 receptor [Source:HGNC Symbol;Acc:HGNC:7381]
#> 36 ret proto-oncogene [Source:HGNC Symbol;Acc:HGNC:9967]
#> 37 AXL receptor tyrosine kinase [Source:HGNC Symbol;Acc:HGNC:905]
#> 38 receptor tyrosine kinase like orphan receptor 2 [Source:HGNC Symbol;Acc:HGNC:10257]
#> 39 ALK receptor tyrosine kinase [Source:HGNC Symbol;Acc:HGNC:427]
#> 40 insulin receptor [Source:HGNC Symbol;Acc:HGNC:6091]
#> 41 erb-b2 receptor tyrosine kinase 4 [Source:HGNC Symbol;Acc:HGNC:3432]
#> 42 colony stimulating factor 1 receptor [Source:HGNC Symbol;Acc:HGNC:2433]
#> 43 EPH receptor B3 [Source:HGNC Symbol;Acc:HGNC:3394]
#> 44 EPH receptor A10 [Source:HGNC Symbol;Acc:HGNC:19987]
#> 45 ROS proto-oncogene 1, receptor tyrosine kinase [Source:HGNC Symbol;Acc:HGNC:10261]
#> 46 EPH receptor B4 [Source:HGNC Symbol;Acc:HGNC:3395]
#> 47 neurotrophic receptor tyrosine kinase 1 [Source:HGNC Symbol;Acc:HGNC:8031]
#> 48 discoidin domain receptor tyrosine kinase 1 [Source:HGNC Symbol;Acc:HGNC:2730]
#> 49 leukocyte receptor tyrosine kinase [Source:HGNC Symbol;Acc:HGNC:6721]
#> 50 erb-b2 receptor tyrosine kinase 3 [Source:HGNC Symbol;Acc:HGNC:3431]
#> 51 tyrosine kinase with immunoglobulin like and EGF like domains 1 [Source:HGNC Symbol;Acc:HGNC:11809]
#> 52 fibroblast growth factor receptor 2 [Source:HGNC Symbol;Acc:HGNC:3689]
#> 53 kinesin family member 1B [Source:HGNC Symbol;Acc:HGNC:16636]
#> 54 NLR family pyrin domain containing 1 [Source:HGNC Symbol;Acc:HGNC:14374]
#> namespace
#> 1 GO
#> 2 GO
#> 3 GO
#> 4 GO
#> 5 GO
#> 6 GO
#> 7 GO
#> 8 GO
#> 9 GO
#> 10 GO
#> 11 GO
#> 12 GO
#> 13 GO
#> 14 GO
#> 15 GO
#> 16 GO
#> 17 GO
#> 18 GO
#> 19 GO
#> 20 GO
#> 21 GO
#> 22 GO
#> 23 GO
#> 24 GO
#> 25 GO
#> 26 GO
#> 27 GO
#> 28 GO
#> 29 GO
#> 30 GO
#> 31 GO
#> 32 GO
#> 33 GO
#> 34 GO
#> 35 GO
#> 36 GO
#> 37 GO
#> 38 GO
#> 39 GO
#> 40 GO
#> 41 GO
#> 42 GO
#> 43 GO
#> 44 GO
#> 45 GO
#> 46 GO
#> 47 GO
#> 48 GO
#> 49 GO
#> 50 GO
#> 51 GO
#> 52 GO
#> 53
#> 54 ENTREZGENE,GENECARDS,HGNC,UNIPROT_GN,WIKIGENEDefault target = ENSG database is Ensembl ENSG, but
gconvert also supports other major naming conventions like
Uniprot, RefSeq, Entrez, HUGO, HGNC and many more. In addition, a large
variety of microarray platforms like Affymetrix, Illumina and Celera are
available.
The parameter mthreshold sets the maximum number of
results per initial alias. Shows all results by default. The parameter
filter_na = TRUE will exclude the results without any
corresponding targets.
The result is a data.frame with columns:
target namespacegsnpensegsnpense converts a list of SNP rs-codes
(e.g. rs11734132) to chromosomal coordinates, gene names and predicted
variant effects. Mapping is only available for variants that overlap
with at least one protein coding Ensembl gene.
gsnpense(query = c("rs11734132", "rs4305276", "rs17396340", "rs3184504"),
filter_na = TRUE)
#> rs_id chromosome start end strand ensgs
#> 1 rs11734132 4 6889792 6889792 + ENSG00000301554
#> 2 rs4305276 2 240555596 240555596 + ENSG00000144504
#> 3 rs17396340 1 10226118 10226118 + ENSG00000054523
#> 4 rs3184504 12 111446804 111446804 + ENSG00000111252
#> gene_names variants.intron_variant variants.missense_variant
#> 1 ENSG00000301554 1 0
#> 2 ANKMY1 21 0
#> 3 KIF1B 6 0
#> 4 SH2B3 0 3
#> variants.non_coding_transcript_variant
#> 1 1
#> 2 0
#> 3 0
#> 4 0The parameter filter_na = TRUE will exclude the results
without any corresponding target genes.
gsnpense(query = c("rs11734132", "rs4305276", "rs17396340", "rs3184504"),
filter_na = FALSE)
#> rs_id chromosome start end strand ensgs
#> 1 rs11734132 4 6889792 6889792 + ENSG00000301554
#> 2 rs4305276 2 240555596 240555596 + ENSG00000144504
#> 3 rs17396340 1 10226118 10226118 + ENSG00000054523
#> 4 rs3184504 12 111446804 111446804 + ENSG00000111252
#> gene_names variants.intron_variant variants.missense_variant
#> 1 ENSG00000301554 1 0
#> 2 ANKMY1 21 0
#> 3 KIF1B 6 0
#> 4 SH2B3 0 3
#> variants.non_coding_transcript_variant
#> 1 1
#> 2 0
#> 3 0
#> 4 0The result is a data.frame with columns:
list with multiple valueslist with multiple valuesdata.frame with corresponding variant
effectsset_base_urlYou can change the underlying tool version to beta with:
You can check the current version with:
Similarly, for the archived versions:
Note that gprofiler2 package is only compatible with versions e94_eg41_p11 and higher.
gprofiler2 package supports all the same organisms, namespaces and data sources as the web tool. The list of organisms and corresponding data sources is available here.
The full list of namespaces that g:Profiler recognizes is available here.
If you use the R package gprofiler2 in published research, please cite:
and
If you have questions or issues, please write to [email protected]
gostgconvertgorthgsnpenseset_base_url