Title: | miRNA Text Mining in Abstracts |
---|---|
Description: | Providing tools for microRNA (miRNA) text mining. miRetrieve summarizes miRNA literature by extracting, counting, and analyzing miRNA names, thus aiming at gaining biological insights into a large amount of text within a short period of time. To do so, miRetrieve uses regular expressions to extract miRNAs and tokenization to identify meaningful miRNA associations. In addition, miRetrieve uses the latest miRTarBase version 8.0 (Hsi-Yuan Huang et al. (2020) "miRTarBase 2020: updates to the experimentally validated microRNA–target interaction database" <doi:10.1093/nar/gkz896>) to display field-specific miRNA-mRNA interactions. The most important functions are available as a Shiny web application under <https://miretrieve.shinyapps.io/miRetrieve/>. |
Authors: | Julian Friedrich [aut, cre], Hans-Peter Hammes [aut], Guido Krenning [aut] |
Maintainer: | Julian Friedrich <[email protected]> |
License: | GPL-3 |
Version: | 1.3.4 |
Built: | 2024-11-15 06:53:43 UTC |
Source: | CRAN |
Add topic column to a data frame.
add_col_topic(df, col.topic = "Topic", topic.name = "Topic1")
add_col_topic(df, col.topic = "Topic", topic.name = "Topic1")
df |
Data frame which the topic column is added to. |
col.topic |
String. Name of the topic column to be created. |
topic.name |
String. Topic name to be contained in |
Add a topic column to a data frame. This topic column is named col.topic
and
contains the string topic.name
.
Data frame with a topic column added.
Keywords to identify abstracts using animal models.
animal_keywords
animal_keywords
An object of class character
of length 12.
Assign topics to abstracts based on precalculated scores.
assign_topic( df, col.topic, threshold, topic.names = NULL, col.topic.name = "Topic", col.pmid = "PMID", discard = FALSE )
assign_topic( df, col.topic, threshold, topic.names = NULL, col.topic.name = "Topic", col.pmid = "PMID", discard = FALSE )
df |
Data frame containing precalculated topic scores and PubMed-IDs. |
col.topic |
Character vector. Vector with column names containing precalculated topic scores. |
threshold |
Integer vector. Vector containing thresholds for topic
columns. Positions in |
topic.names |
Character vector. Optional. Vector containing names of new
topics. Positions in |
col.topic.name |
String. Name of the new topic column. |
col.pmid |
String. Column containing PubMed-IDs. |
discard |
Boolean. If |
Assign topics to abstracts based on precalculated scores.
assign_topic()
compares different precalculated topic scores and
assigns the abstract to the topic with the highest score. If there is a
tie between topic scores, the abstract is assigned to all topics in question.
If an abstract matches no topic, it is assigned to the topic "Unknown".
Data frame with topics based on precalculated topic scores.
calculate_score_topic()
, plot_score_topic()
,
add_col_topic()
Other score functions:
calculate_score_animals()
,
calculate_score_biomarker()
,
calculate_score_patients()
,
calculate_score_topic()
,
plot_score_animals()
,
plot_score_biomarker()
,
plot_score_patients()
,
plot_score_topic()
Assign topics to abstracts based on an LDA model.
assign_topic_lda(df, lda_model, topic.names, col.pmid = PMID)
assign_topic_lda(df, lda_model, topic.names, col.pmid = PMID)
df |
Data frame to assign topics to. Should be the same data frame that the LDA model was fitted on. |
lda_model |
LDA-model. |
topic.names |
Character vector. Vector containing names of the
new topics. Must have the same length as the number of topics |
col.pmid |
Symbol. Column containing PubMed-IDs. |
Assign topic to abstracts based on an LDA model.
To identify the subject of a topic, use plot_lda_term()
.
Data frame with topics assigned to each abstract based on an LDAmodel.
fit_lda()
, plot_lda_term()
, assign_topic()
Other LDA functions:
fit_lda()
,
plot_lda_term()
,
plot_perplexity()
Keywords to identify abstracts reporting about miRNAs as biomarkers.
biomarker_keywords
biomarker_keywords
An object of class character
of length 18.
Calculate animal model score for each abstract to indicate possible use of animal models.
calculate_score_animals( df, keywords = animal_keywords, case = FALSE, threshold = NULL, indicate = FALSE, discard = FALSE, col.abstract = Abstract )
calculate_score_animals( df, keywords = animal_keywords, case = FALSE, threshold = NULL, indicate = FALSE, discard = FALSE, col.abstract = Abstract )
df |
Data frame containing abstracts. |
keywords |
Character vector. Vector containing keywords. The score is
calculated based on these keywords. How much weight a keyword in |
case |
Boolean. If |
threshold |
Integer. Optional. Threshold to decide if an abstract is
considered to use animal models or not. If |
indicate |
Boolean. If |
discard |
Boolean. If |
col.abstract |
Symbol. Column containing abstracts. |
Calculate animal model score for each abstract to indicate possible
use of animal models. This score is added to the data frame as an additional
column Animal_score
, containing the calculated animal model score.
To decide which abstracts are considered to contain animal models, a threshold
can be set via the threshold
argument. Furthermore, an additional
column can be added, verbally indicating the use of animal models in
an abstract.
Choosing the right threshold can be facilitated using plot_score_animals()
.
Data frame with calculated animal model scores.
If discard = FALSE
, adds extra columns
to the original data frame with the calculated animal model scores.
If discard = TRUE
, only abstracts with animal models are kept.
Other score functions:
assign_topic()
,
calculate_score_biomarker()
,
calculate_score_patients()
,
calculate_score_topic()
,
plot_score_animals()
,
plot_score_biomarker()
,
plot_score_patients()
,
plot_score_topic()
Calculate biomarker score for each abstract to indicate possible use of miRNAs as biomarker.
calculate_score_biomarker( df, keywords = biomarker_keywords, case = FALSE, threshold = NULL, indicate = FALSE, discard = FALSE, col.abstract = Abstract )
calculate_score_biomarker( df, keywords = biomarker_keywords, case = FALSE, threshold = NULL, indicate = FALSE, discard = FALSE, col.abstract = Abstract )
df |
Data frame containing abstracts. |
keywords |
Character vector. Vector containing keywords. The score is
calculated based on these keywords. How much weight a keyword in |
case |
Boolean. If |
threshold |
Integer. Optional. Threshold to decide if use of miRNAs as
biomarker are present in an abstract or not. If |
indicate |
Boolean. If |
discard |
Boolean. If |
col.abstract |
Symbol. Column containing abstracts. |
Calculate biomarker score for each abstract to indicate possible
use of miRNAs as biomarker. This score is added to the data frame as an additional
column Biomarker_score
, containing the calculated biomarker score.
To decide which abstracts are considered to contain use of miRNAs as biomarker, a threshold
can be set via the threshold
argument. Furthermore, an additional
column can be added, verbally indicating the general use of miRNAs as biomarker in
an abstract.
Choosing the right threshold can be facilitated using plot_score_biomarker()
.
Data frame with calculated biomarker scores.
If discard = FALSE
, adds extra columns
to the original data frame with calculated biomarker scores.
If discard = TRUE
, only abstracts are with miRNAs as biomarker
are kept.
Other score functions:
assign_topic()
,
calculate_score_animals()
,
calculate_score_patients()
,
calculate_score_topic()
,
plot_score_animals()
,
plot_score_biomarker()
,
plot_score_patients()
,
plot_score_topic()
Calculate patients score for each abstract to indicate possible use of patient material.
calculate_score_patients( df, keywords = patients_keywords, case = FALSE, threshold = NULL, indicate = FALSE, discard = FALSE, col.abstract = Abstract )
calculate_score_patients( df, keywords = patients_keywords, case = FALSE, threshold = NULL, indicate = FALSE, discard = FALSE, col.abstract = Abstract )
df |
Data frame containing abstracts. |
keywords |
Character vector. Vector containing keywords. The score is
calculated based on these keywords. How much weight a keyword in |
case |
Boolean. If |
threshold |
Integer. Optional. Threshold to decide if use of patient tissue is
present in an abstract or not. If |
indicate |
Boolean. If |
discard |
Boolean. If |
col.abstract |
Symbol. Column containing abstracts. |
Calculate patient score for each abstract to indicate possible
use of patient material. This score is added to the data frame as an additional
column Patient_score
, containing the calculated patients score.
To decide which abstracts are considered to contain patient material, a threshold
can be set via the threshold
argument. Furthermore, an additional
column can be added, verbally indicating the general use of patient material.
Choosing the right threshold can be facilitated using plot_score_patients()
.
Data frame with calculated patient scores.
If discard = FALSE
, adds extra columns
to the original data frame with the calculated patient tissue scores.
If discard = TRUE
, only abstracts with use of patient tissue
are kept.
Other score functions:
assign_topic()
,
calculate_score_animals()
,
calculate_score_biomarker()
,
calculate_score_topic()
,
plot_score_animals()
,
plot_score_biomarker()
,
plot_score_patients()
,
plot_score_topic()
Calculate score of a self-chosen topic for each abstract to identify abstracts possibly corresponding to the topic of interest.
calculate_score_topic( df, keywords, case = FALSE, col.score = "topic_score", col.indicate = NULL, threshold = NULL, discard = FALSE, col.abstract = Abstract )
calculate_score_topic( df, keywords, case = FALSE, col.score = "topic_score", col.indicate = NULL, threshold = NULL, discard = FALSE, col.abstract = Abstract )
df |
Data frame containing abstracts. |
keywords |
Character vector. Vector containing keywords. The score is
calculated based on these keywords. How much weight a keyword in |
case |
Boolean. If |
col.score |
String. Name of |
col.indicate |
String. Optional. Name of indicating column. If a string
is provided, an extra column is added to |
threshold |
Integer. Optional. Threshold to decide if abstract
corresponds to topic of interest. If |
discard |
Boolean. If |
col.abstract |
Symbol. Column containing abstracts. |
Calculate score of a self-chosen topic for each abstract to identify
abstracts possibly corresponding to the topic of interest.
This score is added to the data frame as an additional
column, usually called topic_score
, containing the calculated topic score.
If there is more than one topic of interest, the column topic_score
should
be appropriately renamed.
To decide which abstracts are considered to correspond to the topic of interest,
a threshold
can be set via the threshold
argument. Furthermore, an additional
column can be added, verbally indicating if the abstract corresponds to the
topic.
Choosing the right threshold can be facilitated using plot_score_topic()
.
Data frame with calculated topic scores.
If discard = FALSE
, adds extra columns
to the original data frame with the calculated topic scores.
If discard = TRUE
, only abstracts corresponding to
the topic of interest are kept.
assign_topic()
, plot_score_topic()
Other score functions:
assign_topic()
,
calculate_score_animals()
,
calculate_score_biomarker()
,
calculate_score_patients()
,
plot_score_animals()
,
plot_score_biomarker()
,
plot_score_patients()
,
plot_score_topic()
Combine data frames into one data frame.
combine_df(...)
combine_df(...)
... |
Data frames to combine into one data frame. Data frames must have the same number of columns and the same column names. |
Combine data frames into one data frame. combine_df()
accepts several data frames that are combined into one data frame.
Data frames to be combined must have the same number
of columns and the same column names.
Combined data frame.
Other combine functions:
combine_mir()
Combine miRNA vectors into one.
combine_mir(...)
combine_mir(...)
... |
Character vectors. Character vectors containing miRNA names. |
Combine miRNA vectors into one. miRNA names occurring more than once are reduced to one instance.
Combined character vector containing miRNA names.
Other combine functions:
combine_df()
Combine data frames containing stop words into one data frame.
combine_stopwords(...)
combine_stopwords(...)
... |
Data frames with stop words. Data frames must have two columns named "word" and "lexicon". |
Combine data frames containing stop words into one data frame. Provided data frames must have two columns named "word" and "lexicon".
Combined data frame with stop words.
generate_stopwords()
, stopwords_miretrieve, tidytext::stop_words
Other stopword functions:
generate_stopwords()
Compare count of miRNA names between different topics.
compare_mir_count( df, mir, topic = NULL, normalize = TRUE, col.topic = Topic, col.mir = miRNA, col.pmid = PMID, title = NULL )
compare_mir_count( df, mir, topic = NULL, normalize = TRUE, col.topic = Topic, col.mir = miRNA, col.pmid = PMID, title = NULL )
df |
Data frame containing columns for miRNA names, topics, and PubMed-IDs. |
mir |
Character vector. Vector specifying which miRNA names to compare. |
topic |
Character vector. Optional. Vector specifying which topics to compare. |
normalize |
Boolean. If |
col.topic |
Symbol. Column containing topic names. |
col.mir |
Symbol. Column containing miRNA names. |
col.pmid |
Symbol. Column containing PubMed-IDs. |
title |
String. Plot title. |
Compare count of miRNA names between different topics by plotting the number of abstracts mentioning the miRNA in a topic. This count can either be normalized, thus plotting the proportion of abstracts mentioning a miRNA name compared to all abstracts of a topic, or it can be not normalized, thus plotting the absolute number of abstracts mentioning a miRNA per topic.
Bar plot comparing the count of miRNA names between different topics.
compare_mir_count_log2()
, compare_mir_count_unique()
Other compare functions:
compare_mir_count_log2()
,
compare_mir_count_unique()
,
compare_mir_terms_log2()
,
compare_mir_terms_scatter()
,
compare_mir_terms_unique()
,
compare_mir_terms()
Compare log2-frequency count of miRNA names between two topics
compare_mir_count_log2( df, mir, topic = NULL, normalize = TRUE, col.topic = Topic, col.mir = miRNA, col.pmid = PMID, title = NULL )
compare_mir_count_log2( df, mir, topic = NULL, normalize = TRUE, col.topic = Topic, col.mir = miRNA, col.pmid = PMID, title = NULL )
df |
Data frame containing miRNA names, topics, and PubMed-IDs. |
mir |
Character vector. Vector specifying which miRNA names to compare. |
topic |
Character vector. Optional. Vector specifying which
topics to compare. If |
normalize |
Boolean. If |
col.topic |
Symbol. Column containing topics. |
col.mir |
Symbol. Column containing miRNA names. |
col.pmid |
Symbol. Column containing PubMed-IDs. |
title |
String. Plot title. |
Compare log2-frequency count of miRNA names between two topics by plotting the log2-ratio of the miRNA count in two topics. The miRNA count per topic can either be normalized, thus taking the proportion of abstracts mentioning a miRNA name compared to all abstracts in a topic, or not normalized, thus taking the absolute number of abstracts mentioning a miRNA in a topic. The log2-plot is greatly inspired by the book “tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” by Silge and Robinson.
List containing bar plot comparing the log2-frequency count of miRNA names between two topics and its corresponding data frame.
Silge, Julia, and David Robinson. 2016. “tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” JOSS 1 (3). The Open Journal. https://doi.org/10.21105/joss.00037.
compare_mir_count()
, compare_mir_count_unique()
Other compare functions:
compare_mir_count_unique()
,
compare_mir_count()
,
compare_mir_terms_log2()
,
compare_mir_terms_scatter()
,
compare_mir_terms_unique()
,
compare_mir_terms()
Compare top count of unique miRNA names per topic
compare_mir_count_unique( df, top = 5, topic = NULL, normalize = TRUE, colour = "steelblue3", col.topic = Topic, col.mir = miRNA, col.pmid = PMID, title = NULL )
compare_mir_count_unique( df, top = 5, topic = NULL, normalize = TRUE, colour = "steelblue3", col.topic = Topic, col.mir = miRNA, col.pmid = PMID, title = NULL )
df |
Data frame containing miRNA names, topics, and PubMed-IDs. |
top |
Integer. Specifies number of top unique miRNAs to plot. |
topic |
Character vector. Optional. Vector specifying which
topics to compare. If |
normalize |
Boolean. If |
colour |
String. Colour of bar plot. |
col.topic |
Symbol. Column containing topics. |
col.mir |
Symbol. Column containing miRNA names. |
col.pmid |
Symbol. Column containing PubMed-IDs. |
title |
String. Plot title. |
Compare top count of unique miRNA names per topic by plotting the the miRNA count of unique miRNAs per topic. Per topic, the unique miRNAs are identified and their count is plotted. The miRNA count can either be normalized, thus taking the proportion of abstracts mentioning a miRNA name compared to all abstracts in a topic, or not normalized, thus taking the absolute number of abstracts mentioning a miRNA in a topic.
Bar plot comparing frequency of unique miRNA count per topic.
compare_mir_count()
, compare_mir_count_log2()
Other compare functions:
compare_mir_count_log2()
,
compare_mir_count()
,
compare_mir_terms_log2()
,
compare_mir_terms_scatter()
,
compare_mir_terms_unique()
,
compare_mir_terms()
Compare count of top terms associated with a miRNA name over various topics.
compare_mir_terms( df, mir, top = 20, token = "words", ..., topic = NULL, shared = TRUE, normalize = TRUE, stopwords = stopwords_miretrieve, stopwords_ngram = TRUE, position = "dodge", col.mir = miRNA, col.abstract = Abstract, col.topic = Topic, col.pmid = PMID, title = NULL )
compare_mir_terms( df, mir, top = 20, token = "words", ..., topic = NULL, shared = TRUE, normalize = TRUE, stopwords = stopwords_miretrieve, stopwords_ngram = TRUE, position = "dodge", col.mir = miRNA, col.abstract = Abstract, col.topic = Topic, col.pmid = PMID, title = NULL )
df |
Data frame containing miRNA names, abstracts, topics, and PubMed-IDs. |
mir |
String. miRNA name of interest. |
top |
Integer. Number of top terms to plot. |
token |
String. Specifies how abstracts shall be split up. Taken from
|
... |
Additional arguments for tokenization, if necessary. |
topic |
Character vector. Optional. Specifies topics to plot.
If |
shared |
Boolean. If |
normalize |
Boolean. If |
stopwords |
Data frame containing stop words. |
stopwords_ngram |
Boolean. Specifies if stop words shall be removed
from abstracts when using ngrams. Only applied when |
position |
Character vector. Vector containing either "dodge" or "facet". Determines if bar plots are on top of or next to each other. |
col.mir |
Symbol. Column containing miRNA names. |
col.abstract |
Symbol. Column containing abstracts. |
col.topic |
Symbol. Column containing topic names. |
col.pmid |
Symbol. Column containing PubMed-IDs. |
title |
String. Plot title. |
Compare count of top terms associated with a miRNA name
over various topics.
miRNA names and topics must be in a data frame df
, while terms are taken
from abstracts contained in df
.
Number of top terms to plot is regulated by top
. Terms can either be
evaluated as their raw count, e.g. in how many abstracts they are mentioned
in conjunction with the miRNA name, or as their relative count, e.g.
in how many abstracts containing the miRNA they are mentioned compared to all
abstracts containing the miRNA.
compare_mir_terms()
is based on the tools available in the
tidytext package.
Bar plot comparing the count of terms associated with a miRNA name over two topics.
compare_mir_terms_log2()
, compare_mir_terms_scatter()
Other compare functions:
compare_mir_count_log2()
,
compare_mir_count_unique()
,
compare_mir_count()
,
compare_mir_terms_log2()
,
compare_mir_terms_scatter()
,
compare_mir_terms_unique()
Compare log2-frequency count of terms associated with a miRNA name over two topics.
compare_mir_terms_log2( df, mir, top = 20, token = "words", ..., topic = NULL, shared = TRUE, normalize = TRUE, stopwords = stopwords_miretrieve, stopwords_ngram = TRUE, col.mir = miRNA, col.abstract = Abstract, col.topic = Topic, col.pmid = PMID, title = NULL )
compare_mir_terms_log2( df, mir, top = 20, token = "words", ..., topic = NULL, shared = TRUE, normalize = TRUE, stopwords = stopwords_miretrieve, stopwords_ngram = TRUE, col.mir = miRNA, col.abstract = Abstract, col.topic = Topic, col.pmid = PMID, title = NULL )
df |
Data frame containing miRNA names, abstracts, topics, and PubMed-IDs. |
mir |
String. miRNA name of interest. |
top |
Integer. Number of top terms to plot. |
token |
String. Specifies how abstracts shall be split up. Taken from
|
... |
Additional arguments for tokenization, if necessary. |
topic |
Character vector. Optional. Specifies which topics to plot.
Must have length two.
If |
shared |
Boolean. If |
normalize |
Boolean. If |
stopwords |
Data frame containing stop words. |
stopwords_ngram |
Boolean. Specifies if stop words shall be removed
from abstracts when using ngrams. Only applied when |
col.mir |
Symbol. Column containing miRNA names. |
col.abstract |
Symbol. Column containing abstracts. |
col.topic |
Symbol. Column containing topic names. |
col.pmid |
Symbol. Column containing PubMed-IDs. |
title |
String. Plot title. |
Compare log2-frequency count of terms associated with a miRNA name over two topics by
plotting the log2-ratio of the term count associated with a miRNA name
over two topics.
miRNA names and topics must be in a data frame df
, while terms are taken
from abstracts contained in df
.
Number of top terms to plot is regulated by top
. Terms can either be
evaluated as their raw count, e.g. in how many abstracts they are mentioned
in conjunction with the miRNA name, or as their relative count, e.g.
in how many abstracts containing the miRNA they are mentioned compared to all
abstracts containing the miRNA.
compare_mir_terms_log2()
is based on the tools available in the
tidytext package.
The log2-plot is greatly inspired by the book
“tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” by
Silge and Robinson.
List containing bar plot comparing the log2-frequency of terms associated with a miRNA over two topics and its corresponding data frame.
Silge, Julia, and David Robinson. 2016. “tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” JOSS 1 (3). The Open Journal. https://doi.org/10.21105/joss.00037.
compare_mir_terms()
, compare_mir_terms_scatter()
Other compare functions:
compare_mir_count_log2()
,
compare_mir_count_unique()
,
compare_mir_count()
,
compare_mir_terms_scatter()
,
compare_mir_terms_unique()
,
compare_mir_terms()
Compare shared terms associated with a miRNA name over two topics.
compare_mir_terms_scatter( df, mir, top = 1000, token = "words", ..., topic = NULL, stopwords = stopwords_miretrieve, stopwords_ngram = TRUE, html = TRUE, colour.point = "red", colour.term = "black", col.mir = miRNA, col.abstract = Abstract, col.topic = Topic, col.pmid = PMID, title = NULL )
compare_mir_terms_scatter( df, mir, top = 1000, token = "words", ..., topic = NULL, stopwords = stopwords_miretrieve, stopwords_ngram = TRUE, html = TRUE, colour.point = "red", colour.term = "black", col.mir = miRNA, col.abstract = Abstract, col.topic = Topic, col.pmid = PMID, title = NULL )
df |
Data frame containing miRNA names, abstracts, topics, and PubMed-IDs. |
mir |
String. miRNA name of interest. |
top |
Integer. Number of top terms to plot. |
token |
String. Specifies how abstracts shall be split up. Taken from
|
... |
Additional arguments for tokenization, if necessary. |
topic |
Character vector. Optional. Specifies which topics to plot.
Must have length two.
If |
stopwords |
Data frame containing stop words. |
stopwords_ngram |
Boolean. Specifies if stop words shall be removed
from abstracts when using ngrams. Only applied when |
html |
Boolean. Specifies if plot is returned as an HTML-widget or static. |
colour.point |
String. Colour of points for scatter plot. |
colour.term |
String. Colour of terms for scatter plot. |
col.mir |
Symbol. Column containing miRNAs. |
col.abstract |
Symbol. Column containing abstracts. |
col.topic |
Symbol. Column containing topics names. |
col.pmid |
Symbol. Column containing PubMed-IDs. |
title |
String. Plot title. |
Compare shared terms associated with a miRNA name over two topics. These terms are displayed
as a scatter plot, which is either interactive as an HTML-widget, or static. This
is regulated via the html
argument.
miRNA names and topics must be in a data frame df
, while terms are taken
from abstracts contained in df
.
Number of top terms to choose is regulated by top
. Terms are
evaluated as their raw count and plotted on a log10-scale.
compare_mir_terms_scatter()
is based on the tools available in the
tidytext package.
The term-plot is greatly inspired by
“tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” by
Silge and Robinson.
Scatter plot comparing shared terms of a miRNA between two topics.
Silge, Julia, and David Robinson. 2016. “tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” JOSS 1 (3). The Open Journal. https://doi.org/10.21105/joss.00037.
compare_mir_terms()
, compare_mir_terms_log2()
Other compare functions:
compare_mir_count_log2()
,
compare_mir_count_unique()
,
compare_mir_count()
,
compare_mir_terms_log2()
,
compare_mir_terms_unique()
,
compare_mir_terms()
Compare terms uniquely associated with a miRNA name over topics.
compare_mir_terms_unique( df, mir, top = 20, token = "words", ..., topic = NULL, stopwords = stopwords_miretrieve, stopwords_ngram = TRUE, normalize = TRUE, colour = "steelblue3", col.mir = miRNA, col.abstract = Abstract, col.topic = Topic, col.pmid = PMID, title = NULL )
compare_mir_terms_unique( df, mir, top = 20, token = "words", ..., topic = NULL, stopwords = stopwords_miretrieve, stopwords_ngram = TRUE, normalize = TRUE, colour = "steelblue3", col.mir = miRNA, col.abstract = Abstract, col.topic = Topic, col.pmid = PMID, title = NULL )
df |
Data frame containing miRNA names, abstracts, topics, and PubMed-IDs. |
mir |
String. miRNA name of interest. |
top |
Integer. Number of top terms to plot. |
token |
String. Specifies how abstracts shall be split up. Taken from
|
... |
Additional arguments for tokenization, if necessary. |
topic |
Character vector. Optional. Specifies which topics to plot.
If |
stopwords |
Data frame containing stop words. |
stopwords_ngram |
Boolean. Specifies if stop words shall be removed
from abstracts when using ngrams. Only applied when |
normalize |
Boolean. If |
colour |
String. Colour of bar plot. |
col.mir |
Symbol. Column containing miRNAs. |
col.abstract |
Symbol. Column containing abstracts. |
col.topic |
Symbol. Column containing topics names. |
col.pmid |
Symbol. Column containing PubMed-IDs. |
title |
String. Plot title. |
Compare terms uniquely associated with a miRNA name over topics.
miRNA names and topics must be in a data frame df
, while terms are taken
from abstracts contained in df
.
Number of top terms to choose is regulated by top
. Terms are
evaluated either as the number of times they are mentioned in all abstracts
with the miRNA name of interest, or the number of times they are relatively mentioned
compared to all abstracts with the miRNA name of interest.
compare_mir_terms_unique()
is based on the tools available in the
tidytext package.
Bar plot containing unique miRNA-terms associations per topic.
compare_mir_terms()
, compare_mir_terms_log2()
, compare_mir_terms_scatter()
Other compare functions:
compare_mir_count_log2()
,
compare_mir_count_unique()
,
compare_mir_count()
,
compare_mir_terms_log2()
,
compare_mir_terms_scatter()
,
compare_mir_terms()
Count occurrence of miRNA names in a data frame.
count_mir(df, col.mir = miRNA)
count_mir(df, col.mir = miRNA)
df |
Data frame containing miRNA names. |
col.mir |
Symbol. Column containing miRNA names. |
Count occurrence of miRNA names in a data frame. The count of miRNA names is returned as a separate data frame, only listing the miRNA names and their respective frequency.
Data frame. Data frame containing miRNA names and their respective frequency.
plot_mir_count()
, count_mir_threshold()
, plot_mir_count_threshold()
Other count functions:
count_mir_threshold()
,
count_snp()
,
plot_mir_count_threshold()
,
plot_mir_count()
Count occurrence of miRNA names above a threshold.
count_mir_threshold(df, threshold = 1, col.mir = miRNA, col.pmid = PMID)
count_mir_threshold(df, threshold = 1, col.mir = miRNA, col.pmid = PMID)
df |
Data frame containing miRNA names and PubMed-IDs. |
threshold |
Integer or float. If |
col.mir |
Symbol. Column containing miRNA names. |
col.pmid |
Symbol. Column containing PubMed-IDs. |
Count occurrence of miRNA names above a threshold. This threshold can either
be an absolute value, e.g. 3, or a float between 0 and 1, e.g. 0.2.
If threshold
is an absolute value, number of distinct miRNA names mentioned
in at least threshold
abstracts is returned.
If threshold
is a float between 0 and 1, number of distinct miRNA names
mentioned in at least threshold
abstracts
of all abstracts in df
is returned.
Integer with the number of distinct miRNA names in df
.
plot_mir_count_threshold()
, count_mir()
, plot_mir_count()
Other count functions:
count_mir()
,
count_snp()
,
plot_mir_count_threshold()
,
plot_mir_count()
Count occurrence of SNPs in a data frame.
count_snp(df, col.snp = SNPs, col.pmid = PMID)
count_snp(df, col.snp = SNPs, col.pmid = PMID)
df |
Data frame containing SNPs and PubMed IDs. |
col.snp |
Symbol. Column containing SNPs. |
col.pmid |
Symbol. Column containing PubMed IDs. |
Count occurrence of SNPs in a data frame. The count of SNPs is returned as a separate data frame, only listing the SNPs and their respective frequency.
Data frame. Data frame containing SNPs and their respective frequency.
extract_snp()
,
get_snp()
,
subset_snp()
Other count functions:
count_mir_threshold()
,
count_mir()
,
plot_mir_count_threshold()
,
plot_mir_count()
Count occurrence of targets in a data frame.
count_target(df, col.target = Target, add.df = TRUE)
count_target(df, col.target = Target, add.df = TRUE)
df |
Data frame containing a column with targets. |
col.target |
Symbol. Column containing targets. |
add.df |
Boolean. If |
Count occurrence of targets in a data frame. The count of targets can either be returned as a separate data frame, only listing the targets and their respective frequency, or it can be added to the data frame provided as an extra column.
Data frame, either with the targets and their frequency as a new
data frame,
or with the frequency of targets added as a
new column to the input data frame df
.
join_targets()
, plot_target_count()
Other target functions:
join_mirtarbase()
,
join_targets()
,
plot_target_count()
,
plot_target_mir_scatter()
A dataset PubMed abstracts of miRNAs in Colorectal Cancer.
df_crc
df_crc
A data frame.
https://pubmed.ncbi.nlm.nih.gov/
The most recent miRTarBase version 8.0, containing miRNA stem, capitalized targets, and PMIDs.
df_mirtarbase
df_mirtarbase
A data frame with the columns "miRNA_tarbase", "Target", and "PMID".
miRTarBase was published in
Hsi-Yuan Huang, Yang-Chi-Dung Lin, Jing Li, et al., miRTarBase 2020: updates to the experimentally validated microRNA–target interaction database, Nucleic Acids Research, Volume 48, Issue D1, 08 January 2020, Pages D148–D154, https://doi.org/10.1093/nar/gkz896
https://miRTarBase.cuhk.edu.cn:443/
A dataset PubMed abstracts of miRNAs in Pancreatic Cancer.
df_panc
df_panc
A data frame.
https://pubmed.ncbi.nlm.nih.gov/
Test dataset of 20 PubMed abstracts.
df_test
df_test
A data frame.
https://pubmed.ncbi.nlm.nih.gov/
Extract miRNA names from abstracts in a data frame.
extract_mir_df( df, threshold = 1, col.abstract = Abstract, extract_letters = FALSE )
extract_mir_df( df, threshold = 1, col.abstract = Abstract, extract_letters = FALSE )
df |
Data frame containing abstracts. |
threshold |
Integer. Specifies how often a miRNA must be mentioned in an abstract to be extracted. |
col.abstract |
Symbol. Column containing abstracts. |
extract_letters |
Boolean. If |
Extract miRNA names from abstracts in a data frame. miRNA names can
either be extracted with their stem only, e.g. miR-23, or with their trailing
letter, e.g. miR-23a. miRNA names are adapted to the most recent miRBase
version (e.g. miR-97, miR-102, miR-180(a/b) become miR-30a, miR-29a,
and miR-172(a/b), respectively). Additionally, how often a miRNA must be
mentioned in an
abstract to be extracted can be regulated via the threshold
argument.
Ultimately, abstracts not containing any miRNA names
are silently dropped.
As many abstracts do not adhere to the miRNA nomenclature,
it is recommended to extract only the miRNA stem with
extract_letters = FALSE
.
Data frame with miRNA names extracted from abstracts.
Other extract functions:
extract_mir_string()
,
extract_snp()
Extract miRNA names from a string.
extract_mir_string(string, threshold = 1, extract_letters = FALSE)
extract_mir_string(string, threshold = 1, extract_letters = FALSE)
string |
String. String to search for miRNA names. |
threshold |
Integer. Specifies how often a miRNA must be mentioned in |
extract_letters |
Boolean. If |
Extract miRNA names from a string. miRNA names can either be extracted with their stem only, e.g. miR-23, or with their trailing letter, e.g. miR-23a. Furthermore, miRNA names are adapted to the most recent miRBase version (e.g. miR-97, miR-102, miR-180(a/b) become miR-30a, miR-29a, and miR-172(a/b), respectively).
Character vector containing miRNA names, if miRNA names are present in the string. If no miRNA names are present in the string, a message is returned saying "No miRNA found.".
Other extract functions:
extract_mir_df()
,
extract_snp()
Extract SNPs from abstracts in a data frame.
extract_snp( df, pattern = snp_pattern, col.abstract = Abstract, indicate = FALSE, discard = FALSE )
extract_snp( df, pattern = snp_pattern, col.abstract = Abstract, indicate = FALSE, discard = FALSE )
df |
Data frame containing abstracts. |
pattern |
String. Regex pattern to identify SNPs. |
col.abstract |
Symbol. Column containing abstracts. |
indicate |
Boolean. If |
discard |
Boolean. If |
Extract SNPs from abstracts in a data frame. SNPs are added to the data frame in a separate column. Furthermore, an optional column can indicate if SNPs are generally present in an abstract.
Data frame. If discard = FALSE
, return the data frame with
an additional column for SNPs.
If discard = TRUE
, return only abstracts containing SNPs.
count_snp()
,
get_snp()
,
subset_snp()
Other extract functions:
extract_mir_df()
,
extract_mir_string()
Fit LDA-model with k
topics.
fit_lda( df, k, stopwords = stopwords_miretrieve, method = "gibbs", control = NULL, seed = 42, col.abstract = Abstract, col.pmid = PMID )
fit_lda( df, k, stopwords = stopwords_miretrieve, method = "gibbs", control = NULL, seed = 42, col.abstract = Abstract, col.pmid = PMID )
df |
Data frame containing abstracts and PubMed-IDs. |
k |
Integer. Number of topics to fit. Must be >=2. |
stopwords |
Data frame containing stop words. |
method |
String. Either |
control |
Control parameters for LDA modeling. For more information,
see the documentation of the |
seed |
Integer. Seed for reproducibility. |
col.abstract |
Column containing abstracts. |
col.pmid |
Column containing PubMed-ID. |
Fit LDA-model with k
topics from a data frame.
fit_lda()
is based on LDA()
from the package
topicmodels.
LDA-model.
Other LDA functions:
assign_topic_lda()
,
plot_lda_term()
,
plot_perplexity()
Generate a data frame containing stop words.
generate_stopwords(stopwords, combine_with = NULL)
generate_stopwords(stopwords, combine_with = NULL)
stopwords |
Character vector. Vector containing stop words. |
combine_with |
Data frame containing stop words. Optional.
Data frame provided here must have only two columns, namely
|
Generate data frame containing stop words from a character vector. This data
frame consists of two columns, namely word
, containing the stop words, and
lexicon
, containing the string "self-defined".
Additionally, the created data frame can be combined with other stop words
containing data frames, e.g. tidytext::stop_words
or
stopwords_miretrieve
.
Data frame containing stop words.
Silge, Julia, and David Robinson. 2016. “tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” JOSS 1 (3). The Open Journal. https://doi.org/10.21105/joss.00037.
combine_stopwords()
, stopwords_miretrieve, tidytext::stop_words
Other stopword functions:
combine_stopwords()
Identify top miRNA names distinct for one topic compared to another topic in a data frame.
get_distinct_mir_df( df, distinct, top = 5, topic = NULL, col.topic = Topic, col.mir = miRNA, col.pmid = PMID )
get_distinct_mir_df( df, distinct, top = 5, topic = NULL, col.topic = Topic, col.mir = miRNA, col.pmid = PMID )
df |
Data frame containing at least two topics and miRNA names. |
distinct |
String. Name of topic top distinct miRNAs shall be identified
for. |
top |
Integer. Number of top miRNA names to extract for both topics. |
topic |
String. Vector of strings containing topic names to compare
miRNA names for. If |
col.topic |
Symbol. Column containing topic names. |
col.mir |
Symbol. Column containing miRNA names. |
col.pmid |
Symbol. Column containing PubMed-IDs. |
Get top distinct miRNA names of one topic compared to another topic in a
data frame.
get_distinct_mir_df()
compares the top miRNA names of two topics and
returns the miRNA names that are exclusive for distinct
.
Character vector containing miRNA names distinct for distinct
compared to the second topic provided in topic
.
Other get functions:
get_distinct_mir_vec()
,
get_mir()
,
get_pmid()
,
get_shared_mir_df()
,
get_shared_mir_vec()
,
get_snp()
Identify miRNA names distinct for one vector compared to another vector.
get_distinct_mir_vec(mirna.vec.1, mirna.vec.2)
get_distinct_mir_vec(mirna.vec.1, mirna.vec.2)
mirna.vec.1 |
Character vector. First vector containing miRNA names. |
mirna.vec.2 |
Character vector. Second vector containing miRNA names. |
Get distinct miRNA names of one vector compared to another vector.
get_distinct_mir()
compares two vectors containing miRNA names and
returns the miRNA names that are exclusive for mirna.vec.1
.
Character vector containing miRNA names distinct for mirna.vec.1
compared to mirna.vec.2
.
Other get functions:
get_distinct_mir_df()
,
get_mir()
,
get_pmid()
,
get_shared_mir_df()
,
get_shared_mir_vec()
,
get_snp()
Get miRNA names from a data frame. These miRNA names can either be the most frequent ones, or the ones exceeding a threshold.
get_mir( df, top = NULL, threshold = NULL, topic = NULL, col.mir = miRNA, col.pmid = PMID, col.topic = Topic )
get_mir( df, top = NULL, threshold = NULL, topic = NULL, col.mir = miRNA, col.pmid = PMID, col.topic = Topic )
df |
Data frame containing miRNA names. If |
top |
Integer. Optional. Specifies number of most frequent miRNA names
to return. If neither |
threshold |
Integer or float. Optional. If |
topic |
String. Optional. Character vector specifying which topics to obtain miRNA names from. |
col.mir |
Symbol. Column containing miRNA names. |
col.pmid |
Symbol. Column containing PubMed-IDs. |
col.topic |
Symbol. Column containing topic names. |
Get miRNA names from a data frame. These miRNA names can either be the most
frequent ones, or the ones exceeding a threshold. Furthermore, if the data
frame contains abstracts of different topics, only the miRNA names of
specific topics can be obtained by setting the topic
argument.
To get the most frequent miRNA names, set the top
argument. top
determines how many most frequent miRNA names are returned, according to their
rank. Ties among the most
frequently mentioned miRNAs are treated as
the same rank, e.g. if miR-126, miR-34, and miR-29 were all mentioned
the most often with the same frequency, they would all be returned by
specifying top = 1
, top = 2
, and top = 3
.
To get the miRNA names exceeding a threshold, set the threshold
argument.
threshold
can either be an absolute value, e.g. 3, or a float between 0 and 1,
e.g. 0.2.
If threshold
is an absolute value, get_mir()
returns only the miRNA names
mentioned in at least threshold
abstracts.
If threshold
is a float between 0 and 1, get_mir()
returns
only miRNA names mentioned in at least threshold
abstracts
of all abstracts. threshold
requires the data frame to have a column with
PubMed IDs.
If neither top
nor threshold
is set, top
is automatically set to 5
.
Character vector containing miRNA names.
Other get functions:
get_distinct_mir_df()
,
get_distinct_mir_vec()
,
get_pmid()
,
get_shared_mir_df()
,
get_shared_mir_vec()
,
get_snp()
Get PubMed-IDs of a data frame.
get_pmid(df, col.pmid = PMID, copy = TRUE)
get_pmid(df, col.pmid = PMID, copy = TRUE)
df |
Data frame containing PubMed-IDs. |
col.pmid |
Symbol. Column containing PubMed-IDs. |
copy |
Boolean. If |
Get PubMed-IDs of a data frame. get_pmid
returns either a character
vector, containing PubMed-IDs, or copies PubMed-IDs to clipboard. If PubMed-IDs
are copied to the clipboard, they can be used e.g. to search for abstracts on
PubMed.
Copy to clipboard or character vector.
If copy = TRUE
, get_pmid()
copies
PubMed-IDs to clipboard.
If copy = FALSE
, get_pmid()
returns a character
vector, containing PubMed-IDs.
Other get functions:
get_distinct_mir_df()
,
get_distinct_mir_vec()
,
get_mir()
,
get_shared_mir_df()
,
get_shared_mir_vec()
,
get_snp()
Get SNPs from a data frame.
get_snp(df, row = NULL, top = NULL, col.snp = SNPs, col.pmid = PMID)
get_snp(df, row = NULL, top = NULL, col.snp = SNPs, col.pmid = PMID)
df |
Data frame containing SNPs. If |
row |
Integer. Optional. Specifies row from which SNP shall be obtained. Works best
with a data frame listing counts only as from |
top |
Integer. Optional. Specifies number of most frequent SNPs to return. |
col.snp |
Symbol. Column containing SNPs. |
col.pmid |
Symbol. Column containing PubMed IDs. Necessary if the data frame provided is not a count data frame. |
Get SNPs from a data frame.
If a data frame containing SNP counts as from count_snp()
is provided,
these SNPs are specified by the row they are listed in. To get the SNPs by
row, set the row
argument.
If a data frame with PubMed IDs is provided, these SNPs are specified by
their top occurrence. To get the SNPs by frequency, set the top
argument.
If neither row
nor top
is provided, row
is automatically set to 1
.
String or character vector containing SNPs.
extract_snp()
,
count_snp()
,
subset_snp()
Other get functions:
get_distinct_mir_df()
,
get_distinct_mir_vec()
,
get_mir()
,
get_pmid()
,
get_shared_mir_df()
,
get_shared_mir_vec()
Indicate if a miRNA name is contained in an abstract with "Yes"/"No".
indicate_mir(df, indicate.mir, col.mir = miRNA)
indicate_mir(df, indicate.mir, col.mir = miRNA)
df |
Data frame containing miRNA names. |
indicate.mir |
Character vector. Vector containing miRNA names to indicate. |
col.mir |
Symbol. Column containing miRNA names. |
Indicate if a miRNA name is contained in an abstract with "Yes"/"No".
This requires miRNA names already to be extracted, e.g. with extract_mir_df()
,
and to be stored in a separate column, specified by col.mir
.
indicate_mir()
adds another column to a data frame which bears the name
of the miRNA(s) of interest. Within this column, a "Yes" or "No" specifies
if this miRNA name is contained in the corresponding abstract.
Data frame with as many columns added as miRNA names given
in indicate.mir
.
Per column, a "Yes" or "No" indicates if the miRNA name of interest
is present in the
corresponding abstract.
extract_mir_df()
, indicate_term()
Other indicate functions:
indicate_term()
Indicate if a term is contained in abstracts.
indicate_term( df, term, threshold = 1, case = FALSE, discard = FALSE, col.abstract = Abstract )
indicate_term( df, term, threshold = 1, case = FALSE, discard = FALSE, col.abstract = Abstract )
df |
Data frame containing abstracts. |
term |
Character vector. Vector containing terms to indicate. |
threshold |
Integer. Sets how often a term must be in an abstract to be considered "present". |
case |
Boolean. If |
discard |
Boolean. If |
col.abstract |
Symbol. Column containing abstracts. |
Indicate if a term is contained in an abstract. Terms provided can either
be case sensitive or insensitive. Per term, a new column is added to the data
frame indicating if the term is present in an abstract. Furthermore, if a term
is considered "present" in an abstract can be regulated via the threshold
argument. threshold
determines how often a term must be in an abstract
to be considered "present".
Data frame. If discard = FALSE
, the original data frame with additional
columns per term is returned. If discard = TRUE
, only abstracts containing the
terms in term
are returned.
Other indicate functions:
indicate_mir()
Add miRNA targets from miRTarBase version 8.0 to a data frame.
join_mirtarbase( df, col.pmid.df = PMID, col.topic.df = NULL, filter_na = TRUE, reduce = FALSE )
join_mirtarbase( df, col.pmid.df = PMID, col.topic.df = NULL, filter_na = TRUE, reduce = FALSE )
df |
Data frame containing PubMed-IDs that the miRNA targets shall be joined to. |
col.pmid.df |
Symbol. Column containing PubMed-IDs in |
col.topic.df |
Symbol. Optional. Only important if |
filter_na |
Boolean. If |
reduce |
Boolean. If |
Add miRNA targets from miRTarBase version 8.0 to a data frame.
join_mirtarbase()
can return two different data frames, regulated by reduce
:
If reduce = FALSE
, join_mirtarbase()
adds targets from miRTarBase 8.0
to the data frame in a new column. These targets then correspond
to the targets determined in the research paper, but do not necessarily correspond
to the miRNA names mentioned in the abstract.
If reduce = TRUE
, join_mirtarbase()
adds targets from
miRTarBase 8.0 to the data frame in a new column. However, an
altered data frame is returned, containing the PubMed-IDs, targets, and
miRNAs from miRTarBase 8.0.
miRTarBase was published in
Hsi-Yuan Huang, Yang-Chi-Dung Lin, Jing Li, et al., miRTarBase 2020: updates to the experimentally validated microRNA–target interaction database, Nucleic Acids Research, Volume 48, Issue D1, 08 January 2020, Pages D148–D154, https://doi.org/10.1093/nar/gkz896
Data frame containing miRNA targets.
Other target functions:
count_target()
,
join_targets()
,
plot_target_count()
,
plot_target_mir_scatter()
Add miRNA targets from an external xlsx-file to a data frame.
join_targets( df, excel_file, col.pmid.excel, col.target.excel, col.mir.excel = NULL, col.pmid.df = PMID, col.topic.df = NULL, filter_na = TRUE, stem_mir_excel = TRUE, reduce = FALSE )
join_targets( df, excel_file, col.pmid.excel, col.target.excel, col.mir.excel = NULL, col.pmid.df = PMID, col.topic.df = NULL, filter_na = TRUE, stem_mir_excel = TRUE, reduce = FALSE )
df |
Data frame containing PubMed-IDs that the miRNA targets shall be joined to. |
excel_file |
xlsx-file. xlsx-file containing miRNA targets and PubMed-IDs. |
col.pmid.excel |
String. Column containing PubMed-IDs of the
|
col.target.excel |
String. Column containing targets of the
|
col.mir.excel |
String. Optional. Column containing miRNAs of the
|
col.pmid.df |
Symbol. Column containing PubMed-IDs in |
col.topic.df |
Symbol. Optional. Only important if |
filter_na |
Boolean. If |
stem_mir_excel |
Boolean. If |
reduce |
Boolean. If |
Add miRNA targets from an external xlsx-file to a data frame. To add the targets to the
data frame, the xlsx-file and the data frame need to have one column in
common, such as PubMed-IDs.
join_targets()
can return two different data frames, regulated by reduce
:
If reduce = FALSE
, join_targets()
adds targets from an
excel-file to the data frame in a new column. These targets then correspond
to the targets determined in the research paper, but do not necessarily correspond
to the miRNA names mentioned in the abstract.
If reduce = TRUE
, join_targets()
adds targets from an
xlsx-file to the data frame in a new column. However, an
altered data frame is returned, containing the PubMed-IDs, targets, and
miRNAs from the excel-file. For reduce = TRUE
to work, the xlsx-file provided
must contain a column with miRNA names.
Data frame containing miRNA targets.
Other target functions:
count_target()
,
join_mirtarbase()
,
plot_target_count()
,
plot_target_mir_scatter()
Vector containing stop words for n-grams, based on tidytext::stop_words.
ngram_stopwords
ngram_stopwords
Character vector.
tidytext::stop_words
Keywords to identify abstracts investigating miRNAs in patients.
patients_keywords
patients_keywords
An object of class character
of length 10.
Plot terms associated with LDA-fitted topics.
plot_lda_term(lda_model, top.terms = 10, title = NULL)
plot_lda_term(lda_model, top.terms = 10, title = NULL)
lda_model |
LDA-model. |
top.terms |
Integer. Top terms to plot per topic. |
title |
String. Plot title. |
Plot terms associated with LDA-fitted topics. For each topic in the LDA-model,
the top terms are plotted. Plotting top.terms
for each topic can help
identifying its subject.
Bar plot with top terms per topic.
Other LDA functions:
assign_topic_lda()
,
fit_lda()
,
plot_perplexity()
Plot count of most frequently mentioned miRNA names in a data frame.
plot_mir_count( df, top = 10, colour = "steelblue3", col.mir = miRNA, title = NULL )
plot_mir_count( df, top = 10, colour = "steelblue3", col.mir = miRNA, title = NULL )
df |
Data frame containing miRNA names. |
top |
Integer. Specifies number of most frequent miRNA names to plot. |
colour |
String. Colour of bar plot. |
col.mir |
Symbol. Column containing miRNA names. |
title |
String. Plot title. |
Plot count of most frequently mentioned miRNA names in a data frame. How many
most frequently mentioned miRNAs are plotted is determined via the top
argument. Ties among the most frequently mentioned miRNAs are treated as
the same rank, e.g. if miR-126, miR-34, and miR-29 were all mentioned
the most often, they would all be plotted by specifying top = 1
, top = 2
,
or top = 3
.
Bar plot with the most frequently mentioned miRNAs names in df
.
count_mir()
, count_mir_threshold()
, plot_mir_count_threshold()
Other count functions:
count_mir_threshold()
,
count_mir()
,
count_snp()
,
plot_mir_count_threshold()
Plot occurrence count of distinct miRNA names over different thresholds.
plot_mir_count_threshold( df, start = 1, end = 5, bins = NULL, colour = "steelblue3", col.mir = miRNA, col.pmid = PMID, title = NULL )
plot_mir_count_threshold( df, start = 1, end = 5, bins = NULL, colour = "steelblue3", col.mir = miRNA, col.pmid = PMID, title = NULL )
df |
Data frame containing columns with miRNAs and PubMed-IDs. |
start |
Integer or float. Must be greater than 0 and smaller than
|
end |
Integer or float. Must be greater than 0 and greater than
|
bins |
Integer. Optional. Only necessary if |
colour |
String. Colour of bar plot. |
col.mir |
Symbol. Column containing miRNAs. |
col.pmid |
Symbol. Column containing PubMed-IDs. |
title |
String. Plot title. |
Plot occurrence of distinct miRNA names over different thresholds.
These thresholds can either be absolute values or floating values between 0
and 1.
If the thresholds are absolute values, number of distinct miRNA names
mentioned in at least n abstracts are plotted, where n
is the range of thresholds defined by start
and end
.
If the thresholds are floating values, bins
must be specified as well.
Then the umber of distinct miRNA names
mentioned in at least n abstracts over bins
are plotted, where n is the
range of thresholds
between start
and end
.
Overall, plotting can help in identifying if the abstracts
at hand mention different miRNAs in a balanced way, or if there are few miRNAs
dominating the field.
Bar plot counting the occurrence of miRNA names above different thresholds.
count_mir_threshold()
, count_mir()
, plot_mir_count()
Other count functions:
count_mir_threshold()
,
count_mir()
,
count_snp()
,
plot_mir_count()
Plot development of miRNA name mentioning over time.
plot_mir_development( df, mir, start = NULL, end = NULL, linetype = "miRNA", alpha = 0.8, width = 0.3, col.mir = miRNA, col.year = Year, title = NULL )
plot_mir_development( df, mir, start = NULL, end = NULL, linetype = "miRNA", alpha = 0.8, width = 0.3, col.mir = miRNA, col.year = Year, title = NULL )
df |
Data frame containing miRNA names and publication years. |
mir |
Character vector. Vector containing miRNA names to plot. |
start |
Numeric. Optional. Specifies start year. If |
end |
Numeric. Optional. Specifies end year. If |
linetype |
String. Specifies linetype. |
alpha |
Float. Opacity of lines. |
width |
Float. Width of dodging lines. |
col.mir |
Symbol. Column containing miRNA names. |
col.year |
Symbol. Column containing year. |
title |
String. Plot title. |
Plot how often a miRNA name was mentioned per year.
Line plot displaying how often a miRNA name was mentioned per year..
Other miR development functions:
plot_mir_new()
Plot number of newly mentioned miRNA names/year.
plot_mir_new( df, threshold = 1, start = NULL, end = NULL, colour = "steelblue3", col.mir = miRNA, col.year = Year, title = NULL )
plot_mir_new( df, threshold = 1, start = NULL, end = NULL, colour = "steelblue3", col.mir = miRNA, col.year = Year, title = NULL )
df |
Data frame containing miRNA names and publication years. |
threshold |
Integer. Specifies how often a miRNA must be mentioned in a year to be considered "mentioned". |
start |
Integer. Optional. Beginning of publication period.
If |
end |
Integer. Optional. End of publication period.
If |
colour |
String. Colour of bar plot. |
col.mir |
Symbol. Column containing miRNA names. |
col.year |
Symbol. Column containing publication year. |
title |
String. Plot title. |
Plot how many miRNAs are mentioned for the first time in different year.
If a miRNA is considered to be "mentioned" in a year can be regulated
via the threshold
argument. If, for example, threshold
is set to 3, but a
miRNA is mentioned only twice in a year, it is not considered
to be "mentioned" for this year.
Bar plot displaying the number of newly mentioned miRNA names/year.
Other miR development functions:
plot_mir_development()
Plot count of top terms associated with a miRNA name.
plot_mir_terms( df, mir, top = 20, tf.idf = FALSE, token = "words", ..., stopwords = stopwords_miretrieve, stopwords_ngram = TRUE, normalize = TRUE, colour = "steelblue3", col.mir = miRNA, col.abstract = Abstract, col.pmid = PMID, title = NULL )
plot_mir_terms( df, mir, top = 20, tf.idf = FALSE, token = "words", ..., stopwords = stopwords_miretrieve, stopwords_ngram = TRUE, normalize = TRUE, colour = "steelblue3", col.mir = miRNA, col.abstract = Abstract, col.pmid = PMID, title = NULL )
df |
Data frame containing miRNA names, abstracts, and PubMed-IDs. |
mir |
String. miRNA name of interest. |
top |
Integer. Number of top terms to plot. |
tf.idf |
Boolean. If |
token |
String. Specifies how abstracts shall be split up. Taken from
|
... |
Additional arguments for tokenization, if necessary. |
stopwords |
Data frame containing stop words. |
stopwords_ngram |
Boolean. Specifies if stop words shall be removed
from abstracts when using ngrams. Only applied when |
normalize |
Boolean. If |
colour |
String. Colour of bar plot. |
col.mir |
Symbol. Column containing miRNA names |
col.abstract |
Symbol. Column containing abstracts. |
col.pmid |
Symbol. Column containing PubMed-IDs. |
title |
String. Title plot. |
Plot count of top terms associated with a miRNA name.
Top terms associated with mir
have to be in df
as abstracts.
Number of top terms to plot is regulated via the top
argument.
Terms can either be evaluated as their count or in a tf-idf fashion.
If terms are evaluated as their count, they can either be
evaluated as their raw count, e.g. in how many abstracts they are mentioned
in conjunction with the miRNA name, or as their relative count, e.g.
in how many abstracts containing the miRNA they are mentioned compared to all
abstracts containing the miRNA.
If terms are evaluated in a tf-idf fashion, miRNA names are considered as
separate documents and
terms often associated with one miRNA, but not with other miRNAs get
more weight.
plot_mir_terms()
is based on the tools available in the tidytext package.
Bar plot displaying the count of the top terms associated with a miRNA name.
plot_wordcloud()
, tidytext::unnest_tokens()
Other miR term functions:
plot_wordcloud()
Plot perplexity score of various LDA models.
plot_perplexity( df, start = 2, end = 5, stopwords = stopwords_miretrieve, method = "gibbs", control = NULL, col.abstract = Abstract, col.pmid = PMID, title = NULL )
plot_perplexity( df, start = 2, end = 5, stopwords = stopwords_miretrieve, method = "gibbs", control = NULL, col.abstract = Abstract, col.pmid = PMID, title = NULL )
df |
Data frame containing abstracts and PubMed-IDs. |
start |
Integer. Minimum amount of |
end |
Integer. Maximum amount of |
stopwords |
Data frame containing stop words. |
method |
String. Either |
control |
Control parameters for LDA modeling. For more information,
see the documentation of the |
col.abstract |
Column containing abstracts. |
col.pmid |
Column containing PubMed-ID. |
title |
String. Plot title. |
Plot perplexity score of various LDA models. plot_perplexity()
fits
different LDA models for k
topics in the range
between start
and end
. For each
LDA model, the perplexity score is plotted against the corresponding value of
k
.
Plotting the perplexity score of various LDA models
can help in identifying the optimal number of topics to fit an LDA model for.
plot_perplexity()
is based on LDA()
from the package
topicmodels.
Elbow plot displaying perplexity scores of different LDA models.
Other LDA functions:
assign_topic_lda()
,
fit_lda()
,
plot_lda_term()
Plot frequency of animal model scores in abstracts.
plot_score_animals( df, keywords = animal_keywords, case = FALSE, bins = NULL, colour = "steelblue3", col.abstract = Abstract, col.pmid = PMID, title = NULL )
plot_score_animals( df, keywords = animal_keywords, case = FALSE, bins = NULL, colour = "steelblue3", col.abstract = Abstract, col.pmid = PMID, title = NULL )
df |
Data frame containing abstracts. |
keywords |
Character vector. Vector containing keywords. The animal
model score is calculated based on these keywords. How much weight a keyword
in |
case |
Boolean. If |
bins |
Integer. Specifies how many bins are used to plot
the distribution. If |
colour |
String. Colour of histogram. |
col.abstract |
Symbol. Column containing abstracts. |
col.pmid |
Symbol. Column containing PubMed-IDs. |
title |
String. Plot title. |
Plots a frequency distribution of animal model scores in abstracts of a
data frame. The animal model score is influenced by the choice of
terms in keywords
.
Plotting the distribution can help deciding if the
terms are well-chosen, or in choosing the right threshold to decide
which abstracts are considered to contain animal models.
Histogram displaying the distribution of animal scores in abstracts.
Other score functions:
assign_topic()
,
calculate_score_animals()
,
calculate_score_biomarker()
,
calculate_score_patients()
,
calculate_score_topic()
,
plot_score_biomarker()
,
plot_score_patients()
,
plot_score_topic()
Plot frequency of biomarker scores in abstracts.
plot_score_biomarker( df, keywords = biomarker_keywords, case = FALSE, bins = NULL, colour = "steelblue3", col.abstract = Abstract, col.pmid = PMID, title = NULL )
plot_score_biomarker( df, keywords = biomarker_keywords, case = FALSE, bins = NULL, colour = "steelblue3", col.abstract = Abstract, col.pmid = PMID, title = NULL )
df |
Data frame containing abstracts. |
keywords |
Character vector. Vector containing keywords. The biomarker
score is calculated based on these keywords. How much weight a keyword
in |
case |
Boolean. If |
bins |
Integer. Specifies how many bins are used to plot
the distribution. If |
colour |
String. Colour of histogram. |
col.abstract |
Symbol. Column containing abstracts. |
col.pmid |
Symbol. Column containing PubMed-IDs. |
title |
String. Plot title. |
Plots a frequency distribution of biomarker scores in abstracts of a
data frame. The biomarker score is influenced by the choice of
terms in keywords
.
Plotting the distribution can help deciding if the
terms are well-chosen, or in choosing the right threshold to decide
which abstracts are considered to contain use of miRNAs as biomarker.
Histogram displaying the distribution of biomarker scores in abstracts.
Other score functions:
assign_topic()
,
calculate_score_animals()
,
calculate_score_biomarker()
,
calculate_score_patients()
,
calculate_score_topic()
,
plot_score_animals()
,
plot_score_patients()
,
plot_score_topic()
Plot frequency of patient scores in abstracts.
plot_score_patients( df, keywords = patients_keywords, case = FALSE, bins = NULL, colour = "steelblue3", col.abstract = Abstract, col.pmid = PMID, title = NULL )
plot_score_patients( df, keywords = patients_keywords, case = FALSE, bins = NULL, colour = "steelblue3", col.abstract = Abstract, col.pmid = PMID, title = NULL )
df |
Data frame containing abstracts. |
keywords |
Character vector. Vector containing keywords. The score is
calculated based on these keywords. How much weight a keyword in |
case |
Boolean. If |
bins |
Integer. Specifies how many bins are used to plot
the distribution. If |
colour |
String. Colour of histogram. |
col.abstract |
Symbol. Column containing abstracts. |
col.pmid |
Symbol. Column containing PubMed-IDs. |
title |
String. Plot title. |
Plots a frequency distribution of patient scores in abstracts of a
data frame. The patient score is influenced by the choice of
terms in keywords
.
Plotting the distribution can help deciding if the
terms are well-chosen, or in choosing the right threshold to decide
which abstracts are considered to contain patient material
Histogram displaying the distribution of patient scores in abstracts.
Other score functions:
assign_topic()
,
calculate_score_animals()
,
calculate_score_biomarker()
,
calculate_score_patients()
,
calculate_score_topic()
,
plot_score_animals()
,
plot_score_biomarker()
,
plot_score_topic()
Plot frequency of self-chosen topic scores in abstracts.
plot_score_topic( df, keywords, case = FALSE, name.topic = "TOPIC", bins = NULL, colour = "steelblue3", col.abstract = Abstract, col.pmid = PMID, title = NULL )
plot_score_topic( df, keywords, case = FALSE, name.topic = "TOPIC", bins = NULL, colour = "steelblue3", col.abstract = Abstract, col.pmid = PMID, title = NULL )
df |
Data frame containing abstracts. |
keywords |
Character vector. Vector containing keywords. How much weight
a keyword in |
case |
Boolean. If |
name.topic |
String. Name of the topic. |
bins |
Integer. Specifies how many bins are used to plot
the distribution. If |
colour |
String. Colour of histogram. |
col.abstract |
Symbol. Column containing abstracts. |
col.pmid |
Symbol. Column containing PubMed-IDs. |
title |
String. Plot title. |
Plots a frequency distribution of self-chosen topic scores in abstracts of a
data frame. The topic score is influenced by the choice of
terms in keywords
. Plotting the distribution can help in choosing the right
threshold to decide which abstracts correspond to the self-chosen
topic.
Histogram displaying the distribution of self-chosen topic scores in abstracts.
calculate_score_topic()
, assign_topic()
Other score functions:
assign_topic()
,
calculate_score_animals()
,
calculate_score_biomarker()
,
calculate_score_patients()
,
calculate_score_topic()
,
plot_score_animals()
,
plot_score_biomarker()
,
plot_score_patients()
Plot count of miRNA targets.
plot_target_count( df, top = NULL, threshold = NULL, colour = "steelblue3", col.target = Target, title = NULL )
plot_target_count( df, top = NULL, threshold = NULL, colour = "steelblue3", col.target = Target, title = NULL )
df |
Data frame with miRNA targets. |
top |
Numeric. Specifies number of top targets to be plotted. |
threshold |
Numeric. Specifies how often a target must be in |
colour |
String. Colour of bar plot. |
col.target |
Symbol. Column containing miRNA targets. |
title |
String. Plot title. |
Plot count of miRNA targets as a bar plot. How many
targets are plotted is determined either by the top
or by
the threshold
argument.
If top
is given, targets with the highest count are plotted.
Ties among targets with the highest count are treated as
the same rank, e.g. if PTEN, AKT, and VEGFA all had the highest count,
they would all be plotted by specifying top = 1
, top = 2
,
and top = 3
.
If threshold
is given, only targets with a count of at least threshold
are plotted.
If neither top
nor threshold
is given, top
is automatically set
to 5
.
Bar plot with target counts.
count_target()
, join_targets()
Other target functions:
count_target()
,
join_mirtarbase()
,
join_targets()
,
plot_target_mir_scatter()
Plot targets and corresponding miRNAs as a scatter plot.
plot_target_mir_scatter( df, mir = NULL, target = NULL, top = NULL, threshold = NULL, filter_for = "target", col.target = Target, col.mir = miRNA, col.topic = Topic, col.pmid = PMID, title = NULL, height = 0.05, width = 0.05, alpha = 0.6 )
plot_target_mir_scatter( df, mir = NULL, target = NULL, top = NULL, threshold = NULL, filter_for = "target", col.target = Target, col.mir = miRNA, col.topic = Topic, col.pmid = PMID, title = NULL, height = 0.05, width = 0.05, alpha = 0.6 )
df |
Data frame containing targets and miRNA names. |
mir |
String or character vector. Specifies which miRNAs to plot. |
target |
String or character vector. Specifies which targets to plot. |
top |
Numeric. Specifies number of top targets/miRNA names to be plotted. |
threshold |
Numeric. Specifies how often a target/miRNA name must be in
|
filter_for |
String. Must either be |
col.target |
Symbol. Column containing miRNA targets. |
col.mir |
Symbol. Column containing miRNA names. |
col.topic |
Symbol. Column containing topic names. |
col.pmid |
Symbol. Column containing PubMed-IDs. |
title |
String. Plot title. |
height |
Double. Specifies height of jitter. |
width |
Double. Specifies width of jitter. |
alpha |
Double. Specifies opacity of points. |
Plot targets and corresponding miRNAs as a scatter plot.
With filter_for
, it can be determined if the focus shall be
on the top targets to plot their corresponding miRNAs,
or if the focus
shall be on the top miRNA names to plot their corresponding targets.
What "top targets" or "top miRNA names" mean can be determined via the
top
and threshold
arguments.
If top
is given, df
is filtered for the most frequent targets/miRNA
names.
If threshold
is given, data frame is filtered for all targets/miRNA names
mentioned at least threshold
times.
If neither top
nor threshold
is given, top
is automatically set
to 5
.
By plotting miRNAs
against their targets, it is visualized if one miRNA regulates many targets,
or if one target is regulated by many miRNAs. Furthermore, the miRNA-target
interactions are labelled according to their topic in col.topic
, thereby
facilitating comparison of miRNA-target interactions across different topics.
Scatter plot with targets and corresponding miRNAs.
Other target functions:
count_target()
,
join_mirtarbase()
,
join_targets()
,
plot_target_count()
Create wordcloud of terms associated with a miRNA name.
plot_wordcloud( df, mir, min.freq = 1, max.terms = 20, tf.idf = FALSE, token = "words", ..., stopwords = stopwords_miretrieve, stopwords_ngram = TRUE, colours = "black", random.colour = TRUE, ordered.colour = FALSE, col.mir = miRNA, col.abstract = Abstract, col.pmid = PMID )
plot_wordcloud( df, mir, min.freq = 1, max.terms = 20, tf.idf = FALSE, token = "words", ..., stopwords = stopwords_miretrieve, stopwords_ngram = TRUE, colours = "black", random.colour = TRUE, ordered.colour = FALSE, col.mir = miRNA, col.abstract = Abstract, col.pmid = PMID )
df |
Data frame containing miRNA names, abstracts, and PubMed-IDs. |
mir |
String. miRNA name of interest. |
min.freq |
Integer. Specifies least number of times a term must be associated with
|
max.terms |
Integer. Maximum number of terms to plot. |
tf.idf |
Boolean. If |
token |
String. Specifies how abstracts shall be split up. Taken from
|
... |
Additional arguments for tokenization, if necessary. |
stopwords |
Data frame containing stop words. |
stopwords_ngram |
Boolean. Specifies if stop words shall be removed
from abstracts when using ngrams. Only applied when |
colours |
Vector of strings. Colours for wordcloud. |
random.colour |
Boolean. Taken from |
ordered.colour |
Boolean. Taken from |
col.mir |
Symbol. Column containing miRNA names. |
col.abstract |
Symbol. Column containing abstracts. |
col.pmid |
Symbol. Column containing PubMed-IDs. |
Create wordcloud of terms associated with a miRNA name.
miRNA names must be in a data frame df
, while terms are taken
from abstracts contained in df
.
Number of terms to plot is regulated by max.terms
, while min.freq
regulates
the least number of times a term must be mentioned to be plotted.
Terms can either be evaluated as their raw count, e.g. how often they are
mentioned in conjunction with the miRNA of interest, or weighed in a tf-idf
fashion. If tf.idf = TRUE
, miRNA names are considered as separate documents,
and terms often associated with one miRNA, but not with other miRNAs get
more weight.
plot_wordcloud()
is based on the tools available in the wordcloud
package.
Wordcloud of terms associated with a miRNA name.
plot_mir_terms()
, wordcloud::wordcloud()
, tidytext::unnest_tokens()
Other miR term functions:
plot_mir_terms()
Convert PubMed-file from PubMed into a data frame.
read_pubmed(pubmed_file, topic = NULL)
read_pubmed(pubmed_file, topic = NULL)
pubmed_file |
PubMed-file as .txt, downloaded from PubMed. |
topic |
String. Optional. If provided, adds a "Topic" column containing
|
Convert an PubMed-file from PubMed into a data frame. The PubMed-file should contain PubMed-IDs, abstracts from research articles, abstract title, publication year, abstract language, and article type. The data frame created holds at least six columns, namely
PMID
, containing the PubMed-ID,
Year
, containing the publication year,
Title
, containing the title of the abstracts,
Abstract
, containing the actual abstract,
Language
, containing the language(s) of the paper,
Type
, containing the article type.
If topic
is provided, a "Topic" column is added, assigning all abstracts in
df
to topic
.
read_pubmed()
is faster than read_pubmed_jats()
and thus
recommended.
Data frame containing PubMed-IDs, abstracts, abstract titles, publication years, languages, and article types.
Other external data functions:
read_pubmed_jats()
,
save_excel()
,
save_plot()
Convert JATS-file from PubMed into a data frame.
read_pubmed_jats(jats_file, topic = NULL)
read_pubmed_jats(jats_file, topic = NULL)
jats_file |
JATS-file, downloaded from PubMed. |
topic |
String. Optional. If provided, adds a "Topic" column containing
|
Converts an JATS-file from PubMed into a data frame. The JATS-file should contain PubMed-IDs, abstracts from research articles, abstract title, publication year, abstract language, and article type. The data frame created holds at least six columns, namely
PMID
, containing the PubMed-ID,
Year
, containing the publication year,
Title
, containing the title of the abstracts,
Abstract
, containing the actual abstract,
Language
, containing the language(s) of the paper,
Type
, containing the article type.
If topic
is provided, a "Topic" column is added, assigning all abstracts in
df
to topic
.
read_pubmed()
is faster than read_pubmed_jats()
and thus
recommended.
Data frame containing PubMed-IDs, abstracts, abstract titles, publication years, languages, and article types.
Other external data functions:
read_pubmed()
,
save_excel()
,
save_plot()
Save data frame(s) locally as an xlsx-file.
save_excel(..., excel_file = "miRetrieve_data.xlsx")
save_excel(..., excel_file = "miRetrieve_data.xlsx")
... |
Data frame(s) to save. |
excel_file |
String. File name that |
Saves data frame locally as an xlsx-file. If more than one data frame is provided, data frames are saved in an xlsx-file with one sheet per data frame.
Wrapper function of write.xlsx()
from openxlsx.
xlsx-file, locally saved.
Other external data functions:
read_pubmed_jats()
,
read_pubmed()
,
save_plot()
Save the last generated figure locally.
save_plot( plot_file, width = NULL, height = NULL, units = "in", dpi = 300, device = NULL )
save_plot( plot_file, width = NULL, height = NULL, units = "in", dpi = 300, device = NULL )
plot_file |
String. File name that the figure
shall be saved to. Can end in either ".png", ".tiff",
".pdf", ".jpeg", or ".bmp". For more information, see the documentation
of |
width |
Integer. Optional. Plot width. If |
height |
Integer. Optional. Plot height If |
units |
String. Units for |
dpi |
Integer. Resolution for raster graphics such as .pdf-files. |
device |
String or function. Specifies which device to use (such as
"pdf" or |
Saves the last generated figure locally. Wrapper
function of ggsave()
from ggplot2. For further details, please
see ?ggplot2::ggsave.
Plot, locally saved.
Other external data functions:
read_pubmed_jats()
,
read_pubmed()
,
save_excel()
Data frame containing PubMed 2-gram stop words, manually curated from PubMed abstracts
stopwords_2gram
stopwords_2gram
Tibble.
word
: Column containing stop words. Pulled from various PubMed
abstracts.
lexicon
: Column specifying lexicon.
Manually created from various PubMed abstracts.
Data frame containing English stop words, PubMed stop words, and common 2-gram stopwords. English stop words are based on tidytext::stop_words, while PubMed stop words are manually curated from PubMed abstracts
stopwords_miretrieve
stopwords_miretrieve
Tibble.
word
: Column containing stop words. Pulled from various PubMed
abstracts.
lexicon
: Column specifying lexicon.
tidytext::stop_words; manually created from various PubMed abstracts.
Data frame containing PubMed stop words, manually curated from PubMed abstracts
stopwords_pubmed
stopwords_pubmed
Tibble.
word
: Column containing stop words. Pulled from various PubMed
abstracts.
lexicon
: Column specifying lexicon.
Manually created from various PubMed abstracts.
Subset data frame for a term in a specified column.
subset_df(df, col.filter, filter_for = "Yes")
subset_df(df, col.filter, filter_for = "Yes")
df |
Data frame to subset. |
col.filter |
String. Name of column to filter. |
filter_for |
String. Term to filter for. |
Subset data frame for a term in a specified column.
subset_df()
filters a data frame for a certain term in a specified column. All
rows containing the term in the specified column are kept, while the other
rows are silently dropped.
Here, col.filter
is a string rather than
a symbol to facilitate filtering in columns that carry special characters
such as '-' in their name.
Data frame, subset for rows where filter_for
was
present in col.filter
.
indicate_term()
, indicate_mir()
, extract_snp()
Other subset functions:
subset_mir_threshold()
,
subset_mir()
,
subset_research()
,
subset_review()
,
subset_snp()
,
subset_year()
Subset data frame for specific miRNA names only.
subset_mir(df, mir.retain, col.mir = miRNA)
subset_mir(df, mir.retain, col.mir = miRNA)
df |
Data frame containing a miRNA names. |
mir.retain |
Character vector. Vector specifying which miRNA names to keep.
miRNA names in |
col.mir |
Symbol. Column containing miRNA names. |
Subset data frame for specific miRNA names only.
Data frame containing only specified miRNA names.
If no miRNA name in mir.retain
matches a miRNA name in col.mir
, subset_mir()
stops
with a warning saying "No miRNA name in 'mir.retain' matches a miRNA name in 'col.mir'.
Could not filter for miRNA name.".
get_mir()
, subset_mir_threshold()
Other subset functions:
subset_df()
,
subset_mir_threshold()
,
subset_research()
,
subset_review()
,
subset_snp()
,
subset_year()
Subset data frame for miRNA names whose frequency exceeds a threshold.
subset_mir_threshold(df, threshold = 1, col.mir = miRNA, col.pmid = PMID)
subset_mir_threshold(df, threshold = 1, col.mir = miRNA, col.pmid = PMID)
df |
Data frame containing miRNA names and a PubMed-IDs. |
threshold |
Integer or float. If |
col.mir |
Symbol. Column containing miRNA names. |
col.pmid |
Symbol. Column containing PubMed-IDs. |
Subset data frame for miRNA names whose frequency exceeds a threshold.
This threshold can either
be an absolute value, e.g. 3, or a float between 0 and 1, e.g. 0.2.
If threshold
is an absolute value, subset_mir_threshold()
retains
miRNA names mentioned in at least threshold
abstracts.
If threshold
is a float between 0 and 1, subset_mir_threshold()
retains
miRNA names mentioned in at least threshold
abstracts
of all abstracts in df
.
Data frame, subset for miRNA names whose frequency exceeds a threshold.
Other subset functions:
subset_df()
,
subset_mir()
,
subset_research()
,
subset_review()
,
subset_snp()
,
subset_year()
Subset data frame for abstracts of research articles only.
subset_research(df, col.type = Type)
subset_research(df, col.type = Type)
df |
Data frame containing article types. |
col.type |
Symbol. Column containing articles types. |
Subset data frame for abstracts of research articles only. At the same time, abstracts from other article types such as Review, Letter, etc. are dropped.
Data frame containing abstracts of research articles only.
subset_review()
, subset_year()
Other subset functions:
subset_df()
,
subset_mir_threshold()
,
subset_mir()
,
subset_review()
,
subset_snp()
,
subset_year()
Subset data frame for abstracts of review articles only.
subset_review(df, col.type = Type)
subset_review(df, col.type = Type)
df |
Data frame containing article types. |
col.type |
Symbol. Column containing articles types. |
Subset data frame for abstracts of review articles only. At the same time, abstracts from other article types such as Journal Article, Letter, etc. are dropped.
Data frame containing abstracts of review articles only.
subset_research()
, subset_year()
Other subset functions:
subset_df()
,
subset_mir_threshold()
,
subset_mir()
,
subset_research()
,
subset_snp()
,
subset_year()
Subset data frame for specific SNPs only.
subset_snp(df, snp.retain, col.snp = SNPs)
subset_snp(df, snp.retain, col.snp = SNPs)
df |
Data frame containing SNPs. |
snp.retain |
Character vector. Vector specifying which SNPs to keep.
SNPs in |
col.snp |
Symbol. Column containing SNPs. |
Subset data frame for specific SNPs only.
Data frame containing only specified SNPs.
If no SNP in snp.retain
matches a SNP in col.snp
, subset_snp()
stops
with a warning saying "No SNP in 'snp.retain' matches a SNP in 'col.snp'.
Could not filter for SNP.".
extract_snp()
,
count_snp()
,
get_snp()
Other subset functions:
subset_df()
,
subset_mir_threshold()
,
subset_mir()
,
subset_research()
,
subset_review()
,
subset_year()
Subset data frame for abstracts published in a specific period only.
subset_year(df, col.year = Year, start = NULL, end = NULL)
subset_year(df, col.year = Year, start = NULL, end = NULL)
df |
Data frame containing publication years. |
col.year |
Symbol. Column containing publication years. |
start |
Integer. Optional. Beginning of
publication period.
If |
end |
Integer. Optional. End of
publication period.
If |
Subset data frame for abstracts published in a specific period only. All other abstracts published not within this period are silently dropped.
Data frame containing abstracts published in a specific period only.
subset_research()
, subset_review()
Other subset functions:
subset_df()
,
subset_mir_threshold()
,
subset_mir()
,
subset_research()
,
subset_review()
,
subset_snp()