Title: | Text Mining of PubMed Abstracts |
---|---|
Description: | Text mining of PubMed Abstracts (text and XML) from <https://pubmed.ncbi.nlm.nih.gov/>. |
Authors: | Jyoti Rani [aut], S.Ramachandran [aut], Ab Rauf Shah [aut], S. Ramachandran [cre] |
Maintainer: | S. Ramachandran <[email protected]> |
License: | GPL-3 |
Version: | 1.0.21 |
Built: | 2025-03-08 06:24:18 UTC |
Source: | CRAN |
"Abstracts"
Abstract ClassS4 Class with three slots Journal, Abstract, PMID to store abstracts from PubMed
Objects can be created by calls of the form new("Abstracts", ...)
.
Journal
:Object of class "character"
to store Journals of the abstracts from PubMed
Abstract
:Object of class "character"
to store Abstracts from the PubMed
PMID
:Object of class "numeric"
to store PMIDs of abstracts from PubMed
No methods defined with class "Abstracts" in the signature.
S.Ramachandran, Ab Rauf Shah
searchabsL getabs contextSearch Genewise
Yearwise combineabs subabs subsetabs readabs
showClass("Abstracts")
showClass("Abstracts")
additional_info
will help to extract the sentences containing multiple query term(s) from a large corpus of multiple abstracts.
additional_info(abs, pmid, keywords)
additional_info(abs, pmid, keywords)
abs |
|
pmid |
Vector of PMIDs from abstracts |
keywords |
Character Vector of Terms |
It will return a matrix object containing PMID, keywords and sentences
Surabhi Seth
## Not run: additional_info(abs = Abstract, pmid = "26564970"", keywords = "text-mining" )
## Not run: additional_info(abs = Abstract, pmid = "26564970"", keywords = "text-mining" )
alias_fn
This function returns the sentences containing alias of gene and the user given terms from the Abstracts using HGNC gee data table.
In this sense this function is a 2 Dimensional search.
alias_fn(genes, data, abs, filename, terms)
alias_fn(genes, data, abs, filename, terms)
genes |
|
data |
|
abs |
|
filename |
|
terms |
|
An output file containing sentences with aliases of genes.For convenience both the official symbol and the corresponding alias are written in the output. The PMID of the corresponding Abstract containing the extracted sentence also appears just before the sentence. Note that multiple sentences from different abstracts are clubbed together under one gene alias that appears in those sentences.
S.Ramachandran
## Not run: alias_fn(genes,data,myabs,"nephro_",c("diabetic nephropathy","kidney disease")) ## genes output of gene_atomization()
## Not run: alias_fn(genes,data,myabs,"nephro_",c("diabetic nephropathy","kidney disease")) ## genes output of gene_atomization()
This function is used to retrieve the Alternative names of genes from UniProt using HGNC gene symbol.
altnamesfun(m)
altnamesfun(m)
m |
is a character vector of HGNC official gene symbols. |
It returns a list of alternative names of given Gene symbols.
S.Ramachandran
UniProt Consortium. "The universal protein resource (UniProt)." Nucleic acids research 36.suppl 1 (2008): D190-D195. http://www.uniprot.org/
uniprotfun
, ~~~
## Not run: test = altnamesfun(c("ADIPOQ","BDNF")) ## here "ADIPOQ" is the HGNC gene symbol for which alternative name(s) is required.
## Not run: test = altnamesfun(c("ADIPOQ","BDNF")) ## here "ADIPOQ" is the HGNC gene symbol for which alternative name(s) is required.
This function is used to obtain the Buzz word index value for the terms.
BWI(current, previous, n, N)
BWI(current, previous, n, N)
current |
|
previous |
|
n |
|
N |
|
It returns a list containing BWI value for the given word.
S.Ramachandran
Jensen, Lars Juhl, Jasmin Saric, and Peer Bork. "Literature mining for the biologist: from information retrieval to biological discovery." Nature reviews genetics 7.2 (2006): 119-129.
## Not run: result = BWI(mycurrentabs, mypreviousabs, "insulin", "inflammation") ## BWI for the term "insulin" and the theme is inflammation. ## Note that in the previous, years are starting one before the current year 2015; ## current is an S4 object containing the output from currentabs_fn() ## previous is an S4 object containing the output from previousabs_fn(). ## 'n' and 'N' are query terms whose BWI is sought and the theme respectively
## Not run: result = BWI(mycurrentabs, mypreviousabs, "insulin", "inflammation") ## BWI for the term "insulin" and the theme is inflammation. ## Note that in the previous, years are starting one before the current year 2015; ## current is an S4 object containing the output from currentabs_fn() ## previous is an S4 object containing the output from previousabs_fn(). ## 'n' and 'N' are query terms whose BWI is sought and the theme respectively
It will remove the 'NONE' abstracts from the result of searchabsL.
cleanabs(object)
cleanabs(object)
object |
an S4 object of class Abstracts. |
an S4 object of class Abstracts.
Jyoti Rani
## Not run: test1 = searchabsL(abs, include=c("term1", "term2")); test2 = cleanabs(test1) ## End(Not run) ## here 'abs' is an S4 object of class Abstracts ## 'term1', 'term2' are the searchterms ## test1 is an S4 object containing abstracts for given terms ## and test2 is an S4 object of class Abstracts containing clean abstracts of searchabsL
## Not run: test1 = searchabsL(abs, include=c("term1", "term2")); test2 = cleanabs(test1) ## End(Not run) ## here 'abs' is an S4 object of class Abstracts ## 'term1', 'term2' are the searchterms ## test1 is an S4 object containing abstracts for given terms ## and test2 is an S4 object of class Abstracts containing clean abstracts of searchabsL
cleanabs
To clean 'NONE' part of searchabsL output.
signature(object = "Abstracts")
From an S4 object of class 'Abstracts' the cleanabs function is able to clean the output of searchabsL by removing the 'NONE' part of resultant abstracts.
Function for finding the word (term) of highest frequency within clusters.
cluster_words(wordscluster, n)
cluster_words(wordscluster, n)
wordscluster |
an R object containing the output of wordscluster() |
n |
a numeric vector containing cluster numbers |
a list containing cluster and its highest frequency word
S. Ramachandran
## Not run: test = cluster_words(wordscluster, 5) ## wordscluster is an R object of wordscluster ## 5 is number of cluster ## End(Not run)
## Not run: test = cluster_words(wordscluster, 5) ## wordscluster is an R object of wordscluster ## 5 is number of cluster ## End(Not run)
Extracts single or multiple sentences with co-occurrence of given terms
co_occurrence_advance(abstract, term1, term2, n)
co_occurrence_advance(abstract, term1, term2, n)
abstract |
an S4 object of class Abstracts |
term1 |
a character vector of terms |
term2 |
a character vector of terms |
n |
A numeric value, which can be 0,1,2. |
Sentences with co-occurrence of two terms will be extracted along with the corresponding PMIDs. The output will be a data frame. In regard to the argument n, when the value is 0 then the co-occurrence is sought in the same sentence. When the value is 1, then the co-occurrence is sought in two consecutive sentences, namely, first term in the first sentence and second term in the next sentence. When the value is 2, then the co-occurrence is sought in two sentences separated by a sentence without either term1 or term2.
It will return a data frame object containing PMID,sentences and the terms pairs.
Shashwat Badoni Surabhi Seth
## Not run: co_occurrence_advance(myabs,"resistance", c("genes","genetic"), 2
## Not run: co_occurrence_advance(myabs,"resistance", c("genes","genetic"), 2
co_occurrence_fn
will automatically extract sentences with co-occurrence of two sets of terms.
co_occurrence_fn(terms1, abs, filename, terms2)
co_occurrence_fn(terms1, abs, filename, terms2)
terms1 |
a character vector of terms. |
abs |
an S4 object of class Abstracts |
filename |
a single character, filename |
terms2 |
a character vector of terms. |
Sentences with co-occurrence of two terms will be extracted along with the corresponding PMIDs. The data will be written in a text file with the user given filename and the word co_occurrence will be suffixed to it.
A text file.
S.Ramachandran
## Not run: co_occurrence_fn("resistance",myabs,"resistance_genetic",c("genes","genetic") ##
## Not run: co_occurrence_fn("resistance",myabs,"resistance_genetic",c("genes","genetic") ##
combineabs
will automatically combine two abtracts of two objects.
combineabs(object1, object2)
combineabs(object1, object2)
object1 |
An S4 object of class Abstracts |
object2 |
An S4 object of class Abstracts |
Two objects of class 'Abstracts' are combined to return non-redundant combined abstracts. It can be used sequentially to combine many objects of class 'Abstracts'. It will also write the number of combined abstracts into a text file named "data_out.txt"
An R object containing the combined abstracts, and a text file named "data_out.txt" containing the number of abstracts combined together
S.Ramachandran, Jyoti Rani
## Not run: res1 = combineabs(x,y) ## here 'x', 'y' are the S4 objects of class 'Abstracts'.
## Not run: res1 = combineabs(x,y) ## here 'x', 'y' are the S4 objects of class 'Abstracts'.
Abstracts
Method to Combine Abstractscombineabs
method to combine the abstracts. object1 and object2 are from Abstracts
class.
signature(object1 = "Abstracts")
An S4 object of class "Abstracts"
signature(object2 = "Abstracts")
An S4 object of class "Abstracts"
This dataset is used to remove common words from the abstracts. This step is used for size reduction for further data mining.
data(common_words_new)
data(common_words_new)
The format is: chr "common_words_new"
The dataset containing common words used to remove them from the text for size reduction.
https://en.wikipedia.org/wiki/Most_common_words_in_English
data(common_words_new)
data(common_words_new)
contextSearch
is a method to extract the sentences containing a given query term
contextSearch(object, y)
contextSearch(object, y)
object |
An S4 object of Class Abstracts containing text abstracts |
y |
a character vector of term(s) |
It takes object of class Abstracts and query term(s) as arguments and returns a text and latex file of the sentences containing query term. The latex file can be further converted into PDF by using the system command in R i.e. system("pdflatex filename.tex"). pdflatex is a shell command in Linux to convert the latex file into PDF. In the pdf file the terms are written in bold face type to enable ease of reading
contextSearch() will write two files one is a text file named "companion.txt", and other is a Latex file. If the single term is given in query then file name comes with the term name. If multiple terms are used then the file name will be "combined.tex"
Dr.S.Ramachandran, Jyoti Rani
## Not run: contextSearch(x, "diabetes") ## here 'x' is S4 object of class 'Abstracts', and query term is 'diabetes'.
## Not run: contextSearch(x, "diabetes") ## here 'x' is S4 object of class 'Abstracts', and query term is 'diabetes'.
contextSearch
will search the sentence for the given term(s).
signature(object = "Abstracts")
The object from where it will search should be an S4 object of class Abstracts
cos_sim_calc
calculates the cosine measure of similarity between pairs of terms from a corpus.
cos_sim_calc(nummatrix)
cos_sim_calc(nummatrix)
nummatrix |
A numerical matrix for e.g. a Term Document matrix (output from tdm_for_lsa) |
The term document matrix is taken as input and cosine meausures of similarity between all pairs of terms are calculated.
A tab delimited text file containing the similarity values between all pairs of terms.
This file can be input to cytoscape directly.
S. Ramachandran
https://en.wikipedia.org/wiki/Cosine_similarity
## Not run: x = cos_sim_calc(nummatrix) ## here nummatrix is the 'Term Document Matrix' generated from tdm_for_lsa()
## Not run: x = cos_sim_calc(nummatrix) ## here nummatrix is the 'Term Document Matrix' generated from tdm_for_lsa()
cos_sim_calc_boot
allows boot strap analysis. This function should be used as argument for 'statistic' in the boot function of 'boot' package.
cos_sim_calc_boot(data, indices)
cos_sim_calc_boot(data, indices)
data |
Term Document Matrix generated from |
indices |
index of matrix. |
while calling this function we need to transpose the input tdm and can also set the number of replicates. boot package is required to call this function.
It will return a matrix containing the cosine similarity of pairs of terms in the abstracts. This object is in same format as returned by the 'boot' function of 'boot' package.
Dr.S.Ramachandran
## Not run: test_boot = boot(data = t(nummatrix), statistic = cos_sim_calc_boot, R = 2) ## here 'nummatrix' is a Term Document Matrix, boot inbuilt function of boot package, ## R is number of replicates here it is 2. User can extend this number.
## Not run: test_boot = boot(data = t(nummatrix), statistic = cos_sim_calc_boot, R = 2) ## here 'nummatrix' is a Term Document Matrix, boot inbuilt function of boot package, ## R is number of replicates here it is 2. User can extend this number.
This function is used to extract the abstracts for year we want to study. Its output is used as input in other functions like BWI() and genes_BWI()
currentabs_fn(yr_to_include, theme, parentabs)
currentabs_fn(yr_to_include, theme, parentabs)
yr_to_include |
|
theme |
|
parentabs |
|
It returns an S4 object containing the abstracts of the given year.
S.Ramachandran
## Not run: test = currentabs_fn("2015", "atherosclerosis", diabetesabs) ## here "2015" is the year for which, we wish to extract the abstracts on theme"Atherosclerosis" ## from the large corpus of diabetes i.e. diabetesabs.
## Not run: test = currentabs_fn("2015", "atherosclerosis", diabetesabs) ## here "2015" is the year for which, we wish to extract the abstracts on theme"Atherosclerosis" ## from the large corpus of diabetes i.e. diabetesabs.
This function is designed for the user convinience, so that user can get the conclusion from the abstract(s) with out reading the whole abstract(s).
Find_conclusion(y)
Find_conclusion(y)
y |
An S4 object of class 'Abstract'. |
A list containing conclusions of given abstract(s)
S.Ramachandran, Jyoti Rani
## Not run: res1 = Find_conclusion(y) ## here 'y' is an S4 object of class Abstract.
## Not run: res1 = Find_conclusion(y) ## here 'y' is an S4 object of class Abstract.
it helps to fetch the introduction and conclusion part from the abstracts.
find_intro_conc_html(y, themes, all)
find_intro_conc_html(y, themes, all)
y |
and S4 object of class Abstracts |
themes |
a character vector containing terms to be search in the abstracts |
all |
is logical, if true, will include title and author otherwise only abstracts will be considered. |
find_intro_conc_html
provides an HTML file containing space separated introduction and conclusion part from the abstracts of given query term as well as gives a link directly to PubMed for the resulting PMID.
an HTML file.
S.Ramachandran, Jyoti Rani
input_for_find_intro_conc_html
## Not run: test = find_intro_conc_html(abs, "diet", all=FALSE) ## here 'abs' is an S4 object of class Abstracts ## and 'diet' is a term to be search from the abstracts ## this function works for small size of corpus, say about 30-40 abstracts
## Not run: test = find_intro_conc_html(abs, "diet", all=FALSE) ## here 'abs' is an S4 object of class Abstracts ## and 'diet' is a term to be search from the abstracts ## this function works for small size of corpus, say about 30-40 abstracts
gene_atomization
will automatically fetch the genes (HGNC approved Symbol) from the text and report their frequencies. presently only HGNC approved symbols are used.
gene_atomization(m)
gene_atomization(m)
m |
An S4 object of class Abstracts |
The function writes a text file with file name "data_table.txt". The function gene_atomization() is used to obtain the name of genes along with their frequencies of occurence.
A tab delimited table containing gene name and their frequencies of occurrence.
S.Ramachandran, Jyoti Rani
## Not run: gene_atomization(myabs) ## here myabs is an S4 object of class 'Abstracts'containing the abstracts ## uses older version of HGNC data (https://www.genenames.org/) by default. ## users may also use other functions such as official_fn and related ## family of functions for deeper data mining.
## Not run: gene_atomization(myabs) ## here myabs is an S4 object of class 'Abstracts'containing the abstracts ## uses older version of HGNC data (https://www.genenames.org/) by default. ## users may also use other functions such as official_fn and related ## family of functions for deeper data mining.
This function provides the Buzz word index for each gene. The theme is the context in which the gene is studied for e.g. atherosclerosis. Using this function user can identify abstracts with emphasis on a given gene.
genes_BWI(currentabs, previousabs, theme, genes)
genes_BWI(currentabs, previousabs, theme, genes)
currentabs |
|
previousabs |
|
theme |
|
genes |
|
It returns a dataframe containig Genes with their corresponding BWI values.
S.Ramachandran
## Not run: test = genes_BWI(currentabs, previousabs, theme, genes) ## currentabs is an S4 object contaning the Abtracts for the year we want to study. ## previousabs is an S4 object contaning the Abtracts for the years previous ## than our query year for e.g. before 2015 ## theme is a character value specifying the search. ## genes is a character vector of gene symbols.
## Not run: test = genes_BWI(currentabs, previousabs, theme, genes) ## currentabs is an S4 object contaning the Abtracts for the year we want to study. ## previousabs is an S4 object contaning the Abtracts for the years previous ## than our query year for e.g. before 2015 ## theme is a character value specifying the search. ## genes is a character vector of gene symbols.
This dataset is used in DAVID_info
function of the package, and it contains the Entrez Ids for the respective genes and these Entrez Ids will be used to get information about human genes.
data(GeneToEntrez)
data(GeneToEntrez)
The format is: chr "GeneToEntrez"
data(GeneToEntrez)
data(GeneToEntrez)
Genewise
reports the number of abstracts for given gene(s) name(s)
Genewise(object, gene)
Genewise(object, gene)
object |
An S4 object of class Abstracts |
gene |
a character input of gene name(HGNC approved symbol) |
This function will report the number of abstracts containing the query gene term(s) [HGNC approved symbols], and the result is saved in a text file "dataout.txt". Genewise() will report numbers of abstracts only. The abstracts themselves for corresponding gene names can be obtained using searchabsL() and searchabsT.
Genewise will return an R object containing the abstracts for given gene, and a text file named "dataout.txt" containing the number of abstracts
S. Ramachandran, Jyoti Rani
## Not run: Genewise(x, "TLR4") ## here 'x' contains the S4 object of Abstracts.
## Not run: Genewise(x, "TLR4") ## here 'x' contains the S4 object of Abstracts.
Genewise
The method Genewise will automatically report the numbers of abstracts for a given gene. It will write the result in the text file named "dataout.txt"
signature(object = "Abstracts")
This method will search in an S4 object, containiing abstracts. It will write a text file named "dataout.txt", containing the number of abstracts for the query gene terms
get_DOIs
is used to extract DOIs of papers.
get_DOIs(abs)
get_DOIs(abs)
abs |
An S4 object of class Abstracts |
get_DOIs
allow users to get DOIs for individual papers.
It returns a list object containing DOIs. This is useful for further extraction of papers
S.Ramachandran
## Not run: test = get_DOIs(vitiligoabs) ##
## Not run: test = get_DOIs(vitiligoabs) ##
get_gene_sentences
is used to extract the exact sentence in which query gene is discussed.
get_gene_sentences(genes, abs, filename)
get_gene_sentences(genes, abs, filename)
genes |
|
abs |
|
filename |
|
an output file containing the sentences for given gene.
S.Ramachandran
## Not run: get_gene_sentences("RBP4", abstracts, "RBP4_sentence.txt")
## Not run: get_gene_sentences("RBP4", abstracts, "RBP4_sentence.txt")
This function is to get the summary from MedLinePlus.
get_MedlinePlus(x)
get_MedlinePlus(x)
x |
|
It returns a HTML file with name result_Medline_plus.html to be opened with any browser
S.Ramachandran
www.medlineplus.gov, Conuel T. Finding answers in a beauty shop. NIH MedlinePlus: the magazine [Internet]. 2012 Fall [cited 2013 Feb 9]; 7(3):24-26. Available from: https://medlineplus.gov/magazine/issues/fall12/articles/fall12pg24-26.html
## Not run: get_MedlinePlus("malaria")
## Not run: get_MedlinePlus("malaria")
get_NMids
is to fetch the NM ids from the NCBI for corresponding gene/s to further fetch the sequence of that gene/s.
get_NMids(x)
get_NMids(x)
x |
|
It returns a list object containing corresponding NM id from NCBI.
S.Ramachandran
http://www.ncbi.nlm.nih.gov/gene
## Not run: getNMids("5950") ## 5950 is Locus id of RBP4 gene.
## Not run: getNMids("5950") ## 5950 is Locus id of RBP4 gene.
get_original_term
is used to get the exact term as it is present in corpus. This function is not recommended anymore.
get_original_term(m, n)
get_original_term(m, n)
m |
an S4 object of class Abstracts containing the corpus. |
n |
a list object output from the function cluster_words |
a list object contatining the terms.
S.Ramachandran, Jyoti Rani
## Not run: test = get_original_term(abs, words) ## here abs is an S4 object of class Abstracts ## words is the output object of cluster_words()
## Not run: test = get_original_term(abs, words) ## here abs is an S4 object of class Abstracts ## words is the output object of cluster_words()
get_original_term2
is used to get the exact term as it is present in corpus. It takes one term at a time. For multiple terms we can use lapply.
get_original_term2(x, y)
get_original_term2(x, y)
x |
|
y |
|
It returns a list object containing accurate term.
Jyoti Rani, S.Ramachandran.
## Not run: test = get_original_term("hba1c", diababs) ## here it will return accurate formation of hba1c i.e. HbA1c from diababs.
## Not run: test = get_original_term("hba1c", diababs) ## here it will return accurate formation of hba1c i.e. HbA1c from diababs.
get_PMCIDs
is used to fetch the PMC Ids of the abstracts from the corpus.
get_PMCIDS(abs)
get_PMCIDS(abs)
abs |
|
It returns a list containing PMC Ids.
S.Ramachandran
## Not run: get_PMCIDS(abstracts)
## Not run: get_PMCIDS(abstracts)
get_PMCtable
is used to extract the full texr article by giving query PMC Id. Deprecated.
get_PMCtable(url)
get_PMCtable(url)
url |
|
It will return a full text artcle.
S.Ramachandran
http://www.ncbi.nlm.nih.gov/pmc/
## Not run: get_PMCtable("http://www.ncbi.nlm.nih.gov/pmc/?term=4039032")
## Not run: get_PMCtable("http://www.ncbi.nlm.nih.gov/pmc/?term=4039032")
get_Sequences
is used to fetch the sequences of genes using NM ids.
get_Sequences(x, filename)
get_Sequences(x, filename)
x |
NM Id of the sequence. |
filename |
|
It will return a text file containing sequence.
S.Ramachandran
get_NMids
, ~~~
## Not run: get_Sequences("NM_012238.4", "SIRT1")
## Not run: get_Sequences("NM_012238.4", "SIRT1")
getabs
will automatically fetch the abstracts containing the query term. A base function of the package pubmed.mineR.
getabs(object, x, y)
getabs(object, x, y)
object |
An S4 object of class Abstracts |
x |
A character string for the term |
y |
logical, if TRUE, search will be case sensitive |
getabs() is used to find and exctract the abstracts for any given term, from the large a large corpus of abstracts. It uses regexpr based search strategy.
An S4 object of class 'Abstracts', containing the result abstracts for the given term.
Dr.S.Ramachandran
## Not run: getabs(x, "term") ## x is an S4 obeject of class abstracts containing the abstracts.
## Not run: getabs(x, "term") ## x is an S4 obeject of class abstracts containing the abstracts.
getabs
To Get abstracts for a termgetabs
will search for the abstracts of a given term. It is case sensitive.
signature(object = "Abstracts")
This method takes three arguments, first 'object' containing data to be search, 'x', the term to be search, 'y' is logical if set "YES" will consider the case of text.
getabsT
will automatically fetch the abstracts containing the query term.
getabsT(object, x, y)
getabsT(object, x, y)
object |
An S4 object of class Abstracts |
x |
A character string for the term |
y |
is logical, if set TRUE, search will be case sensitive. |
getabsT() is similar to getabs(), but it performs more specific search.
An object of class 'Abstracts', containing the resulted abstracts for term.
S.Ramachandran
## Not run: getabsT(diabdata, "term")
## Not run: getabsT(diabdata, "term")
getabsT
will automatically return the abstracts of a term from the data.
signature(object = "Abstracts")
getabsT will search for the abstracts of a term in the data, and will automatically write the number of abstracts into a text file named "dataout.txt".
Give_Sentences
will help to extract the sentence containing query term/s from the large corpus.
Give_Sentences(m, abs)
Give_Sentences(m, abs)
m |
|
abs |
|
It will return a list object containing sentences
S.Ramachandran
## Not run: Give_Sentences("diabetes", Abstracts)
## Not run: Give_Sentences("diabetes", Abstracts)
Give_Sentences_PMC
is used to extract the sentences from the full text article of given PMC id/s.
Give_Sentences_PMC(PMCID, term)
Give_Sentences_PMC(PMCID, term)
PMCID |
|
term |
|
It will return a list object containing the sentences for query term from the given article.
S.Ramachandran
## Not run: Give_Sentences_PMC(PMC4039032, "atherosclerosis")
## Not run: Give_Sentences_PMC(PMC4039032, "atherosclerosis")
head_abbrev
is used to find expansion for which abbreviation is used.
It will help to find the falsely matching abbreviations from the abstracts.
head_abbrev(limits, term, pmid, abs)
head_abbrev(limits, term, pmid, abs)
limits |
|
term |
|
pmid |
|
abs |
|
It will return a list.
S.Ramachandran
## Not run: head_abbrev(50, "AR", "16893912", myabs)
## Not run: head_abbrev(50, "AR", "16893912", myabs)
"HGNC"
Objects can be created by calls of the form new("HGNC", ...)
.
HGNCID
:Object of class "character"
ApprovedSymbol
:Object of class "character"
ApprovedName
:Object of class "character"
Status
:Object of class "character"
PreviousSymbols
:Object of class "character"
Aliases
:Object of class "character"
Chromosome
:Object of class "character"
AccessionNumbers
:Object of class "character"
RefSeqIDs
:Object of class "character"
Dr.S.Ramachandran, Ab Rauf Shah
showClass("HGNC")
showClass("HGNC")
This dataset contains HGNC2UniprotID from Uniprot and is used in uniprotfn() function of this package, to get the information of a gene from the Uniprot.
data(HGNC2UniprotID)
data(HGNC2UniprotID)
The format is: chr "HGNC2UniprotID"
The dataset contains HGNC2UniprotID
UniProt Consortium. "The universal protein resource (UniProt)." Nucleic acids research 36.suppl 1 (2008): D190-D195. http://www.uniprot.org/
data(HGNC2UniprotID)
data(HGNC2UniprotID)
This dataset contains data from Human Gene Nomenclature Committe i.e HGNC ID, HGNC approved symbol, approved name, gene synonyms, chromosome no., accession numbers and RefSeq ids.
data(HGNCdata)
data(HGNCdata)
The format is: chr "HGNCdata"
The dataset contains HGNCdata
Povey, Sue, et al. "The HUGO gene nomenclature committee (HGNC)." Human genetics 109.6 (2001): 678-680. http://www.genenames.org/
data(HGNCdata)
data(HGNCdata)
it helps in searching and fetching the abstracts from E-utilities using PMIDs.
input_for_find_intro_conc_html(y, all)
input_for_find_intro_conc_html(y, all)
y |
an S4 object of class Abstracts |
all |
is logical if true, will include title and author also. |
it takes an S4 object as input and uses its PMIDs to fetch the abstracts from E-utilities. The output will be used as input for find_intro_conc_html as it contains neat data i.e. abstracts only.
a list containing abstracts and PMID
S.Ramachandran, Jyoti Rani
literature/http:/eutils.ncbi.nlm.nih.gov/
## Not run: test=input_for_find_intro_conc_html(abs) ## here 'abs' is an S4 object of class Abstracts.
## Not run: test=input_for_find_intro_conc_html(abs) ## here 'abs' is an S4 object of class Abstracts.
It is an auxiliary function for altnamesfun.
local_uniprotfun(y)
local_uniprotfun(y)
y |
|
It writes an output file named "x.txt" which will be used as input in altnamesfun().
S.Ramachandran, Jyoti Rani
## Not run: local_uniprotfun("TLR4") ## here it will generate an output file named "x.txt" containing ## result for TLR4.
## Not run: local_uniprotfun("TLR4") ## here it will generate an output file named "x.txt" containing ## result for TLR4.
names_fn
matches the gene symbols to gene names and extract from HGNC.
names_fn(genes, data, abs, filename, terms)
names_fn(genes, data, abs, filename, terms)
genes |
|
data |
|
abs |
|
filename |
|
terms |
|
It returns an output file containing genes with their corresponding gene names and sentences with co-occurrences if any.
S.Ramachandran
## Not run: names_fn(genes, data, diabetes_abs, "names", c("diabetic nephropathy", "DN")) ## End(Not run) ## genes output of gene_atomization()
## Not run: names_fn(genes, data, diabetes_abs, "names", c("diabetic nephropathy", "DN")) ## End(Not run) ## genes output of gene_atomization()
new_xmlreadabs
is modified form of xmlreadabs as it reads the abstracts downloaded or saved in XML format from PubMed. This function should be used for recent XML format from PubMed.
new_xmlreadabs(file)
new_xmlreadabs(file)
file |
an XML file saved from PubMed. |
an S4 object of class Abstracts containing journals, abstracts and PMID.
This function is useful with recent format of XML files from PubMed. The older xmlreadabs will not work with recent format.
S.Ramachandran
## Not run: xmlabs = new_xmlreadabs("easyPubMed_00001.txt") ## here "easyPubMed_00001.txt" is an xml file from PubMed using package easyPubMed
## Not run: xmlabs = new_xmlreadabs("easyPubMed_00001.txt") ## here "easyPubMed_00001.txt" is an xml file from PubMed using package easyPubMed
official_fn
is used to fetch the sentences containing official gene symbol from HGNC.
official_fn(genes, abs, filename, terms)
official_fn(genes, abs, filename, terms)
genes |
|
abs |
|
filename |
|
terms |
|
It will return a text file containing corresponding official gene symbol.
S.Ramachandran
## Not run: official_fn(genes, diabetes_abs, "genes", c("diabetic nephropathy", "DN")) ## End(Not run) ## genes output of gene_atomization()
## Not run: official_fn(genes, diabetes_abs, "genes", c("diabetic nephropathy", "DN")) ## End(Not run) ## genes output of gene_atomization()
pmids_to_abstracts
is used to extract the abstract/s of query PMID/s.
pmids_to_abstracts(x, abs)
pmids_to_abstracts(x, abs)
x |
|
abs |
|
It will return an S4 object of class abstracts containing abstracts for query PMIDs.
S.Ramachandran
## Not run: pmids_to_abstracts(26878666,abs)
## Not run: pmids_to_abstracts(26878666,abs)
This function is used to extract the abstracts from the large corpus excluding the years and under a given theme. Its output is used in other functions like BWI and genes_BWI
previousabs_fn(yrs_to_exclude, theme, parentabs)
previousabs_fn(yrs_to_exclude, theme, parentabs)
yrs_to_exclude |
|
theme |
|
parentabs |
|
It returns an S4 object containing the abstracts of the given year.
S.Ramachandran
## Not run: test = previousabs_fn(as.character(2015:2010), "atherosclerosis", diabetesabs ## here we will get the abstracts before 2010 for 'atherosclerosis' ## from the large corpus diabetesabs.
## Not run: test = previousabs_fn(as.character(2015:2010), "atherosclerosis", diabetesabs ## here we will get the abstracts before 2010 for 'atherosclerosis' ## from the large corpus diabetesabs.
prevsymbol_fn
will return the sentences containing previous symbols of the genes from the abstracts using HGNC data.
prevsymbol_fn(genes, data, abs, filename, terms)
prevsymbol_fn(genes, data, abs, filename, terms)
genes |
|
data |
|
abs |
|
filename |
|
terms |
|
It returns a text file containing gene symbol with corresponding previous symbols.
S.Ramachandran
## Not run: prevsymbol_fn(genes, data, diabetes_abs, "prevsym", c("diabetic nephropathy", "DN")) ## End(Not run)
## Not run: prevsymbol_fn(genes, data, diabetes_abs, "prevsym", c("diabetic nephropathy", "DN")) ## End(Not run)
It gives overview of the abstracts in an S4 object of class Abstracts.
printabs(object)
printabs(object)
object |
An S4 object of class Abstracts. |
prints the total number of abstracts in an S4 object with additional information.
S.Ramachandran
## Not run: printabs(myabs) ## here myabs is an S4 object of class Abstracts.
## Not run: printabs(myabs) ## here myabs is an S4 object of class Abstracts.
pubtator_function
is used to extract specific information from an abstract like Gene, chemical, and diseases etc.Deprecated.
pubtator_function(x)
pubtator_function(x)
x |
numeric value PMID. |
pubtator_function
allow users to get information about 'Gene', 'Chemical' and 'Disease' for given PMID. It uses online tool PubTator on R plateform. It also removes redundancy from the output. It takes one PMID at once, for multiple PMIDs user can use lapply() function.
It returns a list object containing Gene, Chemical, Disease and PMID. The corresponding concept id numbers are joined by a '>' character. This is useful for further data mining
S.Ramachandran, Jyoti Rani
Wei CH et. al., PubTator: a Web-based text mining tool for assisting Biocuration, Nucleic acids research, 2013, 41 (W1): W518-W522. doi: 10.1093/nar/gkt44
Wei CH et. al., Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts, Database (Oxford), bas041, 2012
Wei CH et. al., PubTator: A PubMed-like interactive curation system for document triage and literature curation, in Proceedings of BioCreative 2012 workshop, Washington DC, USA, 145-150, 2012
## Not run: test = pubtator_function(17922911) ## here pubtator_function() will extract the information from this given pmid.
## Not run: test = pubtator_function(17922911) ## here pubtator_function() will extract the information from this given pmid.
pubtator_function
is used to extract specific information from an
abstract like Gene, chemical, and diseases etc.
pubtator_function_JSON(x)
pubtator_function_JSON(x)
x |
numeric value PMID. |
pubtator_function_JSON
allow users to get information about
'Gene', 'Chemical' and 'Disease' for given PMID. It uses online tool
PubTator on R plateform. It also removes redundancy from the output.
It takes one PMID at once, for multiple PMIDs user can use
lapply() function.
It returns a list object containing Gene, Chemical, Disease and PMID. The corresponding concept id numbers are joined by a '>' character. This is useful for further data mining
S.Ramachandran, Jyoti Rani
Wei CH et. al., PubTator: a Web-based text mining tool for assisting Biocuration, Nucleic acids research, 2013, 41 (W1): W518-W522. doi: 10.1093/nar/gkt44
Wei CH et. al., Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts, Database (Oxford), bas041, 2012
Wei CH et. al., PubTator: A PubMed-like interactive curation system for document triage and literature curation, in Proceedings of BioCreative 2012 workshop, Washington DC, USA, 145-150, 2012
pubtator_function()
## Not run: test = pubtator_function_JSON(17922911) ## here pubtator_function_JSON() will extract the information from ## this given pmid.
## Not run: test = pubtator_function_JSON(17922911) ## here pubtator_function_JSON() will extract the information from ## this given pmid.
This function is used to collect the outputs of pubtator_function() after using lapply over multiple PMIDs. This function enables to convert it into table for easy reading and further analysis.
pubtator_result_list_to_table(x)
pubtator_result_list_to_table(x)
x |
here x is list output of pubtator_function(). |
It returns table for pubtator_function output.
S.Ramachandran, Jyoti Rani
## Not run: test = pubtator_result_list_to_table(x) ##here x is the output of pubtator_function
## Not run: test = pubtator_result_list_to_table(x) ##here x is the output of pubtator_function
pubtator_function
is used to extract specific information from an abstract like Gene, chemical, and diseases etc.
pubtator3_function(x)
pubtator3_function(x)
x |
numeric value PMID. |
pubtator_function
allow users to get information about 'Gene', 'Chemical' and 'Disease' for given PMID. It uses online tool PubTator on R plateform. It also removes redundancy from the output. It takes one PMID at once, for multiple PMIDs user can use lapply() function.
It returns a list object containing Gene, Chemical, Disease and PMID. The corresponding concept id numbers are joined by a '>' character. This is useful for further data mining
S.Ramachandran, Jyoti Rani
Wei, Chih-Hsuan, et al. "PubTator 3.0: an AI-powered literature resource for unlocking biomedical knowledge." Nucleic Acids Research (2024): gkae235.
Wei CH et. al., PubTator: a Web-based text mining tool for assisting Biocuration, Nucleic acids research, 2013, 41 (W1): W518-W522. doi: 10.1093/nar/gkt44
Wei CH et. al., Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts, Database (Oxford), bas041, 2012
Wei CH et. al., PubTator: A PubMed-like interactive curation system for document triage and literature curation, in Proceedings of BioCreative 2012 workshop, Washington DC, USA, 145-150, 2012
## Not run: test = pubtator3_function(17922911) ## here pubtator_function() will extract the information from this given pmid.
## Not run: test = pubtator3_function(17922911) ## here pubtator_function() will extract the information from this given pmid.
readabs
will automatically read the abstracts from the pubmed file.
readabs(x)
readabs(x)
x |
Text file of PubMed abstracts. (Abstracts downloaded from PubMed) |
The saved file from a general pubmed search as text file is read via readabs().
An S4 object of class "Abstracts", and a text file with tab delimited headers Journal, Abstract, PMID written with file name "newabs.txt".
S.Ramachandran
## Not run: readabs("pubmed_result.txt") ##here pubmed_result.txt is the text file of abstracts saved from PubMed.
## Not run: readabs("pubmed_result.txt") ##here pubmed_result.txt is the text file of abstracts saved from PubMed.
readabsnew
will automatically read the abstracts from the pubmed text file.
readabsnew(x)
readabsnew(x)
x |
Text file of PubMed abstracts. (Abstracts downloaded from PubMed) |
The saved file from a general pubmed search as text file is read via readabsnew().
An S4 object of class "Abstracts" and a text file with tab delimited headers Journal, Abstract, PMID written with file name "newabs.txt".
S.Ramachandran
## Not run: readabsnew("pubmed_result.txt") ##here pubmed_result.txt is the text file of abstracts saved from PubMed.
## Not run: readabsnew("pubmed_result.txt") ##here pubmed_result.txt is the text file of abstracts saved from PubMed.
ready
will initiate the classes neccessary for other functions.
ready()
ready()
This function is neccessary to initiate the classes which are needed for the implementation of other functions.
classes
S. Ramachandran
## Not run: ready()
## Not run: ready()
removeabs
will remove the abstracts from a corpus for a given term.
removeabs(object, x, y)
removeabs(object, x, y)
object |
An S4 object of class Abstracts |
x |
A character value |
y |
is logical, if set 'TRUE' search will be case specific |
removeabs() finds the abstracts for the given term and remove them from the large set of abstracts.A text file of file name "dataout.txt" will be written containing the number of abstracts removed.
An S4 object of class Abstracts and a text file named "dataout.txt"
S.Ramachandran, Jyoti Rani
## Not run: removeabs(myabs, "atherosclerosis", TRUE)
## Not run: removeabs(myabs, "atherosclerosis", TRUE)
removeabs
To remove abstracts of a term from the data.removeabs
This function will search for the abstracts containing the given term to remove them from the data.
signature(object = "Abstracts")
This method depicts its function, it will remove the abstracts from the data, and the number of abstracts removed will be written the text file named "dataout.txt"
searchabsL
will search for abstracts for the given term(s). Multiple combinations are allowed.
searchabsL(object, yr, include, restrict, exclude)
searchabsL(object, yr, include, restrict, exclude)
object |
An S4 object of class Abstracts |
yr |
character vector specifies the year of search |
include |
character vector specifies the terms contained in the abstracts. |
restrict |
character vector specifies the term contained in the abstracts for which search should be restricted. |
exclude |
character vector specifies the terms contained in the abstracts for excluding these abstracts from the search results. |
In the arguments except for the object all other arguments have "NONE" as default. To export or write the result of searchabsL() we use sendabs() function.
An object of class Abstracts satisfying the term combinations, In addition a text file named "out.txt" reporting the number of abstracts for given query term combinations.
S.Ramachandran
## Not run: searchabsL(myabs, include="term") searchabsL(myabs, yr="2013") searchabsL(myabs, restrict="term") searchabsL(myabs, exclude="term") searchabsL(myabs, include="term", exclude="term2") ## End(Not run) ## Here myabs is the object of class Abstracts containing data, ## "term" is the query term to be search.
## Not run: searchabsL(myabs, include="term") searchabsL(myabs, yr="2013") searchabsL(myabs, restrict="term") searchabsL(myabs, exclude="term") searchabsL(myabs, include="term", exclude="term2") ## End(Not run) ## Here myabs is the object of class Abstracts containing data, ## "term" is the query term to be search.
searchabsL
will automatically search the abstracts from the data for the given terms or their combination of several terms.
signature(object = "Abstracts")
searchabsL will search the abstracts for the given term or combinations of several terms. In this method the argument "include" uses the boolean operator 'OR' and is liberal whereas the 'restrict' and 'exclude' use the boolean operator 'AND' to specify additional filters. If the restriction to individual terms are desired then they can be individually searched and then the multiple abstracts can be combined using combineasb() function.
searchabsT
It is similar to searchabsL() but performs more specific search. It performs case sensitive search.
searchabsT(object, yr, include, restrict, exclude)
searchabsT(object, yr, include, restrict, exclude)
object |
An S4 object of class Abstracts |
yr |
character vector specifies the year(s) of search. |
include |
character vector specifies the term(s) for which abstracts to be searched. |
restrict |
character vector specifies the term(s) contained in the abstracts for which search should be restricted. |
exclude |
character vector specifies the term(s) contained in the abstracts for excluding these abstracts from our search results. |
In the arguments except the object all arguments have "NONE" as default. Use sendabs() function to write the results in a tab delimited text file.
An object of class Abstracts meeting the term and the term combinations. A text file reporting the number of abstracts for the query terms and their combinations is als written with the filename "out.txt".
Dr.S.Ramachandran
## Not run: searchabsT(myabs,yr="2013") searchabsT(myabs,include="term") searchabsT(myabs,restrict="term") searchabsT(myabs,exclude="term") searchabsT(myabs,yr="2013", include="term") ## End(Not run) ## Here myabs is an S4 object of class Abstracts containing the abstracts to search, ## "term" is the query term to be search.
## Not run: searchabsT(myabs,yr="2013") searchabsT(myabs,include="term") searchabsT(myabs,restrict="term") searchabsT(myabs,exclude="term") searchabsT(myabs,yr="2013", include="term") ## End(Not run) ## Here myabs is an S4 object of class Abstracts containing the abstracts to search, ## "term" is the query term to be search.
searchabsT
Searching abstractssearchabsT
will perform a specific search for the given term.
signature(object = "Abstracts")
It is similar to the searchabsL method, but it is more specific than searchabsL, it is case sensitive, however searchabsL is not.
sendabs
will send the abstracts into a tab delimited text file with the fields Journal, Abstract, and PMID.
sendabs(object, x)
sendabs(object, x)
object |
An S4 object of class 'Abstracts' |
x |
"filename.txt" to write the abstracts |
A general writing function for object of class 'Abstracts'
A tab delimited text file with headers Journal, Abstract, PMID.
S.Ramachandran, Jyoti Rani
## Not run: sendabs(myabs,"myabs.txt") ## here myabs is the S4 object of class 'Abstracts' and ## 'abs.txt' is the file where abstracts will be written.
## Not run: sendabs(myabs,"myabs.txt") ## here myabs is the S4 object of class 'Abstracts' and ## 'abs.txt' is the file where abstracts will be written.
sendabs
will write the data of an object of class 'Abstracts' into a tab delimited text file with header Journal, Abstract, and
PMID
signature(object = "Abstracts")
sendabs will send the data into a text file. It writes a tab delimited text file for PubMed abstracts containing Journal, Abstract, and PMID.
SentenceToken
will tokenize abstracts into individual sentences.
SentenceToken(x)
SentenceToken(x)
x |
is a character string; could be an output from paste |
This function is necessary for extracting sentences from abstracts, used by contextSearch function. The tokenization principle follows the overall strategy as described in contextSearch
A character vector of sentences
S.Ramachandran
## Not run: SentenceToken(x)
## Not run: SentenceToken(x)
space_quasher
will automatically remove extra spaces between words. Therefore only one space between any pair of words will be left
space_quasher(x)
space_quasher(x)
x |
x is a text with single or multiple sentences given within double quotes. |
The extra spaces between words in sentences is quashed to one via space_quasher().
Sentences(s) in which extra spaces between any pair of words are quashed to one.
S.Ramachandran
## Not run: space_quasher("I am a ghostbuster. I have the tools required to hunt ghosts") ##here pubmed_result.txt is the text file of abstracts saved from PubMed.
## Not run: space_quasher("I am a ghostbuster. I have the tools required to hunt ghosts") ##here pubmed_result.txt is the text file of abstracts saved from PubMed.
subabs
will automatically extract the sub-abstracts from large set of abstracts.
subabs(object, start, end)
subabs(object, start, end)
object |
An S4 object of class Abstracts |
start |
integer, specifies starting limit of the range to perform search |
end |
integer, specifies end limit of the range to perform search |
From a large number of asbtracts wish to extract a subset of abstracts into a separate object.
An R object of class 'Abstracts' containing the extracted abstracts meeting a given range.
Jyoti Rani, S.Ramachandran
## Not run: subabs(myabs,1,5) ## Here 'myabs is an S4 object of class 'Abstracts', ## 1 and 5 are the start and end respectively.
## Not run: subabs(myabs,1,5) ## Here 'myabs is an S4 object of class 'Abstracts', ## 1 and 5 are the start and end respectively.
subabs
subabs will extract the sub abstracts corresponding to a given range, from the whole data.
signature(object = "Abstracts")
From an S4 object of class 'Abstracts' the subabs function is able to extract the abstracts corresponding to a given range.
It is used to divide the large corpus into a given range.
subsetabs(object, indices)
subsetabs(object, indices)
object |
|
indices |
|
It returns an S4 obejct of extracted Abstracts.
S. Ramachandran.
## Not run: test = subsetabs(diabetesabs, 1:50) ## here we want to extract the Abstacts ranges from 1 to 50 ## from the large corpus of diabetes.
## Not run: test = subsetabs(diabetesabs, 1:50) ## here we want to extract the Abstacts ranges from 1 to 50 ## from the large corpus of diabetes.
subsetabs
is used to subset of Abstracts from the large corpus. Its output is used in other functions like currentabs_fn and previousabs_fn
signature(object = "Abstracts")
subsetabs will divide the large corpus into subset.
lsa package take "Term Document Matrix" as input, so it is needed to create a 'tdm' for Abstracts and tdm_for_lsa
do the same as it find out the frequency of given term in each abstract and each abstract is considered as separate document. It prepares term document matrix of terms in the 'abstracts' corpus
tdm_for_lsa(object, y)
tdm_for_lsa(object, y)
object |
An S4 object of class 'Abstracts' |
y |
a character vector specifying the terms |
a Term Document Matrix (Numerical matrix) containing the raw frequencies of given terms in each abstract.
Jyoti Rani
## Not run: y = c("insulin", "inflammation", "obesity") tdm_for_lsa(myabs,y) ## End(Not run)
## Not run: y = c("insulin", "inflammation", "obesity") tdm_for_lsa(myabs,y) ## End(Not run)
uniprotfun
will access the UniProt data for a given gene as per HGNC approved gene symbols. Deprecated.
uniprotfun(y)
uniprotfun(y)
y |
a HGNC approved gene symbol as character |
This function retrieves data from the UniProt. At present uniprotfun() works with only HGNC approved gene symbols.
A text file written with filename as the 'query' name suffixed with .txt
S.Ramachandran
## Not run: uniprotfun("SIRT1")
## Not run: uniprotfun("SIRT1")
whichcluster
is used to get the cluster in which a given word (term) occurs.
whichcluster(clusterobject, y)
whichcluster(clusterobject, y)
clusterobject |
an R object containing the clusters of words output by |
y |
a character string of query terms. |
a list containing the number of cluster under which given term occurs.
S.Ramachandran
## Not run: test<-whichcluster(x, "diabetes") ## here x is an R object output form wordscluster function. ## and "diabetes" is the term for which cluster number is to be searched. ## End(Not run)
## Not run: test<-whichcluster(x, "diabetes") ## here x is an R object output form wordscluster function. ## and "diabetes" is the term for which cluster number is to be searched. ## End(Not run)
word_associations
will automatically extract associated words for a given word, namely the words immediately to teh left and to the right. The given word is usually in the middle except for those cases, where the given word occurrs either at the start or the end of the sentence.
word_associations(term, abs)
word_associations(term, abs)
term |
is a single word |
abs |
an S4 object of class Abstracts |
Certain words are qualified by authors in various ways. For example, physical therapy, gene therapy etc. This functions is useful in extracting these qualified words in the form of available associated words. Useful for preparing terms to be given in co_occurrence_fn (). There could be other uses also.
comp1 |
A list of all the word pairs in a given set of abstracts. |
S. Ramachandran
Rani J, Shah AB, Ramachandran S. pubmed.mineR: an R package with text-mining algorithms to analyse PubMed abstracts. J Biosci. 2015 Oct;40(4):671-82. PubMed PMID: 26564970.
Give_Sentences
## Not run: word_associations("therapy",myabs ##
## Not run: word_associations("therapy",myabs ##
word_atomizations
will automatically break the whole text into words nd rank them according to their frequency of occurence.
word_atomizations(m)
word_atomizations(m)
m |
An S4 object of class Abstracts |
word_atomizations() will break down the whole text into words after removing the extra white space, punctuation marks and very common english words.
A text file containing words with their frequencies
S. Ramachandran, Jyoti Sharma
## Not run: word_atomizations(myabs) ## here myabs is the object containing abstracts.
## Not run: word_atomizations(myabs) ## here myabs is the object containing abstracts.
wordscluster
is used to cluster the words, using the levenshtein distance concept, which are coming together in combination with either 'prefixes' or 'suffixes' or other compound words. The first word, usually of lowest length, could be 'stemmed' word in many cases drastically so, is considered as representative for that cluster.
wordscluster(lower, upper)
wordscluster(lower, upper)
lower |
lower limit for characters in word. Default = 5. |
upper |
upper limit of characters in word. Default = 30 |
This function is usefull for dampening the 'explotion' of words output from word_atomizations. This step enables easy examination of the terms.
a list object of words clustered together and a text filenamed "resulttable.txt" with the columns cluster number, cluster size and representatives of clusters.
The function may run faster when the lower limits are reduced but 'risks' producing plenty of 'decoy' situations. Their frequencies are very rare. Decoy situations: Some 'words' with part identity to other smaller words will runaway with smaller words. This event creates an unfavorable situation whereby the generated 'clusters' of words become difficult to interpret. This situation can be minimized by increasing the lower limit of word length, however at the cost of lowering computational speed. An example is: the word hypercholesterolemia runsaway with the smaller word 'lester' which could be another name.In this instance increasing the lower limit will be more usefull. Words longer than 30 characters are usually names of chemical comnpunds in IUPAC system of nomenclature.
S.Ramachandran, Jyoti Rani
whichcluster word_atomizations
## Not run: test=wordscluster(5, 10) ## here it will start making cluster of words of length with minimum of 5 characters ## and maximum of 10 characters. ## End(Not run)
## Not run: test=wordscluster(5, 10) ## here it will start making cluster of words of length with minimum of 5 characters ## and maximum of 10 characters. ## End(Not run)
wordsclusterview
is used to view the words comes in cluster formed by wordscluster
function.
wordsclusterview(words_cluster, all)
wordsclusterview(words_cluster, all)
words_cluster |
an R object containing output of wordscluster |
all |
is logical and default is FALSE, if set to TRUE includes those with one member word. |
The first 5 words and 5 words near the median nd 5 words at the tail end are shown for clusters with more than 15 members. In case of cluster size less than 15, all the words are written in output.
It returns a text file named word_cluster_view.txt
S.Ramachandran, Jyoti Rani
## Not run: test= wordsclusterview(cluster) # here cluster is output from wordscluster ## End(Not run)
## Not run: test= wordsclusterview(cluster) # here cluster is output from wordscluster ## End(Not run)
xmlgene_atomizations
is used to fetch the list of genes from the xml abstracts.Deprecated.
xmlgene_atomizations(m)
xmlgene_atomizations(m)
m |
an S4 object of class Abstracts, output from xmlreadabs. |
a list containing genes from the text with their frquency of occurence.
S.Ramachandran, Jyoti Sharma
## Not run: test = xmlgene_atomizations(xmlabs) ## xmlabs is an S4 object of class Abstracts i.e. output of xmlreadabs
## Not run: test = xmlgene_atomizations(xmlabs) ## xmlabs is an S4 object of class Abstracts i.e. output of xmlreadabs
xmlgene_atomizations_new
is used to fetch the list of genes
from the xml abstracts
xmlgene_atomizations_new(m)
xmlgene_atomizations_new(m)
m |
an S4 object of class Abstracts, output from xmlreadabs. |
a list containing genes from the text with their frquency of occurrence.
S.Ramachandran, Jyoti Sharma
## Not run: test = xmlgene_atomizations(xmlabs) ## xmlabs is an S4 object of class Abstracts i.e. output of xmlreadabs
## Not run: test = xmlgene_atomizations(xmlabs) ## xmlabs is an S4 object of class Abstracts i.e. output of xmlreadabs
xmlreadabs
is modified form of readabs as it reads the abstracts downloaded/saved in XML format from PubMed. This is helpful to give clean and better result after preprocessing i.e. word_atomizations
, wordscluster
etc.
xmlreadabs(file)
xmlreadabs(file)
file |
an XML file saved from PubMed. |
an S4 object of class Abstracts containing journals, abstracts and PMID.
S.Ramachandran
## Not run: xmlabs = xmlreadabs("pubmed_result.xml") ## here "pubmed_result.xml" is an xml format file downloaded from PubMed.
## Not run: xmlabs = xmlreadabs("pubmed_result.xml") ## here "pubmed_result.xml" is an xml format file downloaded from PubMed.
xmlword_atomizations
is used to process the abstracts from PubMed in XML format.
xmlword_atomizations(m)
xmlword_atomizations(m)
m |
an S4 object of class Abstracts resulted from xmlreadabs. |
a list containing words from the text with their frequencies.
xmlword_atomizations
cannot work on output of readabs.
S. Ramachandran
## Not run: test = xmlword_atomizations(xmlabs) ## here xmlabs is an S4 object i.e. output of xmlreadabs
## Not run: test = xmlword_atomizations(xmlabs) ## here xmlabs is an S4 object i.e. output of xmlreadabs
Yearwise
reports the no. of abstracts in a year.
Yearwise(object, year)
Yearwise(object, year)
object |
An S4 object of class Abstracts. |
year |
a character vector specifies the year. |
Yearwise() is useful to find the no. of abstracts for the given year.
A text file containing the no. of abstracts for given Year(s)
Dr.S.Ramachandran
## Not run: Yearwise(myabs, "2011") or Yearwise(myabs, c("2011", "2013", "2009") ## End(Not run) ## Here myabs is the object containing PubMed abstracts.
## Not run: Yearwise(myabs, "2011") or Yearwise(myabs, c("2011", "2013", "2009") ## End(Not run) ## Here myabs is the object containing PubMed abstracts.
Yearwise
Year wise extraction of Abstracts Yearwise
will report the abstracts for given year(s).
signature(object = "Abstracts")
This method "Yearwise" is written to fetch the abstracts yearly.