Title: | Wrappers Around Stanford CoreNLP Tools |
---|---|
Description: | Provides a minimal interface for applying annotators from the 'Stanford CoreNLP' java library. Methods are provided for tasks such as tokenisation, part of speech tagging, lemmatisation, named entity recognition, coreference detection and sentiment analysis. |
Authors: | Taylor Arnold, Lauren Tilton |
Maintainer: | Taylor Arnold <[email protected]> |
License: | GPL-2 |
Version: | 0.4-3 |
Built: | 2024-11-28 06:50:57 UTC |
Source: | CRAN |
Parsed via the Stanford CoreNLP Java Library
annoEtranger
annoEtranger
a annotation
object
Taylor Arnold, 2015-06-03
Parsed via the Stanford CoreNLP Java Library
annoHp
annoHp
a annotation
object
Taylor Arnold, 2015-06-03
Runs the CoreNLP annotators for the text contained in a given file.
The details for which annotators to run and how to run them are
specified in the properties file loaded in via the initCoreNLP
function (which must be run prior to any annotation).
annotateFile(file, format = c("obj", "xml", "text"), outputFile = NA, includeXSL = FALSE)
annotateFile(file, format = c("obj", "xml", "text"), outputFile = NA, includeXSL = FALSE)
file |
a string giving the location of the file to be loaded. |
format |
the desired output format. Option |
outputFile |
character string indicating where to put the output. If set to NA, the output will be returned by the function. |
includeXSL |
boolean. Whether the xml style sheet should be included
in the output. Only used if format is |
Runs the CoreNLP annotators over a given string of text. The details
for which annotators to run and how to run them are specified in the
properties file loaded in via the initCoreNLP
function (which
must be run prior to any annotation).
annotateString(text, format = c("obj", "xml", "text"), outputFile = NA, includeXSL = FALSE)
annotateString(text, format = c("obj", "xml", "text"), outputFile = NA, includeXSL = FALSE)
text |
a vector of strings for which an annotation is desired. Will be collapsed to length 1 using new line characters prior to the annotation. |
format |
the desired output format. Option |
outputFile |
character string indicating where to put the output. If set to NA, the output will be returned by the function. |
includeXSL |
boolean. Whether the xml style sheet should be included
in the output. Only used if format is |
## Not run: initCoreNLP() sIn <- "Mother died today. Or, maybe, yesterday; I can't be sure." annoObj <- annotateString(sIn) ## End(Not run)
## Not run: initCoreNLP() sIn <- "Mother died today. Or, maybe, yesterday; I can't be sure." annoObj <- annotateString(sIn) ## End(Not run)
The coreNLP package does not supply the raw java files provided by the Stanford NLP Group as they are quite large. This function downloads the libraries for you, by default into the directory where the package was installed.
downloadCoreNLP(outputLoc, type = c("base", "chinese", "english", "french", "german", "spanish"))
downloadCoreNLP(outputLoc, type = c("base", "chinese", "english", "french", "german", "spanish"))
outputLoc |
a string showing where the files are to be downloaded. If missing, will try to download files into the directory where the package was original installed. |
type |
type of files to download. The base backage, installed by default is required. Other jars include chinese, german, and spanish. These will be installed in addition to the base package. |
If you want to manually download files, simply unzip them and
place in system.file("extdata",package="coreNLP")
## Not run: downloadCoreNLP() downloadCoreNLP(type="spanish") ## End(Not run)
## Not run: downloadCoreNLP() downloadCoreNLP(type="spanish") ## End(Not run)
Returns a dataframe containing all coreferences detected in the text.
getCoreference(annotation)
getCoreference(annotation)
annotation |
an annotation object |
getCoreference(annoHp)
getCoreference(annoHp)
Returns a data frame of the coreferences of an annotation
getDependency(annotation, type = c("CCprocessed", "basic", "collapsed"))
getDependency(annotation, type = c("CCprocessed", "basic", "collapsed"))
annotation |
an annotation object |
type |
the class of coreference desired |
getDependency(annoEtranger) getDependency(annoHp)
getDependency(annoEtranger) getDependency(annoHp)
Returns a dataframe containing all OpenIE triples.
getOpenIE(annotation)
getOpenIE(annotation)
annotation |
an annotation object |
getOpenIE(annoHp)
getOpenIE(annoHp)
Returns a character vector of the parse trees.
Mostly use for visualization; the output of
getToken
will generally be more
conveniant for manipulating in R.
getParse(annotation)
getParse(annotation)
annotation |
an annotation object |
getParse(annoEtranger)
getParse(annoEtranger)
Returns a data frame of the sentiment scores from an annotation
getSentiment(annotation)
getSentiment(annotation)
annotation |
an annotation object |
getSentiment(annoEtranger) getSentiment(annoHp)
getSentiment(annoEtranger) getSentiment(annoHp)
Returns a data frame of the tokens from an annotation object.
getToken(annotation)
getToken(annotation)
annotation |
an annotation object |
getToken(annoEtranger)
getToken(annoEtranger)
This must be run prior to calling any other CoreNLP functions. It may be called multiple times in order to specify a different parameter set, but note that if you use a different configuration during the same R session it must have a unique name.
initCoreNLP(libLoc, type = c("english", "english_all", "english_fast", "arabic", "chinese", "french", "german", "spanish"), parameterFile = NULL, mem = "4g")
initCoreNLP(libLoc, type = c("english", "english_all", "english_fast", "arabic", "chinese", "french", "german", "spanish"), parameterFile = NULL, mem = "4g")
libLoc |
a string giving the location of the CoreNLP java files. This should point to a directory which contains, for example the file "stanford-corenlp-*.jar", where "*" is the version number. If missing, the function will try to find the library in the environment variable CORENLP_HOME, and otherwise will fail. |
type |
type of model to load. Ignored if parameterFile is set. |
parameterFile |
the path to a parameter file. See the CoreNLP documentation for an extensive list of options. If missing, the package will simply specify a list of standard annotators and otherwise only use default values. |
mem |
a string giving the amount of memory to be assigned to the rJava
engine. For example, "6g" assigned 6 gigabytes of memory. At least
2 gigabytes are recommended at a minimum for running the CoreNLP
package. On a 32bit machine, where this is not possible, setting
"1800m" may also work. This option will only have an effect the first
time |
## Not run: initCoreNLP() sIn <- "Mother died today. Or, maybe, yesterday; I can't be sure." annoObj <- annotateString(sIn) ## End(Not run)
## Not run: initCoreNLP() sIn <- "Mother died today. Or, maybe, yesterday; I can't be sure." annoObj <- annotateString(sIn) ## End(Not run)
Loads a properly formated XML file output by the CoreNLP
library into an annotation
object in R.
loadXMLAnnotation(file, encoding = "unknown")
loadXMLAnnotation(file, encoding = "unknown")
file |
connection or character string giving the file name to load |
encoding |
encoding to be assumed for input strings. It is used to mark
character strings as known to be in Latin-1 or UTF-8: it is
not used to re-encode the input. Passed to |
Returns an annotation object from a character vector containing
the xml. Not exported; use loadXMLAnnotation
instead.
parseAnnoXML(xml)
parseAnnoXML(xml)
xml |
character vector containing the xml file from an annotation |
Print a summary of an annotation object
## S3 method for class 'annotation' print(x, ...)
## S3 method for class 'annotation' print(x, ...)
x |
an annotation object |
... |
other arguments. Currently unused. |
print(annoEtranger)
print(annoEtranger)
Maps a character string of English Penn TreeBank part of speech tags into the universal tagset codes. This provides a reduced set of tags (12), and a better cross-linguist model of speech.
universalTagset(pennPOS)
universalTagset(pennPOS)
pennPOS |
a character vector of penn tags to match |
tok <- getToken(annoEtranger) cbind(tok$POS,universalTagset(tok$POS))
tok <- getToken(annoEtranger) cbind(tok$POS,universalTagset(tok$POS))