Title: | Text Mining Distributed Corpus Plug-in |
---|---|
Description: | A plug-in for the text mining framework tm to support text mining in a distributed way. The package provides a convenient interface for handling distributed corpus objects based on distributed list objects. |
Authors: | Ingo Feinerer [aut], Stefan Theussl [aut, cre] |
Maintainer: | Stefan Theussl <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.2-10 |
Built: | 2025-01-03 07:05:57 UTC |
Source: | CRAN |
Data structures and operators for distributed corpora.
DCorpus( x, readerControl = list(reader = reader(x), language = "en"), storage = NULL, keep = TRUE, ... ) ## S3 method for class 'DCorpus' as.VCorpus(x) as.DCorpus( x, storage = NULL, ... )
DCorpus( x, readerControl = list(reader = reader(x), language = "en"), storage = NULL, keep = TRUE, ... ) ## S3 method for class 'DCorpus' as.VCorpus(x) as.DCorpus( x, storage = NULL, ... )
x |
for |
readerControl |
A list with the named components |
storage |
The storage subsystem to use with the DCorpus. Currently two types of storages are supported: local disk storage using the Local File System (LFS) and the Hadoop Distributed File System (HDFS). Default: 'LFS'. |
keep |
Should revisions be used when operating on the
|
... |
Optional arguments for the |
When constructing a distributed corpus the input source is
extracted via the supplied reader and stored on the given file
system (argument storage
). While the data set resides on the
corresponding storage (e.g., HDFS), only a symbolic representation is
held in R (a so-called DList
) which allows to
access the corpus via corresponding (DList
) methods. Since the
available memory for the distributed corpus is only restricted by
available disk space in the given storage (and not main memory like in
a standard tm corpus) by default we also store a set of
so-called revisions, i.e., stages of the (processed) corpus. Revisions
can be turned off later on using the keepRevisions()
replacement function.\
The constructed corpus object inherits from a tm
Corpus
and has several slots containing meta
information:
meta
Corpus Meta Data contains corpus specific meta data in form of tag-value pairs.
dmeta
Document Meta Data of class
data.frame
contains document specific meta data for the
corpus. This is mainly available to be compatible with standard
tm corpus definitions but not yet actually used in the
distributed scenario.
keep
A logical indicating whether revisions representing stages e.g., in a preprocessing chain should be kept or not.
An object inheriting from DCorpus
and Corpus
.
Ingo Feinerer and Stefan Theussl
Corpus
for basic information on the corpus infrastructure
employed by package tm.
## Similar to example in package 'tm' reut21578 <- system.file("texts", "crude", package = "tm") dc <- DistributedCorpus(DirSource(reut21578), readerControl = list(reader = readReut21578XMLasPlain) ) dc ## Coercion data("crude") as.DistributedCorpus(crude) as.VCorpus(dc)
## Similar to example in package 'tm' reut21578 <- system.file("texts", "crude", package = "tm") dc <- DistributedCorpus(DirSource(reut21578), readerControl = list(reader = readReut21578XMLasPlain) ) dc ## Coercion data("crude") as.DistributedCorpus(crude) as.VCorpus(dc)
Each modification of the documents in the corpus results in a new
stage, i.e., revision of the corpus. To allow fast switching
between multiple revisions all modifications may be kept on the file
system. The function setRevision()
allows to go back to any
stage in the history of the corpus. The function
keepRevisions()
shows if revisions are turned on or off; the
corresponding replacement function is used to set the desired
behavior.
getRevisions( corpus ) removeRevision( corpus, revision ) setRevision( corpus, revision ) keepRevisions( corpus ) `keepRevisions<-`( corpus, value )
getRevisions( corpus ) removeRevision( corpus, revision ) setRevision( corpus, revision ) keepRevisions( corpus ) `keepRevisions<-`( corpus, value )
corpus |
A distributed corpus of class |
revision |
The revision which is to be set as active or removed. |
value |
A logical indicating whether revisions should be kept or not. |
Whereas getRevisions()
returns a list of character strings naming all
available revisions, setRevision()
returns the distributed
corpus with the given revision marked as active. The function
keepRevisions()
returns a logical indicating whether revisions
are used or not.
## provide data on storage data("crude") dc <- as.DCorpus(crude) ## do some preprocessing dc <- tm_map(dc, content_transformer(tolower)) ## retrieve available revisions revs <- getRevisions(dc) revs ## go back to original revision setRevision(dc, revs[2]) keepRevisions(dc) keepRevisions(dc) <- FALSE
## provide data on storage data("crude") dc <- as.DCorpus(crude) ## do some preprocessing dc <- tm_map(dc, content_transformer(tolower)) ## retrieve available revisions revs <- getRevisions(dc) revs ## go back to original revision setRevision(dc, revs[2]) keepRevisions(dc) keepRevisions(dc) <- FALSE
Constructs a term-document matrix given a distributed corpus.
## S3 method for class 'DCorpus' TermDocumentMatrix(x, control = list())
## S3 method for class 'DCorpus' TermDocumentMatrix(x, control = list())
x |
A distributed corpus. |
control |
A named list of control options. The component
|
An object of class TermDocumentMatrix
containing a sparse
term-document matrix. The attribute Weighting
contains the
weighting applied to the matrix.
The documentation of termFreq
gives an extensive list of
possible options.
data("crude") tdm <- TermDocumentMatrix(as.DCorpus(crude), list(stopwords = TRUE, weighting = weightTfIdf)) inspect(tdm[149:152,1:5])
data("crude") tdm <- TermDocumentMatrix(as.DCorpus(crude), list(stopwords = TRUE, weighting = weightTfIdf)) inspect(tdm[149:152,1:5])
Interface to apply transformation functions to distributed
corpora. See tm_map
in tm for more information.
## S3 method for class 'DCorpus' tm_map(x, FUN, ...)
## S3 method for class 'DCorpus' tm_map(x, FUN, ...)
x |
A distributed corpus of class |
FUN |
a transformation function taking a text document as input and
returning a text document. The function |
... |
arguments to |
A DCorpus
with FUN
applied to each document in
x
. If revisions are enabled, the original documents contained
in x
can be retrieved via getting back to the corresponding
revision using the function setRevision()
.
getTransformations
for available transformations in package
tm.
data("crude") tm_map(as.DCorpus(crude), content_transformer(tolower))
data("crude") tm_map(as.DCorpus(crude), content_transformer(tolower))