Title: | 'Rcpp' Bindings for the 'Corpus Workbench' ('CWB') |
---|---|
Description: | 'Rcpp' Bindings for the C code of the 'Corpus Workbench' ('CWB'), an indexing and query engine to efficiently analyze large corpora (<https://cwb.sourceforge.io>). 'RcppCWB' is licensed under the GNU GPL-3, in line with the GPL-3 license of the 'CWB' (<https://www.r-project.org/Licenses/GPL-3>). The 'CWB' relies on 'pcre2' (BSD license, see <http://www.pcre.org/licence.txt>) and 'GLib' (LGPL license, see <https://www.gnu.org/licenses/lgpl-3.0.en.html>). See the file LICENSE.note for further information. The package includes modified code of the 'rcqp' package (GPL-2, see <https://cran.r-project.org/package=rcqp>). The original work of the authors of the 'rcqp' package is acknowledged with great respect, and they are listed as authors of this package. To achieve cross-platform portability (including Windows), using 'Rcpp' for wrapper code is the approach used by 'RcppCWB'. |
Authors: | Andreas Blaette [aut, cre], Bernard Desgraupes [aut], Sylvain Loiseau [aut], Oliver Christ [ctb], Bruno Maximilian Schulze [ctb], Stephanie Evert [ctb], Arne Fitschen [ctb], Jeroen Ooms [ctb], Marius Bertram [ctb], Tomas Kalibera [ctb] |
Maintainer: | Andreas Blaette <[email protected]> |
License: | GPL-3 |
Version: | 0.6.5 |
Built: | 2024-12-23 06:21:43 UTC |
Source: | CRAN |
The RcppCWB
package is a wrapper library to expose core functions of
the Open Corpus Workbench
(CWB). This includes the low-level
functionality of the Corpus Library
(CL) as well as capacities to use
the query syntax of the Corpus Query Processor
(CQP).
The Open Corpus Workbench
(CWB) is an indexing and querying engine
popular in corpus-assisted research. Its core aim is to support working
efficiently with large, structurally and linguistically annotated corpora.
First of all, the CWB includes tools to index and compress corpora. Second,
the Corpus Library
(CL) offers low-level functionality to retrieve
information from CWB indexed corpora. Third, the Corpus Query
Processor
(CQP) offers a syntax that allows to perform anything from
simple to complex queries, using different annotation layers of corpora.
The CWB is a classical tool which has inspired a set of developments. A persisting advantage of the CWB is its mature, open source code base that is actively maintained by a community of developers. It is used as a robust and efficient backend for widely used tools such as TXM(https://txm.gitpages.huma-num.fr/textometrie/) or CQPweb (https://cwb.sourceforge.io/cqpweb.php). Its uncompromising C implementation guarantees speed and makes it well suited to be integrated with R at the same time.
The package RcppCWB
is a follow-up on the rcqp
package that
has pioneered to expose CWB functionality from within R. Indeed, the
rcqp
package, published at CRAN in 2015, offers robust access to CWB
functionality. However, the "pure C" implementation of the rcqp
package creates difficulties to make the package portable to Windows. The
primary purpose of the RcppCWB
package is to reimplement a wrapper
library for the CWB using a design that makes it easier to achieve
cross-platform portability.
Even though RcppCWB
functions may be used directly, the package is
designed to serve as an interface to CWB indexed corpora in packages with
higher-level functionality. In this regard, RcppCWB
is the backend
of the polmineR
package. It is deliberately open to be used in other
contexts. The package may stimulate using linguistically annotated, indexed
and compressed corpora on all platforms. The paradigm of working with text
as linguistic data may benefit from RcppCWB
.
When building the package, the first step is to compile the relevant parts
of the CWB on Linux and macOS machines. On Windows, cross-compiled binaries
are downloaded from a GitHub repository of the PolMine Project
(https://github.com/PolMine/libcl). Second, Rcpp
wrappers are
compiled and make the relevant functions of the Corpus Library and CQP
accessible. In addition to genuine CWB functions, RcppCWB
offers a
set of higher level functions implemented using Rcpp
for common
performance critical tasks.
To understand the data storage model of the CWB, in particular the notions
of positional and structural attributes (s- and p-attributes), the vignette
of the rcqp
package is a very good starting point (see references).
The CWB 'Corpus Encoding Tutorial' explains how to create your own corpus, the 'CQP Query Language Tutorial' introduces the syntax of CQP (see references).
The RcppCWB
package includes a sample corpus (REUTERS, the data also
included in the tm
package). The examples in the documentation
of the functions may be a good starting point to understand how to use
RcppCWB
.
The original paper of Christ (1994) explains the design choices of the CWB. The indexing and compression techniques of the CWB (Huffman coding) are explained in Witten et al. (1999).
The work of the all developers of the CWB is gratefully acknowledged. There
is a particular intellectual debt to Bernard Desgraupes and Sylvain
Loiseau, and the rcqp
package they developed as the original R
wrapper to expose the functionality of the CWB.
Andreas Blaette ([email protected])
Christ, O. 1994. "A modular and flexible architecture for an integrated corpus query system", in: Proceedings of COMPLEX '94, pp. 23-32. Budapest. Available online at https://cwb.sourceforge.io/files/Christ1994.pdf
Desgraupes, B.; Loiseau, S. 2012. Introduction to the rcqp package. Vignette of the rcqp package. Available at the CRAN archive at https://cran.r-project.org/src/contrib/Archive/rcqp/
Evert, S. 2005. The CQP Query Language Tutorial. Available online at https://cwb.sourceforge.io/files/CWB_Encoding_Tutorial.pdf
Evert, S. 2005. The IMS Open Corpus Workbench (CWB). Corpus Encoding Tutorial. Available online at https://cwb.sourceforge.io/files/CWB_Encoding_Tutorial.pdf
Open Corpus Workbench (https://cwb.sourceforge.io)
Witten, I.H.; Moffat, A.; Bell, T.C. (1999). Managing Gigabytes. Morgan Kaufmann Publishing, San Francisco, 2nd edition.
# functions of the corpus library (starting with cl) expose the low-level # access to the CWB corpus library (CL) ids <- cl_cpos2id("REUTERS", cpos = 1:20, p_attribute = "word", registry = get_tmp_registry()) tokens <- cl_id2str("REUTERS", id = ids, p_attribute = "word", registry = get_tmp_registry()) print(paste(tokens, collapse = " ")) # To use the corpus query processor (CQP) and its syntax, it is necessary first # to initialize CQP (example: get concordances of 'oil') cqp_query("REUTERS", query = '[]{5} "oil" []{5}') cpos_matrix <- cqp_dump_subcorpus("REUTERS") concordances_oil <- apply( cpos_matrix, 1, function(row){ ids <- cl_cpos2id("REUTERS", p_attribute = "word", cpos = row[1]:row[2], get_tmp_registry()) tokens <- cl_id2str("REUTERS", p_attribute = "word", id = ids, get_tmp_registry()) paste(tokens, collapse = " ") } )
# functions of the corpus library (starting with cl) expose the low-level # access to the CWB corpus library (CL) ids <- cl_cpos2id("REUTERS", cpos = 1:20, p_attribute = "word", registry = get_tmp_registry()) tokens <- cl_id2str("REUTERS", id = ids, p_attribute = "word", registry = get_tmp_registry()) print(paste(tokens, collapse = " ")) # To use the corpus query processor (CQP) and its syntax, it is necessary first # to initialize CQP (example: get concordances of 'oil') cqp_query("REUTERS", query = '[]{5} "oil" []{5}') cpos_matrix <- cqp_dump_subcorpus("REUTERS") concordances_oil <- apply( cpos_matrix, 1, function(row){ ids <- cl_cpos2id("REUTERS", p_attribute = "word", cpos = row[1]:row[2], get_tmp_registry()) tokens <- cl_id2str("REUTERS", p_attribute = "word", id = ids, get_tmp_registry()) paste(tokens, collapse = " ") } )
Rcpp wrappers for CWB Corpus Library functions
attribute_size(corpus, attribute, attribute_type, registry) cpos2str(corpus, p_attribute, registry, cpos) cpos2id(corpus, p_attribute, registry, cpos) struc2cpos(corpus, s_attribute, registry, struc) id2str(corpus, p_attribute, registry, id)
attribute_size(corpus, attribute, attribute_type, registry) cpos2str(corpus, p_attribute, registry, cpos) cpos2id(corpus, p_attribute, registry, cpos) struc2cpos(corpus, s_attribute, registry, struc) id2str(corpus, p_attribute, registry, id)
corpus |
The ID of a CWB corpus. |
attribute |
Either a positional, or a structural attribute. |
attribute_type |
Either "p" (positional attribute) or "s" (structural attribute). |
registry |
Path to the corpus registry. |
p_attribute |
A positional attribute. |
cpos |
An integer vector of corpus positions. |
s_attribute |
A structural attribute. |
struc |
An integer value with struc. |
id |
An |
A set of functions to check whether the input values to the Rcpp wrappers for the C functions of the Corpus Workbench potentially causing crashes are valid. These auxiliary functions are called by the cl_ and cqp_ functions.
check_registry(registry) check_corpus(corpus, registry, cl = TRUE, cqp = TRUE) check_s_attribute( s_attribute, corpus, registry = Sys.getenv("CORPUS_REGISTRY") ) check_p_attribute( p_attribute, corpus, registry = Sys.getenv("CORPUS_REGISTRY") ) check_strucs(corpus, s_attribute, strucs, registry) check_region_matrix(region_matrix) check_query(query) check_cpos( corpus, p_attribute = "word", cpos, registry = Sys.getenv("CORPUS_REGISTRY") ) check_id(corpus, p_attribute, id, registry = Sys.getenv("CORPUS_REGISTRY"))
check_registry(registry) check_corpus(corpus, registry, cl = TRUE, cqp = TRUE) check_s_attribute( s_attribute, corpus, registry = Sys.getenv("CORPUS_REGISTRY") ) check_p_attribute( p_attribute, corpus, registry = Sys.getenv("CORPUS_REGISTRY") ) check_strucs(corpus, s_attribute, strucs, registry) check_region_matrix(region_matrix) check_query(query) check_cpos( corpus, p_attribute = "word", cpos, registry = Sys.getenv("CORPUS_REGISTRY") ) check_id(corpus, p_attribute, id, registry = Sys.getenv("CORPUS_REGISTRY"))
registry |
path to registry directory |
corpus |
name of a CWB corpus |
cl |
A |
cqp |
A |
s_attribute |
a structural attribute |
p_attribute |
a positional attribute |
strucs |
strucs (indices of structural attributes) |
region_matrix |
a region matrix |
query |
a CQP query |
cpos |
vector of corpus positions |
id |
id (encoded p-attribute), integer value |
Check Paths in Registry Files
check_pkg_registry_files(pkg = system.file(package = "RcppCWB"), set = FALSE)
check_pkg_registry_files(pkg = system.file(package = "RcppCWB"), set = FALSE)
pkg |
Full path to package directory |
set |
Logical, whether |
Logical value, whether home directories are set correctly.
Use cl_attribute_size()
to get the total number of values of a positional
attribute (param attribute_type
= "p"), or structural attribute (param
attribute_type
= "s"). Note that indices are zero-based, i.e. the maximum
position of a positional / structural attribute is attribute size minus 1
(see examples).
cl_attribute_size( corpus, attribute, attribute_type, registry = Sys.getenv("CORPUS_REGISTRY") )
cl_attribute_size( corpus, attribute, attribute_type, registry = Sys.getenv("CORPUS_REGISTRY") )
corpus |
name of a CWB corpus (upper case) |
attribute |
name of a p- or s-attribute |
attribute_type |
either "p" or "s", for structural/positional attribute |
registry |
path to the registry directory, defaults to the value of the environment variable CORPUS_REGISTRY |
token_no <- cl_attribute_size( "REUTERS", attribute = "word", attribute_type = "p", registry = get_tmp_registry() ) corpus_positions <- seq.int(from = 0, to = token_no - 1) cl_cpos2id( "REUTERS", "word", cpos = corpus_positions, registry = get_tmp_registry() ) places_no <- cl_attribute_size( "REUTERS", attribute = "places", attribute_type = "s", registry = get_tmp_registry() ) strucs <- seq.int(from = 0, to = places_no - 1) cl_struc2str( "REUTERS", "places", struc = strucs, registry = get_tmp_registry() )
token_no <- cl_attribute_size( "REUTERS", attribute = "word", attribute_type = "p", registry = get_tmp_registry() ) corpus_positions <- seq.int(from = 0, to = token_no - 1) cl_cpos2id( "REUTERS", "word", cpos = corpus_positions, registry = get_tmp_registry() ) places_no <- cl_attribute_size( "REUTERS", attribute = "places", attribute_type = "s", registry = get_tmp_registry() ) strucs <- seq.int(from = 0, to = places_no - 1) cl_struc2str( "REUTERS", "places", struc = strucs, registry = get_tmp_registry() )
The encoding of a corpus is declared in the registry file (corpus property
"charset"). Once a corpus is loaded, this information is available without
parsing the registry file again and again. The cl_charset_name
offers
a quick access to this information.
cl_charset_name(corpus, registry = Sys.getenv("CORPUS_REGISTRY"))
cl_charset_name(corpus, registry = Sys.getenv("CORPUS_REGISTRY"))
corpus |
Name of a CWB corpus (upper case). |
registry |
Path to the registry directory, defaults to the value of the environment variable CORPUS_REGISTRY |
cl_charset_name( corpus = "REUTERS", registry = system.file(package = "RcppCWB", "extdata", "cwb", "registry") )
cl_charset_name( corpus = "REUTERS", registry = system.file(package = "RcppCWB", "extdata", "cwb", "registry") )
Remove a corpus from the list of loaded corpora of the corpus library (CL).
cl_delete_corpus(corpus, registry = Sys.getenv("CORPUS_REGISTRY"))
cl_delete_corpus(corpus, registry = Sys.getenv("CORPUS_REGISTRY"))
corpus |
name of a CWB corpus (upper case) |
registry |
path to the registry directory, defaults to the value of the environment variable CORPUS_REGISTRY |
The corpus library (CL) internally maintains a list of corpora including
information on positional and structural attributes so that the registry file
needs not be parsed again and again. However, when an attribute has been
added to the corpus, it will not yet be visible, because it is not part of
the data that has been loaded. The cl_delete_corpus
function exposes a
CL function named identically, to force reloading the corpus (after it has
been deleted), which will include parsing an updated registry file.
An integer
value 1 is returned invisibly if a previously loaded
corpus has been deleted, or 0 if the corpus has not been loaded and has not
been deleted.
cl_attribute_size("UNGA", attribute = "word", attribute_type = "p") corpus_is_loaded("UNGA") cl_delete_corpus("UNGA") corpus_is_loaded("UNGA")
cl_attribute_size("UNGA", attribute = "word", attribute_type = "p") corpus_is_loaded("UNGA") cl_delete_corpus("UNGA") corpus_is_loaded("UNGA")
Load corpus.
cl_find_corpus(corpus, registry)
cl_find_corpus(corpus, registry)
corpus |
name of a CWB corpus (upper case) |
registry |
path to the registry directory, defaults to the value of the environment variable CORPUS_REGISTRY |
A externalptr
referencing the C representation of the corpus.
Get the total number of unique tokens/ids of a positional attribute. Note
that token ids are zero-based, i.e. when iterating through tokens, start at
0, the maximum will be cl_lexicon_size()
minus 1.
cl_lexicon_size(corpus, p_attribute, registry = Sys.getenv("CORPUS_REGISTRY"))
cl_lexicon_size(corpus, p_attribute, registry = Sys.getenv("CORPUS_REGISTRY"))
corpus |
name of a CWB corpus (upper case) |
p_attribute |
name of positional attribute |
registry |
path to the registry directory, defaults to the value of the environment variable CORPUS_REGISTRY |
lexicon_size <- cl_lexicon_size( "REUTERS", p_attribute = "word", registry = get_tmp_registry() ) token_ids <- seq.int(from = 0, to = lexicon_size - 1) cl_id2str( "REUTERS", p_attribute = "word", id = token_ids, registry = get_tmp_registry() )
lexicon_size <- cl_lexicon_size( "REUTERS", p_attribute = "word", registry = get_tmp_registry() ) token_ids <- seq.int(from = 0, to = lexicon_size - 1) cl_id2str( "REUTERS", p_attribute = "word", id = token_ids, registry = get_tmp_registry() )
Show CL corpora
cl_list_corpora()
cl_list_corpora()
A character
vector.
cl_list_corpora()
cl_list_corpora()
Load corpus
cl_load_corpus(corpus, registry = Sys.getenv("CORPUS_REGISTRY"))
cl_load_corpus(corpus, registry = Sys.getenv("CORPUS_REGISTRY"))
corpus |
A length-one |
registry |
A length-one |
TRUE
if corpus could be loaded and FALSE
if not.
cl_load_corpus("REUTERS")
cl_load_corpus("REUTERS")
Wrappers for CWB Corpus Library functions suited for writing performance code.
s_attr(corpus, s_attribute, registry) p_attr(corpus, p_attribute, registry) p_attr_size(p_attr) s_attr_size(s_attr) p_attr_lexicon_size(p_attr) cpos_to_struc(s_attr, cpos) cpos_to_str(p_attr, cpos) cpos_to_id(p_attr, cpos) struc_to_cpos(s_attr, struc) struc_to_str(s_attr, struc) regex_to_id(p_attr, regex) str_to_id(p_attr, str) id_to_freq(p_attr, id) id_to_cpos(p_attr, id) cpos_to_lbound(s_attr, cpos) cpos_to_rbound(s_attr, cpos)
s_attr(corpus, s_attribute, registry) p_attr(corpus, p_attribute, registry) p_attr_size(p_attr) s_attr_size(s_attr) p_attr_lexicon_size(p_attr) cpos_to_struc(s_attr, cpos) cpos_to_str(p_attr, cpos) cpos_to_id(p_attr, cpos) struc_to_cpos(s_attr, struc) struc_to_str(s_attr, struc) regex_to_id(p_attr, regex) str_to_id(p_attr, str) id_to_freq(p_attr, id) id_to_cpos(p_attr, id) cpos_to_lbound(s_attr, cpos) cpos_to_rbound(s_attr, cpos)
corpus |
ID of a CWB corpus (length-one |
s_attribute |
A structural attribute (length-one |
registry |
Registry directory. |
p_attribute |
A positional attribute (length-one |
p_attr |
A |
s_attr |
A |
cpos |
An |
struc |
A length-one |
regex |
A regular expression. |
str |
A |
id |
An |
The default cl_* R wrappers for the functions of the CWB Corpus Library
involve a lookup of a corpus and its p- or s-attributes (using the corpus ID,
registry and attribute indicated by length-one character vectors) every time
one of these functions is called. It is more efficient looking up an
attribute only once. This set of functions passes "externalptr" classes to
reference attributes that have been looked up. A relevant scenario is writing
functions with a C++ implementation that are compiled and linked using
Rcpp::cppFunction()
or Rcpp::sourceCpp()
library(Rcpp) cppFunction( 'Rcpp::StringVector get_str( SEXP corpus, SEXP p_attribute, SEXP registry, Rcpp::IntegerVector cpos ){ SEXP attr; Rcpp::StringVector result; attr = RcppCWB::p_attr(corpus, p_attribute, registry); result = RcppCWB::cpos_to_str(attr, cpos); return(result); }', depends = "RcppCWB" ) result <- get_str("REUTERS", "word", RcppCWB::get_tmp_registry(), 0:50)
library(Rcpp) cppFunction( 'Rcpp::StringVector get_str( SEXP corpus, SEXP p_attribute, SEXP registry, Rcpp::IntegerVector cpos ){ SEXP attr; Rcpp::StringVector result; attr = RcppCWB::p_attr(corpus, p_attribute, registry); result = RcppCWB::cpos_to_str(attr, cpos); return(result); }', depends = "RcppCWB" ) result <- get_str("REUTERS", "word", RcppCWB::get_tmp_registry(), 0:50)
Structural attributes do not necessarily have values, structural attributes (such as annotations of sentences or paragraphs) may just define regions of corpus positions. Use this function to test whether an attribute has values.
cl_struc_values(corpus, s_attribute, registry = Sys.getenv("CORPUS_REGISTRY"))
cl_struc_values(corpus, s_attribute, registry = Sys.getenv("CORPUS_REGISTRY"))
corpus |
Corpus ID, a length-one |
s_attribute |
Structural attribute to check, a length-one |
registry |
The registry directory of the corpus. |
TRUE
if the attribute has values and FALSE
if not. NA
if the structural
attribute is not available.
cl_struc_values("REUTERS", "places") # TRUE - attribute has values cl_struc_values("REUTERS", "date") # NA - attribute does not exist
cl_struc_values("REUTERS", "places") # TRUE - attribute has values cl_struc_values("REUTERS", "date") # NA - attribute does not exist
CWB indexed corpora store the text of a corpus as numbers: Every token in the token stream of the corpus is identified by a unique corpus position. The string value of every token is identified by a unique integer id. The corpus library (CL) offers a set of functions to make the transitions between corpus positions, token ids, and the character string of tokens.
cl_cpos2str( corpus, p_attribute, registry = Sys.getenv("CORPUS_REGISTRY"), cpos ) cl_cpos2id(corpus, p_attribute, registry = Sys.getenv("CORPUS_REGISTRY"), cpos) cl_id2str(corpus, p_attribute, registry = Sys.getenv("CORPUS_REGISTRY"), id) cl_regex2id( corpus, p_attribute, regex, registry = Sys.getenv("CORPUS_REGISTRY") ) cl_str2id(corpus, p_attribute, str, registry = Sys.getenv("CORPUS_REGISTRY")) cl_id2freq(corpus, p_attribute, id, registry = Sys.getenv("CORPUS_REGISTRY")) cl_id2cpos(corpus, p_attribute, id, registry = Sys.getenv("CORPUS_REGISTRY"))
cl_cpos2str( corpus, p_attribute, registry = Sys.getenv("CORPUS_REGISTRY"), cpos ) cl_cpos2id(corpus, p_attribute, registry = Sys.getenv("CORPUS_REGISTRY"), cpos) cl_id2str(corpus, p_attribute, registry = Sys.getenv("CORPUS_REGISTRY"), id) cl_regex2id( corpus, p_attribute, regex, registry = Sys.getenv("CORPUS_REGISTRY") ) cl_str2id(corpus, p_attribute, str, registry = Sys.getenv("CORPUS_REGISTRY")) cl_id2freq(corpus, p_attribute, id, registry = Sys.getenv("CORPUS_REGISTRY")) cl_id2cpos(corpus, p_attribute, id, registry = Sys.getenv("CORPUS_REGISTRY"))
corpus |
name of a CWB corpus (upper case) |
p_attribute |
a p-attribute (positional attribute) |
registry |
path to the registry directory, defaults to the value of the environment variable CORPUS_REGISTRY |
cpos |
corpus positions (integer vector) |
id |
id of a token |
regex |
a regular expression |
str |
a character string |
# registry directory and cpos_total will be needed in examples cpos_total <- cl_attribute_size( corpus = "REUTERS", attribute = "word", attribute_type = "p", registry = get_tmp_registry() ) # decode the token stream of the corpus (the quick way) token_stream_str <- cl_cpos2str( corpus = "REUTERS", p_attribute = "word", cpos = seq.int(from = 0, to = cpos_total - 1), registry = get_tmp_registry() ) # decode the token stream (cpos2id first, then id2str) token_stream_ids <- cl_cpos2id( corpus = "REUTERS", p_attribute = "word", cpos = seq.int(from = 0, to = cpos_total - 1), registry = get_tmp_registry() ) token_stream_str <- cl_id2str( corpus = "REUTERS", p_attribute = "word", id = token_stream_ids, registry = get_tmp_registry() ) # get corpus positions of a token token_to_get <- "oil" id_oil <- cl_str2id( corpus = "REUTERS", p_attribute = "word", str = token_to_get, registry = get_tmp_registry() ) cpos_oil <- cl_id2cpos <- cl_id2cpos( corpus = "REUTERS", p_attribute = "word", id = id_oil, registry = get_tmp_registry() ) # get frequency of token oil_freq <- cl_id2freq( corpus = "REUTERS", p_attribute = "word", id = id_oil, registry = get_tmp_registry() ) length(cpos_oil) # needs to be the same as oil_freq # use regular expressions ids <- cl_regex2id( corpus = "REUTERS", p_attribute = "word", regex = "M.*", registry = get_tmp_registry() ) m_words <- cl_id2str( corpus = "REUTERS", p_attribute = "word", id = ids, registry = get_tmp_registry() )
# registry directory and cpos_total will be needed in examples cpos_total <- cl_attribute_size( corpus = "REUTERS", attribute = "word", attribute_type = "p", registry = get_tmp_registry() ) # decode the token stream of the corpus (the quick way) token_stream_str <- cl_cpos2str( corpus = "REUTERS", p_attribute = "word", cpos = seq.int(from = 0, to = cpos_total - 1), registry = get_tmp_registry() ) # decode the token stream (cpos2id first, then id2str) token_stream_ids <- cl_cpos2id( corpus = "REUTERS", p_attribute = "word", cpos = seq.int(from = 0, to = cpos_total - 1), registry = get_tmp_registry() ) token_stream_str <- cl_id2str( corpus = "REUTERS", p_attribute = "word", id = token_stream_ids, registry = get_tmp_registry() ) # get corpus positions of a token token_to_get <- "oil" id_oil <- cl_str2id( corpus = "REUTERS", p_attribute = "word", str = token_to_get, registry = get_tmp_registry() ) cpos_oil <- cl_id2cpos <- cl_id2cpos( corpus = "REUTERS", p_attribute = "word", id = id_oil, registry = get_tmp_registry() ) # get frequency of token oil_freq <- cl_id2freq( corpus = "REUTERS", p_attribute = "word", id = id_oil, registry = get_tmp_registry() ) length(cpos_oil) # needs to be the same as oil_freq # use regular expressions ids <- cl_regex2id( corpus = "REUTERS", p_attribute = "word", regex = "M.*", registry = get_tmp_registry() ) m_words <- cl_id2str( corpus = "REUTERS", p_attribute = "word", id = ids, registry = get_tmp_registry() )
Structural attributes store the metadata of texts in a CWB corpus and/or any kind of annotation of a region of text. The fundamental unit are so-called strucs, i.e. indices of regions identified by a left and a right corpus position. The corpus library (CL) offers a set of functions to make the translations between corpus positions (cpos) and strucs (struc).
cl_cpos2struc( corpus, s_attribute, cpos, registry = Sys.getenv("CORPUS_REGISTRY") ) cl_struc2cpos( corpus, s_attribute, registry = Sys.getenv("CORPUS_REGISTRY"), struc ) cl_struc2str( corpus, s_attribute, struc, registry = Sys.getenv("CORPUS_REGISTRY") ) cl_cpos2lbound(corpus, s_attribute, cpos, registry = NULL) cl_cpos2rbound(corpus, s_attribute, cpos, registry = NULL)
cl_cpos2struc( corpus, s_attribute, cpos, registry = Sys.getenv("CORPUS_REGISTRY") ) cl_struc2cpos( corpus, s_attribute, registry = Sys.getenv("CORPUS_REGISTRY"), struc ) cl_struc2str( corpus, s_attribute, struc, registry = Sys.getenv("CORPUS_REGISTRY") ) cl_cpos2lbound(corpus, s_attribute, cpos, registry = NULL) cl_cpos2rbound(corpus, s_attribute, cpos, registry = NULL)
corpus |
name of a CWB corpus (upper case) |
s_attribute |
name of structural attribute (character vector) |
cpos |
An |
registry |
path to the registry directory, defaults to the value of the environment variable CORPUS_REGISTRY |
struc |
a struc identifying a region |
cl_cpos2rbound()
and cl_cpos2lbound()
return NA
for values of
cpos that are outside a struc for the structural attribute given.
# get metadata for matches of token # scenario: id of the texts with occurrence of 'oil' token_to_get <- "oil" token_id <- cl_str2id("REUTERS", p_attribute = "word", str = "oil", get_tmp_registry()) token_cpos <- cl_id2cpos("REUTERS", p_attribute = "word", id = token_id, get_tmp_registry()) strucs <- cl_cpos2struc("REUTERS", s_attribute = "id", cpos = token_cpos, get_tmp_registry()) strucs_unique <- unique(strucs) text_ids <- cl_struc2str("REUTERS", s_attribute = "id", struc = strucs_unique, get_tmp_registry()) # get the full text of the first text with match for 'oil' left_cpos <- cl_cpos2lbound( "REUTERS", s_attribute = "id", cpos = min(token_cpos), registry = get_tmp_registry() ) right_cpos <- cl_cpos2rbound( "REUTERS", s_attribute = "id", cpos = min(token_cpos), registry = get_tmp_registry() ) txt <- cl_cpos2str( "REUTERS", p_attribute = "word", cpos = left_cpos:right_cpos, registry = get_tmp_registry() ) fulltext <- paste(txt, collapse = " ") # alternativ approach to achieve same result first_struc_match_oil <- cl_cpos2struc( "REUTERS", s_attribute = "id", cpos = min(token_cpos), registry = get_tmp_registry() ) cpos_struc <- cl_struc2cpos( "REUTERS", s_attribute = "id", struc = first_struc_match_oil, registry = get_tmp_registry() ) txt <- cl_cpos2str( "REUTERS", p_attribute = "word", cpos = cpos_struc[1]:cpos_struc[2], registry = get_tmp_registry() ) fulltext <- paste(txt, collapse = " ")
# get metadata for matches of token # scenario: id of the texts with occurrence of 'oil' token_to_get <- "oil" token_id <- cl_str2id("REUTERS", p_attribute = "word", str = "oil", get_tmp_registry()) token_cpos <- cl_id2cpos("REUTERS", p_attribute = "word", id = token_id, get_tmp_registry()) strucs <- cl_cpos2struc("REUTERS", s_attribute = "id", cpos = token_cpos, get_tmp_registry()) strucs_unique <- unique(strucs) text_ids <- cl_struc2str("REUTERS", s_attribute = "id", struc = strucs_unique, get_tmp_registry()) # get the full text of the first text with match for 'oil' left_cpos <- cl_cpos2lbound( "REUTERS", s_attribute = "id", cpos = min(token_cpos), registry = get_tmp_registry() ) right_cpos <- cl_cpos2rbound( "REUTERS", s_attribute = "id", cpos = min(token_cpos), registry = get_tmp_registry() ) txt <- cl_cpos2str( "REUTERS", p_attribute = "word", cpos = left_cpos:right_cpos, registry = get_tmp_registry() ) fulltext <- paste(txt, collapse = " ") # alternativ approach to achieve same result first_struc_match_oil <- cl_cpos2struc( "REUTERS", s_attribute = "id", cpos = min(token_cpos), registry = get_tmp_registry() ) cpos_struc <- cl_struc2cpos( "REUTERS", s_attribute = "id", struc = first_struc_match_oil, registry = get_tmp_registry() ) txt <- cl_cpos2str( "REUTERS", p_attribute = "word", cpos = cpos_struc[1]:cpos_struc[2], registry = get_tmp_registry() ) fulltext <- paste(txt, collapse = " ")
Extract information from the internal C representation of registry data.
corpus_data_dir(corpus, registry = Sys.getenv("CORPUS_REGISTRY")) corpus_info_file(corpus, registry = Sys.getenv("CORPUS_REGISTRY")) corpus_full_name(corpus, registry = Sys.getenv("CORPUS_REGISTRY")) corpus_p_attributes(corpus, registry = Sys.getenv("CORPUS_REGISTRY")) corpus_s_attributes(corpus, registry = Sys.getenv("CORPUS_REGISTRY")) corpus_properties(corpus, registry = Sys.getenv("CORPUS_REGISTRY")) corpus_property(corpus, registry = Sys.getenv("CORPUS_REGISTRY"), property) corpus_registry_dir(corpus)
corpus_data_dir(corpus, registry = Sys.getenv("CORPUS_REGISTRY")) corpus_info_file(corpus, registry = Sys.getenv("CORPUS_REGISTRY")) corpus_full_name(corpus, registry = Sys.getenv("CORPUS_REGISTRY")) corpus_p_attributes(corpus, registry = Sys.getenv("CORPUS_REGISTRY")) corpus_s_attributes(corpus, registry = Sys.getenv("CORPUS_REGISTRY")) corpus_properties(corpus, registry = Sys.getenv("CORPUS_REGISTRY")) corpus_property(corpus, registry = Sys.getenv("CORPUS_REGISTRY"), property) corpus_registry_dir(corpus)
corpus |
A length-one |
registry |
A length-one |
property |
A corpus property defined in the registry file (. |
corpus_data_dir()
will return the data directory (class fs_path
)
where the binary files of a corpus are kept (a directory also known as
'home' directory).
corpus_info_file()
will return the path to the info file for a
corpus (class fs_path
object). If info file does not exist or INFO line
is missing in the registry file, NA
is returned.
corpus_full_name()
will return the full name of the corpus defined
in the registry file.
corpus_p_attributes()
returns a character
vector with the
positional attributes of a corpus.
corpus_s_attributes()
returns a character
vector with the
structural attributes of a corpus.
corpus_properties()
returns a character
vector with the corpus
properties defined in the registry file. If the corpus cannot be located,
NA
is returned.
corpus_property()
returns the value of a corpus property defined
in the registry file, or NA
if the corpus does not exist, is not loaded
of if the property requested is undefined.
corpus_get_registry()
will extract the registry directory with the
registry file defining a corpus from the internal C representation of
loaded corpora. The character
vector that is returned may be > 1 if there
are several corpora with the same id defined in registry files in different
(registry) directories. If the corpus is not found, NA
is returned.
corpus_data_dir("REUTERS", registry = get_tmp_registry()) corpus_info_file("REUTERS", registry = get_tmp_registry()) corpus_full_name("REUTERS", registry = get_tmp_registry()) corpus_p_attributes("REUTERS", registry = get_tmp_registry()) corpus_s_attributes("REUTERS", registry = get_tmp_registry()) corpus_properties("REUTERS", registry = get_tmp_registry()) corpus_property( "REUTERS", registry = get_tmp_registry(), property = "language" ) corpus_registry_dir("REUTERS") corpus_registry_dir("FOO") # NA returned
corpus_data_dir("REUTERS", registry = get_tmp_registry()) corpus_info_file("REUTERS", registry = get_tmp_registry()) corpus_full_name("REUTERS", registry = get_tmp_registry()) corpus_p_attributes("REUTERS", registry = get_tmp_registry()) corpus_s_attributes("REUTERS", registry = get_tmp_registry()) corpus_properties("REUTERS", registry = get_tmp_registry()) corpus_property( "REUTERS", registry = get_tmp_registry(), property = "language" ) corpus_registry_dir("REUTERS") corpus_registry_dir("FOO") # NA returned
Check whether corpus is loaded
corpus_is_loaded(corpus, registry = Sys.getenv("CORPUS_REGISTRY"))
corpus_is_loaded(corpus, registry = Sys.getenv("CORPUS_REGISTRY"))
corpus |
A length-one |
registry |
A length-one |
TRUE
if corpus is loaded and FALSE
if not.
CQP needs to know where to look for CWB indexed corpora. To initialize CQP,
call cqp_initialize
. To reset the registry, use the function
cqp_reset_registry
. To get the registry used by CQP, use
cqp_get_registry
. To get the initialization status, use
cqp_is_initialized
cqp_initialize(registry = Sys.getenv("CORPUS_REGISTRY")) cqp_is_initialized() cqp_verbosity(silent, verbose) cqp_get_registry() cqp_reset_registry(registry = Sys.getenv("CORPUS_REGISTRY")) cqp_load_corpus(corpus, registry)
cqp_initialize(registry = Sys.getenv("CORPUS_REGISTRY")) cqp_is_initialized() cqp_verbosity(silent, verbose) cqp_get_registry() cqp_reset_registry(registry = Sys.getenv("CORPUS_REGISTRY")) cqp_load_corpus(corpus, registry)
registry |
the registry directory |
silent |
A single |
verbose |
A single |
corpus |
ID of a CWB corpus (length-one |
cqp_load_corpus
will return a logical
value - TRUE
if corpus
has been loaded successfully, FALSE
if not.
Andreas Blaette, Bernard Desgraupes, Sylvain Loiseau
cqp_is_initialized() # check initialization status if (!cqp_is_initialized()) cqp_initialize() cqp_is_initialized() # check initialization status (TRUE now?) cqp_get_registry() # get registry dir used by CQP cqp_list_corpora() # get list of corpora
cqp_is_initialized() # check initialization status if (!cqp_is_initialized()) cqp_initialize() cqp_is_initialized() # check initialization status (TRUE now?) cqp_get_registry() # get registry dir used by CQP cqp_list_corpora() # get list of corpora
List the corpora described by the registry files in the registry directory that is currently set.
cqp_list_corpora()
cqp_list_corpora()
Andreas Blaette, Bernard Desgraupes, Sylvain Loiseau
cqp_list_corpora()
cqp_list_corpora()
Using CQP queries requires a two-step procedure: At first, you execute a
query using cqp_query
. Then, cqp_dump_subcorpus
will return a
matrix with the regions of the matches for the query.
cqp_query(corpus, query, subcorpus = "QUERY") cqp_dump_subcorpus(corpus, subcorpus = "QUERY") cqp_subcorpus_size(corpus, subcorpus = "QUERY") cqp_list_subcorpora(corpus) cqp_drop_subcorpus(corpus)
cqp_query(corpus, query, subcorpus = "QUERY") cqp_dump_subcorpus(corpus, subcorpus = "QUERY") cqp_subcorpus_size(corpus, subcorpus = "QUERY") cqp_list_subcorpora(corpus) cqp_drop_subcorpus(corpus)
corpus |
a CWB corpus |
query |
a CQP query |
subcorpus |
subcorpus name |
The cqp_query
function executes a CQP query. The
cqp_subcorpus_size
function returns the number of matches for the CQP
query. The cqp_dump_subcorpus
function will return a two-column matrix
with the left and right corpus positions of the matches for the CQP query.
Andreas Blaette, Bernard Desgraupes, Sylvain Loiseau
Evert, S. 2005. The CQP Query Language Tutorial. Available online at https://cwb.sourceforge.io/files/CWB_Encoding_Tutorial.pdf
cqp_query(corpus = "REUTERS", query = '"oil";') cqp_subcorpus_size("REUTERS") cqp_dump_subcorpus("REUTERS") cqp_query(corpus = "REUTERS", query = '"crude" "oil";') cqp_subcorpus_size("REUTERS", subcorpus = "QUERY") cqp_dump_subcorpus("REUTERS")
cqp_query(corpus = "REUTERS", query = '"oil";') cqp_subcorpus_size("REUTERS") cqp_dump_subcorpus("REUTERS") cqp_query(corpus = "REUTERS", query = '"crude" "oil";') cqp_subcorpus_size("REUTERS", subcorpus = "QUERY") cqp_dump_subcorpus("REUTERS")
The function returns a character
vector with characters sets (charsets)
supported by the Corpus Workbench (CWB). The vector is derived from the the
CorpusCharset
object defined in the header file of the corpus library (CL).
cwb_charsets()
cwb_charsets()
Early versions of the CWB were developed for "latin1", "utf8" support has been introduced with CWB v3.2. Note that RcppCWB is tested only for "latin1" and "utf8" and that R uses "UTF-8" rather than utf8" (CWB) by convention.
cwb_charsets()
cwb_charsets()
Wrappers for the CWB tools cwb-makeall
, cwb-huffcode
and
cwb-compress-rdx
. Unlike the 'original' command line tools, these wrappers
will always perform a specific indexing/compression step on one positional
attribute, and produce all components.
cwb_encode( corpus, registry = Sys.getenv("CORPUS_REGISTRY"), data_dir, vrt_dir, encoding = "utf8", p_attributes = c("word", "pos", "lemma"), s_attributes = list(), skip_blank_lines = TRUE, strip_whitespace = TRUE, xml = TRUE, quietly = FALSE, verbose = FALSE ) cwb_makeall( corpus, p_attribute, registry = Sys.getenv("CORPUS_REGISTRY"), quietly = FALSE, logfile ) cwb_huffcode( corpus, p_attribute, registry = Sys.getenv("CORPUS_REGISTRY"), quietly = FALSE, logfile, delete = TRUE ) cwb_compress_rdx( corpus, p_attribute, registry = Sys.getenv("CORPUS_REGISTRY"), quietly = FALSE, logfile, delete = TRUE )
cwb_encode( corpus, registry = Sys.getenv("CORPUS_REGISTRY"), data_dir, vrt_dir, encoding = "utf8", p_attributes = c("word", "pos", "lemma"), s_attributes = list(), skip_blank_lines = TRUE, strip_whitespace = TRUE, xml = TRUE, quietly = FALSE, verbose = FALSE ) cwb_makeall( corpus, p_attribute, registry = Sys.getenv("CORPUS_REGISTRY"), quietly = FALSE, logfile ) cwb_huffcode( corpus, p_attribute, registry = Sys.getenv("CORPUS_REGISTRY"), quietly = FALSE, logfile, delete = TRUE ) cwb_compress_rdx( corpus, p_attribute, registry = Sys.getenv("CORPUS_REGISTRY"), quietly = FALSE, logfile, delete = TRUE )
corpus |
Name of a CWB corpus (upper case). |
registry |
Path to the registry directory, defaults to the value of the environment variable CORPUS_REGISTRY. |
data_dir |
The data directory where |
vrt_dir |
Directory with input corpus files (verticalised format / file
ending *.vrt). Tilde expansion is performed on |
encoding |
The encoding of the files to be encoded. Needs to be an
encoding supported by CWB, see |
p_attributes |
Positional attributes (p-attributes) to be declared. |
s_attributes |
A |
skip_blank_lines |
A |
strip_whitespace |
A |
xml |
A |
quietly |
A |
verbose |
A |
p_attribute |
Name of p-attribute. |
logfile |
Redirect messages of |
delete |
A |
Running cwb_huffcode()
and cwb_compress_rdx()
is optional. Corpora can be
fully used without compression. It is recommended when reducing the size of
corpus data has relevant benefits, e.g. for sharing data. On Windows,
compression is not stable and not recommended. A respective warning
is issued when running cwb_huffcode()
and cwb_compress_rdx()
on Windows.
data_dir <- file.path(tempdir(), "bt_data_dir") dir.create(data_dir) cwb_encode( corpus = "BTMIN", registry = Sys.getenv("CORPUS_REGISTRY"), vrt_dir = system.file(package = "RcppCWB", "extdata", "vrt"), data_dir = data_dir, p_attributes = c("word", "pos", "lemma"), s_attributes = list( plenary_protocol = c( "lp", "protocol_no", "date", "year", "birthday", "version", "url", "filetype" ), speaker = c( "id", "type", "lp", "protocol_no", "date", "year", "ai_no", "ai_id", "ai_type", "who", "name", "parliamentary_group", "party", "role" ), p = character() ) ) unlink(data_dir) unlink(file.path(Sys.getenv("CORPUS_REGISTRY"), "btmin")) # The package includes and 'unfinished' corpus of debates in the UN General # Assembly ("UNGA"), i.e. it does not yet include the reverse index, and it # is not compressed. # # The first step in the following example is to copy the raw # corpus to a temporary place. home_dir <- system.file( package = "RcppCWB", "extdata", "cwb", "indexed_corpora", "unga" ) tmp_data_dir <- file.path(tempdir(), "indexed_corpora") tmp_unga_dir <- file.path(tmp_data_dir, "unga2") if (!file.exists(tmp_data_dir)) dir.create(tmp_data_dir) if (!file.exists(tmp_unga_dir)){ dir.create(tmp_unga_dir) } else { file.remove(list.files(tmp_unga_dir, full.names = TRUE)) } regfile <- readLines( system.file(package = "RcppCWB", "extdata", "cwb", "registry", "unga") ) regfile[grep("^HOME", regfile)] <- sprintf('HOME "%s"', tmp_unga_dir) regfile[grep("^ID", regfile)] <- "ID unga2" writeLines(text = regfile, con = file.path(get_tmp_registry(), "unga2")) for (x in list.files(home_dir, full.names = TRUE)){ file.copy(from = x, to = tmp_unga_dir) } # perform cwb_makeall (equivalent to cwb-makeall command line utility) cwb_makeall( corpus = "UNGA2", p_attribute = "word", registry = get_tmp_registry() ) cl_load_corpus("UNGA2", registry = get_tmp_registry()) cqp_load_corpus("UNGA2", registry = get_tmp_registry()) # see whether it works ids_sentence_1 <- cl_cpos2id( corpus = "UNGA2", p_attribute = "word", registry = get_tmp_registry(), cpos = 0:83 ) tokens_sentence_1 <- cl_id2str( corpus = "UNGA2", p_attribute = "word", registry = get_tmp_registry(), id = ids_sentence_1 ) sentence <- gsub( "\\s+([\\.,])", "\\1", paste(tokens_sentence_1, collapse = " ") ) # perform cwb_huffcode (equivalent to cwb-makeall command line utility) cwb_huffcode( corpus = "UNGA2", p_attribute = "word", registry = get_tmp_registry() ) cwb_compress_rdx( corpus = "UNGA2", p_attribute = "word", registry = get_tmp_registry() )
data_dir <- file.path(tempdir(), "bt_data_dir") dir.create(data_dir) cwb_encode( corpus = "BTMIN", registry = Sys.getenv("CORPUS_REGISTRY"), vrt_dir = system.file(package = "RcppCWB", "extdata", "vrt"), data_dir = data_dir, p_attributes = c("word", "pos", "lemma"), s_attributes = list( plenary_protocol = c( "lp", "protocol_no", "date", "year", "birthday", "version", "url", "filetype" ), speaker = c( "id", "type", "lp", "protocol_no", "date", "year", "ai_no", "ai_id", "ai_type", "who", "name", "parliamentary_group", "party", "role" ), p = character() ) ) unlink(data_dir) unlink(file.path(Sys.getenv("CORPUS_REGISTRY"), "btmin")) # The package includes and 'unfinished' corpus of debates in the UN General # Assembly ("UNGA"), i.e. it does not yet include the reverse index, and it # is not compressed. # # The first step in the following example is to copy the raw # corpus to a temporary place. home_dir <- system.file( package = "RcppCWB", "extdata", "cwb", "indexed_corpora", "unga" ) tmp_data_dir <- file.path(tempdir(), "indexed_corpora") tmp_unga_dir <- file.path(tmp_data_dir, "unga2") if (!file.exists(tmp_data_dir)) dir.create(tmp_data_dir) if (!file.exists(tmp_unga_dir)){ dir.create(tmp_unga_dir) } else { file.remove(list.files(tmp_unga_dir, full.names = TRUE)) } regfile <- readLines( system.file(package = "RcppCWB", "extdata", "cwb", "registry", "unga") ) regfile[grep("^HOME", regfile)] <- sprintf('HOME "%s"', tmp_unga_dir) regfile[grep("^ID", regfile)] <- "ID unga2" writeLines(text = regfile, con = file.path(get_tmp_registry(), "unga2")) for (x in list.files(home_dir, full.names = TRUE)){ file.copy(from = x, to = tmp_unga_dir) } # perform cwb_makeall (equivalent to cwb-makeall command line utility) cwb_makeall( corpus = "UNGA2", p_attribute = "word", registry = get_tmp_registry() ) cl_load_corpus("UNGA2", registry = get_tmp_registry()) cqp_load_corpus("UNGA2", registry = get_tmp_registry()) # see whether it works ids_sentence_1 <- cl_cpos2id( corpus = "UNGA2", p_attribute = "word", registry = get_tmp_registry(), cpos = 0:83 ) tokens_sentence_1 <- cl_id2str( corpus = "UNGA2", p_attribute = "word", registry = get_tmp_registry(), id = ids_sentence_1 ) sentence <- gsub( "\\s+([\\.,])", "\\1", paste(tokens_sentence_1, collapse = " ") ) # perform cwb_huffcode (equivalent to cwb-makeall command line utility) cwb_huffcode( corpus = "UNGA2", p_attribute = "word", registry = get_tmp_registry() ) cwb_compress_rdx( corpus = "UNGA2", p_attribute = "word", registry = get_tmp_registry() )
Get the CWB version used and available when compiling the source code.
cwb_version()
cwb_version()
A numeric_version
object.
cwb_version()
cwb_version()
Get matrix with moving windows. Negative integer values indicate absence of a token at the respective position.
get_cbow_matrix( corpus, p_attribute, registry = Sys.getenv("CORPUS_REGISTRY"), matrix, window )
get_cbow_matrix( corpus, p_attribute, registry = Sys.getenv("CORPUS_REGISTRY"), matrix, window )
corpus |
a CWB corpus |
p_attribute |
a positional attribute |
registry |
the registry directory |
matrix |
a matrix |
window |
window size |
m <- get_region_matrix( corpus = "REUTERS", s_attribute = "places", strucs = 0L:5L, registry = get_tmp_registry() ) windowsize <- 3L m2 <- get_cbow_matrix( corpus = "REUTERS", p_attribute = "word", registry = get_tmp_registry(), matrix = m, window = windowsize ) colnames(m2) <- c(-windowsize:-1, "node", 1:windowsize)
m <- get_region_matrix( corpus = "REUTERS", s_attribute = "places", strucs = 0L:5L, registry = get_tmp_registry() ) windowsize <- 3L m2 <- get_cbow_matrix( corpus = "REUTERS", p_attribute = "word", registry = get_tmp_registry(), matrix = m, window = windowsize ) colnames(m2) <- c(-windowsize:-1, "node", 1:windowsize)
The return value is an integer vector. The length of the vector is the number of unique tokens in the corpus / the number of unique ids. The order of the counts corresponds to the number of ids.
get_count_vector(corpus, p_attribute, registry = Sys.getenv("CORPUS_REGISTRY"))
get_count_vector(corpus, p_attribute, registry = Sys.getenv("CORPUS_REGISTRY"))
corpus |
a CWB corpus |
p_attribute |
a positional attribute |
registry |
registry directory |
an integer vector
y <- get_count_vector( corpus = "REUTERS", p_attribute = "word", registry = get_tmp_registry() ) df <- data.frame(token_id = 0:(length(y) - 1), count = y) df[["token"]] <- cl_id2str( "REUTERS", p_attribute = "word", id = df[["token_id"]], registry = get_tmp_registry() ) df <- df[,c("token", "token_id", "count")] # reorder columns df <- df[order(df[["count"]], decreasing = TRUE),] head(df)
y <- get_count_vector( corpus = "REUTERS", p_attribute = "word", registry = get_tmp_registry() ) df <- data.frame(token_id = 0:(length(y) - 1), count = y) df[["token"]] <- cl_id2str( "REUTERS", p_attribute = "word", id = df[["token_id"]], registry = get_tmp_registry() ) df <- df[,c("token", "token_id", "count")] # reorder columns df <- df[order(df[["count"]], decreasing = TRUE),] head(df)
Get Registry Directory Within Package
get_pkg_registry(pkgname = "RcppCWB")
get_pkg_registry(pkgname = "RcppCWB")
pkgname |
Name of package (character vector) |
The return value is an integer
matrix with the left and right corpus
positions of the strucs in columns one and two, respectively. For negative
struc values in the input vector, the matrix reports NA
values.
get_region_matrix( corpus, s_attribute, strucs, registry = Sys.getenv("CORPUS_REGISTRY") )
get_region_matrix( corpus, s_attribute, strucs, registry = Sys.getenv("CORPUS_REGISTRY") )
corpus |
A CWB corpus (length-one |
s_attribute |
A structural attribute (length-one |
strucs |
Integer vector with strucs. |
registry |
Registry directory with registry file. |
A matrix with integer values indicating left and right corpus positions (columns 1 and 2, respectively).
y <- get_region_matrix( corpus = "REUTERS", s_attribute = "id", strucs = 0L:5L, registry = get_tmp_registry() )
y <- get_region_matrix( corpus = "REUTERS", s_attribute = "id", strucs = 0L:5L, registry = get_tmp_registry() )
The return value is a two-column integer matrix. Column one represents the unique ids of the input vector, column two the respective number of occurrences / counts.
ids_to_count_matrix(ids)
ids_to_count_matrix(ids)
ids |
a vector of ids (integer values) |
ids <- c(1L, 5L, 5L, 7L, 7L, 7L, 7L) ids_to_count_matrix(ids) table(ids) # alternative to get a similar result
ids <- c(1L, 5L, 5L, 7L, 7L, 7L, 7L) ids_to_count_matrix(ids) table(ids) # alternative to get a similar result
Create CWB subcorpus from matrix with regions.
matrix_to_subcorpus(region_matrix, corpus, subcorpus)
matrix_to_subcorpus(region_matrix, corpus, subcorpus)
region_matrix |
A two-colum |
corpus |
A |
subcorpus |
A length-one |
## Not run: # First we generate a subcorpus from a query result oil_context <- cqp_query("REUTERS", subcorpus = "OIL", query = '[]{3}"oil" []{3}') m <- subcorpus_get_ranges(oil_context) reuters <- cl_find_corpus("REUTERS", registry = get_tmp_registry()) p <- matrix_to_subcorpus(subcorpus = "OIL2", corpus = reuters, region_matrix = m) cqp_list_subcorpora("REUTERS") x <- cqp_query("REUTERS:OIL2", query = '"crude";', subcorpus = "CRUDEOIL") subcorpus_get_ranges(x) # clean up cqp_drop_subcorpus("REUTERS:OIL") cqp_drop_subcorpus("REUTERS:OIL2") cqp_drop_subcorpus("REUTERS:CRUDEOIL") ## End(Not run)
## Not run: # First we generate a subcorpus from a query result oil_context <- cqp_query("REUTERS", subcorpus = "OIL", query = '[]{3}"oil" []{3}') m <- subcorpus_get_ranges(oil_context) reuters <- cl_find_corpus("REUTERS", registry = get_tmp_registry()) p <- matrix_to_subcorpus(subcorpus = "OIL2", corpus = reuters, region_matrix = m) cqp_list_subcorpora("REUTERS") x <- cqp_query("REUTERS:OIL2", query = '"crude";', subcorpus = "CRUDEOIL") subcorpus_get_ranges(x) # clean up cqp_drop_subcorpus("REUTERS:OIL") cqp_drop_subcorpus("REUTERS:OIL2") cqp_drop_subcorpus("REUTERS:CRUDEOIL") ## End(Not run)
Usually the default p-attribute will be "word". Use this function to avoid a hard-coded solution. Extracts the default attribute defined in the CWB source code.
p_attr_default()
p_attr_default()
A length-one character
vector.
Get IDs and Counts for Region Matrices.
region_matrix_to_ids( corpus, p_attribute, registry = Sys.getenv("CORPUS_REGISTRY"), matrix ) region_matrix_to_count_matrix( corpus, p_attribute, registry = Sys.getenv("CORPUS_REGISTRY"), matrix ) region_matrix_context( corpus, registry = Sys.getenv("CORPUS_REGISTRY"), matrix, p_attribute, s_attribute, boundary, left, right ) ranges_to_cpos(ranges)
region_matrix_to_ids( corpus, p_attribute, registry = Sys.getenv("CORPUS_REGISTRY"), matrix ) region_matrix_to_count_matrix( corpus, p_attribute, registry = Sys.getenv("CORPUS_REGISTRY"), matrix ) region_matrix_context( corpus, registry = Sys.getenv("CORPUS_REGISTRY"), matrix, p_attribute, s_attribute, boundary, left, right ) ranges_to_cpos(ranges)
corpus |
a CWB corpus |
p_attribute |
a positional attribute |
registry |
registry directory |
matrix |
a regions matrix |
s_attribute |
If not |
boundary |
Structural attribute (length-one |
left |
An |
right |
An |
ranges |
A two-column integer |
ranges_to_cpos()
will turn a matrix
of ranges into an integer
vector with the individual corpus positions covered by the ranges.
# Scenario 1: Get full text for a subcorpus defined by regions m <- get_region_matrix( corpus = "REUTERS", s_attribute = "places", strucs = 4L:5L, registry = get_tmp_registry() ) ids <- region_matrix_to_ids( corpus = "REUTERS", p_attribute = "word", registry = get_tmp_registry(), matrix = m ) tokenstream <- cl_id2str( corpus = "REUTERS", p_attribute = "word", registry = get_tmp_registry(), id = ids ) txt <- paste(tokenstream, collapse = " ") txt # Scenario 2: Get data.frame with counts for region matrix y <- region_matrix_to_count_matrix( corpus = "REUTERS", p_attribute = "word", registry = get_tmp_registry(), matrix = m ) df <- as.data.frame(y) colnames(df) <- c("token_id", "count") df[["token"]] <- cl_id2str( "REUTERS", p_attribute = "word", registry = get_tmp_registry(), id = df[["token_id"]] ) df[order(df[["count"]], decreasing = TRUE),] head(df)
# Scenario 1: Get full text for a subcorpus defined by regions m <- get_region_matrix( corpus = "REUTERS", s_attribute = "places", strucs = 4L:5L, registry = get_tmp_registry() ) ids <- region_matrix_to_ids( corpus = "REUTERS", p_attribute = "word", registry = get_tmp_registry(), matrix = m ) tokenstream <- cl_id2str( corpus = "REUTERS", p_attribute = "word", registry = get_tmp_registry(), id = ids ) txt <- paste(tokenstream, collapse = " ") txt # Scenario 2: Get data.frame with counts for region matrix y <- region_matrix_to_count_matrix( corpus = "REUTERS", p_attribute = "word", registry = get_tmp_registry(), matrix = m ) df <- as.data.frame(y) colnames(df) <- c("token_id", "count") df[["token"]] <- cl_id2str( "REUTERS", p_attribute = "word", registry = get_tmp_registry(), id = df[["token_id"]] ) df[order(df[["count"]], decreasing = TRUE),] head(df)
Look up the minimum and maximum struc of a s-attribute within a region,
including scenario of nested s-attributes. If there are no regions of the
s-attribute within the region, NA
values are returned.
region_matrix_to_struc_matrix( corpus, s_attribute, region_matrix, registry = NULL ) region_to_strucs(corpus, s_attribute, region, registry = NULL)
region_matrix_to_struc_matrix( corpus, s_attribute, region_matrix, registry = NULL ) region_to_strucs(corpus, s_attribute, region, registry = NULL)
corpus |
ID of a CWB corpus. |
s_attribute |
Name of structural attribute. The attribute may be nested. |
region_matrix |
A two-column |
registry |
Path of the registry directory. If |
region |
Vector with left and right corpus position of region. |
Depending whether input is a vector (argument region
) or a matrix
(argument region_matrix
), a vector or a matrix.
The data format of the Corpus Workbench (CWB) allows nested XML as import data. Auxiliary functions assist detecting whether two structural attributes are nested or at the same level (i.e. defining the same regions).
s_attr_is_descendent( x, y, corpus, registry = Sys.getenv("CORPUS_REGISTRY"), sample = NULL ) s_attr_is_sibling(x, y, corpus, registry = Sys.getenv("CORPUS_REGISTRY")) s_attr_relationship(x, y, corpus, registry = Sys.getenv("CORPUS_REGISTRY"))
s_attr_is_descendent( x, y, corpus, registry = Sys.getenv("CORPUS_REGISTRY"), sample = NULL ) s_attr_is_sibling(x, y, corpus, registry = Sys.getenv("CORPUS_REGISTRY")) s_attr_relationship(x, y, corpus, registry = Sys.getenv("CORPUS_REGISTRY"))
x |
A structural attribute, stated as length-one |
y |
Another structural attribute, stated as length-one |
corpus |
A corpus ID (length-one |
registry |
The directory with the registry file for the corpus. |
sample |
An |
s_attr_is_descendent()
will evaluate whether s_attribute x
is
a child of s_attribute y
. The return value is TRUE
(a single logical
value) if all regions defined by x
are within the regions defined by y
.
If not, FALSE
is returned. The return values is also FALSE
if all regions
of x
and y
are idential. Attributes will be siblings in this case,
and not in an ancestor-sibling relationship.
s_attr_is_sibling()
will test whether the regions defined for
structural attribute x
and structural attribute y
are identical. If
yes, TRUE
is returned, assuming that both attributes are at the same
level (siblings). If not, FALSE
is returned.
s_attr_relationship()
will return 0
if s-attributes x
and y
are siblings in the sense that they define identical regions. The return
value is 0
if x
is an ancestor of y
and 1
if x
is a descencdent
of y
.
s_attr_is_descendent("id", "places", corpus = "REUTERS", registry = get_tmp_registry()) s_attr_is_sibling(x = "id", y = "places", corpus = "REUTERS", registry = get_tmp_registry()) s_attr_is_sibling(x = "id", y = "places", corpus = "REUTERS", registry = get_tmp_registry())
s_attr_is_descendent("id", "places", corpus = "REUTERS", registry = get_tmp_registry()) s_attr_is_sibling(x = "id", y = "places", corpus = "REUTERS", registry = get_tmp_registry()) s_attr_is_sibling(x = "id", y = "places", corpus = "REUTERS", registry = get_tmp_registry())
Get all regions defined by a structural attribute. Unlike
get_region_matrix()
that returns a region matrix for a defined subset of
strucs, all regions are returned. As it is the fastest option, the function
reads the binary *.rng file for the structural attribute directly. The corpus
library (CL) is not used in this case.
s_attr_regions( corpus, s_attr, registry = Sys.getenv("CORPUS_REGISTRY"), data_dir = corpus_data_dir(corpus = corpus, registry = registry) )
s_attr_regions( corpus, s_attr, registry = Sys.getenv("CORPUS_REGISTRY"), data_dir = corpus_data_dir(corpus = corpus, registry = registry) )
corpus |
A length-one |
s_attr |
A length-one |
registry |
A length-one |
data_dir |
The data directory of the corpus. |
A two-colum matrix
with the regions defined by the structural
attribute: Column 1 defines left corpus positions and column 2 right corpus
positions of regions.
s_attr_regions("REUTERS", s_attr = "id", registry = get_tmp_registry())
s_attr_regions("REUTERS", s_attr = "id", registry = get_tmp_registry())
Get data.frame
with left and right corpus positions (cpos) for
structural attributes and values.
s_attribute_decode( corpus, data_dir, s_attribute, encoding = NULL, registry = Sys.getenv("CORPUS_REGISTRY"), method = c("R", "Rcpp") )
s_attribute_decode( corpus, data_dir, s_attribute, encoding = NULL, registry = Sys.getenv("CORPUS_REGISTRY"), method = c("R", "Rcpp") )
corpus |
A CWB corpus (ID in upper case). |
data_dir |
The data directory where the binary files of the corpus are stored. |
s_attribute |
A structural attribute (length 1 |
encoding |
Encoding of the values ("latin-1" or "utf-8") |
registry |
The CWB registry directory. |
method |
A length-one |
Two approaches are implemented: A pure R solution will decode the files
directly in the directory specified by data_dir
. An implementation
using Rcpp will use the registry file for corpus
to find the data
directory.
A data.frame
with three columns, if the s-attribute has
values, or two columns, if not. Column cpos_left
are the start
corpus positions of a structural annotation, cpos_right
the end
corpus positions. Column value
is the value of the annotation.
# pure R implementation (Rcpp implementation fails on Windows in vanilla mode) b <- s_attribute_decode( corpus = "REUTERS", data_dir = system.file(package = "RcppCWB", "extdata", "cwb", "indexed_corpora", "reuters"), registry = get_tmp_registry(), s_attribute = "places", method = "R" ) # Using Rcpp wrappers for CWB C code b <- s_attribute_decode( corpus = "REUTERS", data_dir = system.file(package = "RcppCWB", "extdata", "cwb", "indexed_corpora", "reuters"), s_attribute = "places", method = "Rcpp", registry = get_tmp_registry() )
# pure R implementation (Rcpp implementation fails on Windows in vanilla mode) b <- s_attribute_decode( corpus = "REUTERS", data_dir = system.file(package = "RcppCWB", "extdata", "cwb", "indexed_corpora", "reuters"), registry = get_tmp_registry(), s_attribute = "places", method = "R" ) # Using Rcpp wrappers for CWB C code b <- s_attribute_decode( corpus = "REUTERS", data_dir = system.file(package = "RcppCWB", "extdata", "cwb", "indexed_corpora", "reuters"), s_attribute = "places", method = "Rcpp", registry = get_tmp_registry() )
Get ranges of subcorpus
subcorpus_get_ranges(subcorpus_pointer)
subcorpus_get_ranges(subcorpus_pointer)
subcorpus_pointer |
A pointer (class |
Use and get temporary registry directory to describe and access the corpora in a package.
use_tmp_registry(pkg = system.file(package = "RcppCWB")) get_tmp_registry()
use_tmp_registry(pkg = system.file(package = "RcppCWB")) get_tmp_registry()
pkg |
Full path to a package. |