Title: | N-Gram Based Text Categorization |
---|---|
Description: | Text categorization based on n-grams. |
Authors: | Kurt Hornik [aut, cre] , Johannes Rauch [aut], Christian Buchta [aut], Ingo Feinerer [aut] |
Maintainer: | Kurt Hornik <[email protected]> |
License: | GPL-2 |
Version: | 1.0-9 |
Built: | 2024-11-25 16:43:57 UTC |
Source: | CRAN |
-Gram Profiles-gram profile db for 26 languages based on the European Corpus
Initiative Multilingual Corpus I.
ECIMCI_profiles
ECIMCI_profiles
This profile db was built by Johannes Rauch, using the ECI/MCI corpus (http://www.elsnet.org/eci.html) and the default options employed by package textcat, with all text documents encoded in UTF-8.
The category ids used for the db are the respective IETF language tags
(see parse_IETF_language_tag in package NLP), using the ISO 639-2
Part B language subtags and, for Serbian, the script employed (i.e.,
"scc-Cyrl"
and "scc-Latn"
for Serbian written in
Cyrillic and Latin script, respectively; all other languages in the
profile db are written in Latin script.)
S. Armstrong-Warwick, H. S. Thompson, D. McKelvie and D. Petitpierre (1994), Data in Your Language: The ECI Multilingual Corpus 1. In “Proceedings of the International Workshop on Sharable Natural Language Resources” (Nara, Japan), 97–106. https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.44.950
## Languages in the the ECI/MCI profile db: names(ECIMCI_profiles) ## Key options used for the profile: attr(ECIMCI_profiles, "options")[c("n", "size", "reduce", "useBytes")]
## Languages in the the ECI/MCI profile db: names(ECIMCI_profiles) ## Key options used for the profile: attr(ECIMCI_profiles, "options")[c("n", "size", "reduce", "useBytes")]
-Gram ProfilesTextCat -gram byte and character profile dbs for language
identification.
TC_char_profiles TC_byte_profiles
TC_char_profiles TC_byte_profiles
TextCat (https://www.let.rug.nl/vannoord/TextCat/) is a Perl
implementation of the Cavnar and Trenkle “-Gram-Based
Text Categorization” technique by Gertjan van Noord which was
subsequently integrated into SpamAssassin. It provides byte
-gram profiles for 74 “languages” (more precisely,
language/encoding combinations). The wiseguys C library
reimplementation
libtextcat
adds one more non-empty profile
(see (https://wiki.documentfoundation.org/Libexttextcat).
TC_byte_profiles
provides these byte profiles.
TC_char_profiles
provides a subset of 56 character profiles
obtained by converting the byte sequences to UTF-8 strings where
possible.
The category ids are unchanged from the original, and give the full (English) name of the language, optionally combined the name of the encoding script. Note that ‘scots’ indicates Scots, the Germanic language variety historically spoken in Lowland Scotland and parts of Ulster, to be distinguished from Scottish Gaelic (named ‘scots_gaelic’ in the profiles), the Celtic language variety spoken in most of the western Highlands and in the Hebrides (see https://en.wikipedia.org/wiki/Scots_language).
## Languages in the TC byte profiles: names(TC_byte_profiles) ## Languages only in the TC byte profiles: setdiff(names(TC_byte_profiles), names(TC_char_profiles)) ## Key options used for the profiles: attr(TC_byte_profiles, "options")[c("n", "size", "reduce", "useBytes")] attr(TC_char_profiles, "options")[c("n", "size", "reduce", "useBytes")]
## Languages in the TC byte profiles: names(TC_byte_profiles) ## Languages only in the TC byte profiles: setdiff(names(TC_byte_profiles), names(TC_char_profiles)) ## Key options used for the profiles: attr(TC_byte_profiles, "options")[c("n", "size", "reduce", "useBytes")] attr(TC_char_profiles, "options")[c("n", "size", "reduce", "useBytes")]
-Gram Based Text CategorizationCategorize texts by computing their -gram profiles, and finding
the closest category
-gram profile.
textcat(x, p = textcat::TC_char_profiles, method = "CT", ..., options = list())
textcat(x, p = textcat::TC_char_profiles, method = "CT", ..., options = list())
x |
a character vector of texts, or an R object which can be
coerced to this using |
p |
a textcat profile db. By default, the TextCat character
profiles are used (see |
method |
a character string specifying a built-in method, or a
user-defined function for computing distances between |
... |
options to be passed to the method for computing distances between profiles. |
options |
a list of such options. |
For each given text, its -gram profile is computed using the
options in the category profile db. Then, the distance between this
profile and the category profiles is computed, and the text is
categorized into the category of the closest profile (if this is not
unique,
NA
is obtained).
Unless the profile db uses bytes rather than characters, the texts in
x
should be encoded in UTF-8.
W. B. Cavnar and J. M. Trenkle (1994),
-Gram-Based Text Categorization.
In “Proceedings of SDAIR-94, 3rd Annual Symposium on Document
Analysis and Information Retrieval”, 161–175.
https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.53.9367
K. Hornik, P. Mair, J. Rauch, W. Geiger, C. Buchta and I. Feinerer
(2013).
The textcat Package for -Gram Based Text Categorization in R.
Journal of Statistical Software, 52/6, 1–17.
doi:10.18637/jss.v052.i06.
textcat(c("This is an english sentence.", "Das ist ein deutscher satz."))
textcat(c("This is an english sentence.", "Das ist ein deutscher satz."))
Get and set options used for -gram based text categorization.
textcat_options(option, value)
textcat_options(option, value)
option |
character string indicating the option to get or set (see Details). Can be abbreviated. If missing, all options are returned as a list. |
value |
Value to be set. If omitted, the current value of the given option is returned. |
Currently, the following options are available:
profile_method
:A character string or function specifying a method for computing
-gram profiles (see
textcat_profile_db
).
Default: "textcat"
.
profile_options
:A list of options to be passed to the method for computing profiles.
Default: none (empty list).
xdist_method
:A character string or function specifying a method for computing
distances between -gram profiles (see
textcat_xdist
).
Default: "CT"
, giving the Cavnar-Trenkle out of place
measure.
xdist_options
:A list of options to be passes to the method for computing distances between profiles.
Default: none (empty list).
Create -gram profile dbs for text categorization.
textcat_profile_db(x, id = NULL, method = NULL, ..., options = list(), profiles = NULL)
textcat_profile_db(x, id = NULL, method = NULL, ..., options = list(), profiles = NULL)
x |
a character vector of text documents, or an R object of text
documents extractable via |
id |
a character vector giving the categories of the texts to be
recycled to the length of |
method |
a character string specifying a built-in method, or a
user-defined function for computing distances between |
... |
options to be passed to the method for creating profiles. |
options |
a list of such options. |
profiles |
a textcat profile db object. |
The text documents are split according to the given categories, and
-gram profiles are computed using the specified method, with
options either those used for creating
profiles
if this is not
NULL
, or by combining the options given in ...
and
options
and merging with the default profile options specified
by the textcat option profile_options
using exact
name matching. The method and options employed for building the db
are stored in the db as attributes "method"
and
"options"
, respectively.
There is a c
method for combining profile dbs provided
that these have identical options. There are also a [
method
for subscripting and as.matrix
and
as.simple_triplet_matrix
methods to
“export” the profiles to a dense matrix or the sparse simple
triplet matrix representation provided by package slam,
respectively.
Currently, the only available built-in method is "textcnt"
,
which has the following options:
n
:A numeric vector giving the numbers of characters or bytes in the
-gram profiles.
Default: 1 : 5
.
split
:The regular expression pattern to be used in word splitting.
Default: "[[:space:][:punct:][:digit:]]+"
.
perl
:A logical indicating whether to use Perl-compatible regular expressions in word splitting.
Default: FALSE
.
tolower
:A logical indicating whether to transform texts to lower case (after word splitting).
Default: TRUE
.
reduce
:A logical indicating whether a representation of -grams
more efficient than the one used by Cavnar and Trenkle should be
employed.
Default: TRUE
.
useBytes
:A logical indicating whether to use byte -grams rather than
character
-grams.
Default: FALSE
.
ignore
:a character vector of -grams to be ignored when computing
-gram profiles.
Default: "_"
(corresponding to a word boundary).
size
:The maximal number of -grams used for a profile.
Default: 1000L
.
This method uses textcnt
in package tau for
computing -gram profiles, with
n
, split
,
perl
and useBytes
corresponding to the respective
textcnt
arguments, and option reduce
setting argument
marker
as needed. -grams listed in option
ignore
are removed, and only the most frequent remaining ones retained, with
the maximal number given by option size
.
Unless the profile db uses bytes rather than characters (i.e., option
useBytes
is TRUE
), text documents in x
containing
non-ASCII characters must declare their encoding (see
Encoding
), and will be re-encoded to UTF-8.
Note that option n
specifies all numbers of characters
or bytes to be used in the profiles, and not just the maximal number:
e.g., taking n = 3
will create profiles only containing
tri-grams.
## Obtain the texts of the standard licenses shipped with R. files <- dir(file.path(R.home("share"), "licenses"), "^[A-Z]", full.names = TRUE) texts <- sapply(files, function(f) paste(readLines(f), collapse = "\n")) names(texts) <- basename(files) ## Build a profile db using the same method and options as for building ## the ECIMCI character profiles. profiles <- textcat_profile_db(texts, profiles = ECIMCI_profiles) ## Inspect the 10 most frequent n-grams in each profile. lapply(profiles, head, 10L) ## Combine into one frequency table. tab <- as.matrix(profiles) tab[, 1 : 10] ## Determine languages. textcat(profiles, ECIMCI_profiles)
## Obtain the texts of the standard licenses shipped with R. files <- dir(file.path(R.home("share"), "licenses"), "^[A-Z]", full.names = TRUE) texts <- sapply(files, function(f) paste(readLines(f), collapse = "\n")) names(texts) <- basename(files) ## Build a profile db using the same method and options as for building ## the ECIMCI character profiles. profiles <- textcat_profile_db(texts, profiles = ECIMCI_profiles) ## Inspect the 10 most frequent n-grams in each profile. lapply(profiles, head, 10L) ## Combine into one frequency table. tab <- as.matrix(profiles) tab[, 1 : 10] ## Determine languages. textcat(profiles, ECIMCI_profiles)
-Gram ProfilesCompute cross-distances between collections of -gram profiles.
textcat_xdist(x, p = NULL, method = "CT", ..., options = list())
textcat_xdist(x, p = NULL, method = "CT", ..., options = list())
x |
a textcat profile db (see |
p |
|
method |
a character string specifying a built-in method, or a
user-defined function for computing distances between |
... |
options to be passed to the method for computing distances. |
options |
a list of such options. |
If x
(or p
) is not a profile db, the -gram
profiles of the individual text documents extracted from it are
computed using the profile method and options in
p
if this is a
profile db, and using the current textcat profile method and
options otherwise.
Currently, the following distance methods for -gram profiles
are available.
"CT"
:the out-of-place measure of Cavnar and Trenkle.
"ranks"
:a variant of the Cavnar/Trenkle measure based
on the aggregated absolute difference of the ranks of the combined
-grams in the two profiles.
"ALPD"
:the sum of the absolute differences in
-gram log frequencies.
"KLI"
:the Kullback-Leibler I-divergence
of the
-gram frequency distributions
and
of the
two profiles.
"KLJ"
:the Kullback-Leibler J-divergence
,
the symmetrized variant
of the I-divergences.
"JS"
:the Jensen-Shannon divergence between the
-gram frequency distributions.
"cosine"
the cosine dissimilarity between the profiles, i.e., one minus the inner product of the frequency vectors normalized to Euclidean length one (and filled with zeros for entries missing in one of the vectors).
"Dice"
the Dice dissimilarity, i.e., the fraction of
-grams present in one of the profiles only.
For the measures based on distances of frequency distributions,
-grams of the two profiles are combined, and missing
-grams are given a small positive absolute frequency which can
be controlled by option
eps
, and defaults to 1e-6.
Options given in ...
and options
are combined, and
merged with the default xdist options specified by the textcat
option xdist_options
using exact name matching.
## Compute cross-distances between the TextCat byte profiles using the ## CT out-of-place measure. d <- textcat_xdist(TC_byte_profiles) ## Visualize results of hierarchical cluster analysis on the distances. plot(hclust(as.dist(d)), cex = 0.7)
## Compute cross-distances between the TextCat byte profiles using the ## CT out-of-place measure. d <- textcat_xdist(TC_byte_profiles) ## Visualize results of hierarchical cluster analysis on the distances. plot(hclust(as.dist(d)), cex = 0.7)