Title: | Unicode Data and Utilities |
---|---|
Description: | Data from Unicode 15.1.0 and related utilities. |
Authors: | Kurt Hornik [aut, cre] |
Maintainer: | Kurt Hornik <[email protected]> |
License: | GPL-2 |
Version: | 15.1.0-1 |
Built: | 2024-11-18 06:37:28 UTC |
Source: | CRAN |
Default Unicode algorithms for case conversion.
u_to_lower_case(x) u_to_upper_case(x) u_to_title_case(x) u_case_fold(x)
u_to_lower_case(x) u_to_upper_case(x) u_to_title_case(x) u_case_fold(x)
x |
R objects (see Details). |
These functions are generic functions, with methods for the Unicode
character classes (u_char
, u_char_range
,
and u_char_seq
) which suitably apply the case mappings
to the Unicode characters given by x
, and a default method
which treats x
as a vector of “Unicode strings”, and
returns a vector of UTF-8 encoded character strings with the results
of the case conversion of the elements of x
.
Currently, only the unconditional case maps are available for conversion to lower, upper or title case: other variants may be added eventually.
Currently, conversion to title case is only available for
u_char
objects. Other methods will be added
eventually (once the Unicode text segmentation algorithm is
implemented for detecting word boundaries).
Currently, u_case_fold
only performs full case folding
using the Unicode case mappings with status “C” and “F”:
other variants will be added eventually.
For the methods for the Unicode character classes, a
u_char_seq
vector of Unicode character sequences with
the conversions of the characters in x
.
For the default method, a UTF-8 encoded character string with the
results of the case conversions of the elements of x
.
## Latin upper case letters A to Z: x <- as.u_char(as.u_char_range("0041..005A")) ## In case we did not know the code points, we could use e.g. x <- as.u_char(utf8ToInt(paste(LETTERS, collapse = ""))) vapply(x, intToUtf8, "") ## Unicode character method: vapply(u_to_lower_case(x), intToUtf8, "") ## Default method: u_to_lower_case(LETTERS) u_case_fold("Hi Dave.") ## More interesting stuff: sharp s. u_to_upper_case("heiß") ## Note that the default full upper case mapping of U+00DF (LATIN SMALL ## LETTER SHARP S) is *not* to U+1E9E (LATIN CAPITAL LETTER SHARP S). u_case_fold("heiß")
## Latin upper case letters A to Z: x <- as.u_char(as.u_char_range("0041..005A")) ## In case we did not know the code points, we could use e.g. x <- as.u_char(utf8ToInt(paste(LETTERS, collapse = ""))) vapply(x, intToUtf8, "") ## Unicode character method: vapply(u_to_lower_case(x), intToUtf8, "") ## Default method: u_to_lower_case(LETTERS) u_case_fold("Hi Dave.") ## More interesting stuff: sharp s. u_to_upper_case("heiß") ## Note that the default full upper case mapping of U+00DF (LATIN SMALL ## LETTER SHARP S) is *not* to U+1E9E (LATIN CAPITAL LETTER SHARP S). u_case_fold("heiß")
Compute the number of Unicode characters (code points) in sequences or ranges of Unicode characters.
n_of_u_chars(x)
n_of_u_chars(x)
x |
a vector of Unicode characters, character ranges, or character sequences. |
An integer vector with the numbers of Unicode characters specified by
the elements of x
.
## How many code points are assigned to the Latin and Cyrillic scripts? x <- u_scripts(c("Latn", "Cyrl")) ## Numbers in the respective ranges: n <- lapply(x, n_of_u_chars) n ## Total number: vapply(n, sum, 0)
## How many code points are assigned to the Latin and Cyrillic scripts? x <- u_scripts(c("Latn", "Cyrl")) ## Numbers in the respective ranges: n <- lapply(x, n_of_u_chars) n ## Total number: vapply(n, sum, 0)
A simple Unicode alphabetic tokenizer.
Unicode_alphabetic_tokenizer(x)
Unicode_alphabetic_tokenizer(x)
x |
a character vector. |
Tokenization first replaces the elements of x
by their Unicode
character sequences. Then, the non-alphabetic characters (i.e., the
ones which do not have the Alphabetic property) are replaced by
blanks, and the corresponding strings are split according to the
blanks.
A character vector with the tokenized strings.
Unicode blocks.
u_blocks(x)
u_blocks(x)
x |
a character vector with the names of Unicode blocks. |
If x
is missing, a list of the Unicode blocks given as
u_char_range
Unicode character ranges, with the (full)
block names as names.
If x
is given, a (sub)list of the specific Unicode blocks.
Unicode Character Database (https://www.unicode.org/ucd/)
u_char_property
to find the block (property) of Unicode
characters.
Data structures and basic methods for Unicode character data.
as.u_char(x) as.u_char_range(x) as.u_char_seq(x, sep = NA_character_)
as.u_char(x) as.u_char_range(x) as.u_char_seq(x, sep = NA_character_)
x |
R objects coercible to the respective Unicode character data types, see Details. |
sep |
a character string. |
Package Unicode provides three basic classes for representing
Unicode characters: u_char
for vectors of Unicode characters,
u_char_range
for vectors of Unicode character ranges, and
u_char_seq
for vectors of Unicode character sequences. Objects
from these classes are created via the respective coercion functions.
as.u_char
knows to coerce integers or hex strings (with or
without a leading ‘0x’ or the ‘U+’ typically used for
Unicode characters) giving the corresponding code points. It can also
handle Unicode character ranges, flattening them out into the
corresponding vector of Unicode characters. To “coerce” a
UTF-8 encoded R character string to the corresponding Unicode
character object, use coercion on the result of obtaining the integer
code points via utf8ToInt
.
as.u_char_range
knows to coerce character strings of single
Unicode characters or a Unicode range expression with the hex codes of
two Unicode characters collapsed by ‘..’ (currently, hard-wired).
It can also handle u_char
objects, coercing them to ranges of
single code points.
as.u_char_seq
knows to coerce character strings with the hex
codes of Unicode characters collapsed by a non-empty sep
. The
default corresponds to using ‘,’ if the strings use surrounding
angles, and ‘ ’ otherwise. If sep
is empty or has length
zero, the character strings are used as is, re-encoded in UTF-8 if
necessary, and mapped to the corresponding Unicode character sequences
using utf8ToInt
. as.u_char_seq
can also handle
Unicode character ranges (giving the corresponding flattened out
Unicode character sequences), or lists of objects coercible to Unicode
characters via as.u_char
.
All classes currently have as.character
, as.data.frame
,
c
, format
, print
, rep
, unique
and
[
subscript methods. More methods will be added eventually.
For as.u_char
, a u_char
object giving a vector of
Unicode characters.
For as.u_char_range
, a u_char_range
object giving a
vector of Unicode character ranges.
For as.u_char_seq
, a u_char_seq
object giving a
vector of Unicode character sequences.
Unicode Character Database (https://www.unicode.org/ucd/),
https://en.wikipedia.org/wiki/Unicode
x <- as.u_char_range(c("00AA..00AC", "01CC")) x ## Corresponding Unicode character sequence object: as.u_char_seq(x) ## Corresponding Unicode character object with all code points: as.u_char(x) ## Inspect all Unicode characters in the range: u_char_inspect(x) ## Turning R character strings into the respective Unicode character ## sequences: as.u_char_seq(c("Austria", "Trantor"), "") ## which can then be subscripted "as usual", e.g.: x <- as.u_char_seq(c("Austria", "Trantor"), "")[[1L]][c(3L, 5L)] x ## To reassemble the character strings: intToUtf8(x)
x <- as.u_char_range(c("00AA..00AC", "01CC")) x ## Corresponding Unicode character sequence object: as.u_char_seq(x) ## Corresponding Unicode character object with all code points: as.u_char(x) ## Inspect all Unicode characters in the range: u_char_inspect(x) ## Turning R character strings into the respective Unicode character ## sequences: as.u_char_seq(c("Austria", "Trantor"), "") ## which can then be subscripted "as usual", e.g.: x <- as.u_char_seq(c("Austria", "Trantor"), "")[[1L]][c(3L, 5L)] x ## To reassemble the character strings: intToUtf8(x)
Inspect Unicode characters.
u_char_inspect(x)
u_char_inspect(x)
x |
an R object which can be coerced to a |
A data frame with variables Code
, Name
and Char
,
giving the code and name of the given characters and the R character
vectors corresponding to the code points.
## Who has ever seen a capital sharp s? x <- u_char_from_name(c("LATIN SMALL LETTER SHARP S", "LATIN CAPITAL LETTER SHARP S")) u_char_inspect(x) ## (Does this display anything useful?)
## Who has ever seen a capital sharp s? x <- u_char_from_name(c("LATIN SMALL LETTER SHARP S", "LATIN CAPITAL LETTER SHARP S")) u_char_inspect(x) ## (Does this display anything useful?)
Match Unicode characters to Unicode character ranges.
u_char_match(x, table, nomatch = NA_integer_) x %uin% table
u_char_match(x, table, nomatch = NA_integer_) x %uin% table
x |
an R object which can be coerced to a |
table |
an R object coercible to a |
nomatch |
the value to be returned (after coercion to integer) in the case when no match is found. |
u_char_match
returns a vector of the positions of the (first)
matches of the Unicode characters given by x
(after coercion
via as.u_char
) to the Unicode character ranges given by
table
(after coercion via as.u_char_range
).
%uin%
returns a logical vector indicating if there was a
match or not.
Find the names or labels of Unicode characters, or Unicode characters by their name.
u_char_name(x) u_char_from_name(x, type = c("exact", "grep"), ...) u_char_label(x)
u_char_name(x) u_char_from_name(x, type = c("exact", "grep"), ...) u_char_label(x)
x |
an R object which can be coerced to a |
type |
one of |
... |
arguments to be passed to |
The Unicode Standard provides a convention for labeling code points
that do not have character names (control, reserved, noncharacter,
private-use and surrogate code points). These labels can be obtained
by u_char_label
.
By default, exact matching is used for finding Unicode characters by
name. When type = "grep"
, grepl
is used for
matching x
against the Unicode character names; for now, Hangul
syllable and CJK Unified Ideograph names are ignored in this case.
For u_char_name
and u_char_label
, a character vector
with the names or labels, respectively, of the corresponding Unicode
characters.
For u_char_from_name
, a u_char
object giving the
Unicode characters with name exactly matching the given names.
x <- as.u_char(utf8ToInt("Austria")) u_char_name(x) ## Derived Hangul syllable character names are also supported for ## finding characters by exact matching: x <- u_char_name("0xAC00") x u_char_from_name(x) ## Find all Unicode characters with name matching 'DIGIT ONE'. x <- u_char_from_name("\\bDIGIT ONE\\b", "g") ## And show their names. u_char_name(x)
x <- as.u_char(utf8ToInt("Austria")) u_char_name(x) ## Derived Hangul syllable character names are also supported for ## finding characters by exact matching: x <- u_char_name("0xAC00") x u_char_from_name(x) ## Find all Unicode characters with name matching 'DIGIT ONE'. x <- u_char_from_name("\\bDIGIT ONE\\b", "g") ## And show their names. u_char_name(x)
Get the properties of Unicode characters.
u_char_info(x) u_char_properties(x, which) u_char_property(x, which)
u_char_info(x) u_char_properties(x, which) u_char_property(x, which)
x |
an R object which can be coerced to a |
which |
a character vector or string (for
|
For u_char_info
, a data frame with variables giving the Code
(Code
) and the ‘basic’ Unicode variables Name, General
Category, Canonical Combining Class, Bidi Class, Decomposition,
Numeric Value Decimal Digit, Numeric Value Digit, Numeric Value,
Bidi Mirrored, Unicode 1 Name, ISO Comment, Simple Uppercase Mapping,
Simple Lowercase Mapping, and Simple Titlecase Mapping, with names
obtained by replacing white spaces by underscores (e.g.,
Bidi_Class
.)
For u_char_properties
, a data frame with the values of the
specified properties, or, if no arguments were given, a character
vector with the names of all currently available Unicode character
properties.
For u_char_property
, the values of the specified property.
Currently, only the property values of a subset of all Unicode character properties can be obtained.
Unicode Character Database (https://www.unicode.org/ucd/)
## When was the Euro sign added to Unicode? x <- u_char_from_name("EURO SIGN") u_char_property(x, "Age") ## List the currently available Unicode character properties. u_char_properties()
## When was the Euro sign added to Unicode? x <- u_char_from_name("EURO SIGN") u_char_property(x, "Age") ## List the currently available Unicode character properties. u_char_properties()
Unicode named sequences.
u_named_sequences()
u_named_sequences()
A data frame with elements Name
and Sequence
giving the
names and the corresponding Unicode character sequences.
Unicode scripts.
u_scripts(x)
u_scripts(x)
x |
a character vector with the names of Unicode scripts. |
If x
is missing, a list of the Unicode scripts given as
u_char_range
Unicode character ranges, with the (full)
block names as names.
If x
is given, a (sub)list of the specific Unicode scripts.
Unicode Character Database (https://www.unicode.org/ucd/)
u_char_property
to find the script (property) of Unicode
characters.
scripts <- u_scripts() names(scripts) ## Total number of code points assigned to the scripts: sort(vapply(scripts, function(s) sum(n_of_u_chars(s)), 0), decreasing = TRUE)
scripts <- u_scripts() names(scripts) ## Total number of code points assigned to the scripts: sort(vapply(scripts, function(s) sum(n_of_u_chars(s)), 0), decreasing = TRUE)