| Title: | A Text Mining Toolkit for Chinese |
|---|---|
| Description: | A Text mining toolkit for Chinese, which includes facilities for Chinese string processing, Chinese NLP supporting, encoding detecting and converting. Moreover, it provides some functions to support 'tm' package in Chinese. |
| Authors: | Jian Li |
| Maintainer: | Jian Li <[email protected]> |
| License: | LGPL |
| Version: | 0.2-13 |
| Built: | 2026-06-03 08:24:41 UTC |
| Source: | https://github.com/cran/tmcn |
Print the UTF-8 codes of a string.
catUTF8(string, file = "")catUTF8(string, file = "")
string |
A character vector. |
file |
A |
No results.
Jian Li <[email protected]>
catUTF8("hello")catUTF8("hello")
Create a Chinese term-document matrix or a document-term matrix.
createDTM(string, language = c("zh", "en"), tokenize = NULL, removePunctuation = TRUE, removeNumbers = TRUE, removeStopwords = TRUE) createTDM(string, language = c("zh", "en"), tokenize = NULL, removePunctuation = TRUE, removeNumbers = TRUE, removeStopwords = TRUE)createDTM(string, language = c("zh", "en"), tokenize = NULL, removePunctuation = TRUE, removeNumbers = TRUE, removeStopwords = TRUE) createTDM(string, language = c("zh", "en"), tokenize = NULL, removePunctuation = TRUE, removeNumbers = TRUE, removeStopwords = TRUE)
string |
A character vector. |
language |
The language type, 'zh' means Chinese. |
tokenize |
A tokenizers function. |
removePunctuation |
Whether to remove the punctuations. |
removeNumbers |
Whether to remove the numbers. |
removeStopwords |
Whether to remove the stop words. |
Package "tm" is required.
An object of class TermDocumentMatrix or class DocumentTermMatrix.
Jian Li <[email protected]>
Create a word frequency data.frame.
createWordFreq(obj, onlyCN = TRUE, nosymbol = TRUE, stopwords = NULL, useStopDic = FALSE)createWordFreq(obj, onlyCN = TRUE, nosymbol = TRUE, stopwords = NULL, useStopDic = FALSE)
obj |
A character vector or |
onlyCN |
Whether to keep only Chinese words. |
nosymbol |
Whether to keep symbols. |
stopwords |
A character vector of stop words. |
useStopDic |
Whether to use the default stop words. |
A data.frame.
Jian Li <[email protected]>
createWordFreq(c("a", "a", "b", "c"), onlyCN = FALSE, nosymbol = TRUE, useStopDic = FALSE)createWordFreq(c("a", "a", "b", "c"), onlyCN = FALSE, nosymbol = TRUE, useStopDic = FALSE)
GBK character set including some useful information.
data(GBK)data(GBK)
A data frame with 8 columns.
GBKChinese characters in UTF-8.
py0Unique Pinyin of each character.
pyPinyin string of each character.
RadicalIn Chinese, it means 'Bu Shou'.
Stroke_Num_RadicalIn Chinese, it means the number of 'Bi Hua'.
Stroke_OrderIn Chinese, it means 'Bi Shun'.
StructureIn Chinese, it means 'Zi Ti Jie Gou'.
FreqFrequency of the character in Sogou news corpus from all sites between June and July 2012.
Jian Li <[email protected]>
Get the current encoding of the locale.
getCharset()getCharset()
Character of encoding.
Jian Li <[email protected]>
getCharset()getCharset()
Indicate whether the encoding of input string is BIG5.
isBIG5(string, combine = FALSE)isBIG5(string, combine = FALSE)
string |
A character vector. |
combine |
Whether to combine all the strings. |
Logical value.
Jian Li <[email protected]>
isBIG5("hello")isBIG5("hello")
Indicate whether the encoding of input string is GB18030.
isGB18030(string, combine = FALSE)isGB18030(string, combine = FALSE)
string |
A character vector. |
combine |
Whether to combine all the strings. |
Logical value.
Jian Li <[email protected]>
isGB18030("hello")isGB18030("hello")
Indicate whether the encoding of input string is GB2312.
isGB2312(string, combine = FALSE)isGB2312(string, combine = FALSE)
string |
A character vector. |
combine |
Whether to combine all the strings. |
Logical value.
Jian Li <[email protected]>
isGB2312("hello")isGB2312("hello")
Indicate whether the encoding of input string is GBK.
isGBK(string, combine = FALSE)isGBK(string, combine = FALSE)
string |
A character vector. |
combine |
Whether to combine all the strings. |
Logical value.
Jian Li <[email protected]>
isGBK("hello")isGBK("hello")
Indicate whether the encoding of input string is UTF-8.
isUTF8(string, combine = FALSE)isUTF8(string, combine = FALSE)
string |
A character vector. |
combine |
Whether to combine all the strings. |
Logical value.
Jian Li <[email protected]>
isUTF8("hello")isUTF8("hello")
Extract the left or right substrings in a character vector.
left(string, n) right(string, n)left(string, n) right(string, n)
string |
A character vector. |
n |
How many characters. |
A character vector.
Jian Li <[email protected]>
left("hello", 3)left("hello", 3)
National Taiwan University Semantic Dictionary.
data(NTUSD)data(NTUSD)
A list with 4 components.
positive_chsPositive words in simplified Chinese
negative_chsNegative words in simplified Chinese
positive_chtPositive words in traditional Chinese
negative_chtNegative words in traditional Chinese
Revert UTF-8 string to Chinese character.
revUTF8(string, utype = "R")revUTF8(string, utype = "R")
string |
A character vector. |
utype |
UTF-8 string type, the default is R type, such as "<U+XXXX>". |
A character vector.
Jian Li <[email protected]>
Set locale to Simplified Chinese/Traditional Chinese/UK.
setchs(rev = FALSE) setcht(rev = FALSE) setuk(rev = FALSE)setchs(rev = FALSE) setcht(rev = FALSE) setuk(rev = FALSE)
rev |
Whethet to set the locale back. |
No results.
Jian Li <[email protected]>
setchs() setchs(rev = TRUE)setchs() setchs(rev = TRUE)
Dictionary of simplified and traditional Chinese.
data(SIMTRA)data(SIMTRA)
A data frame with 2 columns.
Sima simplified Chinese string.
Traa traditional Chinese string.
Sport news.
data(SPORT)data(SPORT)
A data frame with 6 columns.
idID of the news.
timeTime of the news.
titleTitle of the news.
classClass of the news, 'B' means Basketball, 'F' means Football.
abstractAbstract of the news.
contentContent of the news.
Dictionary of Chinese stop words.
data(STOPWORDS)data(STOPWORDS)
A data frame with 1 column.
worda string vertor of the stop words.
Return Chinese stop words.
stopwordsCN(stopwords = NULL, useStopDic = TRUE)stopwordsCN(stopwords = NULL, useStopDic = TRUE)
stopwords |
A character vector of stop words. |
useStopDic |
Whether to use the default stop words. |
A vector of stop words.
Jian Li <[email protected]>
stopwordsCN("yes", useStopDic = FALSE)stopwordsCN("yes", useStopDic = FALSE)
To capitalize every first letter of a word.
strcap(string, strict = FALSE)strcap(string, strict = FALSE)
string |
A character vector. |
strict |
Whether strict. |
A character vector with the first letter of each word capitalized.
Jian Li <[email protected]>
strcap("the quick red fox jumps over the lazy brown dog")strcap("the quick red fox jumps over the lazy brown dog")
Extract matched substrings by regular expression.
strextract(string, pattern, invert = FALSE, ignore.case = FALSE, perl = FALSE, useBytes = FALSE)strextract(string, pattern, invert = FALSE, ignore.case = FALSE, perl = FALSE, useBytes = FALSE)
string |
A character vector. |
pattern |
A character string containing a regular expression to be matched in the given character vector. |
invert |
A logical value: if TRUE, extract the non-matched substrings. |
ignore.case |
If FALSE, the pattern matching is case sensitive and if TRUE, case is ignored during matching. |
perl |
A logical value. Should perl-compatible regexps be used? |
useBytes |
A logical value. If TRUE the matching is done byte-by-byte rather than character-by-character. |
A character vector with the matched or non-matched substrings.
Jian Li <[email protected]>
txt1 <- c("\t(x1)a(aa2)a ", " bb(bb)") strextract(txt1, "\\([^)]*\\)") txt2 <- c(" Ben Franklin and Jefferson Davis", "\tMillard Fillmore") strextract(txt2, "(?<first>[[:upper:]][[:lower:]]+)", perl = TRUE)txt1 <- c("\t(x1)a(aa2)a ", " bb(bb)") strextract(txt1, "\\([^)]*\\)") txt2 <- c(" Ben Franklin and Jefferson Davis", "\tMillard Fillmore") strextract(txt2, "(?<first>[[:upper:]][[:lower:]]+)", perl = TRUE)
Pad a string to a specified length with a padding character.
strpad(string, width = 0, side = c("left", "right", "both"), pad = " ")strpad(string, width = 0, side = c("left", "right", "both"), pad = " ")
string |
A character vector. |
width |
The number of characters of the string after padding. |
side |
Which side to pad. |
pad |
The padding character. |
A character vector after padding.
Jian Li <[email protected]>
strpad(1:5, width = 4, pad = "0")strpad(1:5, width = 4, pad = "0")
Trim space of a string.
strstrip(string, side = c("both", "left", "right"))strstrip(string, side = c("both", "left", "right"))
string |
A character vector. |
side |
Which side of the string to be trimed, 'both', 'left' or 'right'. |
Trimed vector.
Jian Li <[email protected]>
strstrip(c("\taaaa ", " bbbb "))strstrip(c("\taaaa ", " bbbb "))
Convert a chinese text to pinyin format.
toPinyin(string, capitalize = FALSE)toPinyin(string, capitalize = FALSE)
string |
A character vector. |
capitalize |
Whether to capitalize the first letter of each word. |
A character vector in pinyin format.
Jian Li <[email protected]>
toPinyin("the quick red fox jumps over the lazy brown dog")toPinyin("the quick red fox jumps over the lazy brown dog")
Convert a chinese text from simplified to traditional characters and vice versa.
toTrad(string, rev = FALSE)toTrad(string, rev = FALSE)
string |
A Chinese string vector. |
rev |
Reverse. TRUE means traditional to simplified. Default is FALSE. |
Converted vectors.
Jian Li <[email protected]>
toTrad("hello")toTrad("hello")
Convert encoding of Chinese string to UTF-8.
toUTF8(cnstring)toUTF8(cnstring)
cnstring |
A Chinese string vector. |
Converted vectors.
Jian Li <[email protected]>
toUTF8("hello")toUTF8("hello")