Title: | A Text Mining Toolkit for Chinese |
---|---|
Description: | A Text mining toolkit for Chinese, which includes facilities for Chinese string processing, Chinese NLP supporting, encoding detecting and converting. Moreover, it provides some functions to support 'tm' package in Chinese. |
Authors: | Jian Li |
Maintainer: | Jian Li <[email protected]> |
License: | LGPL |
Version: | 0.2-13 |
Built: | 2024-12-01 08:51:13 UTC |
Source: | CRAN |
Print the UTF-8 codes of a string.
catUTF8(string, file = "")
catUTF8(string, file = "")
string |
A character vector. |
file |
A |
No results.
Jian Li <[email protected]>
catUTF8("hello")
catUTF8("hello")
Create a Chinese term-document matrix or a document-term matrix.
createDTM(string, language = c("zh", "en"), tokenize = NULL, removePunctuation = TRUE, removeNumbers = TRUE, removeStopwords = TRUE) createTDM(string, language = c("zh", "en"), tokenize = NULL, removePunctuation = TRUE, removeNumbers = TRUE, removeStopwords = TRUE)
createDTM(string, language = c("zh", "en"), tokenize = NULL, removePunctuation = TRUE, removeNumbers = TRUE, removeStopwords = TRUE) createTDM(string, language = c("zh", "en"), tokenize = NULL, removePunctuation = TRUE, removeNumbers = TRUE, removeStopwords = TRUE)
string |
A character vector. |
language |
The language type, 'zh' means Chinese. |
tokenize |
A tokenizers function. |
removePunctuation |
Whether to remove the punctuations. |
removeNumbers |
Whether to remove the numbers. |
removeStopwords |
Whether to remove the stop words. |
Package "tm" is required.
An object of class TermDocumentMatrix
or class DocumentTermMatrix
.
Jian Li <[email protected]>
Create a word frequency data.frame.
createWordFreq(obj, onlyCN = TRUE, nosymbol = TRUE, stopwords = NULL, useStopDic = FALSE)
createWordFreq(obj, onlyCN = TRUE, nosymbol = TRUE, stopwords = NULL, useStopDic = FALSE)
obj |
A character vector or |
onlyCN |
Whether to keep only Chinese words. |
nosymbol |
Whether to keep symbols. |
stopwords |
A character vector of stop words. |
useStopDic |
Whether to use the default stop words. |
A data.frame.
Jian Li <[email protected]>
createWordFreq(c("a", "a", "b", "c"), onlyCN = FALSE, nosymbol = TRUE, useStopDic = FALSE)
createWordFreq(c("a", "a", "b", "c"), onlyCN = FALSE, nosymbol = TRUE, useStopDic = FALSE)
GBK character set including some useful information.
data(GBK)
data(GBK)
A data frame with 8 columns.
GBK
Chinese characters in UTF-8.
py0
Unique Pinyin of each character.
py
Pinyin string of each character.
Radical
In Chinese, it means 'Bu Shou'.
Stroke_Num_Radical
In Chinese, it means the number of 'Bi Hua'.
Stroke_Order
In Chinese, it means 'Bi Shun'.
Structure
In Chinese, it means 'Zi Ti Jie Gou'.
Freq
Frequency of the character in Sogou news corpus from all sites between June and July 2012.
Jian Li <[email protected]>
Get the current encoding of the locale.
getCharset()
getCharset()
Character of encoding.
Jian Li <[email protected]>
getCharset()
getCharset()
Indicate whether the encoding of input string is BIG5.
isBIG5(string, combine = FALSE)
isBIG5(string, combine = FALSE)
string |
A character vector. |
combine |
Whether to combine all the strings. |
Logical value.
Jian Li <[email protected]>
isBIG5("hello")
isBIG5("hello")
Indicate whether the encoding of input string is GB18030.
isGB18030(string, combine = FALSE)
isGB18030(string, combine = FALSE)
string |
A character vector. |
combine |
Whether to combine all the strings. |
Logical value.
Jian Li <[email protected]>
isGB18030("hello")
isGB18030("hello")
Indicate whether the encoding of input string is GB2312.
isGB2312(string, combine = FALSE)
isGB2312(string, combine = FALSE)
string |
A character vector. |
combine |
Whether to combine all the strings. |
Logical value.
Jian Li <[email protected]>
isGB2312("hello")
isGB2312("hello")
Indicate whether the encoding of input string is GBK.
isGBK(string, combine = FALSE)
isGBK(string, combine = FALSE)
string |
A character vector. |
combine |
Whether to combine all the strings. |
Logical value.
Jian Li <[email protected]>
isGBK("hello")
isGBK("hello")
Indicate whether the encoding of input string is UTF-8.
isUTF8(string, combine = FALSE)
isUTF8(string, combine = FALSE)
string |
A character vector. |
combine |
Whether to combine all the strings. |
Logical value.
Jian Li <[email protected]>
isUTF8("hello")
isUTF8("hello")
Extract the left or right substrings in a character vector.
left(string, n) right(string, n)
left(string, n) right(string, n)
string |
A character vector. |
n |
How many characters. |
A character vector.
Jian Li <[email protected]>
left("hello", 3)
left("hello", 3)
National Taiwan University Semantic Dictionary.
data(NTUSD)
data(NTUSD)
A list with 4 components.
positive_chs
Positive words in simplified Chinese
negative_chs
Negative words in simplified Chinese
positive_cht
Positive words in traditional Chinese
negative_cht
Negative words in traditional Chinese
Revert UTF-8 string to Chinese character.
revUTF8(string, utype = "R")
revUTF8(string, utype = "R")
string |
A character vector. |
utype |
UTF-8 string type, the default is R type, such as "<U+XXXX>". |
A character vector.
Jian Li <[email protected]>
Set locale to Simplified Chinese/Traditional Chinese/UK.
setchs(rev = FALSE) setcht(rev = FALSE) setuk(rev = FALSE)
setchs(rev = FALSE) setcht(rev = FALSE) setuk(rev = FALSE)
rev |
Whethet to set the locale back. |
No results.
Jian Li <[email protected]>
setchs() setchs(rev = TRUE)
setchs() setchs(rev = TRUE)
Dictionary of simplified and traditional Chinese.
data(SIMTRA)
data(SIMTRA)
A data frame with 2 columns.
Sim
a simplified Chinese string.
Tra
a traditional Chinese string.
Sport news.
data(SPORT)
data(SPORT)
A data frame with 6 columns.
id
ID of the news.
time
Time of the news.
title
Title of the news.
class
Class of the news, 'B' means Basketball, 'F' means Football.
abstract
Abstract of the news.
content
Content of the news.
Dictionary of Chinese stop words.
data(STOPWORDS)
data(STOPWORDS)
A data frame with 1 column.
word
a string vertor of the stop words.
Return Chinese stop words.
stopwordsCN(stopwords = NULL, useStopDic = TRUE)
stopwordsCN(stopwords = NULL, useStopDic = TRUE)
stopwords |
A character vector of stop words. |
useStopDic |
Whether to use the default stop words. |
A vector of stop words.
Jian Li <[email protected]>
stopwordsCN("yes", useStopDic = FALSE)
stopwordsCN("yes", useStopDic = FALSE)
To capitalize every first letter of a word.
strcap(string, strict = FALSE)
strcap(string, strict = FALSE)
string |
A character vector. |
strict |
Whether strict. |
A character vector with the first letter of each word capitalized.
Jian Li <[email protected]>
strcap("the quick red fox jumps over the lazy brown dog")
strcap("the quick red fox jumps over the lazy brown dog")
Extract matched substrings by regular expression.
strextract(string, pattern, invert = FALSE, ignore.case = FALSE, perl = FALSE, useBytes = FALSE)
strextract(string, pattern, invert = FALSE, ignore.case = FALSE, perl = FALSE, useBytes = FALSE)
string |
A character vector. |
pattern |
A character string containing a regular expression to be matched in the given character vector. |
invert |
A logical value: if TRUE, extract the non-matched substrings. |
ignore.case |
If FALSE, the pattern matching is case sensitive and if TRUE, case is ignored during matching. |
perl |
A logical value. Should perl-compatible regexps be used? |
useBytes |
A logical value. If TRUE the matching is done byte-by-byte rather than character-by-character. |
A character vector with the matched or non-matched substrings.
Jian Li <[email protected]>
txt1 <- c("\t(x1)a(aa2)a ", " bb(bb)") strextract(txt1, "\\([^)]*\\)") txt2 <- c(" Ben Franklin and Jefferson Davis", "\tMillard Fillmore") strextract(txt2, "(?<first>[[:upper:]][[:lower:]]+)", perl = TRUE)
txt1 <- c("\t(x1)a(aa2)a ", " bb(bb)") strextract(txt1, "\\([^)]*\\)") txt2 <- c(" Ben Franklin and Jefferson Davis", "\tMillard Fillmore") strextract(txt2, "(?<first>[[:upper:]][[:lower:]]+)", perl = TRUE)
Pad a string to a specified length with a padding character.
strpad(string, width = 0, side = c("left", "right", "both"), pad = " ")
strpad(string, width = 0, side = c("left", "right", "both"), pad = " ")
string |
A character vector. |
width |
The number of characters of the string after padding. |
side |
Which side to pad. |
pad |
The padding character. |
A character vector after padding.
Jian Li <[email protected]>
strpad(1:5, width = 4, pad = "0")
strpad(1:5, width = 4, pad = "0")
Trim space of a string.
strstrip(string, side = c("both", "left", "right"))
strstrip(string, side = c("both", "left", "right"))
string |
A character vector. |
side |
Which side of the string to be trimed, 'both', 'left' or 'right'. |
Trimed vector.
Jian Li <[email protected]>
strstrip(c("\taaaa ", " bbbb "))
strstrip(c("\taaaa ", " bbbb "))
Convert a chinese text to pinyin format.
toPinyin(string, capitalize = FALSE)
toPinyin(string, capitalize = FALSE)
string |
A character vector. |
capitalize |
Whether to capitalize the first letter of each word. |
A character vector in pinyin format.
Jian Li <[email protected]>
toPinyin("the quick red fox jumps over the lazy brown dog")
toPinyin("the quick red fox jumps over the lazy brown dog")
Convert a chinese text from simplified to traditional characters and vice versa.
toTrad(string, rev = FALSE)
toTrad(string, rev = FALSE)
string |
A Chinese string vector. |
rev |
Reverse. TRUE means traditional to simplified. Default is FALSE. |
Converted vectors.
Jian Li <[email protected]>
toTrad("hello")
toTrad("hello")
Convert encoding of Chinese string to UTF-8.
toUTF8(cnstring)
toUTF8(cnstring)
cnstring |
A Chinese string vector. |
Converted vectors.
Jian Li <[email protected]>
toUTF8("hello")
toUTF8("hello")