Package 'tmcn' reference manual

Title:	A Text Mining Toolkit for Chinese
Description:	A Text mining toolkit for Chinese, which includes facilities for Chinese string processing, Chinese NLP supporting, encoding detecting and converting. Moreover, it provides some functions to support 'tm' package in Chinese.
Authors:	Jian Li
Maintainer:	Jian Li <[email protected]>
License:	LGPL
Version:	0.2-13
Built:	2025-01-30 07:40:12 UTC
Source:	CRAN

Print the UTF-8 codes of a string.

Description

Print the UTF-8 codes of a string.

Usage

catUTF8(string, file = "")
catUTF8(string, file = "")

Arguments

`string`	A character vector.
`file`	A `connection`, or a character string naming the file to print to. If "" (the default), cat prints to the standard output connection, the console unless redirected by `sink`.

Value

No results.

Author(s)

Jian Li <[email protected]>

Examples

catUTF8("hello")
catUTF8("hello")

Create a Chinese term-document matrix or a document-term matrix.

Description

Create a Chinese term-document matrix or a document-term matrix.

Usage

createDTM(string, language = c("zh", "en"), tokenize = NULL, removePunctuation = TRUE, 
  removeNumbers = TRUE, removeStopwords = TRUE)
createTDM(string, language = c("zh", "en"), tokenize = NULL, removePunctuation = TRUE, 
  removeNumbers = TRUE, removeStopwords = TRUE)
createDTM(string, language = c("zh", "en"), tokenize = NULL, removePunctuation = TRUE, 
  removeNumbers = TRUE, removeStopwords = TRUE)
createTDM(string, language = c("zh", "en"), tokenize = NULL, removePunctuation = TRUE, 
  removeNumbers = TRUE, removeStopwords = TRUE)

Arguments

`string`	A character vector.
`language`	The language type, 'zh' means Chinese.
`tokenize`	A tokenizers function.
`removePunctuation`	Whether to remove the punctuations.
`removeNumbers`	Whether to remove the numbers.
`removeStopwords`	Whether to remove the stop words.

Details

Package "tm" is required.

Value

An object of class TermDocumentMatrix or class DocumentTermMatrix.

Author(s)

Jian Li <[email protected]>

Create a word frequency data.frame.

Description

Create a word frequency data.frame.

Usage

createWordFreq(obj, onlyCN = TRUE, nosymbol = TRUE, stopwords = NULL,
  useStopDic = FALSE)
createWordFreq(obj, onlyCN = TRUE, nosymbol = TRUE, stopwords = NULL,
  useStopDic = FALSE)

Arguments

`obj`	A character vector or `DocumentTermMatrix` to calculate words frequency.
`onlyCN`	Whether to keep only Chinese words.
`nosymbol`	Whether to keep symbols.
`stopwords`	A character vector of stop words.
`useStopDic`	Whether to use the default stop words.

Value

A data.frame.

Author(s)

Jian Li <[email protected]>

Examples

createWordFreq(c("a", "a", "b", "c"), onlyCN = FALSE, nosymbol = TRUE, useStopDic = FALSE)

createWordFreq(c("a", "a", "b", "c"), onlyCN = FALSE, nosymbol = TRUE, useStopDic = FALSE)

GBK character set

Description

GBK character set including some useful information.

Usage

data(GBK)data(GBK)

Format

A data frame with 8 columns.

GBK: Chinese characters in UTF-8.
py0: Unique Pinyin of each character.
py: Pinyin string of each character.
Radical: In Chinese, it means 'Bu Shou'.
Stroke_Num_Radical: In Chinese, it means the number of 'Bi Hua'.
Stroke_Order: In Chinese, it means 'Bi Shun'.
Structure: In Chinese, it means 'Zi Ti Jie Gou'.
Freq: Frequency of the character in Sogou news corpus from all sites between June and July 2012.

Author(s)

Jian Li <[email protected]>

Get the current encoding of the locale.

Description

Get the current encoding of the locale.

Usage

getCharset()
getCharset()

Value

Character of encoding.

Author(s)

Jian Li <[email protected]>

Examples

getCharset()
getCharset()

Indicate whether the encoding of input string is BIG5.

Description

Indicate whether the encoding of input string is BIG5.

Usage

isBIG5(string, combine = FALSE)
isBIG5(string, combine = FALSE)

Arguments

`string`	A character vector.
`combine`	Whether to combine all the strings.

Value

Logical value.

Author(s)

Jian Li <[email protected]>

Examples

isBIG5("hello")
isBIG5("hello")

Indicate whether the encoding of input string is GB18030.

Description

Indicate whether the encoding of input string is GB18030.

Usage

isGB18030(string, combine = FALSE)
isGB18030(string, combine = FALSE)

Arguments

`string`	A character vector.
`combine`	Whether to combine all the strings.

Value

Logical value.

Author(s)

Jian Li <[email protected]>

Examples

isGB18030("hello")
isGB18030("hello")

Indicate whether the encoding of input string is GB2312.

Description

Indicate whether the encoding of input string is GB2312.

Usage

isGB2312(string, combine = FALSE)
isGB2312(string, combine = FALSE)

Arguments

`string`	A character vector.
`combine`	Whether to combine all the strings.

Value

Logical value.

Author(s)

Jian Li <[email protected]>

Examples

isGB2312("hello")
isGB2312("hello")

Indicate whether the encoding of input string is GBK.

Description

Indicate whether the encoding of input string is GBK.

Usage

isGBK(string, combine = FALSE)
isGBK(string, combine = FALSE)

Arguments

`string`	A character vector.
`combine`	Whether to combine all the strings.

Value

Logical value.

Author(s)

Jian Li <[email protected]>

Examples

isGBK("hello")
isGBK("hello")

Indicate whether the encoding of input string is UTF-8.

Description

Indicate whether the encoding of input string is UTF-8.

Usage

isUTF8(string, combine = FALSE)
isUTF8(string, combine = FALSE)

Arguments

`string`	A character vector.
`combine`	Whether to combine all the strings.

Value

Logical value.

Author(s)

Jian Li <[email protected]>

Examples

isUTF8("hello")
isUTF8("hello")

Extract the left or right substrings in a character vector.

Description

Extract the left or right substrings in a character vector.

Usage

left(string, n)
right(string, n)
left(string, n)
right(string, n)

Arguments

`string`	A character vector.
`n`	How many characters.

Value

A character vector.

Author(s)

Jian Li <[email protected]>

Examples

left("hello", 3)
left("hello", 3)

National Taiwan University Semantic Dictionary

Description

National Taiwan University Semantic Dictionary.

Usage

data(NTUSD)data(NTUSD)

Format

A list with 4 components.

positive_chs: Positive words in simplified Chinese
negative_chs: Negative words in simplified Chinese
positive_cht: Positive words in traditional Chinese
negative_cht: Negative words in traditional Chinese

References

http://nlg.csie.ntu.edu.tw

Revert UTF-8 string to Chinese character.

Description

Revert UTF-8 string to Chinese character.

Usage

revUTF8(string, utype = "R")
revUTF8(string, utype = "R")

Arguments

`string`	A character vector.
`utype`	UTF-8 string type, the default is R type, such as "<U+XXXX>".

Value

A character vector.

Author(s)

Jian Li <[email protected]>

Set locale to Simplified Chinese/Traditional Chinese/UK.

Description

Set locale to Simplified Chinese/Traditional Chinese/UK.

Usage

setchs(rev = FALSE)
setcht(rev = FALSE)
setuk(rev = FALSE)
setchs(rev = FALSE)
setcht(rev = FALSE)
setuk(rev = FALSE)

Arguments

rev

Whethet to set the locale back.

Value

No results.

Author(s)

Jian Li <[email protected]>

Examples

setchs()
setchs(rev = TRUE)
setchs()
setchs(rev = TRUE)

Dictionary of simplified and traditional Chinese

Description

Dictionary of simplified and traditional Chinese.

Usage

data(SIMTRA)data(SIMTRA)

Format

A data frame with 2 columns.

Sim: a simplified Chinese string.
Tra: a traditional Chinese string.

Sport news.

Description

Sport news.

Usage

data(SPORT)data(SPORT)

Format

A data frame with 6 columns.

id: ID of the news.
time: Time of the news.
title: Title of the news.
class: Class of the news, 'B' means Basketball, 'F' means Football.
abstract: Abstract of the news.
content: Content of the news.

Dictionary of Chinese stop words

Description

Dictionary of Chinese stop words.

Usage

data(STOPWORDS)data(STOPWORDS)

Format

A data frame with 1 column.

word: a string vertor of the stop words.

Return Chinese stop words.

Description

Return Chinese stop words.

Usage

stopwordsCN(stopwords = NULL, useStopDic = TRUE)
stopwordsCN(stopwords = NULL, useStopDic = TRUE)

Arguments

`stopwords`	A character vector of stop words.
`useStopDic`	Whether to use the default stop words.

Value

A vector of stop words.

Author(s)

Jian Li <[email protected]>

Examples

stopwordsCN("yes", useStopDic = FALSE)
stopwordsCN("yes", useStopDic = FALSE)

Mixed case capitalizing.

Description

To capitalize every first letter of a word.

Usage

strcap(string, strict = FALSE)
strcap(string, strict = FALSE)

Arguments

`string`	A character vector.
`strict`	Whether strict.

Value

A character vector with the first letter of each word capitalized.

Author(s)

Jian Li <[email protected]>

Examples

strcap("the quick red fox jumps over the lazy brown dog")

strcap("the quick red fox jumps over the lazy brown dog")

Extract matched substrings by regular expression.

Description

Extract matched substrings by regular expression.

Usage

strextract(string, pattern, invert = FALSE, ignore.case = FALSE,
  perl = FALSE, useBytes = FALSE)
strextract(string, pattern, invert = FALSE, ignore.case = FALSE,
  perl = FALSE, useBytes = FALSE)

Arguments

`string`	A character vector.
`pattern`	A character string containing a regular expression to be matched in the given character vector.
`invert`	A logical value: if TRUE, extract the non-matched substrings.
`ignore.case`	If FALSE, the pattern matching is case sensitive and if TRUE, case is ignored during matching.
`perl`	A logical value. Should perl-compatible regexps be used?
`useBytes`	A logical value. If TRUE the matching is done byte-by-byte rather than character-by-character.

Value

A character vector with the matched or non-matched substrings.

Author(s)

Jian Li <[email protected]>

Examples

txt1 <- c("\t(x1)a(aa2)a ", " bb(bb)")
strextract(txt1, "\\([^)]*\\)")
txt2 <- c("  Ben Franklin and Jefferson Davis", "\tMillard Fillmore")
strextract(txt2, "(?<first>[[:upper:]][[:lower:]]+)", perl = TRUE)


txt1 <- c("\t(x1)a(aa2)a ", " bb(bb)")
strextract(txt1, "\\([^)]*\\)")
txt2 <- c("  Ben Franklin and Jefferson Davis", "\tMillard Fillmore")
strextract(txt2, "(?<first>[[:upper:]][[:lower:]]+)", perl = TRUE)

Pad a string to a specified length with a padding character.

Description

Pad a string to a specified length with a padding character.

Usage

strpad(string, width = 0, side = c("left", "right", "both"),
  pad = " ")
strpad(string, width = 0, side = c("left", "right", "both"),
  pad = " ")

Arguments

`string`	A character vector.
`width`	The number of characters of the string after padding.
`side`	Which side to pad.
`pad`	The padding character.

Value

A character vector after padding.

Author(s)

Jian Li <[email protected]>

Examples

strpad(1:5, width = 4, pad = "0")


strpad(1:5, width = 4, pad = "0")

Trim space of a string.

Description

Trim space of a string.

Usage

strstrip(string, side = c("both", "left", "right"))
strstrip(string, side = c("both", "left", "right"))

Arguments

`string`	A character vector.
`side`	Which side of the string to be trimed, 'both', 'left' or 'right'.

Value

Trimed vector.

Author(s)

Jian Li <[email protected]>

Examples

strstrip(c("\taaaa ", " bbbb    "))


strstrip(c("\taaaa ", " bbbb    "))

Convert a chinese text to pinyin format.

Description

Convert a chinese text to pinyin format.

Usage

toPinyin(string, capitalize = FALSE)
toPinyin(string, capitalize = FALSE)

Arguments

`string`	A character vector.
`capitalize`	Whether to capitalize the first letter of each word.

Value

A character vector in pinyin format.

Author(s)

Jian Li <[email protected]>

Examples

toPinyin("the quick red fox jumps over the lazy brown dog")


toPinyin("the quick red fox jumps over the lazy brown dog")

Convert a Chinese text from simplified to traditional characters and vice versa.

Description

Convert a chinese text from simplified to traditional characters and vice versa.

Usage

toTrad(string, rev = FALSE)
toTrad(string, rev = FALSE)

Arguments

`string`	A Chinese string vector.
`rev`	Reverse. TRUE means traditional to simplified. Default is FALSE.

Value

Converted vectors.

Author(s)

Jian Li <[email protected]>

Examples

toTrad("hello")
toTrad("hello")

Convert encoding of Chinese string to UTF-8.

Description

Convert encoding of Chinese string to UTF-8.

Usage

toUTF8(cnstring)
toUTF8(cnstring)

Arguments

cnstring

A Chinese string vector.

Value

Converted vectors.

Author(s)

Jian Li <[email protected]>

Examples

toUTF8("hello")
toUTF8("hello")

Package 'tmcn'

Help Index

Print the UTF-8 codes of a string.

Description

Usage

Arguments

Value

Author(s)

Examples

Create a Chinese term-document matrix or a document-term matrix.

Description

Usage

Arguments

Details

Value

Author(s)

Create a word frequency data.frame.

Description

Usage

Arguments

Value

Author(s)

Examples

GBK character set

Description

Usage

Format

Author(s)

Get the current encoding of the locale.

Description

Usage

Value

Author(s)

Examples

Indicate whether the encoding of input string is BIG5.

Description

Usage

Arguments

Value

Author(s)

Examples

Indicate whether the encoding of input string is GB18030.

Description

Usage

Arguments

Value

Author(s)

Examples

Indicate whether the encoding of input string is GB2312.

Description

Usage

Arguments

Value

Author(s)

Examples

Indicate whether the encoding of input string is GBK.

Description

Usage

Arguments

Value

Author(s)

Examples

Indicate whether the encoding of input string is UTF-8.

Description

Usage

Arguments

Value

Author(s)

Examples

Extract the left or right substrings in a character vector.

Description

Usage

Arguments

Value

Author(s)

Examples

National Taiwan University Semantic Dictionary

Description

Usage

Format