Package 'PersianStemmer'

Title: Persian Stemmer for Text Analysis
Description: Allows users to stem Persian texts for text analysis.
Authors: Roozbeh Safshekan and Rich Nielsen
Maintainer: Roozbeh Safshekan <[email protected]>
License: GPL (>= 2)
Version: 1.0
Built: 2024-10-31 22:21:02 UTC
Source: CRAN

Help Index


A package for stemming Persian for text analysis.

Description

This package is a Persian Stemmer for Text Analysis.

Details

Use the PerStem function.

Author(s)

Roozbeh Safshekan <[email protected]> and Rich Nielsen <[email protected]>

See Also

PerStem

Examples

# Load data
data(UniversityofTehran)

# Stem and transliterate the text
PerStem(UniversityofTehran,NoEnglish=TRUE, NoNumbers= TRUE, 
                    NoStopwords=TRUE, NoPunctuation= TRUE,
                    StemVerbs = TRUE, NoPreSuffix= TRUE, Context = TRUE,
                    StemBrokenPlurals=TRUE,Transliteration= TRUE)

Stems Arabic broken plurals

Description

Stems Arabic broken plurals and returns singulars.

Usage

FixBrokenPlurals(texts)

Arguments

texts

A string with Arabic broken plurals that should be stemmed.

Value

FixBrokenPlurals returns a string with Arabic broken plurals stemmed.

Author(s)

Safshekan, Nielsen

Examples

# Create string with Arabic broken plurals
x <- '\u0645\u0635\u0627\u062F\u06CC\u0642 
\u0648\u0632\u0631\u0627 
\u062D\u062F\u0648\u062F'

# Remove new line characters and fixe half-spaces from a string.
x <- RemNewlineHalfspace(x)

# Remove all characters that are not Latin, Persian or punctuation, 
# and standardize Persian characters.
x <- RefineChars(x)

# Stem Arabic broken plurals
FixBrokenPlurals(x)

Stemms verbs

Description

Stems verbs and returns past and present roots.

Usage

FixVerbs(texts, Context)

Arguments

texts

A Persian string in unicode.

Context

If TRUE, the function stems past-root+'he' only if other verbs with the same past-root exist in text. If FALSE, the function stems verbs without considering other words in text.

Value

FixVerbs returns a string with verbs stemmed.

Author(s)

Safshekan, Nielsen

Examples

# Create string with Persian verbs
x <- '\u0646\u0648\u0634\u062A\u0647 \u0634\u062F\u0647 
\u0628\u0648\u062F\u0647 \u0627\u0633\u062A - \u0646\u0648\u0634\u062A\u0645 - 
\u062F\u0627\u0631\u06CC\u0645 \u0645\u06CC\u0631\u0648\u06CC\u0645 - 
\u062E\u0648\u0627\u0646\u062F\u0647 \u0645\u06CC\u0634\u0648\u0646\u062F - 
\u062E\u0648\u0627\u0647\u062F \u06AF\u0641\u062A - 
\u0628\u0631\u062F\u0647 \u0627\u0633\u062A - 
\u0645\u06CC\u06AF\u0648\u06CC\u06CC\u0645'

# Remove new line characters and fixe half-spaces from a string.
x <- RemNewlineHalfspace(x)

# Remove all characters that are not Latin, Persian or punctuation, 
# and standardize Persian characters.
x <- RefineChars(x)

# Stems verbs
y <- FixVerbs(x, Context = TRUE)
z <- FixVerbs(x, Context = FALSE)

# Remove the numeric signifiers which are used in PerStem function.
gsub("0|1|2|3|4|5","",y)
gsub("0|1|2|3|4|5","",z)

Persian Stemmer for Text Analysis

Description

Stems Persian texts for text analysis.

Usage

PerStem(dat, NoEnglish = TRUE, NoNumbers = TRUE, 
	NoStopwords = TRUE, NoPunctuation = TRUE, 
	StemVerbs = TRUE, NoPreSuffix = TRUE, 
	Context = TRUE, StemBrokenPlurals = TRUE, 
	Transliteration = TRUE)

Arguments

dat

The original data.

NoEnglish

Removes English characters.

NoNumbers

Removes numbers.

NoStopwords

Removes stopwords by using the default stopword list.

NoPunctuation

If TRUE the function removes punctuation. If FALSE, it fixes punctuation for text analysis.

StemVerbs

Performs stemming on verbs and returns past or present root of the verb.

NoPreSuffix

Performs stemming by removing prefixes and suffixes.

Context

If TRUE, the function performs stemming on a word only if its stem exists in text. If FALSE, the function performs stemming without considering other words in text.

StemBrokenPlurals

Performs stemming on Arabic broken plurals and return singulars by using the default Arabic broken plurals list.

Transliteration

Transliterates Persian unicode characters into Latin characters using a transliteration system developed by Roozbeh Safshekan and Rich Nielsen.

Details

PerStem prepares texts in Persian for text analysis by stemming.

Value

PerStem returns the stemmed Persian text.

Author(s)

Roozbeh Safshekan, Richard Nielsen

Examples

# Load data
data(UniversityofTehran)

# Stem and transliterate the text
PerStem(UniversityofTehran,NoEnglish=TRUE, NoNumbers= TRUE, 
                    NoStopwords=TRUE, NoPunctuation= TRUE,
                    StemVerbs = TRUE, NoPreSuffix= TRUE, Context = TRUE,
                    StemBrokenPlurals=TRUE,Transliteration= TRUE)

Removes all characters that are not Latin, Persian or punctuation, and standardizes Persian characters.

Description

Removes all unicode characters except Latin, Persian or General Punctuation characters and standardizes Persian characters.

Usage

RefineChars(texts)

Arguments

texts

A string from which all characters that are not Latin, Persian or punctuation should be removed, or in which Persian characters should be standardized.

Value

RefineChars returns a string with only Latin, standardized Persian or general punctuation characters.

Author(s)

Safshekan, Nielsen

Examples

# Create string with Latin, Persian, Japanese, non-standardized Persian and punctuation characters.
x <- '\u062F\u0627\u0646\u0634\u06AF\u0627\u0647\u064A \u060C 
\u0641\u06CC\u0632\u06CC\u0643 university 
\u65E5\u672C \u0664\u0665\u0666'

# Remove new line characters and fixe half-spaces from a string.
x <- RemNewlineHalfspace(x)

# Remove all characters that are not Latin, Persian or punctuation, 
# and standardize Persian characters.
RefineChars(x)

Removes new line characters and fixes half-spaces

Description

Removes new line characters and fixes half-spaces in a string.

Usage

RemNewlineHalfspace(texts)

Arguments

texts

A string which its new line characters and half-spaces should be removed or fixed.

Value

RemNewlineHalfspace returns a string with new line characters and half-spaces removed or fixed.

Author(s)

Safshekan, Nielsen

Examples

# Create string with Persian string with new line characters and half-spaces 

x <- '\u062F\u0627\u0646\u0634\u06AF\u0627\u0647\u200C\u0647\u0627\u06CC
\u062A\u0647\u0631\u0627\u0646'

# Remove newline characters (eg.\n\r\t\f\v) and fix half-spaces
RemNewlineHalfspace(x)

Remove English characters

Description

Removes English characters from a string.

Usage

RemoveEnglish(texts)

Arguments

texts

A string from which English characters should be removed.

Value

RemoveEnglish returns a string with English characters removed.

Author(s)

Safshekan, Nielsen

Examples

# Create string with English characters
x <- '\u062F\u0627\u0646\u0634\u06AF\u0627\u0647 University'

# Remove English characters
RemoveEnglish(x)

Remove numerals.

Description

Removes numerals from a string.

Usage

RemoveNumbers(texts)

Arguments

texts

A string from which numerals should be removed.

Value

RemoveNumbers returns a string with numerals removed.

Author(s)

Safshekan, Nielsen

Examples

# Create string with Persian characters and number
x <- '\u0633\u0627\u0644 \u06F1\u06F3\u06F9\u06F8'

# Remove Numbers
RemoveNumbers(x)

Remove Persian prefixes and suffixes.

Description

Removes Persian prefixes and suffixes from a unicode string using the default list of Persian prefixes and suffixes.

Usage

RemovePreSuffix(texts, Context)

Arguments

texts

A Persian string in unicode

Context

If TRUE, the function removes prefixes and suffixes of a word only if its stem exists in text. If FALSE, the function removes prefixes and suffixes without considering other words in text.

Value

RemovePreSuffix returns a string with Persian prefixes and suffixes removed.

Author(s)

Safshekan, Nielsen

Examples

# Create string with Persian characters
x <- '\u0627\u0628\u0631\u0642\u062F\u0631\u062A\u0647\u0627\u06CC\u06CC 
\u06A9\u062A\u0627\u0628\u0647\u0627\u06CC\u0645 \u06A9\u062A\u0627\u0628'

# Remove new line characters and fixe half-spaces from a string.
x <- RemNewlineHalfspace(x)

# Remove all characters that are not Latin, Persian or punctuation, 
# and standardize Persian characters.
x <- RefineChars(x)

# Remove Prefixes and Suffixes
RemovePreSuffix(x, Context = TRUE)
RemovePreSuffix(x, Context = FALSE)

Remove Persian stop-words.

Description

Defines a list of Persian stopwords and removes them from a string.

Usage

RemoveStopwords(texts)

Arguments

texts

A string from which Persian stopwords should be removed.

Value

RemoveStopwords returns a string with Persian stopwords removed.

Author(s)

Safshekan, Nielsen

Examples

# Create Persian string with stopwords
x <- '\u0627\u0632 
\u062F\u0627\u0646\u0634\u06AF\u0627\u0647 
\u0622\u0645\u062F'

# Remove new line characters and fixe half-spaces from a string.
x <- RemNewlineHalfspace(x)

# Remove all characters that are not Latin, Persian or punctuation, 
# and standardize Persian characters.
x <- RefineChars(x)

# Remove stopwords
RemoveStopwords(x)

Transliterate Latin characters into Persian unicode characters

Description

Transliterates Latin characters into Persian unicode characters using a transliteration system developed by Roozbeh Safshekan and Rich Nielsen.

Usage

ReverseTransliterate(texts)

Arguments

texts

A string in Latin characters to be transliterated into Persian characters.

Value

ReverseTransliterate returns a string in Persian characters.

Author(s)

Safshekan, Nielsen

Examples

# Create Latin string 
x <- 'danWGah thran'

# Converts Latin characters into Persian unicode characters
ReverseTransliterate(x)

Remove or fix punctuation.

Description

Removes punctuation characters or inserts spaces before and after them so that they can be used in text analysis as separate units.

Usage

RFPunctuation(texts, NoPunctuation)

Arguments

texts

A string with punctuation which should be removed or fixed.

NoPunctuation

If TRUE, the function removes punctuation. If FALSE, the function inserts spaces before and after punctuation.

Value

RFPunctuation returns a string with punctuation removed or fixed for text analysis.

Author(s)

Safshekan, Nielsen

Examples

# Create string with Persian characters and punctuation
x <- '\u062F\u0627\u0646\u0634\u06AF\u0627\u0647\u060C \u062A\u0647\u0631\u0627\u0646\u061F'

# Remove punctuation
RFPunctuation(x, NoPunctuation = TRUE)  

# Fix punctuation
RFPunctuation(x, NoPunctuation = FALSE)

Transliterate Persian unicode characters into Latin characters

Description

Transliterates Persian unicode characters into Latin characters using a transliteration system developed by Roozbeh Safshekan Rich Nielsen.

Usage

Transliterate(texts)

Arguments

texts

A string in Persian characters to be transliterated into Latin characters.

Value

Transliterate returns a string in Latin characters.

Author(s)

Safshekan, Nielsen

Examples

# Create Persian string
x <- '\u062F\u0627\u0646\u0634\u06AF\u0627\u0647 \u062A\u0647\u0631\u0627\u0646'

# Performs transliteration of Persian into Latin characters
Transliterate(x)

Persian texts

Description

Persian text from the University of Tehran website

Usage

data("UniversityofTehran")

Format

Persian text data

Source

https://ut.ac.ir/fa/page/200

Examples

# Load data
data(UniversityofTehran)