Title: | Persian Stemmer for Text Analysis |
---|---|
Description: | Allows users to stem Persian texts for text analysis. |
Authors: | Roozbeh Safshekan and Rich Nielsen |
Maintainer: | Roozbeh Safshekan <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.0 |
Built: | 2024-10-31 22:21:02 UTC |
Source: | CRAN |
This package is a Persian Stemmer for Text Analysis.
Use the PerStem
function.
Roozbeh Safshekan <[email protected]> and Rich Nielsen <[email protected]>
# Load data data(UniversityofTehran) # Stem and transliterate the text PerStem(UniversityofTehran,NoEnglish=TRUE, NoNumbers= TRUE, NoStopwords=TRUE, NoPunctuation= TRUE, StemVerbs = TRUE, NoPreSuffix= TRUE, Context = TRUE, StemBrokenPlurals=TRUE,Transliteration= TRUE)
# Load data data(UniversityofTehran) # Stem and transliterate the text PerStem(UniversityofTehran,NoEnglish=TRUE, NoNumbers= TRUE, NoStopwords=TRUE, NoPunctuation= TRUE, StemVerbs = TRUE, NoPreSuffix= TRUE, Context = TRUE, StemBrokenPlurals=TRUE,Transliteration= TRUE)
Stems Arabic broken plurals and returns singulars.
FixBrokenPlurals(texts)
FixBrokenPlurals(texts)
texts |
A string with Arabic broken plurals that should be stemmed. |
FixBrokenPlurals
returns a string with Arabic broken plurals stemmed.
Safshekan, Nielsen
# Create string with Arabic broken plurals x <- '\u0645\u0635\u0627\u062F\u06CC\u0642 \u0648\u0632\u0631\u0627 \u062D\u062F\u0648\u062F' # Remove new line characters and fixe half-spaces from a string. x <- RemNewlineHalfspace(x) # Remove all characters that are not Latin, Persian or punctuation, # and standardize Persian characters. x <- RefineChars(x) # Stem Arabic broken plurals FixBrokenPlurals(x)
# Create string with Arabic broken plurals x <- '\u0645\u0635\u0627\u062F\u06CC\u0642 \u0648\u0632\u0631\u0627 \u062D\u062F\u0648\u062F' # Remove new line characters and fixe half-spaces from a string. x <- RemNewlineHalfspace(x) # Remove all characters that are not Latin, Persian or punctuation, # and standardize Persian characters. x <- RefineChars(x) # Stem Arabic broken plurals FixBrokenPlurals(x)
Stems verbs and returns past and present roots.
FixVerbs(texts, Context)
FixVerbs(texts, Context)
texts |
A Persian string in unicode. |
Context |
If TRUE, the function stems past-root+'he' only if other verbs with the same past-root exist in text. If FALSE, the function stems verbs without considering other words in text. |
FixVerbs
returns a string with verbs stemmed.
Safshekan, Nielsen
# Create string with Persian verbs x <- '\u0646\u0648\u0634\u062A\u0647 \u0634\u062F\u0647 \u0628\u0648\u062F\u0647 \u0627\u0633\u062A - \u0646\u0648\u0634\u062A\u0645 - \u062F\u0627\u0631\u06CC\u0645 \u0645\u06CC\u0631\u0648\u06CC\u0645 - \u062E\u0648\u0627\u0646\u062F\u0647 \u0645\u06CC\u0634\u0648\u0646\u062F - \u062E\u0648\u0627\u0647\u062F \u06AF\u0641\u062A - \u0628\u0631\u062F\u0647 \u0627\u0633\u062A - \u0645\u06CC\u06AF\u0648\u06CC\u06CC\u0645' # Remove new line characters and fixe half-spaces from a string. x <- RemNewlineHalfspace(x) # Remove all characters that are not Latin, Persian or punctuation, # and standardize Persian characters. x <- RefineChars(x) # Stems verbs y <- FixVerbs(x, Context = TRUE) z <- FixVerbs(x, Context = FALSE) # Remove the numeric signifiers which are used in PerStem function. gsub("0|1|2|3|4|5","",y) gsub("0|1|2|3|4|5","",z)
# Create string with Persian verbs x <- '\u0646\u0648\u0634\u062A\u0647 \u0634\u062F\u0647 \u0628\u0648\u062F\u0647 \u0627\u0633\u062A - \u0646\u0648\u0634\u062A\u0645 - \u062F\u0627\u0631\u06CC\u0645 \u0645\u06CC\u0631\u0648\u06CC\u0645 - \u062E\u0648\u0627\u0646\u062F\u0647 \u0645\u06CC\u0634\u0648\u0646\u062F - \u062E\u0648\u0627\u0647\u062F \u06AF\u0641\u062A - \u0628\u0631\u062F\u0647 \u0627\u0633\u062A - \u0645\u06CC\u06AF\u0648\u06CC\u06CC\u0645' # Remove new line characters and fixe half-spaces from a string. x <- RemNewlineHalfspace(x) # Remove all characters that are not Latin, Persian or punctuation, # and standardize Persian characters. x <- RefineChars(x) # Stems verbs y <- FixVerbs(x, Context = TRUE) z <- FixVerbs(x, Context = FALSE) # Remove the numeric signifiers which are used in PerStem function. gsub("0|1|2|3|4|5","",y) gsub("0|1|2|3|4|5","",z)
Stems Persian texts for text analysis.
PerStem(dat, NoEnglish = TRUE, NoNumbers = TRUE, NoStopwords = TRUE, NoPunctuation = TRUE, StemVerbs = TRUE, NoPreSuffix = TRUE, Context = TRUE, StemBrokenPlurals = TRUE, Transliteration = TRUE)
PerStem(dat, NoEnglish = TRUE, NoNumbers = TRUE, NoStopwords = TRUE, NoPunctuation = TRUE, StemVerbs = TRUE, NoPreSuffix = TRUE, Context = TRUE, StemBrokenPlurals = TRUE, Transliteration = TRUE)
dat |
The original data. |
NoEnglish |
Removes English characters. |
NoNumbers |
Removes numbers. |
NoStopwords |
Removes stopwords by using the default stopword list. |
NoPunctuation |
If TRUE the function removes punctuation. If FALSE, it fixes punctuation for text analysis. |
StemVerbs |
Performs stemming on verbs and returns past or present root of the verb. |
NoPreSuffix |
Performs stemming by removing prefixes and suffixes. |
Context |
If TRUE, the function performs stemming on a word only if its stem exists in text. If FALSE, the function performs stemming without considering other words in text. |
StemBrokenPlurals |
Performs stemming on Arabic broken plurals and return singulars by using the default Arabic broken plurals list. |
Transliteration |
Transliterates Persian unicode characters into Latin characters using a transliteration system developed by Roozbeh Safshekan and Rich Nielsen. |
PerStem
prepares texts in Persian for text analysis by stemming.
PerStem
returns the stemmed Persian text.
Roozbeh Safshekan, Richard Nielsen
# Load data data(UniversityofTehran) # Stem and transliterate the text PerStem(UniversityofTehran,NoEnglish=TRUE, NoNumbers= TRUE, NoStopwords=TRUE, NoPunctuation= TRUE, StemVerbs = TRUE, NoPreSuffix= TRUE, Context = TRUE, StemBrokenPlurals=TRUE,Transliteration= TRUE)
# Load data data(UniversityofTehran) # Stem and transliterate the text PerStem(UniversityofTehran,NoEnglish=TRUE, NoNumbers= TRUE, NoStopwords=TRUE, NoPunctuation= TRUE, StemVerbs = TRUE, NoPreSuffix= TRUE, Context = TRUE, StemBrokenPlurals=TRUE,Transliteration= TRUE)
Removes all unicode characters except Latin, Persian or General Punctuation characters and standardizes Persian characters.
RefineChars(texts)
RefineChars(texts)
texts |
A string from which all characters that are not Latin, Persian or punctuation should be removed, or in which Persian characters should be standardized. |
RefineChars
returns a string with only Latin, standardized Persian or general punctuation characters.
Safshekan, Nielsen
# Create string with Latin, Persian, Japanese, non-standardized Persian and punctuation characters. x <- '\u062F\u0627\u0646\u0634\u06AF\u0627\u0647\u064A \u060C \u0641\u06CC\u0632\u06CC\u0643 university \u65E5\u672C \u0664\u0665\u0666' # Remove new line characters and fixe half-spaces from a string. x <- RemNewlineHalfspace(x) # Remove all characters that are not Latin, Persian or punctuation, # and standardize Persian characters. RefineChars(x)
# Create string with Latin, Persian, Japanese, non-standardized Persian and punctuation characters. x <- '\u062F\u0627\u0646\u0634\u06AF\u0627\u0647\u064A \u060C \u0641\u06CC\u0632\u06CC\u0643 university \u65E5\u672C \u0664\u0665\u0666' # Remove new line characters and fixe half-spaces from a string. x <- RemNewlineHalfspace(x) # Remove all characters that are not Latin, Persian or punctuation, # and standardize Persian characters. RefineChars(x)
Removes new line characters and fixes half-spaces in a string.
RemNewlineHalfspace(texts)
RemNewlineHalfspace(texts)
texts |
A string which its new line characters and half-spaces should be removed or fixed. |
RemNewlineHalfspace
returns a string with new line characters and half-spaces removed or fixed.
Safshekan, Nielsen
# Create string with Persian string with new line characters and half-spaces x <- '\u062F\u0627\u0646\u0634\u06AF\u0627\u0647\u200C\u0647\u0627\u06CC \u062A\u0647\u0631\u0627\u0646' # Remove newline characters (eg.\n\r\t\f\v) and fix half-spaces RemNewlineHalfspace(x)
# Create string with Persian string with new line characters and half-spaces x <- '\u062F\u0627\u0646\u0634\u06AF\u0627\u0647\u200C\u0647\u0627\u06CC \u062A\u0647\u0631\u0627\u0646' # Remove newline characters (eg.\n\r\t\f\v) and fix half-spaces RemNewlineHalfspace(x)
Removes English characters from a string.
RemoveEnglish(texts)
RemoveEnglish(texts)
texts |
A string from which English characters should be removed. |
RemoveEnglish
returns a string with English characters removed.
Safshekan, Nielsen
# Create string with English characters x <- '\u062F\u0627\u0646\u0634\u06AF\u0627\u0647 University' # Remove English characters RemoveEnglish(x)
# Create string with English characters x <- '\u062F\u0627\u0646\u0634\u06AF\u0627\u0647 University' # Remove English characters RemoveEnglish(x)
Removes numerals from a string.
RemoveNumbers(texts)
RemoveNumbers(texts)
texts |
A string from which numerals should be removed. |
RemoveNumbers
returns a string with numerals removed.
Safshekan, Nielsen
# Create string with Persian characters and number x <- '\u0633\u0627\u0644 \u06F1\u06F3\u06F9\u06F8' # Remove Numbers RemoveNumbers(x)
# Create string with Persian characters and number x <- '\u0633\u0627\u0644 \u06F1\u06F3\u06F9\u06F8' # Remove Numbers RemoveNumbers(x)
Removes Persian prefixes and suffixes from a unicode string using the default list of Persian prefixes and suffixes.
RemovePreSuffix(texts, Context)
RemovePreSuffix(texts, Context)
texts |
A Persian string in unicode |
Context |
If TRUE, the function removes prefixes and suffixes of a word only if its stem exists in text. If FALSE, the function removes prefixes and suffixes without considering other words in text. |
RemovePreSuffix
returns a string with Persian prefixes and suffixes removed.
Safshekan, Nielsen
# Create string with Persian characters x <- '\u0627\u0628\u0631\u0642\u062F\u0631\u062A\u0647\u0627\u06CC\u06CC \u06A9\u062A\u0627\u0628\u0647\u0627\u06CC\u0645 \u06A9\u062A\u0627\u0628' # Remove new line characters and fixe half-spaces from a string. x <- RemNewlineHalfspace(x) # Remove all characters that are not Latin, Persian or punctuation, # and standardize Persian characters. x <- RefineChars(x) # Remove Prefixes and Suffixes RemovePreSuffix(x, Context = TRUE) RemovePreSuffix(x, Context = FALSE)
# Create string with Persian characters x <- '\u0627\u0628\u0631\u0642\u062F\u0631\u062A\u0647\u0627\u06CC\u06CC \u06A9\u062A\u0627\u0628\u0647\u0627\u06CC\u0645 \u06A9\u062A\u0627\u0628' # Remove new line characters and fixe half-spaces from a string. x <- RemNewlineHalfspace(x) # Remove all characters that are not Latin, Persian or punctuation, # and standardize Persian characters. x <- RefineChars(x) # Remove Prefixes and Suffixes RemovePreSuffix(x, Context = TRUE) RemovePreSuffix(x, Context = FALSE)
Defines a list of Persian stopwords and removes them from a string.
RemoveStopwords(texts)
RemoveStopwords(texts)
texts |
A string from which Persian stopwords should be removed. |
RemoveStopwords
returns a string with Persian stopwords removed.
Safshekan, Nielsen
# Create Persian string with stopwords x <- '\u0627\u0632 \u062F\u0627\u0646\u0634\u06AF\u0627\u0647 \u0622\u0645\u062F' # Remove new line characters and fixe half-spaces from a string. x <- RemNewlineHalfspace(x) # Remove all characters that are not Latin, Persian or punctuation, # and standardize Persian characters. x <- RefineChars(x) # Remove stopwords RemoveStopwords(x)
# Create Persian string with stopwords x <- '\u0627\u0632 \u062F\u0627\u0646\u0634\u06AF\u0627\u0647 \u0622\u0645\u062F' # Remove new line characters and fixe half-spaces from a string. x <- RemNewlineHalfspace(x) # Remove all characters that are not Latin, Persian or punctuation, # and standardize Persian characters. x <- RefineChars(x) # Remove stopwords RemoveStopwords(x)
Transliterates Latin characters into Persian unicode characters using a transliteration system developed by Roozbeh Safshekan and Rich Nielsen.
ReverseTransliterate(texts)
ReverseTransliterate(texts)
texts |
A string in Latin characters to be transliterated into Persian characters. |
ReverseTransliterate
returns a string in Persian characters.
Safshekan, Nielsen
# Create Latin string x <- 'danWGah thran' # Converts Latin characters into Persian unicode characters ReverseTransliterate(x)
# Create Latin string x <- 'danWGah thran' # Converts Latin characters into Persian unicode characters ReverseTransliterate(x)
Removes punctuation characters or inserts spaces before and after them so that they can be used in text analysis as separate units.
RFPunctuation(texts, NoPunctuation)
RFPunctuation(texts, NoPunctuation)
texts |
A string with punctuation which should be removed or fixed. |
NoPunctuation |
If TRUE, the function removes punctuation. If FALSE, the function inserts spaces before and after punctuation. |
RFPunctuation
returns a string with punctuation removed or fixed for text analysis.
Safshekan, Nielsen
# Create string with Persian characters and punctuation x <- '\u062F\u0627\u0646\u0634\u06AF\u0627\u0647\u060C \u062A\u0647\u0631\u0627\u0646\u061F' # Remove punctuation RFPunctuation(x, NoPunctuation = TRUE) # Fix punctuation RFPunctuation(x, NoPunctuation = FALSE)
# Create string with Persian characters and punctuation x <- '\u062F\u0627\u0646\u0634\u06AF\u0627\u0647\u060C \u062A\u0647\u0631\u0627\u0646\u061F' # Remove punctuation RFPunctuation(x, NoPunctuation = TRUE) # Fix punctuation RFPunctuation(x, NoPunctuation = FALSE)
Transliterates Persian unicode characters into Latin characters using a transliteration system developed by Roozbeh Safshekan Rich Nielsen.
Transliterate(texts)
Transliterate(texts)
texts |
A string in Persian characters to be transliterated into Latin characters. |
Transliterate
returns a string in Latin characters.
Safshekan, Nielsen
# Create Persian string x <- '\u062F\u0627\u0646\u0634\u06AF\u0627\u0647 \u062A\u0647\u0631\u0627\u0646' # Performs transliteration of Persian into Latin characters Transliterate(x)
# Create Persian string x <- '\u062F\u0627\u0646\u0634\u06AF\u0627\u0647 \u062A\u0647\u0631\u0627\u0646' # Performs transliteration of Persian into Latin characters Transliterate(x)
Persian text from the University of Tehran website
data("UniversityofTehran")
data("UniversityofTehran")
Persian text data
https://ut.ac.ir/fa/page/200
# Load data data(UniversityofTehran)
# Load data data(UniversityofTehran)