--- title: "Using clean_strings" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Using clean_strings} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(fedmatch) ``` # Using clean_strings `clean_strings` is the way to prepare strings for name matching, either within `tier_match` (see the `Using-tier-match` vignette). There are several useful options that allow for many different options. Here's the example string we'll be using: ```{r} name_vec <- corp_data1[, Company] name_vec ``` First, we can use the basic string cleaning defaults: ```{r} clean_strings(name_vec) ``` Without any additional arguments, `clean_strings` does the following: - Make everything lowercase - Replace the special characters &, @, %, $ with their word equivalents - Remove all other special characters (e.g. commas, periods) - Convert tabs to spaces - Remove extra spaces Then, we have a few different options we can use. ## sp_char_words `sp_char_words` is a data.frame with 2 columns: the first column is symbols to replace, and the second is their replacement. `fedmatch` as a built-in set of symbols: ```{r} print(sp_char_words) ``` But, you can use any data.frame you'd like, to make whatever replacements you'd like: ```{r} new_sp_char <- data.table::data.table(character = c("o"), replacement = c("apple")) clean_strings(name_vec, sp_char_words = new_sp_char) ``` ## common_words `common_words` is similar, but it respects word boundaries (so you don't replace every usage of 'Corp' with 'Corporation', for example.) `fedmatch` has a built-in set of 54 words and their replacements: ```{r} print(corporate_words[1:5]) ``` But, you can use whatever words you'd like: ```{r} clean_strings(name_vec, common_words = data.table::data.table(word = c("general", "almart"), replacement = c("bananas", "oranges"))) ``` (bananas motors sounds like a lovely place to work). Note that the 'almart' in 'walmart' didn't get replaced, because common_words respects word boundaries., You can also use a related function, `word_frequency`, to look for the most common strings in your data: ```{r} word_frequency(sample(c("hi", "Hello", "bye "), 1e4, replace = TRUE)) ``` ## Remove characters and words remove_words and remove_char are booleans that let you simply remove the words in 'common_words' or specify a set of characters to remove rather than replacing them. ```{r} clean_strings(name_vec, sp_char_words = new_sp_char, remove_char = c("a", "c")) clean_strings(name_vec, common_words = data.table::data.table(word = c("general", "company"), replacement = c("bananas", "oranges")), remove_words = TRUE) ``` ## stem `stem` is a boolean that lets you stem words, using `SnowballC::wordStem`. 'stemming' words means removing common suffixes: ```{r} clean_strings(c( "call", "calling", "called"), stem = TRUE) ``` See the documentation in `SnowballC::wordStem` for details.