Using clean_strings

library(fedmatch)

Using clean_strings

clean_strings is the way to prepare strings for name matching, either within tier_match (see the Using-tier-match vignette). There are several useful options that allow for many different options.

Here’s the example string we’ll be using:

name_vec <- corp_data1[, Company]
name_vec
#>  [1] "Walmart"            "Bershire Hataway"   "Apple"             
#>  [4] "Exxon Mobile"       "McKesson "          "UnitedHealth Group"
#>  [7] "CVS Health"         "General Motors"     "AT&T"              
#> [10] "Ford Motor Company"

First, we can use the basic string cleaning defaults:

clean_strings(name_vec)
#>  [1] "walmart"            "bershire hataway"   "apple"             
#>  [4] "exxon mobile"       "mckesson"           "unitedhealth group"
#>  [7] "cvs health"         "general motors"     "atandt"            
#> [10] "ford motor company"

Without any additional arguments, clean_strings does the following:

  • Make everything lowercase
  • Replace the special characters &, @, %, $ with their word equivalents
  • Remove all other special characters (e.g. commas, periods)
  • Convert tabs to spaces
  • Remove extra spaces

Then, we have a few different options we can use.

sp_char_words

sp_char_words is a data.frame with 2 columns: the first column is symbols to replace, and the second is their replacement. fedmatch as a built-in set of symbols:

print(sp_char_words)
#>    character replacement
#>       <char>      <char>
#> 1:       \\&         and
#> 2:       \\$      dollar
#> 3:       \\%     percent
#> 4:       \\@          at

But, you can use any data.frame you’d like, to make whatever replacements you’d like:

new_sp_char <- data.table::data.table(character = c("o"), replacement = c("apple"))
clean_strings(name_vec, sp_char_words = new_sp_char)
#>  [1] "walmart"                            "bershire hataway"                  
#>  [3] "apple"                              "exxapplen mapplebile"              
#>  [5] "mckessapplen"                       "unitedhealth grappleup"            
#>  [7] "cvs health"                         "general mappletapplers"            
#>  [9] "at t"                               "fapplerd mappletappler capplempany"

common_words

common_words is similar, but it respects word boundaries (so you don’t replace every usage of ‘Corp’ with ‘Corporation’, for example.) fedmatch has a built-in set of 54 words and their replacements:

print(corporate_words[1:5])
#>      abbr     long.names
#>    <char>         <char>
#> 1:  accep     acceptance
#> 2:   amer        america
#> 3:  assoc     associates
#> 4:     cl company listed
#> 5:  cmnty      community

But, you can use whatever words you’d like:

clean_strings(name_vec, common_words = data.table::data.table(word = c("general", "almart"),
                                                              replacement = c("bananas", "oranges")))
#>  [1] "walmart"            "bershire hataway"   "apple"             
#>  [4] "exxon mobile"       "mckesson"           "unitedhealth group"
#>  [7] "cvs health"         "bananas motors"     "atandt"            
#> [10] "ford motor company"

(bananas motors sounds like a lovely place to work). Note that the ‘almart’ in ‘walmart’ didn’t get replaced, because common_words respects word boundaries.,

You can also use a related function, word_frequency, to look for the most common strings in your data:

word_frequency(sample(c("hi", "Hello", "bye    "), 1e4, replace = TRUE))
#>      Word Count
#>    <char> <int>
#> 1:  hello  3376
#> 2:    bye  3323
#> 3:     hi  3301

Remove characters and words

remove_words and remove_char are booleans that let you simply remove the words in ‘common_words’ or specify a set of characters to remove rather than replacing them.

clean_strings(name_vec, sp_char_words = new_sp_char, remove_char = c("a", "c"))
#>  [1] "w lm rt"                           "bershire h t w y"                 
#>  [3] "pple"                              "exxapplen mapplebile"             
#>  [5] "m kessapplen"                      "unitedhe lth grappleup"           
#>  [7] "vs he lth"                         "gener l mappletapplers"           
#>  [9] "t t"                               "fapplerd mappletappler applemp ny"
clean_strings(name_vec, common_words = data.table::data.table(word = c("general", "company"),
                                                              replacement = c("bananas", "oranges")),
              remove_words = TRUE)
#>  [1] "walmart"            "bershire hataway"   "apple"             
#>  [4] "exxon mobile"       "mckesson"           "unitedhealth group"
#>  [7] "cvs health"         "motors"             "atandt"            
#> [10] "ford motor"

stem

stem is a boolean that lets you stem words, using SnowballC::wordStem. ‘stemming’ words means removing common suffixes:

clean_strings(c( "call", "calling", "called"), stem = TRUE)
#> [1] "call" "call" "call"

See the documentation in SnowballC::wordStem for details.