Title: | Legacy Scottish Post Office Directories Cleaner |
---|---|
Description: | Attempts to clean optical character recognition (OCR) errors in legacy Scottish Post Office Directories. Further attempts to match records from trades and general directories. |
Authors: | Olivier Bautheac [aut, cre], University of Strathclyde [cph, fnd] |
Maintainer: | Olivier Bautheac <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.1.2 |
Built: | 2024-12-07 06:43:10 UTC |
Source: | CRAN |
Attempts to separate attached words in provided address entry(/ies).
clean_address_attached_words(addresses)
clean_address_attached_words(addresses)
addresses |
A character string vector of address(es). |
A character string vector of address(es) cleaned of attached words.
Attempts to clean body of address entry(/ies) provided.
clean_address_body(addresses)
clean_address_body(addresses)
addresses |
A character string vector of address(es). |
A character string vector of address(es) with cleaned bodies.
Attempts to clean ends in provided address entry(/ies).
clean_address_ends(addresses)
clean_address_ends(addresses)
addresses |
A character string vector of address(es). |
A character string vector of address(es) with clean ends.
Attempts to standardise "Mac" prefix in provided address entry(/ies).
clean_address_mac(addresses)
clean_address_mac(addresses)
addresses |
A character string vector of address(es). |
A character string vector of addresses with clean "Mac" prefix(es).
Attempts to clean place names in provided address entry(/ies).
clean_address_names(addresses)
clean_address_names(addresses)
addresses |
A character string vector of address(es). |
A character string vector of address(es) with clean name(s).
Attempts to clean number of address entry(/ies) provided.
clean_address_number(addresses)
clean_address_number(addresses)
addresses |
A character string vector of address(es). |
A character string vector of address(es) with cleaned numbers.
Carries out miscellaneous cleaning operations in provided address entry(/ies).
clean_address_others(addresses)
clean_address_others(addresses)
addresses |
A character string vector of address(es). |
A character string vector of clean address(es).
Attempts to clean places in provided address entry(/ies): street, road, place, quay, etc.
clean_address_places(addresses)
clean_address_places(addresses)
addresses |
A character string vector of address(es). |
A character string vector of address(es) with clean place name(s).
Attempts to standardise possessives in provided address entry(/ies).
clean_address_possessives(addresses)
clean_address_possessives(addresses)
addresses |
A character string vector of address(es). |
A character string vector of address(es) with clean possessive(s).
Performs post-cleaning operations on provided address entry(/ies).
clean_address_post_clean(addresses)
clean_address_post_clean(addresses)
addresses |
A character string vector of address(es). |
A character string vector with address(es) cleaner than the one
provided in addresses
.
Performs pre-cleaning operations on provided address entry(/ies).
clean_address_pre_clean(addresses)
clean_address_pre_clean(addresses)
addresses |
A character string vector of address(es). |
A character string vector with address(es) cleaner than the one
provided in addresses
.
Attempts to clean "Saint" prefix in provided address entry(/ies).
clean_address_saints(addresses)
clean_address_saints(addresses)
addresses |
A character string vector of address(es). |
A character string vector of address(es) with clean "Saint" prefix(es).
Attempts to clean unwanted suffixes in provided address entry(/ies).
clean_address_suffixes(addresses)
clean_address_suffixes(addresses)
addresses |
A character string vector of address(es). |
A character string vector of address(es) with unwanted suffix(es) removed.
Attempts to clean worksites in provided address entry(/ies).
clean_address_worksites(addresses)
clean_address_worksites(addresses)
addresses |
A character string vector of address(es). |
A character string vector of address(es) with clean worksite name(s).
Attempts to clean provided forename.
clean_forename(names)
clean_forename(names)
names |
A character string vector of forename(s). |
A character string vector of cleaned forename(s).
Single letter forenames are standardised to the forename starting with that letter occurring the most frequently in the dataset. i.e A. -> Alexander, B. -> Bernard, C. -> Colin, D. -> David, etc.
Attempts to standardise punctuation in provided forename entry(/ies).
clean_forename_punctuation(forenames)
clean_forename_punctuation(forenames)
forenames |
A character string vector of forename(s). |
A character string vector of forename(s) with clean punctuation.
Attempts to separate double-barrelled forename(s) in provided forename entry(/ies).
clean_forename_separate_words(forenames)
clean_forename_separate_words(forenames)
forenames |
A character string vector of forename(s). |
A character string vector of forename(s) with clean double-barrelled forename(s).
Attempts to clean spelling in provided forename entry(/ies).
clean_forename_spelling(forenames)
clean_forename_spelling(forenames)
forenames |
A character string vector of forename(s). |
A character string vector of forename(s) with clean forename(s) spelling.
Attempts to standardise "Mac" prefix in provided name entry(/ies).
clean_mac(names)
clean_mac(names)
names |
A character string vector of name(s). |
A character string vector of name(s) with clean "Mac" prefix(es).
Attempts to clean ends in provided name entry(/ies).
clean_name_ends(names)
clean_name_ends(names)
names |
A character string vector of names |
A character string vector of names with clean ends.
Attempts to clean provided occupation.
clean_occupation(occupations)
clean_occupation(occupations)
occupations |
A character string vector of occupation(s). |
A character string vector of cleaned occupation(s).
Attempts to clean entry(/ies) of unwanted information displayed in brackets.
clean_parentheses(x)
clean_parentheses(x)
x |
A character string vector. |
A character string vector with within brackets content removed.
Attempts to clean entry(/ies) of unwanted special character(s).
clean_specials(x)
clean_specials(x)
x |
A character string vector. |
A character string vector with special character(s) removed.
Attempts to clean ends of strings provided.
clean_string_ends(strings)
clean_string_ends(strings)
strings |
A character string vector. |
A character string vector with clean entry(/ies) ends.
Attempts to clean provided surname.
clean_surname(names)
clean_surname(names)
names |
A character string vector of surname(s). |
A character string vector of cleaned surname(s).
Multiple spelling names are standardised to that of the capital letter header in the general directory. i.e. Abercrombie, Abercromby -> Abercromby; Bayne, Baynes -> Bayne; Beattie, Beatty -> Beatty; etc.
Attempts to standardise punctuation in provided surname entry(/ies).
clean_surname_punctuation(surnames)
clean_surname_punctuation(surnames)
surnames |
A character string vector of surname(s). |
A character string vector of surname(s) with clean punctuation.
Attempts to clean spelling in provided surname entry(/ies).
clean_surname_spelling(surnames)
clean_surname_spelling(surnames)
surnames |
A character string vector of surnames. |
A character string vector of surnames with clean spelling.
Attempts to clean titles attached to names provided: Captain, Major, etc.
clean_title(names)
clean_title(names)
names |
A character string vector of name(s). |
A character string vector of name(s) with cleaned title(s).
Identifies the type of the house address column provided: number or body.
combine_get_address_house_type(column)
combine_get_address_house_type(column)
column |
A Character string: ends in "house.number" or "house.body". |
A Character string: "number" or "body".
Provided with two equal length vectors, returns TRUE for indexes where both entries are "NA" and FALSE otherwise.
combine_has_match_failed(number, body)
combine_has_match_failed(number, body)
number |
A vector of address number(s). Integer or character string. |
body |
A character string vector of address body(/ies). |
A boolean vector: TRUE
for indexes where both
number
and body
are "NA", FALSE
otherwise.
Labels failed matches as such in the provided Scottish post office directory data.frame.
combine_label_failed_matches(directory)
combine_label_failed_matches(directory)
directory |
A Scottish post office directory in the form of a data.frame
or other object that inherits from the data.frame class such as a
|
A data.frame of the same class as the one provided in directory
.
Columns include address.house.number
, address.house.body
. For entries
for which both address.house.number
and address.house.body
are NA
,
address.house.number
and address.house.body
are labelled as "" and
"Failed to match with general directory" respectively.
Labels failed matches as such.
combine_label_if_match_failed(type = c("number", "body"), ...)
combine_label_if_match_failed(type = c("number", "body"), ...)
type |
A Character string, one of: "number" or "body". Type of column to label. |
... |
Further arguments to be passed down to
|
A character string vector: address(es) "number" or "body" as
specified in type
if match succeeded, "" (type = "number") or
"Failed to match with general directory" (type = "body") otherwise.
Creates a 'match.string' column in the provided Scottish post office directory data.frame composed of entry(/ies) full name and trade address pasted together. Missing trade address entry(/ies) are replaced with a random generated string.
combine_make_match_string(directory)
combine_make_match_string(directory)
directory |
A Scottish post office directory in the form of a data.frame
or other object that inherits from the data.frame class such as a
|
A data.frame of the same class as the one provided in directory
;
columns include at least forename
, surname
, address.trade.number
,
address.trade.body
, match.string
.
The purpose of the 'match.string' column is to facilitates the matching of the general to trades directory down the line. It allows to calculate a string distance metric between each pair of entries and match those falling below a specified threshold.
combine_match_general_to_trades
for the matching of
the general to trades directory.
Attempts to complement Scottish post office trades directory data.frame with house address information from the Scottish post office general directory data.frame provided by matching records from the two datasets using the distance metric specified.
combine_match_general_to_trades( trades_directory, general_directory, progress = TRUE, verbose = FALSE, distance = TRUE, matches = TRUE, ... )
combine_match_general_to_trades( trades_directory, general_directory, progress = TRUE, verbose = FALSE, distance = TRUE, matches = TRUE, ... )
trades_directory |
A Scottish post office trades directory in the form
of a data.frame or other object that inherits from the data.frame class
such as a |
general_directory |
A Scottish post office general directory in the form
of a data.frame or other object that inherits from the data.frame class
such as a |
progress |
Whether progress should be shown ( |
verbose |
Whether the function should be executed silently ( |
distance |
Whether ( |
matches |
Whether ( |
... |
Further arguments to be passed down to
|
A tibble
; columns include at least surname
,
forename
, address.trade.number
, address.trade.body
,
address.house.number
, address.house.body
.
trades_directory <- tibble::tibble( page = rep("71", 3L), rank = c("135", "326", "586"), surname = c("Abbott", "Abercromby", "Blair"), forename = c("William", "Alexander", "John Hugh"), occupation = c("Wine and spirit merchant", "Baker", "Victualler"), type = rep("OWN ACCOUNT", 3L), address.trade.number = c("18, 20", "12", "280"), address.trade.body = c("London Road", "Dixon Place", "High Street") ) general_directory <- tibble::tibble( page = rep("71", 2L), surname = c("Abbott", "Abercromby"), forename = c("William", "Alexander"), occupation = c("Wine and spirit merchant", "Baker"), address.trade.number = c("18, 20", ""), address.house.number = c("136", "29"), address.trade.body = c("London Road", "Dixon Place"), address.house.body = c("Queen Square", "Anderston Quay") ) combine_match_general_to_trades( trades_directory, general_directory, progress = TRUE, verbose = FALSE, distance = TRUE, method = "osa", max_dist = 5 )
trades_directory <- tibble::tibble( page = rep("71", 3L), rank = c("135", "326", "586"), surname = c("Abbott", "Abercromby", "Blair"), forename = c("William", "Alexander", "John Hugh"), occupation = c("Wine and spirit merchant", "Baker", "Victualler"), type = rep("OWN ACCOUNT", 3L), address.trade.number = c("18, 20", "12", "280"), address.trade.body = c("London Road", "Dixon Place", "High Street") ) general_directory <- tibble::tibble( page = rep("71", 2L), surname = c("Abbott", "Abercromby"), forename = c("William", "Alexander"), occupation = c("Wine and spirit merchant", "Baker"), address.trade.number = c("18, 20", ""), address.house.number = c("136", "29"), address.trade.body = c("London Road", "Dixon Place"), address.house.body = c("Queen Square", "Anderston Quay") ) combine_match_general_to_trades( trades_directory, general_directory, progress = TRUE, verbose = FALSE, distance = TRUE, method = "osa", max_dist = 5 )
Attempts to complement Scottish post office trades directory data.frame with house address information from the Scottish post office general directory data.frame provided by matching records from the two datasets using the distance metric specified.
combine_match_general_to_trades_plain( trades_directory, general_directory, verbose, matches, ... )
combine_match_general_to_trades_plain( trades_directory, general_directory, verbose, matches, ... )
trades_directory |
A Scottish post office trades directory in the form
of a data.frame or other object that inherits from the data.frame class
such as a |
general_directory |
A Scottish post office general directory in the form
of a data.frame or other object that inherits from the data.frame class
such as a |
verbose |
Whether the function should be executed silently ( |
matches |
Whether ( |
... |
Further arguments to be passed down to
|
A data.frame of the same class as that of the one provided in
trades_directory
and/or general_directory
. Should trades_directory
and
general_directory
be provided as objects of different classes, the class of
the return data.frame will be that of the parent class. i.e. if
trades_directory
and general_directory
are provided as a pure data.frame
and a tibble
respectively, a pure data.frame is
returned. Columns include at least surname
, forename
,
address.trade.number
, address.trade.body
, address.house.number
,
address.house.body
.
combine_match_general_to_trades
.
Attempts to complement Scottish post office trades directory data.frame with house address information from the Scottish post office general directory data.frame provided by matching records from the two datasets using the distance metric specified. Shows a progress bar indicating function progression.
combine_match_general_to_trades_progress( trades_directory, general_directory, verbose, matches, ... )
combine_match_general_to_trades_progress( trades_directory, general_directory, verbose, matches, ... )
trades_directory |
A Scottish post office trades directory in the form
of a data.frame or other object that inherits from the data.frame class
such as a |
general_directory |
A Scottish post office general directory in the form
of a data.frame or other object that inherits from the data.frame class
such as a |
verbose |
Whether the function should be executed silently ( |
matches |
Whether ( |
... |
Further arguments to be passed down to
|
A data.frame of the same class as that of the one provided in
trades_directory
and/or general_directory
. Should trades_directory
and
general_directory
be provided as objects of different classes, the class of
the return data.frame will be that of the parent class. i.e. if
trades_directory
and general_directory
are provided as a pure data.frame
and a tibble
respectively, a pure data.frame is
returned. Columns include at least surname
, forename
,
address.trade.number
, address.trade.body
, address.house.number
,
address.house.body
.
combine_match_general_to_trades
.
Replaces missing trade address(es) in the provided Scottish post office directory data.frame with random string(s). Random string(s) only show(s) in body of trade address entry(/ies).
combine_no_trade_address_to_random_string(directory)
combine_no_trade_address_to_random_string(directory)
directory |
A Scottish post office directory in the form of a data.frame
or other object that inherits from the data.frame class such as a
|
A data.frame of the same class as the one provided in directory
;
columns include at least address.trade
.
Prevents unwarranted matches when matching general to trades directory. Unrelated records with similar name and trade address entry labelled as missing would be otherwise matched.
Returns a 22 character long random string if address provided is labelled as missing ("No trade/house address found").
combine_random_string_if_no_address(address)
combine_random_string_if_no_address(address)
address |
A character string. |
A length 1 character string vector: 22 character long random string
if address
labelled as missing ("No trade/house address found"),
address
otherwise.
Search for specified pattern in provided string; if found returns a 22 character long random string otherwise return original string.
combine_random_string_if_pattern(string, regex)
combine_random_string_if_pattern(string, regex)
string |
A character string. |
regex |
Character string regex specifying the pattern to look for in
|
A length 1 character string vector: 22 character long random string
if regex
found in string
, string
otherwise.
Attempts to clean the provided Scottish post office general directory data.frame.
general_clean_directory(directory, progress = TRUE, verbose = FALSE)
general_clean_directory(directory, progress = TRUE, verbose = FALSE)
directory |
A Scottish post office general directory in the form
of a data.frame or other object that inherits from the data.frame class
such as a |
progress |
Whether progress should be shown ( |
verbose |
Whether the function should be executed silently ( |
A tibble
; columns include at least
forename
, surname
, occupation
, address.trade.number
,
address.trade.body
, address.house.number
and address.house.body
.
"house" suffix in occupation
column is move to addresses
, occupation
information is repatriated from addresses
to occupation
column;
addresses
is split into trade and house address columns; additional
records are created for each extra trade address identified. Entries are
further cleaned of optical character recognition (OCR) errors and subject
to a number of standardisation operations.
pages <- rep("71", 2L) surnames <- c("ABOT", "ABRCROMBIE") forenames <- c("Wm.", "Alex") occupations <- c("Wine and spirit mercht - See Advertisement in Appendix.", "") addresses = c( "1S20 Londn rd; ho. 13<J Queun sq", "Bkr; I2 Dixon Street, & 29 Auderstn Qu.; res 2G5 Argul st." ) directory <- tibble::tibble( page = pages, surname = surnames, forename = forenames, occupation = occupations, addresses = addresses ) general_clean_directory(directory, progress = TRUE, verbose = FALSE)
pages <- rep("71", 2L) surnames <- c("ABOT", "ABRCROMBIE") forenames <- c("Wm.", "Alex") occupations <- c("Wine and spirit mercht - See Advertisement in Appendix.", "") addresses = c( "1S20 Londn rd; ho. 13<J Queun sq", "Bkr; I2 Dixon Street, & 29 Auderstn Qu.; res 2G5 Argul st." ) directory <- tibble::tibble( page = pages, surname = surnames, forename = forenames, occupation = occupations, addresses = addresses ) general_clean_directory(directory, progress = TRUE, verbose = FALSE)
Attempts to clean the provided Scottish post office general directory data.frame.
general_clean_directory_plain(directory, verbose)
general_clean_directory_plain(directory, verbose)
directory |
A Scottish post office general directory in the form
of a data.frame or other object that inherits from the data.frame class
such as a |
verbose |
Whether the function should be executed silently ( |
A data.frame of the same class as the one provided in directory
;
columns include at least forename
, surname
, occupation
,
address.trade.number
, address.trade.body
, address.house.number
and
address.house.body
. "house" suffix in occupation
column is move to
addresses
, occupation information is repatriated from addresses
to
occupation
column; addresses
is split into trade and house address
column; additional records are created for each extra trade address
identified. Entries are further cleaned of optical character recognition
(OCR) errors and subject to a number of standardisation operations.
Attempts to clean the provided Scottish post office general directory data.frame. Shows a progress bar indication the progression of the function.
general_clean_directory_progress(directory, verbose)
general_clean_directory_progress(directory, verbose)
directory |
A Scottish post office general directory in the form
of a data.frame or other object that inherits from the data.frame class
such as a |
verbose |
Whether the function should be executed silently ( |
A data.frame of the same class as the one provided in directory
;
columns include at least forename
, surname
, occupation
,
address.trade.number
, address.trade.body
, address.house.number
and
address.house.body
. "house" suffix in occupation
column is move to
addresses
, occupation information is repatriated from addresses
to
occupation
column; addresses
is split into trade and house address
column; additional records are created for each extra trade address
identified. Entries are further cleaned of optical character recognition
(OCR) errors and subject to a number of standardisation operations.
Attempts to clean entries of the provided Scottish post office general directory data.frame provided.
general_clean_entries(directory, verbose)
general_clean_entries(directory, verbose)
directory |
A Scottish post office general directory in the form
of a data.frame or other object that inherits from the data.frame class
such as a |
verbose |
Whether the function should be executed silently ( |
A data.frame of the same class as the one provided in directory
;
columns include the same as those in the data.frame provided in
directory
. Entries are cleaned of optical character recognition (OCR)
errors and subject to a number of standardisation operations.
Attempts to fix the structure of the raw Scottish post office general
directory data.frame provided. For each entry, general_fix_structure
attempts to fix parsing errors by moving pieces of information provided to
the right columns; further attempts to separate trade from house address,
separate multiple trade addresses as well as separate number from
address body.
general_fix_structure(directory, verbose)
general_fix_structure(directory, verbose)
directory |
A Scottish post office general directory in the form
of a data.frame or other object that inherits from the data.frame class
such as a |
verbose |
Whether the function should be executed silently ( |
A data.frame of the same class as the one provided in directory
;
columns include at least occupation
, address.trade.number
,
address.trade.body
, address.house.number
and address.house.body
.
"house" suffix in occupation
column is move to addresses
, occupation
information is repatriated from addresses
to occupation
column;
addresses
is split into trade and house address columns; additional
records are created for each extra trade address identified.
For some raw Scottish post office general directory entries, the word "house"
referring to address type lives in the occupation column as a result of
parsing errors. general_move_house_to_address
attempts to move this
information to the appropriate destination: the addresses
column.
general_move_house_to_address(directory, regex)
general_move_house_to_address(directory, regex)
directory |
A Scottish post office general directory in the form
of a data.frame or other object that inherits from the data.frame class
such as a |
regex |
Regex to use for the task provided as a character string. |
A data.frame of the same class as the one provided in directory
;
columns include at least occupation
and addresses
. Entries in the
occupation
column are cleaned of "house" suffix; entries showing "house"
suffix in occupation
column see "house, " pasted as prefix to
corresponding addresses
column content.
For some raw Scottish post office general directory entries occupation
information lives in the addresses
column as a result of parsing errors.
general_repatriate_occupation_from_address
attempts to move this
information to the appropriate destination: the occupation
column.
general_repatriate_occupation_from_address(directory, regex)
general_repatriate_occupation_from_address(directory, regex)
directory |
A Scottish post office general directory in the form
of a data.frame or other object that inherits from the data.frame class
such as a |
regex |
Regex to use for the task provided as a character string. |
A data.frame of the same class as the one provided in directory
;
columns include at least occupation
and addresses
.
Attempts to separate number from body of address entries in the Scottish post office general directory data.frame provided
general_split_address_numbers_bodies( directory, regex_split_address_numbers, regex_split_address_body, regex_split_address_empty, ignore_case_filter, ignore_case_match )
general_split_address_numbers_bodies( directory, regex_split_address_numbers, regex_split_address_body, regex_split_address_empty, ignore_case_filter, ignore_case_match )
directory |
A Scottish post office general directory in the form
of a data.frame or other object that inherits from the data.frame class
such as a |
regex_split_address_numbers |
Regex to use to match address number(s). |
regex_split_address_body |
Regex to use to match address body(/ies). |
regex_split_address_empty |
Regex to use to match empty address entries. |
ignore_case_filter |
Boolean specifying whether case should be ignored
( |
ignore_case_match |
Boolean specifying whether case should be ignored
( |
A data.frame of the same class as the one provided in directory
;
columns include at least address.trade.number
, address.trade.body
,
address.house.number
and address.house.body
.
Attempts to separate multiple trade addresses in the Scottish post office general directory data.frame provided for entries for which more than one are provided.
general_split_trade_addresses( directory, regex_split, ignore_case_split, regex_filter, ignore_case_filter, regex_match, ignore_case_match )
general_split_trade_addresses( directory, regex_split, ignore_case_split, regex_filter, ignore_case_filter, regex_match, ignore_case_match )
directory |
A Scottish post office general directory in the form
of a data.frame or other object that inherits from the data.frame class
such as a |
regex_split |
Regex to use to split addresses. |
ignore_case_split |
Boolean specifying whether case should be ignored
( |
regex_filter |
Regex to use to search for address entries with post-split undesired leftovers. |
ignore_case_filter |
Boolean specifying whether case should be ignored
( |
regex_match |
Regex to use to clear address entries from post-split undesired leftovers. |
ignore_case_match |
Boolean specifying whether case should be ignored
( |
A data.frame of the same class as the one provided in directory
;
columns include at least address.trade
. Multiple trade addresses are
separated for entries for which more than one are provided. Each trade
address identified lives on an individual row with information in the other
columns duplicated.
Attempts to separate house address from trade address(es) in the Scottish post office general directory data.frame provided for entries for which a house address is provided along trade address(es).
general_split_trade_house_addresses(directory, regex, verbose)
general_split_trade_house_addresses(directory, regex, verbose)
directory |
A Scottish post office general directory in the form
of a data.frame or other object that inherits from the data.frame class
such as a |
regex |
Regex to use for the task provided as a character string. |
verbose |
Whether the function should be executed silently ( |
A data.frame of the same class as the one provided in directory
;
columns include at least addresses.trade
and address.house
. Trade
addresses are separated from house address for entries for which a house
address is provided along trade address(es).
A dataset containing regular expression meant to match commonly (OCR) misread place names in directory address entries. For each place name a replacement pattern is provided for used in substitution operations as well as a boolean operator indicating whether the corresponding regex is case sensitive or not.
globals_address_names
globals_address_names
A data frame with 3 variables:
regex for place name matching
replacement pattern for substitution operations
boolean operator indicating whether the corresponding regex is case sensitive or not.
A dataset containing regular expression meant to match common (OCR) errors in reading the ampersand character: "&" in directory entries. For each error pattern a replacement pattern is provided for used in substitution operations as well as a boolean operator indicating whether the corresponding regex is case sensitive or not.
globals_ampersand
globals_ampersand
A data frame with 3 variables:
regex for ampersand reading error matching
replacement pattern for substitution operations
boolean operator indicating whether the corresponding regex is case sensitive or not.
A character vector of regular expressions to match common (OCR) errors in reading the ampersand character: "&" in directory entries.
globals_ampersand_vector
globals_ampersand_vector
A character string vector.
A character vector of regular expressions to match common (OCR) errors in reading the ampersand character: "&" in directory entries.
globals_and_double_quote
globals_and_double_quote
A character string vector.
Some regexes contain the double quote character: '"'.
A character vector of regular expressions to match common (OCR) errors in reading the ampersand character: "&" in directory entries.
globals_and_single_quote
globals_and_single_quote
A character string vector.
Some regexes contain the single quote character: "'".
A dataset containing regular expression meant to match commonly (OCR) misread forenames in directory name entries. For each forename a replacement pattern is provided for used in substitution operations as well as a boolean operator indicating whether the corresponding regex is case sensitive or not.
globals_forenames
globals_forenames
A data frame with 3 variables:
regex for forename matching
replacement pattern for substitution operations
boolean operator indicating whether the corresponding regex is case sensitive or not.
A character vector of column names for general directories.
globals_general_colnames
globals_general_colnames
A character string vector.
A dataset containing regular expression meant to match commonly (OCR) misread "Mac" pre-fixes in directory name entries. For each "Mac" pre-fix a replacement pattern is provided for used in substitution operations as well as a boolean operator indicating whether the corresponding regex is case sensitive or not.
globals_macs
globals_macs
A data frame with 3 variables:
regex for "Mac" pre-fix matching
replacement pattern for substitution operations
boolean operator indicating whether the corresponding regex is case sensitive or not.
A dataset containing regular expression meant to match commonly (OCR) misread numbers in directory address entries. For each number a replacement pattern is provided for used in substitution operations as well as a boolean operator indicating whether the corresponding regex is case sensitive or not.
globals_numbers
globals_numbers
A data frame with 3 variables:
regex for number matching
replacement pattern for substitution operations
boolean operator indicating whether the corresponding regex is case sensitive or not.
A dataset containing regular expression meant to match commonly (OCR) misread occupations in directory entries. For each occupation a replacement pattern is provided for used in substitution operations as well as a boolean operator indicating whether the corresponding regex is case sensitive or not.
globals_occupations
globals_occupations
A data frame with 3 variables:
regex for occupation matching
replacement pattern for substitution operations
boolean operator indicating whether the corresponding regex is case sensitive or not.
A character vector of common place types found in directory address entries
globals_places_raw
globals_places_raw
A character string vector.
A dataset containing regular expression meant to match commonly (OCR) misread place types in directory address entries. For each place type a replacement pattern is provided for used in substitution operations as well as a boolean operator indicating whether the corresponding regex is case sensitive or not.
globals_places_regex
globals_places_regex
A data frame with 3 variables:
regex for place type matching
replacement pattern for substitution operations
boolean operator indicating whether the corresponding regex is case sensitive or not.
Regular expression used in the making of the match.string that eventually enables the matching of general and trades directory records.
globals_regex_address_house_body_number
globals_regex_address_house_body_number
A character string vector.
Regular expression used to remove undesired pre-fixes in general directory address records.
globals_regex_address_prefix
globals_regex_address_prefix
A character string vector.
Regular expression used to the word "and" in a filtering operation part of a mutate operation in the general directory provided.
globals_regex_and_filter
globals_regex_and_filter
A character string vector.
Regular expression used to match the word "and" in a filtering operation part of a mutate operation in the general directory provided.
globals_regex_and_match
globals_regex_and_match
A character string vector.
Regular expression used in the making of the match.string that eventually enables the matching of general and trades directory records.
globals_regex_get_address_house_type
globals_regex_get_address_house_type
A character string vector.
combine_get_address_house_type
Regular expression used to separate trades from house addresses in general directory.
globals_regex_house_split_trade
globals_regex_house_split_trade
A character string vector.
general_split_trade_house_addresses
Regular expression used to move the word "house" from the occupation column to the addresses column in general directory.
globals_regex_house_to_address
globals_regex_house_to_address
A character string vector.
Regular expression used to match irrelevant information in the directory dataset provided.
globals_regex_irrelevants
globals_regex_irrelevants
A character string vector.
Regular expression used to repatriate occupation from address column in general directory.
globals_regex_occupation_from_address
globals_regex_occupation_from_address
A character string vector.
general_repatriate_occupation_from_address
Regular expression used to separate numbers from body in provided general directory address entries.
globals_regex_split_address_body
globals_regex_split_address_body
A character string vector.
general_split_address_numbers_bodies
Regular expression used to separate numbers from body in provided general directory address entries.
globals_regex_split_address_empty
globals_regex_split_address_empty
A character string vector.
general_split_address_numbers_bodies
Regular expression used to separate numbers from body in provided general directory address entries.
globals_regex_split_address_numbers
globals_regex_split_address_numbers
A character string vector.
general_split_address_numbers_bodies
Regular expression used to split multiple trade addresses when more than one are provided.
globals_regex_split_trade_addresses
globals_regex_split_trade_addresses
A character string vector.
Regular expression used to match title in provided directory name entries.
globals_regex_titles
globals_regex_titles
A character string vector.
A dataset containing regular expression meant to match commonly (OCR) misread name of Saints in directory address names. For each Saint a replacement pattern is provided for used in substitution operations as well as a boolean operator indicating whether the corresponding regex is case sensitive or not.
globals_saints
globals_saints
A data frame with 3 variables:
regex for Saint name matching
replacement pattern for substitution operations
boolean operator indicating whether the corresponding regex is case sensitive or not.
A dataset containing regular expression meant to match commonly (OCR) misread suffixes in directory address entries. For each suffix a replacement pattern is provided for used in substitution operations as well as a boolean operator indicating whether the corresponding regex is case sensitive or not.
globals_suffixes
globals_suffixes
A data frame with 3 variables:
regex for suffix matching
replacement pattern for substitution operations
boolean operator indicating whether the corresponding regex is case sensitive or not.
A dataset containing regular expression meant to match commonly (OCR) misread surnames in directory name entries. For each surname a replacement pattern is provided for used in substitution operations as well as a boolean operator indicating whether the corresponding regex is case sensitive or not.
globals_surnames
globals_surnames
A data frame with 3 variables:
regex for surname matching
replacement pattern for substitution operations
boolean operator indicating whether the corresponding regex is case sensitive or not.
A dataset containing regular expression meant to match commonly (OCR) misread titles in directory name records. For each title a replacement pattern is provided for used in substitution operations as well as a boolean operator indicating whether the corresponding regex is case sensitive or not.
globals_titles
globals_titles
A data frame with 3 variables:
regex for title matching
replacement pattern for substitution operations
boolean operator indicating whether the corresponding regex is case sensitive or not.
A character vector of column names for trades directories.
globals_trades_colnames
globals_trades_colnames
A character string vector.
A character vector of column names for the dataset where general directory records are matched to trades directory records.
globals_union_colnames
globals_union_colnames
A character string vector.
A dataset containing regular expression meant to match commonly (OCR) misread worksite names in directory address entries. For each worksite a replacement pattern is provided for used in substitution operations as well as a boolean operator indicating whether the corresponding regex is case sensitive or not.
globals_worksites
globals_worksites
A data frame with 3 variables:
regex for worksite name matching
replacement pattern for substitution operations
boolean operator indicating whether the corresponding regex is case sensitive or not.
Attempts to clean the provided Scottish post office trades directory data.frame.
trades_clean_directory(directory, progress = TRUE, verbose = FALSE)
trades_clean_directory(directory, progress = TRUE, verbose = FALSE)
directory |
A Scottish post office trades directory in the form
of a data.frame or other object that inherits from the data.frame class
such as a |
progress |
Whether progress should be shown ( |
verbose |
Whether the function should be executed silently ( |
A data.frame of the same class as the one provided in directory
;
columns include at least forename
, surname
, occupation
,
address.trade.number
and address.trade.body
. Entries are cleaned of
optical character recognition (OCR) errors and subject to a number of
standardisation operations.
pages <- rep("71", 2L) ranks <- c("135", "326") surnames <- c("ABOT", "ABRCROMBIE") forenames <- c("Wm.", "Alex") occupations <- c( "Wine and spirit mercht - See Advertisement in Appendix.", "Bkr" ) types <- rep("OWN ACCOUNT", 2L) numbers <- c("1S20", "I2") bodies <- c("Londn rd.", "Dixen pl") directory <- tibble::tibble( page = pages, rank = ranks, surname = surnames, forename = forenames, occupation = occupations, type = types, address.trade.number = numbers, address.trade.body = bodies ) trades_clean_directory(directory, progress = TRUE, verbose = FALSE)
pages <- rep("71", 2L) ranks <- c("135", "326") surnames <- c("ABOT", "ABRCROMBIE") forenames <- c("Wm.", "Alex") occupations <- c( "Wine and spirit mercht - See Advertisement in Appendix.", "Bkr" ) types <- rep("OWN ACCOUNT", 2L) numbers <- c("1S20", "I2") bodies <- c("Londn rd.", "Dixen pl") directory <- tibble::tibble( page = pages, rank = ranks, surname = surnames, forename = forenames, occupation = occupations, type = types, address.trade.number = numbers, address.trade.body = bodies ) trades_clean_directory(directory, progress = TRUE, verbose = FALSE)
Attempts to clean the provided Scottish post office trades directory data.frame.
trades_clean_directory_plain(directory, verbose)
trades_clean_directory_plain(directory, verbose)
directory |
A Scottish post office trades directory in the form
of a data.frame or other object that inherits from the data.frame class
such as a |
verbose |
Whether the function should be executed silently ( |
A data.frame of the same class as the one provided in directory
;
columns include at least forename
, surname
, occupation
,
address.trade.number
and address.trade.body
. Entries are cleaned of
optical character recognition (OCR) errors and subject to a number of
standardisation operations.
Attempts to clean the provided Scottish post office trades directory data.frame. Shows a progress bar indicating function progression.
trades_clean_directory_progress(directory, verbose)
trades_clean_directory_progress(directory, verbose)
directory |
A Scottish post office trades directory in the form
of a data.frame or other object that inherits from the data.frame class
such as a |
verbose |
Whether the function should be executed silently ( |
A data.frame of the same class as the one provided in directory
;
columns include at least forename
, surname
, occupation
,
address.trade.number
and address.trade.body
. Entries are cleaned of
optical character recognition (OCR) errors and subject to a number of
standardisation operations.
Attempts to clean entries of the provided Scottish post office trades directory data.frame.
trades_clean_entries(directory, verbose)
trades_clean_entries(directory, verbose)
directory |
A Scottish post office trades directory in the form
of a data.frame or other object that inherits from the data.frame class
such as a |
verbose |
Whether the function should be executed silently ( |
A data.frame of the same class as the one provided in directory
;
columns include the same as those in the data.frame provided in
directory
. Entries are cleaned of optical character recognition (OCR)
errors and subject to a number of standardisation operations.
Clean address entries in the provided directory dataframe.
utils_clean_address(directory, type = c("body", "number", "ends"))
utils_clean_address(directory, type = c("body", "number", "ends"))
directory |
A directory dataframe. |
type |
A character string: "body", "number" or "ends". Specifies the type
of address cleaning to be performed. For "body", "number" and "ends"
|
A dataframe.
## Not run: directory <- data.frame( page = c("71", "71"), surname = c("ABOT", "ABRCROMBIE"), forename = c("Wm.", "Alex"), occupation = c("Wine and spirit merchant", "Baker"), address.number = c(" -; 1820", ",,12"), address.body = c( "London st. ; house, Mary hill.*", "&;Dixon st.; residence, Craigrownie, Cove.$" ), stringsAsFactors = FALSE ) utils_clean_address(directory, "body") utils_clean_address(directory, "number") ## End(Not run)
## Not run: directory <- data.frame( page = c("71", "71"), surname = c("ABOT", "ABRCROMBIE"), forename = c("Wm.", "Alex"), occupation = c("Wine and spirit merchant", "Baker"), address.number = c(" -; 1820", ",,12"), address.body = c( "London st. ; house, Mary hill.*", "&;Dixon st.; residence, Craigrownie, Cove.$" ), stringsAsFactors = FALSE ) utils_clean_address(directory, "body") utils_clean_address(directory, "number") ## End(Not run)
Clean body record of provided address(es).
utils_clean_address_body(addresses)
utils_clean_address_body(addresses)
addresses |
A character string vector of address(es). |
A vector of character strings.
## Not run: utils_clean_address_body( c("London st.", "Mary hill.*", "&;Dixon st.", "Craigrownie, Cove.$") ) ## End(Not run)
## Not run: utils_clean_address_body( c("London st.", "Mary hill.*", "&;Dixon st.", "Craigrownie, Cove.$") ) ## End(Not run)
Clean beginning and end of the provided address entries.
utils_clean_address_ends(addresses)
utils_clean_address_ends(addresses)
addresses |
A character string vector of address(es). |
A vector of character strings.
## Not run: utils_clean_address_ends( c( " -; 18, 20 London st.; house, Mary hill.*", ",,12 &;Dixon st.; residence, Craigrownie, Cove.$" ) ) ## End(Not run)
## Not run: utils_clean_address_ends( c( " -; 18, 20 London st.; house, Mary hill.*", ",,12 &;Dixon st.; residence, Craigrownie, Cove.$" ) ) ## End(Not run)
Clean number record of provided address(es).
utils_clean_address_number(addresses)
utils_clean_address_number(addresses)
addresses |
A character string vector of address(es). |
A vector of character strings.
## Not run: utils_clean_address_number(c(" -; 1820", ",,12")) ## End(Not run)
## Not run: utils_clean_address_number(c(" -; 1820", ",,12")) ## End(Not run)
Clean all address records in provided directory dataframe.
utils_clean_addresses(directory)
utils_clean_addresses(directory)
directory |
A directory dataframe. Columns must include
|
A dataframe.
## Not run: directory <- data.frame( page = c("71", "71", "71"), surname = c("ABOT", "ABRCROMBIE", "BLAI"), forename = c("Wm.", "Alex", "Jn Huh"), occupation = c("Wine and spirit merchant", "Baker", "Victualer"), address.trade.number = c(" -; 1820", "", "280"), address.trade.body = c("London st. ; house, Mary hill.*", "", "High stret"), stringsAsFactors = FALSE ) utils_clean_addresses(directory) ## End(Not run)
## Not run: directory <- data.frame( page = c("71", "71", "71"), surname = c("ABOT", "ABRCROMBIE", "BLAI"), forename = c("Wm.", "Alex", "Jn Huh"), occupation = c("Wine and spirit merchant", "Baker", "Victualer"), address.trade.number = c(" -; 1820", "", "280"), address.trade.body = c("London st. ; house, Mary hill.*", "", "High stret"), stringsAsFactors = FALSE ) utils_clean_addresses(directory) ## End(Not run)
Clean entry ends for the specified columns in the directory dataframe provided
utils_clean_ends(directory, ...)
utils_clean_ends(directory, ...)
directory |
A directory dataframe. |
... |
Columns to clean provided as expressions. |
A dataframe.
## Not run: directory <- data.frame( page = c("71", "71", "71"), surname = c("ABOT", "ABRCROMBIE", "BLAI"), forename = c("Wm.", "Alex", "Jn Huh"), occupation = c("Wine and spirit merchant", "Baker", "Victualer"), address.trade.number = c(" -; 1820", "", "280"), address.trade.body = c("London st. ; house, Mary hill.*", "", "High stret"), stringsAsFactors = FALSE ) utils_clean_ends(directory, address.trade.number, address.trade.body) ## End(Not run)
## Not run: directory <- data.frame( page = c("71", "71", "71"), surname = c("ABOT", "ABRCROMBIE", "BLAI"), forename = c("Wm.", "Alex", "Jn Huh"), occupation = c("Wine and spirit merchant", "Baker", "Victualer"), address.trade.number = c(" -; 1820", "", "280"), address.trade.body = c("London st. ; house, Mary hill.*", "", "High stret"), stringsAsFactors = FALSE ) utils_clean_ends(directory, address.trade.number, address.trade.body) ## End(Not run)
Clean name columns (forename & surname) of provided directory dataframe.
utils_clean_names(directory)
utils_clean_names(directory)
directory |
A directory dataframe. |
A dataframe.
## Not run: directory <- data.frame( page = c("71", "71"), surname = c("ABOT", "ABRCROMBIE"), forename = c("Wm.", "Alex"), occupation = c("Wine and spirit merchant", "Baker"), address.number = c(" -; 1820", ",,12"), address.body = c( "London st. ; house, Mary hill.*", "&;Dixon st.; residence, Craigrownie, Cove.$" ), stringsAsFactors = FALSE ) utils_clean_names(directory) ## End(Not run)
## Not run: directory <- data.frame( page = c("71", "71"), surname = c("ABOT", "ABRCROMBIE"), forename = c("Wm.", "Alex"), occupation = c("Wine and spirit merchant", "Baker"), address.number = c(" -; 1820", ",,12"), address.body = c( "London st. ; house, Mary hill.*", "&;Dixon st.; residence, Craigrownie, Cove.$" ), stringsAsFactors = FALSE ) utils_clean_names(directory) ## End(Not run)
Clean "occupation" column of provided directory dataframe.
utils_clean_occupations(directory)
utils_clean_occupations(directory)
directory |
A directory dataframe. |
A dataframe.
## Not run: directory <- data.frame( page = c("71", "71"), surname = c("ABOT", "ABRCROMBIE"), forename = c("Wm.", "Alex"), occupation = c("wine and spirit mercht", "bkr"), address.number = c(" -; 1820", ",,12"), address.body = c( "London st. ; house, Mary hill.*", "&;Dixon st.; residence, Craigrownie, Cove.$" ), stringsAsFactors = FALSE ) utils_clean_occupations(directory) ## End(Not run)
## Not run: directory <- data.frame( page = c("71", "71"), surname = c("ABOT", "ABRCROMBIE"), forename = c("Wm.", "Alex"), occupation = c("wine and spirit mercht", "bkr"), address.number = c(" -; 1820", ",,12"), address.body = c( "London st. ; house, Mary hill.*", "&;Dixon st.; residence, Craigrownie, Cove.$" ), stringsAsFactors = FALSE ) utils_clean_occupations(directory) ## End(Not run)
Clears the provided string of the content specified as a regex.
utils_clear_content(string_search, regex_content, ignore_case)
utils_clear_content(string_search, regex_content, ignore_case)
string_search |
Character string to search for match(es). |
regex_content |
PCRE type regex provided as a character string of match(es) to search for. |
ignore_case |
Boolean specifying whether case should be ignored ( |
A character string.
## Not run: utils_clear_content("glasgow-entrepreneurs", "^.+-", TRUE) ## End(Not run)
## Not run: utils_clear_content("glasgow-entrepreneurs", "^.+-", TRUE) ## End(Not run)
Attempts to get rid of irrelevant information in all columns of the provided directory dataframe provided
utils_clear_irrelevants(directory, ...)
utils_clear_irrelevants(directory, ...)
directory |
A directory dataframe. |
... |
Further arguments to be passed down to |
A dataframe.
## Not run: directory <- data.frame( page = c("71", "71"), surname = c("ABOT", "ABRCROMBIE"), forename = c("Wm.", "Alex"), occupation = c("Wine and spirit merchant — See Advertisement in Appendix.", "Baker"), address.trade.number = c("18, 20", "12"), address.house.number = c("136", "265"), address.trade.body = c("London Street.", "Dixon Street."), address.house.body = c("Queen Street.", "Argyle Street"), stringsAsFactors = FALSE ) utils_clear_irrelevants(directory, globals_regex_irrelevants, ignore_case = TRUE) ## End(Not run)
## Not run: directory <- data.frame( page = c("71", "71"), surname = c("ABOT", "ABRCROMBIE"), forename = c("Wm.", "Alex"), occupation = c("Wine and spirit merchant — See Advertisement in Appendix.", "Baker"), address.trade.number = c("18, 20", "12"), address.house.number = c("136", "265"), address.trade.body = c("London Street.", "Dixon Street."), address.house.body = c("Queen Street.", "Argyle Street"), stringsAsFactors = FALSE ) utils_clear_irrelevants(directory, globals_regex_irrelevants, ignore_case = TRUE) ## End(Not run)
Executes the function provided. Execution can be silenced via the verbose
parameter.
utils_execute(verbose, fun, ...)
utils_execute(verbose, fun, ...)
verbose |
Boolean specifying whether to silence the function execution
( |
fun |
Function to execute provided as an expression. |
... |
Argument(s) to be passed to the function above for execution. |
Whatever the provided function returns.
## Not run: utils_execute(TRUE, message, "I'm showing in console") ## End(Not run)
## Not run: utils_execute(TRUE, message, "I'm showing in console") ## End(Not run)
Takes a raw directory dataframe (just loaded), adds a column with the
corresponding directory name, replaces all NA
entries with an empty
string, clear all entries of unwanted blank characters, format page number
as integer, returns the output with the directory name column in first
position.
utils_format_directory_raw(df, name)
utils_format_directory_raw(df, name)
df |
A raw directory dataframe as output by
|
name |
Directory name provided as a character string. |
A dataframe.
## Not run: directory <- data.frame( page = c("71", "71"), surname = c("ABOT ", " ABRCROMBIE"), forename = c("Wm.", "Alex"), occupation = c("wine and spirit mercht", " bkr"), addresses = c( "depot -; 1820 London st. ; house, Mary hill.*", "workshop,,12 &;Dixon st.; residence, Craigrownie, Cove.$ " ), stringsAsFactors = FALSE ) utils_format_directory_raw(directory, "1861-1862") ## End(Not run)
## Not run: directory <- data.frame( page = c("71", "71"), surname = c("ABOT ", " ABRCROMBIE"), forename = c("Wm.", "Alex"), occupation = c("wine and spirit mercht", " bkr"), addresses = c( "depot -; 1820 London st. ; house, Mary hill.*", "workshop,,12 &;Dixon st.; residence, Craigrownie, Cove.$ " ), stringsAsFactors = FALSE ) utils_format_directory_raw(directory, "1861-1862") ## End(Not run)
Searches for specified pattern in provided character string vector. If found, substitutes all occurrences of an alternative pattern in an alternative character string and returns the output. If not return the default character string provided.
utils_gsub_if_found( regex_filter, string_filter, regex_search, string_replace, string_search, default, ignore_case_filter, ignore_case_search )
utils_gsub_if_found( regex_filter, string_filter, regex_search, string_replace, string_search, default, ignore_case_filter, ignore_case_search )
regex_filter |
Pattern to look for provided as a character string regex. |
string_filter |
Character string vector to search into for the pattern
provided in |
regex_search |
Alternative pattern provided as a character string regex
to look in the alternative character string provided in |
string_replace |
Substitution character string for matches of
|
string_search |
Alternative character string to search into for the
pattern provided in |
default |
Character string returned if pattern provided in |
ignore_case_filter |
Boolean specifying whether case should be ignored ( |
ignore_case_search |
Boolean specifying whether case should be ignored ( |
A character string vector.
## Not run: utils_gsub_if_found( "^glasgow", c("glasgow-entrepreneurs", "aberdeen-entrepreneurs"), "(?<=-).+$", "merchant", "edinburgh-entrepreneurs", "pattern not found", TRUE, TRUE ) ## End(Not run)
## Not run: utils_gsub_if_found( "^glasgow", c("glasgow-entrepreneurs", "aberdeen-entrepreneurs"), "(?<=-).+$", "merchant", "edinburgh-entrepreneurs", "pattern not found", TRUE, TRUE ) ## End(Not run)
Load saved object as .rds
file back into memory.
utils_IO_load(...)
utils_IO_load(...)
... |
Destination parameters to be passed to |
R object from destination .rds
file.
## Not run: utils_IO_load("home/projects", "glasgow-entrepreneurs") ## End(Not run)
## Not run: utils_IO_load("home/projects", "glasgow-entrepreneurs") ## End(Not run)
Paste provided path to directory and file name provided using '/' as separator.
utils_IO_path(directory_path, ..., extension)
utils_IO_path(directory_path, ..., extension)
directory_path |
Path to directory where |
... |
File name components provided as character strings to be passed
down to |
extension |
File extension as character string |
Path to destination file as a character string.
## Not run: utils_IO_path("home/projects", "glasgow-entrepreneurs", "csv") ## End(Not run)
## Not run: utils_IO_path("home/projects", "glasgow-entrepreneurs", "csv") ## End(Not run)
Save the object provided to specified path as .rds
file.
utils_IO_write(data, ...)
utils_IO_write(data, ...)
data |
R object to save. |
... |
Destination parameters to be passed to |
No return value, called for side effects.
## Not run: utils_IO_write(mtcars, "home/projects", "mtcars") ## End(Not run)
## Not run: utils_IO_write(mtcars, "home/projects", "mtcars") ## End(Not run)
Checks whether or not for each address in the evaluation environment, body and number are filled/not empty.
utils_is_address_missing(type)
utils_is_address_missing(type)
type |
A character string: "house" or "trade", specifying the type of address to check. |
A Boolean vector: TRUE if both number and body are empty.
The function is for primarily use in the
utils_label_address_if_missing
function called by
utils_label_missing_addresses
where it provides a filtering
vector used for labelling missing addresses. utils_is_address_missing
creates
an expression and further evaluates it two levels up in the environment tree,
in other words in the directory dataframe eventually passed down to
utils_label_missing_addresses
.
If address is empty label body accordingly: "no house/trade address found".
utils_label_address_if_missing()
utils_label_address_if_missing()
A character string vector of address bodies, unchanged if provided, labelled as missing otherwise.
The function is for primarily use in the
utils_label_missing_addresses
function where it provides a
vector of address bodies utils_label_address_if_missing
creates an
expression and further evaluates it one level up in the environment tree,
in other words in the directory dataframe eventually passed down to
utils_label_missing_addresses
.
Labels empty address bodies as "not house/trade address found" in the provided directory dataframe.
utils_label_missing_addresses(directory)
utils_label_missing_addresses(directory)
directory |
A directory dataframe. Columns must include
|
A dataframe.
## Not run: directory <- data.frame( page = c("71", "71"), surname = c("ABOT", "ABRCROMBIE"), forename = c("Wm.", "Alex"), occupation = c("Wine and spirit merchant", "Baker"), address.number = c(" -; 1820", ""), address.body = c( "London st. ; house, Mary hill.*", "" ), stringsAsFactors = FALSE ) utils_label_missing_addresses(directory) ## End(Not run)
## Not run: directory <- data.frame( page = c("71", "71"), surname = c("ABOT", "ABRCROMBIE"), forename = c("Wm.", "Alex"), occupation = c("Wine and spirit merchant", "Baker"), address.number = c(" -; 1820", ""), address.body = c( "London st. ; house, Mary hill.*", "" ), stringsAsFactors = FALSE ) utils_label_missing_addresses(directory) ## End(Not run)
Loads specified directory "csv" file(s) into memory. Stacks individual
directories into a single dataframe and further passes the output down to
utils_format_directory_raw
for initial formatting.
utils_load_directories_csv( type = c("general", "trades"), directories, path, verbose )
utils_load_directories_csv( type = c("general", "trades"), directories, path, verbose )
type |
A character string: "general" or "trades". Refers to the type of directory to shall be loaded. |
directories |
A character string vector providing the name(s) of the directory(/ies) to load. |
path |
A character string specifying the path to the folder where the directory(/ies) live as ".csv" file(s). |
verbose |
Whether the function should be executed silently ( |
A dataframe.
## Not run: utils_load_directories_csv( "general", "1861-1862", "home/projects/glasgow-entrepreneurs/data/general-directories", FALSE ) ## End(Not run)
## Not run: utils_load_directories_csv( "general", "1861-1862", "home/projects/glasgow-entrepreneurs/data/general-directories", FALSE ) ## End(Not run)
Pastes the arguments provided together using '-'. Appends result string with the extension provided.
utils_make_file(..., extension)
utils_make_file(..., extension)
... |
File name component(s) as character string(s). |
extension |
File extension as character string |
File name as a character string.
utils_make_file("glasgow", "entrepreneurs", extension = "csv")
utils_make_file("glasgow", "entrepreneurs", extension = "csv")
Pastes the arguments provided together using '/' as separator.
utils_make_path(...)
utils_make_path(...)
... |
Path components as character string(s). |
Path to last element provided as a character string.
utils_make_path("home", "projects", "glasgow-entrepreneurs.csv")
utils_make_path("home", "projects", "glasgow-entrepreneurs.csv")
Applies provided function across specified column(s) in provided dataframe.
utils_mutate_across(df, columns, fun, ...)
utils_mutate_across(df, columns, fun, ...)
df |
A dataframe. |
columns |
Vector of expression(s) or character string(s) specifying the columns to apply the function below to in the provided dataframe. |
fun |
Function to execute provided as an expression. |
... |
Argument(s) to be passed to the function above for execution. |
A dataframe.
## Not run: df <- data.frame( location = "glasgow", occupation = "wine merchant", stringsAsFactors = FALSE ) utils_mutate_across(df, c("location", "occupation"), paste0, "!") ## End(Not run)
## Not run: df <- data.frame( location = "glasgow", occupation = "wine merchant", stringsAsFactors = FALSE ) utils_mutate_across(df, c("location", "occupation"), paste0, "!") ## End(Not run)
Executes the function provided while silencing the potential messages related to its execution
utils_mute(fun, ...)
utils_mute(fun, ...)
fun |
Function to execute as an expression. |
... |
Argument(s) to be passed to the function above for execution. |
Whatever the provided function in fun
returns.
## Not run: utils_mute(message, "I'm not showing in console") ## End(Not run)
## Not run: utils_mute(message, "I'm not showing in console") ## End(Not run)
Searches for specified pattern in provided character string. Return pasted provided character string(s) if found or provided default character string if not.
utils_paste_if_found(regex_filter, string_filter, default, ignore_case, ...)
utils_paste_if_found(regex_filter, string_filter, default, ignore_case, ...)
regex_filter |
Pattern to look for provided as a character string regex. |
string_filter |
Character string vector to search into for the pattern
provided in |
default |
Character string returned if pattern provided in |
ignore_case |
Boolean specifying whether case should be ignored ( |
... |
Character string(s) to be paste together using a space as separator
and returned if pattern provided in |
A character string vector.
## Not run: utils_paste_if_found( "^glasgow", c("glasgow-entrepreneurs", "aberdeen-entrepreneurs"), "pattern not found", TRUE, "pattern", "found" ) ## End(Not run)
## Not run: utils_paste_if_found( "^glasgow", c("glasgow-entrepreneurs", "aberdeen-entrepreneurs"), "pattern not found", TRUE, "pattern", "found" ) ## End(Not run)
Searches for specified pattern in provided character string vector. If found, searches for alternative pattern in an alternative character string and returns any match or an empty string if none. If original pattern not found, returns the default character string provided.
utils_regmatches_if_found( string_filter, regex_filter, string_search, regex_search, default, ignore_case_filter, ignore_case_match, not )
utils_regmatches_if_found( string_filter, regex_filter, string_search, regex_search, default, ignore_case_filter, ignore_case_match, not )
string_filter |
Character string vector to search into for the pattern
provided in |
regex_filter |
Pattern to look for provided as a character string regex. |
string_search |
Alternative character string to search into for the
pattern provided in |
regex_search |
Alternative pattern provided as a character string regex
to look for in the alternative character string provided in |
default |
Character string returned if pattern provided in |
ignore_case_filter |
Boolean specifying whether case should be ignored
( |
ignore_case_match |
Boolean specifying whether case should be ignored
( |
not |
Boolean specifying whether to negate the |
A character string vector.
## Not run: utils_regmatches_if_found( c("glasgow-entrepreneurs", "aberdeen-entrepreneurs"), "^glasgow", "edinburgh-entrepreneurs", "^.+(?=-)", "merchant", TRUE, TRUE, FALSE ) ## End(Not run)
## Not run: utils_regmatches_if_found( c("glasgow-entrepreneurs", "aberdeen-entrepreneurs"), "^glasgow", "edinburgh-entrepreneurs", "^.+(?=-)", "merchant", TRUE, TRUE, FALSE ) ## End(Not run)
Searches for non-empty string in provided character string vector. If found searches for alternative pattern in an alternative character string and returns any match or an empty string if none.
utils_regmatches_if_not_empty( string_filter, string_search, regex_search, ignore_case_search )
utils_regmatches_if_not_empty( string_filter, string_search, regex_search, ignore_case_search )
string_filter |
A Character string vector. |
string_search |
Alternative character string to search into for the
pattern provided in |
regex_search |
Alternative pattern provided as a character string regex
to look for in the alternative character string provided in |
ignore_case_search |
Boolean specifying whether case should be ignored
( |
A list of character string vectors.
## Not run: utils_regmatches_if_not_empty( c("glasgow-entrepreneurs", "", "aberdeen-entrepreneurs"), "edinburgh-entrepreneurs" , "^edinburgh", TRUE ) ## End(Not run)
## Not run: utils_regmatches_if_not_empty( c("glasgow-entrepreneurs", "", "aberdeen-entrepreneurs"), "edinburgh-entrepreneurs" , "^edinburgh", TRUE ) ## End(Not run)
Clear address entries in the provided directory dataframe of undesired prefixes such as "depot", "office", "store", "works" or "workshops".
utils_remove_address_prefix(directory, regex, ignore_case)
utils_remove_address_prefix(directory, regex, ignore_case)
directory |
A directory dataframe with an |
regex |
Regex character string to be use for matching. |
ignore_case |
Boolean specifying whether case should be ignored
( |
A dataframe.
## Not run: directory <- data.frame( page = c("71", "71"), surname = c("ABOT", "ABRCROMBIE"), forename = c("Wm.", "Alex"), occupation = c("Wine and spirit merchant", "Baker"), addresses = c( "depot -; 1820 London st. ; house, Mary hill.*", "workshop,,12 &;Dixon st.; residence, Craigrownie, Cove.$ " ), stringsAsFactors = FALSE ) regex <- globals_regex_address_prefix utils_remove_address_prefix(directory, regex, TRUE) ## End(Not run)
## Not run: directory <- data.frame( page = c("71", "71"), surname = c("ABOT", "ABRCROMBIE"), forename = c("Wm.", "Alex"), occupation = c("Wine and spirit merchant", "Baker"), addresses = c( "depot -; 1820 London st. ; house, Mary hill.*", "workshop,,12 &;Dixon st.; residence, Craigrownie, Cove.$ " ), stringsAsFactors = FALSE ) regex <- globals_regex_address_prefix utils_remove_address_prefix(directory, regex, TRUE) ## End(Not run)
Split provided string according to specified pattern. Organise output as a
tibble
.
utils_split_and_name(string, pattern, num_col, colnames)
utils_split_and_name(string, pattern, num_col, colnames)
string |
Character string to be split. |
pattern |
Pattern to split on as character string (can be a regex). |
num_col |
Number of parts to split the string into as integer. |
colnames |
Column names for the output tibble. |
A tibble
## Not run: utils_split_and_name("glasgow-entrepreneurs", "-", 2, c("location", "occupation")) ## End(Not run)
## Not run: utils_split_and_name("glasgow-entrepreneurs", "-", 2, c("location", "occupation")) ## End(Not run)
Removes blanks (white spaces and tabs) at the beginning and end of all entries of the provided dataframe. Converts all series of white space and/or tab(s) in the body of all dataframe entries into a single white space.
Removes blanks (white spaces and tabs) at the beginning and end of all entries of the provided dataframe. Converts all series of white space and/or tab(s) in the body of all dataframe entries into a single white space.
utils_squish_all_columns(df) utils_squish_all_columns(df)
utils_squish_all_columns(df) utils_squish_all_columns(df)
df |
A dataframe. |
A dataframe.
A dataframe.
## Not run: df <- data.frame( location = " glasgow ", occupation = "wine merchant", stringsAsFactors = FALSE ) df <- utils_squish_all_columns(df) ## End(Not run) ## Not run: df <- data.frame( location = " glasgow ", occupation = "wine merchant", stringsAsFactors = FALSE ) df <- utils_squish_all_columns(df) ## End(Not run)
## Not run: df <- data.frame( location = " glasgow ", occupation = "wine merchant", stringsAsFactors = FALSE ) df <- utils_squish_all_columns(df) ## End(Not run) ## Not run: df <- data.frame( location = " glasgow ", occupation = "wine merchant", stringsAsFactors = FALSE ) df <- utils_squish_all_columns(df) ## End(Not run)