Generator from American Community Survey (ACS) Geodatabases

Introduction

The American Community Survey (ACS) offers geodatabases with geographic information and associated data of interest to researchers in the area. The goal of geogenr is to facilitate access to this information through functions that allow us to select the geodatabases that interest us, download them, access the information they contain, filter it and export it in various formats so that we can process it with other tools if required.

Other packages are available that are very useful to access the same data, such as tidycensus, which works in an integrated way with tigris. The main characteristics of geogenr that distinguish it from other proposals are the following:

  • it works locally, once available geodatabases are downloaded (can be downloaded using the package);

  • supports access at the level of group of variables integrated in a layer, instead of at the level of variable or vector of variables;

  • decomposes ACS composite variables into structured fields;

  • allows to directly integrate variables of several years;

  • and It allows us to export the data to other formats.

The rest of this document is structured as follows: First, the starting data are presented. Then, an illustrative example of how the package works is developed. Finally, the document ends with conclusions.

American Community Survey 5-Year Estimates

The package is based on the geodatabases available on the TIGER/Line with Selected Demographic and Economic Data web page. For each year (as of 2012) a list of geodatabases appears under two sections:

  • Legal and Administrative Areas;

  • Statistical Areas.

These geodatabases bring together geography from the TIGER/Line Shapefiles and data from the American Community Survey (ACS) 5-year estimates.

Each ACS geodatabase is structured in layers: a geographic layer, a metadata layer, and the rest are data layers. The data layers have a matrix form, the rows are indexed by instances of the geographic layer, the columns by variables defined in the metadata layer, the cells are numeric values. Here we have an example:

GEOID B01001e1 B01001m1 B01001e2 B01001m2 B01001e3 B01001m3 B01001e4 B01001m4 ...

16000US0100100 218 165 92 114 10 16 18 30

16000US0100124 2582 24 1313 98 45 37 14 19

16000US0100460 4374 24 1963 158 144 76 105 68

16000US0100484 641 159 326 89 10 17 16 11

16000US0100676 295 102 143 55 7 11 14 17

16000US0100820 32878 57 16236 453 1159 257 1151 209

...

Some of the defined variables are shown below.

Short_Name Full_Name

B01001e1 SEX BY AGE: Total: Total Population -- (Estimate)

B01001m1 SEX BY AGE: Total: Total Population -- (Margin of Error)

B01001e2 SEX BY AGE: Male: Total Population -- (Estimate)

B01001m2 SEX BY AGE: Male: Total Population -- (Margin of Error)

B01001e3 SEX BY AGE: Male: Under 5 years: Total Population -- (Estimate)

B01001m3 SEX BY AGE: Male: Under 5 years: Total Population -- (Margin of Error)

B01001e4 SEX BY AGE: Male: 5 to 9 years: Total Population -- (Estimate)

B01001m4 SEX BY AGE: Male: 5 to 9 years: Total Population -- (Margin of Error)

...

Each variable (Short_Name) corresponds to combinations of various field values separated by a separator (:), forming a string (Full_Name). The field name of each value is not available but the topics included are detailed on the web page Subjects Included in the Survey. There are tens of thousands of variables of these characteristics that, in addition to the metadata layer, can be found on the TIGER/Line with Selected Demographic and Economic Data Record Layouts web page. For each combination of values, one variable associated with the estimate and another with the margin of error are defined. Within each layer, variables can be considered in groups, defined by the first part of the Full Name (for example UNWEIGHTED SAMPLE HOUSING UNITS and SEX BY AGE).

A module of geogenr package analyses the components of Full_name, structuring them in fields; and it allows access to variables in groups.

An illustrative example

To work with this data with the geogenr package we distinguish three phases:

  • obtaining the data,

  • data selection,

  • and generation of results,

which are developed below.

Once the result structure is generated, we can export it or define queries on it.

Obtaining the data

The data is available in the form of a geodatabase. One geodatabase for each area in each of the two area groups.

In this example, first, we select and download the ACS geodatabases using the functions offered by the package. When we have them in the same folder (they can be distributed in subfolders), we can access the information they contain to select the one that interests us.

We create an object of class acs_5yr, indicating a folder where we will download the geodatabases.

library(geogenr)

dir <- system.file("extdata/acs_5yr", package = "geogenr")

ac <- acs_5yr(dir)

We can query the available geodatabases by group, area, subject and year using the methods offered by the object.

ac |>
  get_area_groups()
#> [1] "Legal and Administrative Areas" "Statistical Areas"
ac |>
  get_areas(group = "Legal and Administrative Areas")
#>  [1] "American Indian/Alaska Native/Native Hawaiian Area"
#>  [2] "Alaska Native Regional Corporation"                
#>  [3] "Congressional District (116th Congress)"           
#>  [4] "County"                                            
#>  [5] "Place"                                             
#>  [6] "Elementary School District"                        
#>  [7] "Secondary School District"                         
#>  [8] "Unified School District"                           
#>  [9] "State"                                             
#> [10] "State Legislative Districts Upper Chamber"         
#> [11] "State Legislative Districts Lower Chamber"         
#> [12] "Code Tabulation Area"

ac |>
  get_area_years(area = "Alaska Native Regional Corporation")
#> [1] "2013" "2014" "2015" "2016" "2017" "2018" "2019" "2020" "2021"

Once we decide the area to work with, we download the files if they are not already downloaded. There are also functions available to consult previously downloaded data.

ac <- ac |>
  select_area_files("Alaska Native Regional Corporation", 2020:2021)

files <- ac |>
  download_selected_files(unzip = FALSE)
#> [1] TRUE TRUE

The data is downloaded to the working folder associated with the object. We can indicate that they are classified by year or by area. We can also indicate that it will be automatically unzipped once downloaded. In this case they are explicitly unzipped.

files <- ac |>
  unzip_files()

With these operations we already have the data available locally.

Data selection

Once unzipped, we can consult the available areas and years.

ac |>
  get_available_areas()
#> [1] "Alaska Native Regional Corporation"

ac |>
  get_available_area_years(area = "Alaska Native Regional Corporation")
#> [1] "2020" "2021"

For each area we can query the topics of the available reports.

ac |>
  get_available_area_topics("Alaska Native Regional Corporation")
#>  [1] "X01 Age And Sex"                     "X02 Race"                           
#>  [3] "X03 Hispanic Or Latino Origin"       "X04 Ancestry"                       
#>  [5] "X05 Foreign Born Citizenship"        "X06 Place Of Birth"                 
#>  [7] "X07 Migration"                       "X08 Commuting"                      
#>  [9] "X09 Children Household Relationship" "X10 Grandparents Grandchildren"     
#> [11] "X11 Household Family Subfamilies"    "X12 Marital Status And History"     
#> [13] "X13 Fertility"                       "X14 School Enrollment"              
#> [15] "X15 Educational Attainment"          "X16 Language Spoken At Home"        
#> [17] "X17 Poverty"                         "X18 Disability"                     
#> [19] "X19 Income"                          "X20 Earnings"                       
#> [21] "X21 Veteran Status"                  "X22 Food Stamps"                    
#> [23] "X23 Employment Status"               "X24 Industry Occupation"            
#> [25] "X25 Housing Characteristics"         "X26 Group Quarters"                 
#> [27] "X27 Health Insurance"                "X28 Computer And Internet Use"      
#> [29] "X99 Imputation"

We can select one or more topics to obtain the reports they contain.To select them, we create an object of class acs_5yr_topic. We can also select the years for which reports are included; by default all available years are considered.

act <- ac |>
  as_acs_5yr_topic("Alaska Native Regional Corporation",
                   topic = "X01 Age And Sex")

Once a topic has been selected, we can consult the available reports or subreports.

act |>
  get_report_names()
#> [1] "B01001-Sex By Age"        "B01002-Median Age By Sex"
#> [3] "B01003-Total Population"

We can focus on a report or subreport, we can also work with all the reports contained in the topic.

In this case we are going to work with the entire topic.

Generation of results

Once we have obtained a group of reports, we can obtain the associated data in various formats.

acs_5yr_geo class format

In this format, three layers are defined:

  • data with geographical information and and the names and values of the variables for each instance,
  • metadata with the description of the variables,
  • origin with the metadata about the origin of the data.
geo <- act |>
  as_acs_5yr_geo()

This format allows us to perform simple queries using the metadata and the geographic layer.

We obtain and consult the content of the metadata: structured description of the variables.

metadata <- geo |>
  get_metadata()

metadata
#> # A tibble: 1,436 × 12
#>    variable year  Short_Name Full_Name   report subreport report_var report_desc
#>    <chr>    <chr> <chr>      <chr>       <chr>  <chr>          <int> <chr>      
#>  1 V0001    2020  B01001Ae1  Sex By Age… B01001 A                  1 Sex By Age…
#>  2 V0002    2020  B01001Ae10 Sex By Age… B01001 A                 10 Sex By Age…
#>  3 V0003    2020  B01001Ae11 Sex By Age… B01001 A                 11 Sex By Age…
#>  4 V0004    2020  B01001Ae12 Sex By Age… B01001 A                 12 Sex By Age…
#>  5 V0005    2020  B01001Ae13 Sex By Age… B01001 A                 13 Sex By Age…
#>  6 V0006    2020  B01001Ae14 Sex By Age… B01001 A                 14 Sex By Age…
#>  7 V0007    2020  B01001Ae15 Sex By Age… B01001 A                 15 Sex By Age…
#>  8 V0008    2020  B01001Ae16 Sex By Age… B01001 A                 16 Sex By Age…
#>  9 V0009    2020  B01001Ae17 Sex By Age… B01001 A                 17 Sex By Age…
#> 10 V0010    2020  B01001Ae18 Sex By Age… B01001 A                 18 Sex By Age…
#> # ℹ 1,426 more rows
#> # ℹ 4 more variables: measure <chr>, item1 <chr>, item2 <chr>, group <chr>

We filter the metadata using the functions of the dplyr package.

metadata <-
  dplyr::filter(
    metadata,
    item2 == "Female" &
      group == "People Who Are American Indian And Alaska Native Alone" &
      measure == "estimate"
  )

We obtain a new object whose data corresponds to the metadata that we have obtained as a result of the selection operations.

geo2 <- geo |>
  set_metadata(metadata)

geo2 |>
  get_metadata()
#> # A tibble: 2 × 12
#>   variable year  Short_Name Full_Name    report subreport report_var report_desc
#>   <chr>    <chr> <chr>      <chr>        <chr>  <chr>          <int> <chr>      
#> 1 V0671    2020  B01002Ce3  Median Age … B01002 C                  3 Median Age…
#> 2 V1389    2021  B01002Ce3  Median Age … B01002 C                  3 Median Age…
#> # ℹ 4 more variables: measure <chr>, item1 <chr>, item2 <chr>, group <chr>

We work with the new object, for example, by defining a new variable and graphing it using the associated geographic data.

geo_layer <- geo2 |> 
  get_geo_layer()

geo_layer$faiana21vs20 <- 100 * (geo_layer$V1389 - geo_layer$V0671) / geo_layer$V0671
plot(sf::st_shift_longitude(geo_layer[, "faiana21vs20"]))

Additionally, we can export any of these objects in GeoPackage format to use other query tools, such as QGIS.

dir <- tempdir()
file <- geo |>
  as_GeoPackage(dir)

sf::st_layers(file)
#> Driver: GPKG 
#> Available layers:
#>   layer_name geometry_type features fields crs_name
#> 1       data Multi Polygon       12   1453    NAD83
#> 2   metadata            NA     1436     12     <NA>
#> 3     origin            NA        2      6     <NA>

rolap package objects

We can export reports from a theme as flat_table or star_database objects from the rolap package, which allow us to enrich them with additional information or work directly with OLAP tools.

We are going to export them as an object of the star_database class.

st <- act |>
  as_star_database()

Below are the tables that make up the star ROLAP design obtained.

st_dm <- st |>
  rolap::as_dm_class(pk_facts = FALSE)
st_dm |> 
  dm::dm_draw(rankdir = "LR", view_type = "all")

We also show the number of instances of each table.

l_db <- st |>
  rolap::as_tibble_list()

names <- sort(names(l_db))
for (name in names){
  cat(sprintf("name: %s, %d rows\n", name, nrow(l_db[[name]])))
}
#> name: anrc, 8572 rows
#> name: dim_what, 359 rows
#> name: dim_when, 2 rows
#> name: dim_where, 12 rows

We can also export it to any RDBMS.

Conclusions

The American Community Survey (ACS) offers geodatabases with geographic information and associated data of interest to researchers in the area. These data can be accessed through various alternatives in which we must indicate the year and variable names. Due to the large number of variables and their structure, this operation is not easy.

The geogenr package offers an alternative that allows us to download the geodatabases that are considered necessary and access the variables by selecting data layers and logical groups of variables. Additionally, it allows us to obtain data in various formats to consult them directly or use other tools.