---
title: "Introduction to the IPUMS API for R Users"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to the IPUMS API for R Users}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r, echo=FALSE, results="hide"}
library(vcr)

vcr_dir <- "fixtures"

have_api_access <- TRUE

if (!nzchar(Sys.getenv("IPUMS_API_KEY"))) {
  if (dir.exists(vcr_dir) && length(dir(vcr_dir)) > 0) {
    # Fake API token to fool ipumsr API functions
    Sys.setenv("IPUMS_API_KEY" = "foobar")
  } else {
    # If there are no mock files nor API token, can't run API tests
    have_api_access <- FALSE
  }
}

vcr_configure(
  filter_sensitive_data = list(
    "<<<IPUMS_API_KEY>>>" = Sys.getenv("IPUMS_API_KEY")
  ),
  write_disk_path = vcr_dir,
  dir = vcr_dir
)

check_cassette_names()

modify_ready_extract_cassette_file <- function(cassette_file_name,
                                               fixture_path = NULL,
                                               n_requests = 1) {
  fixture_path <- fixture_path %||% vcr::vcr_test_path("fixtures")

  ready_extract_cassette_file <- file.path(
    fixture_path, cassette_file_name
  )

  ready_lines <- readLines(ready_extract_cassette_file)
  request_lines <- which(grepl("^- request:", ready_lines))

  start_line <- request_lines[length(request_lines) - n_requests + 1]

  writeLines(
    c(
      ready_lines[[1]],
      ready_lines[start_line:length(ready_lines)]
    ),
    con = ready_extract_cassette_file
  )
}
```

The IPUMS API provides two asset types, both of which are supported by
ipumsr:

-   **IPUMS extract** endpoints can be used to submit extract requests
    for processing and download completed extract files.
-   **IPUMS metadata** endpoints can be used to discover and explore
    available IPUMS data as well as retrieve codes, names, and other
    extract parameters necessary to form extract requests.

Use of the IPUMS API enables the adoption of a programmatic workflow
that can help users to:

-   Precisely recreate the specifications of previous extract requests,
    making analysis scripts reproducible and self-contained
-   Save extract request definitions that can be shared with others
    without violating IPUMS conditions
-   Integrate the extract download process with functions to load data
    into R
-   Quickly identify and explore available IPUMS data sources

The basic workflow for interacting with the IPUMS API is as follows:

1.  [Define](#define) the parameters of an extract request
2.  [Submit](#submit) the extract request to the IPUMS API
3.  [Wait](#wait) for an extract to complete
4.  [Download](#download) a completed extract

Before getting started, we'll load the necessary packages for the
examples in this vignette:

```{r, message=FALSE}
library(ipumsr)
library(dplyr)
library(purrr)
```

## API availability

IPUMS **extract** support is currently available via API for the
following collections:

-   IPUMS USA
-   IPUMS CPS
-   IPUMS International
-   IPUMS Time Use (ATUS, AHTUS, MTUS)
-   IPUMS Health Surveys (NHIS, MEPS)
-   IPUMS NHGIS

Note that this support only includes data available via a collection's
extract engine. Many collections provide additional data via direct
download, but these products are not supported by the IPUMS API.

IPUMS **metadata** support is currently available via API for the
following collections:

-   IPUMS NHGIS

API support will continue to be added for more collections in the
future. You can check general API availability for all IPUMS collections
with `ipums_data_collections()`.

```{r}
ipums_data_collections()
```

Note that the tools in ipumsr may not necessarily support all the functionality
currently supported by the IPUMS API. See the [API
documentation](https://developer.ipums.org/docs/apiprogram/) for more
information about its latest features.

## Set up your API key {#set-key}

To interact with the IPUMS API, you'll need to register for access with
the IPUMS project you'll be using. If you have not yet registered, you
can find the link to register for each project at the top of its website, which 
can be accessed from [ipums.org](https://www.ipums.org).

Once you're registered, you'll be able to [create an API
key](https://account.ipums.org/api_keys).

By default, ipumsr API functions assume that your key is stored in the
`IPUMS_API_KEY` environment variable. You can also provide your key
directly to these functions, but storing it in an environment variable
saves you some typing and helps prevent you from inadvertently sharing
your key with others (for instance, on GitHub).

You can save your API key to the `IPUMS_API_KEY` environment variable
with `set_ipums_api_key()`. To save your key for use in future sessions,
set `save = TRUE`. This will add your API key to your `.Renviron` file
in your user home directory.

```{r, eval=FALSE}
# Save key in .Renviron for use across sessions
set_ipums_api_key("paste-your-key-here", save = TRUE)
```

The rest of this vignette assumes you have obtained an API key and
stored it in the `IPUMS_API_KEY` environment variable.

## Define an extract request {#define}

ipumsr contains two extract definition functions used to specify the parameters
of a new extract request from scratch.

  -   For microdata projects (USA, CPS, International, Time Use, 
      and Health Surveys), use `define_extract_micro()`.
  -   For NHGIS, use `define_extract_nhgis()`.

When you define an extract request, you can specify the data to be
included in the extract and indicate the desired format and layout.

For instance, the following defines a simple IPUMS USA extract request
for the `AGE`, `SEX`, `RACE`, `STATEFIP`, and `MARST` variables from the
2018 and 2019 American Community Survey (ACS):

```{r}
usa_extract_definition <- define_extract_micro(
  collection = "usa",
  description = "USA extract for API vignette",
  samples = c("us2018a", "us2019a"),
  variables = c("AGE", "SEX", "RACE", "STATEFIP", "MARST")
)

usa_extract_definition
```

The exact extract definition options vary across collections, but all
collections can be used with the same general workflow. For more details
on the available extract definition options, see the associated
[microdata](ipums-api-micro.html) and [NHGIS](ipums-api-nhgis.html)
vignettes.

For the purposes of demonstrating the overall workflow, we will continue
to work with the sample IPUMS USA extract definition created above.

### Extract request objects

`define_extract_micro()` and `define_extract_nhgis()` always produce an 
`ipums_extract` object, which can be handled by other API functions 
(see `?ipums_extract`). Furthermore, these objects will have a subclass for 
the particular collection with which they are associated.

```{r}
class(usa_extract_definition)
```

Many of the specifications for a given extract request object can be
accessed by indexing the object:

```{r}
names(usa_extract_definition$samples)
names(usa_extract_definition$variables)
usa_extract_definition$data_format
```

`ipums_extract` objects also contain information about the extract
request's processing status and its assigned extract number, which
serves as an identifier for the extract request. Since this extract
request is still unsubmitted, it has no request number:

```{r}
usa_extract_definition$status
usa_extract_definition$number
```

To obtain the data requested in the extract definition, we must first
submit it to the IPUMS API for processing.

## Submit an extract request {#submit}

```{r, include=FALSE}
insert_cassette("submit-placeholder-extract-usa")

# We submit these extracts so that the output of requests for things like
# `get_extract_history` below is more "natural".
# Otherwise the output is a bunch of duplicate extract requests for the
# primary extract in this vignette, as this vignette usually gets rebuilt
# several times during editing before it is complete.
submit_extract(
  define_extract_micro(
    "usa",
    description = "Data from 2017 PRCS",
    samples = "us2017b",
    variables = c("RACE", "YEAR")
  )
)

submit_extract(
  define_extract_micro(
    "usa",
    description = "Data from long ago",
    samples = "us1880a",
    variables = c("SEX", "AGE", "LABFORCE")
  )
)

eject_cassette("submit-placeholder-extract-usa")
```

```{r, echo=FALSE, results="hide", message=FALSE}
insert_cassette("submit-extract")
```

To submit an extract definition, use `submit_extract()`.

If no errors are detected in the extract definition, a submitted extract
request will be returned with its assigned number and status. Storing
the returned object can be useful for checking the extract request's
status later.

```{r}
usa_extract_submitted <- submit_extract(usa_extract_definition)
```

The extract number will be stored in the returned object:

```{r}
usa_extract_submitted$number
usa_extract_submitted$status
```

Note that some fields of a submitted extract may be automatically
updated by the API upon submission. For instance, for microdata
extracts, additional preselected variables may be added to the extract
even if they weren't specified explicitly in the extract definition.

```{r}
names(usa_extract_submitted$variables)
```

If you forget to store the updated extract object returned by
`submit_extract()`, you can use the `get_last_extract_info()` helper to
request the information for your most recent extract request for a given
collection:

```{r}
usa_extract_submitted <- get_last_extract_info("usa")

usa_extract_submitted$number
```

```{r, echo=FALSE, results="hide", message=FALSE}
eject_cassette("submit-extract")
```

## Wait for an extract request to complete {#wait}

It may take some time for the IPUMS servers to process your extract
request. You can ensure that an extract has finished processing before
you attempt to download its files by using `wait_for_extract()`. This
polls the API regularly until processing has completed (by default, each
interval increases by 10 seconds). It then returns an `ipums_extract`
object containing the completed extract definition.

```{r, echo=FALSE, results="hide", message=FALSE}
insert_cassette("wait-for-extract")

usa_extract_complete <- wait_for_extract(usa_extract_submitted)

eject_cassette("wait-for-extract")

# Leave an extract request to simulate wait_for_extract() output for USA
modify_ready_extract_cassette_file(
  "wait-for-extract.yml",
  fixture_path = "fixtures",
  n_requests = 2
)
```

```{r, echo=FALSE, results="hide", message=FALSE}
insert_cassette("wait-for-extract")
```

```{r}
usa_extract_complete <- wait_for_extract(usa_extract_submitted)
```

```{r}
usa_extract_complete$status
```

```{r}
# `download_links` should be populated if the extract is ready for download
names(usa_extract_complete$download_links)
```

```{r, echo=FALSE, results="hide", message=FALSE}
eject_cassette("wait-for-extract")
```

Note that `wait_for_extract()` will tie up your R session until your
extract is ready to download. While this is fine in a strictly
programmatic workflow, it may be frustrating when working interactively,
especially for large extracts or when the IPUMS servers are busy.

In these cases, you can manually check whether an extract is ready for
download with `is_extract_ready()`. As long as this returns `TRUE`, you
should be able to download your extract's files.

```{r, echo=FALSE, results="hide", message=FALSE}
insert_cassette("extract-ready")
```

```{r}
is_extract_ready(usa_extract_submitted)
```

```{r, echo=FALSE, results="hide", message=FALSE}
eject_cassette("extract-ready")
```

For a more detailed status check, provide the extract's collection and
number to `get_extract_info()`. This returns an `ipums_extract` object
reflecting the requested extract definition with the most current
status. The `status` of a submitted extract will be one of `"queued"`,
`"started"`, `"produced"`, `"canceled"`, `"failed"`, or `"completed"`.

```{r, echo=FALSE, results="hide", message=FALSE}
insert_cassette("check-extract-info")
```

```{r}
usa_extract_submitted <- get_extract_info(usa_extract_submitted)

usa_extract_submitted$status
```

```{r, echo=FALSE, results="hide", message=FALSE}
eject_cassette("check-extract-info")
```

Note that extracts are removed from the IPUMS servers after a set period
of time (72 hours for microdata collections, 2 weeks for IPUMS NHGIS).
Therefore, an extract that has a `"completed"` status may still be
unavailable for download.

`is_extract_ready()` will alert you if the extract has expired and needs
to be resubmitted. Simply use `submit_extract()` to resubmit an extract
request. Note that this will produce a *new* extract (with a new extract
number), even if the extract definition is identical.

## Download an extract {#download}

Once your extract has finished processing, use `download_extract()` to
download the extract's data files to your local machine. This will
return the path to the downloaded file(s) required to load the data into
R.

For microdata collections, this will be the path to the DDI codebook
(.xml) file, which can be used to read the associated data (contained in
a .dat.gz file).

For NHGIS, this will be a path to the .zip archive containing the
requested data files and/or shapefiles.

```{r, eval=FALSE}
# By default, downloads to your current working directory
filepath <- download_extract(usa_extract_submitted)
```

The files produced by `download_extract()` can be passed directly into
the reader functions provided by ipumsr. For instance, for microdata
projects:

```{r, eval=FALSE}
ddi <- read_ipums_ddi(filepath)
micro_data <- read_ipums_micro(ddi)
```

If instead you're working with an NHGIS extract, use `read_nhgis()` or
`read_ipums_sf()`.

See the [associated vignette](ipums-read.html) for more information
about loading IPUMS data into R.

## Get info on past extracts {#recent}

To retrieve the definition corresponding to a particular extract,
provide its collection and number to `get_extract_info()`. These can be
provided either as a single string of the form `"collection:number"` or
as a length-2 vector: `c("collection", number)`. Several other API
functions support this syntax as well.

```{r, echo=FALSE, results="hide", message=FALSE}
insert_cassette("check-extract-history")
```

```{r}
usa_extract <- get_extract_info("usa:47")

# Alternatively:
usa_extract <- get_extract_info(c("usa", 47))

usa_extract
```

If you know you made a specific extract definition in the past, but you
can't remember the exact number, you can use `get_extract_history()` to
peruse your recent extract requests for a particular collection.

By default, this returns your 10 most recent extract requests as a list
of `ipums_extract` objects. You can adjust how many requests to retrieve
with the `how_many` argument:

```{r}
usa_extracts <- get_extract_history("usa", how_many = 3)

usa_extracts
```

Because this is a list of `ipums_extract` objects, you can operate on
them with the API functions that have been introduced already.

```{r}
is_extract_ready(usa_extracts[[2]])
```

You can also iterate through your extract history to find extracts with
particular characteristics. For instance, we can use `purrr::keep()` to
find all extracts that contain a certain variable or are ready for
download:

```{r}
purrr::keep(usa_extracts, ~ "MARST" %in% names(.x$variables))
purrr::keep(usa_extracts, is_extract_ready)
```

Or we can use the `purrr::map()` family to browse certain values:

```{r}
purrr::map_chr(usa_extracts, ~ .x$description)
```

If you regularly use only a single IPUMS collection, you can save
yourself some typing by setting that collection as your default.
`set_ipums_default_collection()` will save a specified collection to the
value of the `IPUMS_DEFAULT_COLLECTION` environment variable. If you
have a default collection set, API functions will use that collection in
all requests, assuming no other collection is specified.

```{r, eval=FALSE}
set_ipums_default_collection("usa") # Set `save = TRUE` to store across sessions
```

```{r, echo=FALSE, results="hide", message=FALSE}
set_ipums_default_collection("usa")
```

```{r}
# Check the default collection:
Sys.getenv("IPUMS_DEFAULT_COLLECTION")
```

```{r}
# Most recent USA extract:
usa_last <- get_last_extract_info()

# Request info on extract request "usa:10"
usa_extract_10 <- get_extract_info(10)

# You can still request other collections as usual:
cps_extract_10 <- get_extract_info("cps:10")
```


```{r, echo=FALSE, results="hide", message=FALSE}
eject_cassette("check-extract-history")
```

## Share an extract definition {#share}

One exciting feature enabled by the IPUMS API is the ability to share a
standardized extract definition with other IPUMS users so that they can
create an identical extract request themselves. The terms of use for
most IPUMS collections prohibit the public redistribution of IPUMS data, but
don't prohibit the sharing of data extract definitions.

ipumsr facilitates this type of sharing with `save_extract_as_json()`
and `define_extract_from_json()`, which write and read `ipums_extract`
objects to and from a standardized JSON-formatted file.

```{r, eval=FALSE}
usa_extract_10 <- get_extract_info("usa:10")
save_extract_as_json(usa_extract_10, file = "usa_extract_10.json")
```

At this point, you can send `"usa_extract_10.json"` to another user to
allow them to create a duplicate `ipums_extract` object, which they can
load and submit to the API themselves (as long as they have [API
access](#set-key)).

```{r, eval=FALSE}
clone_of_usa_extract_10 <- define_extract_from_json("usa_extract_10.json")
usa_extract_10_resubmitted <- submit_extract(clone_of_usa_extract_10)
```

Note that the code in the previous chunk assumes that the file is saved
in the current working directory. If it's saved somewhere else, replace
`"usa_extract_10.json"` with the full path to the file.

## Revise a previous extract request

Occasionally, you may want to modify an existing extract definition
(e.g. to update an analysis with new data). The easiest way to do so is
to add the new specifications to the `define_extract_*()` code that
produced the original extract definition. This is why we highly
recommend that you save this code somewhere where it can be accessed and
updated in the future.

However, there are cases where the original extract definition code does
not exist (e.g. if the extract was created using the online IPUMS
extract system). In this case, the best approach is to view the extract
definition with `get_extract_info()` and create a new extract definition
(using a `define_extract_*()` function) that reproduces that definition
along with the desired modifications. While this may be a bit tedious
for complex extract definitions, it is a one-time investment that will
make any future updates to the extract definition much easier.

Previously, we encouraged users to use the helpers `add_to_extract()`
and `remove_from_extract()` when modifying extracts. We now encourage
you to re-write extract definitions because they improve
reproducibility: extract definition code will always be more clear and
stable if it is written explicitly, rather than based only on an old
extract number. These two functions may be retired in the future.

## Putting it all together

The core API functions in ipumsr are compatible with one another such
that they can be combined into a single pipeline that requests,
downloads, and reads your extract data into an R data frame:

```{r, eval=FALSE}
usa_data <- define_extract_micro(
  "usa",
  "USA extract for API vignette",
  samples = c("us2018a", "us2019a"),
  variables = c("AGE", "SEX", "RACE", "STATEFIP")
) %>%
  submit_extract() %>%
  wait_for_extract() %>%
  download_extract() %>%
  read_ipums_micro()
```

Note that for NHGIS extracts that contain both data and shapefiles, a
single file will need to be selected before reading, as
`download_extract()` will return the path to each file. For instance,
for a hypothetical `nhgis_extract` that contains both tabular and
spatial data:

```{r, eval=FALSE}
nhgis_data <- download_extract(nhgis_extract) %>%
  purrr::pluck("data") %>% # Select only the tabular data file to read
  read_nhgis()
```

Not only does this API workflow allow you to obtain IPUMS data without
ever leaving your R environment, but it also allows you to retain a
reproducible record of your process. This makes it much easier to
document your workflow, collaborate with other researchers, and update
your analysis in the future.