In September 2018 I created an account with the Integrated Public Use
Microdata Series (IPUMS) data library at the University of
Minnesota..1 My objective then was to understand the
evolution of accounting and auditing as a percent of the US workforce
over time. In redoing this in March 2020, it seems sensible to start by
summarizing the large IPUMS data extract into a smaller table of
occupation codes by year and save this as OCC1950.rda
. Then
can then analyze that much smaller dataset in a separate vignette.
After logging into the IPUMS website, the two primary options are “Select Data” and “My Data”. The latter shows previous selections with the following options:
In 2018 from “Select Data”, I chose “Select harmonized variables” > Person > Work.
From here, I first selected “OCC1950
= Occupation 1950
basis” and got something useful doing that.
I tried other similar extracts as outlined in the Appendix below. I
decided the best extract for my purposes seemed to be the one I started
with using OCC1950
. The rest of this vignette describes
summarizing that into a matrix estimating the percent of the US
population by year in each OCC1950
code in March 2020.
# The code in this snippet is a slight modification
# of the R code from usa_00006.R, 2020-03-18.
if (!require("ipumsr")){
msg <- paste("Reading IPUMS data into R",
"requires the ipumsr package. It can be",
"installed using:\ninstall.packages('ipumsr')")
stop(msg)
}
## Loading required package: ipumsr
# NOTE: base::dir works differently
# within an R Markdown file than
# from an ordinary command prompt,
# at least under macOS 10.15.3 on 2020-03-18.
# With getwd() = the parent directory of
# "~/fda/vignettes",
# When I highlighted "dir()" and executed it
# using <command + enter>, I got the contents
# of "~/fda/vignettes".
# When I executed "dir()" outside the *.Rmd file,
# I got the parent directory.
dir()
## [1] "IPUMS-references.bib" "UpdatingUSGDPpresidents.R"
## [3] "UpdatingUSGDPpresidents.Rmd" "UpdatingUSGDPpresidents.html"
## [5] "nuclearArmageddon.R" "nuclearArmageddon.Rmd"
## [7] "nuclearArmageddon.html" "nuclearProliferation.svg"
## [9] "time2Armgeddon.svg" "updateOCC1950.R"
## [11] "updateOCC1950.Rmd" "update_nuclearWeaponStates.R"
## [13] "update_nuclearWeaponStates.Rmd" "update_nuclearWeaponStates.html"
## [1] "IPUMS-references.bib" "UpdatingUSGDPpresidents.R"
## [3] "UpdatingUSGDPpresidents.Rmd" "UpdatingUSGDPpresidents.html"
## [5] "nuclearArmageddon.R" "nuclearArmageddon.Rmd"
## [7] "nuclearArmageddon.html" "nuclearProliferation.svg"
## [9] "time2Armgeddon.svg" "updateOCC1950.R"
## [11] "updateOCC1950.Rmd" "update_nuclearWeaponStates.R"
## [13] "update_nuclearWeaponStates.Rmd" "update_nuclearWeaponStates.html"
# copy and paste the following
# from "Accountants-IPUMS.Rmd" in R Studio
# into the Console below:
IPUMSdir <- 'IPUMS'
(ddiXml <- dir(IPUMSdir, pattern="usa_00006.xml",
full.names = TRUE))
## character(0)
# OR execute the following inside
# "Accountants-IPUMS.Rmd" in R Studio:
if(length(ddiXml)!=1){
# print(ddiXml <- file.path('..', '..', 'IPUMS'))
print(ddiXml <- dir(pattern="usa_00006.xml",
full.names = TRUE))
# NOTE: This worked under macOS 10.15.3
# with R 3.6.3 and RStudio 1.2.5033.
# It has not been tested on other platforms.
}
## character(0)
Some of the computations below are fairly long, because this dataset
is so large. This vignette is structured so nearly all the R code in
this vignette will be skipped in routine package testing and would only
be executed if a user arranged so a file containing the desired IPUMS
data extract can be found. We do this here by creating a variable
readAndCompute
that is FALSE
by default and is
set to TRUE
only when the the desired data are available
for manual processing.
readAndCompute <- FALSE
if((length(ddiXml)==1) && (!fda::CRAN())){
readAndCompute <- TRUE
ddiDat <- read_ipums_ddi(ddiXml)
(readDatTime <- system.time(
IPUMSdata <- read_ipums_micro(ddiDat)
))
}
On 2020-03-19 this took roughly 40 seconds on a MacBook Pro with a 2.8 GHz quad-core Intel core i7 with 16 GB RAM.
IPUMSdata
is an object with a huge number of rows and 8
columns:
IPUMSdata
is an object of classes tbl_df
,
tbl
and data.frame
with over 114 million rows
for dat00001.xml
.
That’s too few rows to have one row for each person in the most recent census and certainly too few to have one row for each household in all the census since 1850:
if(readAndCompute){
print(etYr <- system.time(
tbl_year <- table(IPUMSdata$YEAR)
))
plot(tbl_year)
tbl_year
}
The key point from the the print of tbl_year
and this
plot is that this dataset includes data from every census except 1890
plus for each year between 2000 and 2016.
Before proceeding, let’s check IPUMSdata
for missing
values:
if(readAndCompute){
print(etNA <- system.time(
nNA <- sapply(IPUMSdata, function(x)sum(is.na(x)))
))
print(nNA)
}
No missing values!
Let’s look at var_desc
for HHWT
:
Let’s look at the distribution of HHWT
:
if(readAndCompute){
print(etQ <- system.time({
rngHHWT <- range(IPUMSdata$HHWT)
qtleHHWT <- quantile(IPUMSdata$HHWT)
}))
print(rngHHWT)
qtleHHWT
}
Let’s also examine the the attributes of OCC1950
:
if(readAndCompute){
print(etCodes <- system.time(
OCC50codes <- attributes(IPUMSdata$OCC1950)
))
str(OCC50codes)
}
We’re especially interested in “labels”:
if(readAndCompute){
# OCCcodes$labels
print(OCC50codes$var_desc)
print(head(OCC50codes$labels))
tail(OCC50codes$labels)
}
The “labels” attribute from OCC1950
provided a translate
table giving English-language names to the numeric codes.
Are all these codes used?
Two codes are not used. What are they?
Let’s sum HHWT
within YEAR
and
OCC1950
:
if(readAndCompute){
print(etOccYr <- system.time(
OccYr <- tapply(IPUMSdata$HHWT,
IPUMSdata[c("OCC1950", "YEAR")], sum)
))
str(OccYr)
}
This is an array of OCC1950
by YEAR
. We
will convert this into a matrix of the proportion of the workforce in
each OCC1950
code with two attributes:
codes
= OCC50codes
workforce
= colSums(YrOcc)
We need the codes
to allow us to make any use of these
data, and workforce
gives us an estimate of the size of the
workforce by year. To confirm the latter, let’s compute it:
All NA
s. One explanation for this is no year has seen
the use of all occupation codes. For example, we should not expect to
see many ““Airplane pilots and navigators” in the nineteenth century!2
To check this, we will table(OCC1950", YEAR)
:
if(readAndCompute){
print(etOY <- system.time(
OY <- with(IPUMSdata, table(OCC1950, YEAR))
))
print(str(OY))
sum(is.na(OccYr) - (OY==0))
}
Wonderful: This says that all NA
s in OccYr
should be 0:
Now let’s repeat the sums by year, comparing with
USGDPpresidents$population.K
:
if(readAndCompute){
(totWts <- colSums(OccYr))
library(Ecdat)
selGDP <- (USGDPpresidents$Year %in% names(totWts))
USpops <- USGDPpresidents[selGDP, ]
ylim <- range(totWts/1e6, USpops$population.K/1e3)
# png('IPUMS HHWT and US Population.png')
plot(names(totWts), totWts/1e6, xlab='',
ylab="millions",
main='sum(HHWT) vs. US Population', las=1)
with(USpops, lines(Year, population.K/1e3))
# dev.off()
}
This plot suggests that HHWT
attempts to weight the
observations so the total matches the US population but has problems for
1970 and every year since 2000. A simple fix is to rescale all the
numbers so they match. Let’s first check by plotting the ratio:
if(readAndCompute){
tot_pop <- (totWts / (USpops$population.K*1000))
plot(USpops$Year, tot_pop, type='b', las=1)
abline(h=1, lty='dotted', col='red')
}
We can make these numbers match by dividing OccYr
by
tot_pop
:
if(readAndCompute){
nOcc <- nrow(OccYr)
Occ1950 <- (OccYr / rep(tot_pop, e=nOcc))
(revTots <- colSums(Occ1950))
plot(USpops$Year, revTots/1e6, type='b', las=1)
}
Great. Now let’s create OCC1950
= proportion of
population:
if(readAndCompute){
OCC1950 <- (Occ1950 / rep(revTots, e=nOcc))
quantile(chkTots <- colSums(OCC1950))
}
Let’s make rownames
= occupation names rather than the
codes:
Let’s save this:
After my initial data extraction, I tried adding OCC
=
“Occupation”. Then “Data cart: Your data extract” said, “2 variable, 32
samples”. Then clicking “View Cart” listed 9 variables, being
YEAR
, DATANUM
, SERIAL
,
HHWT
, GQ
, PERNUM
,
PERWT
, OCC
, and OCC1950
. Then I
clicked, “Create data extract”. This allowed me to further “select data
quality flags” for GQ
, OCC
, and
OCC1950
.
The first time I did this, I ignored the data quality flags. The second time, I requested them.
However, with both the additional OCC
and the data
quality flags, the data set was bigger than I could read into my
computer. So I split that extract into two for seq(1850, 2000, 10) and
for 2001:2016. When I read the 2001:2016 extract, I found that I could
not understand OCC
nor the data quality flags. I decided
that the answer I already had was probably good enough, and it wasn’t
clear if I could learn enough to justify the work of further study of
OCC
and the data quality flags.
In any event, after each data selection, I was given an “estimated size” for the extract (5340 MB for one trial). I entered something to “Describe your extract”. Then I clicked, “Submit extract”. The IPUMS web site responded saying, “Your extract request has been submitted. You will be notified by email at (the email address I had given them) when it has been created.” Under “Data”, it said, “Processing…”.
After a while (47 minutes for one extract on 2018-09-03) I got an email saying my extract was ready for download. I returned to the web site and clicked something under “Data”.
To help with questions about the format, etc., I studied
help(pac=ipumsr)
. I found that the ipumsr
package included seven vignettes. One of those is titled “Introduction
to ipumsr
- IPUMS Data in R”. From that, I learned that
I needed to right-click (ctrl-click
on a Mac) on
DDI
under Codebook
and then select “Save link
as…”. Moreover, I should NOT do this in Safari. Google Chrome worked for
me for this on 2018-09-01 and Firefox worked when I repeated it with a
slightly different extract on 2018-09-03.
When I repeated this 2020-03-18, I downloaded
"usa_00003.dat"
and "use_00005.dat"
consuming
2.3 and 3.8 GB. Codebooks are also available.
Then I followed the instructions in the “Command File” for R. With
one extract, I got, “Error: Error in
read_tokens_(data, tokenizer, col_specs, col_names, locale_
,
: Evaluation error: vector memory exhausted (limit reached?).” After
that, I redid the extract to select less data to file(s) I could
read.