Title: | Datasets and Functions for 'Sampling: Design and Analysis, 3rd Edition' |
---|---|
Description: | Includes all the datasets of 'Sampling: Design and Analysis' (3rd edition by Sharon Lohr) in R format and additional functions for analyzing and graphing probability samples. |
Authors: | Yan Lu [aut, cre], Sharon Lohr [aut] |
Maintainer: | Yan Lu <[email protected]> |
License: | GPL-2 | GPL-3 |
Version: | 0.1.1 |
Built: | 2024-12-13 06:42:33 UTC |
Source: | CRAN |
Data from the 1992 U.S. Census of Agriculture.
data(agpop)
data(agpop)
This data frame contains the following columns:
county name (character variable)
state abbreviation (character variable)
number of acres devoted to farms, 1992
number of acres devoted to farms, 1987
number of acres devoted to farms, 1982
number of farms, 1992
number of farms, 1987
number of farms, 1982
number of farms with 1,000 acres or more, 1992
number of farms with 1,000 acres or more, 1987
number of farms with 1,000 acres or more, 1982
number of farms with 9 acres or fewer, 1992
number of farms with 9 acres or fewer, 1987
number of farms with 9 acres or fewer, 1982
S = south; W = west; NC = north central; NE = northeast
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Data from a without-replacement probability-proportional-to-size sample from agpop data.
data(agpps)
data(agpps)
This data frame contains the following columns:
county name (character variable)
state abbreviation (character variable)
number of acres devoted to farms, 1992
number of acres devoted to farms, 1987
number of acres devoted to farms, 1982
number of farms, 1992
number of farms, 1987
number of farms, 1982
number of farms with 1,000 acres or more, 1992
number of farms with 1,000 acres or more, 1987
number of farms with 1,000 acres or more, 1982
number of farms with 9 acres or fewer, 1992
number of farms with 9 acres or fewer, 1987
number of farms with 9 acres or fewer, 1982
S = south; W = west; NC = north central; NE = northeast
size measure used to select the pps sample
inclusion probability for county
sampling weight for county
unit number for indexing joint inclusion probabilities
columns of joint inclusion probabilities
columns of joint inclusion probabilities
columns of joint inclusion probabilities
columns of joint inclusion probabilities
columns of joint inclusion probabilities
columns of joint inclusion probabilities
columns of joint inclusion probabilities
columns of joint inclusion probabilities
columns of joint inclusion probabilities
columns of joint inclusion probabilities
columns of joint inclusion probabilities
columns of joint inclusion probabilities
columns of joint inclusion probabilities
columns of joint inclusion probabilities
columns of joint inclusion probabilities
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Data from an SRS of size 300 from the 1992 U.S. Census of Agriculture agpop data.
data(agsrs)
data(agsrs)
Variables are the same as in agpop data.
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Data from a stratified random sample of size 300 from the 1992 U.S. Census of Agriculture agpop data.
data(agstrat)
data(agstrat)
This data frame contains the following columns:
county name (character variable)
state abbreviation (character variable)
number of acres devoted to farms, 1992
number of acres devoted to farms, 1987
number of acres devoted to farms, 1982
number of farms, 1992
number of farms, 1987
number of farms, 1982
number of farms with 1,000 acres or more, 1992
number of farms with 1,000 acres or more, 1987
number of farms with 1,000 acres or more, 1982
number of farms with 9 acres or fewer, 1992
number of farms with 9 acres or fewer, 1987
number of farms with 9 acres or fewer, 1982
S = south; W = west; NC = north central; NE = northeast
random numbers used to select sample in each stratum
sampling weight for each county in sample
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Fictional data for an SRS of 12 algebra classes in a city, from a population of 187 classes.
data(algebra)
data(algebra)
This data frame contains the following columns:
class number
number of students in class
score of student on test
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Finger length and height for 3,000 criminals. This data set contains information for the entire population.
data(anthrop)
data(anthrop)
This data frame contains the following columns:
length of left middle finger (cm)
height (inches)
Macdonell, W. R. (1901). On criminal anthropometry and the identification of criminals. Biometrika 1, 177–227.
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Length of left middle finger and height for an SRS of size 200 from anthrop data.
data(anthsrs)
data(anthsrs)
This data frame contains the following columns:
length of left middle finger (cm)
height (inches)
sampling weight
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Finger length and height for a with replacement unequal probability sample of size 200
from data anthrop. The probability of selection, , was proportional to 24 for
y < 65 , 12 for y = 65, 2 for y = 66 or 67, and 1 for y > 67.
data(anthuneq)
data(anthuneq)
This data frame contains the following columns:
length of left middle finger (cm)
height (inches)
sampling weight
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Values from all possible SRSs for an artificial population in Chapter 4 of SDA.
data(artifratio)
data(artifratio)
This data frame contains the following columns:
sample number
first unit in sample
second unit in sample
third unit in sample
fourth unit in sample
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Information from a stratified random sample of Fellows of the American Statistical Association elected between 2000 and 2018. The list of Fellows serving as the population was downloaded from amstat on March 18, 2019. All other information was obtained from public sources.
data(asafellow)
data(asafellow)
This data frame contains the following columns:
year of award
gender of Fellow (character variable, M = male, F = female)
population size in stratum ( = )
sample size in stratum ( = )
field of employment (character variable)
acad = academia
ind = industry
govt = government
year in which Fellow received terminal degree (year of Ph.D. if applicable, otherwise year of Master's or Bachelor's degree)
= 1 if majored in mathematics as undergraduate
= 0 if did not major in math
= NA if missing
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Audit data used in Chapter 6 of SDA.
data(auditresult)
data(auditresult)
This data frame contains the following columns:
audit unit
book value of account
probability of selection
audit value of account
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Selection of accounts for audit data used in Chapter 6 of SDA.
data(auditselect)
data(auditselect)
This data frame contains the following columns:
audit unit
book value of account
cumulative book value
random number 1 selecting account
random number 2 selecting account
random number 3 selecting account
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Population and housing unit estimates for Arizona counties, excluding Maricopa and Pima counties, from the American Community Survey 2018 5-year estimates.
data(azcounties)
data(azcounties)
This data frame contains the following columns:
county name (character variable, length 15)
county number
population estimate for county
housing unit estimate for county
number of owner-occupied housing units for county
Source: https://data.census.gov/, accessed November 27, 2020.
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Statistics on 797 baseball players, compiled by Jenifer Boshes from the rosters of all major league teams in November 2004. Missing values (for variables pball, intwalk, hbp, sacrfly; all other variables have complete data) are coded as NA.
data(baseball)
data(baseball)
This data frame contains the following columns:
team played for at the beginning of the season
AL or NL
a unique identifier for each baseball player
player salary in 2004
primary position coded as P, C, 1B, 2B, 3B, SS, RF, LF, or CF
games played
games started
number of innings
number of putouts
number of assists
errors
number of double plays
number of passed balls (only applies to catchers)
number of games that player appeared at bat
number of at bats
number of runs scored
number of hits
number of doubles
number of triples
number of home runs
number of runs batted in
number of stolen bases
number of times caught stealing
number of times walked
number of strikeouts
number of times intentionally walked
number of times hit by pitch
number of sacrifice hits
number of sacrifice flies
grounded into double play
Forman, S. L. (2004). Baseball-reference.com—Major league statistics and information. www.baseball-reference.com (accessed November 2004).
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Data from homeowner's survey to estimate total number of books, used in Chapter 5.
data(books)
data(books)
This data frame contains the following columns:
shelf number
number of books on shelf
number of the book selected
purchase cost of book
replacement cost of book
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Compute a confidence interval for a capture-recapture sample using the method of Cormack (1992).
captureci(xmat, y, alpha)
captureci(xmat, y, alpha)
xmat |
Define 1 = in sample and 0 = not in sample. For example, if there are two samples, xmat has two columns; the row (1,0) represents the category of being in sample 1 but not in sample 2. |
y |
Number of units corresponding to xmat. |
alpha |
Confidence level with a default value of 0.05. |
cell: estimated cell value for the missing count of category (0, 0)
N: the estimated total counts
CI_cell: the estimated confidence interval for the missing category count
CI_N: the estimated confidence interval for total counts
xmat <- cbind(c(1,1,0),c(1,0,1)) y <- c(20,180,80) captureci(xmat, y, alpha = 0.1)
xmat <- cbind(c(1,1,0),c(1,0,1)) y <- c(20,180,80) captureci(xmat, y, alpha = 0.1)
Population sizes for each state, from the 1920 U.S. census. The data set contains only the 48 states and excludes Washington D.C., Puerto Rico, and U.S. territories (these areas were not allowed to have voting representatives in Congress).
data(census1920)
data(census1920)
This data frame contains the following columns:
state name
state population in 1920 census
Source: U.S. Bureau of the Census (1921).
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Population sizes for each state, from the 2010 U.S. census. The data set contains only the 50 states and excludes the areas that, as of 2020, are not allowed to have voting representatives in Congress: Washington D.C., Puerto Rico, and U.S. territories.
data(census2010)
data(census2010)
This data frame contains the following columns:
state name
state population in 2010 census
Source: U.S. Census Bureau (2019).
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Data for a sample of 31 cherry trees.
data(cherry)
data(cherry)
This data frame contains the following columns:
diameter of tree (inches)
height of tree (feet)
timber volume of tree (cubic feet)
Hand, D. J., F. Daly, A. D. Lunn, K. J. McConway, and E. Ostrowski (1994). A Handbook of Small Data Sets. London: Chapman and Hall.
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Population sizes for 15 classes, used in Chapter 6 of SDA to illustrate unequal-probability sampling.
data(classes)
data(classes)
This data frame contains the following columns:
class ID number
number of students in class
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Two-stage unequal-probability sample without replacement from the population of classes in classes data.
data(classpps)
data(classpps)
This data frame contains the following columns:
class ID number
number of students in class
sampling weight for student
number of hours spent studying statistics
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Joint inclusion probabilities for unequal probability sample without replacement from the population of classes in data classes.
data(classppsjp)
data(classppsjp)
This data frame contains the following columns:
class ID number
number of students in class
probability of being included in sample,
sampling weight
columns of joint inclusion probabilities,
columns of joint inclusion probabilities,
columns of joint inclusion probabilities,
columns of joint inclusion probabilities,
columns of joint inclusion probabilities,
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Selected variables from the U.S. Department of Education College Scorecard Data (version updated on June 1, 2020). Some of the variables in the book data have been calculated from other variables in the original source; these have been given new variable names that are not found in the data dictionary.
data(college)
data(college)
This data frame contains the following columns:
unit identification number
institution name (character, length 81)
city (character, length 24)
state abbreviation (character, length 2)
highest degree awarded
3 = Bachelor's degree
4 = Graduate degree
control (ownership) of institution
1 = public
2 = private nonprofit
region where institution is located
1 New England (CT, ME, MA, NH, RI, VT)
2 Mid East (DE, DC, MD, NJ, NY, PA)
3 Great Lakes (IL, IN, MI, OH, WI)
4 Plains (IA, KS, MN, MO, NE, ND, SD)
5 Southeast (AL, AR, FL, GA, KY, LA, MS, NC, SC, TN, VA, WV)
6 Southwest (AZ, NM, OK, TX)
7 Rocky Mountains (CO, ID, MT, UT, WY)
8 Far West (AK, CA, HI, NV, OR, WA)
locale of institution
11 City: Large (population of 250,000 or more)
12 City: Midsize (population of at least 100,000 but less than 250,000)
13 City: Small (population less than 100,000)
21 Suburb: Large (outside principal city, in urbanized area with population of 250,000 or more)
22 Suburb: Midsize (outside principal city, in urbanized area with population of at least 100,000 but less than 250,000)
23 Suburb: Small (outside principal city, in urbanized area with population less than 100,000)
31 Town: Fringe (in urban cluster up to 10 miles from an urbanized area)
32 Town: Distant (in urban cluster more than 10 miles and up to 35 miles from an urbanized area)
33 Town: Remote (in urban cluster more than 35 miles from an urbanized area)
41 Rural: Fringe (rural territory up to 5 miles from an urbanized area or up to 2.5 miles from an urban cluster)
42 Rural: Distant (rural territory more than 5 miles but up to 25 miles from an urbanized area or more than 2.5 and up to 10 miles from an urban cluster)
43 Rural: Remote (rural territory more than 25 miles from an urbanized area and more than 10 miles from an urban cluster)
carnegie basic classification
15 Doctoral Universities: Very High Research Activity
16 Doctoral Universities: High Research Activity
17 Doctoral/Professional Universities
18 Master's Colleges & Universities: Larger Programs
19 Master's Colleges & Universities: Medium Programs
20 Master's Colleges & Universities: Small Programs
21 Baccalaureate Colleges: Arts & Sciences Focus
22 Baccalaureate Colleges: Diverse Fields
carnegie classification, size and setting
6 Four-year, very small, primarily nonresidential
7 Four-year, very small, primarily residential
8 Four-year, very small, highly residential
9 Four-year, small, primarily nonresidential
10 Four-year, small, primarily residential
11 Four-year, small, highly residential
12 Four-year, medium, primarily nonresidential
13 Four-year, medium, primarily residential
14 Four-year, medium, highly residential
15 Four-year, large, primarily nonresidential
16 Four-year, large, primarily residential
17 Four-year, large, highly residential
historically black college or university
1 = yes, 0 = no
does the college have an open admissions policy, that is, does it accept any students that apply or have minimal requirements for admission?
1 = yes, 0 = no
fall admissions rate, defined as the number of admitted undergraduates divided by the number of undergraduates who applied
average SAT score (or equivalent) for admitted students
number of degree-seeking undergraduate students enrolled in the fall term
proportion of ugds who are men
proportion of ugds who are women
proportion of ugds who are white (based on self-reports)
proportion of ugds who are black/African American (based on self-reports)
proportion of ugds who are Hispanic (based on self-reports)
proportion of ugds who are Asian (based on self-reports)
proportion of ugds who have other race/ethnicity (created from other categories on original data file; race/ethnicity proportions sum to 1)
average net price of attendance, derived from the full cost of attendance, including tuition and fees, books and supplies, and living expenses, minus federal, state, and institutional grant scholarship aid, for full time, first time undergraduate Title IV receiving students. NPT4 created from scorecard data variables NPT4_PUB if public institution and NPT4_PRIV if private
in-state tuition and fees
out-of-state tuition and fees
average faculty salary per month
proportion of faculty that is full-time
proportion of first-year, full-time students who complete their degree within 150% of the expected time to complete; for most institutions, this is the proportion of students who receive a degree within 6 years
number of graduate students
This data set is made available for pedagogical purposes only. Anyone wishing to draw conclusions from College Scorecard data should obtain the full data set from the Department of Education. The original data set has 1,925 variables and includes institutions (such as those that do not grant undergraduate degrees) that are not in the data college.
The college data includes institutions in the original data set that: (1) are located in the 50 states plus District of Columbia, (2) contain information on average net price (NPT4), (3) are predominantly Bachelor's degree-granting, (4) were currently operating as of June 2020, (5) are not private for-profit institutions or "global" campuses, (6) have Carnegie size classification (variable ccsizset) between 6 and 17 and Carnegie basic classification(variable ccbasic) between 14 and 22 (these offer Bachelor's degrees), (7) enrolls first-time students, and (8) are not U.S. Service Academies.
For all variables, missing data are coded as NA.
U.S. Department of Education (2020). College scorecard data. https://collegescorecard.ed.gov/data/ (accessed August 25, 2020).
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Five replicate SRSs from the set of public colleges and universities (having control = 1) in college data. Columns 1-29 are as in college data, with additional columns 30-32 listed below. Note that the selection probabilities and sampling weights are for the separate replicate samples, so that the weights for each replicate sample sum to the population size 500.
data(collegerg)
data(collegerg)
This data frame contains the following columns:
unit identification number
institution name (character, length 81)
city (character, length 24)
state abbreviation (character, length 2)
highest degree awarded
3 = Bachelor's degree
4 = Graduate degree
control (ownership) of institution
1 = public
2 = private nonprofit
region where institution is located
1 New England (CT, ME, MA, NH, RI, VT)
2 Mid East (DE, DC, MD, NJ, NY, PA)
3 Great Lakes (IL, IN, MI, OH, WI)
4 Plains (IA, KS, MN, MO, NE, ND, SD)
5 Southeast (AL, AR, FL, GA, KY, LA, MS, NC, SC, TN, VA, WV)
6 Southwest (AZ, NM, OK, TX)
7 Rocky Mountains (CO, ID, MT, UT, WY)
8 Far West (AK, CA, HI, NV, OR, WA)
locale of institution
11 City: Large (population of 250,000 or more)
12 City: Midsize (population of at least 100,000 but less than 250,000)
13 City: Small (population less than 100,000)
21 Suburb: Large (outside principal city, in urbanized area with population of 250,000 or more)
22 Suburb: Midsize (outside principal city, in urbanized area with population of at least 100,000 but less than 250,000)
23 Suburb: Small (outside principal city, in urbanized area with population less than 100,000)
31 Town: Fringe (in urban cluster up to 10 miles from an urbanized area)
32 Town: Distant (in urban cluster more than 10 miles and up to 35 miles from an urbanized area)
33 Town: Remote (in urban cluster more than 35 miles from an urbanized area)
41 Rural: Fringe (rural territory up to 5 miles from an urbanized area or up to 2.5 miles from an urban cluster)
42 Rural: Distant (rural territory more than 5 miles but up to 25 miles from an urbanized area or more than 2.5 and up to 10 miles from an urban cluster)
43 Rural: Remote (rural territory more than 25 miles from an urbanized area and more than 10 miles from an urban cluster)
carnegie basic classification
15 Doctoral Universities: Very High Research Activity
16 Doctoral Universities: High Research Activity
17 Doctoral/Professional Universities
18 Master's Colleges & Universities: Larger Programs
19 Master's Colleges & Universities: Medium Programs
20 Master's Colleges & Universities: Small Programs
21 Baccalaureate Colleges: Arts & Sciences Focus
22 Baccalaureate Colleges: Diverse Fields
carnegie classification, size and setting
6 Four-year, very small, primarily nonresidential
7 Four-year, very small, primarily residential
8 Four-year, very small, highly residential
9 Four-year, small, primarily nonresidential
10 Four-year, small, primarily residential
11 Four-year, small, highly residential
12 Four-year, medium, primarily nonresidential
13 Four-year, medium, primarily residential
14 Four-year, medium, highly residential
15 Four-year, large, primarily nonresidential
16 Four-year, large, primarily residential
17 Four-year, large, highly residential
historically black college or university,
1 = yes, 0 = no
does the college have an open admissions policy, that is, does it accept any students that apply or have minimal requirements for admission?
1 = yes, 0 = no
fall admissions rate, defined as the number of admitted undergraduates divided by the number of undergraduates who applied
average SAT score (or equivalent) for admitted students
number of degree-seeking undergraduate students enrolled in the fall term
proportion of ugds who are men
proportion of ugds who are women
proportion of ugds who are white (based on self-reports)
proportion of ugds who are black/African American (based on self-reports)
proportion of ugds who are Hispanic (based on self-reports)
proportion of ugds who are Asian (based on self-reports)
proportion of ugds who have other race/ethnicity (created from other categories on original data file; race/ethnicity proportions sum to 1)
average net price of attendance, derived from the full cost of attendance, including tuition and fees, books and supplies, and living expenses, minus federal, state, and institutional grant scholarship aid, for full time, first time undergraduate Title IV receiving students. NPT4 created from scorecard data variables NPT4_PUB if public institution and NPT4_PRIV if private
in-state tuition and fees
out-of-state tuition and fees
average faculty salary per month
proportion of faculty that is full-time
proportion of first-year, full-time students who complete their degree within 150% of the expected time to complete; for most institutions, this is the proportion of students who receive a degree within 6 years
number of graduate students
selection probability for each replicate sample
sampling weight for each replicate sample
replicate group number
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Probability-proportional-to-size sample of size 10 from the stratum of small, highly residential colleges (having ccsizeset = 11) in data college. Columns 1-29 are as in college data, with additional columns 30-34 listed below.
data(collshr)
data(collshr)
This data frame contains the following columns:
unit identification number
institution name (character, length 81)
city (character, length 24)
state abbreviation (character, length 2)
highest degree awarded
3 = Bachelor's degree
4 = Graduate degree
control (ownership) of institution
1 = public
2 = private nonprofit
region where institution is located
1 New England (CT, ME, MA, NH, RI, VT)
2 Mid East (DE, DC, MD, NJ, NY, PA)
3 Great Lakes (IL, IN, MI, OH, WI)
4 Plains (IA, KS, MN, MO, NE, ND, SD)
5 Southeast (AL, AR, FL, GA, KY, LA, MS, NC, SC, TN, VA, WV)
6 Southwest (AZ, NM, OK, TX)
7 Rocky Mountains (CO, ID, MT, UT, WY)
8 Far West (AK, CA, HI, NV, OR, WA)
locale of institution
11 City: Large (population of 250,000 or more)
12 City: Midsize (population of at least 100,000 but less than 250,000)
13 City: Small (population less than 100,000)
21 Suburb: Large (outside principal city, in urbanized area with population of 250,000 or more)
22 Suburb: Midsize (outside principal city, in urbanized area with population of at least 100,000 but less than 250,000)
23 Suburb: Small (outside principal city, in urbanized area with population less than 100,000)
31 Town: Fringe (in urban cluster up to 10 miles from an urbanized area)
32 Town: Distant (in urban cluster more than 10 miles and up to 35 miles from an urbanized area)
33 Town: Remote (in urban cluster more than 35 miles from an urbanized area)
41 Rural: Fringe (rural territory up to 5 miles from an urbanized area or up to 2.5 miles from an urban cluster)
42 Rural: Distant (rural territory more than 5 miles but up to 25 miles from an urbanized area or more than 2.5 and up to 10 miles from an urban cluster)
43 Rural: Remote (rural territory more than 25 miles from an urbanized area and more than 10 miles from an urban cluster)
carnegie basic classification
15 Doctoral Universities: Very High Research Activity
16 Doctoral Universities: High Research Activity
17 Doctoral/Professional Universities
18 Master's Colleges & Universities: Larger Programs
19 Master's Colleges & Universities: Medium Programs
20 Master's Colleges & Universities: Small Programs
21 Baccalaureate Colleges: Arts & Sciences Focus
22 Baccalaureate Colleges: Diverse Fields
carnegie classification, size and setting
6 Four-year, very small, primarily nonresidential
7 Four-year, very small, primarily residential
8 Four-year, very small, highly residential
9 Four-year, small, primarily nonresidential
10 Four-year, small, primarily residential
11 Four-year, small, highly residential
12 Four-year, medium, primarily nonresidential
13 Four-year, medium, primarily residential
14 Four-year, medium, highly residential
15 Four-year, large, primarily nonresidential
16 Four-year, large, primarily residential
17 Four-year, large, highly residential
historically black college or university,
1 = yes, 0 = no
does the college have an open admissions policy, that is, does it accept any students that apply or have minimal requirements for admission?
1 = yes, 0 = no
fall admissions rate, defined as the number of admitted undergraduates divided by the number of undergraduates who applied
average SAT score (or equivalent) for admitted students
number of degree-seeking undergraduate students enrolled in the fall term
proportion of ugds who are men
proportion of ugds who are women
proportion of ugds who are white (based on self-reports)
proportion of ugds who are black/African American (based on self-reports)
proportion of ugds who are Hispanic (based on self-reports)
proportion of ugds who are Asian (based on self-reports)
proportion of ugds who have other race/ethnicity (created from other categories on original data file; race/ethnicity proportions sum to 1)
average net price of attendance, derived from the full cost of attendance, including tuition and fees, books and supplies, and living expenses, minus federal, state, and institutional grant scholarship aid, for full time, first time undergraduate Title IV receiving students. NPT4 created from scorecard data variables NPT4_PUB if public institution and NPT4_PRIV if private
in-state tuition and fees
out-of-state tuition and fees
average faculty salary per month
proportion of faculty that is full-time
proportion of first-year, full-time students who complete their degree within 150% of the expected time to complete; for most institutions, this is the proportion of students who receive a degree within 6 years
number of graduate students
number of mathematics faculty
number of psychology faculty
number of biology faculty
selection probability = ugds /(sum of ugds for stratum)
sampling weight = 1/(10*)
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Selected information on egg size, from a larger study by Arnold (1991). Data provided courtesy of Todd Arnold. Not all observations are used for this data set, so results may not agree with those in Arnold (1991).
data(coots)
data(coots)
This data frame contains the following columns:
clutch number from which eggs were subsampled
number of eggs in clutch ()
length of egg (mm)
maximum breadth of egg (mm)
calculated as 0.000507*length* (
)
= 1 if received supplemental feeding
= 0 otherwise
Arnold, T. W. (1991). Intraclutch variation in egg size of American coots. The Condor 93, 19–27.
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Data (from 1990) from an SRS of 100 of the 3141 counties in the United States. Missing values are coded as NA.
data(counties)
data(counties)
This data frame contains the following columns:
random number used to select the county
state abbreviation
county name
land area, 1990 (square miles)
total number of persons, 1992
active non-Federal physicians on Jan. 1, 1990
school enrollment in elementary or high school, 1990
percent of school enrollment in public schools
civilian labor force, 1991
number unemployed, 1991
farm population, 1990
number of farms, 1987
acreage in farms, 1987
total expenditures in federal funds and grants, 1992 (millions of dollars)
civilians employed by federal government, 1990
military personnel, 1990
number of veterans, 1990
percent of veterans from Vietnam era, 1990
Source: U.S. Census Bureau (1994).
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Data from selected variables in a simple random sample of 5,000 records from the 7,048,107 records with dates between 2001 and 2019 in the City of Chicago database "Crimes-2001 to Present". This file was downloaded on August 11, 2020 from https://data.cityofchicago.org/. These data are provided for pedagogical purposes only. Anyone wishing to publish analyses of Chicago crime data should obtain the most recent data from the website. For a list and map of Community Areas, see https://www.chicago.gov/city/en/depts/dgs/supp_info/citywide_maps.html.
data(crimes)
data(crimes)
This data frame contains the following columns:
year in which crime occurred (between 2001 and 2019)
type of crime, determined from detailed crime description in database
homicide = homicide
sexualasslt = sexual assault
robbery = robbery
aggasslt = aggravated assault
burglary = burglary
mvtheft = motor vehicle theft
idtheft = identity theft
theft = other type of theft
arson = arson
simpleasslt = simple assault (assaults that are not aggravated)
threat = threat or harassment
fraud = fraud
weapon = weapons violation
trespass = trespassing
vandalism = vandalism
narcotics = narcotics or liquor law violation
other = other
= 1 if violent crime
= 0 otherwise
= 1 if an arrest was made
= 0 otherwise
= 1 if crime was domestic-related as defined by the Illinois Domestic Violence Act
= 0 otherwise
number of the Community Area in Chicago where the crime occurred
type of location where crime occurred (e.g. street, apartment)
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Number of dead trees recorded by photograph and field count for a (fictional) SRS of 25 plots taken from a population of 100 plots.
data(deadtrees)
data(deadtrees)
This data frame contains the following columns:
number of dead trees in plot from photograph
number of dead trees in plot from field observation
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Data from a sample of divorce records for states in the Divorce Registration Area.
data(divorce)
data(divorce)
This data frame contains the following columns:
state name (character variable)
state abbreviation (character variable)
sampling rate for state
number of records sampled in state
number of records in sample with husband's age < 20
number of records with 20 <= husband's age <= 24
number of records with 25 <= husband's age <= 29
number of records with 30 <= husband's age <= 34
number of records with 35 <= husband's age <= 39
number of records with 40 <= husband's age <= 44
number of records with 45 <= husband's age <= 49
number of records with husband's age => 50
number of records with wife's age < 20
number of records with 20 <= wife's age <= 24
number of records with 25 <= wife's age <= 29
number of records with 30 <= wife's age <= 34
number of records with 35 <= wife's age <= 39
number of records with 40 <= wife's age <= 44
number of records with 45 <= wife's age <= 49
number of records with wife's age => 50
Source: National Center for Health Statistics (1987).
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Calculates the empirical probability mass function for a variable with associated weights.
emppmf(y, weight)
emppmf(y, weight)
y |
Numerical variable |
weight |
Associated weights of the variable of interest, default weight is rep(1,length(y)) |
vals: the distinct values of y
epmf: empirical probability mass function corresponding to each y in vals
emppmf(seq(1:10)) emppmf(htsrs$height, rep(2000/200,200))
emppmf(seq(1:10)) emppmf(htsrs$height, rep(2000/200,200))
Data from the population of districts for the 1921 Italian general census.
data(gini)
data(gini)
This data frame contains the following columns:
ID number
district name
births per 1,000 population
deaths per 1,000 population
marriages per 1,000 population
percentage of males over 10 years old who work in agriculture
percentage of population in urban areas
average income
average altitude above sea level (meters)
number of inhabitants per square kilometer
rate of average increase of the population
population of area
land area (square kilometers)
= 1 if in the purposive sample selected by Gini and Galvani
= 0 otherwise
Gini, C. and L. Galvani (1929). Di una applicazione del metodo rappresentativo all’ultimo censimento italiano della popolazione. Annali di Statistica 6 (4), 1-105. The data are on pages 73–78.
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
A simple random sample of 120 golf courses, taken from the population on the website ww2.golfcourse.com on August 5, 1998. Missing data are denoted by NA.
data(golfsrs)
data(golfsrs)
This data frame contains the following columns:
random number used to select golf course for sample
state name
number of holes
type of course:
priv = private
semi = semi-private
pub = public
mili = military
resort = resort
year course was built
greens fee for 18 holes during week
greens fee for 9 holes during week
greens fee for 18 holes on weekend
greens fee for 9 holes on weekend
back tee yardage
course rating
par for course
golf cart rental fee for 18 holes
golf cart rental fee for 9 holes
are caddies available? (y or n)
is a golf pro available? (y or n)
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
GPA data from Chapter 5 of SDA.
data(gpa)
data(gpa)
This data frame contains the following columns:
suite (psu) identifier
grade point average of person in suite
sampling weight, = 20 for every observation
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Randomization and statistical inference practices in a stratified random sample of 196 public health articles. The data, provided courtesy of Dr. Matt Hayat, are discussed in Hayat and Knapp (2017). The variables provided in healthjournals are a subset of the variables collected by the authors.
data(healthjournals)
data(healthjournals)
This data frame contains the following columns:
journal that published the article
AJPH = American Journal of Public Health
AJPM = American Journal of Preventive Medicine
PM = Preventive Medicine
number of authors
"Yes" if data in the article were from a randomly selected (probability) sample
"No" otherwise
"Yes" if study subjects for the article were randomly assigned to treatment groups
"No" otherwise
"Yes" if a confidence interval appeared in the article's main text, tables, or figures
"No" otherwise
"Yes" if a p-value or significance test appeared in the article's main text, tables, or figures
"No" otherwise
"Yes" if asterisks were used to represent p-value ranges
"No" otherwise
Hayat, M. and T. Knapp (2017). Randomness and inference in medical and public health research. Journal of the Georgia Public Health Association 7 (1), 7–11.
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Empirical distribution function and empirical probability mass function of data in htpop.
data(htcdf)
data(htcdf)
This data frame contains the following columns:
height value, cm
number of times height value in column 1 occurs in population
empirical probability mass function
empirical distribution function
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Height and gender of 2,000 persons in an artificial population.
data(htpop)
data(htpop)
This data frame contains the following columns:
height of person, cm
M = male
F = female
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Height and gender for an SRS of 200 persons, taken from htpop data
data(htsrs)
data(htsrs)
This data frame contains the following columns:
random number used to select unit
height of person, cm
M = male
F = female
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Height and gender for a stratified random sample of 160 women and 40 men, taken from htpop data.
data(htstrat)
data(htstrat)
The columns and names are as in htsrs data.
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Population and sample sizes for the poststrata used for the Sunday hunting survey.
data(hunting)
data(hunting)
This data frame contains the following columns:
region of state (East, Central, West)
gender (female, male)
age group (16-24, 25-34, 35-44, 45-54, 55-64, 65+)
population size in poststratum from the 2000 U.S. census
sample size in poststratum
Source: Virginia Polytechnic and State University/Responsive Management (2006).
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Small artificial data set used to illustrate imputation methods. Missing values are denoted by NA.
data(impute)
data(impute)
This data frame contains the following columns:
identification number for person
age in years
M = male
F = female
number of years of education
= 1 if victim of any crime
= 0 otherwise
= 1 if victim of violent crime
= 0 otherwise
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Artificial population of 2000 observations.
data(integerwt)
data(integerwt)
This data frame contains the following columns:
stratum number
value of observation
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Data from the online (Mechanical Turk) survey. The data were downloaded from PLOS ONE on February 8, 2020; the variables extracted from the full data set are provided here for educational purposes only.
data(intellonline)
data(intellonline)
This data frame contains the following columns:
response to question about agreement with the statement "I am more intelligent than the average person"
1 = Strongly Agree
2 = Mostly Agree
3 = Mostly Disagree
4 = Strongly Disagree
5 = Don't Know or Not Sure
census region of respondent (character variable, length 10):
Northeast
South
Midwest
West
sex (character variable, length 8):
Male
Female
race (character variable, length 18):
White
African American
Asian American
Hispanic American
Another origin
age, years
household income level (character variable, length 8):
< $40k,
$40-80k,
> $80k
highest education level attained (character variable, length 12):
No College
Some College
College Grad
Grad School
MISSING
relative weight, obtained by poststratifying to demographic proportions in the 2010 U.S. Census. The weights are normed so that they sum to 750.
Heck, P. R., D. J. Simons, and C. F. Chabris (2018). 65% of Americans believe they are above average in intelligence: Results of two nationally representative surveys. PloS One 13 (7), 1–11.
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Data from the telephone survey studied by Heck et al. (2018). The data were downloaded from here and are provided for educational purposes only.
data(intelltel)
data(intelltel)
The variables are the same as in intellonline.
Heck, P. R., D. J. Simons, and C. F. Chabris (2018). 65% of Americans believe they are above average in intelligence: Results of two nationally representative surveys. PloS One 13 (7), 1–11.
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Relative weights for demographic groups in intellonline and intelltel (Heck et al., 2018). Each sample was weighted using the 2010 U.S. Census demographics for sex (male, female), age ( < 44, => 44), and race/ethnicity (white, nonwhite). The table entries give the weights for each of these eight demographic groups.
data(intellwts)
data(intellwts)
This data frame contains the following columns:
Female and Male
Young = (age less than 44)
Old = (age greater than or equal to 44)
White or Nonwhite
number of telephone survey respondents in the sex/age group/race class
number of online survey respondents in the sex/agegroup/race class
relative weight for each respondent to the telephone survey in this sex/agegroup/race class
relative weight for each respondent to the telephone survey in this sex/agegroup/race class
Heck, P. R., D. J. Simons, and C. F. Chabris (2018). 65% of Americans believe they are above average in intelligence: Results of two nationally representative surveys. PloS One 13(7), 1–11.
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Simulate a population of clusters, then draw a simple random sample of clusters and construct interval estimates using incorrect SRS formulae and formulae appropriate for cluster samples.
intervals_ex40(groupcorr, numintervals, groupsize, sampgroups, popgroups, mu, sigma)
intervals_ex40(groupcorr, numintervals, groupsize, sampgroups, popgroups, mu, sigma)
groupcorr |
The intracluster correlation coefficient rho |
numintervals |
Number of samples to be taken from population |
groupsize |
Number of elements in each population cluster |
sampgroups |
Number of clusters to be sampled |
popgroups |
Number of clusters in population |
mu |
Mean for generating population |
sigma |
Standard deviation for generating population |
SRS_cover_prob: proportion of intervals using SRS formulae that include the true population mean mu
cl_cover_prob: proportion of intervals using cluster sampling formulae that include the true population mean mu
SRS_mean_CI_width: the average width of the interval estimates from SRS
Cluster_mean_CI_width: the average width of the interval estimates from cluster sampling
Replicate: Simulation replicate
srs_lci: lower limit of CI from SRS
srs_uci: upper limit of CI from SRS
clus_lci: lower limit of CI from cluster sampling
clus_uci: upper limit of CI from cluster sampling
scatter plot: first graph shows scatter plot of the last simulated sample
CI plots: second graph shows interval estimates produced for each sample if analyzed as an SRS (with red interval not containing the true parameter), and the third shows the interval estimates produced for each sample when analyzed as a cluster sample.
# default setting intervals_ex40(groupcorr = 0, numintervals = 100, groupsize = 5, sampgroups = 10, popgroups = 5000,mu = 0, sigma = 1) # change groupcorr and leave others as default setting intervals_ex40(groupcorr = 0.3) intervals_ex40(groupcorr = 0.7, numintervals = 100, groupsize = 5, sampgroups = 10, popgroups = 5000,mu = 0, sigma = 1)
# default setting intervals_ex40(groupcorr = 0, numintervals = 100, groupsize = 5, sampgroups = 10, popgroups = 5000,mu = 0, sigma = 1) # change groupcorr and leave others as default setting intervals_ex40(groupcorr = 0.3) intervals_ex40(groupcorr = 0.7, numintervals = 100, groupsize = 5, sampgroups = 10, popgroups = 5000,mu = 0, sigma = 1)
Data extracted from the 1980 Census Integrated Public Use Microdata Series, using the "Small Sample Density" option in the data extract tool, on September 17, 2008. The stratum and psu variables were constructed for use in the book exercises. Data analyses on this file do NOT give valid results for inference to the 1980 U.S. population.
data(ipums)
data(ipums)
This data frame contains the following columns:
stratum number (1-9)
psu number (1-90)
total personal income (dollars), topcoded at $75,000
age, with range 15-90
1 = Male
2 = Female
1 = White
2 = Black
3 = American Indian or Alaska Native
4 = Asian or Pacific Islander
5 = Other Race
0 = Not Hispanic
1 = Hispanic
marital Status:
1 = Married
2 = Separated
3 = Divorced
4 = Widowed
5 = Never married/single
ownership of housing unit:
0 = Not Applicable (N/A)
1 = Owned or being bought
2 = Rents
number of years a foreign-born person has lived in the U.S.:
0 = N/A
1 = 0-5 years
2 = 6-10 years
3 = 11-15 years
4 = 16-20 years
5 = 21+ years
is person in school?
1 = No, not in school
2 = Yes, in school
educational attainment:
1 = None or preschool
2 = Grade 1, 2, 3, or 4
3 = Grade 5, 6, 7, or 8
4 = Grade 9
5 = Grade 10
6 = Grade 11
7 = Grade 12
8 = 1 to 3 years of college
9 = 4+ years of college
in labor force?
0 = Not Applicable
1 = No
2 = Yes
class of worker:
0 = Not applicable
13 = Self employed, not incorporated
14 = Self employed, incorporated
22 = Wage/salary, private
25 = Federal government employee
27 = State government employee
28 = Local government employee
29 = Unpaid family worker
veteran status
0 = Not Applicable
1 = No Service
2 = Yes
Ruggles et al. (2004). Integrated Public Use Microdata Series: Version 3.0 [machine- readable database]. https://usa.ipums.org/usa/ (accessed September 17, 2008).
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Types of sampling used for articles in a sample of journals. Note that columns 2 and 3 do not always sum to column 1; for some articles, the investigators could not determine which type of sampling was used. When working with these data, you may wish to create a fourth column, "indeterminate", which equals column1 - (column2 + column3).
data(journal)
data(journal)
This data frame contains the following columns:
number of articles in 1988 that used sampling
number of articles that used probability sampling
number of articles that used non-probability sampling
Source: Jacoby and Handlin (1991). Non-probability sampling designs for litigation surveys. Trademark Reporter 81, 169–179.
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Roberts et al. (1995) reported on the results of a survey of parents whose children had not been immunized against measles during a recent campaign to immunize all children in the first five years of secondary school. The original data were unavailable; univariate and multivariate summary statistics from these artificial data, however, are consistent with those in the paper. All variables are coded as 1 for yes, 0 for no, and NA for no answer. A parent who refused consent (variable 4) was asked why, with responses in variables 5 through 10. If a response in variables 5 through 10 was checked, it was assigned value 1; otherwise, it was assigned value 0. A parent could give more than one reason for not having the child immunized.
data(measles)
data(measles)
This data frame contains the following columns:
parent received consent form
parent returned consent form
parent gave consent for measles immunization
child had already had measles
child had been immunized against measles
parent concerned about side effects
parent wanted general practitioner (GP) to give vaccine
child did not want injection
parent thought measles not a serious illness
GP advised that vaccine was not needed
school attended by child
population size in school
sample size in school
Roberts et al. (1995). Reasons for non-uptake of measles, mumps, and rubella catch up immunisation in a measles epidemic and side effects of the vaccine. British Medical Journal 310, 1629–1632.
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Data from a stratified random sample of books nominated for the Edgar
awards for Best Novel and Best First Novel. The sample was drawn from the population
listing of 655 books at
http://theedgars.com/awards/ on August 14, 2020.
data(mysteries)
data(mysteries)
This data frame contains the following columns:
stratum number, from 1 to 12, computed from the stratification variables in columns 2-4
time period in which award was given:
1 = 1946-1980
2 = 1981-2000
3 = 2001-2020
award category (character variable, length 16): Best Novel, or Best First Novel
= 1 if book won the award that year
= 0 if book was nominated but did not win award
number of population books in stratum ( = )
number of sampled books in stratum ( = )
= 1 if book was obtained (responded) in original sample
= 2 if book was obtained in phase II subsample of nonrespondents
= 0 if not obtained
weight for phase I sample, calculated as ; use for exercises in Chapters 1-11 of SDA
final weight for phase II sample; use for exercises in Chapter 12 of SDA and analyses involving variables victims and firearm
genre of book (character variable, length 11).
"private eye" (protagonist is a private detective)
"procedural" (a detailed, step-by-step analysis of how the crime is solved, using the skills of the detective)
"suspense" (the protagonist is at the center of action or is involved in espionage, but is not a professional detective)
= 1 if the main action in the book takes place at least 20 years before the book's publication date
= 0 if book action is within 20 years of the publication date
= 1 if the main action in the book takes place primarily in urban areas
= 0 otherwise
gender of author (character variable, length 1)
= "F" if author is female
= "M" if author is male
number of female detectives (or protagonists, if book has no detective) in book
number of male detectives (or protagonists, if book has no detective) in book
number of murder victims in book (missing value set to NA if obtained = 0)
number of murders committed with firearms in book (missing value set to NA if obtained = 0)
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Selected variables from the 2015-2016 National Health and Nutrition Examination Survey (NHANES). This data set is provided for educational purposes only. Anyone wishing to publish or use results from analyses of NHANES data should obtain the data files directly from the source: Centers for Disease Control and Prevention (2017).
data(nhanes)
data(nhanes)
This data frame contains the following columns:
Pseudo stratum. These are groups of secondary sampling units used for variance estimation on the publicly available data. Pseudo strata and pseudo psus are released instead of the actual strata and psus to protect the confidentiality of respondents' information. Use sdmvstra as the variable defining the strata.
Pseudo-psu. Use sdmvpsu as the primary sampling unit (psu). (There are two pseudo-psus per pseudo-stratum, numbered 1 and 2.)
Interview weight (use as weight for variables 5-12)
Mobile Examination Center weight (use as weight for any analysis involving variables 13-25)
Interview/examination status
= 1 if interviewed only
= 2 if interviewed and had medical examination
Age in years at screening, from 0 to 80. Anyone with age > 80 years is recorded (topcoded) as 80. No values are missing for this variable.
Age in months at screening (reported only for persons with age 24 months or younger at the time of exam, otherwise missing)
= 1 if male
= 2 if female (no missing values)
Race/ethnicity code (no missing values)
1 = Mexican American
2 = Other Hispanic
3 = Non-Hispanic White
4 = Non-Hispanic Black
6 = Non-Hispanic Asian
7 = Other Race, Including Multi-Racial
Education level of person interviewed (given for adults age 20+only)
1 = Less than 9th grade
2 = 9th to 11th grade (including 12th grade with no diploma)
3 = High school graduate (including GED)
4 = Some college or associate's degree
5 = College graduate or above
7 = Refused
9 = Don't know
Total number of people in the family. Values 1-6 indicate the number of people is that number; value 7 indicates 7 or more people in family. No missing values.
Ratio of family income to poverty guideline. A value less than 1 indicates the family is below the poverty threshold. Variable indfmpir is a continuous variable where values between 0 and 4.99 indicate the actual poverty ratio. A value of 5 indicates that the ratio of family income to the poverty guideline for that family is 5 or more.
Weight (kg)
Standing height (cm)
Body mass index (kg/), calculated as
Waist circumference (cm)
Upper leg length (cm)
Upper arm length (cm)
Upper arm circumference (cm)
Average sagittal abdominal diameter (SAD, the distance from the small of the back to the upper abdomen), in cm. Calculated by averaging the SAD readings on the person (up to four)
Serum total cholesterol (mg/dL)
60-second pulse
Average systolic blood pressure (mm Hg)
Average diastolic blood pressure (mm Hg)
Number of blood pressure readings
The data files merged to create nhanes can be read directly from the SAS transport
files
DEMO_I.XPT,BMX_I.XPT,TCHOL_I.XPT,and BPX_I.XPT from the NHANES website.
Variables 1-23 have the same names as in the SAS transport files.
The blood pressure variables sbp and dbp were created as follows. In the medical examination, three consecutive blood pressure readings were obtained after participants sat quietly for 5 minutes, and the maximum inflation level was determined. A fourth measurement was conducted for some persons who had an incomplete or interrupted blood pressure reading.
The variables sbp and dbp were calculated by discarding the first blood pressure reading and calculating the average of the remaining valid readings. Note that some of the diastolic blood pressure readings are 0.
In the comma-delimited file nhanes.csv, missing values are denoted by -9. In the R data file, missing values are denoted by NA. Note that some of the codes for variables in the table below also denote missing values; for example, the value 7 for dmdeduc2 indicates "Refused", and these codes for special types of missing values remain in the R data files.
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Data collected in the New York Bight for June 1974 and June 1975. Two of the original strata were combined because of insufficient sample sizes. For variable "catchwt", weights less than 0.5 were recorded as 0.5 kg.
data(nybight)
data(nybight)
This data frame contains the following columns:
year of data collection, 1974 or 1975
stratum membership, based on depth
number of fish caught during trawl
total weight (kg) of fish caught during trawl
number of species of fish caught during trawl
depth of station (m)
surface temperature (degrees C)
Missing values are coded as NA.
Wilk et al. (1977). Fishes and Associated Environmental Data Collected in New York Bight, June 1974–June 1975. NOAA Tech. Rep. No. NMFS SSRF-716. Washington, DC: U.S. Government Printing Office.
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Data on number of holts (dens) in Shetland, U.K., used in Kruuk et al. (1989). Data courtesy of Hans Kruuk.
data(otters)
data(otters)
This data frame contains the following columns:
section of coastline
type of habitat (stratum)
number of holts (dens)
Kruuk et al. (1989) An estimate of numbers and habitat preferences of otters Lutra lutra in Shetland, UK. Biological Conservation 49, 241–254.
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Hourly ozone readings (parts per billion, ppb) from a site in Monterey County, California, for 2018 and 2019. Source: accessed November 19, 2020. Missing values are denoted by NA.
data(ozone)
data(ozone)
This data frame contains the following columns:
year of reading (2018 or 2019)
month of reading (1-12)
day of reading (1-31)
ozone reading (ppb) at 0:00 local time
ozone reading (ppb) at 1:00 local time
ozone reading (ppb) at 2:00 local time
ozone reading (ppb) at 3:00 local time
ozone reading (ppb) at 4:00 local time
ozone reading (ppb) at 5:00 local time
ozone reading (ppb) at 6:00 local time
ozone reading (ppb) at 7:00 local time
ozone reading (ppb) at 8:00 local time
ozone reading (ppb) at 9:00 local time
ozone reading (ppb) at 10:00 local time
ozone reading (ppb) at 11:00 local time
ozone reading (ppb) at 12:00 local time
ozone reading (ppb) at 13:00 local time
ozone reading (ppb) at 14:00 local time
ozone reading (ppb) at 15:00 local time
ozone reading (ppb) at 16:00 local time
ozone reading (ppb) at 17:00 local time
ozone reading (ppb) at 18:00 local time
ozone reading (ppb) at 19:00 local time
ozone reading (ppb) at 20:00 local time
ozone reading (ppb) at 21:00 local time
ozone reading (ppb) at 22:00 local time
ozone reading (ppb) at 23:00 local time
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Fictional data from a fictional point-in-time (PIT) survey taken to estimate the number of persons experiencing homelessness.
data(pitcount)
data(pitcount)
This data frame contains the following columns:
stratum number (from 1 to 8)
geographic division, used to form strata
expected density of persons experiencing homelessness (character variable, with values High or Low)
= , the number of areas in the population for stratum h
= , the number of areas in the sample for stratum h
= , the sampling weight for the area
number of persons experiencing unsheltered homelessness found in the area during the PIT count
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
The data described in Zhang et al. (2020) were downloaded from https://www.openicpsr.org/openicpsr/project/109021/version/V1/view on January 22, 2020, from file survey4.rds.
data(profresp)
data(profresp)
This data frame contains the following columns:
Level of professionalism
1 = novice
2 = average
3 = professional
Number of panels respondent has belonged to. A response between 1 and 6 means that the person has belonged to that number of panels; 7 means 7 or more.
How many Internet surveys have you completed before this one?
1 = This is my first one
2 = 1-5
3 = 6-10
4 = 11-15
5 = 16-20
6 = 21-30
7 = More than 30
Are you a member of any online survey panels besides this one?
1 = yes
2 = no
To how many other online panels do you belong?
1 = None
2 = 1 other panel
3 = 2 others
4 = 3 others
5 = 4 others
6 = 5 others
7 = 6 others or more.
This question has a missing value if panelq1 = 2. If you want to estimate how many panels a respondent belongs to, create a new variable numpanel that equals panelq2 if panelq2 is not missing and equals 1 if panelq1 = 2.
Age category
1 = 18 to 34
2 = 35 to 49
3 = 50 to 64
4 = 65 and over
Education category
1 = high school or less
2 = some college or associates' degree
3 = college graduate or higher
1 = male
2 = female
1 = race is non-white
0 = race is white
Which best describes your main reason for joining on-line survey panels?
1 = I want my voice to be heard
2 = Completing surveys is fun
3 = To earn money
4 = Other (Please specify)
During the PAST 12 MONTHS, how many times have you seen a doctor or other health care professional about your own health? Response is number between 0 and 999.
During the PAST MONTH, how many days have you felt you did not get enough rest or sleep?
During the PAST MONTH, how many times have you eaten in restaurants? Please include both full-service and fast food restaurants.
During the PAST MONTH, how many times have you shopped in a grocery store? If you shopped at more than one grocery store on a single trip, please count them separately.
During the PAST 2 YEARS, how many overnight trips have you taken?
The data set profresp contains selected variables from the set of 2,407 respondents who completed the survey and provided information on the demographic variables and the information needed to calculate "professional respondent" status. The full data set survey4.rds contains numerous additional questions about behavior that are not included here, as well as the data from the partially completed surveys. The website also contains data for three other online panel surveys. Because profresp is a subset of the full data, statistics calculated from it may differ from those in Zhang et al. (2020).
Missing values are denoted by NA.
Zhang et al. (2020). Professional respondents in opt-in online panels: What do we really know? Social Science Computer Review 38 (6), 703–719.
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Population estimates from the 2011 American Community Survey (ACS) for age/gender/education categories measured in profresp (Zhang et al., 2020). Note that age3cat has 3 categories, while the age variable in profresp has 4 categories.
data(profrespacs)
data(profrespacs)
This data frame contains the following columns:
1 = male
2 = female
age category
1 = 18 to 34
2 = 35 to 64
3 = 65 and over
education category
1 = high school or less
2 = some college or associates' degree
3 = college graduate or higher
population size from ACS for the gender/age/education level combination
Zhang et al. (2020). Professional respondents in opt-in online panels: What do we really know? Social Science Computer Review 38 (6), 703–719.
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Radon readings for a stratified sample of 1003 homes in Minnesota. The data were downloaded in April 2008 from an earlier version
of the website now located at
https://www.stat.berkeley.edu/users/statlabs/labs.html.
data(radon)
data(radon)
This data frame contains the following columns:
county name
county number
sample size in county
population size in county
radon concentration (picocuries per liter)
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Lengths of rectangles.
data(rectlength)
data(rectlength)
This data frame contains the following columns:
rectangle number
rectangle length
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Page from a random number table.
data(rnt)
data(rnt)
This data frame contains the following columns:
column of 5-digit random numbers
column of 5-digit random numbers
column of 5-digit random numbers
column of 5-digit random numbers
column of 5-digit random numbers
column of 5-digit random numbers
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
All possible simple random samples that can be generated from the population in Example 2.2 of SDA.
data(sample70)
data(sample70)
This data frame contains the following columns:
sample number
sampled units in
sampled units in
sampled units in
sampled units in
values of in sample
values of in sample
values of in sample
values of in sample
estimated population total
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
The number of seedlings in the sampled psus on Santa Cruz Island, California, in 1992 and 1994.
data(santacruz)
data(santacruz)
This data frame contains the following columns:
tree number
number of seedlings in 1992
number of seedlings in 1994
Peart, D. (1994). Impacts of Feral Pig Activity on Vegetation Patterns Associated with Quercus agrifolia on Santa Cruz Island, California. Ph.D. dissertation. Tempe, AZ: Arizona State University.
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Math and reading test results from a two-stage cluster sample of tenth-grade students. An SRS of 10 schools was selected from the 75 schools in the population, and then 20 students were sampled from each school. These data are fictional but the summary statistics are consistent with those seen in educational studies.
data(schools)
data(schools)
This data frame contains the following columns:
school number (use as cluster variable)
gender of student (character variable, F = female, M = male)
score on math test
score on reading test
category level for math test score:
1 if 1 <= math <= 40
2 if 41 <= math
category level for reading test score:
1 if 1 <= read <= 32
2 if 33 <= read <= 50
number of students in school,
weight for student in sample
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Data on number of breathing holes found in sampled areas of Svalbard fjords, reconstructed from summary statistics given in Lydersen and Ryg (1991).
data(seals)
data(seals)
This data frame contains the following columns:
zone number for sampled area
number of breathing holes Imjak found in area
Lydersen, C. and M. Ryg (1991). Evaluating breeding habitat and populations of ringed seals Phoca hispida in Svalbard fjords. Polar Record 27, 223–228.
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Population of black and gray squares and circles.
data(shapespop)
data(shapespop)
This data frame contains the following columns:
identification number for object
shape of object (square or circle)
color of object (gray or black)
area of object ()
= 1 if object can be reached through convenience sample
= 0 otherwise
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Two-phase sample of shorebird nests. These are artificial data constructed from summary statistics given in Bart and Earnst (2002).
data(shorebirds)
data(shorebirds)
This data frame contains the following columns:
plot number
rapid-method count of number of birds in plot
intensive-method count of number of nests in plot
= NA if the plot is not in the phase II sample
Bart, J. and S. Earnst (2002). Double-sampling to estimate density and population trends in birds. The Auk 119, 36–45.
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Companies in the S&P 500 Stock Market Index as of September 15, 2020. Downloaded from https://fknol.com/list/market-cap-sp-500-index-companies.php?go=g0 on September 19, 2020.
data(sp500)
data(sp500)
This data frame contains the following columns:
company name (character variable, length 37)
stock symbol (character variable, length 5)
market capitalization, in billions of U.S. dollars
price per share of stock
price-to-earnings ratio
earnings per share
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Fictional cluster sample of introductory Spanish students.
data(spanish)
data(spanish)
This data frame contains the following columns:
class number
score on vocabulary test (out of 100)
= 1 if plan a trip to a Spanish-speaking country
= 0 otherwise
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
An SRS of size 30 taken from an artificial population of size 100.
data(srs30)
data(srs30)
This data frame contains the following columns:
value of observation
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
SRS of 150 members of the Statistical Society of Canada, downloaded from ssc.ca in August, 2006.
data(ssc)
data(ssc)
This data frame contains the following columns:
m = male
f = female
a = academic
i = industry
n = not determined
= 1 if person is member of American Statistical Association
= 0 otherwise
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Data from an unequal-probability sample of 100 counties from the 1994 County and City Data Book (U.S. Census Bureau, 1994). The sample was selected with probability proportional to population.
data(statepop)
data(statepop)
This data frame contains the following columns:
county name (character variable, length 14)
state name (character variable)
land area of county, 1990 (square miles)
population of county, 1992
number of physicians, 1990
farm population, 1990
number of farms, 1987
number of acres devoted to farming, 1987
number of veterans, 1990
percent of veterans from Vietnam era, 1990
probability of selection
sampling weight, = 1/(100)
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Number of counties (or county equivalents; Alaska has boroughs, Louisiana has parishes, and some states have independent cities), population estimates for 2019, land area, and water area for the 50 states plus the District of Columbia. Total area for a state can be calculated by summing land area and water area. Source: Population estimates are from U.S. Census Bureau (2019). Land and water areas are from U.S. Census Bureau (2012).
data(statepps)
data(statepps)
This data frame contains the following columns:
state name (character variable, length 20)
number of counties or county equivalents
population of state, 2019
land area of state (square kilometers)
water area of state (square kilometers)
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Data on call attempts from the Swedish Survey of Living Conditions.
data(swedishlcs)
data(swedishlcs)
This data frame contains the following columns:
call attempt number
response rate at call attempt (percent)
relative bias for variable benefits
relative bias for variable income
relative bias for variable employed
character variable, length 25: notes about data collection
The variable attempt takes on values 1-25 for the initial fieldwork period. Values 31-40 denote the follow-up period, and value 45 gives the final estimates. The gaps in the attempt variable allow one to see the separation of the periods on the graph.
Lundquist, P. and C.-E. Särndal (2013). Aspects of responsive design with applications to the Swedish Living Conditions Survey. Journal of Official Statistics 29 (4), 557–582.
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Selected variables from the Survey of Youth in Custody (Beck et al., 1988).
data(syc)
data(syc)
This data frame contains the following columns:
stratum number
psu number
= facility number for residents in strata 1-5
= person number for residents in strata 6-16
facility number
number of eligible residents in psu
final weight
random group number
age of resident (NA = missing)
race of resident
1 = White
2 = Black
3 = Asian/Pacific Islander
4 = American Indian, Alaska Native
5 = Other
NA = Missing
1 = Hispanic
2 = not Hispanic
NA = missing
highest grade attended before sent to correctional institution
0 = never attended school
1 - 12 = highest grade attended
13 = GED
14 = other
1 = male
2 = female
who did you live with most of the time you were growing up?
1 = mother only
2 = father only
3 = both mother and father
4 = grandparents
5 = other relatives
6 = friends
7 = foster home
8 = agency or institution
9 = someone else
MA = blank
has anyone in your family, such as your mother, father, brother, sister, ever served time in jail or prison?
1 = yes
2 = no
NA = don't know
most serious crime in current offense
1 = violent (e.g., murder, rape, robbery, assault)
2 = property (e.g. burglary, larceny, arson, fraud, motor vehicle theft)
3 = drug (drug possession or trafficking)
4 = public order (weapons violation, perjury, failure to appear in court)
5 = juvenile status offense (truancy, running away, incorrigible behavior)
NA = missing
ever put on probation or sent to correctional inst for violent offense
1 = yes
0 = no
number of times arrested (NA = missing)
number of times on probation (NA = missing)
number of times previously committed to correctional institution (NA = missing)
prior to being sent here did you ever serve time in a correctional institution?
1 = yes
2 = no
NA = missing
= 1 if previously arrested for violent offense, 0 otherwise
= 1 if previously arrested for property offense, 0 otherwise
= 1 if previously arrested for drug offense, 0 otherwise
= 1 if previously arrested for public order offense, 0 otherwise
= 1 if previously arrested for juvenile status offense, 0 otherwise
age first arrested (NA = missing)
did you use a weapon . . . for this incident? (1 = yes, 2 = no, NA = blank)
did you drink alcohol at all during the year before being sent here this time?
1 = yes
2 = no, didn't drink during year before
3 = no, don't drink at all
NA = missing
ever used illegal drugs;
0 = no
1 = yes
NA = missing
Source: U.S. Department of Justice (1989). Strata 6-16 each contain one facility; the psus in those
strata are residents. In strata 1-5, the psus are facilities. The number of facilities in the
population () for those five facilities are:
,
,
,
,
.
Eleven facilities are sampled from stratum 1 and seven facilities are sampled from each of
strata 2 through 5.
Beck, A. J., S. A. Kline, and L. A. Greenfeld (1988). Survey of Youth in Custody. Technical Report NCJ-113365, Bureau of Justice Statistics, Washington, DC.
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Selected variables from a study on elementary school teacher workload in Maricopa County, Arizona. Data courtesy of Rita Gnap (Gnap, 1995). The psu sizes are given in data teachmi. The large stratum had 245 schools; the small/medium stratum had 66 schools. Missing values are coded as NA.
data(teachers)
data(teachers)
This data frame contains the following columns:
school district size, character variable:
large
sm/me
number of hours required to work at school per week
class size
minutes spent per week in school on preparation
minutes per week that a teacher's aide works with the teacher in the classroom
school identifier
Gnap, R. (1995). Teacher Load in Arizona Elementary School Districts in Maricopa County. Ph.D. dissertation. Tempe, AZ: Arizona State University.
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Cluster sizes for data teachers.
data(teachmi)
data(teachmi)
This data frame contains the following columns:
school district size: large or sm/me
school identifier
number of teachers in that school
number of surveys returned from that school
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Data from a follow-up study of nonrespondents from Gnap (1995).
data(teachnr)
data(teachnr)
This data frame contains the following columns:
number of hours required to work at school per week
class size
minutes spent per week in school on preparation
minutes per week that a teacher's aide works with the teacher in the classroom
Gnap, R. (1995). Teacher Load in Arizona Elementary School Districts in Maricopa County. Ph.D. dissertation. Tempe, AZ: Arizona State University.
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Artificial data used in exercises of Chapter 11.
data(uneqvar)
data(uneqvar)
This data frame contains the following columns:
x: x value
y: y value
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Vietnam-service data from Stockford and Page (1984).
data(vietnam)
data(vietnam)
This data frame contains the following columns:
APC stratum. character variable with options "Yes," "No," "NotAvail"
indicator variable for phase II sample
= 1 if in phase II sample
= 0 otherwise
= 1 if service in Vietnam
= 0 if service not in Vietnam
= NA if not in phase II sample
weight for phase I sample
conditional weight for phase II sample
= (phase I sample size in stratum) / (phase II sample size in stratum)
= NA for observations not in phase 2 sample
final weight for phase II sample
= phase1wt*phase2wt
= NA for observations not in phase II sample
number of observations in the observation's APC stratum that
are in the phase I sample ()
number of observations in the observation's APC stratum that
are in the phase II sample ()
Stockford, D. D. and W. F. Page (1984). Double sampling and the misclassification of Vietnam service. In Proceedings of the Social Statistics Section, pp. 261–264. Alexandria, VA: American Statistical Association.
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Selected variables from the 2002 U.S. Vehicle Inventory and Use Survey (VIUS).
data(vius)
data(vius)
This data frame contains the following columns:
stratum number (contains all 255 strata)
state number
state name
type of truck, used in stratification
1. pickups
2. minivans, other light vans, and sport utility vehicles
3. light single-unit trucks with gross vehicle weight less than 26,000 pounds
4. heavy single-unit trucks with gross vehicle weight greater than or equal to 26,000 pounds
5. truck-tractors
column of sampling weights
body type of vehicle
01. Pickup
02. Minivan
03. Light van other than minivan
04. Sport utility
05. Armored
06. Beverage
07. Concrete mixer
08. Concrete pumper
09. Crane
10. Curtainside
11. Dump
12. Flatbed, stake, platform, etc.
13. Low boy
14. Pole, logging, pulpwood, or pipe
15. Service, utility
16. Service, other
17. Street sweeper
18. Tank, dry bulk
19. Tank, liquids or gases
20. Tow/Wrecker
21. Trash, garbage, or recycling
22. Vacuum
23. Van, basic enclosed
24. Van, insulated non-refrigerated
25. Van, insulated refrigerated
26. Van, open top
27. Van, step, walk-in, or multistop
28. Van, other
99. Other not elsewhere classified
model year
01. 2003, 2002
02. 2001
03. 2000
04. 1999
05. 1998
06. 1997
07. 1996
08. 1995
09. 1994
10. 1993
11. 1992
12. 1991
13. 1990
14. 1989
15. 1988
16. 1987
17. Pre-1987
Gross vehicle weight based on average reported weight
01. Less than 6,001 lbs
02. 6,001 to 8,500 lbs
03. 8,501 to 10,000 lbs
04. 10,001 to 14,000 lbs
05. 14,001 to 16,000 lbs
06. 16,001 to 19,500 lbs
07. 19,501 to 26,000 lbs
08. 26,001 to 33,000 lbs
09. 33,001 to 40,000 lbs
10. 40,001 to 50,000 lbs
11. 50,001 to 60,000 lbs
12. 60,001 to 80,000 lbs
13. 80,001 to 100,000 lbs
14. 100,001 to 130,000 lbs
15. 130,001 lbs. or more
number of miles driven during 2002
number of miles driven since manufactured
miles per gallon averaged during 2002, range from 0.3 to 35, NA denotes not reported or not applicable
operator classification with highest percent
1. Private
2. Motor carrier
3. Owner operator
4. Rental
5. Personal transportation
6. Not applicable (Vehicle not in use)
percent of miles driven as a motor carrier, NA denotes vehicle not in use
percent of miles driven as an owner operator, NA denotes vehicle not in use
percent of miles driven for personal transportation, NA denotes vehicle not in use
percent of miles driven as private (carry own goods or internal company business only), NA denotes vehicle not in use
percent of miles driven as rental, NA denotes vehicle not in use
type of transmission
1. Automatic
2. Manual
3. Semi-Automated Manual
4. Automated Manual
primary range of operation
1. Off-the-road
2. Less than 50 miles
3. 51 to 100 miles
4. 101 to 200 miles
5. 201 to 500 miles
6. 501 miles or more
7. Not reported
8. Not applicable (Vehicle not in use)
percent of annual miles accounted for with trips 50 miles or less from the home base
percent of annual miles accounted for with trips 51 to 100 miles from the home base
percent of annual miles accounted for with trips 101 to 200 miles from the home base
percent of annual miles accounted for with trips 201 to 500 miles from the home base
percent of annual miles accounted for with trips 501 or more miles from home base
make of vehicle
01. Chevrolet
02. Chrysler
03. Dodge
04. Ford
05. Freightliner
06. GMC
07. Honda
08. International
09. Isuzu
10. Jeep
11. Kenworth
12. Mack
13. Mazda
14. Mitsubishi
15. Nissan
16. Peterbilt
17. Plymouth
18. Toyota
19. Volvo
20. White
21. Western Star
22. White GMC
23. Other (domestic)
24. Other (foreign)
Business in which vehicle was most often used during 2002
01. For-hire transportation or warehousing
02. Vehicle leasing or rental
03. Agriculture, forestry, fishing, or hunting
04. Mining
05. Utilities
06. Construction
07. Manufacturing
08. Wholesale trade
09. Retail trade
10. Information services
11. Waste management, landscaping, or administrative/support services
12. Arts, entertainment, or recreation services
13. Accommodation or food services
14. Other services
NA. Not reported or not applicable
Source: Census:VIUS:2006 . The data were downloaded from
https://www.census.gov/svsd/www/vius in May, 2006.
The website from which the data were downloaded no longer exists,
and online information about VIUS may now be found at
https://www.bts.gov/vius,
which provides a link to the archived 2002 data. The missing value of state for records with
adm_state = 42 was recoded to "PA", the state that has code 42. This data set has 98,682
records, which may be too large for some software packages to handle; the file viusca
is a smaller data set, with the same columns described below, containing only vehicles from
California. The variable descriptions below are taken from the VIUS Data Dictionary.
Missing values are coded as NA. For some variables, the value is missing because the
question is not applicable or the vehicle is not in use; see the individual variable descriptions.
Note that a new VIUS is planned for 2022, with data to be released in 2023; see https://www.bts.gov/vius.
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
The data viusca is a smaller data set from vius with the same columns described below, containing only vehicles from California. The variable descriptions below are taken from the VIUS Data Dictionary.
data(viusca)
data(viusca)
This data frame contains the following columns:
stratum number (contains all 255 strata)
state number
state name
type of truck, used in stratification
1. pickups
2. minivans, other light vans, and sport utility vehicles
3. light single-unit trucks with gross vehicle weight less than 26,000 pounds
4. heavy single-unit trucks with gross vehicle weight greater than or equal to 26,000 pounds
5. truck-tractors
column of sampling weights
body type of vehicle
01. Pickup
02. Minivan
03. Light van other than minivan
04. Sport utility
05. Armored
06. Beverage
07. Concrete mixer
08. Concrete pumper
09. Crane
10. Curtainside
11. Dump
12. Flatbed, stake, platform, etc.
13. Low boy
14. Pole, logging, pulpwood, or pipe
15. Service, utility
16. Service, other
17. Street sweeper
18. Tank, dry bulk
19. Tank, liquids or gases
20. Tow/Wrecker
21. Trash, garbage, or recycling
22. Vacuum
23. Van, basic enclosed
24. Van, insulated non-refrigerated
25. Van, insulated refrigerated
26. Van, open top
27. Van, step, walk-in, or multistop
28. Van, other
99. Other not elsewhere classified
model year
01. 2003, 2002
02. 2001
03. 2000
04. 1999
05. 1998
06. 1997
07. 1996
08. 1995
09. 1994
10. 1993
11. 1992
12. 1991
13. 1990
14. 1989
15. 1988
16. 1987
17. Pre-1987
Gross vehicle weight based on average reported weight
01. Less than 6,001 lbs
02. 6,001 to 8,500 lbs
03. 8,501 to 10,000 lbs
04. 10,001 to 14,000 lbs
05. 14,001 to 16,000 lbs
06. 16,001 to 19,500 lbs
07. 19,501 to 26,000 lbs
08. 26,001 to 33,000 lbs
09. 33,001 to 40,000 lbs
10. 40,001 to 50,000 lbs
11. 50,001 to 60,000 lbs
12. 60,001 to 80,000 lbs
13. 80,001 to 100,000 lbs
14. 100,001 to 130,000 lbs
15. 130,001 lbs. or more
number of miles driven during 2002
number of miles driven since manufactured
miles per gallon averaged during 2002, range from 0.3 to 35, NA denotes not reported or not applicable
operator classification with highest percent
1. Private
2. Motor carrier
3. Owner operator
4. Rental
5. Personal transportation
6. Not applicable (Vehicle not in use)
percent of miles driven as a motor carrier, NA denotes vehicle not in use
percent of miles driven as an owner operator, NA denotes vehicle not in use
percent of miles driven for personal transportation, NA denotes vehicle not in use
percent of miles driven as private (carry own goods or internal company business only), NA denotes vehicle not in use
percent of miles driven as rental, NA denotes vehicle not in use
type of transmission
1. Automatic
2. Manual
3. Semi-Automated Manual
4. Automated Manual
primary range of operation
1. Off-the-road
2. Less than 50 miles
3. 51 to 100 miles
4. 101 to 200 miles
5. 201 to 500 miles
6. 501 miles or more
7. Not reported
8. Not applicable (Vehicle not in use)
percent of annual miles accounted for with trips 50 miles or less from the home base
percent of annual miles accounted for with trips 51 to 100 miles from the home base
percent of annual miles accounted for with trips 101 to 200 miles from the home base
percent of annual miles accounted for with trips 201 to 500 miles from the home base
percent of annual miles accounted for with trips 501 or more miles from home base
make of vehicle
01. Chevrolet
02. Chrysler
03. Dodge
04. Ford
05. Freightliner
06. GMC
07. Honda
08. International
09. Isuzu
10. Jeep
11. Kenworth
12. Mack
13. Mazda
14. Mitsubishi
15. Nissan
16. Peterbilt
17. Plymouth
18. Toyota
19. Volvo
20. White
21. Western Star
22. White GMC
23. Other (domestic)
24. Other (foreign)
Business in which vehicle was most often used during 2002
01. For-hire transportation or warehousing
02. Vehicle leasing or rental
03. Agriculture, forestry, fishing, or hunting
04. Mining
05. Utilities
06. Construction
07. Manufacturing
08. Wholesale trade
09. Retail trade
10. Information services
11. Waste management, landscaping, or administrative/support services
12. Arts, entertainment, or recreation services
13. Accommodation or food services
14. Other services
NA. Not reported or not applicable
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.
Selected variables from the Arizona State University Winter Closure Survey, taken in January 1995 (provided courtesy of the ASU office of University Evaluation). This survey was taken to investigate the attitudes and opinions of university employees towards the closing of the university (for budgetary reasons) between December 25 and January 1. For the yes/no questions, the responses are coded as 1 = No, 2 = Yes. The variables treatsta and treatme are coded as 1 = strongly agree, 2 = agree, 3 = undecided, 4 = disagree, 5 = strongly disagree. The variables process and satbreak are coded as 1 = very satisfied, 2 = satisfied, 3 = undecided, 4 = dissatisfied, 5 = very dissatisfied. Variables ownsupp through offclose are coded 1 if the person checked that the statement applied to him/her, and 2 if the statement was not checked.
data(winter)
data(winter)
This data frame contains the following columns:
Stratum number
1 = faculty
2 = classified staff
3 = administrative staff
4 = academic professional
Number of years worked at ASU
1 = 1-2 years
2 = 3-4 years
3 = 5-9 years
4 = 10-14 years
5 = 15 or more years
In the past, have you usually taken vacation days the entire period between December 25 and January 1?
Did you work on campus during Winter Break Closure?
Did the Winter Break Closure cause you any diffculty/concerns?
Did the Winter Break Closure negatively affect your work productivity?
I was unable to obtain staff support in my department/offce
I was unable to obtain staff support in other departments/offices
I was unable to access computers, copy machine, etc. in my department/office
I was unable to endure environmental conditions, e.g., not properly climatized
I was unable to access university services necessary to my work
I was unable to work on my assignments because I work in another department/office
I was unable to work on my assignments because my office was closed
Compared to other departments/offices, I feel staff in my department/office were treated fairly
Compared to other people working in my department/office, I feel I was treated fairly
How satisfied are you with the process used to inform staff about Winter Break Closure?
How satisfied are you with the fact that ASU had a Winter Break Closure this year?
Would you want to have Winter Break Closure again?
Missing values are coded as NA.
Lohr (2021), Sampling: Design and Analysis, 3rd Edition. Boca Raton, FL: CRC Press.
Lu and Lohr (2021), R Companion for Sampling: Design and Analysis, 3rd Edition, 1st Edition. Boca Raton, FL: CRC Press.