Title: | Data Science Labs |
---|---|
Description: | Datasets and functions that can be used for data analysis practice, homework and projects in data science courses and workshops. 26 datasets are available for case studies in data visualization, statistical inference, modeling, linear regression, data wrangling and machine learning. |
Authors: | Rafael A. Irizarry, Amy Gill |
Maintainer: | Rafael A. Irizarry <[email protected]> |
License: | Artistic-2.0 |
Version: | 0.8.0 |
Built: | 2024-10-30 06:50:29 UTC |
Source: | CRAN |
The admission data for six majors for the fall of 1973; often used as an example of Simpson's paradox
admissions
admissions
An object of class "data.frame"
.
major. The major or university department.
gender. Men or women.
admitted. Percent of students admitted.
applicants. Total number of applicants.
PJ Bickel, EA Hammel, and JW O'Connell. Science (1975)
admissions
admissions
Biopsy features for classification of 569 malignant (cancer) and benign (not cancer) breast masses.
brca
brca
An object of class list
.
Features were computationally extracted from digital images of fine needle aspirate biopsy slides. Features correspond to properties of cell nuclei, such as size, shape and regularity. The mean, standard error, and worst value of each of 10 nuclear parameters is reported for a total of 30 features.
This is a classic dataset for training and benchmarking machine learning algorithms.
y. The outcomes. A factor with two levels denoting whether a mass is malignant ("M") or benign ("B").
x. The predictors. A matrix with the mean, standard error and worst value of each of 10 nuclear measurements on the slide, for 30 total features per biopsy:
radius. Nucleus radius (mean of distances from center to points on perimeter).
texture. Nucleus texture (standard deviation of grayscale values).
perimeter. Nucleus perimeter.
area. Nucleus area.
smoothness. Nucleus smoothness (local variation in radius lengths).
compactness. Nucleus compactness (perimeter^2/area - 1).
concavity, Nucleus concavity (severity of concave portions of the contour).
concave_pts. Number of concave portions of the nucleus contour.
symmetry. Nucleus symmetry.
fractal_dim. Nucleus fractal dimension ("coastline approximation" -1).
UCI Machine Learning Repository
table(brca$y) dim(brca$x) head(brca$x)
table(brca$y) dim(brca$x) head(brca$x)
Brexit (EU referendum) poll outcomes for 127 polls from January 2016 to the referendum date on June 23, 2016.
brexit_polls
brexit_polls
An object of class "data.frame"
.
startdate. Start date of poll.
enddate. End date of poll.
pollster. Pollster conducting the poll.
poll_type. Online or telephone poll.
samplesize. Sample size of poll.
remain. Proportion voting Remain.
leave. Proportion voting Leave.
undecided. Proportion of undecided voters.
spread. Spread calculated as remain - leave.
head(brexit_polls)
head(brexit_polls)
Probability of death within 1 year by age and sex in the United States in 2015.
death_prob
death_prob
An object of class "data.frame"
.
age. Age strata, with each year a different stratum.
sex. Male or Female.
prob. Probability of death within 1 year given exact age and sex.
head(death_prob)
head(death_prob)
Divorce rates in Maine and per capita consumption of margarine in US data
divorce_margarine
divorce_margarine
An object of class "data.frame"
.
divorce_rate_maine. Divorce per 1000 in Maine.
margarine_consumption_per_capita. US per capita consumption of margarine in pounds.
year. Year.
with(divorce_margarine, plot(margarine_consumption_per_capita, divorce_rate_maine))
with(divorce_margarine, plot(margarine_consumption_per_capita, divorce_rate_maine))
This function sets a ggplot2 theme used throughout the data science labs. It can be called without arguments.
ds_theme_set( new = "theme_bw", args = NULL, base_size = 11, bold_title = TRUE, ... )
ds_theme_set( new = "theme_bw", args = NULL, base_size = 11, bold_title = TRUE, ... )
new |
a prebuilt ggplot2 theme. Defaults to |
args |
the arguments to be passed along to the ggplot2 theme function. Defaults to |
base_size |
if |
bold_title |
if TRUE, sets titles to be bold |
... |
additional arguments to be used by theme |
None
library(ggplot2) ds_theme_set() qplot(hp, mpg, data=mtcars, color=am, facets=gear~cyl, main="Scatterplots of MPG vs. Horsepower", xlab="Horsepower", ylab="Miles per Gallon")
library(ggplot2) ds_theme_set() qplot(hp, mpg, data=mtcars, color=am, facets=gear~cyl, main="Scatterplots of MPG vs. Horsepower", xlab="Horsepower", ylab="Miles per Gallon")
Health and income outcomes for 184 countries from 1960 to 2016. Also includes two character vectors, oecd
and opec
, with the names of OECD and OPEC countries from 2016.
gapminder
gapminder
An object of class "data.frame"
.
country.
year.
infant_mortality. Infant deaths per 1000.
life_expectancy. Life expectancy in years.
fertility. Average number of children per woman.
population. Country population.
gpd. GDP according to World Bankdev.
continent.
region. Geographical region.
head(gapminder) print(oecd) print(opec)
head(gapminder) print(oecd) print(opec)
Concentrations of the three main greenhouse gases carbon dioxide, methane and nitrous oxide. Measurements are from the Law Dome Ice Core in Antarctica. Selected measurements are provided every 20 years from 1-2000 CE.
greenhouse_gases
greenhouse_gases
An object of class "data.frame"
.
year. Year (CE).
gas. Gas being measured: carbon dioxide ('CO2'), methane ('CH4') or nitrous oxide ('N2O').
concentration. Gas concentration in ppm by volume ('CO2') or ppb by volume ('CH4', 'N2O').
MacFarling Meure et al. 2006 via NOAA.
head(greenhouse_gases)
head(greenhouse_gases)
Self-reported heights and sex. The heights were converted to inches from the original data included in reported_heights
.
heights
heights
An object of class "data.frame"
.
sex. A factor with the self-reported sex.
height. A numeric vector with self-reported heights in inches.
reported_heights
for the original data source.
head(heights)
head(heights)
Concentration of carbon dioxide in ppm by volume from direct measurements at Mauna Loa (1959-2018 CE) and indirect measurements from a series of Antarctic ice cores (approx. -800,000-2001 CE).
historic_co2
historic_co2
An object of class "data.frame"
.
year. Year (CE).
co2. Carbon dioxide concentration in ppm by volume.
source. Source of carbon dioxide measurement: direct CO2 annual mean concentrations from Mauna Loa ('Mauna Loa') or indirect CO2 concentrations from air trapped in ice cores ('Ice Cores').
Mauna Loa data from NOAA. Ice core data from Bereiter et al. 2015 via NOAA.
head(historic_co2)
head(historic_co2)
Body weights, bone density, and percent fat for mice under two diets: chow and high fat. Data provided by Karen Svenson from Jackson Laboratories. Funding to generate these data came from NIH grant P50 GM070683 awarded to Gary Churchill.
mice_weights
mice_weights
An object of class "data.frame"
.
body_weight. Body weight in grams at 19 weeks.
bone_density. Body density.
percent_fat. Percent fat.
sex. The sex of the mice.
diet. The diet of the mice: chow or high fat.
gen. These are outbread mice. This variable denotes the generation.
litter. Which of two litters mice belong to.
Karen Svenson, Daniel M. Gatti, and Gary Churchill from Jackson Laboratories.
Daniel M. Gatti, Petr Simecek, Lisa Somes, Clifton T. Jeffrey, Matthew J. Vincent, Kwangbom Choi, Xingyao Chen, Gary A. Churchill, and Karen L. Svenson. "The Effects of Sex and Diet on Physiology and Liver Gene Expression in Diversity Outbred Mice". bioRxiv 098657; doi:10.1101/098657
mice_weights |> head() with(mice_weights, table(sex, diet))
mice_weights |> head() with(mice_weights, table(sex, diet))
We only include a randomly selected set of 1s, 2s and 7s along with the two predictors based on the proportion of dark pixels in the upper left and lower right quadrants respectively. The dataset is divided into training and test sets.
mnist_127
mnist_127
An object of class list
.
train. A data frame containing training data: labels and predictors.
test. A data frame containing test data: labels and predictors.
index_train. The index of the original mnist training data used for the training set.
index_test. The index of the original mnist test data used for the test set.
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. "Gradient-based learning applied to document recognition." Proceedings of the IEEE, 86(11):2278-2324, November 1998.
[read_mnist(), mnist_27]
with(mnist_127$train, plot(x_1, x_2, col = as.numeric(y)))
with(mnist_127$train, plot(x_1, x_2, col = as.numeric(y)))
We only include a randomly selected set of 2s and 7s along with the two predictors based on the proportion of dark pixels in the upper left and lower right quadrants respectively. The dataset is divided into training and test sets.
mnist_27
mnist_27
An object of class list
.
train. A data frame containing training data: labels and predictors.
test. A data frame containing test data: labels and predictors.
index_train. The index of the original mnist training data used for the training set.
index_test. The index of the original mnist test data used for the test set.
true_p. A data.frame
containing the
two predictors x_1
and x_2
and the conditional probability of being a 7
for x_1
, x_2
.
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. "Gradient-based learning applied to document recognition." Proceedings of the IEEE, 86(11):2278-2324, November 1998.
[read_mnist()]
with(mnist_27$train, plot(x_1, x_2, col = as.numeric(y)))
with(mnist_27$train, plot(x_1, x_2, col = as.numeric(y)))
MovieLens Latest Dataset (Small)
movielens
movielens
Two object of class data.frame
.
movieId. Unique ID for the movie.
title. Movie title (not unique).
year. Year the movie was released.
genres. Genres associated with the movie.
userId. Unique ID for the user.
rating. A rating between 0 and 5 for the movie.
timestamp. Date and time the rating was given.
https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=https://dx.doi.org/10.1145/2827872
head(movielens)
head(movielens)
Gun murder data from FBI reports. Also contains the population of each state.
murders
murders
An object of class "data.frame"
.
state. US state
abb. Abbreviation of US state
region. Geographical US region
population. State population (2010)
total. Number of gun murders in state (2010)
print(murders)
print(murders)
This dataset was randomly generated.
na_example
na_example
An object of class "integer"
.
print(sum(is.na(na_example)))
print(sum(is.na(na_example)))
Distribution of scores for New York City Regents algebra, global history, biology, English, and U.S. history exams. These data were used to make this New York Times plot.
nyc_regents_scores
nyc_regents_scores
An object of class "data.frame"
.
score. Test score from 0 to 100.
integrated_algebra. Score frequency on Algebra exam.
global_history. Score frequency on global history exam.
living_environment. Score frequency on biology exam.
english. Score frequency on English exam.
us_history. Score frequency on U.S. history exam.
New York City Department of Education via Amanda Cox.
print(nyc_regents_scores)
print(nyc_regents_scores)
Composition in percentage of eight fatty acids found in the lipid fraction of 572 Italian olive oils
olive
olive
An object of class "data.frame"
.
region. General region of Italy.
area. Area of Italy.
palmitic. Percent palmitic acid of sample.
palmitoleic. Percent palmitoleic of sample.
stearic. Percent stearic acid of sample.
oleic. Percent oleic acid of sample.
linoleic. Percent linoleic acid of sample.
linolenic. Percent linolenic acid of sample.
arachidic. Percent arachidic acid of sample.
eicosenoic. Percent eicosenoic acid of sample.
J. Zupan, and J. Gasteiger. Neural Networks in Chemistry and Drug Design.
head(olive)
head(olive)
This dataset was randomly generated with a normal distribution (average: 5 feet 9 inches, standard deviation: 3 inches). One value was changed to be mistakenly reported in centimeters rather than feet.
outlier_example
outlier_example
An object of class "numeric"
.
mean(outlier_example) median(outlier_example)
mean(outlier_example) median(outlier_example)
Data from different pollsters for the popular vote between Obama and McCain in the 2008 presidential election.
polls_2008
polls_2008
An object of class data.frame
.
day. Days until election day. Negative numbers are reported so that days can increase up to 0, which is election day.
margin. Average difference between Obama and McCain for that day.
https://web.archive.org/web/20161108190914/http://www.pollster.com/08USPresGEMvO-2.html
with(polls_2008, plot(day, margin))
with(polls_2008, plot(day, margin))
Poll results from US 2016 presidential elections aggregated from HuffPost Pollster, RealClearPolitics, polling firms, and news reports.
The dataset also includes election results (popular vote) and electoral college votes in results_us_election_2016
.
polls_us_election_2016
polls_us_election_2016
An object of class "data.frame"
.
state. State in which poll was taken. 'U.S' is for national polls.
startdate. Poll's start date.
enddate. Poll's end date.
pollster. Pollster conducting the poll.
grade. Grade assigned by fivethirtyeight to pollster.
samplesize. Sample size.
population. Type of population being polled.
rawpoll_clinton. Percentage for Hillary Clinton.
rawpoll_trump. Percentage for Donald Trump
rawpoll_johnson. Percentage for Gary Johnson
rawpoll_mcmullin. Percentage for Evan McMullin.
adjpoll_clinton. Fivethirtyeight adjusted percentage for Hillary Clinton.
ajdpoll_trump. Fivethirtyeight adjusted percentage for Donald Trump
adjpoll_johnson. Fivethirtyeight adjusted percentage for Gary Johnson
adjpoll_mcmullin. Fivethirtyeight adjusted percentage for Evan McMullin.
The original csv file used to create polls_us_election_2016
is here: https://projects.fivethirtyeight.com/general-model/president_general_polls_2016.csv
The data for results_us_election_2016
is from Ballotpedia and can be found here:
https://docs.google.com/spreadsheets/d/1zxyOQDjNOJS_UkzerorUCf2OAdcMcIQEwRciKuYBIZ4/pubhtml?widget=true&headers=false#gid=658726802/
head(polls_us_election_2016)
head(polls_us_election_2016)
A data frame with Puerto Rico daily mortality counts 2015 to May 2018. This includes the day hurricanes Maria made 2017-09-20.
pr_death_counts
pr_death_counts
An object of class data.frame
.
date. Date of the count.
deaths. Number of deaths reported that day.
Puerto Rico Demographic Registry. Data was extracted from PDF provided in 'system.file("extdata", "RD-Mortality-Report_2015-18-180531.pdf", package = "dslabs")'
with(pr_death_counts, plot(date, deaths))
with(pr_death_counts, plot(date, deaths))
This function downloads the mnist training and test data available here http://yann.lecun.com/exdb/mnist/
read_mnist( path = NULL, download = FALSE, destdir = tempdir(), url = "https://www2.harvardx.harvard.edu/courses/IDS_08_v2_03/", keep.files = TRUE )
read_mnist( path = NULL, download = FALSE, destdir = tempdir(), url = "https://www2.harvardx.harvard.edu/courses/IDS_08_v2_03/", keep.files = TRUE )
path |
A character giving the full path of the directory to look for files. It assumes the filenames are the same as the originals. If path is |
download |
If |
destdir |
A character giving the full path of the directory in which to save the downloaded files. The default is to use a temporary directory. |
url |
A character giving the URL from which to download files. Currently a copy of the data is available at https://www2.harvardx.harvard.edu/courses/IDS_08_v2_03/, the current default URL. |
keep.files |
A logical. If |
A list with two components: train and test. Each of these is a list with two components: images and labels. The images component is a matrix with each column representing one of the 28*28 = 784 pixels. The values are integers between 0 and 255 representing grey scale. The labels components is a vector representing the digit shown in the image.
Note that the data is over 10MB, so the download may take several seconds depending on internet speed. If you plan to load the data more than once we recommend you download the data once and read it from disk in the future. See examples.
Samuela Pollack
Rafael A. Irizarry, [email protected]
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. "Gradient-based learning applied to document recognition." Proceedings of the IEEE, 86(11):2278-2324, November 1998.
# this can take several seconds, depending on internet speed. ## Not run: mnist <- read_mnist() i <- 5 image(1:28, 1:28, matrix(mnist$test$images[i,], nrow=28)[ , 28:1], col = gray(seq(0, 1, 0.05)), xlab = "", ylab="") ## the labels for this image is: mnist$test$labels[i] ## End(Not run) # You can download and save the data to a directory like this: ## Not run: mnist <- read_mnist(download = TRUE, destdir = "~/Downloads") # and then, going forward, read from disk mnist <- read_mnist("~/Downloads") ## End(Not run)
# this can take several seconds, depending on internet speed. ## Not run: mnist <- read_mnist() i <- 5 image(1:28, 1:28, matrix(mnist$test$images[i,], nrow=28)[ , 28:1], col = gray(seq(0, 1, 0.05)), xlab = "", ylab="") ## the labels for this image is: mnist$test$labels[i] ## End(Not run) # You can download and save the data to a directory like this: ## Not run: mnist <- read_mnist(download = TRUE, destdir = "~/Downloads") # and then, going forward, read from disk mnist <- read_mnist("~/Downloads") ## End(Not run)
Students were asked to report their height (in inches) and sex in an anonymous online form. This table includes the results from combining data from four courses.
reported_heights
reported_heights
An object of class "data.frame"
.
time_stamp. Time and date of the entry.
sex. Sex as reported by the students.
height. Height as reported by student by filling in a text free box.
head(reported_heights)
head(reported_heights)
Table S1 from paper title "Gender contributes to personal research funding success in The Netherlands"
research_funding_rates
research_funding_rates
An object of class "data.frame"
.
discipline. Research area discipline.
applications_total. Total applications.
applications_men. Total applications by men.
applications_women. Total applications by women.
awards_total. Total awards.
awards_men. Total awards received by men.
awards_women. Total awards received by women.
success_rates_total. Overall success rate.
success_rates_men. Success rate for men.
success_rates_women. Success rate for women.
van der Lee R, Ellemers N. Gender contributes to personal research funding success in The Netherlands. Proc Natl Acad Sci U S A. 2015 Oct 6;112(40):12349-53. doi: 10.1073/pnas.1510159112. Epub 2015 Sep 21. PMID: 26392544; PMCID: PMC4603485.
research_funding_rates # The raw data for this table is available from invisible(raw_data_research_funding_rates)
research_funding_rates # The raw data for this table is available from invisible(raw_data_research_funding_rates)
The function simulates a falling object's position. Default parameters are for dropping a weight from the tower of Pisa.
rfalling_object( n = 14, d_0 = 55.86, v_0 = 0, g = -9.8, scale = 1, time = seq(0, 3.25, length.out = n), error_distribution = c("rnorm", "rt"), df = 3 )
rfalling_object( n = 14, d_0 = 55.86, v_0 = 0, g = -9.8, scale = 1, time = seq(0, 3.25, length.out = n), error_distribution = c("rnorm", "rt"), df = 3 )
n |
Sample size |
d_0 |
Height from which object will fall in meters. |
v_0 |
Initial velocity with which object will fall in meters per second. |
g |
Gravitational constant, 9.8 meters per second per seonnd |
scale |
The measurement errors will be multiplied by this constant. |
time |
Numeric vector of times, in seconds, at which measurements were taken. |
error_distribution |
Character. Either |
df |
If using t-distribution, the degrees of freedom. |
A data.frame
with the time, the distance travelled, and the observed distance.
dat <- rfalling_object() with(dat, plot(time, observed_distance)) with(dat, lines(time, distance, col = "blue"))
dat <- rfalling_object() with(dat, plot(time, observed_distance)) with(dat, lines(time, distance, col = "blue"))
Physical properties of selected stars, including luminosity, temperature, and spectral class.
stars
stars
An object of class "data.frame"
.
star. Name of star.
magnitude. Absolute magnitude of the star, which is a function of the star's luminosity and distance to the star.
temp. Surface temperature in degrees Kelvin (K).
type. Spectral class of star in the OBAFGKM system.
Compiled from multiple open-access references on VizieR.
head(stars)
head(stars)
The function shows a plot of a random sample drawn from an urn with blue and red beads. The sample is taken with replacement. The proportion of blue beads is not shown so that students can try to estimate it.
take_poll(n, ...)
take_poll(n, ...)
n |
Sample size |
... |
additional arguments to be used by the function |
None
take_poll(25)
take_poll(25)
Annual mean global temperature anomaly on land, sea and combined, 1880-2018. Annual global carbon emissions, 1751-2014.
temp_carbon
temp_carbon
An object of class "data.frame"
.
year. Year (CE).
temp_anomaly. Global annual mean temperature anomaly in degrees Celsius relative to the 20th century mean temperature. 1880-2018.
land_anomaly. Annual mean temperature anomaly on land in degrees Celsius relative to the 20th century mean temperature. 1880-2018.
ocean_anomaly. Annual mean temperature anomaly over ocean in degrees Celsius relative to the 20th century mean temperature. 1880-2018.
carbon_emissions. Annual carbon emissions in millions of metric tons of carbon. 1751-2014.
NOAA and Boden, T.A., G. Marland, and R.J. Andres (2017) via CDIAC
head(temp_carbon)
head(temp_carbon)
This is a subset of the data provided by the tissuesGeneExpression
package available from the genomicsclass
GitHub repository.
The predictors are gene expression measurements from 500 genes that
are a random subset of the original 22,215.
tissue_gene_expression
tissue_gene_expression
An object of class list
.
The example dataset is recommended for illustrating clustering and machine learning techniques.
x. The predictors composed of 500 genes. Each row is a gene expression profile and each column is different gene. The column names are the gene symbols.
y. The outcomes. A character vector representing the tissue. One of seven tissue types.
https://github.com/genomicsclass/tissuesGeneExpression
table(tissue_gene_expression$y) dim(tissue_gene_expression$x)
table(tissue_gene_expression$y) dim(tissue_gene_expression$x)
This dataset contains all tweets from Donald Trump's Twitter account from 2009 to 2017. Additionally, the results of a sentiment analysis, conducted on tweets from the campaign period (2015-06-17 to 2016-11-08), are included in sentiment_counts
.
trump_tweets
trump_tweets
An object of class "data.frame"
.
source. Device or service used to compose tweet.
id_str. Tweet ID.
text. Tweet.
created_at. Data and time tweet was tweeted.
retweet_count. How many times tweet had been retweeted at time dataset was created.
in_reply_to_user_id_str. If a reply, the user id of person being replied to.
favorite_count. Number of times tweet had been favored at time dataset was created.
is_retweet. A logical telling us if it is a retweet or not.
The Trump Twitter Archive: https://www.thetrumparchive.com/
head(trump_tweets)
head(trump_tweets)
Yearly counts for Hepatitis A, Measles, Mumps, Pertussis, Polio, Rubella, and Smallpox for US states. Original data courtesy of Tycho Project (http://www.tycho.pitt.edu/).
us_contagious_diseases
us_contagious_diseases
An object of class "data.frame"
.
disease. A factor containing disease names.
state. A factor containing state names.
year.
weeks_reporting. Number of weeks counts were reported that year.
count. Total number of reported cases.
population. State population, interpolated for non-census years.
Willem G. van Panhuis, John Grefenstette, Su Yon Jung, Nian Shong Chok, Anne Cross, Heather Eng, Bruce Y Lee, Vladimir Zadorozhny, Shawn Brown, Derek Cummings, Donald S. Burke. Contagious Diseases in the United States from 1888 to the present. NEJM 2013; 369(22): 2152-2158.
head(us_contagious_diseases)
head(us_contagious_diseases)