Title: | Data Sets, Etc. for the Text "Using R for Introductory Statistics", Second Edition |
---|---|
Description: | A collection of data sets to accompany the textbook "Using R for Introductory Statistics," second edition. |
Authors: | John Verzani <[email protected]> |
Maintainer: | John Verzani <[email protected]> |
License: | GPL (>= 2) |
Version: | 2.0-7 |
Built: | 2024-12-06 06:51:50 UTC |
Source: | CRAN |
For years people have tried to estimate the age of the universe. This data set collects a few estimates starting with lower bounds using estimates for the earth's age.
data(age.universe)
data(age.universe)
A data frame with 16 observations on the following 4 variables.
a numeric vector
a numeric vector
a numeric vector
Short description of source
In the last two decades estimates for the age of the universe have been greatly improved. As of 2013, the best guess is 13.7 billion years with a margin of error of 1 percent. This last estimate is found by WMAP using microwave background radiation. Previous estimates were also based on estimates of Hubble's constant, and dating of old stars.
This data was collected from the following web sites: https://arxiv.org/abs/1212.5225, https://case.edu/pubaff/univcomm/2003/1-03/kraussuniverse.html (now off-line), https://www.astro.ucla.edu/~wright/age.html, http://www.lhup.edu/~dsimanek/cutting/ageuniv.htm (now off-line), and https://map.gsfc.nasa.gov/m_uni/uni_101age.html.
data(age.universe) n <- nrow(age.universe) x <- 1:n names(x) = age.universe$year plot(x,age.universe$upper,ylim=c(0,20)) points(x,age.universe$lower) with(age.universe,sapply(x,function(i) lines(c(i,i),c(lower[i],upper[i]))))
data(age.universe) n <- nrow(age.universe) x <- 1:n names(x) = age.universe$year plot(x,age.universe$upper,ylim=c(0,20)) points(x,age.universe$lower) with(age.universe,sapply(x,function(i) lines(c(i,i),c(lower[i],upper[i]))))
monthly payment for federal program
data(aid)
data(aid)
The format is: Named num [1:51] 57.2 253.5 114.2 68.2 199.6 ... - attr(*, "names")= chr [1:51] "Alabama" "Alaska" "Arizona" "Arkansas" ...
From Kitchen's Exploring Statistics
data(aid) hist(aid)
data(aid) hist(aid)
The Alaska pipeline data consists of in-field ultrasonic measurements of the depths of defects in the Alaska pipeline. The depth of the defects were then re-measured in the laboratory. These measurements were performed in six different batches.
data(alaska.pipeline)
data(alaska.pipeline)
A data frame with 107 observations on the following 3 variables.
Depth of defect as measured in field
Depth of defect as measured in lab
One of 6 batches
From an example in Engineering Statistics Handbook from http://www.itl.nist.gov/div898/handbook/
data(alaska.pipeline) res = lm(lab.defect ~ field.defect, alaska.pipeline) plot(lab.defect ~ field.defect, alaska.pipeline) abline(res) plot(fitted(res),resid(res))
data(alaska.pipeline) res = lm(lab.defect ~ field.defect, alaska.pipeline) plot(lab.defect ~ field.defect, alaska.pipeline) abline(res) plot(fitted(res),resid(res))
The top 79 all-time movies as of 2003 by domestic (US) gross receipts.
data(alltime.movies)
data(alltime.movies)
A data frame with 79 observations on the following 2 variables.
a numeric vector
a numeric vector
The row names are the titles of the movies.
This data was found on http://movieweb.com/movie/alltime.html on June 17, 2003. The source of the data is attributed to (partially) Exhibitor Relations Co. .
data(alltime.movies) hist(alltime.movies$Gross)
data(alltime.movies) hist(alltime.movies$Gross)
Opens pdf file containing answers to selected problems
answers()
answers()
Called for its side-effect of opening a pdf
## answers()
## answers()
A time series of January, February, and March measurements of the annular modes from January 1851 to March 1997.
data(aosat)
data(aosat)
The format is: first column is date in years with fraction to indicate month. The second column is the measurement.
This site http://jisao.washington.edu/ao/ had more details on the importance of this time series.
This data came from the file AO\_SATindex\_JFM\_Jan1851March1997.ascii at http://www.atmos.colostate.edu/ao/Data/ao\_index.html
data(aosat) ## Not run: library(zoo) z = zoo(aosat[,2], order.by=aosat[,1]) plot(z) ## yearly plot(aggregate(z, floor(index(z)), mean)) ## decade-long means plot(aggregate(z, 10*floor(index(z)/10), mean)) ## End(Not run)
data(aosat) ## Not run: library(zoo) z = zoo(aosat[,2], order.by=aosat[,1]) plot(z) ## yearly plot(aggregate(z, floor(index(z)), mean)) ## decade-long means plot(aggregate(z, 10*floor(index(z)/10), mean)) ## End(Not run)
A monthly time series from January 1899 to June 2002 of sea-level pressure measurements relative to some baseline.
data(arctic.oscillations)
data(arctic.oscillations)
The format is: chr "arctic.oscillations"
See https://toptotop.org/ for more details on the importance of climate studies.
The data came from the file AO\_TREN\_NCEP\_Jan1899Current.ascii found many years ago at http://www.atmos.colostate.edu/ao/Data/ao\_index.html.
data(arctic.oscillations) x = ts(arctic.oscillations, start=c(1899,1), frequency=12) plot(x)
data(arctic.oscillations) x = ts(arctic.oscillations, start=c(1899,1), frequency=12) plot(x)
A collection of variables taken for each new mother in a Child and Health Development Study.
data(babies)
data(babies)
A data frame with 1,236 observations on the following 23 variables.
Variables in data file
identification number
5= single fetus
1= live birth that survived at least 28 days
birth date where 1096=January1,1961
length of gestation in days
infant's sex 1=male 2=female 9=unknown
birth weight in ounces (999 unknown)
total number of previous pregnancies including fetal deaths and still births, 99=unknown
mother's race 0-5=white 6=mex 7=black 8=asian 9=mixed 99=unknown
mother's age in years at termination of pregnancy, 99=unknown
mother's education 0= less than 8th grade, 1 = 8th -12th grade - did not graduate, 2= HS graduate–no other schooling , 3= HS+trade, 4=HS+some college 5= College graduate, 6\&7 Trade school HS unclear, 9=unknown
mother's height in inches to the last completed inch 99=unknown
mother prepregnancy wt in pounds, 999=unknown
father's race, coding same as mother's race.
father's age, coding same as mother's age.
father's education, coding same as mother's education.
father's height, coding same as for mother's height
father's weight coding same as for mother's weight
1=married, 2= legally separated, 3= divorced, 4=widowed, 5=never married
family yearly income in \$2500 increments 0 = under 2500, 1=2500-4999, ..., 8= 12,500-14,999, 9=15000+, 98=unknown, 99=not asked
does mother smoke? 0=never, 1= smokes now, 2=until current pregnancy, 3=once did, not now, 9=unknown
If mother quit, how long ago? 0=never smoked, 1=still smokes, 2=during current preg, 3=within 1 yr, 4= 1 to 2 years ago, 5= 2 to 3 yr ago, 6= 3 to 4 yrs ago, 7=5 to 9yrs ago, 8=10+yrs ago, 9=quit and don't know, 98=unknown, 99=not asked
number of cigs smoked per day for past and current smokers 0=never, 1=1-4,2=5-9, 3=10-14, 4=15-19, 5=20-29, 6=30-39, 7=40-60, 8=60+, 9=smoke but don't know,98=unknown, 99=not asked
This dataset is found from https://www.stat.berkeley.edu/users/statlabs/labs.html. It accompanies the excellent text Stat Labs: Mathematical Statistics through Applications Springer-Verlag (2001) by Deborah Nolan and Terry Speed.
data(babies) plot(wt ~ factor(smoke), data=babies) plot(wt1 ~ dwt, data=babies, subset=wt1 < 800 & dwt < 800)
data(babies) plot(wt ~ factor(smoke), data=babies) plot(wt1 ~ dwt, data=babies, subset=wt1 < 800 & dwt < 800)
The babyboom
dataset contains the time of birth, sex, and birth
weight for 44 babies born in one 24-hour period at a hospital in
Brisbane, Australia.
data(babyboom)
data(babyboom)
A data frame with 44 observations on the following 4 variables.
Time on clock
a factor with levels girl
boy
weight in grams of child
minutes after midnight of birth
This data set was submitted to the Journal of Statistical Education, https://www.amstat.org/publications/jse/secure/v7n3/datasets.dunn.cfm (now off-line), by Peter K. Dunn.
data(babyboom) hist(babyboom$wt) hist(diff(babyboom$running.time))
data(babyboom) hist(babyboom$wt) hist(diff(babyboom$running.time))
This dataset contains batting statistics for the 2002 baseball season. The data allows you to compute batting averages, on base percentages, and other statistics of interest to baseball fans. The data only contains players with more than 100 atbats for a team in the year. The data is excerpted with permission from the Lahman baseball database at http://www.seanlahman.com/.
data(batting)
data(batting)
A data frame with 438 observations on the following 22 variables.
This is coded, but those familiar with the players should be able to find their favorites.
a numeric vector. Always 2002 in this dataset.
a numeric vector. Player's stint (order of appearances within a season)
a factor with Team
a factor with levels AL
NL
number of games played
number of at bats
number of runs
number of hits
number of doubles. "2B" in original dat a base.
number of triples. "3B" in original data base
number of home runs
number of runs batted in
number of stolen bases
number of times caught stealing
number of base on balls (walks)
number of strikeouts
number of intentional walks
number of hit by pitches
number of sacrifice hits
number of sacrifice flies
number of grounded into double plays
Baseball fans are “statistics” crazy. They love to talk about things like RBIs, BAs and OBPs. In order to do so, they need the numbers. This data comes from the Lahman baseball database at http://www.seanlahman.com/. The complete dataset includes data for all of baseball not just the year 2002 presented here.
Lahman baseball database, http://www.seanlahman.com/)
In addition to the data set above, the book Curve Ball, by Albert, J. and Bennett, J., Copernicus Books, gives an extensive statistical analysis of baseball.
See https://www.baseball-almanac.com/stats.shtml for definitions of common baseball statistics.
data(batting) attach(batting) BA = H/AB # batting average OBP = (H + BB + HBP) / (AB + BB + HBP + SF) # On base "percentage"
data(batting) attach(batting) BA = H/AB # batting average OBP = (H + BB + HBP) / (AB + BB + HBP + SF) # On base "percentage"
Estimates of the population of a type of Bay Checkerspot butterfly near San Francisco.
data(baycheck)
data(baycheck)
A data frame with 27 observations on the following 2 variables.
a numeric vector
estimated number
From chapter 4 of Morris and Doak, Quantitative Conservation Biology: Theory and Practice of Population Viability Analysis, Sinauer Associates, 2003.
data(baycheck) plot(Nt ~ year,baycheck) ## fit Ricker model N_{t+1} = N_t e^{-rt}W_t n = length(baycheck$year) yt = with(baycheck,log(Nt[-1]/Nt[-n])) nt = with(baycheck,Nt[-n]) lm(yt ~ nt,baycheck)
data(baycheck) plot(Nt ~ year,baycheck) ## fit Ricker model N_{t+1} = N_t e^{-rt}W_t n = length(baycheck$year) yt = with(baycheck,log(Nt[-1]/Nt[-n])) nt = with(baycheck,Nt[-n]) lm(yt ~ nt,baycheck)
A dataset giving world records in track and field running events for various distances and different age groups.
data(best.times)
data(best.times)
A data frame with 113 observations on the following 6 variables.
Distance in meters (42195 is a marathon)
Name of record holder
Date of record
Time in seconds
Time as character
Age at time of record
Age-graded race results allow competitors of different ages to compare their race performances. This data set allows one to see what the relationship is based on peak performances.
The data came from http://www.personal.rdg.ac.uk/~snsgrubb/athletics/agegroups.html which included a calculator to compare results.
data(best.times) attach(best.times) by.dist = split(best.times,as.factor(Dist)) lm(scale(Time) ~ age, by.dist[['400']]) dists = names(by.dist) lapply(dists, function(n) print(lm(scale(Time) ~ age, by.dist[[n]])))
data(best.times) attach(best.times) by.dist = split(best.times,as.factor(Dist)) lm(scale(Time) ~ age, by.dist[['400']]) dists = names(by.dist) lapply(dists, function(n) print(lm(scale(Time) ~ age, by.dist[[n]])))
blood pressure of 15 males taken by machine and expert
data(blood)
data(blood)
This data frame contains the following columns:
a numeric vector
a numeric vector
Taken from Kitchen's Exploring Statistics.
~~ possibly secondary sources and usages ~~
data(blood) attach(blood) t.test(Machine,Expert) detach(blood)
data(blood) attach(blood) t.test(Machine,Expert) detach(blood)
The time in minutes for an insulating fluid to break down under varying voltage loads
data(breakdown)
data(breakdown)
A data frame with 75 observations on the following 2 variables.
Number of kV
time in minutes
An example from industry where a linear model is used with replication and transformation of variables.
Data is from Display 8.3 of Ramsay and Shafer, The Statistical Sleuth Duxbury Press, 1997.
data(breakdown) plot(log(time) ~ voltage, data = breakdown)
data(breakdown) plot(log(time) ~ voltage, data = breakdown)
List of bright stars with Hipparcos catalog number.
data(bright.stars)
data(bright.stars)
A data frame with 96 observations on the following 2 variables.
Common name of star
HIP number for identification
The source of star names goes back to the Greeks and Arabs. Few are modern. This is a list of 96 common stars.
Form the Hipparcos website http://astro.estec.esa.nl/Hipparcos/ident6.html.
data(bright.stars) all.names = paste(bright.stars$name, sep="", collapse="") x = unlist(strsplit(tolower(all.names), "")) letter.dist = sapply(letters, function(i) sum(x == i)) data(scrabble) # for frequency info p = scrabble$frequency[1:26];p=p/sum(p) chisq.test(letter.dist, p=p) # compare with English
data(bright.stars) all.names = paste(bright.stars$name, sep="", collapse="") x = unlist(strsplit(tolower(all.names), "")) letter.dist = sapply(letters, function(i) sum(x == i)) data(scrabble) # for frequency info p = scrabble$frequency[1:26];p=p/sum(p) chisq.test(letter.dist, p=p) # compare with English
The Hipparcos Catalogue has information on over 100,000 stars. Listed in this dataset are brightness measurements for 966 stars from a given sector of the sky.
data(brightness)
data(brightness)
A univariate dataset of 966 numbers.
This is field H5 in the catalog measuring the magnitude, V , in the Johnson UBV photometric system. The smaller numbers are for brighter stars.
http://astro.estec.esa.nl/hipparcos
data(brightness) hist(brightness)
data(brightness) hist(brightness)
bumper repair costs
data(bumpers)
data(bumpers)
Price in dollars to repair a bumper.
From Exploring Statistics, Duxbury Press, 1998, L. Kitchens.
data(bumpers) stem(bumpers)
data(bumpers) stem(bumpers)
Approval ratings as reported by six different polls.
data(BushApproval)
data(BushApproval)
A data frame with 323 observations on the following 3 variables.
The date poll was begun (some take a few days)
a numeric number between 0 and 100
a factor with levels fox
gallup
newsweek
time.cnn
upenn
zogby
A data set of approval ratings of George Bush over the time of his presidency, as reported by several agencies. Most polls were of size approximately 1,000 so the margin of error is about 3 percentage points.
This data was found at http://www.pollingreport.com/BushJob.htm. The idea came from an article in Salon http://salon.com/opinion/feature/2004/02/09/bush_approval/index.html by James K. Galbraith.
data(BushApproval) attach(BushApproval) ## Plot data with confidence intervals. Each poll gets different line type ## no points at first plot(strptime(date,"%m/%d/%y"),approval,type="n", ylab = "Approval Rating",xlab="Date", ylim=c(30,100) ) ## plot line for CI. Margin or error about 3 ## matlines has trouble with dates from strptime() colors = rainbow(6) for(i in 1:nrow(BushApproval)) { lines(rep(strptime(date[i],"%m/%d/%y"),2), c(approval[i]-3,approval[i]+3), lty=as.numeric(who[i]), col=colors[as.numeric(who[i])] ) } ## plot points points(strptime(date,"%m/%d/%y"),approval,pch=as.numeric(who)) ## add legend legend((2003-1970)*365*24*60*60,90,legend=as.character(levels(who)),lty=1:6,col=1:6) detach(BushApproval)
data(BushApproval) attach(BushApproval) ## Plot data with confidence intervals. Each poll gets different line type ## no points at first plot(strptime(date,"%m/%d/%y"),approval,type="n", ylab = "Approval Rating",xlab="Date", ylim=c(30,100) ) ## plot line for CI. Margin or error about 3 ## matlines has trouble with dates from strptime() colors = rainbow(6) for(i in 1:nrow(BushApproval)) { lines(rep(strptime(date[i],"%m/%d/%y"),2), c(approval[i]-3,approval[i]+3), lty=as.numeric(who[i]), col=colors[as.numeric(who[i])] ) } ## plot points points(strptime(date,"%m/%d/%y"),approval,pch=as.numeric(who)) ## add legend legend((2003-1970)*365*24*60*60,90,legend=as.character(levels(who)),lty=1:6,col=1:6) detach(BushApproval)
This data set from Hillborn and Mangel contains data on the number of Albatrosses accidentally caught while fishing by commercial fisheries.
data(bycatch)
data(bycatch)
A data frame with 18 observations on the following 2 variables.
The number of albatross caught
Number of hauls with this many albatross caught
During fishing operations non-target species are often captured. These are called “incidental catch”. In some cases, large-scale observer programs are used to monitor this incidental catch.
When fishing for squid, albatrosses are caught while feeding on the squid at the time of fising. This feeding is encouraged while the net is being hauled in, as the squid are clustered making it an opportunistic time for the albatross to eat.
This is from Hilborn and Mangel, The Ecological Detective, Princeton University Press, 1997. Original source of data is Bartle.
data(bycatch) hauls = with(bycatch,rep(no.albatross,no.hauls))
data(bycatch) hauls = with(bycatch,rep(no.albatross,no.hauls))
Estimated savings from a repeal of the tax on capital gains and dividends for Bush's cabinet members.
data(cabinet)
data(cabinet)
A data frame with 19 observations on the following 4 variables.
Name of individual
Position of individual
Estimated amount of dividend and capital gain income
Estimated tax savings
Quoting from the data source http://www.house.gov/reform/min/pdfs_108/pdf_inves/pdf_admin_tax_law_cabinet_june_3_rep.pdf (From Henry Waxman, congressional watchdog.)
“On May 22, 2003, the House of Representatives and the Senate passed tax legislation that included \$320 billion in tax cuts. The final tax cut bill was signed into law by President Bush on May 28, 2003. The largest component of the new tax law is the reduction of tax rates on both capital gains and dividend income. The law also includes the acceleration of future tax cuts, as well as new tax reductions for businesses.
This capital gains and dividend tax cut will have virtually no impact on the average American. The vast majority of Americans (88 no capital gains on their tax returns. These taxpayers will receive no tax savings at all from the reduction in taxes on capital gains. Similarly, most Americans (75 from the reduction of taxes on dividends.
While the average American will derive little, if any, benefit from the cuts in dividend and capital gains taxes, the law offers significant benefits to the wealthy. For example, the top 1 receive an average tax cut of almost \$21,000 each. In particular, some of the major beneficiaries of this plan will be Vice President Cheney, President Bush, and other members of the cabinet. Based on 2001 and 2002 dividends and capital gains income, Vice President Cheney, President Bush, and the cabinet are estimated to receive an average tax cut of at least \$42,000 per year. Their average tax savings equals the median household income in the United States.”
From http://www.house.gov/reform/min/pdfs_108/pdf_inves/pdf_admin_tax_law_cabinet_june_3_rep.pdfx
data(cabinet) attach(cabinet) median(est.dividend.cg) mean(est.dividend.cg) detach(cabinet)
data(cabinet) attach(cabinet) median(est.dividend.cg) mean(est.dividend.cg) detach(cabinet)
Contains annual tree-ring measurements from Mount Campito from 3426 BC through 1969 AD.
data(camp)
data(camp)
A univariate time series with 5405 observations. The object is of class '"ts"'.
This series is a standard example for the concept of long memory time series.
The data was produced and assembled at the Tree Ring Laboratory at the University of Arizona, Tuscon.
Time Series Data Library:https://robjhyndman.com/TSDL/
This data set is in the tseries package. It is repackaged here for convenience only.
data(camp) acf(camp)
data(camp) acf(camp)
cancer survival times
data(cancer)
data(cancer)
The format is: The format is: List of 5 numeric components stomach, bronchus, colon, ovary and breast
Taken from L. Kitchens, Exploring Statistics, Duxbury Press, 1997.
data(cancer) boxplot(cancer)
data(cancer) boxplot(cancer)
Carbon Monoxide levels at different sites
data(carbon)
data(carbon)
This data frame contains the following columns:
a numeric vector
a numeric vector
Borrowed from Kitchen's Exploring Statistics
data(carbon) boxplot(Monoxide ~ Site,data=carbon)
data(carbon) boxplot(Monoxide ~ Site,data=carbon)
Safety statistics appearing in a January 12th, 2004 issue of the New Yorker showing fatality rates per million vehicles both for drivers of a car, and drivers of other cars that are hit.
data(carsafety)
data(carsafety)
A data frame with 33 observations on the following 4 variables.
The make and model of the car
Type of car
Number of drivers deaths per year if 1,000,000 cars were on the road
Number of deaths in other vehicle caused by accidents involving these cars per year if 1,000,000 cars were on the road
The article this data came from wishes to make the case that SUVs are not safer despite a perception among the U.S. public that they are.
From "Big and Bad" by Malcolm Gladwell. New Yorker, Jan. 12 2004 pp28-33. Data attributed to Tom Wenzel and Marc Ross who have written https://www2.lbl.gov/Science-Articles/Archive/assets/images/2002/Aug-26-2002/SUV-report.pdf.
data(carsafety) plot(Driver.deaths + Other.deaths ~ type, data = carsafety) plot(Driver.deaths + Other.deaths ~ type, data = carsafety)
data(carsafety) plot(Driver.deaths + Other.deaths ~ type, data = carsafety) plot(Driver.deaths + Other.deaths ~ type, data = carsafety)
A listing of various weather measurements made at Central Park in New York City during the month of May 2003.
data(central.park)
data(central.park)
A data frame with 31 observations on the following 19 variables.
the day
maximum temperature (temperatures in Farenheit)
minimum temperature
average temperature
departure from normal
heating degree days
cooling degree days
Water fall. A factor as "T" is a trace.
Amount of snowfall
Depth of snow
Average wind speed
Max wind speed
2 minimum direction
Sunshine measurement a factor with two levels 0
M
Sunshine measurement a factor with levels 0
M
Sunshine measurement. 0-3 = Clear, 4-7 partly cloudy, 8-10 is cloudy
(This is not as documented in the data source. Ignore this variable. It should be: 1 = FOG, 2 = FOG REDUCING VISIBILITY TO 1/4 MILE OR LESS, 3 = THUNDER, 4 = ICE PELLETS, 5 = HAIL, 6 = GLAZE OR RIME, 7 = BLOWING DUST OR SAND: VSBY 1/2 MILE OR LESS, 8 = SMOKE OR HAZE, 9 = BLOWING SNOW, X = TORNADO)
peak wind speed
direction of peak wind
This datasets summarizes the weather in New York City during the merry month of May 2003. This data set comes from the daily climate report issued by the National Weather Service Office.
This data was published on http://www.noah.gov
data(central.park) attach(central.park) barplot(rbind(MIN,MAX-MIN),ylim=c(0,80))
data(central.park) attach(central.park) barplot(rbind(MIN,MAX-MIN),ylim=c(0,80))
The type of day in May 2003 in Central Park, NY
data(central.park.cloud)
data(central.park.cloud)
A factor with levels clear
,partly.cloudy
and cloudy
.
This type of data, and much more, is available from https://www.noaa.gov.
data(central.park.cloud) table(central.park.cloud)
data(central.park.cloud) table(central.park.cloud)
Data on top 200 CEO compensations in the year 2013
data(ceo2013)
data(ceo2013)
a data frame.
data(ceo2013)
data(ceo2013)
A bootstrap sample from the “Survey of Consumer Finances”.
data(cfb)
data(cfb)
A data frame with 1000 observations on the following 14 variables.
Weights to comensate for undersampling. Not applicable
Age of participants
Education level (number of years) of participant
Income in year 2001 of participant
Amount in checking account for participant
Amount in savings accounts
Total directly-held mutual funds
Amount held in stocks
Total financial assets
Value of all vehicles (includes autos, motor homes, RVs, airplanes, boats)
Total home equity
Other financial assets
Total debt
Total net worth
The SCF dataset is a comprehensive survey of consumer finances sponsored by the United States Federal Reserve, https://www.federalreserve.gov/pubs/oss/oss2/2001/scf2001home.html.
The data is oversampled to compensate for low response in the upper brackets. To compensate, weights are assigned. By bootstrapping the data with the weights, we get a “better” version of a random sample from the population.
https://www.federalreserve.gov/pubs/oss/oss2/2001/scf2001home.html
data(cfb) attach(cfb) mean(INCOME)
data(cfb) attach(cfb) mean(INCOME)
weight gain of chickens fed 3 different rations
data(chicken)
data(chicken)
This data frame contains the following columns:
a numeric vector
a numeric vector
a numeric vector
From Kitchens' Exploring Statistics.
data(chicken) boxplot(chicken)
data(chicken) boxplot(chicken)
The chips
data frame has 30 rows and 8 columns.
data(chips)
data(chips)
This data frame contains the following columns:
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
From Kitchens' Exploring Statistics
data(chips) boxplot(chips)
data(chips) boxplot(chips)
Carbon Dioxide Emissions from the U.S.A. from fossil fuel
data(co2emiss)
data(co2emiss)
The format is: Time-Series [1:276] from 1981 to 2004: -30.5 -30.4 -30.3 -29.8 -29.6 ...
Monthly estimates of 13C/12C in fossil-fuel CO2 emissions. Originally at http://cdiac.esd.ornl.gov/trends/emis_mon/emis_mon_co2.html; now off-line.
At one time: "An annual cycle, peaking during the winter months and reflecting natural gas consumption, and a semi-annual cycle of lesser amplitude, peaking in summer and winter and reflecting coal consumption, comprise the dominant features of the annual pattern. The relatively constant emissions until 1987, followed by an increase from 1987-1989, a decrease in 1990-1991 and record highs during the late 1990s, are also evident in the annual data of Marland et al. However, emissions have declined somewhat since 2000."
http://cdiac.esd.ornl.gov/ftp/trends/emis_mon/emis_mon_c13.dat (off-line)
data(co2emiss) monthplot(co2emiss) stl(co2emiss, s.window="periodic")
data(co2emiss) monthplot(co2emiss) stl(co2emiss, s.window="periodic")
The coins in author's change bin with year and value.
data(coins)
data(coins)
A data frame with 371 observations on the following 2 variables.
Year of coin
Value of coin: quarter, dime, nickel, or penny
data(coins) years = cut(coins$year,seq(1920,2010,by=10),include.lowest=TRUE, labels = paste(192:200,"*",sep="")) table(years)
data(coins) years = cut(coins$year,seq(1920,2010,by=10),include.lowest=TRUE, labels = paste(192:200,"*",sep="")) table(years)
Recordings of daily minimum temperature in Woodstock Vermont from January 1 1980 through 1985.
data(coldvermont)
data(coldvermont)
A ts object with daily frequency
Extracted from http://www.ce.washington.edu/pub/HYDRO/edm/met_thru_97/vttmin.dly.gz. Errors were possibly introduced.
data(coldvermont) plot(coldvermont)
data(coldvermont) plot(coldvermont)
htest
Simple means to output a confidence interval for an htest
object.
## S3 method for class 'htest' confint(object, parm, level, ...)
## S3 method for class 'htest' confint(object, parm, level, ...)
object |
A object of class |
parm |
ignored |
level |
ignored |
... |
can pass in function to transform via |
No return value, outputs interval through cat
.
confint(t.test(rnorm(10)))
confint(t.test(rnorm(10)))
Comparison of corn for new and standard variety
data(corn)
data(corn)
This data frame contains the following columns:
a numeric vector
a numeric vector
From Kitchens' Exploring Statitistcs
data(corn) t.test(corn)
data(corn) t.test(corn)
crime rates for 50 states in 1983 and 1993
data(crime)
data(crime)
This data frame contains the following columns:
a numeric vector
a numeric vector
from Kitchens' Exploring Statistics
data(crime) boxplot(crime) t.test(crime[,1],crime[,2],paired=TRUE)
data(crime) boxplot(crime) t.test(crime[,1],crime[,2],paired=TRUE)
The data collected in a calibration experiment consisting of a known load, applied to the load cell, and the corresponding deflection of the cell from its nominal position.
data(deflection)
data(deflection)
A data frame with 40 observations on the following 2 variables.
a numeric vector
a numeric vector
From an example in Engineering Statistics Handbook from http://www.itl.nist.gov/div898/handbook/
data(deflection) res = lm(Deflection ~ Load, data = deflection) plot(Deflection ~ Load, data = deflection) abline(res) # looks good? plot(res)
data(deflection) res = lm(Deflection ~ Load, data = deflection) plot(Deflection ~ Load, data = deflection) abline(res) # looks good? plot(res)
Provides a menu to open one of the provided demonstrations which use shiny for animation.
demos()
demos()
User must have installed shiny prior to usage. As shiny has some dependencies that don't always work, this package is not a dependency of UsingR.
No return value, when called a web page opens. Use Ctrl-C (or equivalent) in terminal to return to an interactive session.
## demos()
## demos()
Allows one to compare empirical densities of different distributions in a simple manner. The density is used as graphs with multiple histograms are too crowded. The usage is similar to side-by-side boxplots.
DensityPlot(x, ...)
DensityPlot(x, ...)
x |
x may be a sequence of data vectors (eg. x,y,z), a data frame with numeric column vectors or a model formula |
... |
You can pass in a bandwidth argument such as bw="SJ". See density for details. A legend will be placed for you automatically. To overide the positioning set do.legend="manual". To skip the legend, set do.legend=FALSE. |
Makes a plot
John Verzani
Basically a modified boxplot function. As well it should be as it serves the same utility: comparing distributions.
## taken from boxplot ## using a formula data(InsectSprays) DensityPlot(count ~ spray, data = InsectSprays) ## on a matrix (data frame) mat <- cbind(Uni05 = (1:100)/21, Norm = rnorm(100), T5 = rt(100, df = 5), Gam2 = rgamma(100, shape = 2)) DensityPlot(data.frame(mat))
## taken from boxplot ## using a formula data(InsectSprays) DensityPlot(count ~ spray, data = InsectSprays) ## on a matrix (data frame) mat <- cbind(Uni05 = (1:100)/21, Norm = rnorm(100), T5 = rt(100, df = 5), Gam2 = rgamma(100, shape = 2)) DensityPlot(data.frame(mat))
A data set on 48 diamond rings containing price in Singapore dollars and size of diamond in carats.
data(diamond)
data(diamond)
A data frame with 48 observations on the following 2 variables.
A measurement of a diamond's size
Price in Singapore dollars
This data comes from a collection of the Journal of Statistics Education. The accompanying documentation says:
“Data presented in a newspaper advertisement suggest the use of simple linear regression to relate the prices of diamond rings to the weights of their diamond stones. The intercept of the resulting regression line is negative and significantly different from zero. This finding raises questions about an assumed pricing mechanism and motivates consideration of remedial actions.”
This comes from http://jse.amstat.org/datasets/diamond.txt. Data set is contributed by Singfat Chu.
data(diamond) plot(price ~ carat, diamond, pch=5)
data(diamond) plot(price ~ carat, diamond, pch=5)
The divorce
data frame has 25 rows and 6 columns.
data(divorce)
data(divorce)
This data frame contains the following columns:
a factor
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
Forgot source
data(divorce) apply(divorce[,2:6],2,sum) # percent divorced by age of marriage
data(divorce) apply(divorce[,2:6],2,sum) # percent divorced by age of marriage
A variant of the stripchart
using big dots as the default.
DOTplot(x, ...)
DOTplot(x, ...)
x |
May be a vector, data frame, matrix (each column a variable), list or model formula. Treats each variable or group as a univariate dataset and makes corresponding DOTplot. |
... |
arguments passed onto
|
Returns the graphic only.
John Verzani
See also as stripchart
, dotplot
x = c(1,1,2,3,5,8) DOTplot(x,main="Fibonacci",cex=2)
x = c(1,1,2,3,5,8) DOTplot(x,main="Fibonacci",cex=2)
A set of points to make a dot-to-dot puzzle
data(dottodot)
data(dottodot)
A data frame with 49 observations on the following 4 variables.
x position
y position
where to put label
number for label
Points to make a dot to dot puzzle to illustrate,
text
, points
, and the argument pos
.
Illustration by Noah Verzani.
data(dottodot) # make a blank graph plot(y~x,data=dottodot,type="n",bty="n",xaxt="n",xlab="",yaxt="n",ylab="") # add the points points(y~x,data=dottodot) # add the labels using pos argument with(dottodot, text(x,y,labels=ind,pos=pos)) # solve the puzzle lines(y~x, data=dottodot)
data(dottodot) # make a blank graph plot(y~x,data=dottodot,type="n",bty="n",xaxt="n",xlab="",yaxt="n",ylab="") # add the points points(y~x,data=dottodot) # add the labels using pos argument with(dottodot, text(x,y,labels=ind,pos=pos)) # solve the puzzle lines(y~x, data=dottodot)
The dowdata
data frame has 443 rows and 5 columns.
data(dowdata)
data(dowdata)
This data frame contains the following columns:
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
this data comes from the site http://www.forecasts.org/
data(dowdata) the.close <- dowdata$Close n <- length(the.close) plot(log(the.close[2:n]/the.close[1:(n-1)]))
data(dowdata) the.close <- dowdata$Close n <- length(the.close) plot(log(the.close[2:n]/the.close[1:(n-1)]))
Monthly DVD player sales since introduction of DVD format to May 2004
data(dvdsales)
data(dvdsales)
Matrix with rows recording the year, and columns the month.
Original data retrieved from http://www.thedigitalbits.com/articles/cemadvdsales.html
data(dvdsales) barplot(t(dvdsales[7:1,]),beside=TRUE)
data(dvdsales) barplot(t(dvdsales[7:1,]),beside=TRUE)
The emissions
data frame has 26 rows and 3 columns.
A data set listing GDP, GDP per capita, and CO2 emissions for 1999.
data(emissions)
data(emissions)
This data frame contains the following columns:
a numeric vector
a numeric vector
a numeric vector
http://www.grida.no for CO2 data and http://www.mrdowling.com for GDP data.
Prompted by a plot appearing in a June 2001 issue of the New York Times.
data(emissions) plot(emissions)
data(emissions) plot(emissions)
The ewr
data frame has 46 rows and 11 columns.
Gives taxi in and taxi out times for 8 different airlines and several months at EWR airport.
Airline codes are
AA
(American Airlines),
AQ
(Aloha Airlines),
AS
(Alaska Airlines),
CO
(Continental Airlines),
DL
(Delta Airlines),
HP
(America West Airlines),
NW
(Northwest Airlines),
TW
(Trans World Airlines),
UA
(United Airlines),
US
(US Airways), and
WN
(Southwest Airlines)
data(ewr)
data(ewr)
This data frame contains the following columns:
a numeric vector
a factor for months
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a factor with levels in
or out
Retrieved from http://www.bts.gov/oai/taxitime/html/ewrtaxi.html
data(ewr) boxplot(ewr[3:10])
data(ewr) boxplot(ewr[3:10])
Direct compensation for 199 United States CEOs in the year 2000 in units of \$10,000.
data(exec.pay)
data(exec.pay)
A numeric vector with 199 entries each measuring compensation in 10,000s of dollars.
New York Times Business section 04/01/2001. See also https://aflcio.org.
data(exec.pay) hist(exec.pay)
data(exec.pay) hist(exec.pay)
A data set containing many physical measurements of 252 males. Most of the variables can be measured with a scale or tape measure. Can they be used to predict the percentage of body fat? If so, this offers an easy alternative to an underwater weighing technique.
data(fat)
data(fat)
A data frame with 252 observations on the following 19 variables.
Case Number
Percent body fat using Brozek's equation, 457/Density - 414.2
Percent body fat using Siri's equation, 495/Density - 450
Density (gm/cm2)
Age (yrs)
Weight (lbs)
Height (inches)
Adiposity index = Weight/Height2 (kg/m
2)
Fat Free Weight = (1 - fraction of body fat) * Weight, using Brozek's formula (lbs)
Neck circumference (cm)
Chest circumference (cm)
Abdomen circumference (cm) "at the umbilicus and level with the iliac crest"
Hip circumference (cm)
Thigh circumference (cm)
Knee circumference (cm)
Ankle circumference (cm)
Extended biceps circumference (cm)
Forearm circumference (cm)
Wrist circumference (cm) "distal to the styloid processes"
From the source:
“The data are as received from Dr. Fisher. Note, however, that there are a few errors. The body densities for cases 48, 76, and 96, for instance, each seem to have one digit in error as can be seen from the two body fat percentage values. Also note the presence of a man (case 42) over 200 pounds in weight who is less than 3 feet tall (the height should presumably be 69.5 inches, not 29.5 inches)! The percent body fat estimates are truncated to zero when negative (case 182).”
This data set comes from the collection of the Journal of Statistics Education at http://jse.amstat.org/datasets/fat.txt. The data set was contributed by Roger W. Johnson.
The source of the data is attributed to Dr. A. Garth Fisher, Human Performance Research Center, Brigham Young University, Provo, Utah 84602,
data(fat) f = body.fat ~ age + weight + height + BMI + neck + chest + abdomen + hip + thigh + knee + ankle + bicep + forearm + wrist res = lm(f, data=fat) summary(res)
data(fat) f = body.fat ~ age + weight + height + BMI + neck + chest + abdomen + hip + thigh + knee + ankle + bicep + forearm + wrist res = lm(f, data=fat) summary(res)
1078 measurements of a father's height and his son's height.
data(father.son)
data(father.son)
A data frame with 1078 observations on the following 2 variables.
Father's height in inches
Son's height in inches
Data set used by Pearson to investigate regression. See data set
galton
for data set used by Galton.
Read into R by the command
read.table("http://stat-www.berkeley.edu/users/juliab/141C/pearson.dat",sep=" ")[,-1]
,
as mentioned by Chuck Cleland on the r-help mailing list.
data(father.son) ## like cover of Freedman, Pisani, and Purves plot(sheight ~ fheight, data=father.son,bty="l",pch=20) abline(a=0,b=1,lty=2,lwd=2) abline(lm(sheight ~ fheight, data=father.son),lty=1,lwd=2)
data(father.son) ## like cover of Freedman, Pisani, and Purves plot(sheight ~ fheight, data=father.son,bty="l",pch=20) abline(a=0,b=1,lty=2,lwd=2) abline(lm(sheight ~ fheight, data=father.son),lty=1,lwd=2)
A data set containing incomes for 1,000 females along with race information. The data is sampled from data provided by the United States Census Bureau.
data(female.inc)
data(female.inc)
A data frame with 1,000 observations on the following 2 variables.
Income for 2001 in dollars
a factor with levels black
, hispanic
or white
The United States Census Bureau provides alot of data on income distributions. This data comes from the Current Population Survey (CPS) for the year 2001. The raw data appears in table format. This data is sampled from the data in that table.
The original table was found at http://ferret.bls.census.gov/macro/032002/perinc/new11_002.htm
data(female.inc) boxplot(income ~ race, female.inc) boxplot(log(income,10) ~ race, female.inc) sapply(with(female.inc,split(income,race)),median)
data(female.inc) boxplot(income ~ race, female.inc) boxplot(log(income,10) ~ race, female.inc) sapply(with(female.inc,split(income,race)),median)
Age of mother at birth of first child
data(firstchi)
data(firstchi)
The format is: num [1:87] 30 18 35 22 23 22 36 24 23 28 ...
From Exploring Statistics, L. Kitchens, Duxbury Press, 1998.
data(firstchi) hist(firstchi)
data(firstchi) hist(firstchi)
Five years of maximum temperatures in New York City
data(five.yr.temperature)
data(five.yr.temperature)
A data frame with 2,439 observations on the following 3 variables.
Which day of the year
The year
Maximum temperature
Dataset found on the internet, but original source is lost.
data(five.yr.temperature) attach(five.yr.temperature) scatter.smooth(temps ~ days,col=gray(.75)) lines(smooth.spline(temps ~ days), lty=2) lines(supsmu(days, temps), lty=3)
data(five.yr.temperature) attach(five.yr.temperature) scatter.smooth(temps ~ days,col=gray(.75)) lines(smooth.spline(temps ~ days), lty=2) lines(supsmu(days, temps), lty=3)
The florida
data frame has 67 rows and 13 columns.
Gives a county by county accounting of the US elections in the state of Florida.
data(florida)
data(florida)
This data frame contains the following columns:
Name of county
Votes for Gore
Votes for Bush
Votes for Buchanan
Votes for Nader
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
Found in the excellent notes Using R for Data Analysis and Graphics by John Maindonald. (As of 2003 a book published by Cambridge University Press.)
data(florida) attach(florida) result.lm <- lm(BUCHANAN ~ BUSH) plot(BUSH,BUCHANAN) abline(result.lm) ## can you find Palm Beach and Miami Dade counties?
data(florida) attach(florida) result.lm <- lm(BUCHANAN ~ BUSH) plot(BUSH,BUCHANAN) abline(result.lm) ## can you find Palm Beach and Miami Dade counties?
Data recorded by Galileo in 1609 during his investigations of the trajectory of a falling body.
data(galileo)
data(galileo)
A data frame with 7 observations on the following 2 variables.
Initial height of ball
Horizontal distance traveled
A simple ramp 500 punti above the ground was constructed. A ball was placed on the ramp at an indicated height from the ground and released. The horizontal distance traveled is recorded (in punti). (One punto is 169/180 millimeter, not a car by FIAT.)
This data and example come from the Statistical Sleuth by Ramsay and Schafer, Duxbury (2001), section 10.1.1. They attribute an article in Scientific American by Drake and MacLachlan.
data(galileo) polynomial = function(x,coefs) { sum = 0 for(i in 0:(length(coefs)-1)) { sum = sum + coefs[i+1]*x^i } sum } res.lm = lm(h.d ~ init.h, data = galileo) res.lm2 = update(res.lm, . ~ . + I(init.h^2), data=galileo) res.lm3 = update(res.lm2, . ~ . + I(init.h^3), data=galileo) plot(h.d ~ init.h, data = galileo) curve(polynomial(x,coef(res.lm)),add=TRUE) curve(polynomial(x,coef(res.lm2)),add=TRUE) curve(polynomial(x,coef(res.lm3)),add=TRUE)
data(galileo) polynomial = function(x,coefs) { sum = 0 for(i in 0:(length(coefs)-1)) { sum = sum + coefs[i+1]*x^i } sum } res.lm = lm(h.d ~ init.h, data = galileo) res.lm2 = update(res.lm, . ~ . + I(init.h^2), data=galileo) res.lm3 = update(res.lm2, . ~ . + I(init.h^3), data=galileo) plot(h.d ~ init.h, data = galileo) curve(polynomial(x,coef(res.lm)),add=TRUE) curve(polynomial(x,coef(res.lm2)),add=TRUE) curve(polynomial(x,coef(res.lm3)),add=TRUE)
Data set from tabulated data set used by Galton in 1885 to study the relationship between a parent's height and their childrens.
data(galton)
data(galton)
A data frame with 928 observations on the following 2 variables.
The child's height
The “midparent” height
The midparent's height is an average of the fathers height and 1.08 times
the mother's. In the data there are 205 different parents and 928 children.
The data here is truncated at the ends for both parents and
children so that it can be treated as numeric data. The data were
tabulated and consequently made discrete. The father.son
data set is
similar data used by Galton and is continuous.
This data was found at http://www.bun.kyoto-u.ac.jp/~suchii/galton86.html.
See also the data.set father.son which was found from http://stat-www.berkeley.edu/users/juliab/141C/pearson.dat.
data(galton) plot(galton) ## or with some jitter. plot(jitter(child,5) ~ jitter(parent,5),galton) ## sunflowerplot shows flowers for multiple plots (Thanks MM) sunflowerplot(galton)
data(galton) plot(galton) ## or with some jitter. plot(jitter(child,5) ~ jitter(parent,5),galton) ## sunflowerplot shows flowers for multiple plots (Thanks MM) sunflowerplot(galton)
Sales data for the Gap from Jan
data(gap)
data(gap)
The format is a ts object storing data from June 2002 through June 2005.
http://home.businesswire.com
data(gap) monthplot(gap)
data(gap) monthplot(gap)
Average retail gasoline prices per month in the United States from January 2000 through February 2006. The hurricane Katrina caused a percentage loss of refinery capability leading to rapidly increasing prices.
data(gasprices)
data(gasprices)
The format is: Time-Series [1:74] from 2000 to 2006: 129 138 152 146 148 ...
Oringally from the Department of Energy web site: https://www.eia.gov/petroleum/gasdiesel/
data(gasprices) plot(gasprices)
data(gasprices) plot(gasprices)
Returns answers for the first edition.
getAnswer(chapter = NULL, problem = NULL)
getAnswer(chapter = NULL, problem = NULL)
chapter |
which chapter |
problem |
which problem |
opens web page to answer
Goals per game in NHL
data(goalspergame)
data(goalspergame)
The format is: mts [1:53, 1:4] 6 6 6 6 6 6 6 6 6 6 ... - attr(*, "dimnames")=List of 2 ..$ : NULL ..$ : chr [1:4] "n.teams" "n.games" "n.goals" "gpg" - attr(*, "tsp")= num [1:3] 1946 1998 1 - attr(*, "class")= chr [1:2] "mts" "ts"
Off internet site. Forgot which.
data(goalspergame)
data(goalspergame)
Closing stock price of a share of Google stock during 2005-02-07 to 2005-07-07
data(google)
data(google)
A data vector of numeric values with names attribute giving the dates.
finance.yahoo.com
data(google) plot(google,type="l")
data(google) plot(google,type="l")
A dataframe of a students grade and their grade in their previous class. Graded on American A-F scale.
data(grades)
data(grades)
A dataframe of 122 rows with 2 columns
The grade in the previous class in the subject matter
The grade in the current class
data(grades) table(grades)
data(grades) table(grades)
Simulated data set investigating effects of cross-country ski-pole grip.
data(grip)
data(grip)
A data frame with 36 observations on the following 4 variables.
Measurement of upper-body power
One of four skiers
Either classic, modern, or integrated.
a numeric vector
Based on a study originally described at http://www.montana.edu/wwwhhd/movementscilab/ and mentioned on http://www.xcskiworld.com/. The study investigated the effect of grip type on upper body power. As this influences performance in races, presumably a skier would prefer the grip that provides the best power output.
data(grip) ftable(xtabs(UBP ~ person + replicate + grip.type,grip))
data(grip) ftable(xtabs(UBP ~ person + replicate + grip.type,grip))
A data frame containing baseball statistics for several players.
data(hall.fame)
data(hall.fame)
A data frame with 1340 observations on the following 28 variables.
first name
last name
Seasons played
Games played
Official At Bats
Runs scored
hits
doubles
triples numeric vector
Home runs
Runs batted in
Base on balls
Strike outs
Batting Average
On Base percentage
Slugging Percentage
Adjusted productions
batting runs
adjusted batting runs
Runs created
Stolen Bases
Caught stealing
Runs scored by stealing
Fielding average
Fielding runs
C = Catcher, 1 = First Base, 2 = Second Base, 3 = Third Base, S = Shortstop, O = Outfield, and D = Designated hitter
a numeric vector
Not a member, Elected by the BBWAA, or Chosen by the Old Timers Committee or Veterans Committee
The sport of baseball lends itself to the collection of data. This data set contains many variables used to assess a players career. The Hall of Fame is reserved for outstanding players as judged initially by the Baseball Writers Association and subsequently by the Veterans Committee.
This data set was submitted to the Journal of Statistical Education, https://www.amstat.org/publications/jse/secure/v8n2/datasets.cochran.new.cfm (now off-line), by James J. Cochran.
data(hall.fame) hist(hall.fame$OBP) with(hall.fame,last[Hall.Fame.Membership != "not a member"])
data(hall.fame) hist(hall.fame$OBP) with(hall.fame,last[Hall.Fame.Membership != "not a member"])
helper function to shorten display of a data frame
headtail(x, k = 3)
headtail(x, k = 3)
x |
a data frame |
k |
number of rows at top and bottom to show. |
No return value. Uses cat
to show data
headtail(mtcars)
headtail(mtcars)
Data on whether a patient is healthy with two covariates.
data(healthy)
data(healthy)
A data frame with 32 observations on the following 3 variables.
One covariate
Another covariate
0 is healthy, 1 is not
Data on health with information from two unspecified covariates.
data(healthy) library(MASS) stepAIC(glm(healthy ~ p + g, healthy, family=binomial))
data(healthy) library(MASS) stepAIC(glm(healthy ~ p + g, healthy, family=binomial))
Simulated data of age vs. max heart rate
data(heartrate)
data(heartrate)
This data frame contains the following columns:
a numeric vector
a numeric vector
Does this fit the workout room value of 220 - age?
Simulated based on “Age-predicted maximal heart rate revisited” Hirofumi Tanaka, Kevin D. Monahan, Douglas R. Seals Journal of the American College of Cardiology, 37:1:153-156.
data(heartrate) plot(heartrate) abline(lm(maxrate ~ age,data=heartrate))
data(heartrate) plot(heartrate) abline(lm(maxrate ~ age,data=heartrate))
The home
data frame has 15 rows and 2 columns.
data(home)
data(home)
This data frame contains the following columns:
a numeric vector
a numeric vector
See full dataset homedata
See full dataset homedata
data(home) ## compare on the same scale boxplot(data.frame(scale(home)))
data(home) ## compare on the same scale boxplot(data.frame(scale(home)))
The homedata
data frame has 6841 rows and 2 columns.
Data set containing assessed values of homes in Maplewood NJ for the years 1970 and 2000. The properties were not officially assessed during that time and it is interesting to see the change in percentage appreciation.
data(homedata)
data(homedata)
This data frame contains the following columns:
a numeric vector
a numeric vector
Maplewood Reval
data(homedata) plot(homedata)
data(homedata) plot(homedata)
The homeprice
data frame has 29 rows and 7 columns.
data(homeprice)
data(homeprice)
This data frame contains the following columns:
list price of home (in thousands)
actual sale price
Number of full bathrooms
number of half bathrooms
number of bedrooms
total number of rooms
Subjective assessment of neighborhood on scale of 1-5
This dataset is a random sampling of the homes sold in Maplewood, NJ during the year 2001. Of course the prices will either seem incredibly high or fantastically cheap depending on where you live, and if you have recently purchased a home.
Source Burgdorff Realty.
data(homeprice) plot(homeprice$sale,homeprice$list) abline(lm(homeprice$list~homeprice$sale))
data(homeprice) plot(homeprice$sale,homeprice$list) abline(lm(homeprice$list~homeprice$sale))
Homework averages for Private and Public schools
data(homework)
data(homework)
This data frame contains the following columns:
a numeric vector
a numeric vector
This is from Kitchens Exploring Statistics
data(homework) boxplot(homework)
data(homework) boxplot(homework)
Gives monthly delivery numbers for new HUMMER vehicles from June 2003 through February 2006. During July, August, and September 2005 there was an Employee Pricing Incentive.
data(HUMMER)
data(HUMMER)
The format is: Time-Series [1:33] from 2003 to 2006: 2493 2654 2987 2837 3157 2837 3157 1927 2141 2334 ...
Compiled from delivery data avalailble at http://www.gm.com/company/investor_information/sales_prod/hist_sales.html
data(HUMMER) plot(HUMMER)
data(HUMMER) plot(HUMMER)
Top percentiles of U.S. income
data(income_percentiles)
data(income_percentiles)
A data frame with Year
and various percentile (90th, 95th, ...)
Not available
data(income_percentiles)
data(income_percentiles)
simulated IQ scores
data(iq)
data(iq)
The format is: num [1:100] 72 75 77 77 81 82 83 84 84 86 ...
From Kitchens Exploring Statistics
data(iq) qqnorm(iq)
data(iq) qqnorm(iq)
A sample from the data presented in the NHANES III survey (https://www.cdc.gov/nchs/nhanes.htm). This survey is used to form the CDC Growth Charts (https://www.cdc.gov/growthcharts/) for children.
data(kid.weights)
data(kid.weights)
A data frame with 250 observations on the following 4 variables.
Age in months
weight in pounds
height in inches
Male of Female
This data is extracted from the NHANES III survey: https://www.cdc.gov/nchs/nhanes.htm.
data(kid.weights) attach(kid.weights) plot(weight,height,pch=as.character(gender)) ## find the BMI -- body mass index m.ht = height*2.54/100 # 2.54 cm per inch m.wt = weight / 2.2046 # 2.2046 lbs. per kg bmi = m.wt/m.ht^2 hist(bmi)
data(kid.weights) attach(kid.weights) plot(weight,height,pch=as.character(gender)) ## find the BMI -- body mass index m.ht = height*2.54/100 # 2.54 cm per inch m.wt = weight / 2.2046 # 2.2046 lbs. per kg bmi = m.wt/m.ht^2 hist(bmi)
Data on car drivers killed, car drivers killed or seriously injured (KSI), and light goods drivers killed during the years 1969 to 1984 in Great Britain. In February 1982 a compulsory seat belt law was introduced.
data(KSI)
data(KSI)
The data is stored as a multi-variate zoo
object.
Data copied from Appendix 2 "Forecasting, structural time series, models and the
Kalman Filter" by Andrew Harvey. The lg.k
data is also found in
the vandrivers
dataset contained in the sspir
package.
Source: HMSO: Road Accidents in Great Britain 1984.
data(KSI) plot(KSI) seatbelt = time(KSI) < 1983 + (2-1)/12
data(KSI) plot(KSI) seatbelt = time(KSI) < 1983 + (2-1)/12
Toss a coin 100 times and keep a running count of the number of heads and the number of tails. Record the times when the number is tied and report the last one. The distribution will have an approximate “arc-sine” law or well-shaped distribution.
data(last.tie)
data(last.tie)
200 numbers between 0 and 100 indicating when the last tie was.
This data comes from simulating the commands:
x = cumsum(sample(c(-1,1),100,replace=T))
and then finding the last tie with
last.tie[i]<-max(0,max(which(!sign(x) ==
sign(x[length(x)]))))
.
data(last.tie) hist(last.tie)
data(last.tie) hist(last.tie)
A simulated dataset on the settlement amount of 250 lawsuits based on values reported by Class Action Reports.
data(lawsuits)
data(lawsuits)
The format is: num [1:250] 16763 10489 17693 14268 442 ...
Class Action Reports completed an extensive survey of attorney fee awards from 1,120 common fund class actions (Volume 24, No. 2, March/April 2003). The full data set is available for a fee. This data is simulated from the values published in an excerpt.
Original data from http://www.classactionreports.com/classactionreports/attorneyfee.htm
See also "Study Disputes View of Costly Surge in Class-Action Suits" by Jonathan D. Glater in the January 14, 2004 New York Times which cites a Jan. 2004 paper in the Journal of Empirical Legal Studies by Eisenberg and Miller.
data(lawsuits) mean(lawsuits) median(lawsuits)
data(lawsuits) mean(lawsuits) median(lawsuits)
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
lorem
lorem
a character string
table(unlist(strsplit(lorem, "")))
table(unlist(strsplit(lorem, "")))
malpractice settlements
data(malpract)
data(malpract)
The format is: num [1:17] 760 380 125 250 2800 450 100 150 2000 180 ...
From Kitchens Exploring Statistics
data(malpract) boxplot(malpract)
data(malpract) boxplot(malpract)
A bag of the candy M and M's has many different colors. Each large production batch is blended to the ratios given in this data set. The batches are thoroughly mixed and then the individual packages are filled by weight using high-speed equipment, not by count.
data(mandms)
data(mandms)
A data frame with 5 observations on the following 6 variables.
percentage of blue
percentage of brown
percentage of green
percentage of orange
percentage of red
percentage of yellow
This data is attributed to an email sent by Masterfoods USA, A Mars, Incoporated Company. This email was archived at the Math Forum, http://www.mathforum.org (now off-line).
data(mandms) bagfull = c(15,34,7,19,29,24) names(bagfull) = c("blue","brown","green","orange","red","yellow") prop = function(x) x/sum(x) chisq.test(bagfull,p = prop(mandms["milk chocolate",])) chisq.test(bagfull,p = prop(mandms["Peanut",]))
data(mandms) bagfull = c(15,34,7,19,29,24) names(bagfull) = c("blue","brown","green","orange","red","yellow") prop = function(x) x/sum(x) chisq.test(bagfull,p = prop(mandms["milk chocolate",])) chisq.test(bagfull,p = prop(mandms["Peanut",]))
Standardized math scores
data(math)
data(math)
The format is: num [1:30] 44 49 62 45 51 59 57 55 70 64 ...
From Larry Kitchens, Exploring Statistics, Duxbury Press.
data(math) hist(math)
data(math) hist(math)
A data set of both the Dow Jones industrial average and the maximum daily temperature in New York City for May 2003.
data(maydow)
data(maydow)
A data frame with 21 observations on the following 3 variables.
Day of the month
The daily close of the DJIQ
Daily maximum temperature in Central Park
Are stock traders influenced by the weather? This dataset looks briefly at this question by comparing the daily close of the Dow Jones industrial average with the maximum daily temperature for the month of May 2003. This month was rainy and unseasonably cool weather wise, yet the DJIA did well.
The DJIA data was taken from https://finance.yahoo.com the temperature data from https://www.noaa.gov.
data(maydow) attach(maydow) plot(max.temp,DJA) plot(max.temp[-1],diff(DJA))
data(maydow) attach(maydow) plot(max.temp,DJA) plot(max.temp[-1],diff(DJA))
Sample from "Medicare Provider Charge Data"
data(Medicare)
data(Medicare)
A data frame with 10000 observations and data for on billings for procedures at many different hospitals.
http://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/index.html
This data came from http://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/index and was referenced in the article https://www.nytimes.com/2013/05/08/business/hospital-billing-varies-wildly-us-data-shows.html, as retrieved on 5/8/2013.
data(Medicare)
data(Medicare)
New and used prices of three popular mid-sized cars.
data(midsize)
data(midsize)
A data frame with 15 observations on the following 4 variables.
2004 is new car price, others are for used car
Honda Accord
Toyota Camry
Ford Taurus
The value of a car depreciates over time. This data gives the price of a new car and values of similar models for previous years as reported by https://www.edmunds.com.
data(midsize) plot(Accord ~ I(2004-Year), data = midsize)
data(midsize) plot(Accord ~ I(2004-Year), data = midsize)
Data on home-game attendance in Major League Baseball for the years 1969-2000.
data(MLBattend)
data(MLBattend)
A data frame with 838 observations on the following 10 variables.
Which team
American or National league
Which division
The year (the year 2000 is recorded as 0)
Actual attendance
Runs scored by the team during year
Runs allows by the team during year
Number of wins for season
Number of losses for season
A measure of how far from division winner the team was. Higher numbers are worse.
This data was submitted to The Journal of Statistical Education by James J. Cochran, http://jse.amstat.org/v10n2/datasets.cochran.html.
data(MLBattend) boxplot(attendance ~ franchise, MLBattend) with(MLBattend, cor(attendance,wins))
data(MLBattend) boxplot(attendance ~ franchise, MLBattend) with(MLBattend, cor(attendance,wins))
Movie data for 2011 by weekend
data(movie_data_2011)
data(movie_data_2011)
A data frame with variables Previous
(previous weekend rank), Movie
(title), Distributor
, Genre
, Gross
(per current weekend), Change
(change from previous week), Theaters
(number of theaters), TotalGross
(total gross to date), Days
(days out), weekend
(weekend of report)
Scraped from pages such as https://www.the-numbers.com/box-office-chart/weekend/2011/04/29
data(movie_data_2011)
data(movie_data_2011)
Data on 25 top movies
data(movies)
data(movies)
A data frame with 26 observations on the following 5 variables.
title
Titles
current
Current week
previous
Previous weel
gross
Total
Some movie website, sorry lost the url.
data(movies) boxplot(movies$previous)
data(movies) boxplot(movies$previous)
Age distribution in Maplewood New Jersey, a suburb of New York City. Data is broken down by Male and Female.
data(mw.ages)
data(mw.ages)
A data frame with 103 observations on the following 2 variables.
Counts per age group. Most groups are 1 year, except for 100-104, 105-110, 110+
Same
US Census 2000 data from http://factfinder.census.gov/
data(mw.ages) barplot(mw.ages$Male + mw.ages$Female)
data(mw.ages) barplot(mw.ages$Male + mw.ages$Female)
The NBA draft in 2002 has a lottery
data(nba.draft)
data(nba.draft)
A data frame with 13 observations on the following 2 variables.
Team name
The team won-loss record
The number of balls (of 1000) that this team has in the lottery selection
The NBA draft has a lottery to determing the top 13 placings. The odds in the lottery are determined by the won-loss record of the team, with poorer records having better odds of winning.
Data is taken from https://www.nba.com/news/draft_ties_020424.html.
data(nba.draft) top.pick = sample(row.names(nba.draft),1,prob = nba.draft$Balls)
data(nba.draft) top.pick = sample(row.names(nba.draft),1,prob = nba.draft$Balls)
A data frame measuring daily sea-ice extent from 1978 until 2013.
data(nisdc)
data(nisdc)
A data frame measuring daily sea-ice extent from 1978 until 2013
ftp://sidads.colorado.edu/DATASETS/NOAA/G02135/north/daily/data/NH_seaice_extent_final.csv and ftp://sidads.colorado.edu/DATASETS/NOAA/G02135/north/daily/data/NH_seaice_extent_nrt.csv (now offline).
See the blog post https://www.r-bloggers.com/2012/08/arctic-sea-ice-at-lowest-levels-since-observations-began/ for a description and nice script to play with.
A data set used to investigate the claim that “normal” temperature is 98.6 degrees.
data(normtemp)
data(normtemp)
A data frame with 130 observations on the following 3 variables.
normal body temperature
Gender 1 = male, 2 = female
Resting heart rate
Is normal body temperature 98.6 degrees Fahrenheit? This dataset was constructed to match data presented in an are article intending to establish the true value of “normal” body temperature.
This data set was contributed by Allen L. Shoemaker to the Journal of Statistics Education, http://jse.amstat.org/datasets/normtemp.txt.
Data set is simulated from values contained in Mackowiak, P. A., Wasserman, S. S., and Levine, M. M. (1992), "A Critical Appraisal of 98.6 Degrees F, the Upper Limit of the Normal Body Temperature, and Other Legacies of Carl Reinhold August Wunderlich," Journal of the American Medical Association, 268, 1578-1580.
data(normtemp) hist(normtemp$temperature) t.test(normtemp$temperature,mu=98.2) summary(lm(temperature ~ factor(gender), normtemp))
data(normtemp) hist(normtemp$temperature) t.test(normtemp$temperature,mu=98.2) summary(lm(temperature ~ factor(gender), normtemp))
Selected variables from the publicly available data from the National Practioner Data Bank (NPDB).
data(npdb)
data(npdb)
A data frame with 6797 observations on the following 6 variables.
2 digit abbreviation of state
Field of practice
Age of practictioner (rounded down to 10s digit)
Year of claim
Dollar amount of reward
a practioner ID, masked for anonymity
The variable names do not match the original. The codings for
field
come from a document on http://63.240.212.200/publicdata.html.
This dataset excerpts some interesting variables from the NPDB for the years 2000-2003. The question of capping medical malpractice awards to lower insurance costs is currently being debated nationwide (U.S.). This data is a primary source for determining this debate.
A quotation from https://npdb-hipdb.com/:
“The legislation that led to the creation of the NPDB was enacted the U.S. Congress believed that the increasing occurrence of medical malpractice litigation and the need to improve the quality of medical care had become nationwide problems that warranted greater efforts than any individual State could undertake. The intent is to improve the quality of health care by encouraging State licensing boards, hospitals and other health care entities, and professional societies to identify and discipline those who engage in unprofessional behavior; and to restrict the ability of incompetent physicians, dentists, and other health care practitioners to move from State to State without disclosure or discovery of previous medical malpractice payment and adverse action history. Adverse actions can involve licensure, clinical privileges, professional society membership, and exclusions from Medicare and Medicaid.”
This data came from https://npdb-hipdb.com/
data(npdb) table(table(npdb$ID)) # big offenders hist(log(npdb$amount)) # log normal?
data(npdb) table(table(npdb$ID)) # big offenders hist(log(npdb$amount)) # log normal?
A random sample of finishers from the New York City Marathon.
data(nym.2002)
data(nym.2002)
A data frame with 1000 observations on the following 5 variables.
Place in the race
What gender
Age on day of race
Indicator of hometown or nation
Time in minutes to finish
Each year thousands of particpants line up to run the New York City Marathon. This list is a random sample from the finishers.
From the New York City Road Runners web site http://www.nyrc.org
data(nym.2002) with(nym.2002, cor(time,age))
data(nym.2002) with(nym.2002, cor(time,age))
A collection of approval ratings for President Obama spanning a duration from early 2010 to the summer of 2013.
data(ObamaApproval)
data(ObamaApproval)
A data frame 7 variables.
Scraped on 7-5-13 from https://www.realclearpolitics.com/epolls/other/president_obama_job_approval-1044.html.
data(ObamaApproval)
data(ObamaApproval)
The on base percentage, OBP
, is a measure of how often a players gets
on base. It differs from the more familiar batting average, as it
include bases on balls (BB
) and hit by pitches (HBP
). The exact
formula is OBP = (H + BB + HBP) / (AB + BB + HBP + SF)
.
data(OBP)
data(OBP)
438 numbers between 0 and 1 corresponding the on base “percentage” for the 438 players who had 100 or more at bats in the 2002 baseball season. The "outlier" is Barry Bonds.
This data came from the interesting Lahman baseball data base
http://www.seanlahman.com/. The names attribute uses the playerID
from this database. Unfortunately there were some errors in the
extraction from the original data set. Consult the original for
accurate numbers.
data(OBP) hist(OBP) OBP[OBP>.5] # who is better than 50%? (only Barry Bonds)
data(OBP) hist(OBP) OBP[OBP>.5] # who is better than 50%? (only Barry Bonds)
A data set on oral lesion location for three Indian towns.
data(oral.lesion)
data(oral.lesion)
A data frame with 9 observations on the following 3 variables.
a numeric vector
a numeric vector
a numeric vector
"Exact Inference for Categorical Data", by Cyrus R. Mehta and Nitin R. Patel. Found at http://www.cytel.com/papers/sxpaper.pdf.
data(oral.lesion) chisq.test(oral.lesion)$p.value chisq.test(oral.lesion,simulate.p.value=TRUE)$p.value ## exact is.0269
data(oral.lesion) chisq.test(oral.lesion)$p.value chisq.test(oral.lesion,simulate.p.value=TRUE)$p.value ## exact is.0269
A time series showing ozone values at Halley Bay Antartica
data(ozonemonthly)
data(ozonemonthly)
The format is: Time-Series [1:590] from 1957 to 2006: 313 311 370 359 334 296 288 274 NA NA ... - attr(*, "names")= chr [1:590] "V5" "V6" "V7" "V8" ...
Provisional monthly mean ozone values for Halley Bay Antartica between 1956 and 2005. Data comes from https://legacy.bas.ac.uk/met/jds/ozone/.
Found at https://legacy.bas.ac.uk/met/jds/ozone/data/ZNOZ.DAT, now off-line.
See https://www.meteohistory.org/2004proceedings1.1/pdfs/11christie.pdf for a discussion of data collection and the Ozone hole.
data(ozonemonthly) ## notice decay in the 80s plot(ozonemonthly) ## October plot shows dramatic swing monthplot(ozonemonthly)
data(ozonemonthly) ## notice decay in the 80s plot(ozonemonthly) ## October plot shows dramatic swing monthplot(ozonemonthly)
Annual snowfall (from July 1 to June 30th) measured at Paradise ranger station at Mount Ranier Washington.
data(paradise)
data(paradise)
The data is stored as a zoo class object. The time index refers to the year the snowfall begins.
Due to its rapid elevation gain, and proximity to the warm moist air of the Pacific Northwest record amounts of snow can fall on Mount Ranier. This data set shows the fluctuations.
Original data from http://www.nps.gov/mora/current/weather.htm
require(zoo) data(paradise) range(paradise, na.rm=TRUE) plot(paradise)
require(zoo) data(paradise) range(paradise, na.rm=TRUE) plot(paradise)
first 2000 digits of pi
data(pi2000)
data(pi2000)
The format is: num [1:2000] 3 1 4 1 5 9 2 6 5 3 ...
Generated by Mathematica, http://www.wolfram.com.
data(pi2000) chisq.test(table(pi2000))
data(pi2000) chisq.test(table(pi2000))
Prime numbers between 1 and 2003.
data(primes)
data(primes)
The format is: num [1:304] 2 3 5 7 11 13 17 19 23 29 ...
Generated using http://www.rsok.com/~jrm/printprimes.html.
data(primes) diff(primes)
data(primes) diff(primes)
Incomes for Puerto Rican immigrants to Miami
data(puerto)
data(puerto)
The format is: num [1:50] 150 280 175 190 305 380 290 300 170 315 ...
From Kitchens Exploring Statistics
data(puerto) hist(puerto)
data(puerto) hist(puerto)
Creates a qqplot of two variables along with graphs of their densities, shaded so that the corresponding percentiles are clearly matched up.
QQplot(x, y, n = 20, xsf = 4, ysf = 4, main = "qqplot", xlab = deparse(substitute(x)), ylab = deparse(substitute(y)), pch = 16, pcol = "black", shade = "gray", ...)
QQplot(x, y, n = 20, xsf = 4, ysf = 4, main = "qqplot", xlab = deparse(substitute(x)), ylab = deparse(substitute(y)), pch = 16, pcol = "black", shade = "gray", ...)
x |
The x variable |
y |
The y variable |
n |
number of points to plot in qqplot. |
xsf |
scale factor to adjust size of x density graph |
ysf |
scale factor to adjust size of y density graph |
main |
title |
xlab |
label for x axis |
ylab |
label for y axis |
pch |
plot character for points in qqplot |
pcol |
color of plot character |
shade |
shading color |
... |
extra arguments passed to |
Shows density estimates for the two samples in a qqplot. Meant to make this useful plot more transparent to first-time users of quantile-quantile plots.
This function has some limitations: the scale factor may need to be adjusted; the code to shade only shaded trapezoids, and does not completely follow the density.
Produces a graphic
John Verzani
x = rnorm(100) y = rt(100, df=3) QQplot(x,y)
x = rnorm(100) y = rt(100, df=3) QQplot(x,y)
Survival times of 20 rats exposed to radiation
data(rat)
data(rat)
The format is: num [1:20] 152 152 115 109 137 88 94 77 160 165 ...
From Kitchents Exploring Statistics
data(rat) hist(rat)
data(rat) hist(rat)
A simulated dataset on reaction time to an external event for subject using cell phones.
data(reaction.time)
data(reaction.time)
A data frame with 60 observations on the following 4 variables.
Age of participant coded as 16-24 or 25+
Male of Female
Code to indicate if subject is using a cell phone "T" or is in the control group "C"
Time in seconds to react to external event
Several studies indicate that cell phone usage while driving can effect reaction times to external events. This dataset uses simulated data based on values from the NHTSA study "The Influence of the Use of Mobile Phones on Driver Situation Awareness".
The NHTSA study was found at http://www-nrd.nhtsa.dot.gov/departments/nrd-13/driver-distraction/PDF/2.PDF
This study and others were linked from the web page http://www.accidentreconstruction.com/research/cellphones/ (now off-line).
data(reaction.time) boxplot(time ~ control, data = reaction.time)
data(reaction.time) boxplot(time ~ control, data = reaction.time)
Simulated length-at-age data for the red drum.
data(reddrum)
data(reddrum)
A data frame with 100 observations on the following 2 variables.
age
a numeric vector
This data is simulated from values reported in a paper by Porch, Wilson and Nieland titled "A new growth model for red drum (Sciaenops ocellaus) that accommodates seasonal and ontogenic changes in growth rates" which appeard in Fishery Bulletin 100(1) (was at http://fishbull.noaa.gov/1001/por.pdf, now off-line). They attribute the data to Beckman et. al and say it comes from measurements in the Northern Gulf of Mexico, between September 1985 and October 1998.
data(reddrum) plot(length ~ age, reddrum)
data(reddrum) plot(length ~ age, reddrum)
The Ricker model is used to model the relationship of recruitment of a salmon species versus the number of spawners. The model has two parameters, a rate of growth at small numbers and a decay rate at large numbers. This data set is simulated data for 83 different recordings using parameters found in a paper by Chen and Holtby.
data(salmon.rate)
data(salmon.rate)
The format is: 83 numbers on decay rates.
The Ricker model for recruitment modeled by spawner count
The paramter is a decay rate
for large values of
. In the paper by Chen and Holtby, they
studied 83 datasets and found that
is log-normally distributed. The
data is simulated from their values to illustrate a log normal
distribution.
These values are from D.G. Chen and L. Blair Holtby, “A regional meta-model for stock recruitment analysis using an empirical Bayesian approach”, found at https://iphc.int/.
data(salmon.rate) hist(log(salmon.rate))
data(salmon.rate) hist(log(salmon.rate))
A data set of unofficial tallies of salmon harvested in Alaska between the years 1980 and 1998. The units are in thousands of fish.
data(salmonharvest)
data(salmonharvest)
A multiple time series object with yearly sampling for the five species Chinook, Sockeye, Coho, Pink, and Chum.
This data was found at http://seamarkets.alaska.edu/ak_harv_fish.htm
data(salmonharvest) acf(salmonharvest)
data(salmonharvest) acf(salmonharvest)
A data frame containing data on health behaviour for school-aged children.
data(samhda)
data(samhda)
A data frame with 600 observations on the following 9 variables.
A numeric weight used in sampling
1=Male, 2=Female, 7=not recorded
1 = 6th, 2 = 8th, 3 = 10th
1 = Y, 2 = N
Amount of days you smoked cigarettes in last 30. 1 = all 30, 2= 20-29, 3 = 10-19, 4 = 6-9, 5= 3-5, 6 = 1-2, 7=0
Have you ever drank alcohol, 1 = Y, 2 = N
Number of days in last 30 in which you drank alcohol
Ever smoke marijuana. 1 = Y, 2= N
Number of days in lst 30 that marijuana was used. 1 = Never used, 2 = all 30, 3 = 20-29, 4 = 10-19, 5 = 6-9, 6 = 3-5, 7 = 1-2, 8 =Used, but not in last 30 days
A data frame containing data on health behaviour for school-aged children.
This data is sampled from the data set "Health Behavior in School-Aged Children, 1996: [United States]" collected by the World Health Organization, https://www.icpsr.umich.edu/. It is available at the Substance Abuse and Mental Health Data Archive (SAMHDA). Only complete cases are given.
data(samhda) attach(samhda) table(amt.smoke)
data(samhda) attach(samhda) table(amt.smoke)
This dataset contains variables that address the relationship between public school expenditures and academic performance, as measured by the SAT.
data(SAT)
data(SAT)
A data frame with variables state
, expend
(expenditure per pupil), ratio
(pupil/teacher ratio);
salary
(average teacher salary; percentage of SAT
takers
; verbal
(verbal score); math
(math score);
total
(average total).
The data came from http://www.amstat.org/publications/jse/datasets/sat.txt
This data comes from http://www.amstat.org/publications/jse/secure/v7n2/datasets.guber.cfm. It is also included in the mosaic package and commented on at http://sas-and-r.blogspot.com/2012/02/example-920-visualizing-simpsons.html. The variables are described at http://www.amstat.org/publications/jse/datasets/sat.txt.
The author references the original source: The variables in this dataset, all aggregated to the state level, were extracted from the 1997 Digest of Education Statistics, an annual publication of the U.S. Department of Education. Data from a number of different tables were downloaded from the National Center for Education Statistics (NCES) website (Available at: http://nces01.ed.gov/pubs/digest97/index.html) and merged into a single data file.
data(SAT)
data(SAT)
Draws a scatterplot of the data, and histogram in the margins. A trend line can be added, if desired.
scatter.with.hist(x, y, hist.col = gray(0.95), trend.line = "lm", ...)
scatter.with.hist(x, y, hist.col = gray(0.95), trend.line = "lm", ...)
x |
numeric predictor |
y |
numeric response variables |
hist.col |
color for histogram |
trend.line |
Draw a trend line using |
... |
Passed to |
Draws the graphic. No return value.
John Verzani
This example comes from the help page for layout
.
data(emissions) attach(emissions) scatter.with.hist(perCapita,CO2)
data(emissions) attach(emissions) scatter.with.hist(perCapita,CO2)
Distribution and point values of letters in Scrabble.
data(scrabble)
data(scrabble)
A data frame with 27 observations on the following 3 variables.
Which piece
point value
Number of pieces
Scrabble is a popular board game based on forming words from the players' pieces. These consist of letters drawn from a pile at random. The game has a certain frequency of letters given by this data. These match fairly well with the letter distribution of the English language.
data(scrabble) ## perform chi-squared analysis on long string. Is it in English? quote = " R is a language and environment for statistical computing \ and graphics. It is a GNU project which is similar to the S language \ and environment which was developed at Bell Laboratories (formerly \ AT&T, now Lucent Technologies) by John Chambers and colleagues. R \ can be considered as a different implementation of S. There are \ some important differences, but much code written for S runs \ unaltered under R." quote.lc = tolower(quote) quote = unlist(strsplit(quote.lc,"")) ltr.dist = sapply(c(letters," "),function(x) sum(quote == x)) chisq.test(ltr.dist,,scrabble$freq)
data(scrabble) ## perform chi-squared analysis on long string. Is it in English? quote = " R is a language and environment for statistical computing \ and graphics. It is a GNU project which is similar to the S language \ and environment which was developed at Bell Laboratories (formerly \ AT&T, now Lucent Technologies) by John Chambers and colleagues. R \ can be considered as a different implementation of S. There are \ some important differences, but much code written for S runs \ unaltered under R." quote.lc = tolower(quote) quote = unlist(strsplit(quote.lc,"")) ltr.dist = sapply(c(letters," "),function(x) sum(quote == x)) chisq.test(ltr.dist,,scrabble$freq)
This function will simulate a chutes and ladder game. It returns a trajectory for a single player. Optionally it can return the transition matrix which can be used to speed up the simulation.
simple.chutes(sim=FALSE, return.cl=FALSE, cl=make.cl())
simple.chutes(sim=FALSE, return.cl=FALSE, cl=make.cl())
sim |
Set to TRUE to return a trajectory. |
return.cl |
Set to TRUE to return a transistion matrix |
cl |
set to the chutes and ladders transition matrix |
To make a chutes and ladders trajectory
simple.chutes(sim=TRUE)
To return the game board
simple.chutes(return.cl=TRUE)
when doing a lot of simulations, it may be best to pass in the game board
cl <- simple.chutes(return.cl=TRUE) simple.chutes(sim=TRUE,cl)
returns a trajectory as a vector, or a matrix if asked to return the transition matrix
John Verzani
board was from http://www.ahs.uwaterloo.ca/~musuem/vexhibit/Whitehill/snakes/snakes.gif
plot(simple.chutes(sim=TRUE))
plot(simple.chutes(sim=TRUE))
Allows one to compare empirical densities of different distributions in a simple manner. The density is used as graphs with multiple histograms are too crowded. The usage is similar to side-by-side boxplots.
simple.densityplot(x, ...)
simple.densityplot(x, ...)
x |
x may be a sequence of data vectors (eg. x,y,z), a data frame with numeric column vectors or a model formula |
... |
You can pass in a bandwidth argument such as bw="SJ". See density for details. A legend will be placed for you automatically. To overide the positioning set do.legend="manual". To skip the legend, set do.legend=FALSE. |
Makes a plot
John Verzani
Basically a modified boxplot function. As well it should be as it serves the same utility: comparing distributions.
boxplot
,simple.violinplot
,density
## taken from boxplot ## using a formula data(InsectSprays) simple.densityplot(count ~ spray, data = InsectSprays) ## on a matrix (data frame) mat <- cbind(Uni05 = (1:100)/21, Norm = rnorm(100), T5 = rt(100, df = 5), Gam2 = rgamma(100, shape = 2)) simple.densityplot(data.frame(mat))
## taken from boxplot ## using a formula data(InsectSprays) simple.densityplot(count ~ spray, data = InsectSprays) ## on a matrix (data frame) mat <- cbind(Uni05 = (1:100)/21, Norm = rnorm(100), T5 = rt(100, df = 5), Gam2 = rgamma(100, shape = 2)) simple.densityplot(data.frame(mat))
Simply plots histogram, boxplot and normal plot for experimental data analysis.
simple.eda(x)
simple.eda(x)
x |
a vector of data |
Just does the plots. No return value
John Verzani
Inspired by S-Plus documentation
hist,boxplot,qnorm
x<- rnorm(100,5,10) simple.eda(x)
x<- rnorm(100,5,10) simple.eda(x)
This makes 3 graphs to check for serial correlation in data. The
graphs are a sequential plot (i vs ), a lag plot
(plotting
vs
where k=1 by default)
and an autocorrelation plot from the times series ("ts") package.
simple.eda.ts(x, lag=1)
simple.eda.ts(x, lag=1)
x |
a univariate vector of data |
lag |
a lag to give to the lag plot |
Makes the graph with 1 row, 3 columns
John Verzani
Downloaded from http://www.itl.nist.gov/div898/handbook/eda/section3/eda34.htm.
## The function is currently defined as ## look for no correlation x <- rnorm(100);simple.eda.ts(x) ## you will find correlation here simple.eda.ts(cumsum(x))
## The function is currently defined as ## look for no correlation x <- rnorm(100);simple.eda.ts(x) ## you will find correlation here simple.eda.ts(cumsum(x))
Not much, just hides some ugly code
simple.fancy.stripchart(l)
simple.fancy.stripchart(l)
l |
A list with each element to be plotted with a stripchart |
Creates the plot
John Verzani
stripchart
x = rnorm(10);y=rnorm(10,1) simple.fancy.stripchart(list(x=x,y=y))
x = rnorm(10);y=rnorm(10,1) simple.fancy.stripchart(list(x=x,y=y))
Simply plot histogram and frequency polygon. Students do not need to know how to add lines to a histogram, and how to extract values.
simple.freqpoly(x, ...)
simple.freqpoly(x, ...)
x |
a vector of data |
... |
arguments passed onto histogram |
returns just the plot
John Verzani
hist,density
x <- rt(100,4) simple.freqpoly(x)
x <- rt(100,4) simple.freqpoly(x)
Simple function to plot both histogram and boxplot to compare
simple.hist.and.boxplot(x, ...)
simple.hist.and.boxplot(x, ...)
x |
vector of univariate data |
... |
Arguments passed to the hist function |
Just prints the two graphs
John Verzani
hist,boxplot,layout
x<-rnorm(100) simple.hist.and.boxplot(x)
x<-rnorm(100) simple.hist.and.boxplot(x)
Used to apply a function to subsets of a data vector. In particular, it is used to find moving averages over a certain "lag" period.
simple.lag(x, lag, FUN = mean)
simple.lag(x, lag, FUN = mean)
x |
a data vector |
lag |
the lag amount to use. |
FUN |
a function to apply to the lagged data. Defaults to mean |
The function FUN is applied to the data x[(i-lag):i] and assigned to the (i-lag)th component of the return vector. Useful for finding moving averages.
returns a vector.
Provided to R help list by Martyn Plummer
filter
## find a moving average of the dow daily High data(dowdata) lag = 50; n = length(dowdata$High) plot(simple.lag(dowdata$High,lag),type="l") lines(dowdata$High[lag:n])
## find a moving average of the dow daily High data(dowdata) lag = 50; n = length(dowdata$High) plot(simple.lag(dowdata$High,lag),type="l") lines(dowdata$High[lag:n])
Simplify usage of lm by avoiding model notation, drawing plot, drawing regression line, drawing confidence intervals.
simple.lm(x, y, show.residuals=FALSE, show.ci=FALSE, conf.level=0.95,pred=)
simple.lm(x, y, show.residuals=FALSE, show.ci=FALSE, conf.level=0.95,pred=)
x |
The predictor variable |
y |
The response variable |
show.residuals |
set to TRUE to plot residuals |
show.ci |
set to TRUE to plot confidence intervals |
conf.level |
if show.ci=TRUE will plot these CI's at this level |
pred |
values of the x-variable for prediction |
returns plots and an instance of lm, as though it were called
lm(y ~ x)
John Verzani
lm
## on simulated data x<-1:10 y<-5*x + rnorm(10,0,1) tmp<-simple.lm(x,y) summary(tmp) ## predict values simple.lm(x,y,pred=c(5,6,7))
## on simulated data x<-1:10 y<-5*x + rnorm(10,0,1) tmp<-simple.lm(x,y) summary(tmp) ## predict values simple.lm(x,y,pred=c(5,6,7))
Do simple sign test like wilcox.test without ranking. Just computes two-sided p-value, no confidence interval is given.
simple.median.test(x, median=NA)
simple.median.test(x, median=NA)
x |
A data vector |
median |
The value of median under the null hyptohesis |
Unlike wilcox.test, this tests the null hypothesis that the median is specified agains the two-sided alternative. For illustration purposes only.
Returns the p value.
John Verzani
wilcox.test
x<-c(12,2,17,25,52,8,1,12) simple.median.test(x,20)
x<-c(12,2,17,25,52,8,1,12) simple.median.test(x,20)
Shows scatterplot of x vs y with histograms of each on sides of graph. As in the example from layout.
simple.scatterplot(x, y, ...)
simple.scatterplot(x, y, ...)
x |
data vector |
y |
data vector |
... |
passed to plot command |
Returns the plot
John Verzani
layout
x<-sort(rnorm(100)) y<-sort(rt(100,3)) simple.scatterplot(x,y)
x<-sort(rnorm(100)) y<-sort(rt(100,3)) simple.scatterplot(x,y)
'simple.sim' is intended to make it a little easier to do simulations with R. Instead of writing a for loop, or dealing with column or row sums, a student can use this "simpler" interface.
simple.sim(no.samples, f, ...)
simple.sim(no.samples, f, ...)
no.samples |
How many samples do you wish to generate |
f |
A function which generates a single random number from some distributions. simple.sim generates the rest. |
... |
parameters passed to f. It does not like named parameters. |
This is simply a wrapper for a for loop that uses the function f to create random numbers from some distribution.
returns a vector of size no.samples
There must be a 1000 better ways to do this. See replicate
or sapply
for example.
John Verzani
## First shows trivial (and very unnecessary usage) ## define a function f and then simulate f<-function() rnorm(1) # create a single random real number sim <- simple.sim(100,f) # create 100 random normal numbers hist(sim) ## what does range look like? f<- function (n,mu=0,sigma=1) { tmp <- rnorm(n,mu,sigma) max(tmp) - min(tmp) } sim <- simple.sim(100,f,5) hist(sim)
## First shows trivial (and very unnecessary usage) ## define a function f and then simulate f<-function() rnorm(1) # create a single random real number sim <- simple.sim(100,f) # create 100 random normal numbers hist(sim) ## what does range look like? f<- function (n,mu=0,sigma=1) { tmp <- rnorm(n,mu,sigma) max(tmp) - min(tmp) } sim <- simple.sim(100,f,5) hist(sim)
This function serves the same utility as side-by-side boxplots, only it provides more detail about the different distribution. It plots violinplots instead of boxplots. That is, instead of a box, it uses the density function to plot the density. For skewed distributions, the results look like "violins". Hence the name.
simple.violinplot(x, ...)
simple.violinplot(x, ...)
x |
Either a sequence of variable names, or a data frame, or a model formula |
... |
You can pass arguments to polygon with this. Notably, you can set the color to red with col='red', and a border color with border='blue' |
Returns a plot.
John Verzani
This is really the boxplot function from R/base with some minor adjustments
boxplot, simple.densityplot
## make a "violin" x <- rnorm(100) ;x[101:150] <- rnorm(50,5) simple.violinplot(x,col="brown") f<-factor(rep(1:5,30)) ## make a quintet. Note also choice of bandwidth simple.violinplot(x~f,col="brown",bw="SJ")
## make a "violin" x <- rnorm(100) ;x[101:150] <- rnorm(50,5) simple.violinplot(x,col="brown") f<-factor(rep(1:5,30)) ## make a quintet. Note also choice of bandwidth simple.violinplot(x~f,col="brown",bw="SJ")
Imlements a z-test similar to the t.test function
simple.z.test(x, sigma, conf.level=0.95)
simple.z.test(x, sigma, conf.level=0.95)
x |
A data vector |
sigma |
the known variance |
conf.level |
Confidence level for confidence interval |
Returns a confidence interval for the mean
Joh Verzani
t.test, prop.test
x<-rnorm(10,0,5) simple.z.test(x,5)
x<-rnorm(10,0,5) simple.z.test(x,5)
Judges scores from the disputed ice skating competition at the 2002 Winter olympics
data(skateranks)
data(skateranks)
A data frame with 20 observations on the following 11 variables.
a factor with levels Berankova/Diabola
Berezhnaya/Sikharulidze
Bestnadigova/Bestandif
Chuvaeva/Palamarchuk
Cobisi/DePra
Ina/Zimmerman
Kautz/Jeschke
Krasitseva/Znachkov
Langlois/Archetto
Lariviere/Faustino
Pang/Tong
Petrova/Tikhonov
Ponomareva/SWviridov
Savchenko/Morozov
Scott/Dulebohn
Sele/Pelletier
Shen/Zhao
Totmianina/Marinin
Zagorska/Siudek
Zhang/Zhang
a factor with levels Armenia
Canada
China
Czech
Germany
Italy
Poland
Russia
Slovakia
US
Ukraine
Uzbekistan
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
data(skateranks)
data(skateranks)
Sodium-Lithium countertransport
data(slc)
data(slc)
The format is: num [1:190] 0.467 0.430 0.192 0.192 0.293 ...
From Kitchens' Exploring Statistics
data(slc) hist(slc)
data(slc) hist(slc)
Water pH levels at 75 water samples in the Great Smoky Mountains
data(smokyph)
data(smokyph)
This data frame contains the following columns:
a numeric vector
a numeric vector
a numeric vector
From Kitchens' Exploring Statistics
data(smokyph) plot(smokyph$elev,smokyph$waterph)
data(smokyph) plot(smokyph$elev,smokyph$waterph)
subset of SR26 data on nutrients compiled by the USDA.
data(snacks)
data(snacks)
A data frame with some nutrition variables
This data came from the SR26 data set found at http://www.ars.usda.gov/Services/docs.htm?docid=8964.
data(snacks)
data(snacks)
Murder rates for 30 Southern US cities
data(south)
data(south)
The format is: num [1:30] 12 10 10 13 12 12 14 7 16 18 ...
From Kitchens' Exploring Statistics
data(south) hist(south)
data(south) hist(south)
The southern oscillation is defined as the barametric pressure difference between Tahiti and the Darwin Islands at sea level. The southern oscillation is a predictor of el nino which in turn is thought to be a driver of world-wide weather. Specifically, repeated southern oscillation values less than -1 typically defines an el nino.
data(southernosc)
data(southernosc)
The format is: Time-Series [1:456] from 1952 to 1990: -0.7 1.3 0.1 -0.9 0.8 1.6 1.7 1.4 1.4 1.5 ...
Originally downloaded from http://www.itl.nist.gov/div898/handbook/pmc/section4/pmc4412.htm
A description was available at http://www.itl.nist.gov/div898/handbook/pmc/section4/pmc4461.htm
data(southernosc) plot(southernosc)
data(southernosc) plot(southernosc)
Excess returns of S\&P 500. These are defined as the difference between the series and some riskless asset.
data(sp500.excess)
data(sp500.excess)
The format is: Time-Series [1:792] from 1929 to 1995: 0.0225 -0.044 -0.0591 0.0227 0.0077 0.0432 0.0455 0.0171 0.0229 -0.0313 ...
This data set is used in Tsay, Analysis of Financial Time Series. At the time, it was downloaded from www.gsb.uchicago.edu/fac/ruey.tsay/teaching/fts (now off-line). The fSeries package may also contain this data set.
data(sp500.excess) plot(sp500.excess)
data(sp500.excess) plot(sp500.excess)
Splits zoo objects by a grouping variable ala split(). Each univariate series is turned into a multivariate zoo object. If the original series is multivariate, the output is a list of multivariate zoo objects.
Split.zoo(x, f)
Split.zoo(x, f)
x |
an univariate or multivariate zoo object |
f |
A grouping variable of the same length of x. A warning is given is length(f) is not the same as index size of x |
Returns a multivariate zoo object, or list of such.
John Verzani
if(require(zoo)) { split.zoo = Split.zoo ## make generic x = zoo(1:30,1:30) f = sample(letters[1:5],30, replace=TRUE) split(x,f) }
if(require(zoo)) { split.zoo = Split.zoo ## make generic x = zoo(1:30,1:30) f = sample(letters[1:5],30, replace=TRUE) split(x,f) }
Create a squareplot as an alternative to a segmented barplot. Useful when the viewer is interested in exact counts in the categories. A squareplot is often used by the New York Times. A grid of squares is presented with each color representing a different category. The colors appear contiguously reading top to bottom, left to right. The colors segment the graph as a segmented bargraph, but the squares allow an interested reader to easily tally the counts.
squareplot(x, col = gray(seq(0.5, 1, length = length(x))), border =NULL, nrows = ceiling(sqrt(sum(x))), ncols = ceiling(sum(x)/nrows), ...)
squareplot(x, col = gray(seq(0.5, 1, length = length(x))), border =NULL, nrows = ceiling(sqrt(sum(x))), ncols = ceiling(sum(x)/nrows), ...)
x |
a vector of counts |
col |
a vector of colors |
border |
border color passed to |
nrows |
number of rows |
ncols |
number of columns |
... |
passed to |
Creates the graph, but has no return value.
John Verzani
The New York Times, https://www.nytimes.com. In particular, Sports page 6, June 15, 2003.
## A Roger Clemens Cy Young year -- roids? squareplot(c(21,7,6),col=c("blue","green","white"))
## A Roger Clemens Cy Young year -- roids? squareplot(c(21,7,6),col=c("blue","green","white"))
A simulation of student records used for placement purposes
data(stud.recs)
data(stud.recs)
A data frame with 160 observations on the following 6 variables.
Score on sequential 1 test
Score on sequential 2 test
Score on sequential 3 test
SAT verbal score
SAT math score
grade on first math class
grade on first math class
Some simulated student records for placement purpores
data(stud.recs) hist(stud.recs$sat.v) with(stud.recs,cor(sat.v,sat.m))
data(stud.recs) hist(stud.recs$sat.v) with(stud.recs,cor(sat.v,sat.m))
Some data for possible student expenses
data(student.expenses)
data(student.expenses)
A data frame of 5 variables for 10 students. All answers are coded "Y
"
for yes, "N
" for no.
Does student have cell phone.
Does student have cable TV.
Does student pay for dial-up internet access.
Does student pay for high-speed or cable modem access to internet.
Does student own a car.
Sample dataset of students expenses.
data(student.expenses) attach(student.expenses) table(dial.up,cable.modem)
data(student.expenses) attach(student.expenses) table(dial.up,cable.modem)
Plot a barplot, with bars nested and ranging from a max to a minimum value. A similar graphic is used on the weather page of the New York Times.
superbarplot(x, names = 1:dim(x)[2], names_height = NULL, col = gray(seq(0.8, 0.5, length = dim(x)[1]/2)), ... )
superbarplot(x, names = 1:dim(x)[2], names_height = NULL, col = gray(seq(0.8, 0.5, length = dim(x)[1]/2)), ... )
x |
A matrix with each pair of rows representing a min and max for the bar. |
names |
Place a name in each bar. |
names_height |
Where the names should go |
col |
What colors to use for the bars. There should be half as
many specified as rows of |
... |
passed to |
A similar graphic on the weather page of the New York Times
shows bars for record highs and lows, normal highs and lows and actual
(or predicted) highs or lows for 10 days of weather. This graphic
succintly and elegantly displays a wealth of information. Intended as
an illustration of the polygon
function.
Returns a plot, but no other values.
John Verzani
The weather page of the New York Times
record.high=c(95,95,93,96,98,96,97,96,95,97) record.low= c(49,47,48,51,49,48,52,51,49,52) normal.high=c(78,78,78,79,79,79,79,80,80,80) normal.low= c(62,62,62,63,63,63,64,64,64,64) actual.high=c(80,78,80,68,83,83,73,75,77,81) actual.low =c(62,65,66,58,69,63,59,58,59,60) x=rbind(record.low,record.high,normal.low,normal.high,actual.low,actual.high) the.names=c("S","M","T","W","T","F","S")[c(3:7,1:5)] superbarplot(x,names=the.names)
record.high=c(95,95,93,96,98,96,97,96,95,97) record.low= c(49,47,48,51,49,48,52,51,49,52) normal.high=c(78,78,78,79,79,79,79,80,80,80) normal.low= c(62,62,62,63,63,63,64,64,64,64) actual.high=c(80,78,80,68,83,83,73,75,77,81) actual.low =c(62,65,66,58,69,63,59,58,59,60) x=rbind(record.low,record.high,normal.low,normal.high,actual.low,actual.high) the.names=c("S","M","T","W","T","F","S")[c(3:7,1:5)] superbarplot(x,names=the.names)
Fictitious data on taste test for new goo
data(tastesgreat)
data(tastesgreat)
A data frame with 40 observations on the following 3 variables.
a factor with levels Female
Male
a numeric vector
1 if enjoyed, 0 otherwise
Fictitious data on a taste test with gender and age as covariates.
data(tastesgreat) summary(glm(enjoyed ~ gender + age, data=tastesgreat, family=binomial))
data(tastesgreat) summary(glm(enjoyed ~ gender + age, data=tastesgreat, family=binomial))
The yields at constant fixed maturity have been constructed by the Treasury Department, based on the most actively traded marketable treasury securities.
data(tcm1y)
data(tcm1y)
The format is: Time-Series [1:558] from 1953 to 2000: 2.36 2.48 2.45 2.38 2.28 2.2 1.79 1.67 1.66 1.41 ...
From the tcm data set in the tseries package. Given here for convenience only. They reference https://www.federalreserve.gov/Releases/H15/data.htm.
data(tcm1y) ar(diff(log(tcm1y)))
data(tcm1y) ar(diff(log(tcm1y)))
Simulated measurements of temperature and salinity in the center of 'Eddy Juggernaut', a huge anti-cyclone (clockwise rotating) Loop Current Ring in the Gulf of Mexico. The start date is October 18, 1999.
data(tempsalinity)
data(tempsalinity)
The data is stored as multivariate zooreg object with variables longitude, latitude, temperature (Celsius), and salinity (psu - practical salinity units, originally from https://toptotop.org/2014/10/21/climate_solutio/).
The temperature salinity profile of body of water can be characteristic. This data shows a change in the profile in time as the eddy accumulates new water.
Data from simulation by Andrew Poje.
data(tempsalinity) if(require(zoo)) { plot(tempsalinity[,3:4]) ## overide plot.zoo method plot.default(tempsalinity[,3:4]) abline(lm(salinity ~ temperature, tempsalinity, subset = 1:67)) abline(lm(salinity ~ temperature, tempsalinity, subset = -(1:67))) }
data(tempsalinity) if(require(zoo)) { plot(tempsalinity[,3:4]) ## overide plot.zoo method plot.default(tempsalinity[,3:4]) abline(lm(salinity ~ temperature, tempsalinity, subset = 1:67)) abline(lm(salinity ~ temperature, tempsalinity, subset = -(1:67))) }
In U.S. culture, an older man dating a younger woman is not uncommon, but when the age difference becomes too great is may seem to some to be unacceptable. This data set is a survey of 10 people with their minimum age for an acceptable partner for a range of ages for the male. A surprising rule of thumb (in the sense that someone took the time to figure this out) for the minimum is half the age plus seven. Does this rule hold for this data set?
data(too.young)
data(too.young)
A data frame with 80 observations on the following 2 variables.
a numeric vector
a numeric vector
data(too.young) lm(Female ~ Male, data=too.young)
data(too.young) lm(Female ~ Male, data=too.young)
IQ data of Burt on identical twins that were separated near birth.
data(twins)
data(twins)
A data frame with 27 observations on the following 3 variables.
IQ for twin raised with foster parents
IQ for twin raised with biological parents
Social status of biological parents
This data comes from the R package that accompanies Julian Faraway's notes Practical Regression and Anova in R (now a book).
data(twins) plot(Foster ~ Biological, twins)
data(twins) plot(Foster ~ Biological, twins)
Song titles and lengths of U2 albums from 1980 to 1997.
data(u2)
data(u2)
The data is stored as a list with names. Each list entry correspond to an album stored as a vector. The values of the vector are the song lengths in seconds and the names are the track titles.
Original data retrieved from http://www.u2station.com/u2ography.html
data(u2) sapply(u2,mean) # average track length max(sapply(u2,max)) # longest track length sort(unlist(u2)) # lengths in sorted order
data(u2) sapply(u2,mean) # average track length max(sapply(u2,max)) # longest track length sort(unlist(u2)) # lengths in sorted order
Data on growth of sea urchins.
data(urchin.growth)
data(urchin.growth)
A data frame with 250 observations on the following 2 variables.
Estimated age of sea urchin
Measurement of size
Data is sampled from a data set that accompanies the thesis of P. Grosjean.
Thesis was found at http://www.sciviews.org/_pgrosjean
data(urchin.growth) plot(jitter(size) ~ jitter(age), data=urchin.growth)
data(urchin.growth) plot(jitter(size) ~ jitter(age), data=urchin.growth)
vacation days
data(vacation)
data(vacation)
The format is: num [1:35] 23 12 10 34 25 16 27 18 28 13 ...
From Kitchens' Exploring Statistics
data(vacation) hist(vacation)
data(vacation) hist(vacation)
This function serves the same utility as side-by-side boxplots, only it provides more detail about the different distribution. It plots violinplots instead of boxplots. That is, instead of a box, it uses the density function to plot the density. For skewed distributions, the results look like "violins". Hence the name.
violinplot(x, ...)
violinplot(x, ...)
x |
Either a sequence of variable names, or a data frame, or a model formula |
... |
You can pass arguments to polygon with this. Notably, you can set the color to red with col='red', and a border color with border='blue' |
Returns a plot.
John Verzani
This is really the boxplot function from R/base with some minor adjustments
boxplot, densityplot
## make a "violin" x <- rnorm(100) ;x[101:150] <- rnorm(50,5) violinplot(x,col="brown") f<-factor(rep(1:5,30)) ## make a quintet. Note also choice of bandwidth violinplot(x~f,col="brown",bw="SJ")
## make a "violin" x <- rnorm(100) ;x[101:150] <- rnorm(50,5) violinplot(x,col="brown") f<-factor(rep(1:5,30)) ## make a quintet. Note also choice of bandwidth violinplot(x~f,col="brown",bw="SJ")
Water temperature measurements at 10 minute intervals at a site off the East coast of the United States in the summer of 1974.
data(watertemp)
data(watertemp)
A zoo class object with index stored as POSIXct elements. The measurements are in Celsius.
NODC Coastal Ocean Time Series Database Search Page which was at http://www.nodc.noaa.gov/dsdt/tsdb/search.html
if(require(zoo)) { data(watertemp) plot(watertemp) acf(watertemp) acf(diff(watertemp)) }
if(require(zoo)) { data(watertemp) plot(watertemp) acf(watertemp) acf(diff(watertemp)) }
This data set comes from a JSE article http://jse.amstat.org/v20n3/woodard.pdf by Roger Woodard. The data is described by: The information for this data set was taken from a Wake County, North Carolina real estate database. Wake County is home to the capital of North Carolina, Raleigh, and to Cary. These cities are the fifteenth and eighth fastest growing cities in the USA respectively, helping Wake County become the ninth fastest growing county in the country. Wake County boasts a 31.18 of approximately 823,345 residents. This data includes 100 randomly selected residential properties in the Wake County registry denoted by their real estate ID number. For each selected property, 11 variables are recorded. These variables include year built, square feet, adjusted land value, address, et al.
data(wchomes)
data(wchomes)
a data frame
https://www.amstat.org/publications/jse/v16n3/woodard.xls (now off-line)
http://jse.amstat.org/v20n3/woodard.pdf
data(wchomes)
data(wchomes)
Correlated data on what makes us happy
data(wellbeing)
data(wellbeing)
A data frame with data about what makes people happy (well being) along with several other covariates
Found from https://www.prcweb.co.uk/lab/what-makes-us-happy/.
https://www.prcweb.co.uk/lab/what-makes-us-happy/ and https://www.nationalaccountsofwellbeing.org/
data(wellbeing)
data(wellbeing)
Downloads stock data from Yahoo!
yahoo.get.hist.quote(instrument = "^gspc", destfile = paste(instrument, ".csv", sep = ""), start, end, quote = c("Open", "High", "Low", "Close"), adjusted = TRUE, download = TRUE, origin = "1970-01-01", compression = "d")
yahoo.get.hist.quote(instrument = "^gspc", destfile = paste(instrument, ".csv", sep = ""), start, end, quote = c("Open", "High", "Low", "Close"), adjusted = TRUE, download = TRUE, origin = "1970-01-01", compression = "d")
instrument |
Ticker symbol as character string. |
destfile |
Temporary file for storage |
start |
Date to start. Specified as "2005-12-31" |
end |
Date to end |
quote |
Any/All of "Open", "High", "Low", "Close" |
adjusted |
Adjust for stock splits, dividends. Defaults to TRUE |
download |
Download the data |
origin |
Dates are recorded in the number of days since the origin. A value of "1970-01-01" is the default. This was changed from "1899-12-30". |
compression |
Passed to yahoo |
Goes to chart.yahoo.com and downloads the stock data. By default returns a multiple time series of class mts with missing days padded by NAs.
A multiple time series with time measureing the number of days since the value specified to origin.
Daniel Herlemont <[email protected]>
This function was found on the mailling list for R-SIG finance
yahoo.get.hist.quote in the tseries package
Mean catch rate of yellow fin tuna in Tropical Indian Ocean for the given years.
data(yellowfin)
data(yellowfin)
A data frame with 49 observations on the following 2 variables.
The year
Mean number of fish per 100 hooks cast
Estimates for the mean number of fish caught per 100 hooks are given for a number of years. This can be used to give an estimate for the size, or biomass, of the species during these years assuming the more abundant the fish, the larger the mean. In practice this assumption is viewed with a wide range of attitudes.
This data is read from a graph that accompanies Myers RA, Worm B (2003) “Rapid worldwide depletion of predatory fish communities”. Nature 423:280-283.
See also http://www.soest.hawaii.edu/PFRP/large_pelagic_predators.html for rebuttals to the Myers and Worm article.
data(yellowfin) plot(yellowfin)
data(yellowfin) plot(yellowfin)