Title: | Practicals for Use with Davison (2003) Statistical Models |
---|---|
Description: | Contains the datasets and a few functions for use with the practicals outlined in Appendix A of the book Statistical Models (Davison, 2003, Cambridge University Press), which can be found at <doi:10.1017/CBO9780511815850>. |
Authors: | Anthony Davison <[email protected]> |
Maintainer: | Alessandra R. Brazzale <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.4-3.1 |
Built: | 2024-11-02 06:31:30 UTC |
Source: | CRAN |
Adds lines to density plot used in Practical 11.3
add.exp.lines(exp.out, i, B = 10)
add.exp.lines(exp.out, i, B = 10)
exp.out |
Gibbs sampler output |
i |
Variable index (=1, 2) |
B |
Upper bound for truncated exponential density |
Anthony Davison
B <-10; I <- 15; S <- 500 exp.out <- exp.gibbs(B=B,I=I,S=S) hist(exp.out[1,,I],prob=TRUE,nclass=15,xlab="u1",ylab="PDF",xlim=c(0,B),ylim=c(0,1)) add.exp.lines(exp.out,1)
B <-10; I <- 15; S <- 500 exp.out <- exp.gibbs(B=B,I=I,S=S) hist(exp.out[1,,I],prob=TRUE,nclass=15,xlab="u1",ylab="PDF",xlim=c(0,B),ylim=c(0,1)) add.exp.lines(exp.out,1)
Three-state data derived from daily rainfall over three years at Alofi in the Niue Island group in the Pacific Ocean. The states are 1 (no rain), 2 (up to 5mm rain), 3 (over 5mm).
data(alofi)
data(alofi)
Avery, P. J. and Henderson, D. A. (1999) Fitting Markov chain models to discrete state series such as DNA sequences. Applied Statistics, 48, 53–61.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 294.
data(alofi) fit <- MClik(alofi) fit$df fit$AIC plot(fit$df,fit$AIC) # best model has minimal AIC?
data(alofi) fit <- MClik(alofi) fit$df fit$AIC plot(fit$df,fit$AIC) # best model has minimal AIC?
A clinical trial to evaluate the efficacy of maintenance chemotherapy for acute myelogenous leukaemia was conducted by Embury et al. (1977) at Stanford University. After reaching a stage of remission through treatment by chemotherapy, patients were randomized into two groups. The first group received maintenance chemotherapy and the second group did not. The aim of the study was to see if maintenance chemotherapy increased the length of the remission. The data here formed a preliminary analysis which was conducted in October 1974.
data(aml)
data(aml)
A data frame with 23 observations on the following 3 variables.
The length of the complete remission (in weeks).
An indicator of right censoring. 1 indicates that the patient had a relapse and so 'time' is the length of the remission. 0 indicates that the patient had left the study or was still in remission in October 1974, that is the length of remission is right-censored.
The group into which the patient was randomized. Group 1 received maintenance chemotherapy, group 2 did not.
Miller, R.G. (1981) Survival Analysis. John Wiley: New York. Page 49.
Embury, S.H, Elias, L., Heller, P.H., Hood, C.E., Greenberg, P.L. and Schrier, S.L. (1977) Remission maintenance therapy in acute myelogenous leukaemia. Western Journal of Medicine, 126, 267–272.
45 school pupils were divided at random into 5 groups of size 9. Groups A and B were taught arithmetic in separate classes by the usual method. Groups C, D, and E were taught together for several days. On each day group C were publically praised, group D were publically reproved, and group E were ignored. The responses are from a standard test taken by all pupils at the end of the period.
data(arithmetic)
data(arithmetic)
A data frame with 45 observations on the following 2 variables.
a factor with levels A
B
C
D
E
a numeric vector
Unpublished lecture notes, Imperial College, London.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 427.
data(arithmetic) attach(arithmetic) plot(y~group) anova(lm(y~group,data=arithmetic)) summary(lm(y~group,data=arithmetic)) # two different parametrisations summary(lm(y~group-1,data=arithmetic)) # for ANOVA
data(arithmetic) attach(arithmetic) plot(y~group) anova(lm(y~group,data=arithmetic)) summary(lm(y~group,data=arithmetic)) # two different parametrisations summary(lm(y~group-1,data=arithmetic)) # for ANOVA
These are the frequencies with which Shakespeare used word types. There are 846 word types which appear more than 100 times in his total works, giving an overall total of 31534 word types.
data(bard)
data(bard)
A data frame with 100 observations on the following 2 variables.
Number of times a word type is used
Number of word types used r times
The canon of Shakespeare's accepted works contains 884,647 words, with 31,534 distinct word types. A word type is a distinguishable arrangement of letters, so ‘king’ is different from ‘kings’ and ‘alehouse’ different from both ‘ale’ and ‘house’.
Efron, B. and Thisted, R. (1976) Estimating the number of unseen species: How many words did Shakespeare know? Biometrika, 63, 435–448.
Thisted, R. and Efron, B. (1987 ) Did Shakespeare write a newly-discovered poem? Biometrika, 74, 445–455.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 629.
The spatial layout and plot yield at harvest in a final assessment trial of 75 varietes of spring barley. The varieties are sown in three blocks, each with 75 plots, and each variety is replicated thrice. The yield for variety 27 is missing in block 3.
data(barley)
data(barley)
A data frame with 225 observations on the following 4 variables.
a factor with three levels
a numeric vector with 75 values giving the plot
a factor with 75 levels giving the variety of barley sown in the plot
yield at harvest, standardised to have unit crude variance
Besag, J. E., Green, P. J., Higdon, D. and Mengersen, K. (1995) Bayesian computation and stochastic systems (with Discussion). Statistical Science, 10, 3–66.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Pages 534–535.
data(barley)
data(barley)
Data comprise 100 consecutive elemetric measurements of the body temperature of a female beaver, at 10-minute intervals. The animal remained in its lodge for the first 38 recordings, and then went outside.
data(beaver)
data(beaver)
A data frame with 100 observations on the following 4 variables.
Day number
Time of day (hhmm)
Body temperature (degrees Celsius)
Indicator of activity outside the lodge
Reynolds, P. S. (1994) Time-series analyses of beaver body temperatures. In Case Studies in Biometry, eds N. Lange, L. Ryan, L. Billard, D. R. Brillinger, L. Conquest and J. Greenhouse, pp. 211–228. New York: Wiley.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 266.
data(beaver) plot(beaver$temp,type="l",xlab="Time",ylab="Temperature")
data(beaver) plot(beaver$temp,type="l",xlab="Time",ylab="Temperature")
This function implements a Gibbs sampler for the normal changepoint model applied to the beaver temperature data used in Example 6.22 and Practical 11.7 of Davison (2003), which should be consulted for details.
beaver.gibbs(init, y, R = 10, a = 1, b = 0.05)
beaver.gibbs(init, y, R = 10, a = 1, b = 0.05)
init |
Initial values for parameters |
y |
A series of normal observations |
R |
Number of iterations of sampler |
a |
Value of a hyperparameter |
b |
Value of a hyperparameter |
This is provided simply so that readers spend less time typing. It is not intended to be robust and general code.
A matrix of size R x 6, whose first four columns contain the values of the parameters for the iterations. Columns 5 and 6 contain the log likelihood and log prior for that iteration.
Anthony Davison ([email protected]
)
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Practical 11.7.
## From Example 11.7: data(beaver) system.time( gibbs.out <- beaver.gibbs(c(36, 40, 3, 38), beaver$temp, R=1000)) par(mfrow=c(2,3)) plot.ts(gibbs.out[,1],main="mu1") # time series plot for mu1 plot.ts(gibbs.out[,2],main="mu2") # time series plot for mu2 plot.ts(gibbs.out[,3],main="lambda") # time series plot for lambda plot.ts(gibbs.out[,4],main="gamma") # time series plot for gamma plot.ts(gibbs.out[,5],main="log likelihood") # and of log likelihood
## From Example 11.7: data(beaver) system.time( gibbs.out <- beaver.gibbs(c(36, 40, 3, 38), beaver$temp, R=1000)) par(mfrow=c(2,3)) plot.ts(gibbs.out[,1],main="mu1") # time series plot for mu1 plot.ts(gibbs.out[,2],main="mu2") # time series plot for mu2 plot.ts(gibbs.out[,3],main="lambda") # time series plot for lambda plot.ts(gibbs.out[,4],main="gamma") # time series plot for gamma plot.ts(gibbs.out[,5],main="log likelihood") # and of log likelihood
Numbers of Japanese beetle larvae per square found in the top foot of soil of an 18 x 8 foot area of a field planted with maize. The columns of the matrix correspond to the direction of cultivation of the field; the maize rows were sown 4 feet apart.
data(beetle)
data(beetle)
The format is: num [1:18, 1:8] 0 2 3 1 5 3 5 3 2 3 ...
Unpublished lecture notes, Imperial College, London
data(beetle)
data(beetle)
The times taken to cycle up a hill, as function of the bicycle seat height, use of dynamo, and tyre pressure. 16 runs were made using a factorial design.
data(bike)
data(bike)
A data frame with 16 observations on the following 11 variables.
Day of run
Order of run
Seat height: -1 indicates 26 inches, 1 indicates 30 inches
Use of dynamo: -1 indicates not used
Tyre pressure: -1 indicates 40 psi, 1 indicates 55 psi
factor corresponding to day
factor corresponding to run
factor corresponding to seat height
factor corresponding to use of dynamo
factor corresponding to tyre pressure
Run time (seconds)
Box, G. E. P., Hunter, W. G. and Hunter, J. S. (1978) Statistics for Experimenters. New York: Wiley.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 357.
data(bike) anova(lm(time~dayf+runf+seat+dynamo+tyre,data=bike))
data(bike) anova(lm(time~dayf+runf+seat+dynamo+tyre,data=bike))
Times spent in delivery suite by 95 women giving birth at the John Radcliffe Hospital, Oxford. The data were kindly provided by Ethel Burns.
data(births)
data(births)
A data frame with 95 observations on the following 2 variables.
Day on which woman arrived
Time (hours) spent on delivery suite
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 18.
data(births)
data(births)
The Blalock–Taussig shunt is an operative procedure for infants with congenital cyanotic heart disease. The data are the survival times in months for shunts in 81 infants, divided into two age groups.
data(blalock)
data(blalock)
A data frame with 81 observations on the following 3 variables.
1 indicates infants aged over 1 month at time of the operation. 2 indicates those aged 30 or fewer days at time of operation.
survival time in months
censoring indicator: 1 indicates observed failure time
Oakes, D. (1991) Life-table analysis. In Statistical Theory and Modelling: In Honour of Sir David Cox, FRS, eds D. V. Hinkley, N. Reid and E. J. Snell, pp. 107–128. London: Chapman and Hall/CRC Press.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 192.
data(blalock) library(survival) plot(survfit(Surv(months,cens)~group,data=blalock),conf.int=TRUE,col=c(2,3))
data(blalock) library(survival) plot(survfit(Surv(months,cens)~group,data=blalock),conf.int=TRUE,col=c(2,3))
These are the number of adult flour beetles which died following a 5-hour exposure to gaseous carbon disulphide.
data(bliss)
data(bliss)
A data frame with 8 observations on the following 3 variables.
concentration of carbon disulphide(mg. per litre)
Numbers of beetles exposed
Numbers of beetles dying
Bliss, C. I. (1935).The calculation of the dosage-mortality curve. Annals of Applied BIology, 22, 134-167.
data(bliss) attach(bliss) plot(log(dose),r/m,ylim=c(0,1),ylab="Proportion dead") fit <- glm(cbind(r,m-r)~log(dose),binomial) summary(fit)
data(bliss) attach(bliss) plot(log(dose),r/m,ylim=c(0,1),ylab="Proportion dead") fit <- glm(cbind(r,m-r)~log(dose),binomial) summary(fit)
Data on the incidence of blood groups O, A, B, and AB in 12 studies on people living in Britain or of British origin living elsewhere.
data(blood)
data(blood)
A data frame with 12 observations on the following 4 variables.
Number of persons with blood group O
Number of persons with blood group A
Number of persons with blood group B
Number of persons with blood group AB
Taylor, G. L. and Prior, A. M. (1938) Blood groups in England. Annals of Eugenics, 8, 343–355.
Initial and follow-up status for 37 breast cancer patients treated for spinal metastases. The status is able to walk unaided (1), unable to walk unaided (2), dead (3). The follow-up times are 0, 3, 6, 12, 24, and 60 months after treatment began.
data(breast)
data(breast)
A data frame with 37 observations on the following 8 variables.
Case number
Initial status
Status immediately after treatment started
Status after 3 months
Status after 6 months
Status after 12 months
Status after 24 months
Status after 60 months
Woman 24 was alive after 6 months but her ability to walk was not recorded (she was in state 1 or 2).
NA indicates that a woman has previously died, or that her status is unknown.
de Stavola, B. L. (1988) Testing departures from time homogeneity in multistate Markov processes. Applied Statistics, 37, 242–250.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 227.
data(breast)
data(breast)
These are said to be measurements of IQ scores for pairs of identical twins, the first raised by foster parents and the second raised by natural parents, published by Sir Cyril Burt. Cases are divided into groups according to parents' social class, A-C, labelled 1-3. The general objective is to assess the impact of social class, and in particular the effect of environment, on IQ.
data(burt)
data(burt)
A data frame with 27 observations on the following 3 variables.
IQ score for twin raised in foster home
IQ score for twin raised by natural parents
Social class of twins
Burt used these and similar data to argue that IQ was largely inherited, a view which strongly influenced British education through the creation of the 11+ exam, which was used to decide which children should be given different forms of education. However after his death it was suggested that the data were fake, a view accepted by some and strongly rebutted by others.
Unpublished lecture notes of David Hinkley.
For information about Burt, see www.indiana.edu/~intell/burt.shtml
data(burt) attach(burt) par(pty="s") plot(x,y,type="n",xlim=c(60,140),ylim=c(60,140)) text(x,y,class,cex=0.8) abline(0,1,lty=2)
data(burt) attach(burt) par(pty="s") plot(x,y,type="n",xlim=c(60,140),ylim=c(60,140)) text(x,y,class,cex=0.8) abline(0,1,lty=2)
Data on breaking angles of chocolate cakes made using different recipes, mixes, and cooking temperatures.
data(cake)
data(cake)
A data frame with 270 observations on the following 4 variables.
Recipe used
mix, a factor with 15 levels
temperature (degrees Fahrenheit) at which cake baked
breaking angle (degrees)
These are data from an experiment in which six different temperatures for cooking three recipes for chocolate cake were compared. Each time a mix was made using one of the recipes, enough batter was prepared for six cakes, which were then randomly allocated to be cooked at the different temperatures. The response is the breaking angle, found by fixing one half of a slab of cake, then pivoting the other half about the middle until breakage occurs.
Cochran, W. G. and Cox, G. M. (1959) Experimental Designs. Second edition. New York: Wiley.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 454.
data(cake)
data(cake)
These are data on the uptake of calcium by cells suspended in a radioactive solution, as a function of time.
data(calcium)
data(calcium)
A data frame with 27 observations on the following 2 variables.
The time (in minutes) that the cells were suspended in the solution
The amount of calcium uptake (nmoles/mg)
Howard Grimes from the Botany Department, North Carolina State University, conducted an experiment for biochemical analysis of intracellular storage and transport of calcium across plasma membrane. Cells were suspended in a solution of radioactive calcium for a certain length of time and then the amount of radioactive calcium that was absorbed by the cells was measured. The experiment was repeated independently with 9 different times of suspension each replicated 3 times.
Rawlings, J.O. (1988) Applied Regression Analysis. Wadsworth and Brooks/Cole Statistics/Probability Series.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 469.
data(calcium) summary(nls(cal~beta0*(1-exp(-time/beta1)),data=calcium,start=list(beta0=5,beta1=5)))
data(calcium) summary(nls(cal~beta0*(1-exp(-time/beta1)),data=calcium,start=list(beta0=5,beta1=5)))
The title should be self-explanatory.
data(cardiac)
data(cardiac)
A data frame with 12 observations on the following 2 variables.
Number of deaths
Number of operations
Spiegelhalter, D. J., Thomas, A., Best, N. G. and W. R. Gilks (1996) BUGS 0.5 Examples Volume 1 (Version ii). Cambridge: MRC Biostatistics Unit. Page 15.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 579.
Data from a Latin square experiment on the potencies of cardiac drugs given to anesthetized cats.
data(cat.heart)
data(cat.heart)
A data frame with 64 observations on the following 6 variables.
on which experiment performed
morning or afternoon
four observers took part
cardiac drug given to cat
100 times log dose in micrograms at which cat died
100 times log cat heart weight in grams
These are results from an experiment to determine the relative potencies of eight similar cardiac drugs, labelled A–H, where A is a standard. The method used was to infuse slowly a suitable dilution of the drug into an anaesthetized cat. The dose at which death occurred and the weight of the cat's heart were recorded. Four observers each made two determinations on each of eight days, with a Latin square design used to eliminate observer and time differences. The heart weight cannot be known at the start of the experiment, but might be expected to affect comparisons among the treatments; it is assumed that heart weight is unaffected by the treatments.
Unpublished lecture notes, Imperial College, London.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 447.
data(cat.heart) anova(lm(y~Observer+Time+Day+Drug+Observer:Time,data=cat.heart))
data(cat.heart) anova(lm(y~Observer+Time+Day+Drug+Observer:Time,data=cat.heart))
Heat evolved in setting of cement, as a function of its chemical composition.
data(cement)
data(cement)
A data frame with 13 observations on the following 5 variables.
percentage weight in clinkers of 3CaO.Al2O3
percentage weight in clinkers of 3CaO.SiO2
percentage weight in clinkers of 4CaO.Al2O3.Fe2O3
percentage weight in clinkers of 2CaO.SiO2
heat evolved (calories/gram)
Woods, H., Steinour, H. H. and Starke, H. R. (1932) Effect of composition of Portland cement on heat evolved during hardening. Industrial Engineering and Chemistry, 24, 1207–1214.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 355.
data(cement) lm(y~x1+x2+x3+x4,data=cement)
data(cement) lm(y~x1+x2+x3+x4,data=cement)
Balanced incomplete block design on the effect of amino acids on growth of chick bones.
data(chicks)
data(chicks)
A data frame with 30 observations on the following 3 variables.
Treatment with levels All
(all amino acids present),
Arg-
(all acids present except Arg), etc.
bones were taken in pairs from 15 chicks
Log10 dry weight of bones at end of experiment
Bones from 7-day-old chick embryos were cultivated over a nutrient chemical medium. Two bones were available from each chick, and the experiment was set out in a balanced incomplete block design with two units per block. The treatments were growth in the complete medium, with about 30 nutrients in carefully controlled quantities, and growth in five other media, each with a single amino acid omitted. Thus His-, Arg-, and so forth denote media without particular amino acids.
Cox, D. R. and Snell, E. J. (1981) Applied Statistics: Principles and Examples. London: Chapman and Hall/CRC Press. Page 95.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 432.
These are the times in minutes taken for four chimpanzees to learn each of four words.
data(chimps)
data(chimps)
A data frame with 40 observations on the following 3 variables.
a factor with levels 1-4
a factor with 1-10
learning time (minutes)
Brown, B. W. and Hollander, M. (1977) Statistics: A Biomedical Introduction. New York: Wiley.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 485.
data(chimps) anova(glm(y~chimp+word,Gamma(log),data=chimps),test="F") anova(glm(y~word+chimp,Gamma(log),data=chimps),test="F")
data(chimps) anova(glm(y~chimp+word,Gamma(log),data=chimps),test="F") anova(glm(y~word+chimp,Gamma(log),data=chimps),test="F")
The data comprise lengths of cloth samples and the numbers of flaws found in them.
data(cloth)
data(cloth)
A data frame with 32 observations on the following 2 variables.
The length of the roll of cloth.
The number of flaws found in the roll.
Bissell, A. F. (1972) A negative binomial model with varying element size. Biometrika, 59, 435–441.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 515.
data(cloth) attach(cloth) plot(x,y) # Comparison of Poisson and quasilikelihood fits summary(glm(y~x-1,family=poisson(identity))) summary(glm(y~x-1,family=quasipoisson(identity)))
data(cloth) attach(cloth) plot(x,y) # Comparison of Poisson and quasilikelihood fits summary(glm(y~x-1,family=poisson(identity))) summary(glm(y~x-1,family=quasipoisson(identity)))
The 'coal' data frame has 191 rows and 1 column.
This data frame gives the dates of 191 explosions in UK coal mines which resulted in 10 or more fatalities. The time span of the data is from March 15, 1851 until March 22 1962.
data(coal)
data(coal)
This data frame contains the following column:
The date of the disaster. The integer part of 'date' gives the year. The day is represented as the fraction of the year that had elapsed on that day.
The data were obtained from
Hand, D.J., Daly, F., Lunn, A.D., McConway, K.J. and Ostrowski, E. (1994) A Handbook of Small Data Sets, Chapman and Hall.
Jarrett, R.G. (1979) A note on the intervals between coal-mining disasters. Biometrika, 66, 191–193.
data(coal) plot(density(coal$date)) rug(coal$date)
data(coal) plot(density(coal$date)) rug(coal$date)
This function computes the posterior distribution of the success probability theta when a coin is spun on its edge (or tossed), when the prior density for that probability is a mixture of beta densities.
coin.spin(para, r = 0, n = 0, n.points = 199)
coin.spin(para, r = 0, n = 0, n.points = 199)
para |
A matrix with 3 columns and k rows, where k is the number of components of the mixture. The first column contains the probabilities, and the next two the shape parameters a and b for the components. |
r |
Number of successes |
n |
Number of trials |
n.points |
The number of values of theta, equally-spaced between 0 and 1. |
This is provided simply so that readers spend less time typing. It is not intended to be robust and general code.
x |
Values of theta |
y |
Values of posterior density for theta |
Anthony Davison ([email protected]
)
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Practical 11.1.
## From Practical 11.1: para <- matrix( c(0.5, 10, 20, 0.5, 20, 10), nrow=2, ncol=3, byrow=TRUE) prior <- coin.spin(para) plot(prior, xlab="theta",ylab="PDF", type="l",ylim=c(0,6)) post <- coin.spin(para, r=4, n=10)
## From Practical 11.1: para <- matrix( c(0.5, 10, 20, 0.5, 20, 10), nrow=2, ncol=3, byrow=TRUE) prior <- coin.spin(para) plot(prior, xlab="theta",ylab="PDF", type="l",ylim=c(0,6)) post <- coin.spin(para, r=4, n=10)
Data on major insurance claims due to fires in Denmark, 1980–1990. The values of the claims have been rescaled for commercial reasons.
data(beaver)
data(beaver)
An irregular time series.
Embrechts, P., Kluppelberg, C. and Mikosch, T. (1997) Modelling Extremal Events for Insurance and Finance. Berlin: Springer.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 278.
data(danish) plot(danish,type="h")
data(danish) plot(danish,type="h")
The heights in eighths of inches of young maize plants put by Charles Darwin in four pots. He planted 15 pairs of plants together, one of each pair being cross-fertilised, and the other being self-fertilised.
data(darwin)
data(darwin)
A data frame with 30 observations on the following 4 variables.
a factor giving the pot
a factor giving the pair
a factor giving the type of fertilisation
height of plant in eighths of inches
Fisher, R. A. (1935) Design of Experiments. Edinburgh: Oliver and Boyd. Page 30.
The original book is reprinted as part of Fisher, R. A. (1990) Statistical Methods, Experimental Design, and Scientific Inference. Oxford University Press.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 2.
data(darwin) attach(darwin) plot(height~type) anova(lm(height~pot+pair+type,data=darwin))
data(darwin) attach(darwin) plot(height~type) anova(lm(height~pot+pair+type,data=darwin))
Performs Gibbs sampling for problem with two truncated exponential variables. See Practical 11.3 of Davison (2003) for details.
exp.gibbs(u1 = NULL, u2 = NULL, B, I = 100, S = 100)
exp.gibbs(u1 = NULL, u2 = NULL, B, I = 100, S = 100)
u1 |
Initial values for variable 1 |
u2 |
Initial values for variable 2 |
B |
Value at which exponential distribution is truncated |
I |
Number of iterations of sampler |
S |
Number of replicates of sampler |
This is provided simply so that readers spend less time typing. It is not intended to be robust and general code.
A 2 x S x I array containing the values of the variables for the successive iterations
Anthony Davison ([email protected]
)
Davison, A. C. (2003) Statistical Models. Cambridge University Press.Practical 11.3.
add.exp.lines <- function( exp.out, i, B=10) { dexp.trunc <- function( u, lambda, B ) dexp(u, rate=lambda)/(1-exp(-lambda*B)) S <- dim(exp.out)[2] I <- dim(exp.out)[3] u <- seq(0.0001,B,length=1000) fu <- rep(0,1000) for (s in 1:S) fu <- fu + dexp.trunc(u,exp.out[3-i,s,I],B)/S lines(u,fu,col="red") invisible() } par(mfrow=c(3,2)) B <-10; I <- 15; S <- 500 exp.out <- exp.gibbs(B=B,I=I,S=S) hist(exp.out[1,,I],prob=TRUE,nclass=15,xlab="u1",ylab="PDF",xlim=c(0,B),ylim=c(0,1)) add.exp.lines(exp.out,1) hist(exp.out[2,,I],prob=TRUE,nclass=15,xlab="u2",ylab="PDF",xlim=c(0,B),ylim=c(0,1)) add.exp.lines(exp.out,2)
add.exp.lines <- function( exp.out, i, B=10) { dexp.trunc <- function( u, lambda, B ) dexp(u, rate=lambda)/(1-exp(-lambda*B)) S <- dim(exp.out)[2] I <- dim(exp.out)[3] u <- seq(0.0001,B,length=1000) fu <- rep(0,1000) for (s in 1:S) fu <- fu + dexp.trunc(u,exp.out[3-i,s,I],B)/S lines(u,fu,col="red") invisible() } par(mfrow=c(3,2)) B <-10; I <- 15; S <- 500 exp.out <- exp.gibbs(B=B,I=I,S=S) hist(exp.out[1,,I],prob=TRUE,nclass=15,xlab="u1",ylab="PDF",xlim=c(0,B),ylim=c(0,1)) add.exp.lines(exp.out,1) hist(exp.out[2,,I],prob=TRUE,nclass=15,xlab="u2",ylab="PDF",xlim=c(0,B),ylim=c(0,1)) add.exp.lines(exp.out,2)
Joint distribution of visual impairment on both eyes, by race and age.
data(eyes)
data(eyes)
A data frame with 32 observations on the following 6 variables.
Impairment (+) or not (-) for left eye.
Impairment (+) or not (-) for right eye.
a factor with levels 40-50
51-60
61-70
70+
White (W) or black (B)
mid-point for age groups, as numeric vector
Number of individuals in each class
K.-Y. Liang, S. L. Zeger and B. Qaqish (1992) Multivariate regression analyses for categorical data (with Discussion). Journal of the Royal Statistical Society, series B, 54, 3–40.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 505.
data(eyes) eyes.glm <- glm(y~age*colour+L*R+(L+R):poly(a,2)+colour:(L+R),poisson,data=eyes) anova(eyes.glm,test="Chi") # analysis of deviance for loglinear model
data(eyes) eyes.glm <- glm(y~age*colour+L*R+(L+R):poly(a,2)+colour:(L+R),poisson,data=eyes) anova(eyes.glm,test="Chi") # analysis of deviance for loglinear model
Data from a 4x4 Latin square experiment on the efficiency of a field concrete mixer.
data(field.concrete)
data(field.concrete)
A data frame with 16 observations on the following 4 variables.
a numeric vector
a factor with levels 4, 8, 12, 16 mph.
order in which runs were performed each day
day on which runs were performed
Unpublished lecture notes, Imperial College, London.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 435.
data(field.concrete) fit <- lm(efficiency~run+day+speed,data=field.concrete) anova(fit) summary(fit) fit <- lm(efficiency~run+day+poly(as.numeric(speed),3),data=field.concrete) summary(fit)
data(field.concrete) fit <- lm(efficiency~run+day+speed,data=field.concrete) anova(fit) summary(fit) fit <- lm(efficiency~run+day+poly(as.numeric(speed),3),data=field.concrete) summary(fit)
The number of balsam-fir seedlings in each quadrant of a grid of 50 five foot square quadrants were counted. The grid consisted of 5 rows of 10 quadrants in each row.
data(fir)
data(fir)
A data frame with 50 observations on the following 3 variables.
The number of seedlings in the quadrant
The row number of the quadrant
The quadrant number within the row
Davison, A.C. and Hinkley, D.V. (1997) Bootstrap Methods and Their Application. Cambridge University Press.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 641.
James Forbes measured the atmospheric pressure and boiling point of water at 17 locations in the Alps.
data(forbes)
data(forbes)
A data frame with 17 observations on the following 2 variables.
Boiling point (Fahrenheit)
Pressure (inches of mercury)
Atkinson, A. C. (1985) Plots, Transformations, and Regression. Oxford University Press.
data(forbes) plot(forbes) fit <- lm(bp~pres,data=forbes) fit plot(forbes$pres,resid(fit)) # model OK? # try refitting with transformation fit <- lm(log(bp)~log(pres),data=forbes)
data(forbes) plot(forbes) fit <- lm(bp~pres,data=forbes) fit plot(forbes$pres,resid(fit)) # model OK? # try refitting with transformation fit <- lm(log(bp)~log(pres),data=forbes)
The 'frets' data frame has 25 rows and 4 columns.
The data consist of measurements of the length and breadth of the heads of pairs of adult brothers in 25 randomly sampled families. All measurements are expressed in millimetres.
data(frets)
data(frets)
This data frame contains the following columns:
The head length of the eldest son.
The head breadth of the eldest son.
The head length of the second son.
The head breadth of the second son.
Frets, G.P. (1921) Heredity of head form in man. Genetica, 3, 193.
Whittaker, J. (1990) Graphical Models in Applied Multivariate Statistics. John Wiley.
data(frets) ## maybe str(frets) ; plot(frets) ...
data(frets) ## maybe str(frets) ; plot(frets) ...
Daily returns ( index, 1991–1998.
data(ftse)
data(ftse)
A time series.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 266.
data(ftse) plot(ftse,type="l",xlab="Time",ylab="Percent return") plot(exp(cumsum(ftse/100)),type="l",xlab="Time",ylab="Relative closing value")
data(ftse) plot(ftse,type="l",xlab="Time",ylab="Percent return") plot(exp(cumsum(ftse/100)),type="l",xlab="Time",ylab="Relative closing value")
Velocities (km/second) of 82 galaxies in a survey of the Corona Borealis region.
data(galaxy)
data(galaxy)
The format is: num [1:82] 9.17 9.35 9.48 9.56 9.78 ...
Roeder, K. (1990) Density estimation with confidence sets exemplified by superclusters and voids in galaxies. Journal of the American Statistical Association, 85, 617–624.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 214.
data(galaxy) plot(density(galaxy)) rug(galaxy)
data(galaxy) plot(density(galaxy)) rug(galaxy)
Estimate prior value of a parameter from the data.
get.alpha(d)
get.alpha(d)
d |
A data frame with vector components |
Prior value of a parameter, estimated from the data.
Calculates jackknife deviance residuals, standardized deviance residuals, standardized Pearson residuals, approximate Cook statistic, leverage and estimated dispersion.
## S3 method for class 'diag' glm(glmfit)
## S3 method for class 'diag' glm(glmfit)
glmfit |
|
A list containing the following items:
res |
The vector of jackknife deviance residuals. |
rd |
The vector of standardized deviance residuals. |
rp |
The vector of standardized Pearson residuals. |
cook |
The vector of approximate Cook statistics. |
h |
The vector of leverages of the observations. |
sd |
The value used to standardize the residuals. This is the the estimate of residual standard deviation in the Gaussian family and is the square root of the estimated shape parameter in the Gamma family. In all other cases it is 1. |
See the helpfile for glm.diag.plots
for an example of the use of glm.diag
.
Anthony Davison <[email protected]>
Davison, A.C. and Snell, E.J. (1991) Residuals and diagnostics. In Statistical Theory and Modelling: In Honour of Sir David Cox. D.V. Hinkley, N. Reid and E.J. Snell (editors), 83-106. Chapman and Hall.
glm
,lm
,plot.glm.diag
,summary.glm
Annual numbers of cases of 'diarrhoea-associated haemolytic uraemic syndrome' treated in clinics in Birmingham and Newcastle from 1970–1989.
data(hus)
data(hus)
A data frame with 20 observations on the following 3 variables.
a numeric vector
Number of cases treated in Birmingham
Number of cases treated in Newcastle
Henderson, R. and Matthews, J. N. S. (1993) An investigation of changepoints in the annual number of cases of haemolytic uraemic syndrome. Applied Statistics, 42, 461–471.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 142.
data(hus) plot(hus$year,hus$birmingham,ylab="Annual number of cases",type="s")
data(hus) plot(hus$year,hus$birmingham,ylab="Annual number of cases",type="s")
This function implements a Gibbs sampler for the Poisson changepoint model applied to the HUS data used in Example 4.40 and Practical 11.6 of Davison (2003), which should be consulted for details.
hus.gibbs(init, y, R = 10, a1 = 1, a2 = 1, c = 0.01, d = 0.01)
hus.gibbs(init, y, R = 10, a1 = 1, a2 = 1, c = 0.01, d = 0.01)
init |
Initial values for parameters |
y |
A series of Poisson counts |
R |
Number of iterations of sampler |
a1 |
Value of a hyperparameter |
a2 |
Value of a hyperparameter |
c |
Value of a hyperparameter |
d |
Value of a hyperparameter |
This is provided simply so that readers spend less time typing. It is not intended to be robust and general code.
A matrix of size R x 7, whose first five columns contain the values of the parameters for the iterations. Columns 6 and 7 contain the log likelihood and log prior for that iteration.
Anthony Davison ([email protected]
)
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Practical 11.6.
## From Example 11.6: hus <- c(1,5,3,2,2,1,0,0,2,1,1,7,11,4,7,10,16,16,9,15) system.time( gibbs.out <- hus.gibbs(c(5, 5, 1, 1, 2), hus, R=1000)) plot.ts(gibbs.out[,1], main="lambda1") # time series plot for lam1 plot.ts(gibbs.out[,2], main="lambda1") # time series plot for lam2 plot.ts(gibbs.out[,6], main="log lik") # and of log likelihood table(gibbs.out[,5]) # tabulate observed values of tau rm(hus)
## From Example 11.6: hus <- c(1,5,3,2,2,1,0,0,2,1,1,7,11,4,7,10,16,16,9,15) system.time( gibbs.out <- hus.gibbs(c(5, 5, 1, 1, 2), hus, R=1000)) plot.ts(gibbs.out[,1], main="lambda1") # time series plot for lam1 plot.ts(gibbs.out[,2], main="lambda1") # time series plot for lam2 plot.ts(gibbs.out[,6], main="log lik") # and of log likelihood table(gibbs.out[,5]) # tabulate observed values of tau rm(hus)
Inverse Hessian matrix, useful for obtaining standard errors
ihess(f, x, ep = 1e-04, ...)
ihess(f, x, ep = 1e-04, ...)
f |
Usually a negative log likelihood |
x |
Usually maximum likelihood estimates for f |
ep |
Step length used to compute numerical second derivatives |
... |
Extra arguments for f, if any |
Matrix of dimension dim(x) times dim(x), containing inverse Hessian matrix of f at x.
This is not needed in R, where hessian matrices are obtained by setting hessian=T in calls to optimisation functions.
Anthony Davison
Based on code written by Stuart Coles of Padova University
# ML fit of t distribution nlogL <- function(x, data) # negative log likelihood { mu <- x[1] sig <- x[2] df <- x[3] -sum(log( dt((data-mu)/sig, df=df)/sig )) } y <- rt(n=100, df=10) # generate t data # this is Splus code.....so remove the #'s for it to work in R # fit <- nlminb(c(1,1,4), nlogL, upper=c(Inf,Inf,Inf), lower=c(-Inf,0,0), # data=y) # fit$parameters # maximum likelihood estimates # J <- ihess(nlogL, fit$parameters, data=y) # sqrt(diag(J)) # standard errors based on observed information # # In this example the standard error can be a bad measure of # uncertainty for the df.
# ML fit of t distribution nlogL <- function(x, data) # negative log likelihood { mu <- x[1] sig <- x[2] df <- x[3] -sum(log( dt((data-mu)/sig, df=df)/sig )) } y <- rt(n=100, df=10) # generate t data # this is Splus code.....so remove the #'s for it to work in R # fit <- nlminb(c(1,1,4), nlogL, upper=c(Inf,Inf,Inf), lower=c(-Inf,0,0), # data=y) # fit$parameters # maximum likelihood estimates # J <- ihess(nlogL, fit$parameters, data=y) # sqrt(diag(J)) # standard errors based on observed information # # In this example the standard error can be a bad measure of # uncertainty for the df.
Sequence of 1572 bases from first human preproglucagon gene
data(intron)
data(intron)
This is a factor with 4 levels, "A","C","G","T"
Avery, P. J. and Henderson, D. A. (1999) Fitting Markov chain models to discrete state series such as DNA sequences. Applied Statistics, 48, 53–61.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 226.
data(intron) ## maybe str(intron) ; plot(intron) ...
data(intron) ## maybe str(intron) ; plot(intron) ...
Response of a rufous-tailed jacamar to butterflies, by not attacking them, by attacking but not eating them, and by attacking and eating them.
data(jacamar)
data(jacamar)
A data frame with 48 observations on the following 5 variables.
Butterfly species: Aphrissa boisduvalli (Ab), Phoebis argante (Pa), Dryas iulia (Di), Pierella luna (Pl), Consul fabius (Cf), Siproeta stelenes (Ss)
colour butterfly wings were painted: Unpainted, Brown, Yellow, Blue, Green, Red, Orange, Black
Number not attacked
Number attacked but rejected
Number eaten
As part of a study of the learning ability of tropical birds, Peng Chai of the University of Texas at Austin collected data on the response of a rufous-tailed jacamar to butterflies. He used marker pens to paint the underside of the wings of eight species of butterflies, and then released each butterfly in the cage where the bird was confined. The bird responded in three ways: by not attacking the butterfly (N); by attacking the butterfly, then sampling but rejecting it (S); or by attacking and eating the butterfly, usually after removing some or all of the wings (E).
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 470.
data(jacamar)
data(jacamar)
These are times in days between sucessive failures of a piece of software developed as part of a large data system. The software was released after the first 31 failures. The last three failures occurred after release.
data(jelinski)
data(jelinski)
The format is: num [1:34] 9 12 11 4 7 2 5 8 5 7 ...
Jelinski, Z. and Moranda, P. B. (1972) Software reliability research. In W. Freiberger (ed), Statistical Computer Performance Evaluation. London: Academic Press. Pages 465–484.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 299.
A data frame of data from 33 leukaemia patients.
data(leuk)
data(leuk)
A data frame with 33 observations on the following 3 variables.
white blood cell count
a test result, '"present"' or '"absent"'
survival time in weeks
Survival times are given for 33 patients who died from acute myelogenous leukaemia. Also measured was the patient's white blood cell count at the time of diagnosis. The patients were also factored into 2 groups according to the presence or absence of a morphologic characteristic of white blood cells. Patients termed AG positive were identified by the presence of Auer rods and/or significant granulation of the leukaemic cells in the bone marrow at the time of diagnosis.
Feigl, P. and Zelen, M. (1965) Estimation of exponential survival probabilities with concomitant information. Biometrics, 21, 826–838.
Cox, D. R. and Oakes, D. (1984) Analysis of Survival Data. Chapman & Hall, p. 9.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 542.
data(leuk) library(survival) plot(survfit(Surv(time) ~ ag, data = leuk), lty = 2:3, col = 2:3) # fit of exponential model summary(glm(time~ag+log10(wbc),data=leuk,family=Gamma(log)),dispersion=1) # now Cox models leuk.cox <- coxph(Surv(time) ~ ag + log(wbc), leuk) summary(leuk.cox)
data(leuk) library(survival) plot(survfit(Surv(time) ~ ag, data = leuk), lty = 2:3, col = 2:3) # fit of exponential model summary(glm(time~ag+log10(wbc),data=leuk,family=Gamma(log)),dispersion=1) # now Cox models leuk.cox <- coxph(Surv(time) ~ ag + log(wbc), leuk) summary(leuk.cox)
A simple function for computing confidence intervals from the values of a likelihood function for a scalar parameter. It prints the maximum likelihood estimate (MLE) and its standard error, and confidence intervals based on normal approximation to the distribution of the MLE and on the chi-squared approximation to the distribution of the likelihood ratio statistic.
lik.ci(psi, logL, conf = c(0.975, 0.025))
lik.ci(psi, logL, conf = c(0.975, 0.025))
psi |
Vector containing parameter values, the range of which contains the MLE |
logL |
Vector containing corresponding log likelihood values |
conf |
Vector containing levels for which confidence interval limits needed |
See above
This uses the spline functions in library(modreg).
Anthony Davison ([email protected])
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Sections 4.4.2, 4.5.1.
# likelihood analysis for mean of truncated Poisson data y <- c(1:6) n <- c(1486,694,195,37,10,1) logL <- function(x, y, n.obs) # x is theta { f <- dpois(y,x)/(1-dpois(0,x)) # dpois is Poisson PDF sum(n*log(f)) } # log likelihood theta <- seq(from=0.8, to=1, length=200) L <- rep(NA, 200) for (i in 1:200) L[i] <- logL(theta[i], y, n) plot(theta, L, type="l", ylab="Log likelihood") lik.ci(theta, L)
# likelihood analysis for mean of truncated Poisson data y <- c(1:6) n <- c(1486,694,195,37,10,1) logL <- function(x, y, n.obs) # x is theta { f <- dpois(y,x)/(1-dpois(0,x)) # dpois is Poisson PDF sum(n*log(f)) } # log likelihood theta <- seq(from=0.8, to=1, length=200) L <- rep(NA, 200) for (i in 1:200) L[i] <- logL(theta[i], y, n) plot(theta, L, type="l", ylab="Log likelihood") lik.ci(theta, L)
The data are numbers of traffic accidents with personal injuries, reported to the police, on Swedish roads on 92 days in 1961 and 92 matching days in 1962. On some of these days a general speed limit of 90 or 100 km/hour was imposed.
data(limits)
data(limits)
A data frame with 92 observations on the following 5 variables.
A factor indicating the day, coded 1
-92
.
1 indicates a limit imposed in 1961, 0 not.
1 indicates a limit imposed in 1962, 0 not.
Number of accidents on this day in 1961.
Number of accidents on this day in 1962.
Svensson, A. (1981) On a goodness-of-fit test for multiplicative Poisson models. Annals of Statistics, 9, 697–704.
data(limits) ## maybe str(limits) ; plot(limits) ...
data(limits) ## maybe str(limits) ; plot(limits) ...
These are data on the structural habitat of two species of lizards in Whitehouse, Jamaica. They comprise observed counts for perch height, perch diameter, insolation, and time of day, for both species. The data can be represented as a 2 x 2 x 2 x 3 x 2 contingency table.
data(lizards)
data(lizards)
A data frame with 48 observations on the following 6 variables.
high
indicates perch at height
5 or more feet, low
indicates perch below 5 feet.
large
indicates perch diameter 2 inches or more,
small
indicates perch diameter less than 2 inches.
Is the perch in a shady
or a sunny
location?
Time of day when lizard observed: early
,
late
or midday
.
Species of lizard: grahami
or opalinus
.
Number of lizards seen.
Bishop, Y. M. M., Fienberg, S. E. and Holland, P. W. (1975) Discrete Multivariate Analysis. Cambridge, Mass.: MIT Press. Page 164.
data(lizards) ## maybe str(lizards) ; plot(lizards) ...
data(lizards) ## maybe str(lizards) ; plot(lizards) ...
The data give the number of deaths due to lung cancer in British male physicians, as a function of years of smoking and cigarette consumption.
data(lung.cancer)
data(lung.cancer)
A data frame with 63 observations on the following 4 variables.
a factor giving the number of years smoking
a factor giving cigarette consumption
man-years at risk
number of deaths
Frome, E. L. (1983) The analysis of rates using Poisson regression models. Biometrics, 39, 665–674.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 8.
data(lung.cancer)
data(lung.cancer)
Data from 11 clinical trials to compare magnesium treatment for heart attacks with control.
data(magnesium)
data(magnesium)
A data frame with 22 observations on the following 4 variables.
a factor with levels 1
–11
Treatment indicator (factor)
Total patients in group
Number of deaths in group
Copas, J. B. (1999) What works?: Selectivity models and meta-analysis. Journal of the Royal Statistical Society series A, 162, 96–109.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 208.
data(magnesium) fit <- glm(cbind(r,m-r)~trial+group,binomial,data=magnesium[1:20,]) anova(fit,test="Chi") summary(fit)
data(magnesium) fit <- glm(cbind(r,m-r)~trial+group,binomial,data=magnesium[1:20,]) anova(fit,test="Chi") summary(fit)
The 'manaus' time series is of class '"ts"' and has 1080 observations on one variable, which is the monthly average of the daily stages (heights) of the Rio Negro at Manaus, from January 1903 until December 1992. The units are metres.
data(manaus)
data(manaus)
A time series
The data values are monthly averages of the daily stages (heights) of the Rio Negro at Manaus. Manaus is 18km upstream from the confluence of the Rio Negro with the Amazon but because of the tiny slope of the water surface and the lower courses of its flatland affluents, they may be regarded as a good approximation of the water level in the Amazon at the confluence. The data here cover 90 years from January 1903 until December 1992.
The Manaus gauge is tied in with an arbitrary bench mark of 100m set in the steps of the Municipal Prefecture; gauge readings are usually referred to sea level, on the basis of a mark on the steps leading to the Parish Church (Matriz), which is assumed to lie at an altitude of 35.874 m according to observations made many years ago under the direction of Samuel Pereira, an engineer in charge of the Manaus Sanitation Committee Whereas such an altitude cannot, by any means, be considered to be a precise datum point, observations have been provisionally referred to it. The measurements are in metres.
The data were kindly made available by Professors H. O'Reilly Sternberg and D. R. Brillinger of the University of California at Berkeley.
Davison, A.C. and Hinkley, D.V. (1997) Bootstrap Methods and Their Application. Cambridge University Press.
Sternberg, H. O'R. (1987) Aggravation of floods in the Amazon river as a consequence of deforestation? Geografiska Annaler, 69A, 201-219.
Sternberg, H. O'R. (1995) Waters and wetlands of Brazilian Amazonia: An uncertain future. In The Fragile Tropics of Latin America: Sustainable Management of Changing Environments, Nishizawa, T. and Uitto, J.I. (editors), United Nations University Press, 113-179.
data(manaus) plot(manaus) acf(manaus) pacf(manaus)
data(manaus) plot(manaus) acf(manaus) pacf(manaus)
The data are from an experiment to compare how different markers assess examination scripts, some of which were original and others of which were photocopies.
data(marking)
data(marking)
A data frame with 32 observations on the following 5 variables.
Two exams were marked
Scripts from 8 persons were marked, coded 1
-8
.
Coded 1
-4
Is the script an original (1) or a photocopy (0)?
The mark out of 80 attributed by the marker.
Normally each marker had a different batch of scripts, but for the experiment one script was taken at random from each batch and replaced after three copies of it had been made. The three copies were sent to the other three markers who assessed them, while the original was replaced and assessed in the usual way. Each of the four copies was therefore assessed by a single marker, but the three markers who had a copy knew that the script was part of the experiment, while the person marking the original did not know it to be part of the experiment. The experiment was repeated at another examination, with the same examiners, but different scripts.
Lindley, D. V. (1961) An experiment in the marking of an examination (with Discussion). Journal of the Royal Statistical Society, series A, 124, 285–313.
data(marking) ## maybe str(marking) ; plot(marking) ...
data(marking) ## maybe str(marking) ; plot(marking) ...
Marks out of 100 for 88 students taking examinations in mechanics (C), vectors (C), algebra (O), analysis (O), statistics (O), where C indicates closed and O indicates open book examination.
data(mathmarks)
data(mathmarks)
A data frame with 88 observations on the following 5 variables.
mark out of 100 for mechanics
mark out of 100 for vectors
mark out of 100 for algebra
mark out of 100 for analysis
mark out of 100 for statistics
Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979) Multivariate Analysis. London: Academic Press. Pages 3–4.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 256.
data(mathmarks) pairs(mathmarks) var(mathmarks)
data(mathmarks) pairs(mathmarks) var(mathmarks)
Computes maximum likelihood estimates of transition probabilities for stationary Markov chain models, of order 0 (independence) to 3.
This is intended for use with Practical 6.1 of Davison (2003), not as production code.
MClik(d)
MClik(d)
d |
A sequence containing successive states of the chain |
order |
order of fitted chain |
df |
degrees of freedom using in fitting |
L |
maximum log likelihood for each order |
AIC |
Akaike information criterion for each order |
one |
one-way marginal table of counts |
two |
two-way margin table of transitions |
three |
three-way marginal table of transitions |
four |
four-way marginal table of transitions |
A. C. Davison ([email protected])
Avery, P. J. and Henderson, D. A. (1999) Fitting Markov chain models to discrete state series such as DNA sequences. Applied Statistics, 48, 53–61.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Section 6.1.
data(intron) fit <- MClik(intron)
data(intron) fit <- MClik(intron)
RFM male mice were exposed to 300 rads of x-radiation at 5–6 weeks of age. The causes of death were thymic lymphoma, reticulum cell sarcoma, and other. Some of the mice were kept in a conventional environment, and the others in a germ-free environment.
data(mice)
data(mice)
A data frame with 177 observations on the following 4 variables.
Environment type (factor)
Cause of death
Censoring indicator, with 1 indicating death
Age at death (weeks)
Hoel, D. G. and Walburg, H. E. (1972) Statistical analysis of survival experiments. Journal of the National Cancer Institute, 49, 361–372.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 200.
data(mice) library(survival) fit <- survfit(Surv(y,status)~cause,data=mice[1:95,]) # first group plot(fit,lty=c(3,2,1))
data(mice) library(survival) fit <- survfit(Surv(y,status)~cause,data=mice[1:95,]) # first group plot(fit,lty=c(3,2,1))
Data from an experiment conducted to determine the optimal planting distance between plants in rows of millet. The rows were 1 foot apart. The design was a 5 x 5 Latin square.
data(millet)
data(millet)
A data frame with 25 observations on the following 4 variables.
Row label, coded 1
-5
.
Column label, coded 1
-5
.
distances between plants:2
, 4
, 6
, 8
, or 10
inches.
Average yield (grams) of three central rows, 15 feet long after allowing for discards, from each plot.
Unpublished lecture notes, Imperial College, London.
data(millet) ## maybe str(millet) ; plot(millet) ...
data(millet) ## maybe str(millet) ; plot(millet) ...
Times to failure of motorettes tested at different temperatures.
data(motorette)
data(motorette)
A data frame with 40 observations on the following 3 variables.
Temperature in degrees Fahrenheit
Censoring indicator
Failure time in hours
Nelson, W. D. and Hahn, G. J. (1972) Linear estimation of a regression relationship from censored data. Part 1 — simple methods and their application (with Discussion). Technometrics, 14, 247–276.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 615.
data(motorette) library(survival) motor.fit <- survreg(Surv(y,cens)~log(x),dist="weibull",data=motorette) summary(motor.fit)
data(motorette) library(survival) motor.fit <- survreg(Surv(y,cens)~log(x),dist="weibull",data=motorette) summary(motor.fit)
Numbers of nematodes invading individual fly larvae for various numbers of initial challengers.
data(nematode)
data(nematode)
A data frame with 29 observations on the following 3 variables.
Number of challengers
Number of invading nematodes
Number of fly larvae
Faddy, M. J. and Fenlon, J. S. (1999) Stochastic modelling of the invasion process of nematodes in fly larvae. Applied Statistics, 48, 31–37.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 295.
data(nematode)
data(nematode)
The ‘nodal’ data frame has 53 rows and 7 columns.
The treatment strategy for a patient diagnosed with cancer of the prostate depend highly on whether the cancer has spread to the surrounding lymph nodes. It is common to operate on the patient to get samples from the nodes which can then be analysed under a microscope but clearly it would be preferable if an accurate assessment of nodal involvement could be made without surgery.
For a sample of 53 prostate cancer patients, a number of possible predictor variables were measured before surgery. The patients then had surgery to determine nodal involvement. It was required to see if nodal involvement could be accurately predicted from the predictor variables and which ones were most important.
data(nodal)
data(nodal)
A data frame with 53 observations on the following 7 variables.
A column of ones.
An indicator of nodal involvement.
The patients age dichotomized into less than 60 (‘0’) and 60 or over ‘1’.
A measurement of the size and position of the tumour observed by palpatation with the fingers via the rectum. A value of ‘1’ indicates a more serious case of the cancer.
Another indicator of the seriousness of the cancer, this one is determined by a pathology reading of a biopsy taken by needle before surgery. A value of ‘1’ indicates a more serious case of the cancer.
A third measure of the seriousness of the cancer taken from an X-ray reading. A value of ‘1’ indicates a more serious case of the cancer.
The level of acid phosphatase in the blood serum.
Brown, B.W. (1980) Prediction analysis for binary data. In Biostatistics Casebook. R.G. Miller, B. Efron, B.W. Brown and L.E. Moses (editors), 3-18. John Wiley.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 491.
data(nodal) nodal.glm <- glm(r~aged+stage+grade+xray+acid,binomial,data=nodal) summary(nodal.glm,correlation=FALSE)
data(nodal) nodal.glm <- glm(r~aged+stage+grade+xray+acid,binomial,data=nodal) summary(nodal.glm,correlation=FALSE)
The data relate to the construction of 32 light water reactor (LWR) plants constructed in the U.S.A in the late 1960's and early 1970's. The data was collected with the aim of predicting the cost of construction of further LWR plants. 6 of the power plants had partial turnkey guarantees and it is possible that, for these plants, some manufacturers' subsidies may be hidden in the quoted capital costs.
data(nuclear)
data(nuclear)
A data frame with 32 observations on the following 11 variables.
The capital cost of construction in millions of dollars adjusted to 1976 base.
The date on which the construction permit was issued. The data are measured in years since January 1 1990 to the nearest month.
The time between application for and issue of the construction permit.
The time between issue of operating license and construction permit.
The net capacity of the power plant (MWe).
A binary variable where ‘1’ indicates the prior existence of a LWR plant at the same site.
A binary variable where ‘1’ indicates that the plant was constructed in the north-east region of the U.S.A.
A binary variable where ‘1’ indicates the use of a cooling tower in the plant.
A binary variable where ‘1’ indicates that the nuclear steam supply system was manufactured by Babcock-Wilcox.
The cumulative number of power plants constructed by each architect-engineer.
A binary variable where ‘1’ indicates those plants with partial turnkey guarantees.
Cox, D.R. and Snell, E.J. (1981) Applied Statistics: Principles and Examples. Chapman and Hall.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 401.
data(nuclear) pairs(nuclear) fit <- lm(log(cost)~date+t1+t2+cap+pr+ne+ct+bw+cum.n+pr,data=nuclear) step(fit) # stepwise model selection
data(nuclear) pairs(nuclear) fit <- lm(log(cost)~date+t1+t2+cap+pr+ne+ct+bw+cum.n+pr,data=nuclear) step(fit) # stepwise model selection
Historical estimates of the force of mortality (hazard function) averaged for 5-year age groups. The data are taken from various historical sources.
data(old.age)
data(old.age)
A data frame with 14 observations on the following 8 variables.
Age group (5-year intervals)
Data estimated from Hungarian graveyards, 900–1100
Data estimated from England, 1640–1689
Data estimated from Breslau, 1687–1691
Data from England and Wales, males, 1841
Data from England and Wales, females, 1841
Data from England and Wales, males, 1980–1982
Data from England and Wales, females, 1980–1982
The estimated numbers of people on which the data in the columns are based are 2300, 3133, 2675, 71,000, 74,000, 834,0000, and 828,000.
Thatcher, A. R. (1999) The long-term pattern of adult mortality and the highest attained age (with Discussion). Journal of the Royal Statistical Society series A, 16, 5–43.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 194.
data(old.age)
data(old.age)
Plots a scatterplot matrix in which panels below the diagonal show ordinary scatterplots of pairs of variables, and those above the diagonal show scatterplots of residuals after regression on all the other variables.
## S3 method for class 'mod' pairs(x, format = "MC", labelnames = names(x), highlight = NULL, level = 0.9, ...)
## S3 method for class 'mod' pairs(x, format = "MC", labelnames = names(x), highlight = NULL, level = 0.9, ...)
x |
A matrix whose rows correspond to units and whose columns correspond to variables measured on those units. |
format |
'MM' for marginal (that is, standard) scatterplots above and below the diagonal, 'MC' for marginal below and conditional (= partial) above, etc. 'MC' by default. |
labelnames |
Names of the variables. |
highlight |
Indexes of observations (rows) to be highlighted. |
level |
Scalar giving the level for the contour, 0.9 by default. |
... |
The plotting symbol and other arguments for the points can be controlled by ‘pch=’, etc. |
The diagonal shows histograms of the original data, and (in black) histograms of the partial residuals after adjustment on all the other variables, shifted to have the same mean as the original data. Also given are the original
The below-diagonal panels contain the numerical value of the correlation, and those above the diagonal contain the partial correlation, that is, the correlation of the residuals after linear regression on the remaining variables. The panels show ellipses which would contain 90 percent of the observations in a large normal sample with the same mean and covariance matrix as the data.
Produces the scatterplot matrix, and prints the marginal and partial standard deviations of the variables.
pairs.mod calls library(ellipse) and will give an error if this is unavailable.
Sylvain Sardy ([email protected])
Davison, A. C. and Sardy, S. (2000) The partial scatterplot matrix. Journal of Computational and Graphical Statistics, 9, 750–758.
library(ellipse) data(mathmarks) pairs.mod(mathmarks)
library(ellipse) data(mathmarks) pairs.mod(mathmarks)
The 'paulsen' data frame has 346 rows and 1 columns.
Sections were prepared from the brain of adult guinea pigs. Spontaneous currents that flowed into individual brain cells were then recorded and the peak amplitude of each current measured. The aim of the experiment was to see if the current flow was quantal in nature (i.e. that it is not a single burst but instead is built up of many smaller bursts of current). If the current was indeed quantal then it would be expected that the distribution of the current amplitude would be multimodal with modes at regular intervals. The modes would be expected to decrease in magnitude for higher current amplitudes.
data(paulsen)
data(paulsen)
This data frame contains the following column:
The current flowing into individual brain cells. The currents are measured in pico-amperes.
The data were kindly made available by Dr. O. Paulsen of the Department of Pharmacology, University of Oxford.
Paulsen, O. and Heggelund, P. (1994) The quantal size at retinogeniculate synapses determined from spontaneous and evoked EPSCs in guinea-pig thalamic slices. Journal of Physiology, 480, 505-511.
data(paulsen) hist(paulsen$y,prob=TRUE)
data(paulsen) hist(paulsen$y,prob=TRUE)
Followup of 312 randomised and 108 unrandomised patients with primary biliary cirrhosis, a rare autoimmune liver disease, at Mayo Clinic.
data(pbc)
data(pbc)
A data frame with 418 observations on the following 20 variables.
in years
serum albumin
alkaline phosphotase
presence of ascites
serum bilirubin
serum cholesterol
presence of edema
0 no edema, 0.5 untreated or successfully treated 1 unsuccessfully treated edema
enlarged liver
survival time
platelet count
standardised blood clotting time
1=male
liver enzyme (now called AST)
blood vessel malformations in the skin
histologic stage of disease (needs biopsy)
censoring status
1/2/-9 for control, treatment, not randomised
triglycerides
urine copper
Fleming, T. R. and Harrington, D. P. (1991) Counting Processes and Survival Analysis. Wiley: New York.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 549.
data(pbc) # to make version of dataset used in book pbcm <- pbc[(pbc$trt!=-9),] pbcm$copper[(pbcm$copper==-9)] <- median(pbcm$copper[(pbcm$copper!=-9)]) pbcm$platelet[(pbcm$platelet==-9)] <- median(pbcm$platelet[(pbcm$platelet!=-9)]) attach(pbcm) library(survival) par(mfrow=c(1,2),pty="s") plot(survfit(Surv(time,status)~trt),ylim=c(0,1),lty=c(1,2), ylab="Survival probability",xlab="Time (days)") plot(survfit(coxph(Surv(time,status)~trt+strata(sex))),ylim=c(0,1),lty=c(1,2), ylab="Survival probability",xlab="Time (days)") lines(survfit(coxph(Surv(time,status)~trt)),lwd=2) # proportional hazards model fit fit <- coxph(formula = Surv(time, status) ~ age + alb + alkphos + ascites + bili + edtrt + hepmeg + platelet + protime + sex + spiders, data=pbcm) summary(fit) step.fit <- step(fit,direction="backward")
data(pbc) # to make version of dataset used in book pbcm <- pbc[(pbc$trt!=-9),] pbcm$copper[(pbcm$copper==-9)] <- median(pbcm$copper[(pbcm$copper!=-9)]) pbcm$platelet[(pbcm$platelet==-9)] <- median(pbcm$platelet[(pbcm$platelet!=-9)]) attach(pbcm) library(survival) par(mfrow=c(1,2),pty="s") plot(survfit(Surv(time,status)~trt),ylim=c(0,1),lty=c(1,2), ylab="Survival probability",xlab="Time (days)") plot(survfit(coxph(Surv(time,status)~trt+strata(sex))),ylim=c(0,1),lty=c(1,2), ylab="Survival probability",xlab="Time (days)") lines(survfit(coxph(Surv(time,status)~trt)),lwd=2) # proportional hazards model fit fit <- coxph(formula = Surv(time, status) ~ age + alb + alkphos + ascites + bili + edtrt + hepmeg + platelet + protime + sex + spiders, data=pbcm) summary(fit) step.fit <- step(fit,direction="backward")
Bearings (degrees) of 29 homing pigeons 30, 60, 90 after their release, and on vanishing from sight.
data(pigeon)
data(pigeon)
A data frame with 29 observations on the following 4 variables.
Bearing after 30 seconds
Bearing after 60 seconds
Bearing after 90 seconds
Bearing on vanishing from sight
Artes, R. (1997) Extensoes da Teoria das Equacoes de Estimacao Generalizadas a Dados Circulares e Modelos de Dispersao. Ph.D. thesis, University of Sao Paulo.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 173.
data(pigeon) plt <- function( ang, r=c(1,2,3,4), lty=1,... ) { si <- sin(2*pi*ang/360) co <- cos(2*pi*ang/360) points( r*si,r*co ) lines( c(0,r*si),c(0,r*co),...) } par(pty="s") plot(c(0,0),c(0,0),xlim=c(-4,4),ylim=c(-4,4), xlab="Easting",ylab="Northing") for (i in 1:nrow(pigeon)) plt( pigeon[i,],col=i )
data(pigeon) plt <- function( ang, r=c(1,2,3,4), lty=1,... ) { si <- sin(2*pi*ang/360) co <- cos(2*pi*ang/360) points( r*si,r*co ) lines( c(0,r*si),c(0,r*co),...) } par(pty="s") plot(c(0,0),c(0,0),xlim=c(-4,4),ylim=c(-4,4), xlab="Easting",ylab="Northing") for (i in 1:nrow(pigeon)) plt( pigeon[i,],col=i )
Data on weight gains in 32 pigs, divided into eight groups of four, and with 4 different diets allocated to the group members.
data(pigs)
data(pigs)
A data frame with 32 observations on the following 3 variables.
a factor with 8 levels
a factor with levels I
–IV
weight gain (units unknown)
Unpublished lecture notes, Imperial College, London
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 431.
data(pigs) anova(lm(gain~group+diet,data=pigs))
data(pigs) anova(lm(gain~group+diet,data=pigs))
Numbers of red blood cells counted by five doctors using ten sets of apparatus.
data(pipette)
data(pipette)
A data frame with 50 observations on the following 3 variables.
Factor with ten levels
Factor with five levels
Number of red blood cells
Unpublished lecture notes, Imperial College, London.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 462.
Makes plot of jackknife deviance residuals against linear predictor, normal scores plots of standardized deviance residuals, plot of approximate Cook statistics against leverage/(1-leverage), and case plot of Cook statistic.
## S3 method for class 'glm.diag' plot(x, glmdiag = glm.diag(x), subset = NULL, iden = FALSE, labels = NULL, ret = FALSE, ...)
## S3 method for class 'glm.diag' plot(x, glmdiag = glm.diag(x), subset = NULL, iden = FALSE, labels = NULL, ret = FALSE, ...)
x |
|
glmdiag |
Diagnostics of |
subset |
Subset of |
iden |
A logical argument. If |
labels |
A vector of labels for use with |
ret |
A logical argument indicating if |
... |
Other arguments, which are ignored. This is included only for compatibility with S3 methods. |
The plot on the top left is a plot of the jackknife deviance residuals against the fitted values.
The plot on the top right is a normal QQ plot of the standardized deviance residuals. The dotted line is the expected line if the standardized residuals are normally distributed, i.e. it is the line with intercept 0 and slope 1.
The bottom two panels are plots of the Cook statistics. On the left is a plot of the Cook statistics against the standardized leverages. In general there will be two dotted lines on this plot. The horizontal line is at 8/(n-2p) where n is the number of observations and p is the number of parameters estimated. Points above this line may be points with high influence on the model. The vertical line is at 2p/(n-2p) and points to the right of this line have high leverage compared to the variance of the raw residual at that point. If all points are below the horizontal line or to the left of the vertical line then the line is not shown.
The final plot again shows the Cook statistic this time plotted against case number enabling us to find which observations are influential.
Use of iden=T
is encouraged for proper exploration of these four plots as a guide to how well the model fits the data and whether certain observations have an unduly large effect on parameter estimates.
If ret is TRUE
then the value of glmdiag
is returned otherwise there is no returned value.
Angelo Canty
Davison, A.C. and Snell, E.J. (1991) Residuals and diagnostics. In Statistical Theory and Modelling: In Honour of Sir David Cox. D.V. Hinkley, N. Reid, and E.J. Snell (editors), 83-106. Chapman and Hall.
# leukaemia data require(MASS) data(leuk, package="MASS") leuk.mod <- glm(time~ag-1+log10(wbc),family=Gamma(log),data=leuk) leuk.diag <- glm.diag(leuk.mod) plot.glm.diag(leuk.mod,leuk.diag)
# leukaemia data require(MASS) data(leuk, package="MASS") leuk.mod <- glm(time~ag-1+log10(wbc),family=Gamma(log),data=leuk) leuk.diag <- glm.diag(leuk.mod) plot.glm.diag(leuk.mod,leuk.diag)
This gives the degree of pneumoconiosis (normal, present, or severe) in a group of coalminers as a function of the number of years worked at the coalface. The degree of the disease was assessed radiologically and is qualitative.
data(pneu)
data(pneu)
A data frame with 8 observations on the following 4 variables.
Period of exposure (years worked at the coalface)
Number of miners with normal lungs
Number of miners with disease present
Number of miners with severe disease
Ashford, J. R. (1959) An approach to the analysis of data for semi-quantal responses in biological assay. Biometrics, 15, 573–581.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 509.
data(pneu) summary(glm(cbind(Present+Severe,Normal)~log(Years),data=pneu,binomial)) summary(glm(cbind(Severe,Normal+Present)~log(Years),data=pneu,binomial))
data(pneu) summary(glm(cbind(Present+Severe,Normal)~log(Years),data=pneu,binomial)) summary(glm(cbind(Severe,Normal+Present)~log(Years),data=pneu,binomial))
This function computes the Laplace approximation to the posterior density of the parameter beta in a Poisson regression model. For more details see Practical 11.2 of Davison (2003).
poi.beta.laplace(data, alpha = get.alpha(data), phi = 1, nu = 0.1, beta = seq(from = 0, to = 7, length = 1000))
poi.beta.laplace(data, alpha = get.alpha(data), phi = 1, nu = 0.1, beta = seq(from = 0, to = 7, length = 1000))
data |
A data frame with vector components |
alpha |
Prior value of a parameter, estimated from the data by default. |
phi |
Prior value of a parameter. |
nu |
Prior value of a parameter. |
beta |
Values for which posterior density of beta should be provided. |
This is provided simply so that readers spend less time typing. It is not intended to be robust and general code.
int |
Estimated integral of posterior density. |
conv |
Did the routine for the Laplace optimization converge? |
x |
Values of beta |
y |
Values of posterior density |
Anthony Davison ([email protected]
)
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Practical 11.2.
## From Practical 11.2: get.alpha <- function(d) { # estimate alpha from data rho <- d$y/d$x n <- length(d$y) mean(rho)^2/( (n-1)*var(rho)/n - mean(rho)*mean(1/d$x) ) } data(cloth) attach(cloth) plot(x,y) beta <- seq(from=0,to=10,length=1000) beta.post <- poi.beta.laplace(cloth,beta=beta,nu=1) plot(beta.post,type="l",xlab="beta",ylab="Posterior density") beta.post <- poi.beta.laplace(cloth,beta=beta,nu=5) lines(beta.post,lty=2)
## From Practical 11.2: get.alpha <- function(d) { # estimate alpha from data rho <- d$y/d$x n <- length(d$y) mean(rho)^2/( (n-1)*var(rho)/n - mean(rho)*mean(1/d$x) ) } data(cloth) attach(cloth) plot(x,y) beta <- seq(from=0,to=10,length=1000) beta.post <- poi.beta.laplace(cloth,beta=beta,nu=1) plot(beta.post,type="l",xlab="beta",ylab="Posterior density") beta.post <- poi.beta.laplace(cloth,beta=beta,nu=5) lines(beta.post,lty=2)
This function implements Gibbs sampling for the hierarchical Poisson model described in Example 11.19 and Practical 11.5 of Davison (2003), which should be consulted for more details.
poi.gibbs(d, alpha, gamma, delta, I, S)
poi.gibbs(d, alpha, gamma, delta, I, S)
d |
A data frame with vector components |
alpha |
A hyperparameter of the prior density |
gamma |
A hyperparameter of the prior density |
delta |
A hyperparameter of the prior density |
I |
Number of iterations for which sampler is run |
S |
Number of independent replicates of sampler |
This is provided simply so that readers spend less time typing. It is not intended to be robust and general code.
An I x S x (n+1) array containing the successive iterations of the samplers,
for the I iterations, S independent replicates, and n
rate parameters plus the parameter
beta
of the prior distribution.
Anthony Davison ([email protected]
)
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Practical 11.5.
## From Practical 11.5: data(pumps) system.time( pumps.sim <- poi.gibbs(pumps, alpha=1.8, delta=0.1, gamma=1, I=1000, S=5) ) par(mfrow=c(2,3)) plot.ts(pumps.sim[,1,1]) acf(pumps.sim[,1,1]) pacf(pumps.sim[,1,1]) plot.ts(pumps.sim[,1,11]) acf(pumps.sim[,1,11]) pacf(pumps.sim[,1,11])
## From Practical 11.5: data(pumps) system.time( pumps.sim <- poi.gibbs(pumps, alpha=1.8, delta=0.1, gamma=1, I=1000, S=5) ) par(mfrow=c(2,3)) plot.ts(pumps.sim[,1,1]) acf(pumps.sim[,1,1]) pacf(pumps.sim[,1,1]) plot.ts(pumps.sim[,1,11]) acf(pumps.sim[,1,11]) pacf(pumps.sim[,1,11])
In an experiment to assess the usefulness of treatments for poisons, 48 animals were split randomly into 12 groups of 4. Each group was administered one of three poisons, and one of four treatments, giving a 3x4 factorial design with 4 replicates.
data(poisons)
data(poisons)
A data frame with 48 observations on the following 3 variables.
Survival time (units of 10 hours)
Factor giving poison
Factor giving treatment
Box, G. E. P. and Cox, D. R. (1964) An analysis of transformations (with Discussion). Journal of the Royal Statistical Society series B, 26, 211–246.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 391.
data(poisons) fit <- lm(time~poison+treat,data=poisons) library(MASS) boxcox(time~poison+treat,data=poisons)
data(poisons) fit <- lm(time~poison+treat,data=poisons) library(MASS) boxcox(time~poison+treat,data=poisons)
Data on the relation between weather, socioeconomic, and air pollution variables and mortality rates in 60 Standard Metropolitan Statistical Areas (SMSAs) of the USA, for the years 1959-1961. Some of the variables are highly collinear.
data(pollution)
data(pollution)
A data frame with 60 observations on the following variables.
Average annual precipitation in inches
Average January temperature in degrees F
Average July temperature in degrees F
Percentage of 1960 SMSA population aged 65 or older
Average household size
Median school years completed by those over 22
percentage of housing units which are sound and with all facilities
Population per square mile in urbanized areas, 1960
Percentage non-white population in urbanized areas, 1960
Percentage employed in white collar occupations
Percentage of families with income < 3000 dollars
Relative hydrocarbon pollution potential
Same for nitric oxides
Same for sulphur dioxide
Annual average percentage relative humidity at 1pm
Total age-adjusted mortality rate per 100,000
McDonald, G. C. and Schwing, R. C. (1973) Instabilities of regression estimates relating air pollution to mortality, Technometrics, 15, 463-482.
data(pollution) ## maybe str(pollution) ; plot(pollution) ...
data(pollution) ## maybe str(pollution) ; plot(pollution) ...
The data give numbers of failures of ten pumps from several systems in the nuclear plant Farley 1. Pumps 1, 3, 4, and 6 operate continuously, while the rest operate only intermittantly or on standby.
data(pumps)
data(pumps)
A data frame with 10 observations on the following 2 variables.
Operating time (in thousands of operatin hours)
Number of failures
Gaver, D. P. and O'Muircheartaigh, I. G. (1987) Robust empirical Bayes analysis of event rates. Technometrics, 29, 1–15.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 600.
Exponential probability plot of data.
qqexp(y, line = FALSE, ...)
qqexp(y, line = FALSE, ...)
y |
Vector for which plot is required |
line |
Add line to plot (no line by default) |
... |
Other options for plot command |
A exponential probablity plot of the data in y; that is, a plot of the ordered values of y against the quantiles of the standard exponential distribution.
qqexp(rexp(50)) qqexp(rgamma(50,shape=2),line=TRUE)
qqexp(rexp(50)) qqexp(rgamma(50,shape=2),line=TRUE)
Times and magnitudes (Richter scale) of 483 shallow earthquakes in an offshore region east of Honshu and south of Hokkaido, for the period 1885–1980.
data(quake)
data(quake)
An irregular time series with earthquake
in days since start of 1885
magnitude (Richter scale)
Ogata, Y. (1988) Statistical models for earthquake occurrences and residual analysis for point processes. Journal of the American Statistical Association, 83, 9–27.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 289.
Data on the weights of 30 rats each week for 5 weeks.
data(rat.growth)
data(rat.growth)
A data frame with 150 observations on the following 3 variables.
a factor with levels 1-30
takes values 0-4
rat weight (units unspecified)
Gelfand, A. E., Hills, S. E., Racine-Poon, A. and Smith, A. F. M. (1990) Illustration of Bayesian inference in normal data models using Gibbs sampling. Journal of the American Statistical Association, 85, 972–985.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 460.
data(rat.growth) library(nlme) rat.fit <- groupedData( y~poly(week,2) | rat, data = rat.growth, labels = list( x = "Week", y = "Weight" ), units = list( x = "", y = "(?)") ) summary(lme(rat.fit))
data(rat.growth) library(nlme) rat.fit <- groupedData( y~poly(week,2) | rat, data = rat.growth, labels = list( x = "Week", y = "Weight" ), units = list( x = "", y = "(?)") ) summary(lme(rat.fit))
The 'salinity' data frame has 28 rows and 4 columns.
Biweekly averages of the water salinity and river discharge in Pamlico Sound, North Carolina were recorded between the years 1972 and 1977. The data in this set consists only of those measurements in March, April and May.
data(salinity)
data(salinity)
This data frame contains the following columns:
The average salinity of the water over two weeks.
The average salinity of the water lagged two weeks. Since only spring is used, the value of 'lag' is not always equal to the previous value of 'sal'.
A factor indicating in which of the 6 biweekly periods between March and May, the observations were taken. The levels of the factor are from 0 to 5 with 0 being the first two weeks in March.
The amount of river discharge during the two weeks for which 'sal' is the average salinity.
The data were obtained from
Ruppert, D. and Carroll, R.J. (1980) Trimmed least squares estimation in the linear model. Journal of the American Statistical Association, 75, 828–838.
data(salinity) ## maybe str(salinity) ; plot(salinity) ...
data(salinity) ## maybe str(salinity) ; plot(salinity) ...
These are the number of seeds germinating when subjected to extracts of certain roots.
data(seeds)
data(seeds)
A data frame with 21 observations on the following 4 variables.
Number of seeds germinating
Total number of seeds
Seed type: O. aegyptiaco 75 or O. aegyptiaco 73
Root extract
Crowder, M. J. (1978) Beta-binomial ANOVA for proportions. Applied Statistics, 27, 34–37.
Cox, D. R. and Snell, E. J. (1989) Analysis of Binary Data, second edition. London: Chapman and Hall. Section 3.2.
data(seeds) ## maybe str(seeds) ; plot(seeds) ...
data(seeds) ## maybe str(seeds) ; plot(seeds) ...
Amount of wear in a paired comparison of two materials used for soling the shoes of 10 boys. The materials were allocated randomly to the left and right feet.
data(shoe)
data(shoe)
A data frame with 20 observations on the following 4 variables.
factor giving the shoe sole material
factor with 10 levels
factor giving left or right foot
amount of shoe wear
Box, G. E. P., Hunter, W. G. and Hunter, J. S. (1978) Statistics for Experimenters. New York: Wiley. Page 100.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 422.
data(shoe) attach(shoe) d <- y[material=="B"]-y[material=="A"] # difference t.test(d) # t test of hypothesis that B wears quicker
data(shoe) attach(shoe) d <- y[material=="B"]-y[material=="A"] # difference t.test(d) # t test of hypothesis that B wears quicker
Data on the number of rubber O-rings showing thermal distress for 23 flights of the space shuttle, with the ambient temperature and pressure at which tests on the putty next to the rings were performed.
data(shuttle)
data(shuttle)
A data frame with 23 observations on the following 4 variables.
Number of rings
Number of rings showing thermal distress
ambient temperature (degrees Fahrenheit)
pressure (pounds per square inch)
Dalal, S. R., Fowlkes, E. B. and Hoadley, B. (1989) Risk analysis of the space shuttle: Pre-Challenger prediction of failure. Journal of the American Statistical Association, 84, 945–957.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 7.
data(shuttle) attach(shuttle) plot(temperature, r/m,ylab="Proportion of failures")
data(shuttle) attach(shuttle) plot(temperature, r/m,ylab="Proportion of failures")
Twenty-year survival and smoking status for 1314 women from Whickham, near Newcastle-upon-Tyne.
data(smoking)
data(smoking)
A data frame with 14 observations on the following 4 variables.
Age group (factor)
Smoking status (1=smoker, 0=non-smoker)
Number alive after 20 years
Number dead after 20 years
Appleton, D. R., French, J. M. and Vanderpump, M. P. J. (1996) Ignoring a covariate: An example of Simpson's paradox. The American Statistician, 50, 340–341.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 258.
data(smoking) summary(glm(cbind(dead,alive)~smoker,data=smoking,binomial)) # note sign change for smoker covariate, due to Simpson's paradox summary(glm(cbind(dead,alive)~age+smoker,data=smoking,binomial))
data(smoking) summary(glm(cbind(dead,alive)~smoker,data=smoking,binomial)) # note sign change for smoker covariate, due to Simpson's paradox summary(glm(cbind(dead,alive)~age+smoker,data=smoking,binomial))
These are scores for the 380 fixtures in the English Premier League, 2000–2001.
data(soccer)
data(soccer)
A data frame with 380 observations on the following 7 variables.
Month of match
Day of match
Year of match
Home team
Away team
Goals scored by home team
Goals scored by away team
http://www.soccerbase.com/footballlive/
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 499.
Failure times of 60 springs divided into 6 groups of 10, with each group subject to a different level of stress. Some of the times are right-censored.
data(springs)
data(springs)
A data frame with 60 observations on the following 3 variables.
failure times (in units of $10^3$ cycles of loading)
censoring indicator, with 0 indicating right-censoring
a factor giving the stress (N/mm$^3$)
Cox, D. R. and Oakes, D. (1984) Analysis of Survival Data. London: Chapman and Hall/CRC Press.
data(springs) attach(springs) plot(cycles~stress) plot(cycles~stress,log="y")
data(springs) attach(springs) plot(cycles~stress) plot(cycles~stress,log="y")
Data on stickiness of blood for six subjects
data(sticky)
data(sticky)
A data frame with 42 observations on the following 2 variables.
factor with levels 1–6
measurement of a property related to stickiness of blood
Unpublished lecture notes, Imperial College London.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 450.
data(sticky) anova(lm(y~subject,data=sticky))
data(sticky) anova(lm(y~subject,data=sticky))
The ‘survival’ data frame has 14 rows and 2 columns.
The data measured the survival percentages of batches of rats who were given varying doses of radiation. At each of 6 doses there were two or three replications of the experiment.
data(survival)
data(survival)
A data frame with 14 observations on the following 2 variables.
The dose of radiation administered (rads).
The survival rate of the batches expressed as a percentage.
Efron, B. (1988) Computer-intensive methods in statistical regression. SIAM Review, 30, 421-449.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 376.
data(survival) plot(survival$dose,survival$surv,log="y") # note the obvious outlier lm(log(surv)~dose,data=survival)
data(survival) plot(survival$dose,survival$surv,log="y") # note the obvious outlier lm(log(surv)~dose,data=survival)
These are data from an experiment on the growth of teak plants after one season, using two planting methods and three root lengths. Plants were laid out in four randomised blocks, each consisting of 6 plots with 50 plants in each plot.
data(teak)
data(teak)
A data frame with 24 observations on the following 4 variables.
Block labels.
A
indicates planting using pits, B
using crowbar.
length, 4
, 6
or 8
inches.
mean height (inches) of the 50 plants grown on each plot.
Unpublished lecture notes, Imperial College, London.
data(teak) anova(lm(y~Block*Plant*Root,data=teak),test="F")
data(teak) anova(lm(y~Block*Plant*Root,data=teak),test="F")
Annual maximum sea levels (m) at seven locations near to or in south-east England, between 1819–1986. There are many missing values.
data(tide)
data(tide)
A data frame with 168 observations on the following 8 variables.
Year
Annual maximum high tide at Yarmouth
Annual maximum high tide at Lowestoft
Annual maximum high tide at Harwich
Annual maximum high tide at Walton
Annual maximum high tide at a site in Holland
Annual maximum high tide at Southend
Annual maximum high tide at Sheerness
The data were kindly provided by Professor Jonathan Tawn of Lancaster University.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 281.
data(tide) plot(tide$year,tide$Yarmouth,type="l")
data(tide) plot(tide$year,tide$Yarmouth,type="l")
Data on the relation between rainfall and the numbers of people testing positive for toxoplasmosis in 34 cities in El Salvador.
data(toxo)
data(toxo)
A data frame with 34 observations on the following 3 variables.
Annual rainfall (mm)
Number of persons tested
Number of persons testing positive for toxoplasmosis
Efron, B. (1986) Double exponential families and their use in generalized linear regression. Journal of the American Statistical Association, 82, 171–200.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 516.
data(toxo) anova(glm(cbind(r,m-r)~poly(rain,3),data=toxo,family=binomial),test="Chi") fit <- glm(cbind(r,m-r)~poly(rain,3),data=toxo,family=quasibinomial) anova(fit,test="F") summary(fit)
data(toxo) anova(glm(cbind(r,m-r)~poly(rain,3),data=toxo,family=binomial),test="Chi") fit <- glm(cbind(r,m-r)~poly(rain,3),data=toxo,family=quasibinomial) anova(fit,test="F") summary(fit)
Data from 40 experiments to compare a new surgery for stomach ulcer with an older surgery.
data(ulcer)
data(ulcer)
A data frame with 80 observations on the following 9 variables.
Author of study from which data taken
Year of publication
Assessment of quality of trial on which data based
Mean age of patients
Number of patients without recurrent bleeding
Total number of patients
a numeric vector
Factor giving control (C) or variants of new treatment
Factor giving 2x2 table corresponding to each trial
Efron, B. (1996) Empirical Bayes methods for combining likelihoods (with Discussion). Journal of the American Statistical Association, 91, 538–565.
Errors in the data given in the paper have been corrected here.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 496.
data(ulcer) glm(cbind(r,m-r)~table+treat,data=ulcer,family=binomial)
data(ulcer) glm(cbind(r,m-r)~table+treat,data=ulcer,family=binomial)
The 'urine' data frame has 79 rows and 7 columns.
79 urine specimens were analyzed in an effort to determine if certain physical characteristics of the urine might be related to the formation of calcium oxalate crystals. Cases 1 and 55 have missing covariates.
data(urine)
data(urine)
This data frame contains the following columns:
Indicator of the presence of calcium oxalate crystals.
The specific gravity of the urine.
The pH reading of the urine.
The osmolarity of the urine. Osmolarity is proportional to the concentration of molecules in solution.
The conductivity of the urine. Conductivity is proportional to the concentration of charged ions in solution.
The urea concentration in millimoles per litre.
The calcium concentration in millimoles per litre.
Andrews, D.F. and Herzberg, A.M. (1985) Data: A Collection of Problems from Many Fields for the Student and Research Worker. Springer-Verlag. Pages 249–251.
data(urine) glm(r~gravity+ph+osmo+cond+urea+calc,binomial,data=urine,subset=-c(1,55))
data(urine) glm(r~gravity+ph+osmo+cond+urea+calc,binomial,data=urine,subset=-c(1,55))
The ten highest annual sea levels (cm) at Venice, from 1887–1981.
data(venice)
data(venice)
A data frame with 95 observations on the following 11 variables.
1887–1981
Annual maximum sea level (cm)
Second largest sea level (cm)
Third largest sea level (cm)
Fourth largest sea level (cm)
Fifth largest sea level (cm)
Sixth largest sea level (cm)
Seventh largest sea level (cm)
Eighth largest sea level (cm)
Ninth largest sea level (cm)
Tenth largest sea level (cm)
There are missing values in 1922 and 1935.
Pirazzoli, P. A. (1982) Maree estreme a Venezia (periodo 1872–1981). Acqua Aria, 10, 1023–1039.
Davison, A. C. (2003) Statistical Models. Cambridge University Press. Page 162.
data(venice) attach(venice) y <- y1[year>1930] # for analysis in Section 5 of Davison (2003) x <- year[year>1930]-1956 plot(x+1956,y,ylab="Sea level (cm)",xlab="Year") lm(y~x)
data(venice) attach(venice) y <- y1[year>1930] # for analysis in Section 5 of Davison (2003) x <- year[year>1930]-1956 plot(x+1956,y,ylab="Sea level (cm)",xlab="Year") lm(y~x)
Daily closing prices (US dollars) of Yahoo.com
shares from 12 April 1996 to
26 April 2000.
data(yahoo)
data(yahoo)
An irregular time series with 1017 values.
data(yahoo) plot(yahoo,type="l",ylab="Yahoo closing prices") plot(diff(100*log(yahoo)),type="l",ylab="Yahoo log returns (percent)")
data(yahoo) plot(yahoo,type="l",ylab="Yahoo closing prices") plot(diff(100*log(yahoo)),type="l",ylab="Yahoo log returns (percent)")