Title: | Harrell Miscellaneous |
---|---|
Description: | Contains many functions useful for data analysis, high-level graphics, utility operations, functions for computing sample size and power, simulation, importing and annotating datasets, imputing missing values, advanced table making, variable clustering, character string manipulation, conversion of R objects to LaTeX and html code, recoding variables, caching, simplified parallel computing, encrypting and decrypting data using a safe workflow, general moving window statistical estimation, and assistance in interpreting principal component analysis. |
Authors: | Frank E Harrell Jr [aut, cre] , Charles Dupont [ctb] (contributed several functions and maintains latex functions) |
Maintainer: | Frank E Harrell Jr <[email protected]> |
License: | GPL (>= 2) |
Version: | 5.2-1 |
Built: | 2024-12-02 13:40:44 UTC |
Source: | CRAN |
%nin%
is a binary operator, which returns a logical vector indicating
if there is a match or not for its left operand. A true vector element
indicates no match in left operand, false indicates a match.
x %nin% table
x %nin% table
x |
a vector (numeric, character, factor) |
table |
a vector (numeric, character, factor), matching the mode of |
vector of logical values with length equal to length of x
.
c('a','b','c') %nin% c('a','b')
c('a','b','c') %nin% c('a','b')
Computes the mean and median of various absolute errors related to
ordinary multiple regression models. The mean and median absolute
errors correspond to the mean square due to regression, error, and
total. The absolute errors computed are derived from ,
, and
. The function also
computes ratios that correspond to
and
(but
these ratios do not add to 1.0); the
measure is the ratio of
mean or median absolute
to the mean or median absolute
. The
or SSE/SST
measure is the mean or median absolute
divided by the mean or median absolute
.
abs.error.pred(fit, lp=NULL, y=NULL) ## S3 method for class 'abs.error.pred' print(x, ...)
abs.error.pred(fit, lp=NULL, y=NULL) ## S3 method for class 'abs.error.pred' print(x, ...)
fit |
a fit object typically from |
lp |
a vector of predicted values (Y hat above) if |
y |
a vector of response variable values if |
x |
an object created by |
... |
unused |
a list of class abs.error.pred
(used by
print.abs.error.pred
) containing two matrices:
differences
and ratios
.
Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
[email protected]
Schemper M (2003): Stat in Med 22:2299-2308.
Tian L, Cai T, Goetghebeur E, Wei LJ (2007): Biometrika 94:297-311.
lm
, ols
, cor
,
validate.ols
set.seed(1) # so can regenerate results x1 <- rnorm(100) x2 <- rnorm(100) y <- exp(x1+x2+rnorm(100)) f <- lm(log(y) ~ x1 + poly(x2,3), y=TRUE) abs.error.pred(lp=exp(fitted(f)), y=y) rm(x1,x2,y,f)
set.seed(1) # so can regenerate results x1 <- rnorm(100) x2 <- rnorm(100) y <- exp(x1+x2+rnorm(100)) f <- lm(log(y) ~ x1 + poly(x2,3), y=TRUE) abs.error.pred(lp=exp(fitted(f)), y=y) rm(x1,x2,y,f)
Add Spike Histograms and Extended Box Plots to ggplot
addggLayers( g, data, type = c("ebp", "spike"), ylim = layer_scales(g)$y$get_limits(), by = "variable", value = "value", frac = 0.065, mult = 1, facet = NULL, pos = c("bottom", "top"), showN = TRUE )
addggLayers( g, data, type = c("ebp", "spike"), ylim = layer_scales(g)$y$get_limits(), by = "variable", value = "value", frac = 0.065, mult = 1, facet = NULL, pos = c("bottom", "top"), showN = TRUE )
g |
a |
data |
data frame/table containing raw data |
type |
specifies either extended box plot or spike histogram. Both are horizontal so are showing the distribution of the x-axis variable. |
ylim |
y-axis limits to use for scaling the height of the added plots, if you don't want to use the limits that |
by |
the name of a variable in |
value |
name of x-variable |
frac |
fraction of y-axis range to devote to vertical aspect of the added plot |
mult |
fudge factor for scaling y aspect |
facet |
optional faceting variable |
pos |
position for added plot |
showN |
sete to |
For an example see this. Note that it was not possible to just create the layers needed to be added, as creating these particular layers in isolation resulted in a ggplot
error.
the original ggplot
object with more layers added
Frank Harrell
spikecomp()
Given a data frame and the names of variable, doubles the
data frame for each variable with a new category
"All"
by default, or by the value of label
.
A new variable .marginal.
is added to the resulting data frame,
with value ""
if the observation is an original one, and with
value equal to the names of the variable being marginalized (separated
by commas) otherwise. If there is another stratification variable
besides the one in ..., and that variable is nested inside the
variable in ..., specify nested=variable name
to have the value
of that variable set fo label
whenever marginal observations are
created for .... See the state-city example below.
addMarginal(data, ..., label = "All", margloc=c('last', 'first'), nested)
addMarginal(data, ..., label = "All", margloc=c('last', 'first'), nested)
data |
a data frame |
... |
a list of names of variables to marginalize |
label |
category name for added marginal observations |
margloc |
location for marginal category within factor variable
specifying categories. Set to |
nested |
a single unquoted variable name if used |
d <- expand.grid(sex=c('female', 'male'), country=c('US', 'Romania'), reps=1:2) addMarginal(d, sex, country) # Example of nested variables d <- data.frame(state=c('AL', 'AL', 'GA', 'GA', 'GA'), city=c('Mobile', 'Montgomery', 'Valdosto', 'Augusta', 'Atlanta'), x=1:5, stringsAsFactors=TRUE) addMarginal(d, state, nested=city) # cite set to 'All' when state is
d <- expand.grid(sex=c('female', 'male'), country=c('US', 'Romania'), reps=1:2) addMarginal(d, sex, country) # Example of nested variables d <- data.frame(state=c('AL', 'AL', 'GA', 'GA', 'GA'), city=c('Mobile', 'Montgomery', 'Valdosto', 'Augusta', 'Atlanta'), x=1:5, stringsAsFactors=TRUE) addMarginal(d, state, nested=city) # cite set to 'All' when state is
Tests, without issuing warnings, whether all elements of a character
vector are legal numeric values, or optionally converts the vector to a
numeric vector. Leading and trailing blanks in x
are ignored.
all.is.numeric(x, what = c("test", "vector", "nonnum"), extras=c('.','NA'))
all.is.numeric(x, what = c("test", "vector", "nonnum"), extras=c('.','NA'))
x |
a character vector |
what |
specify |
extras |
a vector of character strings to count as numeric
values, other than |
a logical value if what="test"
or a vector otherwise
Frank Harrell
all.is.numeric(c('1','1.2','3')) all.is.numeric(c('1','1.2','3a')) all.is.numeric(c('1','1.2','3'),'vector') all.is.numeric(c('1','1.2','3a'),'vector') all.is.numeric(c('1','',' .'),'vector') all.is.numeric(c('1', '1.2', '3a'), 'nonnum')
all.is.numeric(c('1','1.2','3')) all.is.numeric(c('1','1.2','3a')) all.is.numeric(c('1','1.2','3'),'vector') all.is.numeric(c('1','1.2','3a'),'vector') all.is.numeric(c('1','',' .'),'vector') all.is.numeric(c('1', '1.2', '3a'), 'nonnum')
Works in conjunction with the approx
function to do linear
extrapolation. approx
in R does not support extrapolation at
all, and it is buggy in S-Plus 6.
approxExtrap(x, y, xout, method = "linear", n = 50, rule = 2, f = 0, ties = "ordered", na.rm = FALSE)
approxExtrap(x, y, xout, method = "linear", n = 50, rule = 2, f = 0, ties = "ordered", na.rm = FALSE)
x , y , xout , method , n , rule , f
|
see |
ties |
applies only to R. See |
na.rm |
set to |
Duplicates in x
(and corresponding y
elements) are removed
before using approx
.
a vector the same length as xout
Frank Harrell
approxExtrap(1:3,1:3,xout=c(0,4))
approxExtrap(1:3,1:3,xout=c(0,4))
Expands continuous variables into restricted cubic spline bases and
categorical variables into dummy variables and fits a multivariate
equation using canonical variates. This finds optimum transformations
that maximize . Optionally, the bootstrap is used to estimate
the covariance matrix of both left- and right-hand-side transformation
parameters, and to estimate the bias in the
due to overfitting
and compute the bootstrap optimism-corrected
.
Cross-validation can also be used to get an unbiased estimate of
but this is not as precise as the bootstrap estimate. The
bootstrap and cross-validation may also used to get estimates of mean
and median absolute error in predicted values on the original
y
scale. These two estimates are perhaps the best ones for gauging the
accuracy of a flexible model, because it is difficult to compare
under different y-transformations, and because
allows for an out-of-sample recalibration (i.e., it only measures
relative errors).
Note that uncertainty about the proper transformation of y
causes
an enormous amount of model uncertainty. When the transformation for
y
is estimated from the data a high variance in predicted values
on the original y
scale may result, especially if the true
transformation is linear. Comparing bootstrap or cross-validated mean
absolute errors with and without restricted the y
transform to be
linear (ytype='l'
) may help the analyst choose the proper model
complexity.
areg(x, y, xtype = NULL, ytype = NULL, nk = 4, B = 0, na.rm = TRUE, tolerance = NULL, crossval = NULL) ## S3 method for class 'areg' print(x, digits=4, ...) ## S3 method for class 'areg' plot(x, whichx = 1:ncol(x$x), ...) ## S3 method for class 'areg' predict(object, x, type=c('lp','fitted','x'), what=c('all','sample'), ...)
areg(x, y, xtype = NULL, ytype = NULL, nk = 4, B = 0, na.rm = TRUE, tolerance = NULL, crossval = NULL) ## S3 method for class 'areg' print(x, digits=4, ...) ## S3 method for class 'areg' plot(x, whichx = 1:ncol(x$x), ...) ## S3 method for class 'areg' predict(object, x, type=c('lp','fitted','x'), what=c('all','sample'), ...)
x |
A single predictor or a matrix of predictors. Categorical
predictors are required to be coded as integers (as |
y |
a |
xtype |
a vector of one-letter character codes specifying how each predictor
is to be modeled, in order of columns of |
ytype |
same coding as for |
nk |
number of knots, 0 for linear, or 3 or more. Default is 4 which will fit 3 parameters to continuous variables (one linear term and two nonlinear terms) |
B |
number of bootstrap resamples used to estimate covariance matrices of transformation parameters. Default is no bootstrapping. |
na.rm |
set to |
tolerance |
singularity tolerance. List source code for
|
crossval |
set to a positive integer k to compute k-fold cross-validated R-squared (square of first canonical correlation) and mean and median absolute error of predictions on the original scale |
digits |
number of digits to use in formatting for printing |
object |
an object created by |
whichx |
integer or character vector specifying which predictors
are to have their transformations plotted (default is all). The
|
type |
tells |
what |
When the |
... |
arguments passed to the plot function. |
areg
is a competitor of ace
in the acepack
package. Transformations from ace
are seldom smooth enough and
are often overfitted. With areg
the complexity can be controlled
with the nk
parameter, and predicted values are easy to obtain
because parametric functions are fitted.
If one side of the equation has a categorical variable with more than
two categories and the other side has a continuous variable not assumed
to act linearly, larger sample sizes are needed to reliably estimate
transformations, as it is difficult to optimally score categorical
variables to maximize against a simultaneously optimally
transformed continuous variable.
a list of class "areg"
containing many objects
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
Breiman and Friedman, Journal of the American Statistical Association (September, 1985).
set.seed(1) ns <- c(30,300,3000) for(n in ns) { y <- sample(1:5, n, TRUE) x <- abs(y-3) + runif(n) par(mfrow=c(3,4)) for(k in c(0,3:5)) { z <- areg(x, y, ytype='c', nk=k) plot(x, z$tx) title(paste('R2=',format(z$rsquared))) tapply(z$ty, y, range) a <- tapply(x,y,mean) b <- tapply(z$ty,y,mean) plot(a,b) abline(lsfit(a,b)) # Should get same result to within linear transformation if reverse x and y w <- areg(y, x, xtype='c', nk=k) plot(z$ty, w$tx) title(paste('R2=',format(w$rsquared))) abline(lsfit(z$ty, w$tx)) } } par(mfrow=c(2,2)) # Example where one category in y differs from others but only in variance of x n <- 50 y <- sample(1:5,n,TRUE) x <- rnorm(n) x[y==1] <- rnorm(sum(y==1), 0, 5) z <- areg(x,y,xtype='l',ytype='c') z plot(z) z <- areg(x,y,ytype='c') z plot(z) ## Not run: # Examine overfitting when true transformations are linear par(mfrow=c(4,3)) for(n in c(200,2000)) { x <- rnorm(n); y <- rnorm(n) + x for(nk in c(0,3,5)) { z <- areg(x, y, nk=nk, crossval=10, B=100) print(z) plot(z) title(paste('n=',n)) } } par(mfrow=c(1,1)) # Underfitting when true transformation is quadratic but overfitting # when y is allowed to be transformed set.seed(49) n <- 200 x <- rnorm(n); y <- rnorm(n) + .5*x^2 #areg(x, y, nk=0, crossval=10, B=100) #areg(x, y, nk=4, ytype='l', crossval=10, B=100) z <- areg(x, y, nk=4) #, crossval=10, B=100) z # Plot x vs. predicted value on original scale. Since y-transform is # not monotonic, there are multiple y-inverses xx <- seq(-3.5,3.5,length=1000) yhat <- predict(z, xx, type='fitted') plot(x, y, xlim=c(-3.5,3.5)) for(j in 1:ncol(yhat)) lines(xx, yhat[,j], col=j) # Plot a random sample of possible y inverses yhats <- predict(z, xx, type='fitted', what='sample') points(xx, yhats, pch=2) ## End(Not run) # True transformation of x1 is quadratic, y is linear n <- 200 x1 <- rnorm(n); x2 <- rnorm(n); y <- rnorm(n) + x1^2 z <- areg(cbind(x1,x2),y,xtype=c('s','l'),nk=3) par(mfrow=c(2,2)) plot(z) # y transformation is inverse quadratic but areg gets the same answer by # making x1 quadratic n <- 5000 x1 <- rnorm(n); x2 <- rnorm(n); y <- (x1 + rnorm(n))^2 z <- areg(cbind(x1,x2),y,nk=5) par(mfrow=c(2,2)) plot(z) # Overfit 20 predictors when no true relationships exist n <- 1000 x <- matrix(runif(n*20),n,20) y <- rnorm(n) z <- areg(x, y, nk=5) # add crossval=4 to expose the problem # Test predict function n <- 50 x <- rnorm(n) y <- rnorm(n) + x g <- sample(1:3, n, TRUE) z <- areg(cbind(x,g),y,xtype=c('s','c')) range(predict(z, cbind(x,g)) - z$linear.predictors)
set.seed(1) ns <- c(30,300,3000) for(n in ns) { y <- sample(1:5, n, TRUE) x <- abs(y-3) + runif(n) par(mfrow=c(3,4)) for(k in c(0,3:5)) { z <- areg(x, y, ytype='c', nk=k) plot(x, z$tx) title(paste('R2=',format(z$rsquared))) tapply(z$ty, y, range) a <- tapply(x,y,mean) b <- tapply(z$ty,y,mean) plot(a,b) abline(lsfit(a,b)) # Should get same result to within linear transformation if reverse x and y w <- areg(y, x, xtype='c', nk=k) plot(z$ty, w$tx) title(paste('R2=',format(w$rsquared))) abline(lsfit(z$ty, w$tx)) } } par(mfrow=c(2,2)) # Example where one category in y differs from others but only in variance of x n <- 50 y <- sample(1:5,n,TRUE) x <- rnorm(n) x[y==1] <- rnorm(sum(y==1), 0, 5) z <- areg(x,y,xtype='l',ytype='c') z plot(z) z <- areg(x,y,ytype='c') z plot(z) ## Not run: # Examine overfitting when true transformations are linear par(mfrow=c(4,3)) for(n in c(200,2000)) { x <- rnorm(n); y <- rnorm(n) + x for(nk in c(0,3,5)) { z <- areg(x, y, nk=nk, crossval=10, B=100) print(z) plot(z) title(paste('n=',n)) } } par(mfrow=c(1,1)) # Underfitting when true transformation is quadratic but overfitting # when y is allowed to be transformed set.seed(49) n <- 200 x <- rnorm(n); y <- rnorm(n) + .5*x^2 #areg(x, y, nk=0, crossval=10, B=100) #areg(x, y, nk=4, ytype='l', crossval=10, B=100) z <- areg(x, y, nk=4) #, crossval=10, B=100) z # Plot x vs. predicted value on original scale. Since y-transform is # not monotonic, there are multiple y-inverses xx <- seq(-3.5,3.5,length=1000) yhat <- predict(z, xx, type='fitted') plot(x, y, xlim=c(-3.5,3.5)) for(j in 1:ncol(yhat)) lines(xx, yhat[,j], col=j) # Plot a random sample of possible y inverses yhats <- predict(z, xx, type='fitted', what='sample') points(xx, yhats, pch=2) ## End(Not run) # True transformation of x1 is quadratic, y is linear n <- 200 x1 <- rnorm(n); x2 <- rnorm(n); y <- rnorm(n) + x1^2 z <- areg(cbind(x1,x2),y,xtype=c('s','l'),nk=3) par(mfrow=c(2,2)) plot(z) # y transformation is inverse quadratic but areg gets the same answer by # making x1 quadratic n <- 5000 x1 <- rnorm(n); x2 <- rnorm(n); y <- (x1 + rnorm(n))^2 z <- areg(cbind(x1,x2),y,nk=5) par(mfrow=c(2,2)) plot(z) # Overfit 20 predictors when no true relationships exist n <- 1000 x <- matrix(runif(n*20),n,20) y <- rnorm(n) z <- areg(x, y, nk=5) # add crossval=4 to expose the problem # Test predict function n <- 50 x <- rnorm(n) y <- rnorm(n) + x g <- sample(1:3, n, TRUE) z <- areg(cbind(x,g),y,xtype=c('s','c')) range(predict(z, cbind(x,g)) - z$linear.predictors)
The transcan
function creates flexible additive imputation models
but provides only an approximation to true multiple imputation as the
imputation models are fixed before all multiple imputations are
drawn. This ignores variability caused by having to fit the
imputation models. aregImpute
takes all aspects of uncertainty in
the imputations into account by using the bootstrap to approximate the
process of drawing predicted values from a full Bayesian predictive
distribution. Different bootstrap resamples are used for each of the
multiple imputations, i.e., for the i
th imputation of a sometimes
missing variable, i=1,2,... n.impute
, a flexible additive
model is fitted on a sample with replacement from the original data and
this model is used to predict all of the original missing and
non-missing values for the target variable.
areg
is used to fit the imputation models. By default, linearity
is assumed for target variables (variables being imputed) and
nk=3
knots are assumed for continuous predictors transformed
using restricted cubic splines. If nk
is three or greater and
tlinear
is set to FALSE
, areg
simultaneously finds transformations of the target variable and of all of
the predictors, to get a good fit assuming additivity, maximizing
, using the same canonical correlation method as
transcan
. Flexible transformations may be overridden for
specific variables by specifying the identity transformation for them.
When a categorical variable is being predicted, the flexible
transformation is Fisher's optimum scoring method. Nonlinear transformations for continuous variables may be nonmonotonic. If
nk
is a vector, areg
's bootstrap and crossval=10
options will be used to help find the optimum validating value of
nk
over values of that vector, at the last imputation iteration.
For the imputations, the minimum value of nk
is used.
Instead of defaulting to taking random draws from fitted imputation
models using random residuals as is done by transcan
,
aregImpute
by default uses predictive mean matching with optional
weighted probability sampling of donors rather than using only the
closest match. Predictive mean matching works for binary, categorical,
and continuous variables without the need for iterative maximum
likelihood fitting for binary and categorical variables, and without the
need for computing residuals or for curtailing imputed values to be in
the range of actual data. Predictive mean matching is especially
attractive when the variable being imputed is also being transformed
automatically. Constraints may be placed on variables being imputed
with predictive mean matching, e.g., a missing hospital discharge date
may be required to be imputed from a donor observation whose discharge
date is before the recipient subject's first post-discharge visit date.
See Details below for more information about the
algorithm. A "regression"
method is also available that is
similar to that used in transcan
. This option should be used
when mechanistic missingness requires the use of extrapolation during
imputation.
A print
method summarizes the results, and a plot
method plots
distributions of imputed values. Typically, fit.mult.impute
will
be called after aregImpute
.
If a target variable is transformed nonlinearly (i.e., if nk
is
greater than zero and tlinear
is set to FALSE
) and the
estimated target variable transformation is non-monotonic, imputed
values are not unique. When type='regression'
, a random choice
of possible inverse values is made.
The reformM
function provides two ways of recreating a formula to
give to aregImpute
by reordering the variables in the formula.
This is a modified version of a function written by Yong Hao Pua. One
can specify nperm
to obtain a list of nperm
randomly
permuted variables. The list is converted to a single ordinary formula
if nperm=1
. If nperm
is omitted, variables are sorted in
descending order of the number of NA
s. reformM
also
prints a recommended number of multiple imputations to use, which is a
minimum of 5 and the percent of incomplete observations.
aregImpute(formula, data, subset, n.impute=5, group=NULL, nk=3, tlinear=TRUE, type=c('pmm','regression','normpmm'), pmmtype=1, match=c('weighted','closest','kclosest'), kclosest=3, fweighted=0.2, curtail=TRUE, constraint=NULL, boot.method=c('simple', 'approximate bayesian'), burnin=3, x=FALSE, pr=TRUE, plotTrans=FALSE, tolerance=NULL, B=75) ## S3 method for class 'aregImpute' print(x, digits=3, ...) ## S3 method for class 'aregImpute' plot(x, nclass=NULL, type=c('ecdf','hist'), datadensity=c("hist", "none", "rug", "density"), diagnostics=FALSE, maxn=10, ...) reformM(formula, data, nperm)
aregImpute(formula, data, subset, n.impute=5, group=NULL, nk=3, tlinear=TRUE, type=c('pmm','regression','normpmm'), pmmtype=1, match=c('weighted','closest','kclosest'), kclosest=3, fweighted=0.2, curtail=TRUE, constraint=NULL, boot.method=c('simple', 'approximate bayesian'), burnin=3, x=FALSE, pr=TRUE, plotTrans=FALSE, tolerance=NULL, B=75) ## S3 method for class 'aregImpute' print(x, digits=3, ...) ## S3 method for class 'aregImpute' plot(x, nclass=NULL, type=c('ecdf','hist'), datadensity=c("hist", "none", "rug", "density"), diagnostics=FALSE, maxn=10, ...) reformM(formula, data, nperm)
formula |
an S model formula. You can specify restrictions for transformations
of variables. The function automatically determines which variables
are categorical (i.e., |
x |
an object created by |
data |
input raw data |
subset |
These may be also be specified. You may not specify |
n.impute |
number of multiple imputations. |
group |
a character or factor variable the same length as the
number of observations in |
nk |
number of knots to use for continuous variables. When both
the target variable and the predictors are having optimum
transformations estimated, there is more instability than with normal
regression so the complexity of the model should decrease more sharply
as the sample size decreases. Hence set |
tlinear |
set to |
type |
The default is |
pmmtype |
type of matching to be used for predictive mean
matching when |
match |
Defaults to |
kclosest |
see |
fweighted |
Smoothing parameter (multiple of mean absolute difference) used when
|
curtail |
applies if |
constraint |
for predictive mean matching |
boot.method |
By default, simple boostrapping is used in which the
target variable is predicted using a sample with replacement from the
observations with non-missing target variable. Specify
|
burnin |
|
pr |
set to |
plotTrans |
set to |
tolerance |
singularity criterion; list the source code in the
|
B |
number of bootstrap resamples to use if |
digits |
number of digits for printing |
nclass |
number of bins to use in drawing histogram |
datadensity |
see |
diagnostics |
Specify |
maxn |
Maximum number of observations shown for diagnostics. Default is
|
nperm |
number of random formula permutations for |
... |
other arguments that are ignored |
The sequence of steps used by the aregImpute
algorithm is the
following.
(1) For each variable containing m NA
s where m > 0, initialize the
NA
s to values from a random sample (without replacement if
a sufficient number of non-missing values exist) of size m from the
non-missing values.
(2) For burnin+n.impute
iterations do the following steps. The
first burnin
iterations provide a burn-in, and imputations are
saved only from the last n.impute
iterations.
(3) For each variable containing any NA
s, draw a sample with
replacement from the observations in the entire dataset in which the
current variable being imputed is non-missing. Fit a flexible
additive model to predict this target variable while finding the
optimum transformation of it (unless the identity
transformation is forced). Use this fitted flexible model to
predict the target variable in all of the original observations.
Impute each missing value of the target variable with the observed
value whose predicted transformed value is closest to the predicted
transformed value of the missing value (if match="closest"
and
type="pmm"
),
or use a draw from a multinomial distribution with probabilities derived
from distance weights, if match="weighted"
(the default).
(4) After these imputations are computed, use these random draw
imputations the next time the curent target variable is used as a
predictor of other sometimes-missing variables.
When match="closest"
, predictive mean matching does not work well
when fewer than 3 variables are used to predict the target variable,
because many of the multiple imputations for an observation will be
identical. In the extreme case of one right-hand-side variable and
assuming that only monotonic transformations of left and right-side
variables are allowed, every bootstrap resample will give predicted
values of the target variable that are monotonically related to
predicted values from every other bootstrap resample. The same is true
for Bayesian predicted values. This causes predictive mean matching to
always match on the same donor observation.
When the missingness mechanism for a variable is so systematic that the
distribution of observed values is truncated, predictive mean matching
does not work. It will only yield imputed values that are near observed
values, so intervals in which no values are observed will not be
populated by imputed values. For this case, the only hope is to make
regression assumptions and use extrapolation. With
type="regression"
, aregImpute
will use linear
extrapolation to obtain a (hopefully) reasonable distribution of imputed
values. The "regression"
option causes aregImpute
to
impute missing values by adding a random sample of residuals (with
replacement if there are more NA
s than measured values) on the
transformed scale of the target variable. After random residuals are
added, predicted random draws are obtained on the original untransformed
scale using reverse linear interpolation on the table of original and
transformed target values (linear extrapolation when a random residual
is large enough to put the random draw prediction outside the range of
observed values). The bootstrap is used as with type="pmm"
to
factor in the uncertainty of the imputation model.
As model uncertainty is high when the transformation of a target
variable is unknown, tlinear
defaults to TRUE
to limit the
variance in predicted values when nk
is positive.
a list of class "aregImpute"
containing the following elements:
call |
the function call expression |
formula |
the formula specified to |
match |
the |
fweighted |
the |
n |
total number of observations in input dataset |
p |
number of variables |
na |
list of subscripts of observations for which values were originally missing |
nna |
named vector containing the numbers of missing values in the data |
type |
vector of types of transformations used for each variable
( |
tlinear |
value of |
nk |
number of knots used for smooth transformations |
cat.levels |
list containing character vectors specifying the |
df |
degrees of freedom (number of parameters estimated) for each variable |
n.impute |
number of multiple imputations per missing value |
imputed |
a list containing matrices of imputed values in the same format as
those created by |
x |
if |
rsq |
for the last round of imputations, a vector containing the R-squares
with which each sometimes-missing variable could be predicted from the
others by |
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
van Buuren, Stef. Flexible Imputation of Missing Data. Chapman & Hall/CRC, Boca Raton FL, 2012.
Little R, An H. Robust likelihood-based analysis of multivariate data with missing values. Statistica Sinica 14:949-968, 2004.
van Buuren S, Brand JPL, Groothuis-Oudshoorn CGM, Rubin DB. Fully conditional specifications in multivariate imputation. J Stat Comp Sim 72:1049-1064, 2006.
de Groot JAH, Janssen KJM, Zwinderman AH, Moons KGM, Reitsma JB. Multiple imputation to correct for partial verification bias revisited. Stat Med 27:5880-5889, 2008.
Siddique J. Multiple imputation using an iterative hot-deck with distance-based donor selection. Stat Med 27:83-102, 2008.
White IR, Royston P, Wood AM. Multiple imputation using chained equations: Issues and guidance for practice. Stat Med 30:377-399, 2011.
Curnow E, Carpenter JR, Heron JE, et al: Multiple imputation of missing data under missing at random: compatible imputation models are not sufficient to avoid bias if they are mis-specified. J Clin Epi June 9, 2023. DOI:10.1016/j.jclinepi.2023.06.011.
fit.mult.impute
, transcan
, areg
, naclus
, naplot
, mice
,
dotchart3
, Ecdf
, completer
# Check that aregImpute can almost exactly estimate missing values when # there is a perfect nonlinear relationship between two variables # Fit restricted cubic splines with 4 knots for x1 and x2, linear for x3 set.seed(3) x1 <- rnorm(200) x2 <- x1^2 x3 <- runif(200) m <- 30 x2[1:m] <- NA a <- aregImpute(~x1+x2+I(x3), n.impute=5, nk=4, match='closest') a matplot(x1[1:m]^2, a$imputed$x2) abline(a=0, b=1, lty=2) x1[1:m]^2 a$imputed$x2 # Multiple imputation and estimation of variances and covariances of # regression coefficient estimates accounting for imputation # Example 1: large sample size, much missing data, no overlap in # NAs across variables x1 <- factor(sample(c('a','b','c'),1000,TRUE)) x2 <- (x1=='b') + 3*(x1=='c') + rnorm(1000,0,2) x3 <- rnorm(1000) y <- x2 + 1*(x1=='c') + .2*x3 + rnorm(1000,0,2) orig.x1 <- x1[1:250] orig.x2 <- x2[251:350] x1[1:250] <- NA x2[251:350] <- NA d <- data.frame(x1,x2,x3,y, stringsAsFactors=TRUE) # Find value of nk that yields best validating imputation models # tlinear=FALSE means to not force the target variable to be linear f <- aregImpute(~y + x1 + x2 + x3, nk=c(0,3:5), tlinear=FALSE, data=d, B=10) # normally B=75 f # Try forcing target variable (x1, then x2) to be linear while allowing # predictors to be nonlinear (could also say tlinear=TRUE) f <- aregImpute(~y + x1 + x2 + x3, nk=c(0,3:5), data=d, B=10) f ## Not run: # Use 100 imputations to better check against individual true values f <- aregImpute(~y + x1 + x2 + x3, n.impute=100, data=d) f par(mfrow=c(2,1)) plot(f) modecat <- function(u) { tab <- table(u) as.numeric(names(tab)[tab==max(tab)][1]) } table(orig.x1,apply(f$imputed$x1, 1, modecat)) par(mfrow=c(1,1)) plot(orig.x2, apply(f$imputed$x2, 1, mean)) fmi <- fit.mult.impute(y ~ x1 + x2 + x3, lm, f, data=d) sqrt(diag(vcov(fmi))) fcc <- lm(y ~ x1 + x2 + x3) summary(fcc) # SEs are larger than from mult. imputation ## End(Not run) ## Not run: # Example 2: Very discriminating imputation models, # x1 and x2 have some NAs on the same rows, smaller n set.seed(5) x1 <- factor(sample(c('a','b','c'),100,TRUE)) x2 <- (x1=='b') + 3*(x1=='c') + rnorm(100,0,.4) x3 <- rnorm(100) y <- x2 + 1*(x1=='c') + .2*x3 + rnorm(100,0,.4) orig.x1 <- x1[1:20] orig.x2 <- x2[18:23] x1[1:20] <- NA x2[18:23] <- NA #x2[21:25] <- NA d <- data.frame(x1,x2,x3,y, stringsAsFactors=TRUE) n <- naclus(d) plot(n); naplot(n) # Show patterns of NAs # 100 imputations to study them; normally use 5 or 10 f <- aregImpute(~y + x1 + x2 + x3, n.impute=100, nk=0, data=d) par(mfrow=c(2,3)) plot(f, diagnostics=TRUE, maxn=2) # Note: diagnostics=TRUE makes graphs similar to those made by: # r <- range(f$imputed$x2, orig.x2) # for(i in 1:6) { # use 1:2 to mimic maxn=2 # plot(1:100, f$imputed$x2[i,], ylim=r, # ylab=paste("Imputations for Obs.",i)) # abline(h=orig.x2[i],lty=2) # } table(orig.x1,apply(f$imputed$x1, 1, modecat)) par(mfrow=c(1,1)) plot(orig.x2, apply(f$imputed$x2, 1, mean)) fmi <- fit.mult.impute(y ~ x1 + x2, lm, f, data=d) sqrt(diag(vcov(fmi))) fcc <- lm(y ~ x1 + x2) summary(fcc) # SEs are larger than from mult. imputation ## End(Not run) ## Not run: # Study relationship between smoothing parameter for weighting function # (multiplier of mean absolute distance of transformed predicted # values, used in tricube weighting function) and standard deviation # of multiple imputations. SDs are computed from average variances # across subjects. match="closest" same as match="weighted" with # small value of fweighted. # This example also shows problems with predicted mean # matching almost always giving the same imputed values when there is # only one predictor (regression coefficients change over multiple # imputations but predicted values are virtually 1-1 functions of each # other) set.seed(23) x <- runif(200) y <- x + runif(200, -.05, .05) r <- resid(lsfit(x,y)) rmse <- sqrt(sum(r^2)/(200-2)) # sqrt of residual MSE y[1:20] <- NA d <- data.frame(x,y) f <- aregImpute(~ x + y, n.impute=10, match='closest', data=d) # As an aside here is how to create a completed dataset for imputation # number 3 as fit.mult.impute would do automatically. In this degenerate # case changing 3 to 1-2,4-10 will not alter the results. imputed <- impute.transcan(f, imputation=3, data=d, list.out=TRUE, pr=FALSE, check=FALSE) sd <- sqrt(mean(apply(f$imputed$y, 1, var))) ss <- c(0, .01, .02, seq(.05, 1, length=20)) sds <- ss; sds[1] <- sd for(i in 2:length(ss)) { f <- aregImpute(~ x + y, n.impute=10, fweighted=ss[i]) sds[i] <- sqrt(mean(apply(f$imputed$y, 1, var))) } plot(ss, sds, xlab='Smoothing Parameter', ylab='SD of Imputed Values', type='b') abline(v=.2, lty=2) # default value of fweighted abline(h=rmse, lty=2) # root MSE of residuals from linear regression ## End(Not run) ## Not run: # Do a similar experiment for the Titanic dataset getHdata(titanic3) h <- lm(age ~ sex + pclass + survived, data=titanic3) rmse <- summary(h)$sigma set.seed(21) f <- aregImpute(~ age + sex + pclass + survived, n.impute=10, data=titanic3, match='closest') sd <- sqrt(mean(apply(f$imputed$age, 1, var))) ss <- c(0, .01, .02, seq(.05, 1, length=20)) sds <- ss; sds[1] <- sd for(i in 2:length(ss)) { f <- aregImpute(~ age + sex + pclass + survived, data=titanic3, n.impute=10, fweighted=ss[i]) sds[i] <- sqrt(mean(apply(f$imputed$age, 1, var))) } plot(ss, sds, xlab='Smoothing Parameter', ylab='SD of Imputed Values', type='b') abline(v=.2, lty=2) # default value of fweighted abline(h=rmse, lty=2) # root MSE of residuals from linear regression ## End(Not run) set.seed(2) d <- data.frame(x1=runif(50), x2=c(rep(NA, 10), runif(40)), x3=c(runif(4), rep(NA, 11), runif(35))) reformM(~ x1 + x2 + x3, data=d) reformM(~ x1 + x2 + x3, data=d, nperm=2) # Give result or one of the results as the first argument to aregImpute # Constrain imputed values for two variables # Require imputed values for x2 to be above 0.2 # Assume x1 is never missing and require imputed values for # x3 to be less than the recipient's value of x1 a <- aregImpute(~ x1 + x2 + x3, data=d, constraint=list(x2 = expression(d$x2 > 0.2), x3 = expression(d$x3 < r$x1))) a
# Check that aregImpute can almost exactly estimate missing values when # there is a perfect nonlinear relationship between two variables # Fit restricted cubic splines with 4 knots for x1 and x2, linear for x3 set.seed(3) x1 <- rnorm(200) x2 <- x1^2 x3 <- runif(200) m <- 30 x2[1:m] <- NA a <- aregImpute(~x1+x2+I(x3), n.impute=5, nk=4, match='closest') a matplot(x1[1:m]^2, a$imputed$x2) abline(a=0, b=1, lty=2) x1[1:m]^2 a$imputed$x2 # Multiple imputation and estimation of variances and covariances of # regression coefficient estimates accounting for imputation # Example 1: large sample size, much missing data, no overlap in # NAs across variables x1 <- factor(sample(c('a','b','c'),1000,TRUE)) x2 <- (x1=='b') + 3*(x1=='c') + rnorm(1000,0,2) x3 <- rnorm(1000) y <- x2 + 1*(x1=='c') + .2*x3 + rnorm(1000,0,2) orig.x1 <- x1[1:250] orig.x2 <- x2[251:350] x1[1:250] <- NA x2[251:350] <- NA d <- data.frame(x1,x2,x3,y, stringsAsFactors=TRUE) # Find value of nk that yields best validating imputation models # tlinear=FALSE means to not force the target variable to be linear f <- aregImpute(~y + x1 + x2 + x3, nk=c(0,3:5), tlinear=FALSE, data=d, B=10) # normally B=75 f # Try forcing target variable (x1, then x2) to be linear while allowing # predictors to be nonlinear (could also say tlinear=TRUE) f <- aregImpute(~y + x1 + x2 + x3, nk=c(0,3:5), data=d, B=10) f ## Not run: # Use 100 imputations to better check against individual true values f <- aregImpute(~y + x1 + x2 + x3, n.impute=100, data=d) f par(mfrow=c(2,1)) plot(f) modecat <- function(u) { tab <- table(u) as.numeric(names(tab)[tab==max(tab)][1]) } table(orig.x1,apply(f$imputed$x1, 1, modecat)) par(mfrow=c(1,1)) plot(orig.x2, apply(f$imputed$x2, 1, mean)) fmi <- fit.mult.impute(y ~ x1 + x2 + x3, lm, f, data=d) sqrt(diag(vcov(fmi))) fcc <- lm(y ~ x1 + x2 + x3) summary(fcc) # SEs are larger than from mult. imputation ## End(Not run) ## Not run: # Example 2: Very discriminating imputation models, # x1 and x2 have some NAs on the same rows, smaller n set.seed(5) x1 <- factor(sample(c('a','b','c'),100,TRUE)) x2 <- (x1=='b') + 3*(x1=='c') + rnorm(100,0,.4) x3 <- rnorm(100) y <- x2 + 1*(x1=='c') + .2*x3 + rnorm(100,0,.4) orig.x1 <- x1[1:20] orig.x2 <- x2[18:23] x1[1:20] <- NA x2[18:23] <- NA #x2[21:25] <- NA d <- data.frame(x1,x2,x3,y, stringsAsFactors=TRUE) n <- naclus(d) plot(n); naplot(n) # Show patterns of NAs # 100 imputations to study them; normally use 5 or 10 f <- aregImpute(~y + x1 + x2 + x3, n.impute=100, nk=0, data=d) par(mfrow=c(2,3)) plot(f, diagnostics=TRUE, maxn=2) # Note: diagnostics=TRUE makes graphs similar to those made by: # r <- range(f$imputed$x2, orig.x2) # for(i in 1:6) { # use 1:2 to mimic maxn=2 # plot(1:100, f$imputed$x2[i,], ylim=r, # ylab=paste("Imputations for Obs.",i)) # abline(h=orig.x2[i],lty=2) # } table(orig.x1,apply(f$imputed$x1, 1, modecat)) par(mfrow=c(1,1)) plot(orig.x2, apply(f$imputed$x2, 1, mean)) fmi <- fit.mult.impute(y ~ x1 + x2, lm, f, data=d) sqrt(diag(vcov(fmi))) fcc <- lm(y ~ x1 + x2) summary(fcc) # SEs are larger than from mult. imputation ## End(Not run) ## Not run: # Study relationship between smoothing parameter for weighting function # (multiplier of mean absolute distance of transformed predicted # values, used in tricube weighting function) and standard deviation # of multiple imputations. SDs are computed from average variances # across subjects. match="closest" same as match="weighted" with # small value of fweighted. # This example also shows problems with predicted mean # matching almost always giving the same imputed values when there is # only one predictor (regression coefficients change over multiple # imputations but predicted values are virtually 1-1 functions of each # other) set.seed(23) x <- runif(200) y <- x + runif(200, -.05, .05) r <- resid(lsfit(x,y)) rmse <- sqrt(sum(r^2)/(200-2)) # sqrt of residual MSE y[1:20] <- NA d <- data.frame(x,y) f <- aregImpute(~ x + y, n.impute=10, match='closest', data=d) # As an aside here is how to create a completed dataset for imputation # number 3 as fit.mult.impute would do automatically. In this degenerate # case changing 3 to 1-2,4-10 will not alter the results. imputed <- impute.transcan(f, imputation=3, data=d, list.out=TRUE, pr=FALSE, check=FALSE) sd <- sqrt(mean(apply(f$imputed$y, 1, var))) ss <- c(0, .01, .02, seq(.05, 1, length=20)) sds <- ss; sds[1] <- sd for(i in 2:length(ss)) { f <- aregImpute(~ x + y, n.impute=10, fweighted=ss[i]) sds[i] <- sqrt(mean(apply(f$imputed$y, 1, var))) } plot(ss, sds, xlab='Smoothing Parameter', ylab='SD of Imputed Values', type='b') abline(v=.2, lty=2) # default value of fweighted abline(h=rmse, lty=2) # root MSE of residuals from linear regression ## End(Not run) ## Not run: # Do a similar experiment for the Titanic dataset getHdata(titanic3) h <- lm(age ~ sex + pclass + survived, data=titanic3) rmse <- summary(h)$sigma set.seed(21) f <- aregImpute(~ age + sex + pclass + survived, n.impute=10, data=titanic3, match='closest') sd <- sqrt(mean(apply(f$imputed$age, 1, var))) ss <- c(0, .01, .02, seq(.05, 1, length=20)) sds <- ss; sds[1] <- sd for(i in 2:length(ss)) { f <- aregImpute(~ age + sex + pclass + survived, data=titanic3, n.impute=10, fweighted=ss[i]) sds[i] <- sqrt(mean(apply(f$imputed$age, 1, var))) } plot(ss, sds, xlab='Smoothing Parameter', ylab='SD of Imputed Values', type='b') abline(v=.2, lty=2) # default value of fweighted abline(h=rmse, lty=2) # root MSE of residuals from linear regression ## End(Not run) set.seed(2) d <- data.frame(x1=runif(50), x2=c(rep(NA, 10), runif(40)), x3=c(runif(4), rep(NA, 11), runif(35))) reformM(~ x1 + x2 + x3, data=d) reformM(~ x1 + x2 + x3, data=d, nperm=2) # Give result or one of the results as the first argument to aregImpute # Constrain imputed values for two variables # Require imputed values for x2 to be above 0.2 # Assume x1 is never missing and require imputed values for # x3 to be less than the recipient's value of x1 a <- aregImpute(~ x1 + x2 + x3, data=d, constraint=list(x2 = expression(d$x2 > 0.2), x3 = expression(d$x3 < r$x1))) a
Produces 1-alpha confidence intervals for binomial probabilities.
binconf(x, n, alpha=0.05, method=c("wilson","exact","asymptotic","all"), include.x=FALSE, include.n=FALSE, return.df=FALSE)
binconf(x, n, alpha=0.05, method=c("wilson","exact","asymptotic","all"), include.x=FALSE, include.n=FALSE, return.df=FALSE)
x |
vector containing the number of "successes" for binomial variates |
n |
vector containing the numbers of corresponding observations |
alpha |
probability of a type I error, so confidence coefficient = 1-alpha |
method |
character string specifing which method to use. The "all" method only works when x and n are length 1. The "exact" method uses the F distribution to compute exact (based on the binomial cdf) intervals; the "wilson" interval is score-test-based; and the "asymptotic" is the text-book, asymptotic normal interval. Following Agresti and Coull, the Wilson interval is to be preferred and so is the default. |
include.x |
logical flag to indicate whether |
include.n |
logical flag to indicate whether |
return.df |
logical flag to indicate that a data frame rather than a matrix be returned |
a matrix or data.frame containing the computed intervals and,
optionally, x
and n
.
Rollin Brant, Modified by Frank Harrell and
Brad Biggerstaff
Centers for Disease Control and Prevention
National Center for Infectious Diseases
Division of Vector-Borne Infectious Diseases
P.O. Box 2087, Fort Collins, CO, 80522-2087, USA
[email protected]
A. Agresti and B.A. Coull, Approximate is better than "exact" for interval estimation of binomial proportions, American Statistician, 52:119–126, 1998.
R.G. Newcombe, Logit confidence intervals and the inverse sinh transformation, American Statistician, 55:200–202, 2001.
L.D. Brown, T.T. Cai and A. DasGupta, Interval estimation for a binomial proportion (with discussion), Statistical Science, 16:101–133, 2001.
binconf(0:10,10,include.x=TRUE,include.n=TRUE) binconf(46,50,method="all")
binconf(0:10,10,include.x=TRUE,include.n=TRUE) binconf(46,50,method="all")
biVar
is a generic function that accepts a formula and usual
data
, subset
, and na.action
parameters plus a
list statinfo
that specifies a function of two variables to
compute along with information about labeling results for printing and
plotting. The function is called separately with each right hand side
variable and the same left hand variable. The result is a matrix of
bivariate statistics and the statinfo
list that drives printing
and plotting. The plot method draws a dot plot with x-axis values by
default sorted in order of one of the statistics computed by the function.
spearman2
computes the square of Spearman's rho rank correlation
and a generalization of it in which x
can relate
non-monotonically to y
. This is done by computing the Spearman
multiple rho-squared between (rank(x), rank(x)^2)
and y
.
When x
is categorical, a different kind of Spearman correlation
used in the Kruskal-Wallis test is computed (and spearman2
can do
the Kruskal-Wallis test). This is done by computing the ordinary
multiple R^2
between k-1
dummy variables and
rank(y)
, where x
has k
categories. x
can
also be a formula, in which case each predictor is correlated separately
with y
, using non-missing observations for that predictor.
biVar
is used to do the looping and bookkeeping. By default the
plot shows the adjusted rho^2
, using the same formula used for
the ordinary adjusted R^2
. The F
test uses the unadjusted
R2.
spearman
computes Spearman's rho on non-missing values of two
variables. spearman.test
is a simple version of
spearman2.default
.
chiSquare
is set up like spearman2
except it is intended
for a categorical response variable. Separate Pearson chi-square tests
are done for each predictor, with optional collapsing of infrequent
categories. Numeric predictors having more than g
levels are
categorized into g
quantile groups. chiSquare
uses
biVar
.
biVar(formula, statinfo, data=NULL, subset=NULL, na.action=na.retain, exclude.imputed=TRUE, ...) ## S3 method for class 'biVar' print(x, ...) ## S3 method for class 'biVar' plot(x, what=info$defaultwhat, sort.=TRUE, main, xlab, vnames=c('names','labels'), ...) spearman2(x, ...) ## Default S3 method: spearman2(x, y, p=1, minlev=0, na.rm=TRUE, exclude.imputed=na.rm, ...) ## S3 method for class 'formula' spearman2(formula, data=NULL, subset, na.action=na.retain, exclude.imputed=TRUE, ...) spearman(x, y) spearman.test(x, y, p=1) chiSquare(formula, data=NULL, subset=NULL, na.action=na.retain, exclude.imputed=TRUE, ...)
biVar(formula, statinfo, data=NULL, subset=NULL, na.action=na.retain, exclude.imputed=TRUE, ...) ## S3 method for class 'biVar' print(x, ...) ## S3 method for class 'biVar' plot(x, what=info$defaultwhat, sort.=TRUE, main, xlab, vnames=c('names','labels'), ...) spearman2(x, ...) ## Default S3 method: spearman2(x, y, p=1, minlev=0, na.rm=TRUE, exclude.imputed=na.rm, ...) ## S3 method for class 'formula' spearman2(formula, data=NULL, subset, na.action=na.retain, exclude.imputed=TRUE, ...) spearman(x, y) spearman.test(x, y, p=1) chiSquare(formula, data=NULL, subset=NULL, na.action=na.retain, exclude.imputed=TRUE, ...)
formula |
a formula with a single left side variable |
statinfo |
see |
data , subset , na.action
|
the usual options for models. Default for |
exclude.imputed |
set to |
... |
other arguments that are passed to the function used to
compute the bivariate statistics or to |
na.rm |
logical; delete NA values? |
x |
a numeric matrix with at least 5 rows and at least 2 columns (if
|
y |
a numeric vector |
p |
for numeric variables, specifies the order of the Spearman |
minlev |
minimum relative frequency that a level of a categorical predictor
should have before it is pooled with other categories (see
|
what |
specifies which statistic to plot. Possibilities include the column names that appear with the print method is used. |
sort. |
set |
main |
main title for plot. Default title shows the name of the response variable. |
xlab |
x-axis label. Default constructed from |
vnames |
set to |
Uses midranks in case of ties, as described by Hollander and Wolfe.
P-values for Spearman, Wilcoxon, or Kruskal-Wallis tests are
approximated by using the t
or F
distributions.
spearman2.default
(the
function that is called for a single x
, i.e., when there is no
formula) returns a vector of statistics for the variable.
biVar
, spearman2.formula
, and chiSquare
return a
matrix with rows corresponding to predictors.
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
Hollander M. and Wolfe D.A. (1973). Nonparametric Statistical Methods. New York: Wiley.
Press WH, Flannery BP, Teukolsky SA, Vetterling, WT (1988): Numerical Recipes in C. Cambridge: Cambridge University Press.
combine.levels
,
varclus
, dotchart3
, impute
,
chisq.test
, cut2
.
x <- c(-2, -1, 0, 1, 2) y <- c(4, 1, 0, 1, 4) z <- c(1, 2, 3, 4, NA) v <- c(1, 2, 3, 4, 5) spearman2(x, y) plot(spearman2(z ~ x + y + v, p=2)) f <- chiSquare(z ~ x + y + v) f
x <- c(-2, -1, 0, 1, 2) y <- c(4, 1, 0, 1, 4) z <- c(1, 2, 3, 4, NA) v <- c(1, 2, 3, 4, 5) spearman2(x, y) plot(spearman2(z ~ x + y + v, p=2)) f <- chiSquare(z ~ x + y + v) f
Bootstraps Kaplan-Meier estimate of the probability of survival to at
least a fixed time (times
variable) or the estimate of the q
quantile of the survival distribution (e.g., median survival time, the
default).
bootkm(S, q=0.5, B=500, times, pr=TRUE)
bootkm(S, q=0.5, B=500, times, pr=TRUE)
S |
a |
q |
quantile of survival time, default is 0.5 for median |
B |
number of bootstrap repetitions (default=500) |
times |
time vector (currently only a scalar is allowed) at which to compute
survival estimates. You may specify only one of |
pr |
set to |
bootkm
uses Therneau's survfitKM
function to efficiently
compute Kaplan-Meier estimates.
a vector containing B
bootstrap estimates
updates .Random.seed
, and, if pr=TRUE
, prints progress
of simulations
Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
[email protected]
Akritas MG (1986): Bootstrapping the Kaplan-Meier estimator. JASA 81:1032–1038.
survfit
, Surv
,
Survival.cph
, Quantile.cph
# Compute 0.95 nonparametric confidence interval for the difference in # median survival time between females and males (two-sample problem) set.seed(1) library(survival) S <- Surv(runif(200)) # no censoring sex <- c(rep('female',100),rep('male',100)) med.female <- bootkm(S[sex=='female',], B=100) # normally B=500 med.male <- bootkm(S[sex=='male',], B=100) describe(med.female-med.male) quantile(med.female-med.male, c(.025,.975), na.rm=TRUE) # na.rm needed because some bootstrap estimates of median survival # time may be missing when a bootstrap sample did not include the # longer survival times
# Compute 0.95 nonparametric confidence interval for the difference in # median survival time between females and males (two-sample problem) set.seed(1) library(survival) S <- Surv(runif(200)) # no censoring sex <- c(rep('female',100),rep('male',100)) med.female <- bootkm(S[sex=='female',], B=100) # normally B=500 med.male <- bootkm(S[sex=='male',], B=100) describe(med.female-med.male) quantile(med.female-med.male, c(.025,.975), na.rm=TRUE) # na.rm needed because some bootstrap estimates of median survival # time may be missing when a bootstrap sample did not include the # longer survival times
Uses method of Fleiss, Tytun, and Ury (but without the continuity
correction) to estimate the power (or the sample size to achieve a given
power) of a two-sided test for the difference in two proportions. The two
sample sizes are allowed to be unequal, but for bsamsize
you must specify
the fraction of observations in group 1. For power calculations, one
probability (p1
) must be given, and either the other probability (p2
),
an odds.ratio
, or a percent.reduction
must be given. For bpower
or
bsamsize
, any or all of the arguments may be vectors, in which case they
return a vector of powers or sample sizes. All vector arguments must have
the same length.
Given p1, p2
, ballocation
uses the method of Brittain and Schlesselman
to compute the optimal fraction of observations to be placed in group 1
that either (1) minimize the variance of the difference in two proportions,
(2) minimize the variance of the ratio of the two proportions,
(3) minimize the variance of the log odds ratio, or
(4) maximize the power of the 2-tailed test for differences. For (4)
the total sample size must be given, or the fraction optimizing
the power is not returned. The fraction for (3) is one minus the fraction
for (1).
bpower.sim
estimates power by simulations, in minimal time. By using
bpower.sim
you can see that the formulas without any continuity correction
are quite accurate, and that the power of a continuity-corrected test
is significantly lower. That's why no continuity corrections are implemented
here.
bpower(p1, p2, odds.ratio, percent.reduction, n, n1, n2, alpha=0.05) bsamsize(p1, p2, fraction=.5, alpha=.05, power=.8) ballocation(p1, p2, n, alpha=.05) bpower.sim(p1, p2, odds.ratio, percent.reduction, n, n1, n2, alpha=0.05, nsim=10000)
bpower(p1, p2, odds.ratio, percent.reduction, n, n1, n2, alpha=0.05) bsamsize(p1, p2, fraction=.5, alpha=.05, power=.8) ballocation(p1, p2, n, alpha=.05) bpower.sim(p1, p2, odds.ratio, percent.reduction, n, n1, n2, alpha=0.05, nsim=10000)
p1 |
population probability in the group 1 |
p2 |
probability for group 2 |
odds.ratio |
odds ratio to detect |
percent.reduction |
percent reduction in risk to detect |
n |
total sample size over the two groups. If you omit this for
|
n1 |
sample size in group 1 |
n2 |
sample size in group 2. |
alpha |
type I assertion probability |
fraction |
fraction of observations in group 1 |
power |
the desired probability of detecting a difference |
nsim |
number of simulations of binomial responses |
For bpower.sim
, all arguments must be of length one.
for bpower
, the power estimate; for bsamsize
, a vector containing
the sample sizes in the two groups; for ballocation
, a vector with
4 fractions of observations allocated to group 1, optimizing the four
criteria mentioned above. For bpower.sim
, a vector with three
elements is returned, corresponding to the simulated power and its
lower and upper 0.95 confidence limits.
Frank Harrell
Department of Biostatistics
Vanderbilt University
Fleiss JL, Tytun A, Ury HK (1980): A simple approximation for calculating sample sizes for comparing independent proportions. Biometrics 36:343–6.
Brittain E, Schlesselman JJ (1982): Optimal allocation for the comparison of proportions. Biometrics 38:1003–9.
Gordon I, Watson R (1996): The myth of continuity-corrected sample size formulae. Biometrics 52:71–6.
samplesize.bin
, chisq.test
, binconf
bpower(.1, odds.ratio=.9, n=1000, alpha=c(.01,.05)) bpower.sim(.1, odds.ratio=.9, n=1000) bsamsize(.1, .05, power=.95) ballocation(.1, .5, n=100) # Plot power vs. n for various odds ratios (base prob.=.1) n <- seq(10, 1000, by=10) OR <- seq(.2,.9,by=.1) plot(0, 0, xlim=range(n), ylim=c(0,1), xlab="n", ylab="Power", type="n") for(or in OR) { lines(n, bpower(.1, odds.ratio=or, n=n)) text(350, bpower(.1, odds.ratio=or, n=350)-.02, format(or)) } # Another way to plot the same curves, but letting labcurve do the # work, including labeling each curve at points of maximum separation pow <- lapply(OR, function(or,n)list(x=n,y=bpower(p1=.1,odds.ratio=or,n=n)), n=n) names(pow) <- format(OR) labcurve(pow, pl=TRUE, xlab='n', ylab='Power') # Contour graph for various probabilities of outcome in the control # group, fixing the odds ratio at .8 ([p2/(1-p2) / p1/(1-p1)] = .8) # n is varied also p1 <- seq(.01,.99,by=.01) n <- seq(100,5000,by=250) pow <- outer(p1, n, function(p1,n) bpower(p1, n=n, odds.ratio=.8)) # This forms a length(p1)*length(n) matrix of power estimates contour(p1, n, pow)
bpower(.1, odds.ratio=.9, n=1000, alpha=c(.01,.05)) bpower.sim(.1, odds.ratio=.9, n=1000) bsamsize(.1, .05, power=.95) ballocation(.1, .5, n=100) # Plot power vs. n for various odds ratios (base prob.=.1) n <- seq(10, 1000, by=10) OR <- seq(.2,.9,by=.1) plot(0, 0, xlim=range(n), ylim=c(0,1), xlab="n", ylab="Power", type="n") for(or in OR) { lines(n, bpower(.1, odds.ratio=or, n=n)) text(350, bpower(.1, odds.ratio=or, n=350)-.02, format(or)) } # Another way to plot the same curves, but letting labcurve do the # work, including labeling each curve at points of maximum separation pow <- lapply(OR, function(or,n)list(x=n,y=bpower(p1=.1,odds.ratio=or,n=n)), n=n) names(pow) <- format(OR) labcurve(pow, pl=TRUE, xlab='n', ylab='Power') # Contour graph for various probabilities of outcome in the control # group, fixing the odds ratio at .8 ([p2/(1-p2) / p1/(1-p1)] = .8) # n is varied also p1 <- seq(.01,.99,by=.01) n <- seq(100,5000,by=250) pow <- outer(p1, n, function(p1,n) bpower(p1, n=n, odds.ratio=.8)) # This forms a length(p1)*length(n) matrix of power estimates contour(p1, n, pow)
Producess side-by-side box-percentile plots from several vectors or a list of vectors.
bpplot(..., name=TRUE, main="Box-Percentile Plot", xlab="", ylab="", srtx=0, plotopts=NULL)
bpplot(..., name=TRUE, main="Box-Percentile Plot", xlab="", ylab="", srtx=0, plotopts=NULL)
... |
vectors or lists containing
numeric components (e.g., the output of |
name |
character vector of names for the groups.
Default is |
main |
main title for the plot. |
xlab |
x axis label. |
ylab |
y axis label. |
srtx |
rotation angle for x-axis labels. Default is zero. |
plotopts |
a list of other parameters to send to |
There are no returned values
A plot is created on the current graphics device.
Box-percentile plots are similiar to boxplots, except box-percentile plots supply more information about the univariate distributions. At any height the width of the irregular "box" is proportional to the percentile of that height, up to the 50th percentile, and above the 50th percentile the width is proportional to 100 minus the percentile. Thus, the width at any given height is proportional to the percent of observations that are more extreme in that direction. As in boxplots, the median, 25th and 75th percentiles are marked with line segments across the box.
Jeffrey Banfield
[email protected]
Modified by F. Harrell 30Jun97
Esty WW, Banfield J: The box-percentile plot. J Statistical Software 8 No. 17, 2003.
panel.bpplot
, boxplot
, Ecdf
,
bwplot
set.seed(1) x1 <- rnorm(500) x2 <- runif(500, -2, 2) x3 <- abs(rnorm(500))-2 bpplot(x1, x2, x3) g <- sample(1:2, 500, replace=TRUE) bpplot(split(x2, g), name=c('Group 1','Group 2')) rm(x1,x2,x3,g)
set.seed(1) x1 <- rnorm(500) x2 <- runif(500, -2, 2) x3 <- abs(rnorm(500))-2 bpplot(x1, x2, x3) g <- sample(1:2, 500, replace=TRUE) bpplot(split(x2, g), name=c('Group 1','Group 2')) rm(x1,x2,x3,g)
For any number of cross-classification variables, bystats
returns a matrix with the sample size, number missing y
, and
fun(non-missing y)
, with the cross-classifications designated
by rows. Uses Harrell's modification of the interaction
function to produce cross-classifications. The default fun
is
mean
, and if y
is binary, the mean is labeled as
Fraction
. There is a print
method as well as a
latex
method for objects created by bystats
.
bystats2
handles the special case in which there are 2
classifcation variables, and places the first one in rows and the
second in columns. The print
method for bystats2
uses
the print.char.matrix
function to organize statistics
for cells into boxes.
bystats(y, ..., fun, nmiss, subset) ## S3 method for class 'bystats' print(x, ...) ## S3 method for class 'bystats' latex(object, title, caption, rowlabel, ...) bystats2(y, v, h, fun, nmiss, subset) ## S3 method for class 'bystats2' print(x, abbreviate.dimnames=FALSE, prefix.width=max(nchar(dimnames(x)[[1]])), ...) ## S3 method for class 'bystats2' latex(object, title, caption, rowlabel, ...)
bystats(y, ..., fun, nmiss, subset) ## S3 method for class 'bystats' print(x, ...) ## S3 method for class 'bystats' latex(object, title, caption, rowlabel, ...) bystats2(y, v, h, fun, nmiss, subset) ## S3 method for class 'bystats2' print(x, abbreviate.dimnames=FALSE, prefix.width=max(nchar(dimnames(x)[[1]])), ...) ## S3 method for class 'bystats2' latex(object, title, caption, rowlabel, ...)
y |
a binary, logical, or continuous variable or a matrix or data frame of
such variables. If |
... |
For |
v |
vertical variable for |
h |
horizontal variable for |
fun |
a function to compute on the non-missing |
nmiss |
A column containing a count of missing values is included if |
subset |
a vector of subscripts or logical values indicating the subset of data to analyze |
abbreviate.dimnames |
set to |
prefix.width |
|
title |
|
caption |
caption to pass to |
rowlabel |
|
x |
an object created by |
object |
an object created by |
for bystats
, a matrix with row names equal to the classification labels and column
names N, Missing, funlab
, where funlab
is determined from fun
.
A row is added to the end with the summary statistics computed
on all observations combined. The class of this matrix is bystats
.
For bystats
, returns a 3-dimensional array with the last dimension
corresponding to statistics being computed. The class of the array is
bystats2
.
latex
produces a .tex
file.
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
interaction
, cut
, cut2
, latex
, print.char.matrix
,
translate
## Not run: bystats(sex==2, county, city) bystats(death, race) bystats(death, cut2(age,g=5), race) bystats(cholesterol, cut2(age,g=4), sex, fun=median) bystats(cholesterol, sex, fun=quantile) bystats(cholesterol, sex, fun=function(x)c(Mean=mean(x),Median=median(x))) latex(bystats(death,race,nmiss=FALSE,subset=sex=="female"), digits=2) f <- function(y) c(Hazard=sum(y[,2])/sum(y[,1])) # f() gets the hazard estimate for right-censored data from exponential dist. bystats(cbind(d.time, death), race, sex, fun=f) bystats(cbind(pressure, cholesterol), age.decile, fun=function(y) c(Median.pressure =median(y[,1]), Median.cholesterol=median(y[,2]))) y <- cbind(pressure, cholesterol) bystats(y, age.decile, fun=function(y) apply(y, 2, median)) # same result as last one bystats(y, age.decile, fun=function(y) apply(y, 2, quantile, c(.25,.75))) # The last one computes separately the 0.25 and 0.75 quantiles of 2 vars. latex(bystats2(death, race, sex, fun=table)) ## End(Not run)
## Not run: bystats(sex==2, county, city) bystats(death, race) bystats(death, cut2(age,g=5), race) bystats(cholesterol, cut2(age,g=4), sex, fun=median) bystats(cholesterol, sex, fun=quantile) bystats(cholesterol, sex, fun=function(x)c(Mean=mean(x),Median=median(x))) latex(bystats(death,race,nmiss=FALSE,subset=sex=="female"), digits=2) f <- function(y) c(Hazard=sum(y[,2])/sum(y[,1])) # f() gets the hazard estimate for right-censored data from exponential dist. bystats(cbind(d.time, death), race, sex, fun=f) bystats(cbind(pressure, cholesterol), age.decile, fun=function(y) c(Median.pressure =median(y[,1]), Median.cholesterol=median(y[,2]))) y <- cbind(pressure, cholesterol) bystats(y, age.decile, fun=function(y) apply(y, 2, median)) # same result as last one bystats(y, age.decile, fun=function(y) apply(y, 2, quantile, c(.25,.75))) # The last one computes separately the 0.25 and 0.75 quantiles of 2 vars. latex(bystats2(death, race, sex, fun=table)) ## End(Not run)
Capitalizes the first letter of each element of the string vector.
capitalize(string)
capitalize(string)
string |
String to be capitalized |
Returns a vector of charaters with the first letter capitalized
Charles Dupont
capitalize(c("Hello", "bob", "daN"))
capitalize(c("Hello", "bob", "daN"))
Uses the method of Peterson and George to compute the power of an
interaction test in a 2 x 2 setup in which all 4 distributions are
exponential. This will be the same as the power of the Cox model
test if assumptions hold. The test is 2-tailed.
The duration of accrual is specified
(constant accrual is assumed), as is the minimum follow-up time.
The maximum follow-up time is then accrual + tmin
. Treatment
allocation is assumed to be 1:1.
ciapower(tref, n1, n2, m1c, m2c, r1, r2, accrual, tmin, alpha=0.05, pr=TRUE)
ciapower(tref, n1, n2, m1c, m2c, r1, r2, accrual, tmin, alpha=0.05, pr=TRUE)
tref |
time at which mortalities estimated |
n1 |
total sample size, stratum 1 |
n2 |
total sample size, stratum 2 |
m1c |
tref-year mortality, stratum 1 control |
m2c |
tref-year mortality, stratum 2 control |
r1 |
% reduction in |
r2 |
% reduction in |
accrual |
duration of accrual period |
tmin |
minimum follow-up time |
alpha |
type I error probability |
pr |
set to |
power
prints
Frank Harrell
Department of Biostatistics
Vanderbilt University
Peterson B, George SL: Controlled Clinical Trials 14:511–522; 1993.
# Find the power of a race x treatment test. 25% of patients will # be non-white and the total sample size is 14000. # Accrual is for 1.5 years and minimum follow-up is 5y. # Reduction in 5-year mortality is 15% for whites, 0% or -5% for # non-whites. 5-year mortality for control subjects if assumed to # be 0.18 for whites, 0.23 for non-whites. n <- 14000 for(nonwhite.reduction in c(0,-5)) { cat("\n\n\n% Reduction in 5-year mortality for non-whites:", nonwhite.reduction, "\n\n") pow <- ciapower(5, .75*n, .25*n, .18, .23, 15, nonwhite.reduction, 1.5, 5) cat("\n\nPower:",format(pow),"\n") }
# Find the power of a race x treatment test. 25% of patients will # be non-white and the total sample size is 14000. # Accrual is for 1.5 years and minimum follow-up is 5y. # Reduction in 5-year mortality is 15% for whites, 0% or -5% for # non-whites. 5-year mortality for control subjects if assumed to # be 0.18 for whites, 0.23 for non-whites. n <- 14000 for(nonwhite.reduction in c(0,-5)) { cat("\n\n\n% Reduction in 5-year mortality for non-whites:", nonwhite.reduction, "\n\n") pow <- ciapower(5, .75*n, .25*n, .18, .23, 15, nonwhite.reduction, 1.5, 5) cat("\n\nPower:",format(pow),"\n") }
Takes a set of coordinates in any of the 5 coordinate systems (usr, plt, fig, dev, or tdev) and returns the same points in all 5 coordinate systems.
cnvrt.coords(x, y = NULL, input = c("usr", "plt", "fig", "dev","tdev"))
cnvrt.coords(x, y = NULL, input = c("usr", "plt", "fig", "dev","tdev"))
x |
Vector, Matrix, or list of x coordinates (or x and y coordinates), NA's allowed. |
y |
y coordinates (if |
input |
Character scalar indicating the coordinate system of the input points. |
Every plot has 5 coordinate systems:
usr (User): the coordinate system of the data, this is shown by the tick marks and axis labels.
plt (Plot): Plot area, coordinates range from 0 to 1 with 0 corresponding to the x and y axes and 1 corresponding to the top and right of the plot area. Margins of the plot correspond to plot coordinates less than 0 or greater than 1.
fig (Figure): Figure area, coordinates range from 0 to 1 with 0 corresponding to the bottom and left edges of the figure (including margins, label areas) and 1 corresponds to the top and right edges. fig and dev coordinates will be identical if there is only 1 figure area on the device (layout, mfrow, or mfcol has not been used).
dev (Device): Device area, coordinates range from 0 to 1 with 0 corresponding to the bottom and left of the device region within the outer margins and 1 is the top and right of the region withing the outer margins. If the outer margins are all set to 0 then tdev and dev should be identical.
tdev (Total Device): Total Device area, coordinates range from 0 to 1 with 0 corresponding to the bottom and left edges of the device (piece of paper, window on screen) and 1 corresponds to the top and right edges.
A list with 5 components, each component is a list with vectors named x and y. The 5 sublists are:
usr |
The coordinates of the input points in usr (User) coordinates. |
plt |
The coordinates of the input points in plt (Plot) coordinates. |
fig |
The coordinates of the input points in fig (Figure) coordinates. |
dev |
The coordinates of the input points in dev (Device) coordinates. |
tdev |
The coordinates of the input points in tdev (Total Device) coordinates. |
You must provide both x and y, but one of them may be NA
.
This function is becoming depricated with the new functions
grconvertX
and grconvertY
in R version 2.7.0 and beyond.
These new functions use the correct coordinate system names and have
more coordinate systems available, you should start using them instead.
Greg Snow [email protected]
par
specifically 'usr','plt', and 'fig'. Also
'xpd' for plotting outside of the plotting region and 'mfrow' and
'mfcol' for multi figure plotting. subplot
,
grconvertX
and grconvertY
in R2.7.0 and later
old.par <- par(no.readonly=TRUE) par(mfrow=c(2,2),xpd=NA) # generate some sample data tmp.x <- rnorm(25, 10, 2) tmp.y <- rnorm(25, 50, 10) tmp.z <- rnorm(25, 0, 1) plot( tmp.x, tmp.y) # draw a diagonal line across the plot area tmp1 <- cnvrt.coords( c(0,1), c(0,1), input='plt' ) lines(tmp1$usr, col='blue') # draw a diagonal line accross figure region tmp2 <- cnvrt.coords( c(0,1), c(1,0), input='fig') lines(tmp2$usr, col='red') # save coordinate of point 1 and y value near top of plot for future plots tmp.point1 <- cnvrt.coords(tmp.x[1], tmp.y[1]) tmp.range1 <- cnvrt.coords(NA, 0.98, input='plt') # make a second plot and draw a line linking point 1 in each plot plot(tmp.y, tmp.z) tmp.point2 <- cnvrt.coords( tmp.point1$dev, input='dev' ) arrows( tmp.y[1], tmp.z[1], tmp.point2$usr$x, tmp.point2$usr$y, col='green') # draw another plot and add rectangle showing same range in 2 plots plot(tmp.x, tmp.z) tmp.range2 <- cnvrt.coords(NA, 0.02, input='plt') tmp.range3 <- cnvrt.coords(NA, tmp.range1$dev$y, input='dev') rect( 9, tmp.range2$usr$y, 11, tmp.range3$usr$y, border='yellow') # put a label just to the right of the plot and # near the top of the figure region. text( cnvrt.coords(1.05, NA, input='plt')$usr$x, cnvrt.coords(NA, 0.75, input='fig')$usr$y, "Label", adj=0) par(mfrow=c(1,1)) ## create a subplot within another plot (see also subplot) plot(1:10, 1:10) tmp <- cnvrt.coords( c( 1, 4, 6, 9), c(6, 9, 1, 4) ) par(plt = c(tmp$dev$x[1:2], tmp$dev$y[1:2]), new=TRUE) hist(rnorm(100)) par(fig = c(tmp$dev$x[3:4], tmp$dev$y[3:4]), new=TRUE) hist(rnorm(100)) par(old.par)
old.par <- par(no.readonly=TRUE) par(mfrow=c(2,2),xpd=NA) # generate some sample data tmp.x <- rnorm(25, 10, 2) tmp.y <- rnorm(25, 50, 10) tmp.z <- rnorm(25, 0, 1) plot( tmp.x, tmp.y) # draw a diagonal line across the plot area tmp1 <- cnvrt.coords( c(0,1), c(0,1), input='plt' ) lines(tmp1$usr, col='blue') # draw a diagonal line accross figure region tmp2 <- cnvrt.coords( c(0,1), c(1,0), input='fig') lines(tmp2$usr, col='red') # save coordinate of point 1 and y value near top of plot for future plots tmp.point1 <- cnvrt.coords(tmp.x[1], tmp.y[1]) tmp.range1 <- cnvrt.coords(NA, 0.98, input='plt') # make a second plot and draw a line linking point 1 in each plot plot(tmp.y, tmp.z) tmp.point2 <- cnvrt.coords( tmp.point1$dev, input='dev' ) arrows( tmp.y[1], tmp.z[1], tmp.point2$usr$x, tmp.point2$usr$y, col='green') # draw another plot and add rectangle showing same range in 2 plots plot(tmp.x, tmp.z) tmp.range2 <- cnvrt.coords(NA, 0.02, input='plt') tmp.range3 <- cnvrt.coords(NA, tmp.range1$dev$y, input='dev') rect( 9, tmp.range2$usr$y, 11, tmp.range3$usr$y, border='yellow') # put a label just to the right of the plot and # near the top of the figure region. text( cnvrt.coords(1.05, NA, input='plt')$usr$x, cnvrt.coords(NA, 0.75, input='fig')$usr$y, "Label", adj=0) par(mfrow=c(1,1)) ## create a subplot within another plot (see also subplot) plot(1:10, 1:10) tmp <- cnvrt.coords( c( 1, 4, 6, 9), c(6, 9, 1, 4) ) par(plt = c(tmp$dev$x[1:2], tmp$dev$y[1:2]), new=TRUE) hist(rnorm(100)) par(fig = c(tmp$dev$x[3:4], tmp$dev$y[3:4]), new=TRUE) hist(rnorm(100)) par(old.par)
These functions are used on ggplot2
objects or as layers when
building a ggplot2
object, and to facilitate use of
gridExtra
. colorFacet
colors the thin
rectangles used to separate panels created by facet_grid
(and
probably by facet_wrap
).
A better approach may be found at https://stackoverflow.com/questions/28652284/.
arrGrob
is a front-end to gridExtra::arrangeGrob
that
allows for proper printing. See
https://stackoverflow.com/questions/29062766/store-output-from-gridextragrid-arrange-into-an-object/. The arrGrob
print
method calls grid::grid.draw
.
colorFacet(g, col = adjustcolor("blue", alpha.f = 0.3)) arrGrob(...) ## S3 method for class 'arrGrob' print(x, ...)
colorFacet(g, col = adjustcolor("blue", alpha.f = 0.3)) arrGrob(...) ## S3 method for class 'arrGrob' print(x, ...)
g |
a |
col |
color for facet separator rectanges |
... |
passed to |
x |
an object created by |
Sandy Muspratt
## Not run: require(ggplot2) s <- summaryP(age + sex ~ region + treatment) colorFacet(ggplot(s)) # prints directly # arrGrob is called by rms::ggplot.Predict and others ## End(Not run)
## Not run: require(ggplot2) s <- summaryP(age + sex ~ region + treatment) colorFacet(ggplot(s)) # prints directly # arrGrob is called by rms::ggplot.Predict and others ## End(Not run)
Combine Infrequent Levels of a Categorical Variable
combine.levels( x, minlev = 0.05, m, ord = is.ordered(x), plevels = FALSE, sep = "," )
combine.levels( x, minlev = 0.05, m, ord = is.ordered(x), plevels = FALSE, sep = "," )
x |
a factor, 'ordered' factor, or numeric or character variable that will be turned into a 'factor' |
minlev |
the minimum proportion of observations in a cell before that cell is combined with one or more cells. If more than one cell has fewer than minlev*n observations, all such cells are combined into a new cell labeled '"OTHER"'. Otherwise, the lowest frequency cell is combined with the next lowest frequency cell, and the level name is the combination of the two old level levels. When 'ord=TRUE' combinations happen only for consecutive levels. |
m |
alternative to 'minlev', is the minimum number of observations in a cell before it will be combined with others |
ord |
set to 'TRUE' to treat 'x' as if it were an ordered factor, which allows only consecutive levels to be combined |
plevels |
by default 'combine.levels' pools low-frequency levels into a category named 'OTHER' when 'x' is not ordered and 'ord=FALSE'. To instead name this category the concatenation of all the pooled level names, separated by a comma, set 'plevels=TRUE'. |
sep |
the separator for concatenating levels when 'plevels=TRUE' |
After turning 'x' into a 'factor' if it is not one already, combines levels of 'x' whose frequency falls below a specified relative frequency 'minlev' or absolute count 'm'. When 'x' is not treated as ordered, all of the small frequency levels are combined into '"OTHER"', unless 'plevels=TRUE'. When 'ord=TRUE' or 'x' is an ordered factor, only consecutive levels are combined. New levels are constructed by concatenating the levels with 'sep' as a separator. This is useful when comparing ordinal regression with polytomous (multinomial) regression and there are too many categories for polytomous regression. 'combine.levels' is also useful when assumptions of ordinal models are being checked empirically by computing exceedance probabilities for various cutoffs of the dependent variable.
a factor variable, or if 'ord=TRUE' an ordered factor variable
Frank Harrell
x <- c(rep('A', 1), rep('B', 3), rep('C', 4), rep('D',1), rep('E',1)) combine.levels(x, m=3) combine.levels(x, m=3, plevels=TRUE) combine.levels(x, ord=TRUE, m=3) x <- c(rep('A', 1), rep('B', 3), rep('C', 4), rep('D',1), rep('E',1), rep('F',1)) combine.levels(x, ord=TRUE, m=3)
x <- c(rep('A', 1), rep('B', 3), rep('C', 4), rep('D',1), rep('E',1)) combine.levels(x, m=3) combine.levels(x, m=3, plevels=TRUE) combine.levels(x, ord=TRUE, m=3) x <- c(rep('A', 1), rep('B', 3), rep('C', 4), rep('D',1), rep('E',1), rep('F',1)) combine.levels(x, ord=TRUE, m=3)
Generates a plotly attribute plot given a series of possibly overlapping binary variables
combplotp( formula, data = NULL, subset, na.action = na.retain, vnames = c("labels", "names"), includenone = FALSE, showno = FALSE, maxcomb = NULL, minfreq = NULL, N = NULL, pos = function(x) 1 * (tolower(x) %in% c("true", "yes", "y", "positive", "+", "present", "1")), obsname = "subjects", ptsize = 35, width = NULL, height = NULL, ... )
combplotp( formula, data = NULL, subset, na.action = na.retain, vnames = c("labels", "names"), includenone = FALSE, showno = FALSE, maxcomb = NULL, minfreq = NULL, N = NULL, pos = function(x) 1 * (tolower(x) %in% c("true", "yes", "y", "positive", "+", "present", "1")), obsname = "subjects", ptsize = 35, width = NULL, height = NULL, ... )
formula |
a formula containing all the variables to be cross-tabulated, on the formula's right hand side. There is no left hand side variable. If |
data |
input data frame. If none is specified the data are assumed to come from the parent frame. |
subset |
an optional subsetting expression applied to |
na.action |
see |
vnames |
set to |
includenone |
set to |
showno |
set to |
maxcomb |
maximum number of combinations to display |
minfreq |
if specified, any combination having a frequency less than this will be omitted from the display |
N |
set to an integer to override the global denominator, instead of using the number of rows in the data |
pos |
a function of vector returning a logical vector with |
obsname |
character string noun describing observations, default is |
ptsize |
point size, defaults to 35 |
width |
width of |
height |
height of |
... |
optional arguments to pass to |
Similar to the UpSetR
package, draws a special dot chart sometimes called an attribute plot that depicts all possible combination of the binary variables. By default a positive value, indicating that a certain condition pertains for a subject, is any of logical TRUE
, numeric 1, "yes"
, "y"
, "positive"
, "+"
or "present"
value, and all others are considered negative. The user can override this determination by specifying her own pos
function. Case is ignored in the variable values.
The plot uses solid dots arranged in a vertical line to indicate which combination of conditions is being considered. Frequencies of all possible combinations are shown above the dot chart. Marginal frequencies of positive values for the input variables are shown to the left of the dot chart. More information for all three of these component symbols is provided in hover text.
Variables are sorted in descending order of marginal frqeuencies and likewise for combinations of variables.
plotly
object
Frank Harrell
if (requireNamespace("plotly")) { g <- function() sample(0:1, n, prob=c(1 - p, p), replace=TRUE) set.seed(2); n <- 100; p <- 0.5 x1 <- g(); label(x1) <- 'A long label for x1 that describes it' x2 <- g() x3 <- g(); label(x3) <- 'This is<br>a label for x3' x4 <- g() combplotp(~ x1 + x2 + x3 + x4, showno=TRUE, includenone=TRUE) n <- 1500; p <- 0.05 pain <- g() anxiety <- g() depression <- g() soreness <- g() numbness <- g() tiredness <- g() sleepiness <- g() combplotp(~ pain + anxiety + depression + soreness + numbness + tiredness + sleepiness, showno=TRUE) }
if (requireNamespace("plotly")) { g <- function() sample(0:1, n, prob=c(1 - p, p), replace=TRUE) set.seed(2); n <- 100; p <- 0.5 x1 <- g(); label(x1) <- 'A long label for x1 that describes it' x2 <- g() x3 <- g(); label(x3) <- 'This is<br>a label for x3' x4 <- g() combplotp(~ x1 + x2 + x3 + x4, showno=TRUE, includenone=TRUE) n <- 1500; p <- 0.05 pain <- g() anxiety <- g() depression <- g() soreness <- g() numbness <- g() tiredness <- g() sleepiness <- g() combplotp(~ pain + anxiety + depression + soreness + numbness + tiredness + sleepiness, showno=TRUE) }
Create imputed dataset(s) using transcan
and aregImpute
objects
completer(a, nimpute, oneimpute = FALSE, mydata)
completer(a, nimpute, oneimpute = FALSE, mydata)
a |
An object of class |
nimpute |
A numeric vector between 1 and |
oneimpute |
A logical vector. When set to |
mydata |
A data frame in which its missing values will be imputed. |
Similar in function to mice::complete
, this function uses transcan
and aregImpute
objects to impute missing data
and returns the completed dataset(s) as a dataframe or a list.
It assumes that transcan
is used for single regression imputation.
A single or a list of completed dataset(s).
Yong-Hao Pua, Singapore General Hospital
## Not run: mtcars$hp[1:5] <- NA mtcars$wt[1:10] <- NA myrform <- ~ wt + hp + I(carb) mytranscan <- transcan( myrform, data = mtcars, imputed = TRUE, pl = FALSE, pr = FALSE, trantab = TRUE, long = TRUE) myareg <- aregImpute(myrform, data = mtcars, x=TRUE, n.impute = 5) completer(mytranscan) # single completed dataset completer(myareg, 3, oneimpute = TRUE) # single completed dataset based on the `n.impute`th set of multiple imputation completer(myareg, 3) # list of completed datasets based on first `nimpute` sets of multiple imputation completer(myareg) # list of completed datasets based on all available sets of multiple imputation # To get a stacked data frame of all completed datasets use # do.call(rbind, completer(myareg, data=mydata)) # or use rbindlist in data.table ## End(Not run)
## Not run: mtcars$hp[1:5] <- NA mtcars$wt[1:10] <- NA myrform <- ~ wt + hp + I(carb) mytranscan <- transcan( myrform, data = mtcars, imputed = TRUE, pl = FALSE, pr = FALSE, trantab = TRUE, long = TRUE) myareg <- aregImpute(myrform, data = mtcars, x=TRUE, n.impute = 5) completer(mytranscan) # single completed dataset completer(myareg, 3, oneimpute = TRUE) # single completed dataset based on the `n.impute`th set of multiple imputation completer(myareg, 3) # list of completed datasets based on first `nimpute` sets of multiple imputation completer(myareg) # list of completed datasets based on all available sets of multiple imputation # To get a stacked data frame of all completed datasets use # do.call(rbind, completer(myareg, data=mydata)) # or use rbindlist in data.table ## End(Not run)
Merges an object by the names of its elements. Inserting elements in
value
into x
that do not exists in x
and
replacing elements in x
that exists in value
with
value
elements if protect
is false.
consolidate(x, value, protect, ...) ## Default S3 method: consolidate(x, value, protect=FALSE, ...) consolidate(x, protect, ...) <- value
consolidate(x, value, protect, ...) ## Default S3 method: consolidate(x, value, protect=FALSE, ...) consolidate(x, protect, ...) <- value
x |
named list or vector |
value |
named list or vector |
protect |
logical; should elements in |
... |
currently does nothing; included if ever want to make generic. |
Charles Dupont
x <- 1:5 names(x) <- LETTERS[x] y <- 6:10 names(y) <- LETTERS[y-2] x # c(A=1,B=2,C=3,D=4,E=5) y # c(D=6,E=7,F=8,G=9,H=10) consolidate(x, y) # c(A=1,B=2,C=3,D=6,E=7,F=8,G=9,H=10) consolidate(x, y, protect=TRUE) # c(A=1,B=2,C=3,D=4,E=5,F=8,G=9,H=10)
x <- 1:5 names(x) <- LETTERS[x] y <- 6:10 names(y) <- LETTERS[y-2] x # c(A=1,B=2,C=3,D=4,E=5) y # c(D=6,E=7,F=8,G=9,H=10) consolidate(x, y) # c(A=1,B=2,C=3,D=6,E=7,F=8,G=9,H=10) consolidate(x, y, protect=TRUE) # c(A=1,B=2,C=3,D=4,E=5,F=8,G=9,H=10)
contents
is a generic method for which contents.data.frame
is currently the only method. contents.data.frame
creates an
object containing the following attributes of the variables
from a data frame: names, labels (if any), units (if any), number of
factor levels (if any), factor levels,
class, storage mode, and number of NAs. print.contents.data.frame
will print the results, with options for sorting the variables.
html.contents.data.frame
creates HTML code for displaying the
results. This code has hyperlinks so that if the user clicks on the
number of levels the browser jumps to the correct part of a table of
factor levels for all the factor
variables. If long labels are
present ("longlabel"
attributes on variables), these are printed
at the bottom and the html
method links to them through the
regular labels. Variables having the same levels
in the same
order have the levels factored out for brevity.
contents.list
prints a directory of datasets when
sasxport.get
imported more than one SAS dataset.
If options(prType='html')
is in effect, calling print
on
an object that is the contents of a data frame will result in
rendering the HTML version. If run from the console a browser window
will open.
contents(object, ...) ## S3 method for class 'data.frame' contents(object, sortlevels=FALSE, id=NULL, range=NULL, values=NULL, ...) ## S3 method for class 'contents.data.frame' print(x, sort=c('none','names','labels','NAs'), prlevels=TRUE, maxlevels=Inf, number=FALSE, ...) ## S3 method for class 'contents.data.frame' html(object, sort=c('none','names','labels','NAs'), prlevels=TRUE, maxlevels=Inf, levelType=c('list','table'), number=FALSE, nshow=TRUE, ...) ## S3 method for class 'list' contents(object, dslabels, ...) ## S3 method for class 'contents.list' print(x, sort=c('none','names','labels','NAs','vars'), ...)
contents(object, ...) ## S3 method for class 'data.frame' contents(object, sortlevels=FALSE, id=NULL, range=NULL, values=NULL, ...) ## S3 method for class 'contents.data.frame' print(x, sort=c('none','names','labels','NAs'), prlevels=TRUE, maxlevels=Inf, number=FALSE, ...) ## S3 method for class 'contents.data.frame' html(object, sort=c('none','names','labels','NAs'), prlevels=TRUE, maxlevels=Inf, levelType=c('list','table'), number=FALSE, nshow=TRUE, ...) ## S3 method for class 'list' contents(object, dslabels, ...) ## S3 method for class 'contents.list' print(x, sort=c('none','names','labels','NAs','vars'), ...)
object |
a data frame. For |
sortlevels |
set to |
id |
an optional subject ID variable name that if present in
|
range |
an optional variable name that if present in |
values |
an optional variable name that if present in
|
x |
an object created by |
sort |
Default is to print the variables in their original order in the
data frame. Specify one of
|
prlevels |
set to |
maxlevels |
maximum number of levels to print for a |
number |
set to |
nshow |
set to |
levelType |
By default, bullet lists of category levels are
constructed in html. Set |
... |
arguments passed from |
dslabels |
named vector of SAS dataset labels, created for
example by |
an object of class "contents.data.frame"
or
"contents.list"
. For the html
method is an html
character vector object.
Frank Harrell
Vanderbilt University
[email protected]
describe
, html
, upData
,
extractlabs
, hlab
set.seed(1) dfr <- data.frame(x=rnorm(400),y=sample(c('male','female'),400,TRUE), stringsAsFactors=TRUE) contents(dfr) dfr <- upData(dfr, labels=c(x='Label for x', y='Label for y')) attr(dfr$x, 'longlabel') <- 'A very long label for x that can continue onto multiple long lines of text' k <- contents(dfr) print(k, sort='names', prlevels=FALSE) ## Not run: html(k) html(contents(dfr)) # same result latex(k$contents) # latex.default just the main information ## End(Not run)
set.seed(1) dfr <- data.frame(x=rnorm(400),y=sample(c('male','female'),400,TRUE), stringsAsFactors=TRUE) contents(dfr) dfr <- upData(dfr, labels=c(x='Label for x', y='Label for y')) attr(dfr$x, 'longlabel') <- 'A very long label for x that can continue onto multiple long lines of text' k <- contents(dfr) print(k, sort='names', prlevels=FALSE) ## Not run: html(k) html(contents(dfr)) # same result latex(k$contents) # latex.default just the main information ## End(Not run)
Assumes exponential distributions for both treatment groups. Uses the George-Desu method along with formulas of Schoenfeld that allow estimation of the expected number of events in the two groups. To allow for drop-ins (noncompliance to control therapy, crossover to intervention) and noncompliance of the intervention, the method of Lachin and Foulkes is used.
cpower(tref, n, mc, r, accrual, tmin, noncomp.c=0, noncomp.i=0, alpha=0.05, nc, ni, pr=TRUE)
cpower(tref, n, mc, r, accrual, tmin, noncomp.c=0, noncomp.i=0, alpha=0.05, nc, ni, pr=TRUE)
tref |
time at which mortalities estimated |
n |
total sample size (both groups combined). If allocation is unequal
so that there are not |
mc |
tref-year mortality, control |
r |
% reduction in |
accrual |
duration of accrual period |
tmin |
minimum follow-up time |
noncomp.c |
% non-compliant in control group (drop-ins) |
noncomp.i |
% non-compliant in intervention group (non-adherers) |
alpha |
type I error probability. A 2-tailed test is assumed. |
nc |
number of subjects in control group |
ni |
number of subjects in intervention group. |
pr |
set to |
For handling noncompliance, uses a modification of formula (5.4) of
Lachin and Foulkes. Their method is based on a test for the difference
in two hazard rates, whereas cpower
is based on testing the difference
in two log hazards. It is assumed here that the same correction factor
can be approximately applied to the log hazard ratio as Lachin and Foulkes applied to
the hazard difference.
Note that Schoenfeld approximates the variance
of the log hazard ratio by 4/m
, where m
is the total number of events,
whereas the George-Desu method uses the slightly better 1/m1 + 1/m2
.
Power from this function will thus differ slightly from that obtained with
the SAS samsizc
program.
power
prints
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
Peterson B, George SL: Controlled Clinical Trials 14:511–522; 1993.
Lachin JM, Foulkes MA: Biometrics 42:507–519; 1986.
Schoenfeld D: Biometrics 39:499–503; 1983.
#In this example, 4 plots are drawn on one page, one plot for each #combination of noncompliance percentage. Within a plot, the #5-year mortality % in the control group is on the x-axis, and #separate curves are drawn for several % reductions in mortality #with the intervention. The accrual period is 1.5y, with all #patients followed at least 5y and some 6.5y. par(mfrow=c(2,2),oma=c(3,0,3,0)) morts <- seq(10,25,length=50) red <- c(10,15,20,25) for(noncomp in c(0,10,15,-1)) { if(noncomp>=0) nc.i <- nc.c <- noncomp else {nc.i <- 25; nc.c <- 15} z <- paste("Drop-in ",nc.c,"%, Non-adherence ",nc.i,"%",sep="") plot(0,0,xlim=range(morts),ylim=c(0,1), xlab="5-year Mortality in Control Patients (%)", ylab="Power",type="n") title(z) cat(z,"\n") lty <- 0 for(r in red) { lty <- lty+1 power <- morts i <- 0 for(m in morts) { i <- i+1 power[i] <- cpower(5, 14000, m/100, r, 1.5, 5, nc.c, nc.i, pr=FALSE) } lines(morts, power, lty=lty) } if(noncomp==0)legend(18,.55,rev(paste(red,"% reduction",sep="")), lty=4:1,bty="n") } mtitle("Power vs Non-Adherence for Main Comparison", ll="alpha=.05, 2-tailed, Total N=14000",cex.l=.8) # # Point sample size requirement vs. mortality reduction # Root finder (uniroot()) assumes needed sample size is between # 1000 and 40000 # nc.i <- 25; nc.c <- 15; mort <- .18 red <- seq(10,25,by=.25) samsiz <- red i <- 0 for(r in red) { i <- i+1 samsiz[i] <- uniroot(function(x) cpower(5, x, mort, r, 1.5, 5, nc.c, nc.i, pr=FALSE) - .8, c(1000,40000))$root } samsiz <- samsiz/1000 par(mfrow=c(1,1)) plot(red, samsiz, xlab='% Reduction in 5-Year Mortality', ylab='Total Sample Size (Thousands)', type='n') lines(red, samsiz, lwd=2) title('Sample Size for Power=0.80\nDrop-in 15%, Non-adherence 25%') title(sub='alpha=0.05, 2-tailed', adj=0)
#In this example, 4 plots are drawn on one page, one plot for each #combination of noncompliance percentage. Within a plot, the #5-year mortality % in the control group is on the x-axis, and #separate curves are drawn for several % reductions in mortality #with the intervention. The accrual period is 1.5y, with all #patients followed at least 5y and some 6.5y. par(mfrow=c(2,2),oma=c(3,0,3,0)) morts <- seq(10,25,length=50) red <- c(10,15,20,25) for(noncomp in c(0,10,15,-1)) { if(noncomp>=0) nc.i <- nc.c <- noncomp else {nc.i <- 25; nc.c <- 15} z <- paste("Drop-in ",nc.c,"%, Non-adherence ",nc.i,"%",sep="") plot(0,0,xlim=range(morts),ylim=c(0,1), xlab="5-year Mortality in Control Patients (%)", ylab="Power",type="n") title(z) cat(z,"\n") lty <- 0 for(r in red) { lty <- lty+1 power <- morts i <- 0 for(m in morts) { i <- i+1 power[i] <- cpower(5, 14000, m/100, r, 1.5, 5, nc.c, nc.i, pr=FALSE) } lines(morts, power, lty=lty) } if(noncomp==0)legend(18,.55,rev(paste(red,"% reduction",sep="")), lty=4:1,bty="n") } mtitle("Power vs Non-Adherence for Main Comparison", ll="alpha=.05, 2-tailed, Total N=14000",cex.l=.8) # # Point sample size requirement vs. mortality reduction # Root finder (uniroot()) assumes needed sample size is between # 1000 and 40000 # nc.i <- 25; nc.c <- 15; mort <- .18 red <- seq(10,25,by=.25) samsiz <- red i <- 0 for(r in red) { i <- i+1 samsiz[i] <- uniroot(function(x) cpower(5, x, mort, r, 1.5, 5, nc.c, nc.i, pr=FALSE) - .8, c(1000,40000))$root } samsiz <- samsiz/1000 par(mfrow=c(1,1)) plot(red, samsiz, xlab='% Reduction in 5-Year Mortality', ylab='Total Sample Size (Thousands)', type='n') lines(red, samsiz, lwd=2) title('Sample Size for Power=0.80\nDrop-in 15%, Non-adherence 25%') title(sub='alpha=0.05, 2-tailed', adj=0)
Cs
makes a vector of character strings from a list of valid R
names. .q
is similar but also makes uses of names of arguments.
Cs(...) .q(...)
Cs(...) .q(...)
... |
any number of names separated by commas. For |
character string vector. For .q
there will be a names
attribute to the vector if any names appeared in ....
sys.frame, deparse
Cs(a,cat,dog) # subset.data.frame <- dataframe[,Cs(age,sex,race,bloodpressure,height)] .q(a, b, c, 'this and that') .q(dog=a, giraffe=b, cat=c)
Cs(a,cat,dog) # subset.data.frame <- dataframe[,Cs(age,sex,race,bloodpressure,height)] .q(a, b, c, 'this and that') .q(dog=a, giraffe=b, cat=c)
Read comma-separated text data files, allowing optional translation
to lower case for variable names after making them valid S names.
There is a facility for reading long variable labels as one of the
rows. If labels are not specified and a final variable name is not
the same as that in the header, the original variable name is saved as
a variable label. Uses read.csv
if the data.table
package is not in effect, otherwise calls fread
.
csv.get(file, lowernames=FALSE, datevars=NULL, datetimevars=NULL, dateformat='%F', fixdates=c('none','year'), comment.char="", autodate=TRUE, allow=NULL, charfactor=FALSE, sep=',', skip=0, vnames=NULL, labels=NULL, text=NULL, ...)
csv.get(file, lowernames=FALSE, datevars=NULL, datetimevars=NULL, dateformat='%F', fixdates=c('none','year'), comment.char="", autodate=TRUE, allow=NULL, charfactor=FALSE, sep=',', skip=0, vnames=NULL, labels=NULL, text=NULL, ...)
file |
the file name for import. |
lowernames |
set this to |
datevars |
character vector of names (after |
datetimevars |
character vector of names (after |
dateformat |
for |
fixdates |
for any of the variables listed in |
comment.char |
a character vector of length one containing a single character or an empty string. Use '""' to turn off the interpretation of comments altogether. |
autodate |
Set to true to allow function to guess at which variables are dates |
allow |
a vector of characters allowed by R that should not be converted to periods in variable names. By default, underscores in variable names are converted to periods as with R before version 1.9. |
charfactor |
set to |
sep |
field separator, defaults to comma |
skip |
number of records to skip before data start. Required if
|
vnames |
number of row containing variable names, default is one |
labels |
number of row containing variable labels, default is no labels |
text |
a character string containing the |
... |
arguments to pass to |
csv.get
reads comma-separated text data files, allowing optional
translation to lower case for variable names after making them valid S
names. Original possibly non-legal names are taken to be variable
labels if labels
is not specified. Character or factor
variables containing dates can be converted to date variables.
cleanup.import
is invoked to finish the job.
a new data frame.
Frank Harrell, Vanderbilt University
sas.get
, data.frame
,
cleanup.import
, read.csv
,
strptime
, POSIXct
, Date
,
fread
## Not run: dat <- csv.get('myfile.csv') # Read a csv file with junk in the first row, variable names in the # second, long variable labels in the third, and junk in the 4th row dat <- csv.get('myfile.csv', vnames=2, labels=3, skip=4) ## End(Not run)
## Not run: dat <- csv.get('myfile.csv') # Read a csv file with junk in the first row, variable names in the # second, long variable labels in the third, and junk in the 4th row dat <- csv.get('myfile.csv', vnames=2, labels=3, skip=4) ## End(Not run)
curveRep
finds representative curves from a
relatively large collection of curves. The curves usually represent
time-response profiles as in serial (longitudinal or repeated) data
with possibly unequal time points and greatly varying sample sizes per
subject. After excluding records containing missing x
or
y
, records are first stratified into kn
groups having similar
sample sizes per curve (subject). Within these strata, curves are
next stratified according to the distribution of x
points per
curve (typically measurement times per subject). The
clara
clustering/partitioning function is used
to do this, clustering on one, two, or three x
characteristics
depending on the minimum sample size in the current interval of sample
size. If the interval has a minimum number of unique values
of
one, clustering is done on the single x
values. If the minimum
number of unique x
values is two, clustering is done to create
groups that are similar on both min(x)
and max(x)
. For
groups containing no fewer than three unique x
values,
clustering is done on the trio of values min(x)
, max(x)
,
and the longest gap between any successive x
. Then within
sample size and x
distribution strata, clustering of
time-response profiles is based on p
values of y
all
evaluated at the same p
equally-spaced x
's within the
stratum. An option allows per-curve data to be smoothed with
lowess
before proceeding. Outer x
values are
taken as extremes of x
across all curves within the stratum.
Linear interpolation within curves is used to estimate y
at the
grid of x
's. For curves within the stratum that do not extend
to the most extreme x
values in that stratum, extrapolation
uses flat lines from the observed extremes in the curve unless
extrap=TRUE
. The p
y
values are clustered using
clara
.
print
and plot
methods show results. By specifying an
auxiliary idcol
variable to plot
, other variables such
as treatment may be depicted to allow the analyst to determine for
example whether subjects on different treatments are assigned to
different time-response profiles. To write the frequencies of a
variable such as treatment in the upper left corner of each panel
(instead of the grand total number of clusters in that panel), specify
freq
.
curveSmooth
takes a set of curves and smooths them using
lowess
. If the number of unique x
points in a curve is
less than p
, the smooth is evaluated at the unique x
values. Otherwise it is evaluated at an equally spaced set of
x
points over the observed range. If fewer than 3 unique
x
values are in a curve, those points are used and smoothing is not done.
curveRep(x, y, id, kn = 5, kxdist = 5, k = 5, p = 5, force1 = TRUE, metric = c("euclidean", "manhattan"), smooth=FALSE, extrap=FALSE, pr=FALSE) ## S3 method for class 'curveRep' print(x, ...) ## S3 method for class 'curveRep' plot(x, which=1:length(res), method=c('all','lattice','data'), m=NULL, probs=c(.5, .25, .75), nx=NULL, fill=TRUE, idcol=NULL, freq=NULL, plotfreq=FALSE, xlim=range(x), ylim=range(y), xlab='x', ylab='y', colorfreq=FALSE, ...) curveSmooth(x, y, id, p=NULL, pr=TRUE)
curveRep(x, y, id, kn = 5, kxdist = 5, k = 5, p = 5, force1 = TRUE, metric = c("euclidean", "manhattan"), smooth=FALSE, extrap=FALSE, pr=FALSE) ## S3 method for class 'curveRep' print(x, ...) ## S3 method for class 'curveRep' plot(x, which=1:length(res), method=c('all','lattice','data'), m=NULL, probs=c(.5, .25, .75), nx=NULL, fill=TRUE, idcol=NULL, freq=NULL, plotfreq=FALSE, xlim=range(x), ylim=range(y), xlab='x', ylab='y', colorfreq=FALSE, ...) curveSmooth(x, y, id, p=NULL, pr=TRUE)
x |
a numeric vector, typically measurement times.
For |
y |
a numeric vector of response values |
id |
a vector of curve (subject) identifiers, the same length as
|
kn |
number of curve sample size groups to construct.
|
kxdist |
maximum number of x-distribution clusters to derive
using |
k |
maximum number of x-y profile clusters to derive using |
p |
number of |
force1 |
By default if any curves have only one point, all curves
consisting of one point will be placed in a separate stratum. To
prevent this separation, set |
metric |
see |
smooth |
By default, linear interpolation is used on raw data to
obtain |
extrap |
set to |
pr |
set to |
which |
an integer vector specifying which sample size intervals
to plot. Must be specified if |
method |
The default makes individual plots of possibly all
x-distribution by sample size by cluster combinations. Fewer may be
plotted by specifying |
m |
the number of curves in a cluster to randomly sample if there
are more than |
nx |
applies if |
probs |
3-vector of probabilities with the central quantile first. Default uses quartiles. |
fill |
for |
idcol |
a named vector to be used as a table lookup for color
assignments (does not apply when |
freq |
a named vector to be used as a table lookup for a grouping
variable such as treatment. The names are curve |
plotfreq |
set to |
colorfreq |
set to |
xlim , ylim , xlab , ylab
|
plotting parameters. Default ranges are
the ranges in the entire set of raw data given to |
... |
arguments passed to other functions. |
In the graph titles for the default graphic output, n
refers to the
minimum sample size, x
refers to the sequential x-distribution
cluster, and c
refers to the sequential x-y profile cluster. Graphs
from method = "lattice"
are produced by
xyplot
and in the panel titles
distribution
refers to the x-distribution stratum and
cluster
refers to the x-y profile cluster.
a list of class "curveRep"
with the following elements
res |
a hierarchical list first split by sample size intervals,
then by x distribution clusters, then containing a vector of cluster
numbers with |
ns |
a table of frequencies of sample sizes per curve after
removing |
nomit |
total number of records excluded due to |
missfreq |
a table of frequencies of number of |
ncuts |
cut points for sample size intervals |
kn |
number of sample size intervals |
kxdist |
number of clusters on x distribution |
k |
number of clusters of curves within sample size and distribution groups |
p |
number of points at which to evaluate each curve for clustering |
x |
|
y |
|
id |
input data after removing |
curveSmooth
returns a list with elements x,y,id
.
The references describe other methods for deriving
representative curves, but those methods were not used here. The last
reference which used a cluster analysis on principal components
motivated curveRep
however. The kml
package does k-means clustering of longitudinal data with imputation.
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
Segal M. (1994): Representative curves for longitudinal data via regression trees. J Comp Graph Stat 3:214-233.
Jones MC, Rice JA (1992): Displaying the important features of large collections of similar curves. Am Statistician 46:140-145.
Zheng X, Simpson JA, et al (2005): Data from a study of effectiveness suggested potential prognostic factors related to the patterns of shoulder pain. J Clin Epi 58:823-830.
## Not run: # Simulate 200 curves with per-curve sample sizes ranging from 1 to 10 # Make curves with odd-numbered IDs have an x-distribution that is random # uniform [0,1] and those with even-numbered IDs have an x-dist. that is # half as wide but still centered at 0.5. Shift y values higher with # increasing IDs set.seed(1) N <- 200 nc <- sample(1:10, N, TRUE) id <- rep(1:N, nc) x <- y <- id for(i in 1:N) { x[id==i] <- if(i %% 2) runif(nc[i]) else runif(nc[i], c(.25, .75)) y[id==i] <- i + 10*(x[id==i] - .5) + runif(nc[i], -10, 10) } w <- curveRep(x, y, id, kxdist=2, p=10) w par(ask=TRUE, mfrow=c(4,5)) plot(w) # show everything, profiles going across par(mfrow=c(2,5)) plot(w,1) # show n=1 results # Use a color assignment table, assigning low curves to green and # high to red. Unique curve (subject) IDs are the names of the vector. cols <- c(rep('green', N/2), rep('red', N/2)) names(cols) <- as.character(1:N) plot(w, 3, idcol=cols) par(ask=FALSE, mfrow=c(1,1)) plot(w, 1, 'lattice') # show n=1 results plot(w, 3, 'lattice') # show n=4-5 results plot(w, 3, 'lattice', idcol=cols) # same but different color mapping plot(w, 3, 'lattice', m=1) # show a single "representative" curve # Show median, 10th, and 90th percentiles of supposedly representative curves plot(w, 3, 'lattice', m='quantiles', probs=c(.5,.1,.9)) # Same plot but with much less grouping of x variable plot(w, 3, 'lattice', m='quantiles', probs=c(.5,.1,.9), nx=2) # Use ggplot2 for one sample size interval z <- plot(w, 2, 'data') require(ggplot2) ggplot(z, aes(x, y, color=curve)) + geom_line() + facet_grid(distribution ~ cluster) + theme(legend.position='none') + labs(caption=z$ninterval[1]) # Smooth data before profiling. This allows later plotting to plot # smoothed representative curves rather than raw curves (which # specifying smooth=TRUE to curveRep would do, if curveSmooth was not used) d <- curveSmooth(x, y, id) w <- with(d, curveRep(x, y, id)) # Example to show that curveRep can cluster profiles correctly when # there is no noise. In the data there are four profiles - flat, flat # at a higher mean y, linearly increasing then flat, and flat at the # first height except for a sharp triangular peak set.seed(1) x <- 0:100 m <- length(x) profile <- matrix(NA, nrow=m, ncol=4) profile[,1] <- rep(0, m) profile[,2] <- rep(3, m) profile[,3] <- c(0:3, rep(3, m-4)) profile[,4] <- c(0,1,3,1,rep(0,m-4)) col <- c('black','blue','green','red') matplot(x, profile, type='l', col=col) xeval <- seq(0, 100, length.out=5) s <- x matplot(x[s], profile[s,], type='l', col=col) id <- rep(1:100, each=m) X <- Y <- id cols <- character(100) names(cols) <- as.character(1:100) for(i in 1:100) { s <- id==i X[s] <- x j <- sample(1:4,1) Y[s] <- profile[,j] cols[i] <- col[j] } table(cols) yl <- c(-1,4) w <- curveRep(X, Y, id, kn=1, kxdist=1, k=4) plot(w, 1, 'lattice', idcol=cols, ylim=yl) # Found 4 clusters but two have same profile w <- curveRep(X, Y, id, kn=1, kxdist=1, k=3) plot(w, 1, 'lattice', idcol=cols, freq=cols, plotfreq=TRUE, ylim=yl) # Incorrectly combined black and red because default value p=5 did # not result in different profiles at x=xeval w <- curveRep(X, Y, id, kn=1, kxdist=1, k=4, p=40) plot(w, 1, 'lattice', idcol=cols, ylim=yl) # Found correct clusters because evaluated curves at 40 equally # spaced points and could find the sharp triangular peak in profile 4 ## End(Not run)
## Not run: # Simulate 200 curves with per-curve sample sizes ranging from 1 to 10 # Make curves with odd-numbered IDs have an x-distribution that is random # uniform [0,1] and those with even-numbered IDs have an x-dist. that is # half as wide but still centered at 0.5. Shift y values higher with # increasing IDs set.seed(1) N <- 200 nc <- sample(1:10, N, TRUE) id <- rep(1:N, nc) x <- y <- id for(i in 1:N) { x[id==i] <- if(i %% 2) runif(nc[i]) else runif(nc[i], c(.25, .75)) y[id==i] <- i + 10*(x[id==i] - .5) + runif(nc[i], -10, 10) } w <- curveRep(x, y, id, kxdist=2, p=10) w par(ask=TRUE, mfrow=c(4,5)) plot(w) # show everything, profiles going across par(mfrow=c(2,5)) plot(w,1) # show n=1 results # Use a color assignment table, assigning low curves to green and # high to red. Unique curve (subject) IDs are the names of the vector. cols <- c(rep('green', N/2), rep('red', N/2)) names(cols) <- as.character(1:N) plot(w, 3, idcol=cols) par(ask=FALSE, mfrow=c(1,1)) plot(w, 1, 'lattice') # show n=1 results plot(w, 3, 'lattice') # show n=4-5 results plot(w, 3, 'lattice', idcol=cols) # same but different color mapping plot(w, 3, 'lattice', m=1) # show a single "representative" curve # Show median, 10th, and 90th percentiles of supposedly representative curves plot(w, 3, 'lattice', m='quantiles', probs=c(.5,.1,.9)) # Same plot but with much less grouping of x variable plot(w, 3, 'lattice', m='quantiles', probs=c(.5,.1,.9), nx=2) # Use ggplot2 for one sample size interval z <- plot(w, 2, 'data') require(ggplot2) ggplot(z, aes(x, y, color=curve)) + geom_line() + facet_grid(distribution ~ cluster) + theme(legend.position='none') + labs(caption=z$ninterval[1]) # Smooth data before profiling. This allows later plotting to plot # smoothed representative curves rather than raw curves (which # specifying smooth=TRUE to curveRep would do, if curveSmooth was not used) d <- curveSmooth(x, y, id) w <- with(d, curveRep(x, y, id)) # Example to show that curveRep can cluster profiles correctly when # there is no noise. In the data there are four profiles - flat, flat # at a higher mean y, linearly increasing then flat, and flat at the # first height except for a sharp triangular peak set.seed(1) x <- 0:100 m <- length(x) profile <- matrix(NA, nrow=m, ncol=4) profile[,1] <- rep(0, m) profile[,2] <- rep(3, m) profile[,3] <- c(0:3, rep(3, m-4)) profile[,4] <- c(0,1,3,1,rep(0,m-4)) col <- c('black','blue','green','red') matplot(x, profile, type='l', col=col) xeval <- seq(0, 100, length.out=5) s <- x matplot(x[s], profile[s,], type='l', col=col) id <- rep(1:100, each=m) X <- Y <- id cols <- character(100) names(cols) <- as.character(1:100) for(i in 1:100) { s <- id==i X[s] <- x j <- sample(1:4,1) Y[s] <- profile[,j] cols[i] <- col[j] } table(cols) yl <- c(-1,4) w <- curveRep(X, Y, id, kn=1, kxdist=1, k=4) plot(w, 1, 'lattice', idcol=cols, ylim=yl) # Found 4 clusters but two have same profile w <- curveRep(X, Y, id, kn=1, kxdist=1, k=3) plot(w, 1, 'lattice', idcol=cols, freq=cols, plotfreq=TRUE, ylim=yl) # Incorrectly combined black and red because default value p=5 did # not result in different profiles at x=xeval w <- curveRep(X, Y, id, kn=1, kxdist=1, k=4, p=40) plot(w, 1, 'lattice', idcol=cols, ylim=yl) # Found correct clusters because evaluated curves at 40 equally # spaced points and could find the sharp triangular peak in profile 4 ## End(Not run)
Function like cut but left endpoints are inclusive and labels are of
the form [lower, upper)
, except that last interval is [lower,upper]
.
If cuts are given, will by default make sure that cuts include entire
range of x
.
Also, if cuts are not given, will cut x
into quantile groups
(g
given) or groups
with a given minimum number of observations (m
). Whereas cut creates a
category object, cut2
creates a factor object.
cut2(x, cuts, m=150, g, levels.mean=FALSE, digits, minmax=TRUE, oneval=TRUE, onlycuts=FALSE, formatfun=format, ...)
cut2(x, cuts, m=150, g, levels.mean=FALSE, digits, minmax=TRUE, oneval=TRUE, onlycuts=FALSE, formatfun=format, ...)
x |
numeric vector to classify into intervals |
cuts |
cut points |
m |
desired minimum number of observations in a group. The algorithm does
not guarantee that all groups will have at least |
g |
number of quantile groups |
levels.mean |
set to |
digits |
number of significant digits to use in constructing levels. Default is 3
(5 if |
minmax |
if cuts is specified but |
oneval |
if an interval contains only one unique value, the interval will be
labeled with the formatted version of that value instead of the
interval endpoints, unless |
onlycuts |
set to |
formatfun |
formatting function, supports formula notation (if |
... |
additional arguments passed to |
a factor variable with levels of the form [a,b)
or formatted means
(character strings) unless onlycuts
is TRUE
in which case
a numeric vector is returned
set.seed(1) x <- runif(1000, 0, 100) z <- cut2(x, c(10,20,30)) table(z) table(cut2(x, g=10)) # quantile groups table(cut2(x, m=50)) # group x into intevals with at least 50 obs.
set.seed(1) x <- runif(1000, 0, 100) z <- cut2(x, c(10,20,30)) table(z) table(cut2(x, g=10)) # quantile groups table(cut2(x, m=50)) # group x into intevals with at least 50 obs.
This help file contains a template for importing data to create an R data frame, correcting some problems resulting from the import and making the data frame be stored more efficiently, modifying the data frame (including better annotating it and changing the names of some of its variables), and checking and inspecting the data frame for reasonableness of the values of its variables and to describe patterns of missing data. Various built-in functions and functions in the Hmisc library are used. At the end some methods for creating data frames “from scratch” within R are presented.
The examples below attempt to clarify the separation of operations
that are done on a data frame as a whole, operations that are done on
a small subset of its variables without attaching the whole data
frame, and operations that are done on many variables after attaching
the data frame in search position one. It also tries to clarify that
for analyzing several separate variables using R commands that do not
support a data
argument, it is helpful to attach the data frame
in a search position later than position one.
It is often useful to create, modify, and process datasets in the following order.
Import external data into a data frame (if the raw data do not contain column names, provide these during the import if possible)
Make global changes to a data frame (e.g., changing variable names)
Change attributes or values of variables within a data frame
Do analyses involving the whole data frame (without attaching it)
(Data frame still in .Data)
Do analyses of individual variables (after attaching the data frame in search position two or later)
The examples below use the FEV
dataset from
Rosner 1995. Almost any dataset would do. The jcetable data
are taken from Galobardes, etal.
Presently, giving a variable the "units"
attribute (using the
Hmisc units
function) only benefits the
Hmisc describe
function and the rms
library's version of the link[rms]{Surv}
function. Variables
labels defined with the Hmisc label
function are used by
describe
, summary.formula
, and many of
the plotting functions in Hmisc and rms.
Alzola CF, Harrell FE (2006): An Introduction to S and the Hmisc and Design Libraries. Chapters 3 and 4, https://hbiostat.org/R/doc/sintro.pdf.
Galobardes, et al. (1998), J Clin Epi 51:875-881.
Rosner B (1995): Fundamentals of Biostatistics, 4th Edition. New York: Duxbury Press.
scan
, read.table
,
cleanup.import
, sas.get
,
data.frame
, attach
, detach
,
describe
, datadensity
,
plot.data.frame
, hist.data.frame
,
naclus
, factor
, label
,
units
, names
, expand.grid
,
summary.formula
, summary.data.frame
,
casefold
, edit
, page
,
plot.data.frame
, Cs
,
combine.levels
,upData
## Not run: # First, we do steps that create or manipulate the data # frame in its entirety. For S-Plus, these are done with # .Data in search position one (the default at the # start of the session). # # ----------------------------------------------------------------------- # Step 1: Create initial draft of data frame # # We usually begin by importing a dataset from # # another application. ASCII files may be imported # using the scan and read.table functions. SAS # datasets may be imported using the Hmisc sas.get # function (which will carry more attributes from # SAS than using File \dots Import) from the GUI # menus. But for most applications (especially # Excel), File \dots Import will suffice. If using # the GUI, it is often best to provide variable # names during the import process, using the Options # tab, rather than renaming all fields later Of # course, if the data to be imported already have # field names (e.g., in Excel), let S use those # automatically. If using S-Plus, you can use a # command to execute File \dots Import, e.g.: import.data(FileName = "/windows/temp/fev.asc", FileType = "ASCII", DataFrame = "FEV") # Here we name the new data frame FEV rather than # fev, because we wanted to distinguish a variable # in the data frame named fev from the data frame # name. For S-Plus the command will look # instead like the following: FEV <- importData("/tmp/fev.asc") # ----------------------------------------------------------------------- # Step 2: Clean up data frame / make it be more # efficiently stored # # Unless using sas.get to import your dataset # (sas.get already stores data efficiently), it is # usually a good idea to run the data frame through # the Hmisc cleanup.import function to change # numeric variables that are always whole numbers to # be stored as integers, the remaining numerics to # single precision, strange values from Excel to # NAs, and character variables that always contain # legal numeric values to numeric variables. # cleanup.import typically halves the size of the # data frame. If you do not specify any parameters # to cleanup.import, the function assumes that no # numeric variable needs more than 7 significant # digits of precision, so all non-integer-valued # variables will be converted to single precision. FEV <- cleanup.import(FEV) # ----------------------------------------------------------------------- # Step 3: Make global changes to the data frame # # A data frame has attributes that are "external" to # its variables. There are the vector of its # variable names ("names" attribute), the # observation identifiers ("row.names"), and the # "class" (whose value is "data.frame"). The # "names" attribute is the one most commonly in need # of modification. If we had wanted to change all # the variable names to lower case, we could have # specified lowernames=TRUE to the cleanup.import # invocation above, or type names(FEV) <- casefold(names(FEV)) # The upData function can also be used to change # variable names in two ways (see below). # To change names in a non-systematic way we use # other options. Under Windows/NT the most # straigtforward approach is to change the names # interactively. Click on the data frame in the # left panel of the Object Browser, then in the # right pane click twice (slowly) on a variable. # Use the left arrow and other keys to edit the # name. Click outside that name field to commit the # change. You can also rename columns while in a # Data Sheet. To instead use programming commands # to change names, use something like: names(FEV)[6] <- 'smoke' # assumes you know the positions! names(FEV)[names(FEV)=='smoking'] <- 'smoke' names(FEV) <- edit(names(FEV)) # The last example is useful if you are changing # many names. But none of the interactive # approaches such as edit() are handy if you will be # re-importing the dataset after it is updated in # its original application. This problem can be # addressed by saving the new names in a permanent # vector in .Data: new.names <- names(FEV) # Then if the data are re-imported, you can type names(FEV) <- new.names # to rename the variables. # ----------------------------------------------------------------------- # Step 4: Delete unneeded variables # # To delete some of the variables, you can # right-click on variable names in the Object # Browser's right pane, then select Delete. You can # also set variables to have NULL values, which # causes the system to delete them. We don't need # to delete any variables from FEV but suppose we # did need to delete some from mydframe. mydframe$x1 <- NULL mydframe$x2 <- NULL mydframe[c('age','sex')] <- NULL # delete 2 variables mydframe[Cs(age,sex)] <- NULL # same thing # The last example uses the Hmisc short-cut quoting # function Cs. See also the drop parameter to upData. # ----------------------------------------------------------------------- # Step 5: Make changes to individual variables # within the data frame # # After importing data, the resulting variables are # seldom self - documenting, so we commonly need to # change or enhance attributes of individual # variables within the data frame. # # If you are only changing a few variables, it is # efficient to change them directly without # attaching the entire data frame. FEV$sex <- factor(FEV$sex, 0:1, c('female','male')) FEV$smoke <- factor(FEV$smoke, 0:1, c('non-current smoker','current smoker')) units(FEV$age) <- 'years' units(FEV$fev) <- 'L' label(FEV$fev) <- 'Forced Expiratory Volume' units(FEV$height) <- 'inches' # When changing more than one or two variables it is # more convenient change the data frame using the # Hmisc upData function. FEV2 <- upData(FEV, rename=c(smoking='smoke'), # omit if renamed above drop=c('var1','var2'), levels=list(sex =list(female=0,male=1), smoke=list('non-current smoker'=0, 'current smoker'=1)), units=list(age='years', fev='L', height='inches'), labels=list(fev='Forced Expiratory Volume')) # An alternative to levels=list(\dots) is for example # upData(FEV, sex=factor(sex,0:1,c('female','male'))). # # Note that we saved the changed data frame into a # new data frame FEV2. If we were confident of the # correctness of our changes we could have stored # the new data frame on top of the old one, under # the original name FEV. # ----------------------------------------------------------------------- # Step 6: Check the data frame # # The Hmisc describe function is perhaps the first # function that should be used on the new data # frame. It provides documentation of all the # variables and the frequency tabulation, counts of # NAs, and 5 largest and smallest values are # helpful in detecting data errors. Typing # describe(FEV) will write the results to the # current output window. To put the results in a # new window that can persist, even upon exiting # S, we use the page function. The describe # output can be minimized to an icon but kept ready # for guiding later steps of the analysis. page(describe(FEV2), multi=TRUE) # multi=TRUE allows that window to persist while # control is returned to other windows # The new data frame is OK. Store it on top of the # old FEV and then use the graphical user interface # to delete FEV2 (click on it and hit the Delete # key) or type rm(FEV2) after the next statement. FEV <- FEV2 # Next, we can use a variety of other functions to # check and describe all of the variables. As we # are analyzing all or almost all of the variables, # this is best done without attaching the data # frame. Note that plot.data.frame plots inverted # CDFs for continuous variables and dot plots # showing frequency distributions of categorical # ones. summary(FEV) # basic summary function (summary.data.frame) plot(FEV) # plot.data.frame datadensity(FEV) # rug plots and freq. bar charts for all var. hist.data.frame(FEV) # for variables having > 2 values by(FEV, FEV$smoke, summary) # use basic summary function with stratification # ----------------------------------------------------------------------- # Step 7: Do detailed analyses involving individual # variables # # Analyses based on the formula language can use # data= so attaching the data frame may not be # required. This saves memory. Here we use the # Hmisc summary.formula function to compute 5 # statistics on height, stratified separately by age # quartile and by sex. options(width=80) summary(height ~ age + sex, data=FEV, fun=function(y)c(smean.sd(y), smedian.hilow(y,conf.int=.5))) # This computes mean height, S.D., median, outer quartiles fit <- lm(height ~ age*sex, data=FEV) summary(fit) # For this analysis we could also have attached the # data frame in search position 2. For other # analyses, it is mandatory to attach the data frame # unless FEV$ prefixes each variable name. # Important: DO NOT USE attach(FEV, 1) or # attach(FEV, pos=1, \dots) if you are only analyzing # and not changing the variables, unless you really # need to avoid conflicts with variables in search # position 1 that have the same names as the # variables in FEV. Attaching into search position # 1 will cause S-Plus to be more of a memory hog. attach(FEV) # Use e.g. attach(FEV[,Cs(age,sex)]) if you only # want to analyze a small subset of the variables # Use e.g. attach(FEV[FEV$sex=='male',]) to # analyze a subset of the observations summary(height ~ age + sex, fun=function(y)c(smean.sd(y), smedian.hilow(y,conf.int=.5))) fit <- lm(height ~ age*sex) # Run generic summary function on height and fev, # stratified by sex by(data.frame(height,fev), sex, summary) # Cross-classify into 4 sex x smoke groups by(FEV, list(sex,smoke), summary) # Plot 5 quantiles s <- summary(fev ~ age + sex + height, fun=function(y)quantile(y,c(.1,.25,.5,.75,.9))) plot(s, which=1:5, pch=c(1,2,15,2,1), #pch=c('=','[','o',']','='), main='A Discovery', xlab='FEV') # Use the nonparametric bootstrap to compute a # 0.95 confidence interval for the population mean fev smean.cl.boot(fev) # in Hmisc # Use the Statistics \dots Compare Samples \dots One Sample # keys to get a normal-theory-based C.I. Then do it # more manually. The following method assumes that # there are no NAs in fev sd <- sqrt(var(fev)) xbar <- mean(fev) xbar sd n <- length(fev) qt(.975,n-1) # prints 0.975 critical value of t dist. with n-1 d.f. xbar + c(-1,1)*sd/sqrt(n)*qt(.975,n-1) # prints confidence limits # Fit a linear model # fit <- lm(fev ~ other variables \dots) detach() # The last command is only needed if you want to # start operating on another data frame and you want # to get FEV out of the way. # ----------------------------------------------------------------------- # Creating data frames from scratch # # Data frames can be created from within S. To # create a small data frame containing ordinary # data, you can use something like dframe <- data.frame(age=c(10,20,30), sex=c('male','female','male'), stringsAsFactors=TRUE) # You can also create a data frame using the Data # Sheet. Create an empty data frame with the # correct variable names and types, then edit in the # data. dd <- data.frame(age=numeric(0),sex=character(0), stringsAsFactors=TRUE) # The sex variable will be stored as a factor, and # levels will be automatically added to it as you # define new values for sex in the Data Sheet's sex # column. # # When the data frame you need to create is defined # by systematically varying variables (e.g., all # possible combinations of values of each variable), # the expand.grid function is useful for quickly # creating the data. Then you can add # non-systematically-varying variables to the object # created by expand.grid, using programming # statements or editing the Data Sheet. This # process is useful for creating a data frame # representing all the values in a printed table. # In what follows we create a data frame # representing the combinations of values from an 8 # x 2 x 2 x 2 (event x method x sex x what) table, # and add a non-systematic variable percent to the # data. jcetable <- expand.grid( event=c('Wheezing at any time', 'Wheezing and breathless', 'Wheezing without a cold', 'Waking with tightness in the chest', 'Waking with shortness of breath', 'Waking with an attack of cough', 'Attack of asthma', 'Use of medication'), method=c('Mail','Telephone'), sex=c('Male','Female'), what=c('Sensitivity','Specificity')) jcetable$percent <- c(756,618,706,422,356,578,289,333, 576,421,789,273,273,212,212,212, 613,763,713,403,377,541,290,226, 613,684,632,290,387,613,258,129, 656,597,438,780,732,679,938,919, 714,600,494,877,850,703,963,987, 755,420,480,794,779,647,956,941, 766,423,500,833,833,604,955,986) / 10 # In jcetable, event varies most rapidly, then # method, then sex, and what. ## End(Not run)
## Not run: # First, we do steps that create or manipulate the data # frame in its entirety. For S-Plus, these are done with # .Data in search position one (the default at the # start of the session). # # ----------------------------------------------------------------------- # Step 1: Create initial draft of data frame # # We usually begin by importing a dataset from # # another application. ASCII files may be imported # using the scan and read.table functions. SAS # datasets may be imported using the Hmisc sas.get # function (which will carry more attributes from # SAS than using File \dots Import) from the GUI # menus. But for most applications (especially # Excel), File \dots Import will suffice. If using # the GUI, it is often best to provide variable # names during the import process, using the Options # tab, rather than renaming all fields later Of # course, if the data to be imported already have # field names (e.g., in Excel), let S use those # automatically. If using S-Plus, you can use a # command to execute File \dots Import, e.g.: import.data(FileName = "/windows/temp/fev.asc", FileType = "ASCII", DataFrame = "FEV") # Here we name the new data frame FEV rather than # fev, because we wanted to distinguish a variable # in the data frame named fev from the data frame # name. For S-Plus the command will look # instead like the following: FEV <- importData("/tmp/fev.asc") # ----------------------------------------------------------------------- # Step 2: Clean up data frame / make it be more # efficiently stored # # Unless using sas.get to import your dataset # (sas.get already stores data efficiently), it is # usually a good idea to run the data frame through # the Hmisc cleanup.import function to change # numeric variables that are always whole numbers to # be stored as integers, the remaining numerics to # single precision, strange values from Excel to # NAs, and character variables that always contain # legal numeric values to numeric variables. # cleanup.import typically halves the size of the # data frame. If you do not specify any parameters # to cleanup.import, the function assumes that no # numeric variable needs more than 7 significant # digits of precision, so all non-integer-valued # variables will be converted to single precision. FEV <- cleanup.import(FEV) # ----------------------------------------------------------------------- # Step 3: Make global changes to the data frame # # A data frame has attributes that are "external" to # its variables. There are the vector of its # variable names ("names" attribute), the # observation identifiers ("row.names"), and the # "class" (whose value is "data.frame"). The # "names" attribute is the one most commonly in need # of modification. If we had wanted to change all # the variable names to lower case, we could have # specified lowernames=TRUE to the cleanup.import # invocation above, or type names(FEV) <- casefold(names(FEV)) # The upData function can also be used to change # variable names in two ways (see below). # To change names in a non-systematic way we use # other options. Under Windows/NT the most # straigtforward approach is to change the names # interactively. Click on the data frame in the # left panel of the Object Browser, then in the # right pane click twice (slowly) on a variable. # Use the left arrow and other keys to edit the # name. Click outside that name field to commit the # change. You can also rename columns while in a # Data Sheet. To instead use programming commands # to change names, use something like: names(FEV)[6] <- 'smoke' # assumes you know the positions! names(FEV)[names(FEV)=='smoking'] <- 'smoke' names(FEV) <- edit(names(FEV)) # The last example is useful if you are changing # many names. But none of the interactive # approaches such as edit() are handy if you will be # re-importing the dataset after it is updated in # its original application. This problem can be # addressed by saving the new names in a permanent # vector in .Data: new.names <- names(FEV) # Then if the data are re-imported, you can type names(FEV) <- new.names # to rename the variables. # ----------------------------------------------------------------------- # Step 4: Delete unneeded variables # # To delete some of the variables, you can # right-click on variable names in the Object # Browser's right pane, then select Delete. You can # also set variables to have NULL values, which # causes the system to delete them. We don't need # to delete any variables from FEV but suppose we # did need to delete some from mydframe. mydframe$x1 <- NULL mydframe$x2 <- NULL mydframe[c('age','sex')] <- NULL # delete 2 variables mydframe[Cs(age,sex)] <- NULL # same thing # The last example uses the Hmisc short-cut quoting # function Cs. See also the drop parameter to upData. # ----------------------------------------------------------------------- # Step 5: Make changes to individual variables # within the data frame # # After importing data, the resulting variables are # seldom self - documenting, so we commonly need to # change or enhance attributes of individual # variables within the data frame. # # If you are only changing a few variables, it is # efficient to change them directly without # attaching the entire data frame. FEV$sex <- factor(FEV$sex, 0:1, c('female','male')) FEV$smoke <- factor(FEV$smoke, 0:1, c('non-current smoker','current smoker')) units(FEV$age) <- 'years' units(FEV$fev) <- 'L' label(FEV$fev) <- 'Forced Expiratory Volume' units(FEV$height) <- 'inches' # When changing more than one or two variables it is # more convenient change the data frame using the # Hmisc upData function. FEV2 <- upData(FEV, rename=c(smoking='smoke'), # omit if renamed above drop=c('var1','var2'), levels=list(sex =list(female=0,male=1), smoke=list('non-current smoker'=0, 'current smoker'=1)), units=list(age='years', fev='L', height='inches'), labels=list(fev='Forced Expiratory Volume')) # An alternative to levels=list(\dots) is for example # upData(FEV, sex=factor(sex,0:1,c('female','male'))). # # Note that we saved the changed data frame into a # new data frame FEV2. If we were confident of the # correctness of our changes we could have stored # the new data frame on top of the old one, under # the original name FEV. # ----------------------------------------------------------------------- # Step 6: Check the data frame # # The Hmisc describe function is perhaps the first # function that should be used on the new data # frame. It provides documentation of all the # variables and the frequency tabulation, counts of # NAs, and 5 largest and smallest values are # helpful in detecting data errors. Typing # describe(FEV) will write the results to the # current output window. To put the results in a # new window that can persist, even upon exiting # S, we use the page function. The describe # output can be minimized to an icon but kept ready # for guiding later steps of the analysis. page(describe(FEV2), multi=TRUE) # multi=TRUE allows that window to persist while # control is returned to other windows # The new data frame is OK. Store it on top of the # old FEV and then use the graphical user interface # to delete FEV2 (click on it and hit the Delete # key) or type rm(FEV2) after the next statement. FEV <- FEV2 # Next, we can use a variety of other functions to # check and describe all of the variables. As we # are analyzing all or almost all of the variables, # this is best done without attaching the data # frame. Note that plot.data.frame plots inverted # CDFs for continuous variables and dot plots # showing frequency distributions of categorical # ones. summary(FEV) # basic summary function (summary.data.frame) plot(FEV) # plot.data.frame datadensity(FEV) # rug plots and freq. bar charts for all var. hist.data.frame(FEV) # for variables having > 2 values by(FEV, FEV$smoke, summary) # use basic summary function with stratification # ----------------------------------------------------------------------- # Step 7: Do detailed analyses involving individual # variables # # Analyses based on the formula language can use # data= so attaching the data frame may not be # required. This saves memory. Here we use the # Hmisc summary.formula function to compute 5 # statistics on height, stratified separately by age # quartile and by sex. options(width=80) summary(height ~ age + sex, data=FEV, fun=function(y)c(smean.sd(y), smedian.hilow(y,conf.int=.5))) # This computes mean height, S.D., median, outer quartiles fit <- lm(height ~ age*sex, data=FEV) summary(fit) # For this analysis we could also have attached the # data frame in search position 2. For other # analyses, it is mandatory to attach the data frame # unless FEV$ prefixes each variable name. # Important: DO NOT USE attach(FEV, 1) or # attach(FEV, pos=1, \dots) if you are only analyzing # and not changing the variables, unless you really # need to avoid conflicts with variables in search # position 1 that have the same names as the # variables in FEV. Attaching into search position # 1 will cause S-Plus to be more of a memory hog. attach(FEV) # Use e.g. attach(FEV[,Cs(age,sex)]) if you only # want to analyze a small subset of the variables # Use e.g. attach(FEV[FEV$sex=='male',]) to # analyze a subset of the observations summary(height ~ age + sex, fun=function(y)c(smean.sd(y), smedian.hilow(y,conf.int=.5))) fit <- lm(height ~ age*sex) # Run generic summary function on height and fev, # stratified by sex by(data.frame(height,fev), sex, summary) # Cross-classify into 4 sex x smoke groups by(FEV, list(sex,smoke), summary) # Plot 5 quantiles s <- summary(fev ~ age + sex + height, fun=function(y)quantile(y,c(.1,.25,.5,.75,.9))) plot(s, which=1:5, pch=c(1,2,15,2,1), #pch=c('=','[','o',']','='), main='A Discovery', xlab='FEV') # Use the nonparametric bootstrap to compute a # 0.95 confidence interval for the population mean fev smean.cl.boot(fev) # in Hmisc # Use the Statistics \dots Compare Samples \dots One Sample # keys to get a normal-theory-based C.I. Then do it # more manually. The following method assumes that # there are no NAs in fev sd <- sqrt(var(fev)) xbar <- mean(fev) xbar sd n <- length(fev) qt(.975,n-1) # prints 0.975 critical value of t dist. with n-1 d.f. xbar + c(-1,1)*sd/sqrt(n)*qt(.975,n-1) # prints confidence limits # Fit a linear model # fit <- lm(fev ~ other variables \dots) detach() # The last command is only needed if you want to # start operating on another data frame and you want # to get FEV out of the way. # ----------------------------------------------------------------------- # Creating data frames from scratch # # Data frames can be created from within S. To # create a small data frame containing ordinary # data, you can use something like dframe <- data.frame(age=c(10,20,30), sex=c('male','female','male'), stringsAsFactors=TRUE) # You can also create a data frame using the Data # Sheet. Create an empty data frame with the # correct variable names and types, then edit in the # data. dd <- data.frame(age=numeric(0),sex=character(0), stringsAsFactors=TRUE) # The sex variable will be stored as a factor, and # levels will be automatically added to it as you # define new values for sex in the Data Sheet's sex # column. # # When the data frame you need to create is defined # by systematically varying variables (e.g., all # possible combinations of values of each variable), # the expand.grid function is useful for quickly # creating the data. Then you can add # non-systematically-varying variables to the object # created by expand.grid, using programming # statements or editing the Data Sheet. This # process is useful for creating a data frame # representing all the values in a printed table. # In what follows we create a data frame # representing the combinations of values from an 8 # x 2 x 2 x 2 (event x method x sex x what) table, # and add a non-systematic variable percent to the # data. jcetable <- expand.grid( event=c('Wheezing at any time', 'Wheezing and breathless', 'Wheezing without a cold', 'Waking with tightness in the chest', 'Waking with shortness of breath', 'Waking with an attack of cough', 'Attack of asthma', 'Use of medication'), method=c('Mail','Telephone'), sex=c('Male','Female'), what=c('Sensitivity','Specificity')) jcetable$percent <- c(756,618,706,422,356,578,289,333, 576,421,789,273,273,212,212,212, 613,763,713,403,377,541,290,226, 613,684,632,290,387,613,258,129, 656,597,438,780,732,679,938,919, 714,600,494,877,850,703,963,987, 755,420,480,794,779,647,956,941, 766,423,500,833,833,604,955,986) / 10 # In jcetable, event varies most rapidly, then # method, then sex, and what. ## End(Not run)
These functions are intended to be used to describe how well a given
set of new observations (e.g., new subjects) were represented in a
dataset used to develop a predictive model.
The dataRep
function forms a data frame that contains all the unique
combinations of variable values that existed in a given set of
variable values. Cross–classifications of values are created using
exact values of variables, so for continuous numeric variables it is
often necessary to round them to the nearest v
and to possibly
curtail the values to some lower and upper limit before rounding.
Here v
denotes a numeric constant specifying the matching tolerance
that will be used. dataRep
also stores marginal distribution
summaries for all the variables. For numeric variables, all 101
percentiles are stored, and for all variables, the frequency
distributions are also stored (frequencies are computed after any
rounding and curtailment of numeric variables). For the purposes of
rounding and curtailing, the roundN
function is provided. A print
method will summarize the calculations made by dataRep
, and if
long=TRUE
all unique combinations of values and their frequencies in
the original dataset are printed.
The predict
method for dataRep
takes a new data frame having
variables named the same as the original ones (but whose factor levels
are not necessarily in the same order) and examines the collapsed
cross-classifications created by dataRep
to find how many
observations were similar to each of the new observations after any
rounding or curtailment of limits is done. predict
also does some
calculations to describe how the variable values of the new
observations "stack up" against the marginal distributions of the
original data. For categorical variables, the percent of observations
having a given variable with the value of the new observation (after
rounding for variables that were through roundN
in the formula given
to dataRep
) is computed. For numeric variables, the percentile of
the original distribution in which the current value falls will be
computed. For this purpose, the data are not rounded because the 101
original percentiles were retained; linear interpolation is used to
estimate percentiles for values between two tabulated percentiles.
The lowest marginal frequency of matching values across all variables
is also computed. For example, if an age, sex combination matches 10
subjects in the original dataset but the age value matches 100 ages
(after rounding) and the sex value matches the sex code of 300
observations, the lowest marginal frequency is 100, which is a "best
case" upper limit for multivariable matching. I.e., matching on all
variables has to result on a lower frequency than this amount.
A print
method for the output of predict.dataRep
prints all
calculations done by predict
by default. Calculations can be
selectively suppressed.
dataRep(formula, data, subset, na.action) roundN(x, tol=1, clip=NULL) ## S3 method for class 'dataRep' print(x, long=FALSE, ...) ## S3 method for class 'dataRep' predict(object, newdata, ...) ## S3 method for class 'predict.dataRep' print(x, prdata=TRUE, prpct=TRUE, ...)
dataRep(formula, data, subset, na.action) roundN(x, tol=1, clip=NULL) ## S3 method for class 'dataRep' print(x, long=FALSE, ...) ## S3 method for class 'dataRep' predict(object, newdata, ...) ## S3 method for class 'predict.dataRep' print(x, prdata=TRUE, prpct=TRUE, ...)
formula |
a formula with no left-hand-side. Continuous numeric variables in
need of rounding should appear in the formula as e.g. |
x |
a numeric vector or an object created by |
object |
the object created by |
data , subset , na.action
|
standard modeling arguments. Default |
tol |
rounding constant (tolerance is actually |
clip |
a 2-vector specifying a lower and upper limit to curtail values of |
long |
set to |
newdata |
a data frame containing all the variables given to |
prdata |
set to |
prpct |
set to |
... |
unused |
dataRep
returns a list of class "dataRep"
containing the collapsed
data frame and frequency counts along with marginal distribution
information. predict
returns an object of class "predict.dataRep"
containing information determined by matching observations in
newdata
with the original (collapsed) data.
print.dataRep
prints.
Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
[email protected]
set.seed(13) num.symptoms <- sample(1:4, 1000,TRUE) sex <- factor(sample(c('female','male'), 1000,TRUE)) x <- runif(1000) x[1] <- NA table(num.symptoms, sex, .25*round(x/.25)) d <- dataRep(~ num.symptoms + sex + roundN(x,.25)) print(d, long=TRUE) predict(d, data.frame(num.symptoms=1:3, sex=c('male','male','female'), x=c(.03,.5,1.5)))
set.seed(13) num.symptoms <- sample(1:4, 1000,TRUE) sex <- factor(sample(c('female','male'), 1000,TRUE)) x <- runif(1000) x[1] <- NA table(num.symptoms, sex, .25*round(x/.25)) d <- dataRep(~ num.symptoms + sex + roundN(x,.25)) print(d, long=TRUE) predict(d, data.frame(num.symptoms=1:3, sex=c('male','male','female'), x=c(.03,.5,1.5)))
Computes the Kish design effect and corresponding intra-cluster correlation for a single cluster-sampled variable
deff(y, cluster)
deff(y, cluster)
y |
variable to analyze |
cluster |
a variable whose unique values indicate cluster membership. Any type of variable is allowed. |
a vector with named elements n
(total number of non-missing
observations), clusters
(number of clusters after deleting
missing data), rho
(intra-cluster correlation), and deff
(design effect).
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
set.seed(1) blood.pressure <- rnorm(1000, 120, 15) clinic <- sample(letters, 1000, replace=TRUE) deff(blood.pressure, clinic)
set.seed(1) blood.pressure <- rnorm(1000, 120, 15) clinic <- sample(letters, 1000, replace=TRUE) deff(blood.pressure, clinic)
describe
is a generic method that invokes describe.data.frame
,
describe.matrix
, describe.vector
, or
describe.formula
. describe.vector
is the basic
function for handling a single variable.
This function determines whether the variable is character, factor,
category, binary, discrete numeric, and continuous numeric, and prints
a concise statistical summary according to each. A numeric variable is
deemed discrete if it has <= 10 distinct values. In this case,
quantiles are not printed. A frequency table is printed
for any non-binary variable if it has no more than 20 distinct
values. For any variable for which the frequency table is not printed,
the 5 lowest and highest values are printed. This behavior can be
overriden for long character variables with many levels using the
listunique
parameter, to get a complete tabulation.
describe
is especially useful for
describing data frames created by *.get
, as labels, formats,
value labels, and (in the case of sas.get
) frequencies of special
missing values are printed.
For a binary variable, the sum (number of 1's) and mean (proportion of
1's) are printed. If the first argument is a formula, a model frame
is created and passed to describe.data.frame. If a variable
is of class "impute"
, a count of the number of imputed values is
printed. If a date variable has an attribute partial.date
(this is set up by sas.get
), counts of how many partial dates are
actually present (missing month, missing day, missing both) are also presented.
If a variable was created by the special-purpose function substi
(which
substitutes values of a second variable if the first variable is NA),
the frequency table of substitutions is also printed.
For numeric variables, describe
adds an item called Info
which is a relative information measure using the relative efficiency of
a proportional odds/Wilcoxon test on the variable relative to the same
test on a variable that has no ties. Info
is related to how
continuous the variable is, and ties are less harmful the more untied
values there are. The formula for Info
is one minus the sum of
the cubes of relative frequencies of values divided by one minus the
square of the reciprocal of the sample size. The lowest information
comes from a variable having only one distinct value following by a
highly skewed binary variable. Info
is reported to
two decimal places.
A latex method exists for converting the describe
object to a
LaTeX file. For numeric variables having more than 20 distinct values,
describe
saves in its returned object the frequencies of 100
evenly spaced bins running from minimum observed value to the maximum.
When there are less than or equal to 20 distinct values, the original
values are maintained.
latex
and html
insert a spike histogram displaying these
frequency counts in the tabular material using the LaTeX picture
environment. For example output see
https://hbiostat.org/doc/rms/book/chapter7edition1.pdf.
Note that the latex method assumes you have the following styles
installed in your latex installation: setspace and relsize.
The html
method mimics the LaTeX output. This is useful in the
context of Quarto/Rmarkdown html and html notebook output.
If options(prType='html')
is in effect, calling print
on
an object that is the result of running describe
on a data frame
will result in rendering the HTML version. If run from the console a
browser window will open. When which
is specified to
print
, whether or not prType='html'
is in effect, a
gt
package html table will be produced containing only
the types of variables requested. When which='both'
a list with
element names Continuous
and Categorical
is produced,
making it convenient for the user to print as desired, or to pass the
list directed to the qreport
maketabs
function when using Quarto.
The plot
method is for describe
objects run on data
frames. It produces spike histograms for a graphic of
continuous variables and a dot chart for categorical variables, showing
category proportions. The graphic format is ggplot2
if the user
has not set options(grType='plotly')
or has set the grType
option to something other than 'plotly'
. Otherwise plotly
graphics that are interactive are produced, and these can be placed into
an Rmarkdown html notebook. The user must install the plotly
package for this to work. When the use hovers the mouse over a bin for
a raw data value, the actual value will pop-up (formatted using
digits
). When the user hovers over the minimum data value, most
of the information calculated by describe
will pop up. For each
variable, the number of missing values is used to assign the color to
the histogram or dot chart, and a legend is drawn. Color is not used if
there are no missing values in any variable. For categorical variables,
hovering over the leftmost point for a variable displays details, and
for all points proportions, numerators, and denominators are displayed
in the popup. If both continuous and categorical variables are present
and which='both'
is specified, the plot
method returns an
unclassed list
containing two objects, named 'Categorical'
and 'Continuous'
, in that order.
Sample weights may be specified to any of the functions, resulting in weighted means, quantiles, and frequency tables.
Note: As discussed in Cox and Longton (2008), Stata Technical Bulletin 8(4) pp. 557, the term "unique" has been replaced with "distinct" in the output (but not in parameter names).
When weights
are not used, the pseudomedian and Gini's mean difference are computed for
numeric variables. The pseudomedian is labeled pMedian
and is the median of all possible pairwise averages. It is a robust and efficient measure of location that equals the mean and median for symmetric distributions. It is also called the Hodges-Lehmann one-sample estimator. Gini's mean difference is a robust measure of dispersion that is the
mean absolute difference between any pairs of observations. In simple
output Gini's difference is labeled Gmd
.
formatdescribeSingle
is a service function for latex
,
html
, and print
methods for single variables that is not
intended to be called by the user.
## S3 method for class 'vector' describe(x, descript, exclude.missing=TRUE, digits=4, listunique=0, listnchar=12, weights=NULL, normwt=FALSE, minlength=NULL, shortmChoice=TRUE, rmhtml=FALSE, trans=NULL, lumptails=0.01, ...) ## S3 method for class 'matrix' describe(x, descript, exclude.missing=TRUE, digits=4, ...) ## S3 method for class 'data.frame' describe(x, descript, exclude.missing=TRUE, digits=4, trans=NULL, ...) ## S3 method for class 'formula' describe(x, descript, data, subset, na.action, digits=4, weights, ...) ## S3 method for class 'describe' print(x, which = c('both', 'categorical', 'continuous'), ...) ## S3 method for class 'describe' latex(object, title=NULL, file=paste('describe',first.word(expr=attr(object,'descript')),'tex',sep='.'), append=FALSE, size='small', tabular=TRUE, greek=TRUE, spacing=0.7, lspace=c(0,0), ...) ## S3 method for class 'describe.single' latex(object, title=NULL, vname, file, append=FALSE, size='small', tabular=TRUE, greek=TRUE, lspace=c(0,0), ...) ## S3 method for class 'describe' html(object, size=85, tabular=TRUE, greek=TRUE, scroll=FALSE, rows=25, cols=100, ...) ## S3 method for class 'describe.single' html(object, size=85, tabular=TRUE, greek=TRUE, ...) formatdescribeSingle(x, condense=c('extremes', 'frequencies', 'both', 'none'), lang=c('plain', 'latex', 'html'), verb=0, lspace=c(0, 0), size=85, ...) ## S3 method for class 'describe' plot(x, which=c('both', 'continuous', 'categorical'), what=NULL, sort=c('ascending', 'descending', 'none'), n.unique=10, digits=5, bvspace=2, ...)
## S3 method for class 'vector' describe(x, descript, exclude.missing=TRUE, digits=4, listunique=0, listnchar=12, weights=NULL, normwt=FALSE, minlength=NULL, shortmChoice=TRUE, rmhtml=FALSE, trans=NULL, lumptails=0.01, ...) ## S3 method for class 'matrix' describe(x, descript, exclude.missing=TRUE, digits=4, ...) ## S3 method for class 'data.frame' describe(x, descript, exclude.missing=TRUE, digits=4, trans=NULL, ...) ## S3 method for class 'formula' describe(x, descript, data, subset, na.action, digits=4, weights, ...) ## S3 method for class 'describe' print(x, which = c('both', 'categorical', 'continuous'), ...) ## S3 method for class 'describe' latex(object, title=NULL, file=paste('describe',first.word(expr=attr(object,'descript')),'tex',sep='.'), append=FALSE, size='small', tabular=TRUE, greek=TRUE, spacing=0.7, lspace=c(0,0), ...) ## S3 method for class 'describe.single' latex(object, title=NULL, vname, file, append=FALSE, size='small', tabular=TRUE, greek=TRUE, lspace=c(0,0), ...) ## S3 method for class 'describe' html(object, size=85, tabular=TRUE, greek=TRUE, scroll=FALSE, rows=25, cols=100, ...) ## S3 method for class 'describe.single' html(object, size=85, tabular=TRUE, greek=TRUE, ...) formatdescribeSingle(x, condense=c('extremes', 'frequencies', 'both', 'none'), lang=c('plain', 'latex', 'html'), verb=0, lspace=c(0, 0), size=85, ...) ## S3 method for class 'describe' plot(x, which=c('both', 'continuous', 'categorical'), what=NULL, sort=c('ascending', 'descending', 'none'), n.unique=10, digits=5, bvspace=2, ...)
x |
a data frame, matrix, vector, or formula. For a data frame, the
|
descript |
optional title to print for x. The default is the name of the argument
or the "label" attributes of individual variables. When the first argument
is a formula, |
exclude.missing |
set toTRUE to print the names of variables that contain only missing values. This list appears at the bottom of the printout, and no space is taken up for such variables in the main listing. |
digits |
number of significant digits to print. For |
listunique |
For a character variable that is not an |
listnchar |
see |
weights |
a numeric vector of frequencies or sample weights. Each observation
will be treated as if it were sampled |
minlength |
value passed to summary.mChoice |
shortmChoice |
set to |
rmhtml |
set to |
trans |
for |
lumptails |
specifies the quantile to use (its complement is also used) for grouping observations in the tails so that outliers have less chance of distorting the variable's range for sparkline spike histograms. The default is 0.01, i.e., observations below the 0.01 quantile are grouped together in the leftmost bin, and observations above the 0.99 quantile are grouped to form the last bin. |
normwt |
The default, |
object |
a result of |
title |
unused |
data |
a data frame, data table, or list |
subset |
a subsetting expression |
na.action |
These are used if a formula is specified. |
... |
arguments passed to |
file |
name of output file (should have a suffix of .tex). Default name is
formed from the first word of the |
append |
set to |
size |
LaTeX text size ( |
tabular |
set to |
greek |
By default, the |
spacing |
By default, the |
lspace |
extra vertical scape, in character size units (i.e., "ex"
as appended to the space). When using certain font sizes, there is
too much space left around LaTeX verbatim environments. This
two-vector specifies space to remove (i.e., the values are negated in
forming the |
scroll |
set to |
rows , cols
|
the number of rows or columns to allocate for the scrollable box |
vname |
unused argument in |
which |
specifies whether to plot numeric continuous or
binary/categorical variables, or both. When |
what |
character or numeric vector specifying which variables to plot; default is to plot all |
sort |
specifies how and whether variables are sorted in order of
the proportion of positives when |
n.unique |
the minimum number of distinct values a numeric variable
must have before |
bvspace |
the between-variable spacing for categorical variables. Defaults to 2, meaning twice the amount of vertical space as what is used for between-category spacing within a variable |
condense |
specifies whether to condense the output with regard to
the 5 lowest and highest values ( |
lang |
specifies the markup language |
verb |
set to 1 if a verbatim environment is already in effect for LaTeX |
If options(na.detail.response=TRUE)
has been set and na.action
is "na.delete"
or
"na.keep"
, summary statistics on
the response variable are printed separately for missing and non-missing
values of each predictor. The default summary function returns
the number of non-missing response values and the mean of the last
column of the response values, with a names
attribute of
c("N","Mean")
.
When the response is a Surv
object and the mean is used, this will
result in the crude proportion of events being used to summarize
the response. The actual summary function can be designated through
options(na.fun.response = "function name")
.
If you are modifying LaTex parskip
or certain other parameters,
you may need to shrink the area around tabular
and
verbatim
environments produced by latex.describe
. You can
do this using for example
\usepackage{etoolbox}\makeatletter\preto{\@verbatim}{\topsep=-1.4pt
\partopsep=0pt}\preto{\@tabular}{\parskip=2pt
\parsep=0pt}\makeatother
in the LaTeX preamble.
a list containing elements descript
, counts
,
values
. The list is of class describe
. If the input
object was a matrix or a data
frame, the list is a list of lists, one list for each variable
analyzed. latex
returns a standard latex
object. For numeric
variables having at least 20 distinct values, an additional component
intervalFreq
. This component is a list with two elements, range
(containing two values) and count
, a vector of 100 integer frequency
counts. print
with which=
returns a 'gt' table object.
The user can modify the table by piping formatting changes, column
removals, and other operations, before final rendering.
Frank Harrell
Vanderbilt University
[email protected]
spikecomp
, sas.get
, quantile
,
GiniMd
, pMedian
,
table
, summary
,
model.frame.default
,
naprint
, lapply
, tapply
,
Surv
, na.delete
,
na.keep
,
na.detail.response
, latex
set.seed(1) describe(runif(200),dig=2) #single variable, continuous #get quantiles .05,.10,\dots dfr <- data.frame(x=rnorm(400),y=sample(c('male','female'),400,TRUE)) describe(dfr) ## Not run: options(grType='plotly') d <- describe(mydata) p <- plot(d) # create plots for both types of variables p[[1]]; p[[2]] # or p$Categorical; p$Continuous plotly::subplot(p[[1]], p[[2]], nrows=2) # plot both in one plot(d, which='categorical') # categorical ones d <- sas.get(".","mydata",special.miss=TRUE,recode=TRUE) describe(d) #describe entire data frame attach(d, 1) describe(relig) #Has special missing values .D .F .M .R .T #attr(relig,"label") is "Religious preference" #relig : Religious preference Format:relig # n missing D F M R T distinct # 4038 263 45 33 7 2 1 8 # #0:none (251, 6%), 1:Jewish (372, 9%), 2:Catholic (1230, 30%) #3:Jehovah's Witnes (25, 1%), 4:Christ Scientist (7, 0%) #5:Seventh Day Adv (17, 0%), 6:Protestant (2025, 50%), 7:other (111, 3%) # Method for describing part of a data frame: describe(death.time ~ age*sex + rcs(blood.pressure)) describe(~ age+sex) describe(~ age+sex, weights=freqs) # weighted analysis fit <- lrm(y ~ age*sex + log(height)) describe(formula(fit)) describe(y ~ age*sex, na.action=na.delete) # report on number deleted for each variable options(na.detail.response=TRUE) # keep missings separately for each x, report on dist of y by x=NA describe(y ~ age*sex) options(na.fun.response="quantile") describe(y ~ age*sex) # same but use quantiles of y by x=NA d <- describe(my.data.frame) d$age # print description for just age d[c('age','sex')] # print description for two variables d[sort(names(d))] # print in alphabetic order by var. names d2 <- d[20:30] # keep variables 20-30 page(d2) # pop-up window for these variables # Test date/time formats and suppression of times when they don't vary library(chron) d <- data.frame(a=chron((1:20)+.1), b=chron((1:20)+(1:20)/100), d=ISOdatetime(year=rep(2003,20),month=rep(4,20),day=1:20, hour=rep(11,20),min=rep(17,20),sec=rep(11,20)), f=ISOdatetime(year=rep(2003,20),month=rep(4,20),day=1:20, hour=1:20,min=1:20,sec=1:20), g=ISOdate(year=2001:2020,month=rep(3,20),day=1:20)) describe(d) # Make a function to run describe, latex.describe, and use the kdvi # previewer in Linux to view the result and easily make a pdf file ldesc <- function(data) { options(xdvicmd='kdvi') d <- describe(data, desc=deparse(substitute(data))) dvi(latex(d, file='/tmp/z.tex'), nomargins=FALSE, width=8.5, height=11) } ldesc(d) ## End(Not run)
set.seed(1) describe(runif(200),dig=2) #single variable, continuous #get quantiles .05,.10,\dots dfr <- data.frame(x=rnorm(400),y=sample(c('male','female'),400,TRUE)) describe(dfr) ## Not run: options(grType='plotly') d <- describe(mydata) p <- plot(d) # create plots for both types of variables p[[1]]; p[[2]] # or p$Categorical; p$Continuous plotly::subplot(p[[1]], p[[2]], nrows=2) # plot both in one plot(d, which='categorical') # categorical ones d <- sas.get(".","mydata",special.miss=TRUE,recode=TRUE) describe(d) #describe entire data frame attach(d, 1) describe(relig) #Has special missing values .D .F .M .R .T #attr(relig,"label") is "Religious preference" #relig : Religious preference Format:relig # n missing D F M R T distinct # 4038 263 45 33 7 2 1 8 # #0:none (251, 6%), 1:Jewish (372, 9%), 2:Catholic (1230, 30%) #3:Jehovah's Witnes (25, 1%), 4:Christ Scientist (7, 0%) #5:Seventh Day Adv (17, 0%), 6:Protestant (2025, 50%), 7:other (111, 3%) # Method for describing part of a data frame: describe(death.time ~ age*sex + rcs(blood.pressure)) describe(~ age+sex) describe(~ age+sex, weights=freqs) # weighted analysis fit <- lrm(y ~ age*sex + log(height)) describe(formula(fit)) describe(y ~ age*sex, na.action=na.delete) # report on number deleted for each variable options(na.detail.response=TRUE) # keep missings separately for each x, report on dist of y by x=NA describe(y ~ age*sex) options(na.fun.response="quantile") describe(y ~ age*sex) # same but use quantiles of y by x=NA d <- describe(my.data.frame) d$age # print description for just age d[c('age','sex')] # print description for two variables d[sort(names(d))] # print in alphabetic order by var. names d2 <- d[20:30] # keep variables 20-30 page(d2) # pop-up window for these variables # Test date/time formats and suppression of times when they don't vary library(chron) d <- data.frame(a=chron((1:20)+.1), b=chron((1:20)+(1:20)/100), d=ISOdatetime(year=rep(2003,20),month=rep(4,20),day=1:20, hour=rep(11,20),min=rep(17,20),sec=rep(11,20)), f=ISOdatetime(year=rep(2003,20),month=rep(4,20),day=1:20, hour=1:20,min=1:20,sec=1:20), g=ISOdate(year=2001:2020,month=rep(3,20),day=1:20)) describe(d) # Make a function to run describe, latex.describe, and use the kdvi # previewer in Linux to view the result and easily make a pdf file ldesc <- function(data) { options(xdvicmd='kdvi') d <- describe(data, desc=deparse(substitute(data))) dvi(latex(d, file='/tmp/z.tex'), nomargins=FALSE, width=8.5, height=11) } ldesc(d) ## End(Not run)
discrete
creates a discrete vector which is distinct from a
continuous vector, or a factor/ordered vector.
The other function are tools for manipulating descrete vectors.
as.discrete(x, ...) ## Default S3 method: as.discrete(x, ...) discrete(x, levels = sort(unique.default(x), na.last = TRUE), exclude = NA) ## S3 replacement method for class 'discrete' x[...] <- value ## S3 method for class 'discrete' x[..., drop = FALSE] ## S3 method for class 'discrete' x[[i]] is.discrete(x) ## S3 replacement method for class 'discrete' is.na(x) <- value ## S3 replacement method for class 'discrete' length(x) <- value
as.discrete(x, ...) ## Default S3 method: as.discrete(x, ...) discrete(x, levels = sort(unique.default(x), na.last = TRUE), exclude = NA) ## S3 replacement method for class 'discrete' x[...] <- value ## S3 method for class 'discrete' x[..., drop = FALSE] ## S3 method for class 'discrete' x[[i]] is.discrete(x) ## S3 replacement method for class 'discrete' is.na(x) <- value ## S3 replacement method for class 'discrete' length(x) <- value
x |
a vector |
drop |
Should unused levels be dropped. |
exclude |
logical: should |
i |
indexing vector |
levels |
charater: list of individual level values |
value |
index of elements to set to |
... |
arguments to be passed to other functions |
as.discrete
converts a vector into a discrete vector.
discrete
creates a discrete vector from provided values.
is.discrete
tests to see if the vector is a discrete vector.
as.discrete
, discrete
returns a vector of
discrete
type.
is.discrete
returan logical TRUE
if the vector is of
class discrete other wise it returns FALSE
.
Charles Dupont
a <- discrete(1:25) a is.discrete(a) b <- as.discrete(2:4) b
a <- discrete(1:25) a is.discrete(a) b <- as.discrete(2:4) b
dotchart2
is an enhanced version of the dotchart
function
with several new options.
dotchart2(data, labels, groups=NULL, gdata=NA, horizontal=TRUE, pch=16, xlab='', ylab='', xlim=NULL, auxdata, auxgdata=NULL, auxtitle, lty=1, lines=TRUE, dotsize = .8, cex = par("cex"), cex.labels = cex, cex.group.labels = cex.labels*1.25, sort.=TRUE, add=FALSE, dotfont=par('font'), groupfont=2, reset.par=add, xaxis=TRUE, width.factor=1.1, lcolor='gray', leavepar=FALSE, axisat=NULL, axislabels=NULL, ...)
dotchart2(data, labels, groups=NULL, gdata=NA, horizontal=TRUE, pch=16, xlab='', ylab='', xlim=NULL, auxdata, auxgdata=NULL, auxtitle, lty=1, lines=TRUE, dotsize = .8, cex = par("cex"), cex.labels = cex, cex.group.labels = cex.labels*1.25, sort.=TRUE, add=FALSE, dotfont=par('font'), groupfont=2, reset.par=add, xaxis=TRUE, width.factor=1.1, lcolor='gray', leavepar=FALSE, axisat=NULL, axislabels=NULL, ...)
data |
a numeric vector whose values are shown on the x-axis |
labels |
a vector of labels for each point, corresponding to
|
groups |
an optional categorical variable indicating how
|
gdata |
data values for groups, typically summaries such as group medians |
horizontal |
set to |
pch |
default character number or value for plotting dots in dot charts. The default is 16. |
xlab |
x-axis title |
ylab |
y-axis title |
xlim |
x-axis limits. Applies only to |
auxdata |
a vector of auxiliary data given to |
auxgdata |
similar to |
auxtitle |
if |
lty |
line type for horizontal lines. Default is 1 for R, 2 for S-Plus |
lines |
set to |
dotsize |
|
cex |
see |
cex.labels |
|
cex.group.labels |
value of |
sort. |
set to |
add |
set to |
dotfont |
font number of plotting dots. Default is one. Use |
groupfont |
font number to use in drawing |
reset.par |
set to |
xaxis |
set to |
width.factor |
When the calculated left margin turns out to be faulty, specify a
factor by which to multiple the left margin as |
lcolor |
color for horizontal reference lines. Default is |
leavepar |
set to |
axisat |
a vector of tick mark locations to pass to |
axislabels |
a vector of strings specifying axis tick mark labels. Useful if transforming the data axis |
... |
arguments passed to |
dotchart
will leave par
altered if reset.par=FALSE
.
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
set.seed(135) maj <- factor(c(rep('North',13),rep('South',13))) g <- paste('Category',rep(letters[1:13],2)) n <- sample(1:15000, 26, replace=TRUE) y1 <- runif(26) y2 <- pmax(0, y1 - runif(26, 0, .1)) dotchart2(y1, g, groups=maj, auxdata=n, auxtitle='n', xlab='Y') dotchart2(y2, g, groups=maj, pch=17, add=TRUE) ## Compare with dotchart function (no superpositioning or auxdata allowed): ## dotchart(y1, g, groups=maj, xlab='Y') ## To plot using a transformed scale add for example ## axisat=sqrt(pretty(y)), axislabels=pretty(y)
set.seed(135) maj <- factor(c(rep('North',13),rep('South',13))) g <- paste('Category',rep(letters[1:13],2)) n <- sample(1:15000, 26, replace=TRUE) y1 <- runif(26) y2 <- pmax(0, y1 - runif(26, 0, .1)) dotchart2(y1, g, groups=maj, auxdata=n, auxtitle='n', xlab='Y') dotchart2(y2, g, groups=maj, pch=17, add=TRUE) ## Compare with dotchart function (no superpositioning or auxdata allowed): ## dotchart(y1, g, groups=maj, xlab='Y') ## To plot using a transformed scale add for example ## axisat=sqrt(pretty(y)), axislabels=pretty(y)
These are adaptations of the R dotchart function that sorts categories
top to bottom, adds auxdata
and auxtitle
arguments to put
extra information in the right margin, and for dotchart3
adds
arguments cex.labels
, cex.group.labels
, and
groupfont
. By default, group headings are in a larger, bold
font. dotchart3
also cuts a bit of white space from the top and
bottom of the chart. The most significant change, however, is in how
x
is interpreted. Columns of x
no longer provide an
alternate way to define groups. Instead, they define superpositioned
values. This is useful for showing three quartiles, for example. Going
along with this change, for dotchart3
pch
can now be a
vector specifying symbols to use going across columns of x
.
x
was changed in this way because to put multiple points on a
line (e.g., quartiles) and keeping track of par()
parameters when
dotchart2
was called with add=TRUE
was cumbersome.
dotchart3
changes the margins to account for horizontal labels.
dotchartp
is a version of dotchart3
for making the chart
with the plotly
package.
summaryD
creates aggregate data using summarize
and
calls dotchart3
with suitable arguments to summarize data by
major and minor categories. If options(grType='plotly')
is in
effect and the plotly
package is installed, summaryD
uses
dotchartp
instead of dotchart3
.
summaryDp
is a streamlined summaryD
-like function that
uses the dotchartpl
function to render a plotly
graphic.
It is used to compute summary statistics stratified separately by a
series of variables.
dotchart3(x, labels = NULL, groups = NULL, gdata = NULL, cex = par("cex"), pch = 21, gpch = pch, bg = par("bg"), color = par("fg"), gcolor = par("fg"), lcolor = "gray", xlim = range(c(x, gdata), na.rm=TRUE), main = NULL, xlab = NULL, ylab = NULL, auxdata = NULL, auxtitle = NULL, auxgdata=NULL, axisat=NULL, axislabels=NULL, cex.labels = cex, cex.group.labels = cex.labels * 1.25, cex.auxdata=cex, groupfont = 2, auxwhere=NULL, height=NULL, width=NULL, ...) dotchartp(x, labels = NULL, groups = NULL, gdata = NULL, xlim = range(c(x, gdata), na.rm=TRUE), main=NULL, xlab = NULL, ylab = '', auxdata=NULL, auxtitle=NULL, auxgdata=NULL, auxwhere=c('right', 'hover'), symbol='circle', col=colorspace::rainbow_hcl, legendgroup=NULL, axisat=NULL, axislabels=NULL, sort=TRUE, digits=4, dec=NULL, height=NULL, width=700, layoutattr=FALSE, showlegend=TRUE, ...) summaryD(formula, data=NULL, fun=mean, funm=fun, groupsummary=TRUE, auxvar=NULL, auxtitle='', auxwhere=c('hover', 'right'), vals=length(auxvar) > 0, fmtvals=format, symbol=if(use.plotly) 'circle' else 21, col=if(use.plotly) colorspace::rainbow_hcl else 1:10, legendgroup=NULL, cex.auxdata=.7, xlab=v[1], ylab=NULL, gridevery=NULL, gridcol=gray(.95), sort=TRUE, ...) summaryDp(formula, fun=function(x) c(Mean=mean(x, na.rm=TRUE), N=sum(! is.na(x))), overall=TRUE, xlim=NULL, xlab=NULL, data=NULL, subset=NULL, na.action=na.retain, ncharsmax=c(50, 30), digits=4, ...)
dotchart3(x, labels = NULL, groups = NULL, gdata = NULL, cex = par("cex"), pch = 21, gpch = pch, bg = par("bg"), color = par("fg"), gcolor = par("fg"), lcolor = "gray", xlim = range(c(x, gdata), na.rm=TRUE), main = NULL, xlab = NULL, ylab = NULL, auxdata = NULL, auxtitle = NULL, auxgdata=NULL, axisat=NULL, axislabels=NULL, cex.labels = cex, cex.group.labels = cex.labels * 1.25, cex.auxdata=cex, groupfont = 2, auxwhere=NULL, height=NULL, width=NULL, ...) dotchartp(x, labels = NULL, groups = NULL, gdata = NULL, xlim = range(c(x, gdata), na.rm=TRUE), main=NULL, xlab = NULL, ylab = '', auxdata=NULL, auxtitle=NULL, auxgdata=NULL, auxwhere=c('right', 'hover'), symbol='circle', col=colorspace::rainbow_hcl, legendgroup=NULL, axisat=NULL, axislabels=NULL, sort=TRUE, digits=4, dec=NULL, height=NULL, width=700, layoutattr=FALSE, showlegend=TRUE, ...) summaryD(formula, data=NULL, fun=mean, funm=fun, groupsummary=TRUE, auxvar=NULL, auxtitle='', auxwhere=c('hover', 'right'), vals=length(auxvar) > 0, fmtvals=format, symbol=if(use.plotly) 'circle' else 21, col=if(use.plotly) colorspace::rainbow_hcl else 1:10, legendgroup=NULL, cex.auxdata=.7, xlab=v[1], ylab=NULL, gridevery=NULL, gridcol=gray(.95), sort=TRUE, ...) summaryDp(formula, fun=function(x) c(Mean=mean(x, na.rm=TRUE), N=sum(! is.na(x))), overall=TRUE, xlim=NULL, xlab=NULL, data=NULL, subset=NULL, na.action=na.retain, ncharsmax=c(50, 30), digits=4, ...)
x |
a numeric vector or matrix |
labels |
labels for categories corresponding to rows of
|
groups , gdata , cex , pch , gpch , bg , color , gcolor , lcolor , xlim , main , xlab , ylab
|
see |
auxdata |
a vector of information to be put in the right margin,
in the same order as |
auxtitle |
a column heading for |
auxgdata |
similar to |
axisat |
a vector of tick mark locations to pass to |
axislabels |
a vector of strings specifying axis tick mark labels. Useful if transforming the data axis |
digits |
number of significant digits for formatting numeric data
in hover text for |
dec |
for |
cex.labels |
|
cex.group.labels |
|
cex.auxdata |
|
groupfont |
font number for group headings |
auxwhere |
for |
... |
other arguments passed to some of the graphics functions,
or to |
layoutattr |
set to |
showlegend |
set to |
formula |
a formula with one variable on the left hand side (the
variable to compute summary statistics on), and one or two
variables on the right hand side. If there are two variables,
the first is taken as the major grouping variable. If the left
hand side variable is a matrix it has to be a legal R variable
name, not an expression, and |
data |
a data frame or list used to find the variables in
|
fun |
a summarization function creating a single number from a
vector. Default is the mean. For |
funm |
applies if there are two right hand variables and
|
groupsummary |
By default, when there are two right-hand
variables, |
auxvar |
when |
vals |
set to |
fmtvals |
an optional function to format values before putting
them in the right margin. Default is the |
symbol |
a scalar or vector of |
col |
a function or vector of colors to assign to multiple points plotted in one line. If a function it will be evaluated with an argument equal to the number of groups/columns. |
legendgroup |
see |
gridevery |
specify a positive number to draw very faint vertical
grid lines every |
gridcol |
color for grid lines; default is very faint gray scale |
sort |
specify |
height , width
|
height and width in pixels for |
overall |
set to |
subset |
an observation subsetting expression |
na.action |
an |
ncharsmax |
a 2-vector specifying the number of characters after which an html new line character should be placed, respectively for the x-axis label and the stratification variable levels |
the function returns invisibly
Frank Harrell
dotchart
,dotchart2
,summarize
,
rlegend
set.seed(135) maj <- factor(c(rep('North',13),rep('South',13))) g <- paste('Category',rep(letters[1:13],2)) n <- sample(1:15000, 26, replace=TRUE) y1 <- runif(26) y2 <- pmax(0, y1 - runif(26, 0, .1)) dotchart3(cbind(y1,y2), g, groups=maj, auxdata=n, auxtitle='n', xlab='Y', pch=c(1,17)) ## Compare with dotchart function (no superpositioning or auxdata allowed): ## dotchart(y1, g, groups=maj, xlab='Y') ## Not run: dotchartp(cbind(y1, y2), g, groups=maj, auxdata=n, auxtitle='n', xlab='Y', gdata=cbind(c(0,.1), c(.23,.44)), auxgdata=c(-1,-2), symbol=c('circle', 'line-ns-open')) summaryDp(sbp ~ region + sex + race + cut2(age, g=5), data=mydata) ## End(Not run) ## Put options(grType='plotly') to have the following use dotchartp ## (rlegend will not apply) ## Add argument auxwhere='hover' to summaryD or dotchartp to put ## aux info in hover text instead of right margin summaryD(y1 ~ maj + g, xlab='Mean') summaryD(y1 ~ maj + g, groupsummary=FALSE) summaryD(y1 ~ g, fmtvals=function(x) sprintf('%4.2f', x)) Y <- cbind(y1, y2) # summaryD cannot handle cbind(...) ~ ... summaryD(Y ~ maj + g, fun=function(y) y[1,], symbol=c(1,17)) rlegend(.1, 26, c('y1','y2'), pch=c(1,17)) summaryD(y1 ~ maj, fun=function(y) c(Mean=mean(y), n=length(y)), auxvar='n', auxtitle='N')
set.seed(135) maj <- factor(c(rep('North',13),rep('South',13))) g <- paste('Category',rep(letters[1:13],2)) n <- sample(1:15000, 26, replace=TRUE) y1 <- runif(26) y2 <- pmax(0, y1 - runif(26, 0, .1)) dotchart3(cbind(y1,y2), g, groups=maj, auxdata=n, auxtitle='n', xlab='Y', pch=c(1,17)) ## Compare with dotchart function (no superpositioning or auxdata allowed): ## dotchart(y1, g, groups=maj, xlab='Y') ## Not run: dotchartp(cbind(y1, y2), g, groups=maj, auxdata=n, auxtitle='n', xlab='Y', gdata=cbind(c(0,.1), c(.23,.44)), auxgdata=c(-1,-2), symbol=c('circle', 'line-ns-open')) summaryDp(sbp ~ region + sex + race + cut2(age, g=5), data=mydata) ## End(Not run) ## Put options(grType='plotly') to have the following use dotchartp ## (rlegend will not apply) ## Add argument auxwhere='hover' to summaryD or dotchartp to put ## aux info in hover text instead of right margin summaryD(y1 ~ maj + g, xlab='Mean') summaryD(y1 ~ maj + g, groupsummary=FALSE) summaryD(y1 ~ g, fmtvals=function(x) sprintf('%4.2f', x)) Y <- cbind(y1, y2) # summaryD cannot handle cbind(...) ~ ... summaryD(Y ~ maj + g, fun=function(y) y[1,], symbol=c(1,17)) rlegend(.1, 26, c('y1','y2'), pch=c(1,17)) summaryD(y1 ~ maj, fun=function(y) c(Mean=mean(y), n=length(y)), auxvar='n', auxtitle='N')
This function produces a plotly
interactive graphic and accepts
a different format of data input than the other dotchart
functions. It was written to handle a hierarchical data structure
including strata that further subdivide the main classes. Strata,
indicated by the mult
variable, are shown on the same
horizontal line, and if the variable big
is FALSE
will
appear slightly below the main line, using smaller symbols, and having
some transparency. This is intended to handle output such as that
from the summaryP
function when there is a superpositioning
variable group
and a stratification variable mult
,
especially when the data have been run through the addMarginal
function to create mult
categories labelled "All"
for
which the user will specify big=TRUE
to indicate non-stratified
estimates (stratified only on group
) to emphasize.
When viewing graphics that used mult
and big
, the user
can click on the legends for the small points for group
s to
vanish the finely stratified estimates.
When group
is used by mult
and big
are not, and
when the group
variable has exactly two distinct values, you
can specify refgroup
to get the difference between two
proportions in addition to the individual proportions. The individual
proportions are plotted, but confidence intervals for the difference
are shown in hover text and half-width confidence intervals for the
difference, centered at the midpoint of the proportions, are shown.
These have the property of intersecting the two proportions if and
only if there is no significant difference at the 1 - conf.int
level.
Specify fun=exp
and ifun=log
if estimates and confidence
limits are on the log scale. Make sure that zeros were prevented in
the original calculations. For exponential hazard rates this can be
accomplished by replacing event counts of 0 with 0.5.
dotchartpl(x, major=NULL, minor=NULL, group=NULL, mult=NULL, big=NULL, htext=NULL, num=NULL, denom=NULL, numlabel='', denomlabel='', fun=function(x) x, ifun=function(x) x, op='-', lower=NULL, upper=NULL, refgroup=NULL, sortdiff=TRUE, conf.int=0.95, minkeep=NULL, xlim=NULL, xlab='Proportion', tracename=NULL, limitstracename='Limits', nonbigtracename='Stratified Estimates', dec=3, width=800, height=NULL, col=colorspace::rainbow_hcl)
dotchartpl(x, major=NULL, minor=NULL, group=NULL, mult=NULL, big=NULL, htext=NULL, num=NULL, denom=NULL, numlabel='', denomlabel='', fun=function(x) x, ifun=function(x) x, op='-', lower=NULL, upper=NULL, refgroup=NULL, sortdiff=TRUE, conf.int=0.95, minkeep=NULL, xlim=NULL, xlab='Proportion', tracename=NULL, limitstracename='Limits', nonbigtracename='Stratified Estimates', dec=3, width=800, height=NULL, col=colorspace::rainbow_hcl)
x |
a numeric vector used for values on the |
major |
major vertical category, e.g., variable labels |
minor |
minor vertical category, e.g. category levels within variables |
group |
superpositioning variable such as treatment |
mult |
strata names for further subdivisions without
|
big |
omit if all levels of |
htext |
additional hover text per point |
num |
if |
denom |
like |
numlabel |
character string to put to the right of the numerator in hover text |
denomlabel |
character string to put to the right of the denominator in hover text |
fun |
a transformation to make when printing estimates. For
example, one may specify |
ifun |
inverse transformation of |
op |
set to for example |
lower |
lower limits for optional error bars |
upper |
upper limits for optional error bars |
refgroup |
if |
sortdiff |
|
conf.int |
confidence level for computing confidence intervals
for the difference in two proportions. Specify |
minkeep |
if |
xlim |
|
xlab |
|
tracename |
|
limitstracename |
|
nonbigtracename |
|
col |
a function or vector of colors to assign to |
dec |
number of places to the right of the decimal place for formatting numeric quantities in hover text |
width |
width of plot in pixels |
height |
height of plot in pixels; computed from number of strata by default |
a plotly
object. An attribute levelsRemoved
is
added if minkeep
is used and any categories were omitted from
the plot as a result. This is a character vector with categories
removed. If major
is present, the strings are of the form
major:minor
Frank Harrell
## Not run: set.seed(1) d <- expand.grid(major=c('Alabama', 'Alaska', 'Arkansas'), minor=c('East', 'West'), group=c('Female', 'Male'), city=0:2) n <- nrow(d) d$num <- round(100*runif(n)) d$denom <- d$num + round(100*runif(n)) d$x <- d$num / d$denom d$lower <- d$x - runif(n) d$upper <- d$x + runif(n) with(d, dotchartpl(x, major, minor, group, city, lower=lower, upper=upper, big=city==0, num=num, denom=denom, xlab='x')) # Show half-width confidence intervals for Female - Male differences # after subsetting the data to have only one record per # state/region/group d <- subset(d, city == 0) with(d, dotchartpl(x, major, minor, group, num=num, denom=denom, lower=lower, upper=upper, refgroup='Male') ) n <- 500 set.seed(1) d <- data.frame( race = sample(c('Asian', 'Black/AA', 'White'), n, TRUE), sex = sample(c('Female', 'Male'), n, TRUE), treat = sample(c('A', 'B'), n, TRUE), smoking = sample(c('Smoker', 'Non-smoker'), n, TRUE), hypertension = sample(c('Hypertensive', 'Non-Hypertensive'), n, TRUE), region = sample(c('North America','Europe','South America', 'Europe', 'Asia', 'Central America'), n, TRUE)) d <- upData(d, labels=c(race='Race', sex='Sex')) dm <- addMarginal(d, region) s <- summaryP(race + sex + smoking + hypertension ~ region + treat, data=dm) s$region <- ifelse(s$region == 'All', 'All Regions', as.character(s$region)) with(s, dotchartpl(freq / denom, major=var, minor=val, group=treat, mult=region, big=region == 'All Regions', num=freq, denom=denom) ) s2 <- s[- attr(s, 'rows.to.exclude1'), ] with(s2, dotchartpl(freq / denom, major=var, minor=val, group=treat, mult=region, big=region == 'All Regions', num=freq, denom=denom) ) # Note these plots can be created by plot.summaryP when options(grType='plotly') # Plot hazard rates and ratios with confidence limits, on log scale d <- data.frame(tx=c('a', 'a', 'b', 'b'), event=c('MI', 'stroke', 'MI', 'stroke'), count=c(10, 5, 5, 2), exposure=c(1000, 1000, 900, 900)) # There were no zero event counts in this dataset. In general we # want to handle that, hence the 0.5 below d <- upData(d, hazard = pmax(0.5, count) / exposure, selog = sqrt(1. / pmax(0.5, count)), lower = log(hazard) - 1.96 * selog, upper = log(hazard) + 1.96 * selog) with(d, dotchartpl(log(hazard), minor=event, group=tx, num=count, denom=exposure, lower=lower, upper=upper, fun=exp, ifun=log, op='/', numlabel='events', denomlabel='years', refgroup='a', xlab='Events Per Person-Year') ) ## End(Not run)
## Not run: set.seed(1) d <- expand.grid(major=c('Alabama', 'Alaska', 'Arkansas'), minor=c('East', 'West'), group=c('Female', 'Male'), city=0:2) n <- nrow(d) d$num <- round(100*runif(n)) d$denom <- d$num + round(100*runif(n)) d$x <- d$num / d$denom d$lower <- d$x - runif(n) d$upper <- d$x + runif(n) with(d, dotchartpl(x, major, minor, group, city, lower=lower, upper=upper, big=city==0, num=num, denom=denom, xlab='x')) # Show half-width confidence intervals for Female - Male differences # after subsetting the data to have only one record per # state/region/group d <- subset(d, city == 0) with(d, dotchartpl(x, major, minor, group, num=num, denom=denom, lower=lower, upper=upper, refgroup='Male') ) n <- 500 set.seed(1) d <- data.frame( race = sample(c('Asian', 'Black/AA', 'White'), n, TRUE), sex = sample(c('Female', 'Male'), n, TRUE), treat = sample(c('A', 'B'), n, TRUE), smoking = sample(c('Smoker', 'Non-smoker'), n, TRUE), hypertension = sample(c('Hypertensive', 'Non-Hypertensive'), n, TRUE), region = sample(c('North America','Europe','South America', 'Europe', 'Asia', 'Central America'), n, TRUE)) d <- upData(d, labels=c(race='Race', sex='Sex')) dm <- addMarginal(d, region) s <- summaryP(race + sex + smoking + hypertension ~ region + treat, data=dm) s$region <- ifelse(s$region == 'All', 'All Regions', as.character(s$region)) with(s, dotchartpl(freq / denom, major=var, minor=val, group=treat, mult=region, big=region == 'All Regions', num=freq, denom=denom) ) s2 <- s[- attr(s, 'rows.to.exclude1'), ] with(s2, dotchartpl(freq / denom, major=var, minor=val, group=treat, mult=region, big=region == 'All Regions', num=freq, denom=denom) ) # Note these plots can be created by plot.summaryP when options(grType='plotly') # Plot hazard rates and ratios with confidence limits, on log scale d <- data.frame(tx=c('a', 'a', 'b', 'b'), event=c('MI', 'stroke', 'MI', 'stroke'), count=c(10, 5, 5, 2), exposure=c(1000, 1000, 900, 900)) # There were no zero event counts in this dataset. In general we # want to handle that, hence the 0.5 below d <- upData(d, hazard = pmax(0.5, count) / exposure, selog = sqrt(1. / pmax(0.5, count)), lower = log(hazard) - 1.96 * selog, upper = log(hazard) + 1.96 * selog) with(d, dotchartpl(log(hazard), minor=event, group=tx, num=count, denom=exposure, lower=lower, upper=upper, fun=exp, ifun=log, op='/', numlabel='events', denomlabel='years', refgroup='a', xlab='Events Per Person-Year') ) ## End(Not run)
Computation of Coordinates of Extended Box Plots Elements
ebpcomp(x, qref = c(0.5, 0.25, 0.75), probs = c(0.05, 0.125, 0.25, 0.375))
ebpcomp(x, qref = c(0.5, 0.25, 0.75), probs = c(0.05, 0.125, 0.25, 0.375))
x |
a numeric variable |
qref |
quantiles for major corners |
probs |
quantiles for minor corners |
For an extended box plots computes all the elements needed for plotting it. This is typically used when adding to a ggplot2
plot.
list with elements segments
, lines
, points
, points2
Frank Harrell
ebpcomp(1:1000)
ebpcomp(1:1000)
Computes coordinates of cumulative distribution function of x, and by defaults
plots it as a step function. A grouping variable may be specified so that
stratified estimates are computed and (by default) plotted. If there is
more than one group, the labcurve
function is used (by default) to label
the multiple step functions or to draw a legend defining line types, colors,
or symbols by linking them with group labels. A weights
vector may
be specified to get weighted estimates. Specify normwt
to make
weights
sum to the length of x
(after removing NAs). Other wise
the total sample size is taken to be the sum of the weights.
Ecdf
is actually a method, and Ecdf.default
is what's
called for a vector argument. Ecdf.data.frame
is called when the
first argument is a data frame. This function can automatically set up
a matrix of ECDFs and wait for a mouse click if the matrix requires more
than one page. Categorical variables, character variables, and
variables having fewer than a set number of unique values are ignored.
If par(mfrow=..)
is not set up before Ecdf.data.frame
is
called, the function will try to figure the best layout depending on the
number of variables in the data frame. Upon return the original
mfrow
is left intact.
When the first argument to Ecdf
is a formula, a Trellis/Lattice function
Ecdf.formula
is called. This allows for multi-panel
conditioning, superposition using a groups
variable, and other
Trellis features, along with the ability to easily plot transformed
ECDFs using the fun
argument. For example, if fun=qnorm
,
the inverse normal transformation will be used for the y-axis. If the
transformed curves are linear this indicates normality. Like the
xYplot
function, Ecdf
will create a function Key
if
the groups
variable is used. This function can be invoked by the
user to define the keys for the groups.
Ecdf(x, ...) ## Default S3 method: Ecdf(x, what=c('F','1-F','f','1-f'), weights=rep(1, length(x)), normwt=FALSE, xlab, ylab, q, pl=TRUE, add=FALSE, lty=1, col=1, group=rep(1,length(x)), label.curves=TRUE, xlim, subtitles=TRUE, datadensity=c('none','rug','hist','density'), side=1, frac=switch(datadensity,none=NA,rug=.03,hist=.1,density=.1), dens.opts=NULL, lwd=1, log='', ...) ## S3 method for class 'data.frame' Ecdf(x, group=rep(1,nrows), weights=rep(1, nrows), normwt=FALSE, label.curves=TRUE, n.unique=10, na.big=FALSE, subtitles=TRUE, vnames=c('labels','names'),...) ## S3 method for class 'formula' Ecdf(x, data=sys.frame(sys.parent()), groups=NULL, prepanel=prepanel.Ecdf, panel=panel.Ecdf, ..., xlab, ylab, fun=function(x)x, what=c('F','1-F','f','1-f'), subset=TRUE)
Ecdf(x, ...) ## Default S3 method: Ecdf(x, what=c('F','1-F','f','1-f'), weights=rep(1, length(x)), normwt=FALSE, xlab, ylab, q, pl=TRUE, add=FALSE, lty=1, col=1, group=rep(1,length(x)), label.curves=TRUE, xlim, subtitles=TRUE, datadensity=c('none','rug','hist','density'), side=1, frac=switch(datadensity,none=NA,rug=.03,hist=.1,density=.1), dens.opts=NULL, lwd=1, log='', ...) ## S3 method for class 'data.frame' Ecdf(x, group=rep(1,nrows), weights=rep(1, nrows), normwt=FALSE, label.curves=TRUE, n.unique=10, na.big=FALSE, subtitles=TRUE, vnames=c('labels','names'),...) ## S3 method for class 'formula' Ecdf(x, data=sys.frame(sys.parent()), groups=NULL, prepanel=prepanel.Ecdf, panel=panel.Ecdf, ..., xlab, ylab, fun=function(x)x, what=c('F','1-F','f','1-f'), subset=TRUE)
x |
a numeric vector, data frame, or Trellis/Lattice formula |
what |
The default is |
weights |
numeric vector of weights. Omit or specify a zero-length vector or NULL to get unweighted estimates. |
normwt |
see above |
xlab |
x-axis label. Default is label(x) or name of calling argument. For
|
ylab |
y-axis label. Default is |
q |
a vector for quantiles for which to draw reference lines on the plot. Default is not to draw any. |
pl |
set to F to omit the plot, to just return estimates |
add |
set to TRUE to add the cdf to an existing plot. Does not apply if using lattice graphics (i.e., if a formula is given as the first argument). |
lty |
integer line type for plot. If |
lwd |
line width for plot. Can be a vector corresponding to |
log |
see |
col |
color for step function. Can be a vector. |
group |
a numeric, character, or |
label.curves |
applies if more than one |
xlim |
x-axis limits. Default is entire range of |
subtitles |
set to |
datadensity |
If |
side |
If |
frac |
passed to |
dens.opts |
a list of optional arguments for |
... |
other parameters passed to plot if add=F. For data frames, other
parameters to pass to |
n.unique |
minimum number of unique values before an ECDF is drawn for a variable in a data frame. Default is 10. |
na.big |
set to |
vnames |
By default, variable labels are used to label x-axes. Set |
method |
method for computing the empirical cumulative distribution. See
|
fun |
a function to transform the cumulative proportions, for the
Trellis-type usage of |
data , groups , subset , prepanel , panel
|
the usual Trellis/Lattice parameters, with |
for Ecdf.default
an invisible list with elements x and y giving the
coordinates of the cdf. If there is more than one group
, a list of
such lists is returned. An attribute, N
, is in the returned
object. It contains the elements n
and m
, the number of
non-missing and missing observations, respectively.
plots
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
wtd.Ecdf
, label
, table
, cumsum
, labcurve
, xYplot
, histSpike
set.seed(1) ch <- rnorm(1000, 200, 40) Ecdf(ch, xlab="Serum Cholesterol") scat1d(ch) # add rug plot histSpike(ch, add=TRUE, frac=.15) # add spike histogram # Better: add a data density display automatically: Ecdf(ch, datadensity='density') label(ch) <- "Serum Cholesterol" Ecdf(ch) other.ch <- rnorm(500, 220, 20) Ecdf(other.ch,add=TRUE,lty=2) sex <- factor(sample(c('female','male'), 1000, TRUE)) Ecdf(ch, q=c(.25,.5,.75)) # show quartiles Ecdf(ch, group=sex, label.curves=list(method='arrow')) # Example showing how to draw multiple ECDFs from paired data pre.test <- rnorm(100,50,10) post.test <- rnorm(100,55,10) x <- c(pre.test, post.test) g <- c(rep('Pre',length(pre.test)),rep('Post',length(post.test))) Ecdf(x, group=g, xlab='Test Results', label.curves=list(keys=1:2)) # keys=1:2 causes symbols to be drawn periodically on top of curves # Draw a matrix of ECDFs for a data frame m <- data.frame(pre.test, post.test, sex=sample(c('male','female'),100,TRUE)) Ecdf(m, group=m$sex, datadensity='rug') freqs <- sample(1:10, 1000, TRUE) Ecdf(ch, weights=freqs) # weighted estimates # Trellis/Lattice examples: region <- factor(sample(c('Europe','USA','Australia'),100,TRUE)) year <- factor(sample(2001:2002,1000,TRUE)) Ecdf(~ch | region*year, groups=sex) Key() # draw a key for sex at the default location # Key(locator(1)) # user-specified positioning of key age <- rnorm(1000, 50, 10) Ecdf(~ch | lattice::equal.count(age), groups=sex) # use overlapping shingles Ecdf(~ch | sex, datadensity='hist', side=3) # add spike histogram at top
set.seed(1) ch <- rnorm(1000, 200, 40) Ecdf(ch, xlab="Serum Cholesterol") scat1d(ch) # add rug plot histSpike(ch, add=TRUE, frac=.15) # add spike histogram # Better: add a data density display automatically: Ecdf(ch, datadensity='density') label(ch) <- "Serum Cholesterol" Ecdf(ch) other.ch <- rnorm(500, 220, 20) Ecdf(other.ch,add=TRUE,lty=2) sex <- factor(sample(c('female','male'), 1000, TRUE)) Ecdf(ch, q=c(.25,.5,.75)) # show quartiles Ecdf(ch, group=sex, label.curves=list(method='arrow')) # Example showing how to draw multiple ECDFs from paired data pre.test <- rnorm(100,50,10) post.test <- rnorm(100,55,10) x <- c(pre.test, post.test) g <- c(rep('Pre',length(pre.test)),rep('Post',length(post.test))) Ecdf(x, group=g, xlab='Test Results', label.curves=list(keys=1:2)) # keys=1:2 causes symbols to be drawn periodically on top of curves # Draw a matrix of ECDFs for a data frame m <- data.frame(pre.test, post.test, sex=sample(c('male','female'),100,TRUE)) Ecdf(m, group=m$sex, datadensity='rug') freqs <- sample(1:10, 1000, TRUE) Ecdf(ch, weights=freqs) # weighted estimates # Trellis/Lattice examples: region <- factor(sample(c('Europe','USA','Australia'),100,TRUE)) year <- factor(sample(2001:2002,1000,TRUE)) Ecdf(~ch | region*year, groups=sex) Key() # draw a key for sex at the default location # Key(locator(1)) # user-specified positioning of key age <- rnorm(1000, 50, 10) Ecdf(~ch | lattice::equal.count(age), groups=sex) # use overlapping shingles Ecdf(~ch | sex, datadensity='hist', side=3) # add spike histogram at top
Compute Coordinates of an Empirical Distribution Function
ecdfSteps(x, extend)
ecdfSteps(x, extend)
x |
numeric vector, possibly with |
extend |
a 2-vector do extend the range of x (low, high). Set |
For a numeric vector uses the R built-in ecdf
function to compute
coordinates of the ECDF, with extension slightly below and above the
range of x
by default. This is useful for ggplot2
where the ECDF may need to be transformed. The returned object is suitable for creating stratified statistics using data.table
and other methods.
a list with components x
and y
Frank Harrell
ecdfSteps(0:10) ## Not run: # Use data.table for obtaining ECDFs by country and region w <- d[, ecdfSteps(z, extend=c(1,11)), by=.(country, region)] # d is a DT # Use ggplot2 to make one graph with multiple regions' ECDFs # and use faceting for countries ggplot(w, aes(x, y, color=region)) + geom_step() + facet_wrap(~ country) ## End(Not run)
ecdfSteps(0:10) ## Not run: # Use data.table for obtaining ECDFs by country and region w <- d[, ecdfSteps(z, extend=c(1,11)), by=.(country, region)] # d is a DT # Use ggplot2 to make one graph with multiple regions' ECDFs # and use faceting for countries ggplot(w, aes(x, y, color=region)) + geom_step() + facet_wrap(~ country) ## End(Not run)
Expands the width either supercolumns or the subcolumns so that the the sum of the supercolumn widths is the same as the sum of the subcolumn widths.
equalBins(widths, subwidths)
equalBins(widths, subwidths)
widths |
widths of the supercolumns. |
subwidths |
list of widths of the subcolumns for each supercolumn. |
This determins the correct subwidths of each of various columns in a table for printing. The correct width of the multicolumns is deterimed by summing the widths of it subcolumns.
widths of the the columns for a table.
Charles Dupont
mcols <- c("Group 1", "Group 2") mwidth <- nchar(mcols, type="width") spancols <- c(3,3) ccols <- c("a", "deer", "ad", "cat", "help", "bob") cwidth <- nchar(ccols, type="width") subwidths <- partition.vector(cwidth, spancols) equalBins(mwidth, subwidths)
mcols <- c("Group 1", "Group 2") mwidth <- nchar(mcols, type="width") spancols <- c(3,3) ccols <- c("a", "deer", "ad", "cat", "help", "bob") cwidth <- nchar(ccols, type="width") subwidths <- partition.vector(cwidth, spancols) equalBins(mwidth, subwidths)
Add vertical error bars to an existing plot or makes a new plot with error bars.
errbar(x, y, yplus, yminus, cap=0.015, main = NULL, sub=NULL, xlab=as.character(substitute(x)), ylab=if(is.factor(x) || is.character(x)) "" else as.character(substitute(y)), add=FALSE, lty=1, type='p', ylim=NULL, lwd=1, pch=16, errbar.col, Type=rep(1, length(y)), ...)
errbar(x, y, yplus, yminus, cap=0.015, main = NULL, sub=NULL, xlab=as.character(substitute(x)), ylab=if(is.factor(x) || is.character(x)) "" else as.character(substitute(y)), add=FALSE, lty=1, type='p', ylim=NULL, lwd=1, pch=16, errbar.col, Type=rep(1, length(y)), ...)
x |
vector of numeric x-axis values (for vertical error bars) or a factor or
character variable (for horizontal error bars, |
y |
vector of y-axis values. |
yplus |
vector of y-axis values: the tops of the error bars. |
yminus |
vector of y-axis values: the bottoms of the error bars. |
cap |
the width of the little lines at the tops and bottoms of the error bars
in units of the width of the plot. Defaults to |
main |
a main title for the plot, passed to |
sub |
a sub title for the plot, passed to |
xlab |
optional x-axis labels if |
ylab |
optional y-axis labels if |
add |
set to |
lty |
type of line for error bars |
type |
type of point. Use |
ylim |
y-axis limits. Default is to use range of |
lwd |
line width for line segments (not main line) |
pch |
character to use as the point. |
errbar.col |
color to use for drawing error bars. |
Type |
used for horizontal bars only. Is an integer vector with values |
... |
other parameters passed to all graphics functions. |
errbar
adds vertical error bars to an existing plot or makes a new
plot with error bars. It can also make a horizontal error bar plot
that shows error bars for group differences as well as bars for
groups. For the latter type of plot, the lower x-axis scale
corresponds to group estimates and the upper scale corresponds to
differences. The spacings of the two scales are identical but the
scale for differences has its origin shifted so that zero may be
included. If at least one of the confidence intervals includes zero,
a vertical dotted reference line at zero is drawn.
Charles Geyer, University of Chicago. Modified by Frank Harrell,
Vanderbilt University, to handle missing data, to add the parameters
add
and lty
, and to implement horizontal charts with differences.
set.seed(1) x <- 1:10 y <- x + rnorm(10) delta <- runif(10) errbar( x, y, y + delta, y - delta ) # Show bootstrap nonparametric CLs for 3 group means and for # pairwise differences on same graph group <- sample(c('a','b','d'), 200, TRUE) y <- runif(200) + .25*(group=='b') + .5*(group=='d') cla <- smean.cl.boot(y[group=='a'],B=100,reps=TRUE) # usually B=1000 a <- attr(cla,'reps') clb <- smean.cl.boot(y[group=='b'],B=100,reps=TRUE) b <- attr(clb,'reps') cld <- smean.cl.boot(y[group=='d'],B=100,reps=TRUE) d <- attr(cld,'reps') a.b <- quantile(a-b,c(.025,.975)) a.d <- quantile(a-d,c(.025,.975)) b.d <- quantile(b-d,c(.025,.975)) errbar(c('a','b','d','a - b','a - d','b - d'), c(cla[1],clb[1],cld[1],cla[1]-clb[1],cla[1]-cld[1],clb[1]-cld[1]), c(cla[3],clb[3],cld[3],a.b[2],a.d[2],b.d[2]), c(cla[2],clb[2],cld[2],a.b[1],a.d[1],b.d[1]), Type=c(1,1,1,2,2,2), xlab='', ylab='')
set.seed(1) x <- 1:10 y <- x + rnorm(10) delta <- runif(10) errbar( x, y, y + delta, y - delta ) # Show bootstrap nonparametric CLs for 3 group means and for # pairwise differences on same graph group <- sample(c('a','b','d'), 200, TRUE) y <- runif(200) + .25*(group=='b') + .5*(group=='d') cla <- smean.cl.boot(y[group=='a'],B=100,reps=TRUE) # usually B=1000 a <- attr(cla,'reps') clb <- smean.cl.boot(y[group=='b'],B=100,reps=TRUE) b <- attr(clb,'reps') cld <- smean.cl.boot(y[group=='d'],B=100,reps=TRUE) d <- attr(cld,'reps') a.b <- quantile(a-b,c(.025,.975)) a.d <- quantile(a-d,c(.025,.975)) b.d <- quantile(b-d,c(.025,.975)) errbar(c('a','b','d','a - b','a - d','b - d'), c(cla[1],clb[1],cld[1],cla[1]-clb[1],cla[1]-cld[1],clb[1]-cld[1]), c(cla[3],clb[3],cld[3],a.b[2],a.d[2],b.d[2]), c(cla[2],clb[2],cld[2],a.b[1],a.d[1],b.d[1]), Type=c(1,1,1,2,2,2), xlab='', ylab='')
Escapes any characters that would have special meaning in a reqular expression.
escapeRegex(string) escapeBS(string)
escapeRegex(string) escapeBS(string)
string |
string being operated on. |
escapeRegex
will escape any characters that would have
special meaning in a reqular expression. For any string
grep(regexpEscape(string), string)
will always be true.
escapeBS
will escape any backslash ‘\’ in a string.
The value of the string with any characters that would have special meaning in a reqular expression escaped.
Charles Dupont
Department of Biostatistics
Vanderbilt University
string <- "this\\(system) {is} [full]." escapeRegex(string) escapeBS(string)
string <- "this\\(system) {is} [full]." escapeRegex(string) escapeBS(string)
Simulate Comparisons For Use in Sequential Markov Longitudinal Clinical Trial Simulations
estSeqMarkovOrd( y, times, initial, absorb = NULL, intercepts, parameter, looks, g, formula, ppo = NULL, yprevfactor = TRUE, groupContrast = NULL, cscov = FALSE, timecriterion = NULL, coxzph = FALSE, sstat = NULL, rdsample = NULL, maxest = NULL, maxvest = NULL, nsim = 1, progress = FALSE, pfile = "" )
estSeqMarkovOrd( y, times, initial, absorb = NULL, intercepts, parameter, looks, g, formula, ppo = NULL, yprevfactor = TRUE, groupContrast = NULL, cscov = FALSE, timecriterion = NULL, coxzph = FALSE, sstat = NULL, rdsample = NULL, maxest = NULL, maxvest = NULL, nsim = 1, progress = FALSE, pfile = "" )
y |
vector of possible y values in order (numeric, character, factor) |
times |
vector of measurement times |
initial |
a vector of probabilities summing to 1.0 that specifies the frequency distribution of initial values to be sampled from. The vector must have names that correspond to values of |
absorb |
vector of absorbing states, a subset of |
intercepts |
vector of intercepts in the proportional odds model. There must be one fewer of these than the length of |
parameter |
vector of true parameter (effects; group differences) values. These are group 2:1 log odds ratios in the transition model, conditioning on the previous |
looks |
integer vector of ID numbers at which maximum likelihood estimates and their estimated variances are computed. For a single look specify a scalar value for |
g |
a user-specified function of three or more arguments which in order are |
formula |
a formula object given to the |
ppo |
a formula specifying the part of |
yprevfactor |
see |
groupContrast |
omit this argument if |
cscov |
applies if |
timecriterion |
a function of a time-ordered vector of simulated ordinal responses |
coxzph |
set to |
sstat |
set to a function of the time vector and the corresponding vector of ordinal responses for a single group if you want to compute a Wilcoxon test on a derived quantity such as the number of days in a given state. |
rdsample |
an optional function to do response-dependent sampling. It is a function of these arguments, which are vectors that stop at any absorbing state: |
maxest |
maximum acceptable absolute value of the contrast estimate, ignored if |
maxvest |
like |
nsim |
number of simulations (default is 1) |
progress |
set to |
pfile |
file to which to write progress information. Defaults to |
Simulates sequential clinical trials of longitudinal ordinal outcomes using a first-order Markov model. Looks are done sequentially after subject ID numbers given in the vector looks
with the earliest possible look being after subject 2. At each look, a subject's repeated records are either all used or all ignored depending on the sequent ID number. For each true effect parameter value, simulation, and at each look, runs a function to compute the estimate of the parameter of interest along with its variance. For each simulation, data are first simulated for the last look, and these data are sequentially revealed for earlier looks. The user provides a function g
that has extra arguments specifying the true effect of parameter
the treatment group
expecting treatments to be coded 1 and 2. parameter
is usually on the scale of a regression coefficient, e.g., a log odds ratio. Fitting is done using the rms::lrm()
function, unless non-proportional odds is allowed in which case VGAM::vglm()
is used. If timecriterion
is specified, the function also, for the last data look only, computes the first time at which the criterion is satisfied for the subject or use the event time and event/censoring indicator computed by timecriterion
. The Cox/logrank chi-square statistic for comparing groups on the derived time variable is saved. If coxzph=TRUE
, the survival
package correlation coefficient rho
from the scaled partial residuals is also saved so that the user can later determine to what extent the Markov model resulted in the proportional hazards assumption being violated when analyzing on the time scale. vglm
is accelerated by saving the first successful fit for the largest sample size and using its coefficients as starting value for further vglm
fits for any sample size for the same setting of parameter
.
a data frame with number of rows equal to the product of nsim
, the length of looks
, and the length of parameter
, with variables sim
, parameter
, look
, est
(log odds ratio for group), and vest
(the variance of the latter). If timecriterion
is specified the data frame also contains loghr
(Cox log hazard ratio for group), lrchisq
(chi-square from Cox test for group), and if coxph=TRUE
, phchisq
, the chi-square for testing proportional hazards. The attribute etimefreq
is also present if timecriterion
is present, and it probvides the frequency distribution of derived event times by group and censoring/event indicator. If sstat
is given, the attribute sstat
is also present, and it contains an array with dimensions corresponding to simulations, parameter values within simulations, id
, and a two-column subarray with columns group
and y
, the latter being the summary measure computed by the sstat
function. The returned data frame also has attribute lrmcoef
which are the last-look logistic regression coefficient estimates over the nsim
simulations and the parameter settings, and an attribute failures
which is a data frame containing the variables reason
and frequency
cataloging the reasons for unsuccessful model fits.
Frank Harrell
gbayesSeqSim()
, simMarkovOrd()
, https://hbiostat.org/R/Hmisc/markov/
Simulate Comparisons For Use in Sequential Clinical Trial Simulations
estSeqSim(parameter, looks, gendat, fitter, nsim = 1, progress = FALSE)
estSeqSim(parameter, looks, gendat, fitter, nsim = 1, progress = FALSE)
parameter |
vector of true parameter (effects; group differences) values |
looks |
integer vector of observation numbers at which posterior probabilities are computed |
gendat |
a function of three arguments: true parameter value (scalar), sample size for first group, sample size for second group |
fitter |
a function of two arguments: 0/1 group indicator vector and the dependent variable vector |
nsim |
number of simulations (default is 1) |
progress |
set to |
Simulates sequential clinical trials. Looks are done sequentially at observation numbers given in the vector looks
with the earliest possible look being at observation 2. For each true effect parameter value, simulation, and at each look, runs a function to compute the estimate of the parameter of interest along with its variance. For each simulation, data are first simulated for the last look, and these data are sequentially revealed for earlier looks. The user provides a function gendat
that given a true effect of parameter
and the two sample sizes (for treatment groups 1 and 2) returns a list with vectors y1
and y2
containing simulated data. The user also provides a function fitter
with arguments x
(group indicator 0/1) and y
(response variable) that returns a 2-vector containing the effect estimate and its variance. parameter
is usually on the scale of a regression coefficient, e.g., a log odds ratio.
a data frame with number of rows equal to the product of nsim
, the length of looks
, and the length of parameter
.
Frank Harrell
gbayesSeqSim()
, simMarkovOrd()
, estSeqMarkovOrd()
if (requireNamespace("rms", quietly = TRUE)) { # Run 100 simulations, 5 looks, 2 true parameter values # Total simulation time: 2s lfit <- function(x, y) { f <- rms::lrm.fit(x, y) k <- length(coef(f)) c(coef(f)[k], vcov(f)[k, k]) } gdat <- function(beta, n1, n2) { # Cell probabilities for a 7-category ordinal outcome for the control group p <- c(2, 1, 2, 7, 8, 38, 42) / 100 # Compute cell probabilities for the treated group p2 <- pomodm(p=p, odds.ratio=exp(beta)) y1 <- sample(1 : 7, n1, p, replace=TRUE) y2 <- sample(1 : 7, n2, p2, replace=TRUE) list(y1=y1, y2=y2) } set.seed(1) est <- estSeqSim(c(0, log(0.7)), looks=c(50, 75, 95, 100, 200), gendat=gdat, fitter=lfit, nsim=100) head(est) }
if (requireNamespace("rms", quietly = TRUE)) { # Run 100 simulations, 5 looks, 2 true parameter values # Total simulation time: 2s lfit <- function(x, y) { f <- rms::lrm.fit(x, y) k <- length(coef(f)) c(coef(f)[k], vcov(f)[k, k]) } gdat <- function(beta, n1, n2) { # Cell probabilities for a 7-category ordinal outcome for the control group p <- c(2, 1, 2, 7, 8, 38, 42) / 100 # Compute cell probabilities for the treated group p2 <- pomodm(p=p, odds.ratio=exp(beta)) y1 <- sample(1 : 7, n1, p, replace=TRUE) y2 <- sample(1 : 7, n2, p2, replace=TRUE) list(y1=y1, y2=y2) } set.seed(1) est <- estSeqSim(c(0, log(0.7)), looks=c(50, 75, 95, 100, 200), gendat=gdat, fitter=lfit, nsim=100) head(est) }
Creates an event chart on the current graphics device. Also, allows user to plot legend on plot area or on separate page. Contains features useful for plotting data with time-to-event outcomes Which arise in a variety of studies including randomized clinical trials and non-randomized cohort studies. This function can use as input a matrix or a data frame, although greater utility and ease of use will be seen with a data frame.
event.chart(data, subset.r = 1:dim(data)[1], subset.c = 1:dim(data)[2], sort.by = NA, sort.ascending = TRUE, sort.na.last = TRUE, sort.after.subset = TRUE, y.var = NA, y.var.type = "n", y.jitter = FALSE, y.jitter.factor = 1, y.renum = FALSE, NA.rm = FALSE, x.reference = NA, now = max(data[, subset.c], na.rm = TRUE), now.line = FALSE, now.line.lty = 2, now.line.lwd = 1, now.line.col = 1, pty = "m", date.orig = c(1, 1, 1960), titl = "Event Chart", y.idlabels = NA, y.axis = "auto", y.axis.custom.at = NA, y.axis.custom.labels = NA, y.julian = FALSE, y.lim.extend = c(0, 0), y.lab = ifelse(is.na(y.idlabels), "", as.character(y.idlabels)), x.axis.all = TRUE, x.axis = "auto", x.axis.custom.at = NA, x.axis.custom.labels = NA, x.julian = FALSE, x.lim.extend = c(0, 0), x.scale = 1, x.lab = ifelse(x.julian, "Follow-up Time", "Study Date"), line.by = NA, line.lty = 1, line.lwd = 1, line.col = 1, line.add = NA, line.add.lty = NA, line.add.lwd = NA, line.add.col = NA, point.pch = 1:length(subset.c), point.cex = rep(0.6, length(subset.c)), point.col = rep(1, length(subset.c)), point.cex.mult = 1., point.cex.mult.var = NA, extra.points.no.mult = rep(NA, length(subset.c)), legend.plot = FALSE, legend.location = "o", legend.titl = titl, legend.titl.cex = 3, legend.titl.line = 1, legend.point.at = list(x = c(5, 95), y = c(95, 30)), legend.point.pch = point.pch, legend.point.text = ifelse(rep(is.data.frame(data), length(subset.c)), names(data[, subset.c]), subset.c), legend.cex = 2.5, legend.bty = "n", legend.line.at = list(x = c(5, 95), y = c(20, 5)), legend.line.text = names(table(as.character(data[, line.by]), exclude = c("", "NA"))), legend.line.lwd = line.lwd, legend.loc.num = 1, ...)
event.chart(data, subset.r = 1:dim(data)[1], subset.c = 1:dim(data)[2], sort.by = NA, sort.ascending = TRUE, sort.na.last = TRUE, sort.after.subset = TRUE, y.var = NA, y.var.type = "n", y.jitter = FALSE, y.jitter.factor = 1, y.renum = FALSE, NA.rm = FALSE, x.reference = NA, now = max(data[, subset.c], na.rm = TRUE), now.line = FALSE, now.line.lty = 2, now.line.lwd = 1, now.line.col = 1, pty = "m", date.orig = c(1, 1, 1960), titl = "Event Chart", y.idlabels = NA, y.axis = "auto", y.axis.custom.at = NA, y.axis.custom.labels = NA, y.julian = FALSE, y.lim.extend = c(0, 0), y.lab = ifelse(is.na(y.idlabels), "", as.character(y.idlabels)), x.axis.all = TRUE, x.axis = "auto", x.axis.custom.at = NA, x.axis.custom.labels = NA, x.julian = FALSE, x.lim.extend = c(0, 0), x.scale = 1, x.lab = ifelse(x.julian, "Follow-up Time", "Study Date"), line.by = NA, line.lty = 1, line.lwd = 1, line.col = 1, line.add = NA, line.add.lty = NA, line.add.lwd = NA, line.add.col = NA, point.pch = 1:length(subset.c), point.cex = rep(0.6, length(subset.c)), point.col = rep(1, length(subset.c)), point.cex.mult = 1., point.cex.mult.var = NA, extra.points.no.mult = rep(NA, length(subset.c)), legend.plot = FALSE, legend.location = "o", legend.titl = titl, legend.titl.cex = 3, legend.titl.line = 1, legend.point.at = list(x = c(5, 95), y = c(95, 30)), legend.point.pch = point.pch, legend.point.text = ifelse(rep(is.data.frame(data), length(subset.c)), names(data[, subset.c]), subset.c), legend.cex = 2.5, legend.bty = "n", legend.line.at = list(x = c(5, 95), y = c(20, 5)), legend.line.text = names(table(as.character(data[, line.by]), exclude = c("", "NA"))), legend.line.lwd = line.lwd, legend.loc.num = 1, ...)
data |
a matrix or data frame with rows corresponding to subjects and columns corresponding to variables. Note that for a data frame or matrix containing multiple time-to-event data (e.g., time to recurrence, time to death, and time to last follow-up), one column is required for each specific event. |
subset.r |
subset of rows of original matrix or data frame to place in event chart.
Logical arguments may be used here (e.g., |
subset.c |
subset of columns of original matrix or data frame to place in event chart;
if working with a data frame, a vector of data frame variable names may be
used for subsetting purposes (e.g., |
sort.by |
column(s) or data frame variable name(s) with which to sort the chart's output.
The default is |
sort.ascending |
logical flag (which takes effect only if the argument |
sort.na.last |
logical flag (which takes effect only if the argument |
sort.after.subset |
logical flag (which takes effect only if the argument sort.by is utilized).
If |
y.var |
variable name or column number of original matrix or data frame with
which to scale y-axis.
Default is |
y.var.type |
type of variable specified in |
y.jitter |
logical flag (which takes effect only if the argument
The default of |
y.jitter.factor |
an argument used with the |
y.renum |
logical flag. If |
NA.rm |
logical flag. If |
x.reference |
column of original matrix or data frame with which to reference the x-axis.
That is, if specified, all columns specified in |
now |
the “now” date which will be used for top of y-axis
when creating the Goldman eventchart (see reference below).
Default is |
now.line |
logical flag. A feature utilized by the Goldman Eventchart.
When |
now.line.lty |
line type of |
now.line.lwd |
line width of |
now.line.col |
color of |
pty |
graph option, |
date.orig |
date of origin to consider if dates are in julian, SAS , or S-Plus dates
object format; default is January 1, 1960 (which is the default origin
used by both S-Plus and SAS). Utilized when either
|
titl |
title for event chart. Default is 'Event Chart'. |
y.idlabels |
column or data frame variable name used for y-axis labels. For example,
if |
y.axis |
character string specifying whether program will control labelling
of y-axis (with argument |
y.axis.custom.at |
user-specified vector of y-axis label locations.
Must be used when |
y.axis.custom.labels |
user-specified vector of y-axis labels.
Must be used when |
y.julian |
logical flag (which will only be considered if |
y.lim.extend |
two-dimensional vector representing the number of units that the user
wants to increase |
y.lab |
single label to be used for entire y-axis. Default will be the variable name
or column number of |
x.axis.all |
logical flag. If |
x.axis |
character string specifying whether program will control labelling
of x-axis (with argument |
x.axis.custom.at |
user-specified vector of x-axis label locations.
Must be used when |
x.axis.custom.labels |
user-specified vector of x-axis labels.
Must be used when |
x.julian |
logical flag (which will only be considered if |
x.lim.extend |
two-dimensional vector representing the number of time units (usually in days)
that the user wants to increase |
x.scale |
a factor whose reciprocal is multiplied to original units of the
x-axis. For example, if the original data frame is in units of days,
|
x.lab |
single label to be used for entire x-axis. Default will be “On Study Date”
if |
line.by |
column or data frame variable name for plotting unique lines by unique
values of vector (e.g., specify |
line.lty |
vector of line types corresponding to ascending order of |
line.lwd |
vector of line widths corresponding to ascending order of |
line.col |
vector of line colors corresponding to ascending order of |
line.add |
a 2xk matrix with k=number of pairs of additional line segments to add.
For example, if it is of interest to draw additional line segments
connecting events one and two, two and three, and four and five,
(possibly with different colors), an appropriate The convention use of If NOTE: The drawing of the original default line
may be suppressed (with |
line.add.lty |
a kx1 vector corresponding to the columns of |
line.add.lwd |
a kx1 vector corresponding to the columns of |
line.add.col |
a kx1 vector corresponding to the columns of |
point.pch |
vector of |
point.cex |
vector of size of points representing each event.
If |
point.col |
vector of colors of points representing each event.
If |
point.cex.mult |
a single number (may be non-integer), which is the base multiplier for the value of
the |
point.cex.mult.var |
vector of variables to be used in determining what point.cex.mult is multiplied by
for determining size of plotted points from (possibly a subset of)
|
extra.points.no.mult |
vector of variables in the dataset to ignore for purposes of using
|
legend.plot |
logical flag; if |
legend.location |
will be used only if |
legend.titl |
title for the legend; default is title to be used for main plot.
Only used when |
legend.titl.cex |
size of text for legend title. Only used when |
legend.titl.line |
line location of legend title dictated by |
legend.point.at |
location of upper left and lower right corners of legend area to be utilized for describing events via points and text. |
legend.point.pch |
vector of |
legend.point.text |
text to be used for describing events; the default is setup for a data frame,
as it will print the names of the columns specified by |
legend.cex |
size of text for points and event descriptions. Default is 2.5 which is setup
for |
legend.bty |
option to put a box around the legend(s); default is to have no box
( |
legend.line.at |
if |
legend.line.text |
text to be used for describing |
legend.line.lwd |
vector of line widths corresponding to |
legend.loc.num |
number used for locator argument when |
... |
additional par arguments for use in main plot. |
if you want to put, say, two eventcharts side-by-side, in a plot
region, you should not set up par(mfrow=c(1,2))
before running the
first plot. Instead, you should add the argument mfg=c(1,1,1,2)
to the first plot call followed by the argument mfg=c(1,2,1,2)
to the second plot call.
if dates in original data frame are in a specialized form
(eg., mm/dd/yy) of mode CHARACTER, the user must convert those columns to
become class dates or julian numeric mode (see Date
for more information).
For example, in a data frame called testdata
, with specialized
dates in columns 4 thru 10, the following code could be used:
as.numeric(dates(testdata[,4:10]))
. This will convert the columns
to numeric julian dates based on the function's default origin
of January 1, 1960. If original dates are in class dates or julian form,
no extra work is necessary.
In the survival analysis, the data typically come in two
columns: one column containing survival time and the other
containing censoring indicator or event code. The
event.convert
function converts this type of data into
multiple columns of event times, one column of each event
type, suitable for the event.chart
function.
an event chart is created on the current graphics device. If legend.plot =TRUE and legend.location = 'o', a one-page legend will precede the event chart. Please note that par parameters on completion of function will be reset to par parameters existing prior to start of function.
J. Jack Lee and Kenneth R. Hess
Department of Biostatistics
University of Texas
M.D. Anderson Cancer Center
Houston, TX 77030
[email protected], [email protected]
Joel A. Dubin
Department of Statistics
University of Waterloo
[email protected]
Lee J.J., Hess, K.R., Dubin, J.A. (2000). Extensions and applications of event charts. The American Statistician, 54:1, 63–70.
Dubin, J.A., Lee, J.J., Hess, K.R. (1997). The Utility of Event Charts. Proceedings of the Biometrics Section, American Statistical Association.
Dubin, J.A., Muller H-G, Wang J-L (2001). Event history graphs for censored survival data. Statistics in Medicine, 20: 2951–2964.
Goldman, A.I. (1992). EVENTCHARTS: Visualizing Survival and Other Timed-Events Data. The American Statistician, 46:1, 13–18.
# The sample data set is an augmented CDC AIDS dataset (ASCII) # which is used in the examples in the help file. This dataset is # described in Kalbfleisch and Lawless (JASA, 1989). # Here, we have included only children 4 years old and younger. # We have also added a new field, dethdate, which # represents a fictitious death date for each patient. There was # no recording of death date on the original dataset. In addition, we have # added a fictitious viral load reading (copies/ml) for each patient at time of AIDS diagnosis, # noting viral load was also not part of the original dataset. # # All dates are julian with julian=0 being # January 1, 1960, and julian=14000 being 14000 days beyond # January 1, 1960 (i.e., May 1, 1998). cdcaids <- data.frame( age=c(4,2,1,1,2,2,2,4,2,1,1,3,2,1,3,2,1,2,4,2,2,1,4,2,4,1,4,2,1,1,3,3,1,3), infedate=c( 7274,7727,7949,8037,7765,8096,8186,7520,8522,8609,8524,8213,8455,8739, 8034,8646,8886,8549,8068,8682,8612,9007,8461,8888,8096,9192,9107,9001, 9344,9155,8800,8519,9282,8673), diagdate=c( 8100,8158,8251,8343,8463,8489,8554,8644,8713,8733,8854,8855,8863,8983, 9035,9037,9132,9164,9186,9221,9224,9252,9274,9404,9405,9433,9434,9470, 9470,9472,9489,9500,9585,9649), diffdate=c( 826,431,302,306,698,393,368,1124,191,124,330,642,408,244,1001,391,246, 615,1118,539,612,245,813,516,1309,241,327,469,126,317,689,981,303,976), dethdate=c( 8434,8304,NA,8414,8715,NA,8667,9142,8731,8750,8963,9120,9005,9028,9445, 9180,9189,9406,9711,9453,9465,9289,9640,9608,10010,9488,9523,9633,9667, 9547,9755,NA,9686,10084), censdate=c( NA,NA,8321,NA,NA,8519,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA, NA,NA,NA,NA,NA,NA,NA,NA,NA,10095,NA,NA), viralload=c( 13000,36000,70000,90000,21000,110000,75000,12000,125000,110000,13000,39000,79000,135000,14000, 42000,123000,20000,12000,18000,16000,140000,16000,58000,11000,120000,85000,31000,24000,115000, 17000,13100,72000,13500) ) cdcaids <- upData(cdcaids, labels=c(age ='Age, y', infedate='Date of blood transfusion', diagdate='Date of AIDS diagnosis', diffdate='Incubation period (days from HIV to AIDS)', dethdate='Fictitious date of death', censdate='Fictitious censoring date', viralload='Fictitious viral load')) # Note that the style options listed with these # examples are best suited for output to a postscript file (i.e., using # the postscript function with horizontal=TRUE) as opposed to a graphical # window (e.g., motif). # To produce simple calendar event chart (with internal legend): # postscript('example1.ps', horizontal=TRUE) event.chart(cdcaids, subset.c=c('infedate','diagdate','dethdate','censdate'), x.lab = 'observation dates', y.lab='patients (sorted by AIDS diagnosis date)', titl='AIDS data calendar event chart 1', point.pch=c(1,2,15,0), point.cex=c(1,1,0.8,0.8), legend.plot=TRUE, legend.location='i', legend.cex=1.0, legend.point.text=c('transfusion','AIDS diagnosis','death','censored'), legend.point.at = list(c(7210, 8100), c(35, 27)), legend.bty='o') # To produce simple interval event chart (with internal legend): # postscript('example2.ps', horizontal=TRUE) event.chart(cdcaids, subset.c=c('infedate','diagdate','dethdate','censdate'), x.lab = 'time since transfusion (in days)', y.lab='patients (sorted by AIDS diagnosis date)', titl='AIDS data interval event chart 1', point.pch=c(1,2,15,0), point.cex=c(1,1,0.8,0.8), legend.plot=TRUE, legend.location='i', legend.cex=1.0, legend.point.text=c('transfusion','AIDS diagnosis','death','censored'), x.reference='infedate', x.julian=TRUE, legend.bty='o', legend.point.at = list(c(1400, 1950), c(7, -1))) # To produce simple interval event chart (with internal legend), # but now with flexible diagdate symbol size based on viral load variable: # postscript('example2a.ps', horizontal=TRUE) event.chart(cdcaids, subset.c=c('infedate','diagdate','dethdate','censdate'), x.lab = 'time since transfusion (in days)', y.lab='patients (sorted by AIDS diagnosis date)', titl='AIDS data interval event chart 1a, with viral load at diagdate represented', point.pch=c(1,2,15,0), point.cex=c(1,1,0.8,0.8), point.cex.mult = 0.00002, point.cex.mult.var = 'viralload', extra.points.no.mult = c(1,NA,1,1), legend.plot=TRUE, legend.location='i', legend.cex=1.0, legend.point.text=c('transfusion','AIDS diagnosis','death','censored'), x.reference='infedate', x.julian=TRUE, legend.bty='o', legend.point.at = list(c(1400, 1950), c(7, -1))) # To produce more complicated interval chart which is # referenced by infection date, and sorted by age and incubation period: # postscript('example3.ps', horizontal=TRUE) event.chart(cdcaids, subset.c=c('infedate','diagdate','dethdate','censdate'), x.lab = 'time since diagnosis of AIDS (in days)', y.lab='patients (sorted by age and incubation length)', titl='AIDS data interval event chart 2 (sorted by age, incubation)', point.pch=c(1,2,15,0), point.cex=c(1,1,0.8,0.8), legend.plot=TRUE, legend.location='i',legend.cex=1.0, legend.point.text=c('transfusion','AIDS diagnosis','death','censored'), x.reference='diagdate', x.julian=TRUE, sort.by=c('age','diffdate'), line.by='age', line.lty=c(1,3,2,4), line.lwd=rep(1,4), line.col=rep(1,4), legend.bty='o', legend.point.at = list(c(-1350, -800), c(7, -1)), legend.line.at = list(c(-1350, -800), c(16, 8)), legend.line.text=c('age = 1', ' = 2', ' = 3', ' = 4')) # To produce the Goldman chart: # postscript('example4.ps', horizontal=TRUE) event.chart(cdcaids, subset.c=c('infedate','diagdate','dethdate','censdate'), x.lab = 'time since transfusion (in days)', y.lab='dates of observation', titl='AIDS data Goldman event chart 1', y.var = c('infedate'), y.var.type='d', now.line=TRUE, y.jitter=FALSE, point.pch=c(1,2,15,0), point.cex=c(1,1,0.8,0.8), mgp = c(3.1,1.6,0), legend.plot=TRUE, legend.location='i',legend.cex=1.0, legend.point.text=c('transfusion','AIDS diagnosis','death','censored'), x.reference='infedate', x.julian=TRUE, legend.bty='o', legend.point.at = list(c(1500, 2800), c(9300, 10000))) # To convert coded time-to-event data, then, draw an event chart: surv.time <- c(5,6,3,1,2) cens.ind <- c(1,0,1,1,0) surv.data <- cbind(surv.time,cens.ind) event.data <- event.convert(surv.data) event.chart(cbind(rep(0,5),event.data),x.julian=TRUE,x.reference=1)
# The sample data set is an augmented CDC AIDS dataset (ASCII) # which is used in the examples in the help file. This dataset is # described in Kalbfleisch and Lawless (JASA, 1989). # Here, we have included only children 4 years old and younger. # We have also added a new field, dethdate, which # represents a fictitious death date for each patient. There was # no recording of death date on the original dataset. In addition, we have # added a fictitious viral load reading (copies/ml) for each patient at time of AIDS diagnosis, # noting viral load was also not part of the original dataset. # # All dates are julian with julian=0 being # January 1, 1960, and julian=14000 being 14000 days beyond # January 1, 1960 (i.e., May 1, 1998). cdcaids <- data.frame( age=c(4,2,1,1,2,2,2,4,2,1,1,3,2,1,3,2,1,2,4,2,2,1,4,2,4,1,4,2,1,1,3,3,1,3), infedate=c( 7274,7727,7949,8037,7765,8096,8186,7520,8522,8609,8524,8213,8455,8739, 8034,8646,8886,8549,8068,8682,8612,9007,8461,8888,8096,9192,9107,9001, 9344,9155,8800,8519,9282,8673), diagdate=c( 8100,8158,8251,8343,8463,8489,8554,8644,8713,8733,8854,8855,8863,8983, 9035,9037,9132,9164,9186,9221,9224,9252,9274,9404,9405,9433,9434,9470, 9470,9472,9489,9500,9585,9649), diffdate=c( 826,431,302,306,698,393,368,1124,191,124,330,642,408,244,1001,391,246, 615,1118,539,612,245,813,516,1309,241,327,469,126,317,689,981,303,976), dethdate=c( 8434,8304,NA,8414,8715,NA,8667,9142,8731,8750,8963,9120,9005,9028,9445, 9180,9189,9406,9711,9453,9465,9289,9640,9608,10010,9488,9523,9633,9667, 9547,9755,NA,9686,10084), censdate=c( NA,NA,8321,NA,NA,8519,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA, NA,NA,NA,NA,NA,NA,NA,NA,NA,10095,NA,NA), viralload=c( 13000,36000,70000,90000,21000,110000,75000,12000,125000,110000,13000,39000,79000,135000,14000, 42000,123000,20000,12000,18000,16000,140000,16000,58000,11000,120000,85000,31000,24000,115000, 17000,13100,72000,13500) ) cdcaids <- upData(cdcaids, labels=c(age ='Age, y', infedate='Date of blood transfusion', diagdate='Date of AIDS diagnosis', diffdate='Incubation period (days from HIV to AIDS)', dethdate='Fictitious date of death', censdate='Fictitious censoring date', viralload='Fictitious viral load')) # Note that the style options listed with these # examples are best suited for output to a postscript file (i.e., using # the postscript function with horizontal=TRUE) as opposed to a graphical # window (e.g., motif). # To produce simple calendar event chart (with internal legend): # postscript('example1.ps', horizontal=TRUE) event.chart(cdcaids, subset.c=c('infedate','diagdate','dethdate','censdate'), x.lab = 'observation dates', y.lab='patients (sorted by AIDS diagnosis date)', titl='AIDS data calendar event chart 1', point.pch=c(1,2,15,0), point.cex=c(1,1,0.8,0.8), legend.plot=TRUE, legend.location='i', legend.cex=1.0, legend.point.text=c('transfusion','AIDS diagnosis','death','censored'), legend.point.at = list(c(7210, 8100), c(35, 27)), legend.bty='o') # To produce simple interval event chart (with internal legend): # postscript('example2.ps', horizontal=TRUE) event.chart(cdcaids, subset.c=c('infedate','diagdate','dethdate','censdate'), x.lab = 'time since transfusion (in days)', y.lab='patients (sorted by AIDS diagnosis date)', titl='AIDS data interval event chart 1', point.pch=c(1,2,15,0), point.cex=c(1,1,0.8,0.8), legend.plot=TRUE, legend.location='i', legend.cex=1.0, legend.point.text=c('transfusion','AIDS diagnosis','death','censored'), x.reference='infedate', x.julian=TRUE, legend.bty='o', legend.point.at = list(c(1400, 1950), c(7, -1))) # To produce simple interval event chart (with internal legend), # but now with flexible diagdate symbol size based on viral load variable: # postscript('example2a.ps', horizontal=TRUE) event.chart(cdcaids, subset.c=c('infedate','diagdate','dethdate','censdate'), x.lab = 'time since transfusion (in days)', y.lab='patients (sorted by AIDS diagnosis date)', titl='AIDS data interval event chart 1a, with viral load at diagdate represented', point.pch=c(1,2,15,0), point.cex=c(1,1,0.8,0.8), point.cex.mult = 0.00002, point.cex.mult.var = 'viralload', extra.points.no.mult = c(1,NA,1,1), legend.plot=TRUE, legend.location='i', legend.cex=1.0, legend.point.text=c('transfusion','AIDS diagnosis','death','censored'), x.reference='infedate', x.julian=TRUE, legend.bty='o', legend.point.at = list(c(1400, 1950), c(7, -1))) # To produce more complicated interval chart which is # referenced by infection date, and sorted by age and incubation period: # postscript('example3.ps', horizontal=TRUE) event.chart(cdcaids, subset.c=c('infedate','diagdate','dethdate','censdate'), x.lab = 'time since diagnosis of AIDS (in days)', y.lab='patients (sorted by age and incubation length)', titl='AIDS data interval event chart 2 (sorted by age, incubation)', point.pch=c(1,2,15,0), point.cex=c(1,1,0.8,0.8), legend.plot=TRUE, legend.location='i',legend.cex=1.0, legend.point.text=c('transfusion','AIDS diagnosis','death','censored'), x.reference='diagdate', x.julian=TRUE, sort.by=c('age','diffdate'), line.by='age', line.lty=c(1,3,2,4), line.lwd=rep(1,4), line.col=rep(1,4), legend.bty='o', legend.point.at = list(c(-1350, -800), c(7, -1)), legend.line.at = list(c(-1350, -800), c(16, 8)), legend.line.text=c('age = 1', ' = 2', ' = 3', ' = 4')) # To produce the Goldman chart: # postscript('example4.ps', horizontal=TRUE) event.chart(cdcaids, subset.c=c('infedate','diagdate','dethdate','censdate'), x.lab = 'time since transfusion (in days)', y.lab='dates of observation', titl='AIDS data Goldman event chart 1', y.var = c('infedate'), y.var.type='d', now.line=TRUE, y.jitter=FALSE, point.pch=c(1,2,15,0), point.cex=c(1,1,0.8,0.8), mgp = c(3.1,1.6,0), legend.plot=TRUE, legend.location='i',legend.cex=1.0, legend.point.text=c('transfusion','AIDS diagnosis','death','censored'), x.reference='infedate', x.julian=TRUE, legend.bty='o', legend.point.at = list(c(1500, 2800), c(9300, 10000))) # To convert coded time-to-event data, then, draw an event chart: surv.time <- c(5,6,3,1,2) cens.ind <- c(1,0,1,1,0) surv.data <- cbind(surv.time,cens.ind) event.data <- event.convert(surv.data) event.chart(cbind(rep(0,5),event.data),x.julian=TRUE,x.reference=1)
Convert a two-column data matrix with event time and event code into multiple column event time with one event in each column
event.convert(data2, event.time = 1, event.code = 2)
event.convert(data2, event.time = 1, event.code = 2)
data2 |
a matrix or dataframe with at least 2 columns; by default, the first column contains the event time and the second column contains the k event codes (e.g. 1=dead, 0=censord) |
event.time |
the column number in data contains the event time |
event.code |
the column number in data contains the event code |
In the survival analysis, the data typically come in two
columns: one column containing survival time and the other
containing censoring indicator or event code. The
event.convert
function converts this type of data into
multiple columns of event times, one column of each event
type, suitable for the event.chart
function.
J. Jack Lee and Kenneth R. Hess
Department of Biostatistics
University of Texas
M.D. Anderson Cancer Center
Houston, TX 77030
[email protected], [email protected]
Joel A. Dubin
Department of Statistics
University of Waterloo
[email protected]
event.history
, Date
, event.chart
# To convert coded time-to-event data, then, draw an event chart: surv.time <- c(5,6,3,1,2) cens.ind <- c(1,0,1,1,0) surv.data <- cbind(surv.time,cens.ind) event.data <- event.convert(surv.data) event.chart(cbind(rep(0,5),event.data),x.julian=TRUE,x.reference=1)
# To convert coded time-to-event data, then, draw an event chart: surv.time <- c(5,6,3,1,2) cens.ind <- c(1,0,1,1,0) surv.data <- cbind(surv.time,cens.ind) event.data <- event.convert(surv.data) event.chart(cbind(rep(0,5),event.data),x.julian=TRUE,x.reference=1)
Produces an event history graph for right-censored survival data, including time-dependent covariate status, as described in Dubin, Muller, and Wang (2001). Effectively, a Kaplan-Meier curve is produced with supplementary information regarding individual survival information, censoring information, and status over time of an individual time-dependent covariate or time-dependent covariate function for both uncensored and censored individuals.
event.history(data, survtime.col, surv.col, surv.ind = c(1, 0), subset.rows = NULL, covtime.cols = NULL, cov.cols = NULL, num.colors = 1, cut.cov = NULL, colors = 1, cens.density = 10, mult.end.cens = 1.05, cens.mark.right =FALSE, cens.mark = "-", cens.mark.ahead = 0.5, cens.mark.cutoff = -1e-08, cens.mark.cex = 1, x.lab = "time under observation", y.lab = "estimated survival probability", title = "event history graph", ...)
event.history(data, survtime.col, surv.col, surv.ind = c(1, 0), subset.rows = NULL, covtime.cols = NULL, cov.cols = NULL, num.colors = 1, cut.cov = NULL, colors = 1, cens.density = 10, mult.end.cens = 1.05, cens.mark.right =FALSE, cens.mark = "-", cens.mark.ahead = 0.5, cens.mark.cutoff = -1e-08, cens.mark.cex = 1, x.lab = "time under observation", y.lab = "estimated survival probability", title = "event history graph", ...)
data |
A matrix or data frame with rows corresponding to units (often individuals) and columns corresponding to survival time, event/censoring indicator. Also, multiple columns may be devoted to time-dependent covariate level and time change. |
survtime.col |
Column (in data) representing minimum of time-to-event or right-censoring time for individual. |
surv.col |
Column (in data) representing event indicator for an individual. Though, traditionally, such an indicator will be 1 for an event and 0 for a censored observation, this indicator can be represented by any two numbers, made explicit by the surv.ind argument. |
surv.ind |
Two-element vector representing, respectively, the
number for an event, as listed in |
subset.rows |
Subset of rows of original matrix or data frame (data) to
place in event history graph.
Logical arguments may be used here (e.g., |
covtime.cols |
Column(s) (in data) representing the time when change of time-dependent
covariate (or time-dependent covariate function) occurs.
There should be a unique non- |
cov.cols |
Column(s) (in data) representing the level of the time-dependent
covariate (or time-dependent covariate function). There should be
a unique non- |
num.colors |
Colors are utilized for the time-dependent covariate level for an
individual. This argument provides the number of unique covariate
levels which will be displayed by mapping the number of colors
(via |
cut.cov |
This argument allows the user to explicitly state how to
define the intervals for the time-dependent covariate, such that
different colors will be allocated to the user-defined covariate levels.
For example, for plotting five colors, six ordered points within the
span of the data's covariate levels should be provided.
Default is |
colors |
This is a vector argument defining the actual colors used
for the time-dependent covariate levels in the plot, with the
index of this vector corresponding to the ordered levels
of the covariate. The number of colors (i.e., the length
of the colors vector) should correspond to the
value provided to the |
cens.density |
This will provide the shading density at the end of the
individual bars for those who are censored. For more information
on shading density, see the density argument in the S-Plus
polygon function. Default is |
mult.end.cens |
This is a multiplier that extends the length of the longest surviving individual bar (or bars, if a tie exists) if right-censored, presuming that no event times eventually follow this final censored time. Default extends the length 5 percent beyond the length of the observed right-censored survival time. |
cens.mark.right |
A logical argument that states whether an explicit mark should be placed to the right of the individual right-censored survival bars. This argument is most useful for large sample sizes, where it may be hard to detect the special shading via cens.density, particularly for the short-term survivors. |
cens.mark |
Character argument which describes the censored mark that should be
used if |
cens.mark.ahead |
A numeric argument, which specifies the absolute distance to be placed between the individual right-censored survival bars and the mark as defined in the above cens.mark argument. Default is 0.5 (that is, a half of day, if survival time is measured in days), but may very well need adjusting depending on the maximum survival time observed in the dataset. |
cens.mark.cutoff |
A negative number very close to 0
(by default |
cens.mark.cex |
Numeric argument defining the size of the mark defined in
the |
x.lab |
Single label to be used for entire x-axis.
Default is |
y.lab |
Single label to be used for entire y-axis.
Default is |
title |
Title for the event history graph.
Default is |
... |
This allows arguments to the plot function call within
the |
In order to focus on a particular area of the event history graph,
zooming can be performed. This is best done by
specifying appropriate xlim
and ylim
arguments at the end of the event.history
function call,
taking advantage of the ...
argument link to the plot function.
An example of zooming can be seen
in Plate 4 of the paper referenced below.
Please read the reference below to understand how the individual covariate and survival information is provided in the plot, how ties are handled, how right-censoring is handled, etc.
This function has been tested thoroughly, but only within a restricted version and environment, i.e., only within S-Plus 2000, Version 3, and within S-Plus 6.0, version 2, both on a Windows 2000 machine. Hence, we cannot currently vouch for the function's effectiveness in other versions of S-Plus (e.g., S-Plus 3.4) nor in other operating environments (e.g., Windows 95, Linux or Unix). The function has also been verified to work on R under Linux.
The authors have found better control of the use of color by producing the graphs via the postscript plotting device in S-Plus. In fact, the provided examples utilize the postscript function. However, your past experiences may be different, and you may prefer to control color directly (to the graphsheet in Windows environment, for example). The event.history function will work with either approach.
Joel Dubin
[email protected]
Dubin, J.A., Muller, H.-G., and Wang, J.-L. (2001). Event history graphs for censored survival data. Statistics in Medicine, 20, 2951-2964.
plot
,polygon
,
event.chart
, par
# Code to produce event history graphs for SIM paper # # before generating plots, some pre-processing needs to be performed, # in order to get dataset in proper form for event.history function; # need to create one line per subject and sort by time under observation, # with those experiencing event coming before those tied with censoring time; require('survival') data(heart) # creation of event.history version of heart dataset (call heart.one): heart.one <- matrix(nrow=length(unique(heart$id)), ncol=8) for(i in 1:length(unique(heart$id))) { if(length(heart$id[heart$id==i]) == 1) heart.one[i,] <- as.numeric(unlist(heart[heart$id==i, ])) else if(length(heart$id[heart$id==i]) == 2) heart.one[i,] <- as.numeric(unlist(heart[heart$id==i,][2,])) } heart.one[,3][heart.one[,3] == 0] <- 2 ## converting censored events to 2, from 0 if(is.factor(heart$transplant)) heart.one[,7] <- heart.one[,7] - 1 ## getting back to correct transplantation coding heart.one <- as.data.frame(heart.one[order(unlist(heart.one[,2]), unlist(heart.one[,3])),]) names(heart.one) <- names(heart) # back to usual censoring indicator: heart.one[,3][heart.one[,3] == 2] <- 0 # note: transplant says 0 (for no transplants) or 1 (for one transplant) # and event = 1 is death, while event = 0 is censored # plot single Kaplan-Meier curve from heart data, first creating survival object heart.surv <- survfit(Surv(stop, event) ~ 1, data=heart.one, conf.int = FALSE) # figure 3: traditional Kaplan-Meier curve # postscript('ehgfig3.ps', horiz=TRUE) # omi <- par(omi=c(0,1.25,0.5,1.25)) plot(heart.surv, ylab='estimated survival probability', xlab='observation time (in days)') title('Figure 3: Kaplan-Meier curve for Stanford data', cex=0.8) # dev.off() ## now, draw event history graph for Stanford heart data; use as Figure 4 # postscript('ehgfig4.ps', horiz=TRUE, colors = seq(0, 1, len=20)) # par(omi=c(0,1.25,0.5,1.25)) event.history(heart.one, survtime.col=heart.one[,2], surv.col=heart.one[,3], covtime.cols = cbind(rep(0, dim(heart.one)[1]), heart.one[,1]), cov.cols = cbind(rep(0, dim(heart.one)[1]), heart.one[,7]), num.colors=2, colors=c(6,10), x.lab = 'time under observation (in days)', title='Figure 4: Event history graph for\nStanford data', cens.mark.right =TRUE, cens.mark = '-', cens.mark.ahead = 30.0, cens.mark.cex = 0.85) # dev.off() # now, draw age-stratified event history graph for Stanford heart data; # use as Figure 5 # two plots, stratified by age status # postscript('c:\temp\ehgfig5.ps', horiz=TRUE, colors = seq(0, 1, len=20)) # par(omi=c(0,1.25,0.5,1.25)) par(mfrow=c(1,2)) event.history(data=heart.one, subset.rows = (heart.one[,4] < 0), survtime.col=heart.one[,2], surv.col=heart.one[,3], covtime.cols = cbind(rep(0, dim(heart.one)[1]), heart.one[,1]), cov.cols = cbind(rep(0, dim(heart.one)[1]), heart.one[,7]), num.colors=2, colors=c(6,10), x.lab = 'time under observation\n(in days)', title = 'Figure 5a:\nStanford data\n(age < 48)', cens.mark.right =TRUE, cens.mark = '-', cens.mark.ahead = 40.0, cens.mark.cex = 0.85, xlim=c(0,1900)) event.history(data=heart.one, subset.rows = (heart.one[,4] >= 0), survtime.col=heart.one[,2], surv.col=heart.one[,3], covtime.cols = cbind(rep(0, dim(heart.one)[1]), heart.one[,1]), cov.cols = cbind(rep(0, dim(heart.one)[1]), heart.one[,7]), num.colors=2, colors=c(6,10), x.lab = 'time under observation\n(in days)', title = 'Figure 5b:\nStanford data\n(age >= 48)', cens.mark.right =TRUE, cens.mark = '-', cens.mark.ahead = 40.0, cens.mark.cex = 0.85, xlim=c(0,1900)) # dev.off() # par(omi=omi) # we will not show liver cirrhosis data manipulation, as it was # a bit detailed; however, here is the # event.history code to produce Figure 7 / Plate 1 # Figure 7 / Plate 1 : prothrombin ehg with color ## Not run: second.arg <- 1 ### second.arg is for shading third.arg <- c(rep(1,18),0,1) ### third.arg is for intensity # postscript('c:\temp\ehgfig7.ps', horiz=TRUE, # colors = cbind(seq(0, 1, len = 20), second.arg, third.arg)) # par(omi=c(0,1.25,0.5,1.25), col=19) event.history(cirrhos2.eh, subset.rows = NULL, survtime.col=cirrhos2.eh$time, surv.col=cirrhos2.eh$event, covtime.cols = as.matrix(cirrhos2.eh[, ((2:18)*2)]), cov.cols = as.matrix(cirrhos2.eh[, ((2:18)*2) + 1]), cut.cov = as.numeric(quantile(as.matrix(cirrhos2.eh[, ((2:18)*2) + 1]), c(0,.2,.4,.6,.8,1), na.rm=TRUE) + c(-1,0,0,0,0,1)), colors=c(20,4,8,11,14), x.lab = 'time under observation (in days)', title='Figure 7: Event history graph for liver cirrhosis data (color)', cens.mark.right =TRUE, cens.mark = '-', cens.mark.ahead = 100.0, cens.mark.cex = 0.85) # dev.off() ## End(Not run)
# Code to produce event history graphs for SIM paper # # before generating plots, some pre-processing needs to be performed, # in order to get dataset in proper form for event.history function; # need to create one line per subject and sort by time under observation, # with those experiencing event coming before those tied with censoring time; require('survival') data(heart) # creation of event.history version of heart dataset (call heart.one): heart.one <- matrix(nrow=length(unique(heart$id)), ncol=8) for(i in 1:length(unique(heart$id))) { if(length(heart$id[heart$id==i]) == 1) heart.one[i,] <- as.numeric(unlist(heart[heart$id==i, ])) else if(length(heart$id[heart$id==i]) == 2) heart.one[i,] <- as.numeric(unlist(heart[heart$id==i,][2,])) } heart.one[,3][heart.one[,3] == 0] <- 2 ## converting censored events to 2, from 0 if(is.factor(heart$transplant)) heart.one[,7] <- heart.one[,7] - 1 ## getting back to correct transplantation coding heart.one <- as.data.frame(heart.one[order(unlist(heart.one[,2]), unlist(heart.one[,3])),]) names(heart.one) <- names(heart) # back to usual censoring indicator: heart.one[,3][heart.one[,3] == 2] <- 0 # note: transplant says 0 (for no transplants) or 1 (for one transplant) # and event = 1 is death, while event = 0 is censored # plot single Kaplan-Meier curve from heart data, first creating survival object heart.surv <- survfit(Surv(stop, event) ~ 1, data=heart.one, conf.int = FALSE) # figure 3: traditional Kaplan-Meier curve # postscript('ehgfig3.ps', horiz=TRUE) # omi <- par(omi=c(0,1.25,0.5,1.25)) plot(heart.surv, ylab='estimated survival probability', xlab='observation time (in days)') title('Figure 3: Kaplan-Meier curve for Stanford data', cex=0.8) # dev.off() ## now, draw event history graph for Stanford heart data; use as Figure 4 # postscript('ehgfig4.ps', horiz=TRUE, colors = seq(0, 1, len=20)) # par(omi=c(0,1.25,0.5,1.25)) event.history(heart.one, survtime.col=heart.one[,2], surv.col=heart.one[,3], covtime.cols = cbind(rep(0, dim(heart.one)[1]), heart.one[,1]), cov.cols = cbind(rep(0, dim(heart.one)[1]), heart.one[,7]), num.colors=2, colors=c(6,10), x.lab = 'time under observation (in days)', title='Figure 4: Event history graph for\nStanford data', cens.mark.right =TRUE, cens.mark = '-', cens.mark.ahead = 30.0, cens.mark.cex = 0.85) # dev.off() # now, draw age-stratified event history graph for Stanford heart data; # use as Figure 5 # two plots, stratified by age status # postscript('c:\temp\ehgfig5.ps', horiz=TRUE, colors = seq(0, 1, len=20)) # par(omi=c(0,1.25,0.5,1.25)) par(mfrow=c(1,2)) event.history(data=heart.one, subset.rows = (heart.one[,4] < 0), survtime.col=heart.one[,2], surv.col=heart.one[,3], covtime.cols = cbind(rep(0, dim(heart.one)[1]), heart.one[,1]), cov.cols = cbind(rep(0, dim(heart.one)[1]), heart.one[,7]), num.colors=2, colors=c(6,10), x.lab = 'time under observation\n(in days)', title = 'Figure 5a:\nStanford data\n(age < 48)', cens.mark.right =TRUE, cens.mark = '-', cens.mark.ahead = 40.0, cens.mark.cex = 0.85, xlim=c(0,1900)) event.history(data=heart.one, subset.rows = (heart.one[,4] >= 0), survtime.col=heart.one[,2], surv.col=heart.one[,3], covtime.cols = cbind(rep(0, dim(heart.one)[1]), heart.one[,1]), cov.cols = cbind(rep(0, dim(heart.one)[1]), heart.one[,7]), num.colors=2, colors=c(6,10), x.lab = 'time under observation\n(in days)', title = 'Figure 5b:\nStanford data\n(age >= 48)', cens.mark.right =TRUE, cens.mark = '-', cens.mark.ahead = 40.0, cens.mark.cex = 0.85, xlim=c(0,1900)) # dev.off() # par(omi=omi) # we will not show liver cirrhosis data manipulation, as it was # a bit detailed; however, here is the # event.history code to produce Figure 7 / Plate 1 # Figure 7 / Plate 1 : prothrombin ehg with color ## Not run: second.arg <- 1 ### second.arg is for shading third.arg <- c(rep(1,18),0,1) ### third.arg is for intensity # postscript('c:\temp\ehgfig7.ps', horiz=TRUE, # colors = cbind(seq(0, 1, len = 20), second.arg, third.arg)) # par(omi=c(0,1.25,0.5,1.25), col=19) event.history(cirrhos2.eh, subset.rows = NULL, survtime.col=cirrhos2.eh$time, surv.col=cirrhos2.eh$event, covtime.cols = as.matrix(cirrhos2.eh[, ((2:18)*2)]), cov.cols = as.matrix(cirrhos2.eh[, ((2:18)*2) + 1]), cut.cov = as.numeric(quantile(as.matrix(cirrhos2.eh[, ((2:18)*2) + 1]), c(0,.2,.4,.6,.8,1), na.rm=TRUE) + c(-1,0,0,0,0,1)), colors=c(20,4,8,11,14), x.lab = 'time under observation (in days)', title='Figure 7: Event history graph for liver cirrhosis data (color)', cens.mark.right =TRUE, cens.mark = '-', cens.mark.ahead = 100.0, cens.mark.cex = 0.85) # dev.off() ## End(Not run)
Extract Labels and Units From Multiple Datasets
extractlabs(..., print = TRUE)
extractlabs(..., print = TRUE)
... |
one ore more data frames or data tables |
print |
set to |
For one or more data frames/tables extracts all labels and units and comb ines them over dataset, dropping any variables not having either labels or units defined. The resulting data table is returned and is used by the hlab
function if the user stores the result in an objectnamed LabelsUnits
. The result is NULL
if no variable in any dataset has a non-blank label
or units
. Variables found in more than one dataset with duplicate label
and units
are consolidated. A warning message is issued when duplicate variables have conflicting labels or units, and by default, details are printed. No attempt is made to resolve these conflicts.
a data table
Frank Harrell
label()
, contents()
, units()
, hlab()
d <- data.frame(x=1:10, y=(1:10)/10) d <- upData(d, labels=c(x='X', y='Y'), units=c(x='mmHg'), print=FALSE) d2 <- d units(d2$x) <- 'cm' LabelsUnits <- extractlabs(d, d2) LabelsUnits
d <- data.frame(x=1:10, y=(1:10)/10) d <- upData(d, labels=c(x='X', y='Y'), units=c(x='mmHg'), print=FALSE) d2 <- d units(d2$x) <- 'cm' LabelsUnits <- extractlabs(d, d2) LabelsUnits
General File Import Using rio
fImport( file, format, lowernames = c("not mixed", "no", "yes"), und. = FALSE, ... )
fImport( file, format, lowernames = c("not mixed", "no", "yes"), und. = FALSE, ... )
file |
name of file to import, or full URL. |
format |
format of file to import, usually not needed. See |
lowernames |
defaults to changing variable names to all lower case unless the name as mixed upper and lower case, which results in keeping the original characters in the name. Set |
und. |
set to |
... |
more arguments to pass to |
This is a front-end for the rio
package's import
function. fImport
includes options for setting variable names to lower case and to change underscores in names to periods. Variables on the imported data frame that have label
s are converted to Hmisc package labelled
class so that subsetting the data frame will preserve the labels.
a data frame created by rio
, unless a rio
option is given to use another format
Frank Harrell
upData
, especially the moveUnits
option
## Not run: # Get a Stata dataset d <- fImport('http://www.principlesofeconometrics.com/stata/alcohol.dta') contents(d) ## End(Not run)
## Not run: # Get a Stata dataset d <- fImport('http://www.principlesofeconometrics.com/stata/alcohol.dta') contents(d) ## End(Not run)
Compares each row in x
against all the rows in y
, finding rows in
y
with all columns within a tolerance of the values a given row of
x
. The default tolerance
tol
is zero, i.e., an exact match is required on all columns.
For qualifying matches, a distance measure is computed. This is
the sum of squares of differences between x
and y
after scaling
the columns. The default scaling values are tol
, and for columns
with tol=1
the scale values are set to 1.0 (since they are ignored
anyway). Matches (up to maxmatch
of them) are stored and listed in order of
increasing distance.
The summary
method prints a frequency distribution of the
number of matches per observation in x
, the median of the minimum
distances for all matches per x
, as a function of the number of matches,
and the frequency of selection of duplicate observations as those having
the smallest distance. The print
method prints the entire matches
and distance
components of the result from find.matches
.
matchCases
finds all controls that match cases on a single variable
x
within a tolerance of tol
. This is intended for prospective
cohort studies that use matching for confounder adjustment (even
though regression models usually work better).
find.matches(x, y, tol=rep(0, ncol(y)), scale=tol, maxmatch=10) ## S3 method for class 'find.matches' summary(object, ...) ## S3 method for class 'find.matches' print(x, digits, ...) matchCases(xcase, ycase, idcase=names(ycase), xcontrol, ycontrol, idcontrol=names(ycontrol), tol=NULL, maxobs=max(length(ycase),length(ycontrol))*10, maxmatch=20, which=c('closest','random'))
find.matches(x, y, tol=rep(0, ncol(y)), scale=tol, maxmatch=10) ## S3 method for class 'find.matches' summary(object, ...) ## S3 method for class 'find.matches' print(x, digits, ...) matchCases(xcase, ycase, idcase=names(ycase), xcontrol, ycontrol, idcontrol=names(ycontrol), tol=NULL, maxobs=max(length(ycase),length(ycontrol))*10, maxmatch=20, which=c('closest','random'))
x |
a numeric matrix or the result of |
y |
a numeric matrix with same number of columns as |
xcase |
numeric vector to match on for cases |
xcontrol |
numeric vector to match on for controls, not necessarily
the same length as |
ycase |
a vector or matrix |
ycontrol |
|
tol |
a vector of tolerances with number of elements the same as the number
of columns of |
scale |
a vector of scaling constants with number of elements the same as the
number of columns of |
maxmatch |
maximum number of matches to allow. For |
object |
an object created by |
digits |
number of digits to use in printing distances |
idcase |
vector the same length as |
idcontrol |
|
maxobs |
maximum number of cases and all matching controls combined (maximum
dimension of data frame resulting from |
which |
set to |
... |
unused |
find.matches
returns a list of class find.matches
with elements
matches
and distance
.
Both elements are matrices with the number of rows equal to the number
of rows in x
, and with k
columns, where k
is the maximum number of
matches (<= maxmatch
) that occurred. The elements of matches
are row identifiers of y
that match, with zeros if fewer than
maxmatch
matches are found (blanks if y
had row names).
matchCases
returns a data frame with variables idcase
(id of case
currently being matched), type
(factor variable with levels "case"
and "control"
), id
(id of case if case row, or id of matching
case), and y
.
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
Ming K, Rosenbaum PR (2001): A note on optimal matching with variable controls using the assignment algorithm. J Comp Graph Stat 10:455–463.
Cepeda MS, Boston R, Farrar JT, Strom BL (2003): Optimal matching with a variable number of controls vs. a fixed number of controls for a cohort study: trade-offs. J Clin Epidemiology 56:230-237. Note: These papers were not used for the functions here but probably should have been.
y <- rbind(c(.1, .2),c(.11, .22), c(.3, .4), c(.31, .41), c(.32, 5)) x <- rbind(c(.09,.21), c(.29,.39)) y x w <- find.matches(x, y, maxmatch=5, tol=c(.05,.05)) set.seed(111) # so can replicate results x <- matrix(runif(500), ncol=2) y <- matrix(runif(2000), ncol=2) w <- find.matches(x, y, maxmatch=5, tol=c(.02,.03)) w$matches[1:5,] w$distance[1:5,] # Find first x with 3 or more y-matches num.match <- apply(w$matches, 1, function(x)sum(x > 0)) j <- ((1:length(num.match))[num.match > 2])[1] x[j,] y[w$matches[j,],] summary(w) # For many applications would do something like this: # attach(df1) # x <- cbind(age, sex) # Just do as.matrix(df1) if df1 has no factor objects # attach(df2) # y <- cbind(age, sex) # mat <- find.matches(x, y, tol=c(5,0)) # exact match on sex, 5y on age # Demonstrate matchCases xcase <- c(1,3,5,12) xcontrol <- 1:6 idcase <- c('A','B','C','D') idcontrol <- c('a','b','c','d','e','f') ycase <- c(11,33,55,122) ycontrol <- c(11,22,33,44,55,66) matchCases(xcase, ycase, idcase, xcontrol, ycontrol, idcontrol, tol=1) # If y is a binary response variable, the following code # will produce a Mantel-Haenszel summary odds ratio that # utilizes the matching. # Standard variance formula will not work here because # a control will match more than one case # WARNING: The M-H procedure exemplified here is suspect # because of the small strata and widely varying number # of controls per case. x <- c(1, 2, 3, 3, 3, 6, 7, 12, 1, 1:7) y <- c(0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1) case <- c(rep(TRUE, 8), rep(FALSE, 8)) id <- 1:length(x) m <- matchCases(x[case], y[case], id[case], x[!case], y[!case], id[!case], tol=1) iscase <- m$type=='case' # Note: the first tapply on insures that event indicators are # sorted by case id. The second actually does something. event.case <- tapply(m$y[iscase], m$idcase[iscase], sum) event.control <- tapply(m$y[!iscase], m$idcase[!iscase], sum) n.control <- tapply(!iscase, m$idcase, sum) n <- tapply(m$y, m$idcase, length) or <- sum(event.case * (n.control - event.control) / n) / sum(event.control * (1 - event.case) / n) or # Bootstrap this estimator by sampling with replacement from # subjects. Assumes id is unique when combine cases+controls # (id was constructed this way above). The following algorithms # puts all sampled controls back with the cases to whom they were # originally matched. ids <- unique(m$id) idgroups <- split(1:nrow(m), m$id) B <- 50 # in practice use many more ors <- numeric(B) # Function to order w by ids, leaving unassigned elements zero align <- function(ids, w) { z <- structure(rep(0, length(ids)), names=ids) z[names(w)] <- w z } for(i in 1:B) { j <- sample(ids, replace=TRUE) obs <- unlist(idgroups[j]) u <- m[obs,] iscase <- u$type=='case' n.case <- align(ids, tapply(u$type, u$idcase, function(v)sum(v=='case'))) n.control <- align(ids, tapply(u$type, u$idcase, function(v)sum(v=='control'))) event.case <- align(ids, tapply(u$y[iscase], u$idcase[iscase], sum)) event.control <- align(ids, tapply(u$y[!iscase], u$idcase[!iscase], sum)) n <- n.case + n.control # Remove sets having 0 cases or 0 controls in resample s <- n.case > 0 & n.control > 0 denom <- sum(event.control[s] * (n.case[s] - event.case[s]) / n[s]) or <- if(denom==0) NA else sum(event.case[s] * (n.control[s] - event.control[s]) / n[s]) / denom ors[i] <- or } describe(ors)
y <- rbind(c(.1, .2),c(.11, .22), c(.3, .4), c(.31, .41), c(.32, 5)) x <- rbind(c(.09,.21), c(.29,.39)) y x w <- find.matches(x, y, maxmatch=5, tol=c(.05,.05)) set.seed(111) # so can replicate results x <- matrix(runif(500), ncol=2) y <- matrix(runif(2000), ncol=2) w <- find.matches(x, y, maxmatch=5, tol=c(.02,.03)) w$matches[1:5,] w$distance[1:5,] # Find first x with 3 or more y-matches num.match <- apply(w$matches, 1, function(x)sum(x > 0)) j <- ((1:length(num.match))[num.match > 2])[1] x[j,] y[w$matches[j,],] summary(w) # For many applications would do something like this: # attach(df1) # x <- cbind(age, sex) # Just do as.matrix(df1) if df1 has no factor objects # attach(df2) # y <- cbind(age, sex) # mat <- find.matches(x, y, tol=c(5,0)) # exact match on sex, 5y on age # Demonstrate matchCases xcase <- c(1,3,5,12) xcontrol <- 1:6 idcase <- c('A','B','C','D') idcontrol <- c('a','b','c','d','e','f') ycase <- c(11,33,55,122) ycontrol <- c(11,22,33,44,55,66) matchCases(xcase, ycase, idcase, xcontrol, ycontrol, idcontrol, tol=1) # If y is a binary response variable, the following code # will produce a Mantel-Haenszel summary odds ratio that # utilizes the matching. # Standard variance formula will not work here because # a control will match more than one case # WARNING: The M-H procedure exemplified here is suspect # because of the small strata and widely varying number # of controls per case. x <- c(1, 2, 3, 3, 3, 6, 7, 12, 1, 1:7) y <- c(0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1) case <- c(rep(TRUE, 8), rep(FALSE, 8)) id <- 1:length(x) m <- matchCases(x[case], y[case], id[case], x[!case], y[!case], id[!case], tol=1) iscase <- m$type=='case' # Note: the first tapply on insures that event indicators are # sorted by case id. The second actually does something. event.case <- tapply(m$y[iscase], m$idcase[iscase], sum) event.control <- tapply(m$y[!iscase], m$idcase[!iscase], sum) n.control <- tapply(!iscase, m$idcase, sum) n <- tapply(m$y, m$idcase, length) or <- sum(event.case * (n.control - event.control) / n) / sum(event.control * (1 - event.case) / n) or # Bootstrap this estimator by sampling with replacement from # subjects. Assumes id is unique when combine cases+controls # (id was constructed this way above). The following algorithms # puts all sampled controls back with the cases to whom they were # originally matched. ids <- unique(m$id) idgroups <- split(1:nrow(m), m$id) B <- 50 # in practice use many more ors <- numeric(B) # Function to order w by ids, leaving unassigned elements zero align <- function(ids, w) { z <- structure(rep(0, length(ids)), names=ids) z[names(w)] <- w z } for(i in 1:B) { j <- sample(ids, replace=TRUE) obs <- unlist(idgroups[j]) u <- m[obs,] iscase <- u$type=='case' n.case <- align(ids, tapply(u$type, u$idcase, function(v)sum(v=='case'))) n.control <- align(ids, tapply(u$type, u$idcase, function(v)sum(v=='control'))) event.case <- align(ids, tapply(u$y[iscase], u$idcase[iscase], sum)) event.control <- align(ids, tapply(u$y[!iscase], u$idcase[!iscase], sum)) n <- n.case + n.control # Remove sets having 0 cases or 0 controls in resample s <- n.case > 0 & n.control > 0 denom <- sum(event.control[s] * (n.case[s] - event.case[s]) / n[s]) or <- if(denom==0) NA else sum(event.case[s] * (n.control[s] - event.control[s]) / n[s]) / denom ors[i] <- or } describe(ors)
first.word
finds the first word in an expression. A word is defined by
unlisting the elements of the expression found by the S parser and then
accepting any elements whose first character is either a letter or period.
The principal intended use is for the automatic generation of temporary
file names where it is important to exclude special characters from
the file name. For Microsoft Windows, periods in names are deleted and
only up to the first 8 characters of the word is returned.
first.word(x, i=1, expr=substitute(x))
first.word(x, i=1, expr=substitute(x))
x |
any scalar character string |
i |
word number, default value = 1. Used when the second or |
expr |
any S object of mode |
a character string
Frank E. Harrell, Jr.,
Department of Biostatistics,
Vanderbilt University,
[email protected]
Richard M. Heiberger,
Department of Statistics,
Temple University, Philadelphia, PA.
[email protected]
first.word(expr=expression(y ~ x + log(w)))
first.word(expr=expression(y ~ x + log(w)))
format.df
does appropriate rounding and decimal alignment, and outputs
a character matrix containing the formatted data. If x
is a
data.frame
, then do each component separately.
If x
is a matrix, but not a data.frame, make it a data.frame
with individual components for the columns.
If a component x$x
is a matrix, then do all columns the same.
format.df(x, digits, dec=NULL, rdec=NULL, cdec=NULL, numeric.dollar=!dcolumn, na.blank=FALSE, na.dot=FALSE, blank.dot=FALSE, col.just=NULL, cdot=FALSE, dcolumn=FALSE, matrix.sep=' ', scientific=c(-4,4), math.row.names=FALSE, already.math.row.names=FALSE, math.col.names=FALSE, already.math.col.names=FALSE, double.slash=FALSE, format.Date="%m/%d/%Y", format.POSIXt="%m/%d/%Y %H:%M:%OS", ...)
format.df(x, digits, dec=NULL, rdec=NULL, cdec=NULL, numeric.dollar=!dcolumn, na.blank=FALSE, na.dot=FALSE, blank.dot=FALSE, col.just=NULL, cdot=FALSE, dcolumn=FALSE, matrix.sep=' ', scientific=c(-4,4), math.row.names=FALSE, already.math.row.names=FALSE, math.col.names=FALSE, already.math.col.names=FALSE, double.slash=FALSE, format.Date="%m/%d/%Y", format.POSIXt="%m/%d/%Y %H:%M:%OS", ...)
x |
a matrix (usually numeric) or data frame |
digits |
causes all values in the table to be formatted to |
dec |
If |
rdec |
a vector specifying the number of decimal places to the right for each row
( |
cdec |
a vector specifying the number of decimal places for each column. The vector must have number of items equal to number of columns or components of input x. |
cdot |
Set to |
na.blank |
Set to |
dcolumn |
Set to |
numeric.dollar |
logical, default |
math.row.names |
logical, set true to place dollar signs around the row names. |
already.math.row.names |
set to |
math.col.names |
logical, set true to place dollar signs around the column names. |
already.math.col.names |
set to |
na.dot |
Set to |
blank.dot |
Set to |
col.just |
Input vector |
matrix.sep |
When |
scientific |
specifies ranges of exponents (or a logical vector) specifying values
not to convert to scientific notation. See |
double.slash |
should escaping backslashes be themselves escaped. |
format.Date |
String used to format objects of the Date class. |
format.POSIXt |
String used to format objects of the POSIXt class. |
... |
other arguments are accepted and passed to |
a character matrix with character images of properly rounded x
.
Matrix components of input x
are now just sets of columns of
character matrix.
Object attribute"col.just"
repeats the value of the argument col.just
when provided,
otherwise, it includes the recommended justification for columns of output.
See the discussion of the argument col.just
.
The default justification is ‘l’ for characters and factors,
‘r’ for numeric.
When dcolumn==TRUE
, numerics will have ‘.’ as the justification character.
Frank E. Harrell, Jr.,
Department of Biostatistics,
Vanderbilt University,
[email protected]
Richard M. Heiberger,
Department of Statistics,
Temple University, Philadelphia, PA.
[email protected]
## Not run: x <- data.frame(a=1:2, b=3:4) x$m <- 10000*matrix(5:8,nrow=2) names(x) dim(x) x format.df(x, big.mark=",") dim(format.df(x)) ## End(Not run)
## Not run: x <- data.frame(a=1:2, b=3:4) x$m <- 10000*matrix(5:8,nrow=2) names(x) dim(x) x format.df(x, big.mark=",") dim(format.df(x)) ## End(Not run)
format.pval
is intended for formatting p-values.
format.pval(x, pv=x, digits = max(1, .Options$digits - 2), eps = .Machine$double.eps, na.form = "NA", ...)
format.pval(x, pv=x, digits = max(1, .Options$digits - 2), eps = .Machine$double.eps, na.form = "NA", ...)
pv |
a numeric vector. |
x |
argument for method compliance. |
digits |
how many significant digits are to be used. |
eps |
a numerical tolerance: see Details. |
na.form |
character representation of |
... |
arguments passed to |
format.pval
is mainly an auxiliary function for
print.summary.lm
etc., and does separate formatting for
fixed, floating point and very small values; those less than
eps
are formatted as “‘< [eps]’” (where
“‘[eps]’” stands for format(eps, digits)
).
A character vector.
This is the base format.pval
function with the
ablitiy to pass the nsmall
argument to format
format.pval(c(runif(5), pi^-100, NA)) format.pval(c(0.1, 0.0001, 1e-27)) format.pval(c(0.1, 1e-27), nsmall=3)
format.pval(c(runif(5), pi^-100, NA)) format.pval(c(0.1, 0.0001, 1e-27)) format.pval(c(0.1, 1e-27), nsmall=3)
gbayes
derives the (Gaussian) posterior and optionally the predictive
distribution when both the prior and the likelihood are Gaussian, and
when the statistic of interest comes from a 2-sample problem.
This function is especially useful in obtaining the expected power of
a statistical test, averaging over the distribution of the population
effect parameter (e.g., log hazard ratio) that is obtained using
pilot data. gbayes
is also useful for summarizing studies for
which the statistic of interest is approximately Gaussian with
known variance. An example is given for comparing two proportions
using the angular transformation, for which the variance is
independent of unknown parameters except for very extreme probabilities.
A plot
method is also given. This plots the prior, posterior, and
predictive distributions on a single graph using a nice default for
the x-axis limits and using the labcurve
function for automatic
labeling of the curves.
gbayes2
uses the method of Spiegelhalter and Freedman (1986) to compute the
probability of correctly concluding that a new treatment is superior
to a control. By this we mean that a 1-alpha
normal
theory-based confidence interval for the new minus old treatment
effect lies wholly to the right of delta.w
, where delta.w
is the
minimally worthwhile treatment effect (which can be zero to be
consistent with ordinary null hypothesis testing, a method not always
making sense). This kind of power function is averaged over a prior
distribution for the unknown treatment effect. This procedure is
applicable to the situation where a prior distribution is not to be
used in constructing the test statistic or confidence interval, but is
only used for specifying the distribution of delta
, the parameter of
interest.
Even though gbayes2
assumes that the test statistic has a normal distribution with known
variance (which is strongly a function of the sample size in the two
treatment groups), the prior distribution function can be completely
general. Instead of using a step-function for the prior distribution
as Spiegelhalter and Freedman used in their appendix, gbayes2
uses
the built-in integrate
function for numerical integration.
gbayes2
also allows the variance of the test statistic to be general
as long as it is evaluated by the user. The conditional power given the
parameter of interest delta
is 1 - pnorm((delta.w - delta)/sd + z)
, where z
is the normal critical value corresponding to 1 - alpha
/2.
gbayesMixPredNoData
derives the predictive distribution of a
statistic that is Gaussian given delta
when no data have yet been
observed and when the prior is a mixture of two Gaussians.
gbayesMixPost
derives the posterior density, cdf, or posterior
mean of delta
given
the statistic x
, when the prior for delta
is a mixture of two
Gaussians and when x
is Gaussian given delta
.
gbayesMixPowerNP
computes the power for a test for delta
> delta.w
for the case where (1) a Gaussian prior or mixture of two Gaussian priors
is used as the prior distribution, (2) this prior is used in forming
the statistical test or credible interval, (3) no prior is used for
the distribution of delta
for computing power but instead a fixed
single delta
is given (as in traditional frequentist hypothesis
tests), and (4) the test statistic has a Gaussian likelihood with
known variance (and mean equal to the specified delta
).
gbayesMixPowerNP
is handy where you want to use an earlier study in
testing for treatment effects in a new study, but you want to mix with
this prior a non-informative prior. The mixing probability mix
can
be thought of as the "applicability" of the previous study. As with
gbayes2
, power here means the probability that the new study will
yield a left credible interval that is to the right of delta.w
.
gbayes1PowerNP
is a special case of gbayesMixPowerNP
when the
prior is a single Gaussian.
gbayes(mean.prior, var.prior, m1, m2, stat, var.stat, n1, n2, cut.prior, cut.prob.prior=0.025) ## S3 method for class 'gbayes' plot(x, xlim, ylim, name.stat='z', ...) gbayes2(sd, prior, delta.w=0, alpha=0.05, upper=Inf, prior.aux) gbayesMixPredNoData(mix=NA, d0=NA, v0=NA, d1=NA, v1=NA, what=c('density','cdf')) gbayesMixPost(x=NA, v=NA, mix=1, d0=NA, v0=NA, d1=NA, v1=NA, what=c('density','cdf','postmean')) gbayesMixPowerNP(pcdf, delta, v, delta.w=0, mix, interval, nsim=0, alpha=0.05) gbayes1PowerNP(d0, v0, delta, v, delta.w=0, alpha=0.05)
gbayes(mean.prior, var.prior, m1, m2, stat, var.stat, n1, n2, cut.prior, cut.prob.prior=0.025) ## S3 method for class 'gbayes' plot(x, xlim, ylim, name.stat='z', ...) gbayes2(sd, prior, delta.w=0, alpha=0.05, upper=Inf, prior.aux) gbayesMixPredNoData(mix=NA, d0=NA, v0=NA, d1=NA, v1=NA, what=c('density','cdf')) gbayesMixPost(x=NA, v=NA, mix=1, d0=NA, v0=NA, d1=NA, v1=NA, what=c('density','cdf','postmean')) gbayesMixPowerNP(pcdf, delta, v, delta.w=0, mix, interval, nsim=0, alpha=0.05) gbayes1PowerNP(d0, v0, delta, v, delta.w=0, alpha=0.05)
mean.prior |
mean of the prior distribution |
cut.prior , cut.prob.prior , var.prior
|
variance of the prior. Use a large number such as 10000 to effectively
use a flat (noninformative) prior. Sometimes it is useful to compute
the variance so that the prior probability that |
m1 |
sample size in group 1 |
m2 |
sample size in group 2 |
stat |
statistic comparing groups 1 and 2, e.g., log hazard ratio, difference in means, difference in angular transformations of proportions |
var.stat |
variance of |
x |
an object returned by |
sd |
the standard deviation of the treatment effect |
prior |
a function of possibly a vector of unknown treatment effects, returning the prior density at those values |
pcdf |
a function computing the posterior CDF of the treatment effect
|
delta |
a true unknown single treatment effect to detect |
v |
the variance of the statistic |
n1 |
number of future observations in group 1, for obtaining a predictive distribution |
n2 |
number of future observations in group 2 |
xlim |
vector of 2 x-axis limits. Default is the mean of the posterior plus or minus 6 standard deviations of the posterior. |
ylim |
vector of 2 y-axis limits. Default is the range over combined prior and posterior densities. |
name.stat |
label for x-axis. Default is |
... |
optional arguments passed to |
delta.w |
the minimum worthwhile treatment difference to detech. The default is zero for a plain uninteristing null hypothesis. |
alpha |
type I error, or more accurately one minus the confidence level for a two-sided confidence limit for the treatment effect |
upper |
upper limit of integration over the prior distribution multiplied by the normal likelihood for the treatment effect statistic. Default is infinity. |
prior.aux |
argument to pass to |
mix |
mixing probability or weight for the Gaussian prior having mean |
d0 |
mean of the first Gaussian distribution (only Gaussian for
|
v0 |
variance of the first Gaussian (only Gaussian for
|
d1 |
mean of the second Gaussian (if |
v1 |
variance of the second Gaussian (if |
what |
specifies whether the predictive density or the CDF is to be
computed. Default is |
interval |
a 2-vector containing the lower and upper limit for possible values of
the test statistic |
nsim |
defaults to zero, causing |
gbayes
returns a list of class "gbayes"
containing the following
names elements: mean.prior
,var.prior
,mean.post
, var.post
, and
if n1
is specified, mean.pred
and var.pred
. Note that
mean.pred
is identical to mean.post
. gbayes2
returns a single
number which is the probability of correctly rejecting the null
hypothesis in favor of the new treatment. gbayesMixPredNoData
returns a function that can be used to evaluate the predictive density
or cumulative distribution. gbayesMixPost
returns a function that
can be used to evaluate the posterior density or cdf. gbayesMixPowerNP
returns a vector containing two values if nsim
= 0. The first value is the
critical value for the test statistic that will make the left credible
interval > delta.w
, and the second value is the power. If nsim
> 0,
it returns the power estimate and confidence limits for it if nsim
>
0. The examples show how to use these functions.
Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
[email protected]
Spiegelhalter DJ, Freedman LS, Parmar MKB (1994): Bayesian approaches to
randomized trials. JRSS A 157:357–416. Results for gbayes
are derived from
Equations 1, 2, 3, and 6.
Spiegelhalter DJ, Freedman LS (1986): A predictive approach to selecting the size of a clinical trial, based on subjective clinical opinion. Stat in Med 5:1–13.
Joseph, Lawrence and Belisle, Patrick (1997): Bayesian sample size determination for normal means and differences between normal means. The Statistician 46:209–226.
Grouin, JM, Coste M, Bunouf P, Lecoutre B (2007): Bayesian sample size determination in non-sequential clinical trials: Statistical aspects and some regulatory considerations. Stat in Med 26:4914–4924.
# Compare 2 proportions using the var stabilizing transformation # arcsin(sqrt((x+3/8)/(n+3/4))) (Anscombe), which has variance # 1/[4(n+.5)] m1 <- 100; m2 <- 150 deaths1 <- 10; deaths2 <- 30 f <- function(events,n) asin(sqrt((events+3/8)/(n+3/4))) stat <- f(deaths1,m1) - f(deaths2,m2) var.stat <- function(m1, m2) 1/4/(m1+.5) + 1/4/(m2+.5) cat("Test statistic:",format(stat)," s.d.:", format(sqrt(var.stat(m1,m2))), "\n") #Use unbiased prior with variance 1000 (almost flat) b <- gbayes(0, 1000, m1, m2, stat, var.stat, 2*m1, 2*m2) print(b) plot(b) #To get posterior Prob[parameter > w] use # 1-pnorm(w, b$mean.post, sqrt(b$var.post)) #If g(effect, n1, n2) is the power function to #detect an effect of 'effect' with samples size for groups 1 and 2 #of n1,n2, estimate the expected power by getting 1000 random #draws from the posterior distribution, computing power for #each value of the population effect, and averaging the 1000 powers #This code assumes that g will accept vector-valued 'effect' #For the 2-sample proportion problem just addressed, 'effect' #could be taken approximately as the change in the arcsin of #the square root of the probability of the event g <- function(effect, n1, n2, alpha=.05) { sd <- sqrt(var.stat(n1,n2)) z <- qnorm(1 - alpha/2) effect <- abs(effect) 1 - pnorm(z - effect/sd) + pnorm(-z - effect/sd) } effects <- rnorm(1000, b$mean.post, sqrt(b$var.post)) powers <- g(effects, 500, 500) hist(powers, nclass=35, xlab='Power') describe(powers) # gbayes2 examples # First consider a study with a binary response where the # sample size is n1=500 in the new treatment arm and n2=300 # in the control arm. The parameter of interest is the # treated:control log odds ratio, which has variance # 1/[n1 p1 (1-p1)] + 1/[n2 p2 (1-p2)]. This is not # really constant so we average the variance over plausible # values of the probabilities of response p1 and p2. We # think that these are between .4 and .6 and we take a # further short cut v <- function(n1, n2, p1, p2) 1/(n1*p1*(1-p1)) + 1/(n2*p2*(1-p2)) n1 <- 500; n2 <- 300 ps <- seq(.4, .6, length=100) vguess <- quantile(v(n1, n2, ps, ps), .75) vguess # 75% # 0.02183459 # The minimally interesting treatment effect is an odds ratio # of 1.1. The prior distribution on the log odds ratio is # a 50:50 mixture of a vague Gaussian (mean 0, sd 100) and # an informative prior from a previous study (mean 1, sd 1) prior <- function(delta) 0.5*dnorm(delta, 0, 100)+0.5*dnorm(delta, 1, 1) deltas <- seq(-5, 5, length=150) plot(deltas, prior(deltas), type='l') # Now compute the power, averaged over this prior gbayes2(sqrt(vguess), prior, log(1.1)) # [1] 0.6133338 # See how much power is lost by ignoring the previous # study completely gbayes2(sqrt(vguess), function(delta)dnorm(delta, 0, 100), log(1.1)) # [1] 0.4984588 # What happens to the power if we really don't believe the treatment # is very effective? Let's use a prior distribution for the log # odds ratio that is uniform between log(1.2) and log(1.3). # Also check the power against a true null hypothesis prior2 <- function(delta) dunif(delta, log(1.2), log(1.3)) gbayes2(sqrt(vguess), prior2, log(1.1)) # [1] 0.1385113 gbayes2(sqrt(vguess), prior2, 0) # [1] 0.3264065 # Compare this with the power of a two-sample binomial test to # detect an odds ratio of 1.25 bpower(.5, odds.ratio=1.25, n1=500, n2=300) # Power # 0.3307486 # For the original prior, consider a new study with equal # sample sizes n in the two arms. Solve for n to get a # power of 0.9. For the variance of the log odds ratio # assume a common p in the center of a range of suspected # probabilities of response, 0.3. For this example we # use a zero null value and the uniform prior above v <- function(n) 2/(n*.3*.7) pow <- function(n) gbayes2(sqrt(v(n)), prior2) uniroot(function(n) pow(n)-0.9, c(50,10000))$root # [1] 2119.675 # Check this value pow(2119.675) # [1] 0.9 # Get the posterior density when there is a mixture of two priors, # with mixing probability 0.5. The first prior is almost # non-informative (normal with mean 0 and variance 10000) and the # second has mean 2 and variance 0.3. The test statistic has a value # of 3 with variance 0.4. f <- gbayesMixPost(3, 4, mix=0.5, d0=0, v0=10000, d1=2, v1=0.3) args(f) # Plot this density delta <- seq(-2, 6, length=150) plot(delta, f(delta), type='l') # Add to the plot the posterior density that used only # the almost non-informative prior lines(delta, f(delta, mix=1), lty=2) # The same but for an observed statistic of zero lines(delta, f(delta, mix=1, x=0), lty=3) # Derive the CDF instead of the density g <- gbayesMixPost(3, 4, mix=0.5, d0=0, v0=10000, d1=2, v1=0.3, what='cdf') # Had mix=0 or 1, gbayes1PowerNP could have been used instead # of gbayesMixPowerNP below # Compute the power to detect an effect of delta=1 if the variance # of the test statistic is 0.2 gbayesMixPowerNP(g, 1, 0.2, interval=c(-10,12)) # Do the same thing by simulation gbayesMixPowerNP(g, 1, 0.2, interval=c(-10,12), nsim=20000) # Compute by what factor the sample size needs to be larger # (the variance needs to be smaller) so that the power is 0.9 ratios <- seq(1, 4, length=50) pow <- single(50) for(i in 1:50) pow[i] <- gbayesMixPowerNP(g, 1, 0.2/ratios[i], interval=c(-10,12))[2] # Solve for ratio using reverse linear interpolation approx(pow, ratios, xout=0.9)$y # Check this by computing power gbayesMixPowerNP(g, 1, 0.2/2.1, interval=c(-10,12)) # So the study will have to be 2.1 times as large as earlier thought
# Compare 2 proportions using the var stabilizing transformation # arcsin(sqrt((x+3/8)/(n+3/4))) (Anscombe), which has variance # 1/[4(n+.5)] m1 <- 100; m2 <- 150 deaths1 <- 10; deaths2 <- 30 f <- function(events,n) asin(sqrt((events+3/8)/(n+3/4))) stat <- f(deaths1,m1) - f(deaths2,m2) var.stat <- function(m1, m2) 1/4/(m1+.5) + 1/4/(m2+.5) cat("Test statistic:",format(stat)," s.d.:", format(sqrt(var.stat(m1,m2))), "\n") #Use unbiased prior with variance 1000 (almost flat) b <- gbayes(0, 1000, m1, m2, stat, var.stat, 2*m1, 2*m2) print(b) plot(b) #To get posterior Prob[parameter > w] use # 1-pnorm(w, b$mean.post, sqrt(b$var.post)) #If g(effect, n1, n2) is the power function to #detect an effect of 'effect' with samples size for groups 1 and 2 #of n1,n2, estimate the expected power by getting 1000 random #draws from the posterior distribution, computing power for #each value of the population effect, and averaging the 1000 powers #This code assumes that g will accept vector-valued 'effect' #For the 2-sample proportion problem just addressed, 'effect' #could be taken approximately as the change in the arcsin of #the square root of the probability of the event g <- function(effect, n1, n2, alpha=.05) { sd <- sqrt(var.stat(n1,n2)) z <- qnorm(1 - alpha/2) effect <- abs(effect) 1 - pnorm(z - effect/sd) + pnorm(-z - effect/sd) } effects <- rnorm(1000, b$mean.post, sqrt(b$var.post)) powers <- g(effects, 500, 500) hist(powers, nclass=35, xlab='Power') describe(powers) # gbayes2 examples # First consider a study with a binary response where the # sample size is n1=500 in the new treatment arm and n2=300 # in the control arm. The parameter of interest is the # treated:control log odds ratio, which has variance # 1/[n1 p1 (1-p1)] + 1/[n2 p2 (1-p2)]. This is not # really constant so we average the variance over plausible # values of the probabilities of response p1 and p2. We # think that these are between .4 and .6 and we take a # further short cut v <- function(n1, n2, p1, p2) 1/(n1*p1*(1-p1)) + 1/(n2*p2*(1-p2)) n1 <- 500; n2 <- 300 ps <- seq(.4, .6, length=100) vguess <- quantile(v(n1, n2, ps, ps), .75) vguess # 75% # 0.02183459 # The minimally interesting treatment effect is an odds ratio # of 1.1. The prior distribution on the log odds ratio is # a 50:50 mixture of a vague Gaussian (mean 0, sd 100) and # an informative prior from a previous study (mean 1, sd 1) prior <- function(delta) 0.5*dnorm(delta, 0, 100)+0.5*dnorm(delta, 1, 1) deltas <- seq(-5, 5, length=150) plot(deltas, prior(deltas), type='l') # Now compute the power, averaged over this prior gbayes2(sqrt(vguess), prior, log(1.1)) # [1] 0.6133338 # See how much power is lost by ignoring the previous # study completely gbayes2(sqrt(vguess), function(delta)dnorm(delta, 0, 100), log(1.1)) # [1] 0.4984588 # What happens to the power if we really don't believe the treatment # is very effective? Let's use a prior distribution for the log # odds ratio that is uniform between log(1.2) and log(1.3). # Also check the power against a true null hypothesis prior2 <- function(delta) dunif(delta, log(1.2), log(1.3)) gbayes2(sqrt(vguess), prior2, log(1.1)) # [1] 0.1385113 gbayes2(sqrt(vguess), prior2, 0) # [1] 0.3264065 # Compare this with the power of a two-sample binomial test to # detect an odds ratio of 1.25 bpower(.5, odds.ratio=1.25, n1=500, n2=300) # Power # 0.3307486 # For the original prior, consider a new study with equal # sample sizes n in the two arms. Solve for n to get a # power of 0.9. For the variance of the log odds ratio # assume a common p in the center of a range of suspected # probabilities of response, 0.3. For this example we # use a zero null value and the uniform prior above v <- function(n) 2/(n*.3*.7) pow <- function(n) gbayes2(sqrt(v(n)), prior2) uniroot(function(n) pow(n)-0.9, c(50,10000))$root # [1] 2119.675 # Check this value pow(2119.675) # [1] 0.9 # Get the posterior density when there is a mixture of two priors, # with mixing probability 0.5. The first prior is almost # non-informative (normal with mean 0 and variance 10000) and the # second has mean 2 and variance 0.3. The test statistic has a value # of 3 with variance 0.4. f <- gbayesMixPost(3, 4, mix=0.5, d0=0, v0=10000, d1=2, v1=0.3) args(f) # Plot this density delta <- seq(-2, 6, length=150) plot(delta, f(delta), type='l') # Add to the plot the posterior density that used only # the almost non-informative prior lines(delta, f(delta, mix=1), lty=2) # The same but for an observed statistic of zero lines(delta, f(delta, mix=1, x=0), lty=3) # Derive the CDF instead of the density g <- gbayesMixPost(3, 4, mix=0.5, d0=0, v0=10000, d1=2, v1=0.3, what='cdf') # Had mix=0 or 1, gbayes1PowerNP could have been used instead # of gbayesMixPowerNP below # Compute the power to detect an effect of delta=1 if the variance # of the test statistic is 0.2 gbayesMixPowerNP(g, 1, 0.2, interval=c(-10,12)) # Do the same thing by simulation gbayesMixPowerNP(g, 1, 0.2, interval=c(-10,12), nsim=20000) # Compute by what factor the sample size needs to be larger # (the variance needs to be smaller) so that the power is 0.9 ratios <- seq(1, 4, length=50) pow <- single(50) for(i in 1:50) pow[i] <- gbayesMixPowerNP(g, 1, 0.2/ratios[i], interval=c(-10,12))[2] # Solve for ratio using reverse linear interpolation approx(pow, ratios, xout=0.9)$y # Check this by computing power gbayesMixPowerNP(g, 1, 0.2/2.1, interval=c(-10,12)) # So the study will have to be 2.1 times as large as earlier thought
Simulate Bayesian Sequential Treatment Comparisons Using a Gaussian Model
gbayesSeqSim(est, asserts)
gbayesSeqSim(est, asserts)
est |
data frame created by |
asserts |
list of lists. The first element of each list is the user-specified name for each assertion/prior combination, e.g., |
Simulate a sequential trial under a Gaussian model for parameter estimates, and Gaussian priors using simulated estimates and variances returned by estSeqSim
. For each row of the data frame est
and for each prior/assertion combination, computes the posterior probability of the assertion.
a data frame with number of rows equal to that of est
with a number of new columns equal to the number of assertions added. The new columns are named p1
, p2
, p3
, ... (posterior probabilities), mean1
, mean2
, ... (posterior means), and sd1
, sd2
, ... (posterior standard deviations). The returned data frame also has an attribute asserts
added which is the original asserts
augmented with any derived mu
and sigma
and converted to a data frame, and another attribute alabels
which is a named vector used to map p1
, p2
, ... to the user-provided labels in asserts
.
Frank Harrell
gbayes()
, estSeqSim()
, simMarkovOrd()
, estSeqMarkovOrd()
## Not run: # Simulate Bayesian operating characteristics for an unadjusted # proportional odds comparison (Wilcoxon test) # For 100 simulations, 5 looks, 2 true parameter values, and # 2 assertion/prior combinations, compute the posterior probability # Use a low-level logistic regression call to speed up simuluations # Use data.table to compute various summary measures # Total simulation time: 2s lfit <- function(x, y) { f <- rms::lrm.fit(x, y) k <- length(coef(f)) c(coef(f)[k], vcov(f)[k, k]) } gdat <- function(beta, n1, n2) { # Cell probabilities for a 7-category ordinal outcome for the control group p <- c(2, 1, 2, 7, 8, 38, 42) / 100 # Compute cell probabilities for the treated group p2 <- pomodm(p=p, odds.ratio=exp(beta)) y1 <- sample(1 : 7, n1, p, replace=TRUE) y2 <- sample(1 : 7, n2, p2, replace=TRUE) list(y1=y1, y2=y2) } # Assertion 1: log(OR) < 0 under prior with prior mean 0.1 and sigma 1 on log OR scale # Assertion 2: OR between 0.9 and 1/0.9 with prior mean 0 and sigma computed so that # P(OR > 2) = 0.05 asserts <- list(list('Efficacy', '<', 0, mu=0.1, sigma=1), list('Similarity', 'in', log(c(0.9, 1/0.9)), cutprior=log(2), tailprob=0.05)) set.seed(1) est <- estSeqSim(c(0, log(0.7)), looks=c(50, 75, 95, 100, 200), gendat=gdat, fitter=lfit, nsim=100) z <- gbayesSeqSim(est, asserts) head(z) attr(z, 'asserts') # Compute the proportion of simulations that hit targets (different target posterior # probabilities for efficacy vs. similarity) # For the efficacy assessment compute the first look at which the target # was hit (set to infinity if never hit) require(data.table) z <- data.table(z) u <- z[, .(first=min(p1 > 0.95)), by=.(parameter, sim)] # Compute the proportion of simulations that ever hit the target and # that hit it by the 100th subject u[, .(ever=mean(first < Inf)), by=.(parameter)] u[, .(by75=mean(first <= 100)), by=.(parameter)] ## End(Not run)
## Not run: # Simulate Bayesian operating characteristics for an unadjusted # proportional odds comparison (Wilcoxon test) # For 100 simulations, 5 looks, 2 true parameter values, and # 2 assertion/prior combinations, compute the posterior probability # Use a low-level logistic regression call to speed up simuluations # Use data.table to compute various summary measures # Total simulation time: 2s lfit <- function(x, y) { f <- rms::lrm.fit(x, y) k <- length(coef(f)) c(coef(f)[k], vcov(f)[k, k]) } gdat <- function(beta, n1, n2) { # Cell probabilities for a 7-category ordinal outcome for the control group p <- c(2, 1, 2, 7, 8, 38, 42) / 100 # Compute cell probabilities for the treated group p2 <- pomodm(p=p, odds.ratio=exp(beta)) y1 <- sample(1 : 7, n1, p, replace=TRUE) y2 <- sample(1 : 7, n2, p2, replace=TRUE) list(y1=y1, y2=y2) } # Assertion 1: log(OR) < 0 under prior with prior mean 0.1 and sigma 1 on log OR scale # Assertion 2: OR between 0.9 and 1/0.9 with prior mean 0 and sigma computed so that # P(OR > 2) = 0.05 asserts <- list(list('Efficacy', '<', 0, mu=0.1, sigma=1), list('Similarity', 'in', log(c(0.9, 1/0.9)), cutprior=log(2), tailprob=0.05)) set.seed(1) est <- estSeqSim(c(0, log(0.7)), looks=c(50, 75, 95, 100, 200), gendat=gdat, fitter=lfit, nsim=100) z <- gbayesSeqSim(est, asserts) head(z) attr(z, 'asserts') # Compute the proportion of simulations that hit targets (different target posterior # probabilities for efficacy vs. similarity) # For the efficacy assessment compute the first look at which the target # was hit (set to infinity if never hit) require(data.table) z <- data.table(z) u <- z[, .(first=min(p1 > 0.95)), by=.(parameter, sim)] # Compute the proportion of simulations that ever hit the target and # that hit it by the 100th subject u[, .(ever=mean(first < Inf)), by=.(parameter)] u[, .(by75=mean(first <= 100)), by=.(parameter)] ## End(Not run)
Data from The Analysis of Biological Data by Shitlock and Schluter
getabd(name = "", lowernames = FALSE, allow = "_")
getabd(name = "", lowernames = FALSE, allow = "_")
name |
name of dataset to fetch. Omit to get a data table listing all available datasets. |
lowernames |
set to |
allow |
set to |
Fetches csv files for exercises in the book
data frame with attributes label
and url
Frank Harrell
This function downloads and makes ready to use datasets from the main
web site for the Hmisc and rms libraries. For R, the
datasets were stored in compressed save
format and
getHdata
makes them available by running load
after download. For S-Plus, the datasets were stored in
data.dump
format and are made available by running
data.restore
after import. The dataset is run through the
cleanup.import
function. Calling getHdata
with no
file
argument provides a character vector of names of available
datasets that are currently on the web site. For R, R's default
browser can optionally be launched to view html
files that were
already prepared using the Hmisc command
html(contents())
or to view ‘.txt’ or ‘.html’ data
description files when available.
If options(localHfiles=TRUE)
the scripts are read from local directory
~/web/data/repo
instead of from the web server.
getHdata(file, what = c("data", "contents", "description", "all"), where="https://hbiostat.org/data/repo")
getHdata(file, what = c("data", "contents", "description", "all"), where="https://hbiostat.org/data/repo")
file |
an unquoted name of a dataset on the web site, e.g. ‘prostate’.
Omit |
what |
specify |
where |
URL containing the data and metadata files |
getHdata()
without a file
argument returns a character
vector of dataset base names. When a dataset is downloaded, the data
frame is placed in search position one and is not returned as value of
getHdata
.
Frank Harrell
download.file
, cleanup.import
,
data.restore
, load
## Not run: getHdata() # download list of available datasets getHdata(prostate) # downloads, load( ) or data.restore( ) # runs cleanup.import for S-Plus 6 getHdata(valung, "contents") # open browser (options(browser="whatever")) # after downloading valung.html # (result of html(contents())) getHdata(support, "all") # download and open one browser window datadensity(support) attach(support) # make individual variables available getHdata(plasma, "all") # download and open two browser windows # (description file is available for plasma) ## End(Not run)
## Not run: getHdata() # download list of available datasets getHdata(prostate) # downloads, load( ) or data.restore( ) # runs cleanup.import for S-Plus 6 getHdata(valung, "contents") # open browser (options(browser="whatever")) # after downloading valung.html # (result of html(contents())) getHdata(support, "all") # download and open one browser window datadensity(support) attach(support) # make individual variables available getHdata(plasma, "all") # download and open two browser windows # (description file is available for plasma) ## End(Not run)
The github rscripts project at
https://github.com/harrelfe/rscripts contains R scripts that are
primarily analysis templates for teaching with RStudio. This function
allows the user to print an organized list of available scripts, to
download a script and source()
it into the current session (the
default), to
download a script and load it into an RStudio script editor window, to
list scripts whose major category contains a given string (ignoring
case), or to list all major and minor categories. If
options(localHfiles=TRUE)
the scripts are read from local directory
~/R/rscripts
instead of from github.
getRs(file=NULL, guser='harrelfe', grepo='rscripts', gdir='raw/master', dir=NULL, browse=c('local', 'browser'), cats=FALSE, put=c('source', 'rstudio'))
getRs(file=NULL, guser='harrelfe', grepo='rscripts', gdir='raw/master', dir=NULL, browse=c('local', 'browser'), cats=FALSE, put=c('source', 'rstudio'))
file |
a character string containing a script file name.
Omit |
guser |
GitHub user name, default is |
grepo |
Github repository name, default is |
gdir |
Github directory under which to find retrievable files |
dir |
directory under |
browse |
When showing the rscripts contents directory, the
default is to list in tabular form in the console. Specify
|
cats |
Leave at the default ( |
put |
Leave at the default ( |
a data frame or list, depending on arguments
Frank Harrell and Cole Beck
## Not run: getRs() # list available scripts scripts <- getRs() # likewise, but store in an object that can easily # be viewed on demand in RStudio getRs('introda.r') # download introda.r and put in script editor getRs(cats=TRUE) # list available major and minor categories categories <- getRs(cats=TRUE) # likewise but store results in a list for later viewing getRs(cats='reg') # list all scripts in a major category containing 'reg' getRs('importREDCap.r') # source() to define a function # source() a new version of the Hmisc package's cut2 function: getRs('cut2.s', grepo='Hmisc', dir='R') ## End(Not run)
## Not run: getRs() # list available scripts scripts <- getRs() # likewise, but store in an object that can easily # be viewed on demand in RStudio getRs('introda.r') # download introda.r and put in script editor getRs(cats=TRUE) # list available major and minor categories categories <- getRs(cats=TRUE) # likewise but store results in a list for later viewing getRs(cats='reg') # list all scripts in a major category containing 'reg' getRs('importREDCap.r') # source() to define a function # source() a new version of the Hmisc package's cut2 function: getRs