%nin%
is a binary operator, which returns a logical vector indicating
if there is a match or not for its left operand. A true vector element
indicates no match in left operand, false indicates a match.
x %nin% table
x 
a vector (numeric, character, factor) 
table 
a vector (numeric, character, factor), matching the mode of 
vector of logical values with length equal to length of x
.
c('a','b','c') %nin% c('a','b')
Computes the mean and median of various absolute errors related to
ordinary multiple regression models. The mean and median absolute
errors correspond to the mean square due to regression, error, and
total. The absolute errors computed are derived from $\hat{Y} 
\mbox{median($\hat{Y}$)}$
,
$\hat{Y}  Y$
, and $Y 
\mbox{median($Y$)}$
. The function also
computes ratios that correspond to $R^2$
and $1  R^2$
(but
these ratios do not add to 1.0); the $R^2$
measure is the ratio of
mean or median absolute $\hat{Y}  \mbox{median($\hat{Y}$)}$
to the mean or median absolute $Y 
\mbox{median($Y$)}$
. The $1  R^2$
or SSE/SST
measure is the mean or median absolute $\hat{Y}  Y$
divided by the mean or median absolute $\hat{Y} 
\mbox{median($Y$)}$
.
abs.error.pred(fit, lp=NULL, y=NULL)
## S3 method for class 'abs.error.pred'
print(x, ...)
fit 
a fit object typically from 
lp 
a vector of predicted values (Y hat above) if 
y 
a vector of response variable values if 
x 
an object created by 
... 
unused 
a list of class abs.error.pred
(used by
print.abs.error.pred
) containing two matrices:
differences
and ratios
.
Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
fh@fharrell.com
Schemper M (2003): Stat in Med 22:22992308.
Tian L, Cai T, Goetghebeur E, Wei LJ (2007): Biometrika 94:297311.
lm
, ols
, cor
,
validate.ols
set.seed(1) # so can regenerate results
x1 < rnorm(100)
x2 < rnorm(100)
y < exp(x1+x2+rnorm(100))
f < lm(log(y) ~ x1 + poly(x2,3), y=TRUE)
abs.error.pred(lp=exp(fitted(f)), y=y)
rm(x1,x2,y,f)
Add Spike Histograms and Extended Box Plots to ggplot
addggLayers(
g,
data,
type = c("ebp", "spike"),
ylim = layer_scales(g)$y$get_limits(),
by = "variable",
value = "value",
frac = 0.065,
mult = 1,
facet = NULL,
pos = c("bottom", "top"),
showN = TRUE
)
g 
a 
data 
data frame/table containing raw data 
type 
specifies either extended box plot or spike histogram. Both are horizontal so are showing the distribution of the xaxis variable. 
ylim 
yaxis limits to use for scaling the height of the added plots, if you don't want to use the limits that 
by 
the name of a variable in 
value 
name of xvariable 
frac 
fraction of yaxis range to devote to vertical aspect of the added plot 
mult 
fudge factor for scaling y aspect 
facet 
optional faceting variable 
pos 
position for added plot 
showN 
sete to 
For an example see this. Note that it was not possible to just create the layers needed to be added, as creating these particular layers in isolation resulted in a ggplot
error.
the original ggplot
object with more layers added
Frank Harrell
spikecomp()
Given a data frame and the names of variable, doubles the
data frame for each variable with a new category
"All"
by default, or by the value of label
.
A new variable .marginal.
is added to the resulting data frame,
with value ""
if the observation is an original one, and with
value equal to the names of the variable being marginalized (separated
by commas) otherwise. If there is another stratification variable
besides the one in ..., and that variable is nested inside the
variable in ..., specify nested=variable name
to have the value
of that variable set fo label
whenever marginal observations are
created for .... See the statecity example below.
addMarginal(data, ..., label = "All", margloc=c('last', 'first'), nested)
data 
a data frame 
... 
a list of names of variables to marginalize 
label 
category name for added marginal observations 
margloc 
location for marginal category within factor variable
specifying categories. Set to 
nested 
a single unquoted variable name if used 
d < expand.grid(sex=c('female', 'male'), country=c('US', 'Romania'),
reps=1:2)
addMarginal(d, sex, country)
# Example of nested variables
d < data.frame(state=c('AL', 'AL', 'GA', 'GA', 'GA'),
city=c('Mobile', 'Montgomery', 'Valdosto',
'Augusta', 'Atlanta'),
x=1:5, stringsAsFactors=TRUE)
addMarginal(d, state, nested=city) # cite set to 'All' when state is
Tests, without issuing warnings, whether all elements of a character
vector are legal numeric values, or optionally converts the vector to a
numeric vector. Leading and trailing blanks in x
are ignored.
all.is.numeric(x, what = c("test", "vector", "nonnum"), extras=c('.','NA'))
x 
a character vector 
what 
specify 
extras 
a vector of character strings to count as numeric
values, other than 
a logical value if what="test"
or a vector otherwise
Frank Harrell
all.is.numeric(c('1','1.2','3'))
all.is.numeric(c('1','1.2','3a'))
all.is.numeric(c('1','1.2','3'),'vector')
all.is.numeric(c('1','1.2','3a'),'vector')
all.is.numeric(c('1','',' .'),'vector')
all.is.numeric(c('1', '1.2', '3a'), 'nonnum')
Works in conjunction with the approx
function to do linear
extrapolation. approx
in R does not support extrapolation at
all, and it is buggy in SPlus 6.
approxExtrap(x, y, xout, method = "linear", n = 50, rule = 2, f = 0,
ties = "ordered", na.rm = FALSE)
x , y , xout , method , n , rule , f

see 
ties 
applies only to R. See 
na.rm 
set to 
Duplicates in x
(and corresponding y
elements) are removed
before using approx
.
a vector the same length as xout
Frank Harrell
approxExtrap(1:3,1:3,xout=c(0,4))
Expands continuous variables into restricted cubic spline bases and
categorical variables into dummy variables and fits a multivariate
equation using canonical variates. This finds optimum transformations
that maximize $R^2$
. Optionally, the bootstrap is used to estimate
the covariance matrix of both left and righthandside transformation
parameters, and to estimate the bias in the $R^2$
due to overfitting
and compute the bootstrap optimismcorrected $R^2$
.
Crossvalidation can also be used to get an unbiased estimate of
$R^2$
but this is not as precise as the bootstrap estimate. The
bootstrap and crossvalidation may also used to get estimates of mean
and median absolute error in predicted values on the original y
scale. These two estimates are perhaps the best ones for gauging the
accuracy of a flexible model, because it is difficult to compare
$R^2$
under different ytransformations, and because $R^2$
allows for an outofsample recalibration (i.e., it only measures
relative errors).
Note that uncertainty about the proper transformation of y
causes
an enormous amount of model uncertainty. When the transformation for
y
is estimated from the data a high variance in predicted values
on the original y
scale may result, especially if the true
transformation is linear. Comparing bootstrap or crossvalidated mean
absolute errors with and without restricted the y
transform to be
linear (ytype='l'
) may help the analyst choose the proper model
complexity.
areg(x, y, xtype = NULL, ytype = NULL, nk = 4,
B = 0, na.rm = TRUE, tolerance = NULL, crossval = NULL)
## S3 method for class 'areg'
print(x, digits=4, ...)
## S3 method for class 'areg'
plot(x, whichx = 1:ncol(x$x), ...)
## S3 method for class 'areg'
predict(object, x, type=c('lp','fitted','x'),
what=c('all','sample'), ...)
x 
A single predictor or a matrix of predictors. Categorical
predictors are required to be coded as integers (as 
y 
a 
xtype 
a vector of oneletter character codes specifying how each predictor
is to be modeled, in order of columns of 
ytype 
same coding as for 
nk 
number of knots, 0 for linear, or 3 or more. Default is 4 which will fit 3 parameters to continuous variables (one linear term and two nonlinear terms) 
B 
number of bootstrap resamples used to estimate covariance matrices of transformation parameters. Default is no bootstrapping. 
na.rm 
set to 
tolerance 
singularity tolerance. List source code for

crossval 
set to a positive integer k to compute kfold crossvalidated Rsquared (square of first canonical correlation) and mean and median absolute error of predictions on the original scale 
digits 
number of digits to use in formatting for printing 
object 
an object created by 
whichx 
integer or character vector specifying which predictors
are to have their transformations plotted (default is all). The

type 
tells 
what 
When the 
... 
arguments passed to the plot function. 
areg
is a competitor of ace
in the acepack
package. Transformations from ace
are seldom smooth enough and
are often overfitted. With areg
the complexity can be controlled
with the nk
parameter, and predicted values are easy to obtain
because parametric functions are fitted.
If one side of the equation has a categorical variable with more than
two categories and the other side has a continuous variable not assumed
to act linearly, larger sample sizes are needed to reliably estimate
transformations, as it is difficult to optimally score categorical
variables to maximize $R^2$
against a simultaneously optimally
transformed continuous variable.
a list of class "areg"
containing many objects
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
Breiman and Friedman, Journal of the American Statistical Association (September, 1985).
set.seed(1)
ns < c(30,300,3000)
for(n in ns) {
y < sample(1:5, n, TRUE)
x < abs(y3) + runif(n)
par(mfrow=c(3,4))
for(k in c(0,3:5)) {
z < areg(x, y, ytype='c', nk=k)
plot(x, z$tx)
title(paste('R2=',format(z$rsquared)))
tapply(z$ty, y, range)
a < tapply(x,y,mean)
b < tapply(z$ty,y,mean)
plot(a,b)
abline(lsfit(a,b))
# Should get same result to within linear transformation if reverse x and y
w < areg(y, x, xtype='c', nk=k)
plot(z$ty, w$tx)
title(paste('R2=',format(w$rsquared)))
abline(lsfit(z$ty, w$tx))
}
}
par(mfrow=c(2,2))
# Example where one category in y differs from others but only in variance of x
n < 50
y < sample(1:5,n,TRUE)
x < rnorm(n)
x[y==1] < rnorm(sum(y==1), 0, 5)
z < areg(x,y,xtype='l',ytype='c')
z
plot(z)
z < areg(x,y,ytype='c')
z
plot(z)
## Not run:
# Examine overfitting when true transformations are linear
par(mfrow=c(4,3))
for(n in c(200,2000)) {
x < rnorm(n); y < rnorm(n) + x
for(nk in c(0,3,5)) {
z < areg(x, y, nk=nk, crossval=10, B=100)
print(z)
plot(z)
title(paste('n=',n))
}
}
par(mfrow=c(1,1))
# Underfitting when true transformation is quadratic but overfitting
# when y is allowed to be transformed
set.seed(49)
n < 200
x < rnorm(n); y < rnorm(n) + .5*x^2
#areg(x, y, nk=0, crossval=10, B=100)
#areg(x, y, nk=4, ytype='l', crossval=10, B=100)
z < areg(x, y, nk=4) #, crossval=10, B=100)
z
# Plot x vs. predicted value on original scale. Since ytransform is
# not monotonic, there are multiple yinverses
xx < seq(3.5,3.5,length=1000)
yhat < predict(z, xx, type='fitted')
plot(x, y, xlim=c(3.5,3.5))
for(j in 1:ncol(yhat)) lines(xx, yhat[,j], col=j)
# Plot a random sample of possible y inverses
yhats < predict(z, xx, type='fitted', what='sample')
points(xx, yhats, pch=2)
## End(Not run)
# True transformation of x1 is quadratic, y is linear
n < 200
x1 < rnorm(n); x2 < rnorm(n); y < rnorm(n) + x1^2
z < areg(cbind(x1,x2),y,xtype=c('s','l'),nk=3)
par(mfrow=c(2,2))
plot(z)
# y transformation is inverse quadratic but areg gets the same answer by
# making x1 quadratic
n < 5000
x1 < rnorm(n); x2 < rnorm(n); y < (x1 + rnorm(n))^2
z < areg(cbind(x1,x2),y,nk=5)
par(mfrow=c(2,2))
plot(z)
# Overfit 20 predictors when no true relationships exist
n < 1000
x < matrix(runif(n*20),n,20)
y < rnorm(n)
z < areg(x, y, nk=5) # add crossval=4 to expose the problem
# Test predict function
n < 50
x < rnorm(n)
y < rnorm(n) + x
g < sample(1:3, n, TRUE)
z < areg(cbind(x,g),y,xtype=c('s','c'))
range(predict(z, cbind(x,g))  z$linear.predictors)
The transcan
function creates flexible additive imputation models
but provides only an approximation to true multiple imputation as the
imputation models are fixed before all multiple imputations are
drawn. This ignores variability caused by having to fit the
imputation models. aregImpute
takes all aspects of uncertainty in
the imputations into account by using the bootstrap to approximate the
process of drawing predicted values from a full Bayesian predictive
distribution. Different bootstrap resamples are used for each of the
multiple imputations, i.e., for the i
th imputation of a sometimes
missing variable, i=1,2,... n.impute
, a flexible additive
model is fitted on a sample with replacement from the original data and
this model is used to predict all of the original missing and
nonmissing values for the target variable.
areg
is used to fit the imputation models. By default, linearity
is assumed for target variables (variables being imputed) and
nk=3
knots are assumed for continuous predictors transformed
using restricted cubic splines. If nk
is three or greater and
tlinear
is set to FALSE
, areg
simultaneously finds transformations of the target variable and of all of
the predictors, to get a good fit assuming additivity, maximizing
$R^2$
, using the same canonical correlation method as
transcan
. Flexible transformations may be overridden for
specific variables by specifying the identity transformation for them.
When a categorical variable is being predicted, the flexible
transformation is Fisher's optimum scoring method. Nonlinear transformations for continuous variables may be nonmonotonic. If
nk
is a vector, areg
's bootstrap and crossval=10
options will be used to help find the optimum validating value of
nk
over values of that vector, at the last imputation iteration.
For the imputations, the minimum value of nk
is used.
Instead of defaulting to taking random draws from fitted imputation
models using random residuals as is done by transcan
,
aregImpute
by default uses predictive mean matching with optional
weighted probability sampling of donors rather than using only the
closest match. Predictive mean matching works for binary, categorical,
and continuous variables without the need for iterative maximum
likelihood fitting for binary and categorical variables, and without the
need for computing residuals or for curtailing imputed values to be in
the range of actual data. Predictive mean matching is especially
attractive when the variable being imputed is also being transformed
automatically. Constraints may be placed on variables being imputed
with predictive mean matching, e.g., a missing hospital discharge date
may be required to be imputed from a donor observation whose discharge
date is before the recipient subject's first postdischarge visit date.
See Details below for more information about the
algorithm. A "regression"
method is also available that is
similar to that used in transcan
. This option should be used
when mechanistic missingness requires the use of extrapolation during
imputation.
A print
method summarizes the results, and a plot
method plots
distributions of imputed values. Typically, fit.mult.impute
will
be called after aregImpute
.
If a target variable is transformed nonlinearly (i.e., if nk
is
greater than zero and tlinear
is set to FALSE
) and the
estimated target variable transformation is nonmonotonic, imputed
values are not unique. When type='regression'
, a random choice
of possible inverse values is made.
The reformM
function provides two ways of recreating a formula to
give to aregImpute
by reordering the variables in the formula.
This is a modified version of a function written by Yong Hao Pua. One
can specify nperm
to obtain a list of nperm
randomly
permuted variables. The list is converted to a single ordinary formula
if nperm=1
. If nperm
is omitted, variables are sorted in
descending order of the number of NA
s. reformM
also
prints a recommended number of multiple imputations to use, which is a
minimum of 5 and the percent of incomplete observations.
aregImpute(formula, data, subset, n.impute=5, group=NULL,
nk=3, tlinear=TRUE, type=c('pmm','regression','normpmm'),
pmmtype=1, match=c('weighted','closest','kclosest'),
kclosest=3, fweighted=0.2,
curtail=TRUE, constraint=NULL,
boot.method=c('simple', 'approximate bayesian'),
burnin=3, x=FALSE, pr=TRUE, plotTrans=FALSE, tolerance=NULL, B=75)
## S3 method for class 'aregImpute'
print(x, digits=3, ...)
## S3 method for class 'aregImpute'
plot(x, nclass=NULL, type=c('ecdf','hist'),
datadensity=c("hist", "none", "rug", "density"),
diagnostics=FALSE, maxn=10, ...)
reformM(formula, data, nperm)
formula 
an S model formula. You can specify restrictions for transformations
of variables. The function automatically determines which variables
are categorical (i.e., 
x 
an object created by 
data 
input raw data 
subset 
These may be also be specified. You may not specify 
n.impute 
number of multiple imputations. 
group 
a character or factor variable the same length as the
number of observations in 
nk 
number of knots to use for continuous variables. When both
the target variable and the predictors are having optimum
transformations estimated, there is more instability than with normal
regression so the complexity of the model should decrease more sharply
as the sample size decreases. Hence set 
tlinear 
set to 
type 
The default is 
pmmtype 
type of matching to be used for predictive mean
matching when 
match 
Defaults to 
kclosest 
see 
fweighted 
Smoothing parameter (multiple of mean absolute difference) used when

curtail 
applies if 
constraint 
for predictive mean matching 
boot.method 
By default, simple boostrapping is used in which the
target variable is predicted using a sample with replacement from the
observations with nonmissing target variable. Specify

burnin 

pr 
set to 
plotTrans 
set to 
tolerance 
singularity criterion; list the source code in the

B 
number of bootstrap resamples to use if 
digits 
number of digits for printing 
nclass 
number of bins to use in drawing histogram 
datadensity 
see 
diagnostics 
Specify 
maxn 
Maximum number of observations shown for diagnostics. Default is

nperm 
number of random formula permutations for 
... 
other arguments that are ignored 
The sequence of steps used by the aregImpute
algorithm is the
following.
(1) For each variable containing m NA
s where m > 0, initialize the
NA
s to values from a random sample (without replacement if
a sufficient number of nonmissing values exist) of size m from the
nonmissing values.
(2) For burnin+n.impute
iterations do the following steps. The
first burnin
iterations provide a burnin, and imputations are
saved only from the last n.impute
iterations.
(3) For each variable containing any NA
s, draw a sample with
replacement from the observations in the entire dataset in which the
current variable being imputed is nonmissing. Fit a flexible
additive model to predict this target variable while finding the
optimum transformation of it (unless the identity
transformation is forced). Use this fitted flexible model to
predict the target variable in all of the original observations.
Impute each missing value of the target variable with the observed
value whose predicted transformed value is closest to the predicted
transformed value of the missing value (if match="closest"
and
type="pmm"
),
or use a draw from a multinomial distribution with probabilities derived
from distance weights, if match="weighted"
(the default).
(4) After these imputations are computed, use these random draw
imputations the next time the curent target variable is used as a
predictor of other sometimesmissing variables.
When match="closest"
, predictive mean matching does not work well
when fewer than 3 variables are used to predict the target variable,
because many of the multiple imputations for an observation will be
identical. In the extreme case of one righthandside variable and
assuming that only monotonic transformations of left and rightside
variables are allowed, every bootstrap resample will give predicted
values of the target variable that are monotonically related to
predicted values from every other bootstrap resample. The same is true
for Bayesian predicted values. This causes predictive mean matching to
always match on the same donor observation.
When the missingness mechanism for a variable is so systematic that the
distribution of observed values is truncated, predictive mean matching
does not work. It will only yield imputed values that are near observed
values, so intervals in which no values are observed will not be
populated by imputed values. For this case, the only hope is to make
regression assumptions and use extrapolation. With
type="regression"
, aregImpute
will use linear
extrapolation to obtain a (hopefully) reasonable distribution of imputed
values. The "regression"
option causes aregImpute
to
impute missing values by adding a random sample of residuals (with
replacement if there are more NA
s than measured values) on the
transformed scale of the target variable. After random residuals are
added, predicted random draws are obtained on the original untransformed
scale using reverse linear interpolation on the table of original and
transformed target values (linear extrapolation when a random residual
is large enough to put the random draw prediction outside the range of
observed values). The bootstrap is used as with type="pmm"
to
factor in the uncertainty of the imputation model.
As model uncertainty is high when the transformation of a target
variable is unknown, tlinear
defaults to TRUE
to limit the
variance in predicted values when nk
is positive.
a list of class "aregImpute"
containing the following elements:
call 
the function call expression 
formula 
the formula specified to 
match 
the 
fweighted 
the 
n 
total number of observations in input dataset 
p 
number of variables 
na 
list of subscripts of observations for which values were originally missing 
nna 
named vector containing the numbers of missing values in the data 
type 
vector of types of transformations used for each variable
( 
tlinear 
value of 
nk 
number of knots used for smooth transformations 
cat.levels 
list containing character vectors specifying the 
df 
degrees of freedom (number of parameters estimated) for each variable 
n.impute 
number of multiple imputations per missing value 
imputed 
a list containing matrices of imputed values in the same format as
those created by 
x 
if 
rsq 
for the last round of imputations, a vector containing the Rsquares
with which each sometimesmissing variable could be predicted from the
others by 
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
van Buuren, Stef. Flexible Imputation of Missing Data. Chapman & Hall/CRC, Boca Raton FL, 2012.
Little R, An H. Robust likelihoodbased analysis of multivariate data with missing values. Statistica Sinica 14:949968, 2004.
van Buuren S, Brand JPL, GroothuisOudshoorn CGM, Rubin DB. Fully conditional specifications in multivariate imputation. J Stat Comp Sim 72:10491064, 2006.
de Groot JAH, Janssen KJM, Zwinderman AH, Moons KGM, Reitsma JB. Multiple imputation to correct for partial verification bias revisited. Stat Med 27:58805889, 2008.
Siddique J. Multiple imputation using an iterative hotdeck with distancebased donor selection. Stat Med 27:83102, 2008.
White IR, Royston P, Wood AM. Multiple imputation using chained equations: Issues and guidance for practice. Stat Med 30:377399, 2011.
Curnow E, Carpenter JR, Heron JE, et al: Multiple imputation of missing data under missing at random: compatible imputation models are not sufficient to avoid bias if they are misspecified. J Clin Epi June 9, 2023. DOI:10.1016/j.jclinepi.2023.06.011.
fit.mult.impute
, transcan
, areg
, naclus
, naplot
, mice
,
dotchart3
, Ecdf
, completer
# Check that aregImpute can almost exactly estimate missing values when
# there is a perfect nonlinear relationship between two variables
# Fit restricted cubic splines with 4 knots for x1 and x2, linear for x3
set.seed(3)
x1 < rnorm(200)
x2 < x1^2
x3 < runif(200)
m < 30
x2[1:m] < NA
a < aregImpute(~x1+x2+I(x3), n.impute=5, nk=4, match='closest')
a
matplot(x1[1:m]^2, a$imputed$x2)
abline(a=0, b=1, lty=2)
x1[1:m]^2
a$imputed$x2
# Multiple imputation and estimation of variances and covariances of
# regression coefficient estimates accounting for imputation
# Example 1: large sample size, much missing data, no overlap in
# NAs across variables
x1 < factor(sample(c('a','b','c'),1000,TRUE))
x2 < (x1=='b') + 3*(x1=='c') + rnorm(1000,0,2)
x3 < rnorm(1000)
y < x2 + 1*(x1=='c') + .2*x3 + rnorm(1000,0,2)
orig.x1 < x1[1:250]
orig.x2 < x2[251:350]
x1[1:250] < NA
x2[251:350] < NA
d < data.frame(x1,x2,x3,y, stringsAsFactors=TRUE)
# Find value of nk that yields best validating imputation models
# tlinear=FALSE means to not force the target variable to be linear
f < aregImpute(~y + x1 + x2 + x3, nk=c(0,3:5), tlinear=FALSE,
data=d, B=10) # normally B=75
f
# Try forcing target variable (x1, then x2) to be linear while allowing
# predictors to be nonlinear (could also say tlinear=TRUE)
f < aregImpute(~y + x1 + x2 + x3, nk=c(0,3:5), data=d, B=10)
f
## Not run:
# Use 100 imputations to better check against individual true values
f < aregImpute(~y + x1 + x2 + x3, n.impute=100, data=d)
f
par(mfrow=c(2,1))
plot(f)
modecat < function(u) {
tab < table(u)
as.numeric(names(tab)[tab==max(tab)][1])
}
table(orig.x1,apply(f$imputed$x1, 1, modecat))
par(mfrow=c(1,1))
plot(orig.x2, apply(f$imputed$x2, 1, mean))
fmi < fit.mult.impute(y ~ x1 + x2 + x3, lm, f,
data=d)
sqrt(diag(vcov(fmi)))
fcc < lm(y ~ x1 + x2 + x3)
summary(fcc) # SEs are larger than from mult. imputation
## End(Not run)
## Not run:
# Example 2: Very discriminating imputation models,
# x1 and x2 have some NAs on the same rows, smaller n
set.seed(5)
x1 < factor(sample(c('a','b','c'),100,TRUE))
x2 < (x1=='b') + 3*(x1=='c') + rnorm(100,0,.4)
x3 < rnorm(100)
y < x2 + 1*(x1=='c') + .2*x3 + rnorm(100,0,.4)
orig.x1 < x1[1:20]
orig.x2 < x2[18:23]
x1[1:20] < NA
x2[18:23] < NA
#x2[21:25] < NA
d < data.frame(x1,x2,x3,y, stringsAsFactors=TRUE)
n < naclus(d)
plot(n); naplot(n) # Show patterns of NAs
# 100 imputations to study them; normally use 5 or 10
f < aregImpute(~y + x1 + x2 + x3, n.impute=100, nk=0, data=d)
par(mfrow=c(2,3))
plot(f, diagnostics=TRUE, maxn=2)
# Note: diagnostics=TRUE makes graphs similar to those made by:
# r < range(f$imputed$x2, orig.x2)
# for(i in 1:6) { # use 1:2 to mimic maxn=2
# plot(1:100, f$imputed$x2[i,], ylim=r,
# ylab=paste("Imputations for Obs.",i))
# abline(h=orig.x2[i],lty=2)
# }
table(orig.x1,apply(f$imputed$x1, 1, modecat))
par(mfrow=c(1,1))
plot(orig.x2, apply(f$imputed$x2, 1, mean))
fmi < fit.mult.impute(y ~ x1 + x2, lm, f,
data=d)
sqrt(diag(vcov(fmi)))
fcc < lm(y ~ x1 + x2)
summary(fcc) # SEs are larger than from mult. imputation
## End(Not run)
## Not run:
# Study relationship between smoothing parameter for weighting function
# (multiplier of mean absolute distance of transformed predicted
# values, used in tricube weighting function) and standard deviation
# of multiple imputations. SDs are computed from average variances
# across subjects. match="closest" same as match="weighted" with
# small value of fweighted.
# This example also shows problems with predicted mean
# matching almost always giving the same imputed values when there is
# only one predictor (regression coefficients change over multiple
# imputations but predicted values are virtually 11 functions of each
# other)
set.seed(23)
x < runif(200)
y < x + runif(200, .05, .05)
r < resid(lsfit(x,y))
rmse < sqrt(sum(r^2)/(2002)) # sqrt of residual MSE
y[1:20] < NA
d < data.frame(x,y)
f < aregImpute(~ x + y, n.impute=10, match='closest', data=d)
# As an aside here is how to create a completed dataset for imputation
# number 3 as fit.mult.impute would do automatically. In this degenerate
# case changing 3 to 12,410 will not alter the results.
imputed < impute.transcan(f, imputation=3, data=d, list.out=TRUE,
pr=FALSE, check=FALSE)
sd < sqrt(mean(apply(f$imputed$y, 1, var)))
ss < c(0, .01, .02, seq(.05, 1, length=20))
sds < ss; sds[1] < sd
for(i in 2:length(ss)) {
f < aregImpute(~ x + y, n.impute=10, fweighted=ss[i])
sds[i] < sqrt(mean(apply(f$imputed$y, 1, var)))
}
plot(ss, sds, xlab='Smoothing Parameter', ylab='SD of Imputed Values',
type='b')
abline(v=.2, lty=2) # default value of fweighted
abline(h=rmse, lty=2) # root MSE of residuals from linear regression
## End(Not run)
## Not run:
# Do a similar experiment for the Titanic dataset
getHdata(titanic3)
h < lm(age ~ sex + pclass + survived, data=titanic3)
rmse < summary(h)$sigma
set.seed(21)
f < aregImpute(~ age + sex + pclass + survived, n.impute=10,
data=titanic3, match='closest')
sd < sqrt(mean(apply(f$imputed$age, 1, var)))
ss < c(0, .01, .02, seq(.05, 1, length=20))
sds < ss; sds[1] < sd
for(i in 2:length(ss)) {
f < aregImpute(~ age + sex + pclass + survived, data=titanic3,
n.impute=10, fweighted=ss[i])
sds[i] < sqrt(mean(apply(f$imputed$age, 1, var)))
}
plot(ss, sds, xlab='Smoothing Parameter', ylab='SD of Imputed Values',
type='b')
abline(v=.2, lty=2) # default value of fweighted
abline(h=rmse, lty=2) # root MSE of residuals from linear regression
## End(Not run)
set.seed(2)
d < data.frame(x1=runif(50), x2=c(rep(NA, 10), runif(40)),
x3=c(runif(4), rep(NA, 11), runif(35)))
reformM(~ x1 + x2 + x3, data=d)
reformM(~ x1 + x2 + x3, data=d, nperm=2)
# Give result or one of the results as the first argument to aregImpute
# Constrain imputed values for two variables
# Require imputed values for x2 to be above 0.2
# Assume x1 is never missing and require imputed values for
# x3 to be less than the recipient's value of x1
a < aregImpute(~ x1 + x2 + x3, data=d,
constraint=list(x2 = expression(d$x2 > 0.2),
x3 = expression(d$x3 < r$x1)))
a
Produces 1alpha confidence intervals for binomial probabilities.
binconf(x, n, alpha=0.05,
method=c("wilson","exact","asymptotic","all"),
include.x=FALSE, include.n=FALSE, return.df=FALSE)
x 
vector containing the number of "successes" for binomial variates 
n 
vector containing the numbers of corresponding observations 
alpha 
probability of a type I error, so confidence coefficient = 1alpha 
method 
character string specifing which method to use. The "all" method only works when x and n are length 1. The "exact" method uses the F distribution to compute exact (based on the binomial cdf) intervals; the "wilson" interval is scoretestbased; and the "asymptotic" is the textbook, asymptotic normal interval. Following Agresti and Coull, the Wilson interval is to be preferred and so is the default. 
include.x 
logical flag to indicate whether 
include.n 
logical flag to indicate whether 
return.df 
logical flag to indicate that a data frame rather than a matrix be returned 
a matrix or data.frame containing the computed intervals and,
optionally, x
and n
.
Rollin Brant, Modified by Frank Harrell and
Brad Biggerstaff
Centers for Disease Control and Prevention
National Center for Infectious Diseases
Division of VectorBorne Infectious Diseases
P.O. Box 2087, Fort Collins, CO, 805222087, USA
bkb5@cdc.gov
A. Agresti and B.A. Coull, Approximate is better than "exact" for interval estimation of binomial proportions, American Statistician, 52:119–126, 1998.
R.G. Newcombe, Logit confidence intervals and the inverse sinh transformation, American Statistician, 55:200–202, 2001.
L.D. Brown, T.T. Cai and A. DasGupta, Interval estimation for a binomial proportion (with discussion), Statistical Science, 16:101–133, 2001.
binconf(0:10,10,include.x=TRUE,include.n=TRUE)
binconf(46,50,method="all")
biVar
is a generic function that accepts a formula and usual
data
, subset
, and na.action
parameters plus a
list statinfo
that specifies a function of two variables to
compute along with information about labeling results for printing and
plotting. The function is called separately with each right hand side
variable and the same left hand variable. The result is a matrix of
bivariate statistics and the statinfo
list that drives printing
and plotting. The plot method draws a dot plot with xaxis values by
default sorted in order of one of the statistics computed by the function.
spearman2
computes the square of Spearman's rho rank correlation
and a generalization of it in which x
can relate
nonmonotonically to y
. This is done by computing the Spearman
multiple rhosquared between (rank(x), rank(x)^2)
and y
.
When x
is categorical, a different kind of Spearman correlation
used in the KruskalWallis test is computed (and spearman2
can do
the KruskalWallis test). This is done by computing the ordinary
multiple R^2
between k1
dummy variables and
rank(y)
, where x
has k
categories. x
can
also be a formula, in which case each predictor is correlated separately
with y
, using nonmissing observations for that predictor.
biVar
is used to do the looping and bookkeeping. By default the
plot shows the adjusted rho^2
, using the same formula used for
the ordinary adjusted R^2
. The F
test uses the unadjusted
R2.
spearman
computes Spearman's rho on nonmissing values of two
variables. spearman.test
is a simple version of
spearman2.default
.
chiSquare
is set up like spearman2
except it is intended
for a categorical response variable. Separate Pearson chisquare tests
are done for each predictor, with optional collapsing of infrequent
categories. Numeric predictors having more than g
levels are
categorized into g
quantile groups. chiSquare
uses
biVar
.
biVar(formula, statinfo, data=NULL, subset=NULL,
na.action=na.retain, exclude.imputed=TRUE, ...)
## S3 method for class 'biVar'
print(x, ...)
## S3 method for class 'biVar'
plot(x, what=info$defaultwhat,
sort.=TRUE, main, xlab,
vnames=c('names','labels'), ...)
spearman2(x, ...)
## Default S3 method:
spearman2(x, y, p=1, minlev=0, na.rm=TRUE, exclude.imputed=na.rm, ...)
## S3 method for class 'formula'
spearman2(formula, data=NULL,
subset, na.action=na.retain, exclude.imputed=TRUE, ...)
spearman(x, y)
spearman.test(x, y, p=1)
chiSquare(formula, data=NULL, subset=NULL, na.action=na.retain,
exclude.imputed=TRUE, ...)
formula 
a formula with a single left side variable 
statinfo 
see 
data , subset , na.action

the usual options for models. Default for 
exclude.imputed 
set to 
... 
other arguments that are passed to the function used to
compute the bivariate statistics or to 
na.rm 
logical; delete NA values? 
x 
a numeric matrix with at least 5 rows and at least 2 columns (if

y 
a numeric vector 
p 
for numeric variables, specifies the order of the Spearman 
minlev 
minimum relative frequency that a level of a categorical predictor
should have before it is pooled with other categories (see

what 
specifies which statistic to plot. Possibilities include the column names that appear with the print method is used. 
sort. 
set 
main 
main title for plot. Default title shows the name of the response variable. 
xlab 
xaxis label. Default constructed from 
vnames 
set to 
Uses midranks in case of ties, as described by Hollander and Wolfe.
Pvalues for Spearman, Wilcoxon, or KruskalWallis tests are
approximated by using the t
or F
distributions.
spearman2.default
(the
function that is called for a single x
, i.e., when there is no
formula) returns a vector of statistics for the variable.
biVar
, spearman2.formula
, and chiSquare
return a
matrix with rows corresponding to predictors.
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
Hollander M. and Wolfe D.A. (1973). Nonparametric Statistical Methods. New York: Wiley.
Press WH, Flannery BP, Teukolsky SA, Vetterling, WT (1988): Numerical Recipes in C. Cambridge: Cambridge University Press.
combine.levels
,
varclus
, dotchart3
, impute
,
chisq.test
, cut2
.
x < c(2, 1, 0, 1, 2)
y < c(4, 1, 0, 1, 4)
z < c(1, 2, 3, 4, NA)
v < c(1, 2, 3, 4, 5)
spearman2(x, y)
plot(spearman2(z ~ x + y + v, p=2))
f < chiSquare(z ~ x + y + v)
f
Bootstraps KaplanMeier estimate of the probability of survival to at
least a fixed time (times
variable) or the estimate of the q
quantile of the survival distribution (e.g., median survival time, the
default).
bootkm(S, q=0.5, B=500, times, pr=TRUE)
S 
a 
q 
quantile of survival time, default is 0.5 for median 
B 
number of bootstrap repetitions (default=500) 
times 
time vector (currently only a scalar is allowed) at which to compute
survival estimates. You may specify only one of 
pr 
set to 
bootkm
uses Therneau's survfitKM
function to efficiently
compute KaplanMeier estimates.
a vector containing B
bootstrap estimates
updates .Random.seed
, and, if pr=TRUE
, prints progress
of simulations
Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
fh@fharrell.com
Akritas MG (1986): Bootstrapping the KaplanMeier estimator. JASA 81:1032–1038.
survfit
, Surv
,
Survival.cph
, Quantile.cph
# Compute 0.95 nonparametric confidence interval for the difference in
# median survival time between females and males (twosample problem)
set.seed(1)
library(survival)
S < Surv(runif(200)) # no censoring
sex < c(rep('female',100),rep('male',100))
med.female < bootkm(S[sex=='female',], B=100) # normally B=500
med.male < bootkm(S[sex=='male',], B=100)
describe(med.femalemed.male)
quantile(med.femalemed.male, c(.025,.975), na.rm=TRUE)
# na.rm needed because some bootstrap estimates of median survival
# time may be missing when a bootstrap sample did not include the
# longer survival times
Uses method of Fleiss, Tytun, and Ury (but without the continuity
correction) to estimate the power (or the sample size to achieve a given
power) of a twosided test for the difference in two proportions. The two
sample sizes are allowed to be unequal, but for bsamsize
you must specify
the fraction of observations in group 1. For power calculations, one
probability (p1
) must be given, and either the other probability (p2
),
an odds.ratio
, or a percent.reduction
must be given. For bpower
or
bsamsize
, any or all of the arguments may be vectors, in which case they
return a vector of powers or sample sizes. All vector arguments must have
the same length.
Given p1, p2
, ballocation
uses the method of Brittain and Schlesselman
to compute the optimal fraction of observations to be placed in group 1
that either (1) minimize the variance of the difference in two proportions,
(2) minimize the variance of the ratio of the two proportions,
(3) minimize the variance of the log odds ratio, or
(4) maximize the power of the 2tailed test for differences. For (4)
the total sample size must be given, or the fraction optimizing
the power is not returned. The fraction for (3) is one minus the fraction
for (1).
bpower.sim
estimates power by simulations, in minimal time. By using
bpower.sim
you can see that the formulas without any continuity correction
are quite accurate, and that the power of a continuitycorrected test
is significantly lower. That's why no continuity corrections are implemented
here.
bpower(p1, p2, odds.ratio, percent.reduction,
n, n1, n2, alpha=0.05)
bsamsize(p1, p2, fraction=.5, alpha=.05, power=.8)
ballocation(p1, p2, n, alpha=.05)
bpower.sim(p1, p2, odds.ratio, percent.reduction,
n, n1, n2,
alpha=0.05, nsim=10000)
p1 
population probability in the group 1 
p2 
probability for group 2 
odds.ratio 
odds ratio to detect 
percent.reduction 
percent reduction in risk to detect 
n 
total sample size over the two groups. If you omit this for

n1 
sample size in group 1 
n2 
sample size in group 2. 
alpha 
type I assertion probability 
fraction 
fraction of observations in group 1 
power 
the desired probability of detecting a difference 
nsim 
number of simulations of binomial responses 
For bpower.sim
, all arguments must be of length one.
for bpower
, the power estimate; for bsamsize
, a vector containing
the sample sizes in the two groups; for ballocation
, a vector with
4 fractions of observations allocated to group 1, optimizing the four
criteria mentioned above. For bpower.sim
, a vector with three
elements is returned, corresponding to the simulated power and its
lower and upper 0.95 confidence limits.
Frank Harrell
Department of Biostatistics
Vanderbilt University
Fleiss JL, Tytun A, Ury HK (1980): A simple approximation for calculating sample sizes for comparing independent proportions. Biometrics 36:343–6.
Brittain E, Schlesselman JJ (1982): Optimal allocation for the comparison of proportions. Biometrics 38:1003–9.
Gordon I, Watson R (1996): The myth of continuitycorrected sample size formulae. Biometrics 52:71–6.
samplesize.bin
, chisq.test
, binconf
bpower(.1, odds.ratio=.9, n=1000, alpha=c(.01,.05))
bpower.sim(.1, odds.ratio=.9, n=1000)
bsamsize(.1, .05, power=.95)
ballocation(.1, .5, n=100)
# Plot power vs. n for various odds ratios (base prob.=.1)
n < seq(10, 1000, by=10)
OR < seq(.2,.9,by=.1)
plot(0, 0, xlim=range(n), ylim=c(0,1), xlab="n", ylab="Power", type="n")
for(or in OR) {
lines(n, bpower(.1, odds.ratio=or, n=n))
text(350, bpower(.1, odds.ratio=or, n=350).02, format(or))
}
# Another way to plot the same curves, but letting labcurve do the
# work, including labeling each curve at points of maximum separation
pow < lapply(OR, function(or,n)list(x=n,y=bpower(p1=.1,odds.ratio=or,n=n)),
n=n)
names(pow) < format(OR)
labcurve(pow, pl=TRUE, xlab='n', ylab='Power')
# Contour graph for various probabilities of outcome in the control
# group, fixing the odds ratio at .8 ([p2/(1p2) / p1/(1p1)] = .8)
# n is varied also
p1 < seq(.01,.99,by=.01)
n < seq(100,5000,by=250)
pow < outer(p1, n, function(p1,n) bpower(p1, n=n, odds.ratio=.8))
# This forms a length(p1)*length(n) matrix of power estimates
contour(p1, n, pow)
Producess sidebyside boxpercentile plots from several vectors or a list of vectors.
bpplot(..., name=TRUE, main="BoxPercentile Plot",
xlab="", ylab="", srtx=0, plotopts=NULL)
... 
vectors or lists containing
numeric components (e.g., the output of 
name 
character vector of names for the groups.
Default is 
main 
main title for the plot. 
xlab 
x axis label. 
ylab 
y axis label. 
srtx 
rotation angle for xaxis labels. Default is zero. 
plotopts 
a list of other parameters to send to 
There are no returned values
A plot is created on the current graphics device.
Boxpercentile plots are similiar to boxplots, except boxpercentile plots supply more information about the univariate distributions. At any height the width of the irregular "box" is proportional to the percentile of that height, up to the 50th percentile, and above the 50th percentile the width is proportional to 100 minus the percentile. Thus, the width at any given height is proportional to the percent of observations that are more extreme in that direction. As in boxplots, the median, 25th and 75th percentiles are marked with line segments across the box.
Jeffrey Banfield
umsfjban@bill.oscs.montana.edu
Modified by F. Harrell 30Jun97
Esty WW, Banfield J: The boxpercentile plot. J Statistical Software 8 No. 17, 2003.
panel.bpplot
, boxplot
, Ecdf
,
bwplot
set.seed(1)
x1 < rnorm(500)
x2 < runif(500, 2, 2)
x3 < abs(rnorm(500))2
bpplot(x1, x2, x3)
g < sample(1:2, 500, replace=TRUE)
bpplot(split(x2, g), name=c('Group 1','Group 2'))
rm(x1,x2,x3,g)
For any number of crossclassification variables, bystats
returns a matrix with the sample size, number missing y
, and
fun(nonmissing y)
, with the crossclassifications designated
by rows. Uses Harrell's modification of the interaction
function to produce crossclassifications. The default fun
is
mean
, and if y
is binary, the mean is labeled as
Fraction
. There is a print
method as well as a
latex
method for objects created by bystats
.
bystats2
handles the special case in which there are 2
classifcation variables, and places the first one in rows and the
second in columns. The print
method for bystats2
uses
the print.char.matrix
function to organize statistics
for cells into boxes.
bystats(y, ..., fun, nmiss, subset)
## S3 method for class 'bystats'
print(x, ...)
## S3 method for class 'bystats'
latex(object, title, caption, rowlabel, ...)
bystats2(y, v, h, fun, nmiss, subset)
## S3 method for class 'bystats2'
print(x, abbreviate.dimnames=FALSE,
prefix.width=max(nchar(dimnames(x)[[1]])), ...)
## S3 method for class 'bystats2'
latex(object, title, caption, rowlabel, ...)
y 
a binary, logical, or continuous variable or a matrix or data frame of
such variables. If 
... 
For 
v 
vertical variable for 
h 
horizontal variable for 
fun 
a function to compute on the nonmissing 
nmiss 
A column containing a count of missing values is included if 
subset 
a vector of subscripts or logical values indicating the subset of data to analyze 
abbreviate.dimnames 
set to 
prefix.width 

title 

caption 
caption to pass to 
rowlabel 

x 
an object created by 
object 
an object created by 
for bystats
, a matrix with row names equal to the classification labels and column
names N, Missing, funlab
, where funlab
is determined from fun
.
A row is added to the end with the summary statistics computed
on all observations combined. The class of this matrix is bystats
.
For bystats
, returns a 3dimensional array with the last dimension
corresponding to statistics being computed. The class of the array is
bystats2
.
latex
produces a .tex
file.
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
interaction
, cut
, cut2
, latex
, print.char.matrix
,
translate
## Not run:
bystats(sex==2, county, city)
bystats(death, race)
bystats(death, cut2(age,g=5), race)
bystats(cholesterol, cut2(age,g=4), sex, fun=median)
bystats(cholesterol, sex, fun=quantile)
bystats(cholesterol, sex, fun=function(x)c(Mean=mean(x),Median=median(x)))
latex(bystats(death,race,nmiss=FALSE,subset=sex=="female"), digits=2)
f < function(y) c(Hazard=sum(y[,2])/sum(y[,1]))
# f() gets the hazard estimate for rightcensored data from exponential dist.
bystats(cbind(d.time, death), race, sex, fun=f)
bystats(cbind(pressure, cholesterol), age.decile,
fun=function(y) c(Median.pressure =median(y[,1]),
Median.cholesterol=median(y[,2])))
y < cbind(pressure, cholesterol)
bystats(y, age.decile,
fun=function(y) apply(y, 2, median)) # same result as last one
bystats(y, age.decile, fun=function(y) apply(y, 2, quantile, c(.25,.75)))
# The last one computes separately the 0.25 and 0.75 quantiles of 2 vars.
latex(bystats2(death, race, sex, fun=table))
## End(Not run)
Capitalizes the first letter of each element of the string vector.
capitalize(string)
string 
String to be capitalized 
Returns a vector of charaters with the first letter capitalized
Charles Dupont
capitalize(c("Hello", "bob", "daN"))
Uses the method of Peterson and George to compute the power of an
interaction test in a 2 x 2 setup in which all 4 distributions are
exponential. This will be the same as the power of the Cox model
test if assumptions hold. The test is 2tailed.
The duration of accrual is specified
(constant accrual is assumed), as is the minimum followup time.
The maximum followup time is then accrual + tmin
. Treatment
allocation is assumed to be 1:1.
ciapower(tref, n1, n2, m1c, m2c, r1, r2, accrual, tmin,
alpha=0.05, pr=TRUE)
tref 
time at which mortalities estimated 
n1 
total sample size, stratum 1 
n2 
total sample size, stratum 2 
m1c 
trefyear mortality, stratum 1 control 
m2c 
trefyear mortality, stratum 2 control 
r1 
% reduction in 
r2 
% reduction in 
accrual 
duration of accrual period 
tmin 
minimum followup time 
alpha 
type I error probability 
pr 
set to 
power
prints
Frank Harrell
Department of Biostatistics
Vanderbilt University
Peterson B, George SL: Controlled Clinical Trials 14:511–522; 1993.
# Find the power of a race x treatment test. 25% of patients will
# be nonwhite and the total sample size is 14000.
# Accrual is for 1.5 years and minimum followup is 5y.
# Reduction in 5year mortality is 15% for whites, 0% or 5% for
# nonwhites. 5year mortality for control subjects if assumed to
# be 0.18 for whites, 0.23 for nonwhites.
n < 14000
for(nonwhite.reduction in c(0,5)) {
cat("\n\n\n% Reduction in 5year mortality for nonwhites:",
nonwhite.reduction, "\n\n")
pow < ciapower(5, .75*n, .25*n, .18, .23, 15, nonwhite.reduction,
1.5, 5)
cat("\n\nPower:",format(pow),"\n")
}
Takes a set of coordinates in any of the 5 coordinate systems (usr, plt, fig, dev, or tdev) and returns the same points in all 5 coordinate systems.
cnvrt.coords(x, y = NULL, input = c("usr", "plt", "fig", "dev","tdev"))
x 
Vector, Matrix, or list of x coordinates (or x and y coordinates), NA's allowed. 
y 
y coordinates (if 
input 
Character scalar indicating the coordinate system of the input points. 
Every plot has 5 coordinate systems:
usr (User): the coordinate system of the data, this is shown by the tick marks and axis labels.
plt (Plot): Plot area, coordinates range from 0 to 1 with 0 corresponding to the x and y axes and 1 corresponding to the top and right of the plot area. Margins of the plot correspond to plot coordinates less than 0 or greater than 1.
fig (Figure): Figure area, coordinates range from 0 to 1 with 0 corresponding to the bottom and left edges of the figure (including margins, label areas) and 1 corresponds to the top and right edges. fig and dev coordinates will be identical if there is only 1 figure area on the device (layout, mfrow, or mfcol has not been used).
dev (Device): Device area, coordinates range from 0 to 1 with 0 corresponding to the bottom and left of the device region within the outer margins and 1 is the top and right of the region withing the outer margins. If the outer margins are all set to 0 then tdev and dev should be identical.
tdev (Total Device): Total Device area, coordinates range from 0 to 1 with 0 corresponding to the bottom and left edges of the device (piece of paper, window on screen) and 1 corresponds to the top and right edges.
A list with 5 components, each component is a list with vectors named x and y. The 5 sublists are:
usr 
The coordinates of the input points in usr (User) coordinates. 
plt 
The coordinates of the input points in plt (Plot) coordinates. 
fig 
The coordinates of the input points in fig (Figure) coordinates. 
dev 
The coordinates of the input points in dev (Device) coordinates. 
tdev 
The coordinates of the input points in tdev (Total Device) coordinates. 
You must provide both x and y, but one of them may be NA
.
This function is becoming depricated with the new functions
grconvertX
and grconvertY
in R version 2.7.0 and beyond.
These new functions use the correct coordinate system names and have
more coordinate systems available, you should start using them instead.
Greg Snow greg.snow@imail.org
par
specifically 'usr','plt', and 'fig'. Also
'xpd' for plotting outside of the plotting region and 'mfrow' and
'mfcol' for multi figure plotting. subplot
,
grconvertX
and grconvertY
in R2.7.0 and later
old.par < par(no.readonly=TRUE)
par(mfrow=c(2,2),xpd=NA)
# generate some sample data
tmp.x < rnorm(25, 10, 2)
tmp.y < rnorm(25, 50, 10)
tmp.z < rnorm(25, 0, 1)
plot( tmp.x, tmp.y)
# draw a diagonal line across the plot area
tmp1 < cnvrt.coords( c(0,1), c(0,1), input='plt' )
lines(tmp1$usr, col='blue')
# draw a diagonal line accross figure region
tmp2 < cnvrt.coords( c(0,1), c(1,0), input='fig')
lines(tmp2$usr, col='red')
# save coordinate of point 1 and y value near top of plot for future plots
tmp.point1 < cnvrt.coords(tmp.x[1], tmp.y[1])
tmp.range1 < cnvrt.coords(NA, 0.98, input='plt')
# make a second plot and draw a line linking point 1 in each plot
plot(tmp.y, tmp.z)
tmp.point2 < cnvrt.coords( tmp.point1$dev, input='dev' )
arrows( tmp.y[1], tmp.z[1], tmp.point2$usr$x, tmp.point2$usr$y,
col='green')
# draw another plot and add rectangle showing same range in 2 plots
plot(tmp.x, tmp.z)
tmp.range2 < cnvrt.coords(NA, 0.02, input='plt')
tmp.range3 < cnvrt.coords(NA, tmp.range1$dev$y, input='dev')
rect( 9, tmp.range2$usr$y, 11, tmp.range3$usr$y, border='yellow')
# put a label just to the right of the plot and
# near the top of the figure region.
text( cnvrt.coords(1.05, NA, input='plt')$usr$x,
cnvrt.coords(NA, 0.75, input='fig')$usr$y,
"Label", adj=0)
par(mfrow=c(1,1))
## create a subplot within another plot (see also subplot)
plot(1:10, 1:10)
tmp < cnvrt.coords( c( 1, 4, 6, 9), c(6, 9, 1, 4) )
par(plt = c(tmp$dev$x[1:2], tmp$dev$y[1:2]), new=TRUE)
hist(rnorm(100))
par(fig = c(tmp$dev$x[3:4], tmp$dev$y[3:4]), new=TRUE)
hist(rnorm(100))
par(old.par)
These functions are used on ggplot2
objects or as layers when
building a ggplot2
object, and to facilitate use of
gridExtra
. colorFacet
colors the thin
rectangles used to separate panels created by facet_grid
(and
probably by facet_wrap
).
A better approach may be found at https://stackoverflow.com/questions/28652284/.
arrGrob
is a frontend to gridExtra::arrangeGrob
that
allows for proper printing. See
https://stackoverflow.com/questions/29062766/storeoutputfromgridextragridarrangeintoanobject/. The arrGrob
print
method calls grid::grid.draw
.
colorFacet(g, col = adjustcolor("blue", alpha.f = 0.3))
arrGrob(...)
## S3 method for class 'arrGrob'
print(x, ...)
g 
a 
col 
color for facet separator rectanges 
... 
passed to 
x 
an object created by 
Sandy Muspratt
## Not run:
require(ggplot2)
s < summaryP(age + sex ~ region + treatment)
colorFacet(ggplot(s)) # prints directly
# arrGrob is called by rms::ggplot.Predict and others
## End(Not run)
Combine Infrequent Levels of a Categorical Variable
combine.levels(
x,
minlev = 0.05,
m,
ord = is.ordered(x),
plevels = FALSE,
sep = ","
)
x 
a factor, 'ordered' factor, or numeric or character variable that will be turned into a 'factor' 
minlev 
the minimum proportion of observations in a cell before that cell is combined with one or more cells. If more than one cell has fewer than minlev*n observations, all such cells are combined into a new cell labeled '"OTHER"'. Otherwise, the lowest frequency cell is combined with the next lowest frequency cell, and the level name is the combination of the two old level levels. When 'ord=TRUE' combinations happen only for consecutive levels. 
m 
alternative to 'minlev', is the minimum number of observations in a cell before it will be combined with others 
ord 
set to 'TRUE' to treat 'x' as if it were an ordered factor, which allows only consecutive levels to be combined 
plevels 
by default 'combine.levels' pools lowfrequency levels into a category named 'OTHER' when 'x' is not ordered and 'ord=FALSE'. To instead name this category the concatenation of all the pooled level names, separated by a comma, set 'plevels=TRUE'. 
sep 
the separator for concatenating levels when 'plevels=TRUE' 
After turning 'x' into a 'factor' if it is not one already, combines levels of 'x' whose frequency falls below a specified relative frequency 'minlev' or absolute count 'm'. When 'x' is not treated as ordered, all of the small frequency levels are combined into '"OTHER"', unless 'plevels=TRUE'. When 'ord=TRUE' or 'x' is an ordered factor, only consecutive levels are combined. New levels are constructed by concatenating the levels with 'sep' as a separator. This is useful when comparing ordinal regression with polytomous (multinomial) regression and there are too many categories for polytomous regression. 'combine.levels' is also useful when assumptions of ordinal models are being checked empirically by computing exceedance probabilities for various cutoffs of the dependent variable.
a factor variable, or if 'ord=TRUE' an ordered factor variable
Frank Harrell
x < c(rep('A', 1), rep('B', 3), rep('C', 4), rep('D',1), rep('E',1))
combine.levels(x, m=3)
combine.levels(x, m=3, plevels=TRUE)
combine.levels(x, ord=TRUE, m=3)
x < c(rep('A', 1), rep('B', 3), rep('C', 4), rep('D',1), rep('E',1),
rep('F',1))
combine.levels(x, ord=TRUE, m=3)
Generates a plotly attribute plot given a series of possibly overlapping binary variables
combplotp(
formula,
data = NULL,
subset,
na.action = na.retain,
vnames = c("labels", "names"),
includenone = FALSE,
showno = FALSE,
maxcomb = NULL,
minfreq = NULL,
N = NULL,
pos = function(x) 1 * (tolower(x) %in% c("true", "yes", "y", "positive", "+",
"present", "1")),
obsname = "subjects",
ptsize = 35,
width = NULL,
height = NULL,
...
)
formula 
a formula containing all the variables to be crosstabulated, on the formula's right hand side. There is no left hand side variable. If 
data 
input data frame. If none is specified the data are assumed to come from the parent frame. 
subset 
an optional subsetting expression applied to 
na.action 
see 
vnames 
set to 
includenone 
set to 
showno 
set to 
maxcomb 
maximum number of combinations to display 
minfreq 
if specified, any combination having a frequency less than this will be omitted from the display 
N 
set to an integer to override the global denominator, instead of using the number of rows in the data 
pos 
a function of vector returning a logical vector with 
obsname 
character string noun describing observations, default is 
ptsize 
point size, defaults to 35 
width 
width of 
height 
height of 
... 
optional arguments to pass to 
Similar to the UpSetR
package, draws a special dot chart sometimes called an attribute plot that depicts all possible combination of the binary variables. By default a positive value, indicating that a certain condition pertains for a subject, is any of logical TRUE
, numeric 1, "yes"
, "y"
, "positive"
, "+"
or "present"
value, and all others are considered negative. The user can override this determination by specifying her own pos
function. Case is ignored in the variable values.
The plot uses solid dots arranged in a vertical line to indicate which combination of conditions is being considered. Frequencies of all possible combinations are shown above the dot chart. Marginal frequencies of positive values for the input variables are shown to the left of the dot chart. More information for all three of these component symbols is provided in hover text.
Variables are sorted in descending order of marginal frqeuencies and likewise for combinations of variables.
plotly
object
Frank Harrell
if (requireNamespace("plotly")) {
g < function() sample(0:1, n, prob=c(1  p, p), replace=TRUE)
set.seed(2); n < 100; p < 0.5
x1 < g(); label(x1) < 'A long label for x1 that describes it'
x2 < g()
x3 < g(); label(x3) < 'This is<br>a label for x3'
x4 < g()
combplotp(~ x1 + x2 + x3 + x4, showno=TRUE, includenone=TRUE)
n < 1500; p < 0.05
pain < g()
anxiety < g()
depression < g()
soreness < g()
numbness < g()
tiredness < g()
sleepiness < g()
combplotp(~ pain + anxiety + depression + soreness + numbness +
tiredness + sleepiness, showno=TRUE)
}
Create imputed dataset(s) using transcan
and aregImpute
objects
completer(a, nimpute, oneimpute = FALSE, mydata)
a 
An object of class 
nimpute 
A numeric vector between 1 and 
oneimpute 
A logical vector. When set to 
mydata 
A data frame in which its missing values will be imputed. 
Similar in function to mice::complete
, this function uses transcan
and aregImpute
objects to impute missing data
and returns the completed dataset(s) as a dataframe or a list.
It assumes that transcan
is used for single regression imputation.
A single or a list of completed dataset(s).
YongHao Pua, Singapore General Hospital
## Not run:
mtcars$hp[1:5] < NA
mtcars$wt[1:10] < NA
myrform < ~ wt + hp + I(carb)
mytranscan < transcan( myrform, data = mtcars, imputed = TRUE,
pl = FALSE, pr = FALSE, trantab = TRUE, long = TRUE)
myareg < aregImpute(myrform, data = mtcars, x=TRUE, n.impute = 5)
completer(mytranscan) # single completed dataset
completer(myareg, 3, oneimpute = TRUE)
# single completed dataset based on the `n.impute`th set of multiple imputation
completer(myareg, 3)
# list of completed datasets based on first `nimpute` sets of multiple imputation
completer(myareg)
# list of completed datasets based on all available sets of multiple imputation
# To get a stacked data frame of all completed datasets use
# do.call(rbind, completer(myareg, data=mydata))
# or use rbindlist in data.table
## End(Not run)
Merges an object by the names of its elements. Inserting elements in
value
into x
that do not exists in x
and
replacing elements in x
that exists in value
with
value
elements if protect
is false.
consolidate(x, value, protect, ...)
## Default S3 method:
consolidate(x, value, protect=FALSE, ...)
consolidate(x, protect, ...) < value
x 
named list or vector 
value 
named list or vector 
protect 
logical; should elements in 
... 
currently does nothing; included if ever want to make generic. 
Charles Dupont
x < 1:5
names(x) < LETTERS[x]
y < 6:10
names(y) < LETTERS[y2]
x # c(A=1,B=2,C=3,D=4,E=5)
y # c(D=6,E=7,F=8,G=9,H=10)
consolidate(x, y) # c(A=1,B=2,C=3,D=6,E=7,F=8,G=9,H=10)
consolidate(x, y, protect=TRUE) # c(A=1,B=2,C=3,D=4,E=5,F=8,G=9,H=10)
contents
is a generic method for which contents.data.frame
is currently the only method. contents.data.frame
creates an
object containing the following attributes of the variables
from a data frame: names, labels (if any), units (if any), number of
factor levels (if any), factor levels,
class, storage mode, and number of NAs. print.contents.data.frame
will print the results, with options for sorting the variables.
html.contents.data.frame
creates HTML code for displaying the
results. This code has hyperlinks so that if the user clicks on the
number of levels the browser jumps to the correct part of a table of
factor levels for all the factor
variables. If long labels are
present ("longlabel"
attributes on variables), these are printed
at the bottom and the html
method links to them through the
regular labels. Variables having the same levels
in the same
order have the levels factored out for brevity.
contents.list
prints a directory of datasets when
sasxport.get
imported more than one SAS dataset.
If options(prType='html')
is in effect, calling print
on
an object that is the contents of a data frame will result in
rendering the HTML version. If run from the console a browser window
will open.
contents(object, ...)
## S3 method for class 'data.frame'
contents(object, sortlevels=FALSE, id=NULL,
range=NULL, values=NULL, ...)
## S3 method for class 'contents.data.frame'
print(x,
sort=c('none','names','labels','NAs'), prlevels=TRUE, maxlevels=Inf,
number=FALSE, ...)
## S3 method for class 'contents.data.frame'
html(object,
sort=c('none','names','labels','NAs'), prlevels=TRUE, maxlevels=Inf,
levelType=c('list','table'),
number=FALSE, nshow=TRUE, ...)
## S3 method for class 'list'
contents(object, dslabels, ...)
## S3 method for class 'contents.list'
print(x,
sort=c('none','names','labels','NAs','vars'), ...)
object 
a data frame. For 
sortlevels 
set to 
id 
an optional subject ID variable name that if present in

range 
an optional variable name that if present in 
values 
an optional variable name that if present in

x 
an object created by 
sort 
Default is to print the variables in their original order in the
data frame. Specify one of

prlevels 
set to 
maxlevels 
maximum number of levels to print for a 
number 
set to 
nshow 
set to 
levelType 
By default, bullet lists of category levels are
constructed in html. Set 
... 
arguments passed from 
dslabels 
named vector of SAS dataset labels, created for
example by 
an object of class "contents.data.frame"
or
"contents.list"
. For the html
method is an html
character vector object.
Frank Harrell
Vanderbilt University
fh@fharrell.com
describe
, html
, upData
,
extractlabs
, hlab
set.seed(1)
dfr < data.frame(x=rnorm(400),y=sample(c('male','female'),400,TRUE),
stringsAsFactors=TRUE)
contents(dfr)
dfr < upData(dfr, labels=c(x='Label for x', y='Label for y'))
attr(dfr$x, 'longlabel') <
'A very long label for x that can continue onto multiple long lines of text'
k < contents(dfr)
print(k, sort='names', prlevels=FALSE)
## Not run:
html(k)
html(contents(dfr)) # same result
latex(k$contents) # latex.default just the main information
## End(Not run)
Assumes exponential distributions for both treatment groups. Uses the GeorgeDesu method along with formulas of Schoenfeld that allow estimation of the expected number of events in the two groups. To allow for dropins (noncompliance to control therapy, crossover to intervention) and noncompliance of the intervention, the method of Lachin and Foulkes is used.
cpower(tref, n, mc, r, accrual, tmin, noncomp.c=0, noncomp.i=0,
alpha=0.05, nc, ni, pr=TRUE)
tref 
time at which mortalities estimated 
n 
total sample size (both groups combined). If allocation is unequal
so that there are not 
mc 
trefyear mortality, control 
r 
% reduction in 
accrual 
duration of accrual period 
tmin 
minimum followup time 
noncomp.c 
% noncompliant in control group (dropins) 
noncomp.i 
% noncompliant in intervention group (nonadherers) 
alpha 
type I error probability. A 2tailed test is assumed. 
nc 
number of subjects in control group 
ni 
number of subjects in intervention group. 
pr 
set to 
For handling noncompliance, uses a modification of formula (5.4) of
Lachin and Foulkes. Their method is based on a test for the difference
in two hazard rates, whereas cpower
is based on testing the difference
in two log hazards. It is assumed here that the same correction factor
can be approximately applied to the log hazard ratio as Lachin and Foulkes applied to
the hazard difference.
Note that Schoenfeld approximates the variance
of the log hazard ratio by 4/m
, where m
is the total number of events,
whereas the GeorgeDesu method uses the slightly better 1/m1 + 1/m2
.
Power from this function will thus differ slightly from that obtained with
the SAS samsizc
program.
power
prints
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
Peterson B, George SL: Controlled Clinical Trials 14:511–522; 1993.
Lachin JM, Foulkes MA: Biometrics 42:507–519; 1986.
Schoenfeld D: Biometrics 39:499–503; 1983.
#In this example, 4 plots are drawn on one page, one plot for each
#combination of noncompliance percentage. Within a plot, the
#5year mortality % in the control group is on the xaxis, and
#separate curves are drawn for several % reductions in mortality
#with the intervention. The accrual period is 1.5y, with all
#patients followed at least 5y and some 6.5y.
par(mfrow=c(2,2),oma=c(3,0,3,0))
morts < seq(10,25,length=50)
red < c(10,15,20,25)
for(noncomp in c(0,10,15,1)) {
if(noncomp>=0) nc.i < nc.c < noncomp else {nc.i < 25; nc.c < 15}
z < paste("Dropin ",nc.c,"%, Nonadherence ",nc.i,"%",sep="")
plot(0,0,xlim=range(morts),ylim=c(0,1),
xlab="5year Mortality in Control Patients (%)",
ylab="Power",type="n")
title(z)
cat(z,"\n")
lty < 0
for(r in red) {
lty < lty+1
power < morts
i < 0
for(m in morts) {
i < i+1
power[i] < cpower(5, 14000, m/100, r, 1.5, 5, nc.c, nc.i, pr=FALSE)
}
lines(morts, power, lty=lty)
}
if(noncomp==0)legend(18,.55,rev(paste(red,"% reduction",sep="")),
lty=4:1,bty="n")
}
mtitle("Power vs NonAdherence for Main Comparison",
ll="alpha=.05, 2tailed, Total N=14000",cex.l=.8)
#
# Point sample size requirement vs. mortality reduction
# Root finder (uniroot()) assumes needed sample size is between
# 1000 and 40000
#
nc.i < 25; nc.c < 15; mort < .18
red < seq(10,25,by=.25)
samsiz < red
i < 0
for(r in red) {
i < i+1
samsiz[i] < uniroot(function(x) cpower(5, x, mort, r, 1.5, 5,
nc.c, nc.i, pr=FALSE)  .8,
c(1000,40000))$root
}
samsiz < samsiz/1000
par(mfrow=c(1,1))
plot(red, samsiz, xlab='% Reduction in 5Year Mortality',
ylab='Total Sample Size (Thousands)', type='n')
lines(red, samsiz, lwd=2)
title('Sample Size for Power=0.80\nDropin 15%, Nonadherence 25%')
title(sub='alpha=0.05, 2tailed', adj=0)
Cs
makes a vector of character strings from a list of valid R
names. .q
is similar but also makes uses of names of arguments.
Cs(...)
.q(...)
... 
any number of names separated by commas. For 
character string vector. For .q
there will be a names
attribute to the vector if any names appeared in ....
sys.frame, deparse
Cs(a,cat,dog)
# subset.data.frame < dataframe[,Cs(age,sex,race,bloodpressure,height)]
.q(a, b, c, 'this and that')
.q(dog=a, giraffe=b, cat=c)
Read commaseparated text data files, allowing optional translation
to lower case for variable names after making them valid S names.
There is a facility for reading long variable labels as one of the
rows. If labels are not specified and a final variable name is not
the same as that in the header, the original variable name is saved as
a variable label. Uses read.csv
if the data.table
package is not in effect, otherwise calls fread
.
csv.get(file, lowernames=FALSE, datevars=NULL, datetimevars=NULL,
dateformat='%F',
fixdates=c('none','year'), comment.char="", autodate=TRUE,
allow=NULL, charfactor=FALSE,
sep=',', skip=0, vnames=NULL, labels=NULL, text=NULL, ...)
file 
the file name for import. 
lowernames 
set this to 
datevars 
character vector of names (after 
datetimevars 
character vector of names (after 
dateformat 
for 
fixdates 
for any of the variables listed in 
comment.char 
a character vector of length one containing a single character or an empty string. Use '""' to turn off the interpretation of comments altogether. 
autodate 
Set to true to allow function to guess at which variables are dates 
allow 
a vector of characters allowed by R that should not be converted to periods in variable names. By default, underscores in variable names are converted to periods as with R before version 1.9. 
charfactor 
set to 
sep 
field separator, defaults to comma 
skip 
number of records to skip before data start. Required if

vnames 
number of row containing variable names, default is one 
labels 
number of row containing variable labels, default is no labels 
text 
a character string containing the 
... 
arguments to pass to 
csv.get
reads commaseparated text data files, allowing optional
translation to lower case for variable names after making them valid S
names. Original possibly nonlegal names are taken to be variable
labels if labels
is not specified. Character or factor
variables containing dates can be converted to date variables.
cleanup.import
is invoked to finish the job.
a new data frame.
Frank Harrell, Vanderbilt University
sas.get
, data.frame
,
cleanup.import
, read.csv
,
strptime
, POSIXct
, Date
,
fread
## Not run:
dat < csv.get('myfile.csv')
# Read a csv file with junk in the first row, variable names in the
# second, long variable labels in the third, and junk in the 4th row
dat < csv.get('myfile.csv', vnames=2, labels=3, skip=4)
## End(Not run)
curveRep
finds representative curves from a
relatively large collection of curves. The curves usually represent
timeresponse profiles as in serial (longitudinal or repeated) data
with possibly unequal time points and greatly varying sample sizes per
subject. After excluding records containing missing x
or
y
, records are first stratified into kn
groups having similar
sample sizes per curve (subject). Within these strata, curves are
next stratified according to the distribution of x
points per
curve (typically measurement times per subject). The
clara
clustering/partitioning function is used
to do this, clustering on one, two, or three x
characteristics
depending on the minimum sample size in the current interval of sample
size. If the interval has a minimum number of unique values
of
one, clustering is done on the single x
values. If the minimum
number of unique x
values is two, clustering is done to create
groups that are similar on both min(x)
and max(x)
. For
groups containing no fewer than three unique x
values,
clustering is done on the trio of values min(x)
, max(x)
,
and the longest gap between any successive x
. Then within
sample size and x
distribution strata, clustering of
timeresponse profiles is based on p
values of y
all
evaluated at the same p
equallyspaced x
's within the
stratum. An option allows percurve data to be smoothed with
lowess
before proceeding. Outer x
values are
taken as extremes of x
across all curves within the stratum.
Linear interpolation within curves is used to estimate y
at the
grid of x
's. For curves within the stratum that do not extend
to the most extreme x
values in that stratum, extrapolation
uses flat lines from the observed extremes in the curve unless
extrap=TRUE
. The p
y
values are clustered using
clara
.
print
and plot
methods show results. By specifying an
auxiliary idcol
variable to plot
, other variables such
as treatment may be depicted to allow the analyst to determine for
example whether subjects on different treatments are assigned to
different timeresponse profiles. To write the frequencies of a
variable such as treatment in the upper left corner of each panel
(instead of the grand total number of clusters in that panel), specify
freq
.
curveSmooth
takes a set of curves and smooths them using
lowess
. If the number of unique x
points in a curve is
less than p
, the smooth is evaluated at the unique x
values. Otherwise it is evaluated at an equally spaced set of
x
points over the observed range. If fewer than 3 unique
x
values are in a curve, those points are used and smoothing is not done.
curveRep(x, y, id, kn = 5, kxdist = 5, k = 5, p = 5,
force1 = TRUE, metric = c("euclidean", "manhattan"),
smooth=FALSE, extrap=FALSE, pr=FALSE)
## S3 method for class 'curveRep'
print(x, ...)
## S3 method for class 'curveRep'
plot(x, which=1:length(res),
method=c('all','lattice','data'),
m=NULL, probs=c(.5, .25, .75), nx=NULL, fill=TRUE,
idcol=NULL, freq=NULL, plotfreq=FALSE,
xlim=range(x), ylim=range(y),
xlab='x', ylab='y', colorfreq=FALSE, ...)
curveSmooth(x, y, id, p=NULL, pr=TRUE)
x 
a numeric vector, typically measurement times.
For 
y 
a numeric vector of response values 
id 
a vector of curve (subject) identifiers, the same length as

kn 
number of curve sample size groups to construct.

kxdist 
maximum number of xdistribution clusters to derive
using 
k 
maximum number of xy profile clusters to derive using 
p 
number of 
force1 
By default if any curves have only one point, all curves
consisting of one point will be placed in a separate stratum. To
prevent this separation, set 
metric 
see 
smooth 
By default, linear interpolation is used on raw data to
obtain 
extrap 
set to 
pr 
set to 
which 
an integer vector specifying which sample size intervals
to plot. Must be specified if 
method 
The default makes individual plots of possibly all
xdistribution by sample size by cluster combinations. Fewer may be
plotted by specifying 
m 
the number of curves in a cluster to randomly sample if there
are more than 
nx 
applies if 
probs 
3vector of probabilities with the central quantile first. Default uses quartiles. 
fill 
for 
idcol 
a named vector to be used as a table lookup for color
assignments (does not apply when 
freq 
a named vector to be used as a table lookup for a grouping
variable such as treatment. The names are curve 
plotfreq 
set to 
colorfreq 
set to 
xlim , ylim , xlab , ylab

plotting parameters. Default ranges are
the ranges in the entire set of raw data given to 
... 
arguments passed to other functions. 
In the graph titles for the default graphic output, n
refers to the
minimum sample size, x
refers to the sequential xdistribution
cluster, and c
refers to the sequential xy profile cluster. Graphs
from method = "lattice"
are produced by
xyplot
and in the panel titles
distribution
refers to the xdistribution stratum and
cluster
refers to the xy profile cluster.
a list of class "curveRep"
with the following elements
res 
a hierarchical list first split by sample size intervals,
then by x distribution clusters, then containing a vector of cluster
numbers with 
ns 
a table of frequencies of sample sizes per curve after
removing 
nomit 
total number of records excluded due to 
missfreq 
a table of frequencies of number of 
ncuts 
cut points for sample size intervals 
kn 
number of sample size intervals 
kxdist 
number of clusters on x distribution 
k 
number of clusters of curves within sample size and distribution groups 
p 
number of points at which to evaluate each curve for clustering 
x 

y 

id 
input data after removing 
curveSmooth
returns a list with elements x,y,id
.
The references describe other methods for deriving
representative curves, but those methods were not used here. The last
reference which used a cluster analysis on principal components
motivated curveRep
however. The kml
package does kmeans clustering of longitudinal data with imputation.
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
Segal M. (1994): Representative curves for longitudinal data via regression trees. J Comp Graph Stat 3:214233.
Jones MC, Rice JA (1992): Displaying the important features of large collections of similar curves. Am Statistician 46:140145.
Zheng X, Simpson JA, et al (2005): Data from a study of effectiveness suggested potential prognostic factors related to the patterns of shoulder pain. J Clin Epi 58:823830.
## Not run:
# Simulate 200 curves with percurve sample sizes ranging from 1 to 10
# Make curves with oddnumbered IDs have an xdistribution that is random
# uniform [0,1] and those with evennumbered IDs have an xdist. that is
# half as wide but still centered at 0.5. Shift y values higher with
# increasing IDs
set.seed(1)
N < 200
nc < sample(1:10, N, TRUE)
id < rep(1:N, nc)
x < y < id
for(i in 1:N) {
x[id==i] < if(i %% 2) runif(nc[i]) else runif(nc[i], c(.25, .75))
y[id==i] < i + 10*(x[id==i]  .5) + runif(nc[i], 10, 10)
}
w < curveRep(x, y, id, kxdist=2, p=10)
w
par(ask=TRUE, mfrow=c(4,5))
plot(w) # show everything, profiles going across
par(mfrow=c(2,5))
plot(w,1) # show n=1 results
# Use a color assignment table, assigning low curves to green and
# high to red. Unique curve (subject) IDs are the names of the vector.
cols < c(rep('green', N/2), rep('red', N/2))
names(cols) < as.character(1:N)
plot(w, 3, idcol=cols)
par(ask=FALSE, mfrow=c(1,1))
plot(w, 1, 'lattice') # show n=1 results
plot(w, 3, 'lattice') # show n=45 results
plot(w, 3, 'lattice', idcol=cols) # same but different color mapping
plot(w, 3, 'lattice', m=1) # show a single "representative" curve
# Show median, 10th, and 90th percentiles of supposedly representative curves
plot(w, 3, 'lattice', m='quantiles', probs=c(.5,.1,.9))
# Same plot but with much less grouping of x variable
plot(w, 3, 'lattice', m='quantiles', probs=c(.5,.1,.9), nx=2)
# Use ggplot2 for one sample size interval
z < plot(w, 2, 'data')
require(ggplot2)
ggplot(z, aes(x, y, color=curve)) + geom_line() +
facet_grid(distribution ~ cluster) +
theme(legend.position='none') +
labs(caption=z$ninterval[1])
# Smooth data before profiling. This allows later plotting to plot
# smoothed representative curves rather than raw curves (which
# specifying smooth=TRUE to curveRep would do, if curveSmooth was not used)
d < curveSmooth(x, y, id)
w < with(d, curveRep(x, y, id))
# Example to show that curveRep can cluster profiles correctly when
# there is no noise. In the data there are four profiles  flat, flat
# at a higher mean y, linearly increasing then flat, and flat at the
# first height except for a sharp triangular peak
set.seed(1)
x < 0:100
m < length(x)
profile < matrix(NA, nrow=m, ncol=4)
profile[,1] < rep(0, m)
profile[,2] < rep(3, m)
profile[,3] < c(0:3, rep(3, m4))
profile[,4] < c(0,1,3,1,rep(0,m4))
col < c('black','blue','green','red')
matplot(x, profile, type='l', col=col)
xeval < seq(0, 100, length.out=5)
s < x
matplot(x[s], profile[s,], type='l', col=col)
id < rep(1:100, each=m)
X < Y < id
cols < character(100)
names(cols) < as.character(1:100)
for(i in 1:100) {
s < id==i
X[s] < x
j < sample(1:4,1)
Y[s] < profile[,j]
cols[i] < col[j]
}
table(cols)
yl < c(1,4)
w < curveRep(X, Y, id, kn=1, kxdist=1, k=4)
plot(w, 1, 'lattice', idcol=cols, ylim=yl)
# Found 4 clusters but two have same profile
w < curveRep(X, Y, id, kn=1, kxdist=1, k=3)
plot(w, 1, 'lattice', idcol=cols, freq=cols, plotfreq=TRUE, ylim=yl)
# Incorrectly combined black and red because default value p=5 did
# not result in different profiles at x=xeval
w < curveRep(X, Y, id, kn=1, kxdist=1, k=4, p=40)
plot(w, 1, 'lattice', idcol=cols, ylim=yl)
# Found correct clusters because evaluated curves at 40 equally
# spaced points and could find the sharp triangular peak in profile 4
## End(Not run)
Function like cut but left endpoints are inclusive and labels are of
the form [lower, upper)
, except that last interval is [lower,upper]
.
If cuts are given, will by default make sure that cuts include entire
range of x
.
Also, if cuts are not given, will cut x
into quantile groups
(g
given) or groups
with a given minimum number of observations (m
). Whereas cut creates a
category object, cut2
creates a factor object.
cut2(x, cuts, m=150, g, levels.mean=FALSE, digits, minmax=TRUE,
oneval=TRUE, onlycuts=FALSE, formatfun=format, ...)
x 
numeric vector to classify into intervals 
cuts 
cut points 
m 
desired minimum number of observations in a group. The algorithm does
not guarantee that all groups will have at least 
g 
number of quantile groups 
levels.mean 
set to 
digits 
number of significant digits to use in constructing levels. Default is 3
(5 if 
minmax 
if cuts is specified but 
oneval 
if an interval contains only one unique value, the interval will be
labeled with the formatted version of that value instead of the
interval endpoints, unless 
onlycuts 
set to 
formatfun 
formatting function, supports formula notation (if 
... 
additional arguments passed to 
a factor variable with levels of the form [a,b)
or formatted means
(character strings) unless onlycuts
is TRUE
in which case
a numeric vector is returned
set.seed(1)
x < runif(1000, 0, 100)
z < cut2(x, c(10,20,30))
table(z)
table(cut2(x, g=10)) # quantile groups
table(cut2(x, m=50)) # group x into intevals with at least 50 obs.
This help file contains a template for importing data to create an R data frame, correcting some problems resulting from the import and making the data frame be stored more efficiently, modifying the data frame (including better annotating it and changing the names of some of its variables), and checking and inspecting the data frame for reasonableness of the values of its variables and to describe patterns of missing data. Various builtin functions and functions in the Hmisc library are used. At the end some methods for creating data frames “from scratch” within R are presented.
The examples below attempt to clarify the separation of operations
that are done on a data frame as a whole, operations that are done on
a small subset of its variables without attaching the whole data
frame, and operations that are done on many variables after attaching
the data frame in search position one. It also tries to clarify that
for analyzing several separate variables using R commands that do not
support a data
argument, it is helpful to attach the data frame
in a search position later than position one.
It is often useful to create, modify, and process datasets in the following order.
Import external data into a data frame (if the raw data do not contain column names, provide these during the import if possible)
Make global changes to a data frame (e.g., changing variable names)
Change attributes or values of variables within a data frame
Do analyses involving the whole data frame (without attaching it)
(Data frame still in .Data)
Do analyses of individual variables (after attaching the data frame in search position two or later)
The examples below use the FEV
dataset from
Rosner 1995. Almost any dataset would do. The jcetable data
are taken from Galobardes, etal.
Presently, giving a variable the "units"
attribute (using the
Hmisc units
function) only benefits the
Hmisc describe
function and the rms
library's version of the link[rms]{Surv}
function. Variables
labels defined with the Hmisc label
function are used by
describe
, summary.formula
, and many of
the plotting functions in Hmisc and rms.
Alzola CF, Harrell FE (2006): An Introduction to S and the Hmisc and Design Libraries. Chapters 3 and 4, https://hbiostat.org/R/doc/sintro.pdf.
Galobardes, et al. (1998), J Clin Epi 51:875881.
Rosner B (1995): Fundamentals of Biostatistics, 4th Edition. New York: Duxbury Press.
scan
, read.table
,
cleanup.import
, sas.get
,
data.frame
, attach
, detach
,
describe
, datadensity
,
plot.data.frame
, hist.data.frame
,
naclus
, factor
, label
,
units
, names
, expand.grid
,
summary.formula
, summary.data.frame
,
casefold
, edit
, page
,
plot.data.frame
, Cs
,
combine.levels
,upData
## Not run:
# First, we do steps that create or manipulate the data
# frame in its entirety. For SPlus, these are done with
# .Data in search position one (the default at the
# start of the session).
#
# 
# Step 1: Create initial draft of data frame
#
# We usually begin by importing a dataset from
# # another application. ASCII files may be imported
# using the scan and read.table functions. SAS
# datasets may be imported using the Hmisc sas.get
# function (which will carry more attributes from
# SAS than using File \dots Import) from the GUI
# menus. But for most applications (especially
# Excel), File \dots Import will suffice. If using
# the GUI, it is often best to provide variable
# names during the import process, using the Options
# tab, rather than renaming all fields later Of
# course, if the data to be imported already have
# field names (e.g., in Excel), let S use those
# automatically. If using SPlus, you can use a
# command to execute File \dots Import, e.g.:
import.data(FileName = "/windows/temp/fev.asc",
FileType = "ASCII", DataFrame = "FEV")
# Here we name the new data frame FEV rather than
# fev, because we wanted to distinguish a variable
# in the data frame named fev from the data frame
# name. For SPlus the command will look
# instead like the following:
FEV < importData("/tmp/fev.asc")
# 
# Step 2: Clean up data frame / make it be more
# efficiently stored
#
# Unless using sas.get to import your dataset
# (sas.get already stores data efficiently), it is
# usually a good idea to run the data frame through
# the Hmisc cleanup.import function to change
# numeric variables that are always whole numbers to
# be stored as integers, the remaining numerics to
# single precision, strange values from Excel to
# NAs, and character variables that always contain
# legal numeric values to numeric variables.
# cleanup.import typically halves the size of the
# data frame. If you do not specify any parameters
# to cleanup.import, the function assumes that no
# numeric variable needs more than 7 significant
# digits of precision, so all nonintegervalued
# variables will be converted to single precision.
FEV < cleanup.import(FEV)
# 
# Step 3: Make global changes to the data frame
#
# A data frame has attributes that are "external" to
# its variables. There are the vector of its
# variable names ("names" attribute), the
# observation identifiers ("row.names"), and the
# "class" (whose value is "data.frame"). The
# "names" attribute is the one most commonly in need
# of modification. If we had wanted to change all
# the variable names to lower case, we could have
# specified lowernames=TRUE to the cleanup.import
# invocation above, or type
names(FEV) < casefold(names(FEV))
# The upData function can also be used to change
# variable names in two ways (see below).
# To change names in a nonsystematic way we use
# other options. Under Windows/NT the most
# straigtforward approach is to change the names
# interactively. Click on the data frame in the
# left panel of the Object Browser, then in the
# right pane click twice (slowly) on a variable.
# Use the left arrow and other keys to edit the
# name. Click outside that name field to commit the
# change. You can also rename columns while in a
# Data Sheet. To instead use programming commands
# to change names, use something like:
names(FEV)[6] < 'smoke' # assumes you know the positions!
names(FEV)[names(FEV)=='smoking'] < 'smoke'
names(FEV) < edit(names(FEV))
# The last example is useful if you are changing
# many names. But none of the interactive
# approaches such as edit() are handy if you will be
# reimporting the dataset after it is updated in
# its original application. This problem can be
# addressed by saving the new names in a permanent
# vector in .Data:
new.names < names(FEV)
# Then if the data are reimported, you can type
names(FEV) < new.names
# to rename the variables.
# 
# Step 4: Delete unneeded variables
#
# To delete some of the variables, you can
# rightclick on variable names in the Object
# Browser's right pane, then select Delete. You can
# also set variables to have NULL values, which
# causes the system to delete them. We don't need
# to delete any variables from FEV but suppose we
# did need to delete some from mydframe.
mydframe$x1 < NULL
mydframe$x2 < NULL
mydframe[c('age','sex')] < NULL # delete 2 variables
mydframe[Cs(age,sex)] < NULL # same thing
# The last example uses the Hmisc shortcut quoting
# function Cs. See also the drop parameter to upData.
# 
# Step 5: Make changes to individual variables
# within the data frame
#
# After importing data, the resulting variables are
# seldom self  documenting, so we commonly need to
# change or enhance attributes of individual
# variables within the data frame.
#
# If you are only changing a few variables, it is
# efficient to change them directly without
# attaching the entire data frame.
FEV$sex < factor(FEV$sex, 0:1, c('female','male'))
FEV$smoke < factor(FEV$smoke, 0:1,
c('noncurrent smoker','current smoker'))
units(FEV$age) < 'years'
units(FEV$fev) < 'L'
label(FEV$fev) < 'Forced Expiratory Volume'
units(FEV$height) < 'inches'
# When changing more than one or two variables it is
# more convenient change the data frame using the
# Hmisc upData function.
FEV2 < upData(FEV,
rename=c(smoking='smoke'),
# omit if renamed above
drop=c('var1','var2'),
levels=list(sex =list(female=0,male=1),
smoke=list('noncurrent smoker'=0,
'current smoker'=1)),
units=list(age='years', fev='L', height='inches'),
labels=list(fev='Forced Expiratory Volume'))
# An alternative to levels=list(\dots) is for example
# upData(FEV, sex=factor(sex,0:1,c('female','male'))).
#
# Note that we saved the changed data frame into a
# new data frame FEV2. If we were confident of the
# correctness of our changes we could have stored
# the new data frame on top of the old one, under
# the original name FEV.
# 
# Step 6: Check the data frame
#
# The Hmisc describe function is perhaps the first
# function that should be used on the new data
# frame. It provides documentation of all the
# variables and the frequency tabulation, counts of
# NAs, and 5 largest and smallest values are
# helpful in detecting data errors. Typing
# describe(FEV) will write the results to the
# current output window. To put the results in a
# new window that can persist, even upon exiting
# S, we use the page function. The describe
# output can be minimized to an icon but kept ready
# for guiding later steps of the analysis.
page(describe(FEV2), multi=TRUE)
# multi=TRUE allows that window to persist while
# control is returned to other windows
# The new data frame is OK. Store it on top of the
# old FEV and then use the graphical user interface
# to delete FEV2 (click on it and hit the Delete
# key) or type rm(FEV2) after the next statement.
FEV < FEV2
# Next, we can use a variety of other functions to
# check and describe all of the variables. As we
# are analyzing all or almost all of the variables,
# this is best done without attaching the data
# frame. Note that plot.data.frame plots inverted
# CDFs for continuous variables and dot plots
# showing frequency distributions of categorical
# ones.
summary(FEV)
# basic summary function (summary.data.frame)
plot(FEV) # plot.data.frame
datadensity(FEV)
# rug plots and freq. bar charts for all var.
hist.data.frame(FEV)
# for variables having > 2 values
by(FEV, FEV$smoke, summary)
# use basic summary function with stratification
# 
# Step 7: Do detailed analyses involving individual
# variables
#
# Analyses based on the formula language can use
# data= so attaching the data frame may not be
# required. This saves memory. Here we use the
# Hmisc summary.formula function to compute 5
# statistics on height, stratified separately by age
# quartile and by sex.
options(width=80)
summary(height ~ age + sex, data=FEV,
fun=function(y)c(smean.sd(y),
smedian.hilow(y,conf.int=.5)))
# This computes mean height, S.D., median, outer quartiles
fit < lm(height ~ age*sex, data=FEV)
summary(fit)
# For this analysis we could also have attached the
# data frame in search position 2. For other
# analyses, it is mandatory to attach the data frame
# unless FEV$ prefixes each variable name.
# Important: DO NOT USE attach(FEV, 1) or
# attach(FEV, pos=1, \dots) if you are only analyzing
# and not changing the variables, unless you really
# need to avoid conflicts with variables in search
# position 1 that have the same names as the
# variables in FEV. Attaching into search position
# 1 will cause SPlus to be more of a memory hog.
attach(FEV)
# Use e.g. attach(FEV[,Cs(age,sex)]) if you only
# want to analyze a small subset of the variables
# Use e.g. attach(FEV[FEV$sex=='male',]) to
# analyze a subset of the observations
summary(height ~ age + sex,
fun=function(y)c(smean.sd(y),
smedian.hilow(y,conf.int=.5)))
fit < lm(height ~ age*sex)
# Run generic summary function on height and fev,
# stratified by sex
by(data.frame(height,fev), sex, summary)
# Crossclassify into 4 sex x smoke groups
by(FEV, list(sex,smoke), summary)
# Plot 5 quantiles
s < summary(fev ~ age + sex + height,
fun=function(y)quantile(y,c(.1,.25,.5,.75,.9)))
plot(s, which=1:5, pch=c(1,2,15,2,1), #pch=c('=','[','o',']','='),
main='A Discovery', xlab='FEV')
# Use the nonparametric bootstrap to compute a
# 0.95 confidence interval for the population mean fev
smean.cl.boot(fev) # in Hmisc
# Use the Statistics \dots Compare Samples \dots One Sample
# keys to get a normaltheorybased C.I. Then do it
# more manually. The following method assumes that
# there are no NAs in fev
sd < sqrt(var(fev))
xbar < mean(fev)
xbar
sd
n < length(fev)
qt(.975,n1)
# prints 0.975 critical value of t dist. with n1 d.f.
xbar + c(1,1)*sd/sqrt(n)*qt(.975,n1)
# prints confidence limits
# Fit a linear model
# fit < lm(fev ~ other variables \dots)
detach()
# The last command is only needed if you want to
# start operating on another data frame and you want
# to get FEV out of the way.
# 
# Creating data frames from scratch
#
# Data frames can be created from within S. To
# create a small data frame containing ordinary
# data, you can use something like
dframe < data.frame(age=c(10,20,30),
sex=c('male','female','male'),
stringsAsFactors=TRUE)
# You can also create a data frame using the Data
# Sheet. Create an empty data frame with the
# correct variable names and types, then edit in the
# data.
dd < data.frame(age=numeric(0),sex=character(0),
stringsAsFactors=TRUE)
# The sex variable will be stored as a factor, and
# levels will be automatically added to it as you
# define new values for sex in the Data Sheet's sex
# column.
#
# When the data frame you need to create is defined
# by systematically varying variables (e.g., all
# possible combinations of values of each variable),
# the expand.grid function is useful for quickly
# creating the data. Then you can add
# nonsystematicallyvarying variables to the object
# created by expand.grid, using programming
# statements or editing the Data Sheet. This
# process is useful for creating a data frame
# representing all the values in a printed table.
# In what follows we create a data frame
# representing the combinations of values from an 8
# x 2 x 2 x 2 (event x method x sex x what) table,
# and add a nonsystematic variable percent to the
# data.
jcetable < expand.grid(
event=c('Wheezing at any time',
'Wheezing and breathless',
'Wheezing without a cold',
'Waking with tightness in the chest',
'Waking with shortness of breath',
'Waking with an attack of cough',
'Attack of asthma',
'Use of medication'),
method=c('Mail','Telephone'),
sex=c('Male','Female'),
what=c('Sensitivity','Specificity'))
jcetable$percent <
c(756,618,706,422,356,578,289,333,
576,421,789,273,273,212,212,212,
613,763,713,403,377,541,290,226,
613,684,632,290,387,613,258,129,
656,597,438,780,732,679,938,919,
714,600,494,877,850,703,963,987,
755,420,480,794,779,647,956,941,
766,423,500,833,833,604,955,986) / 10
# In jcetable, event varies most rapidly, then
# method, then sex, and what.
## End(Not run)
These functions are intended to be used to describe how well a given
set of new observations (e.g., new subjects) were represented in a
dataset used to develop a predictive model.
The dataRep
function forms a data frame that contains all the unique
combinations of variable values that existed in a given set of
variable values. Cross–classifications of values are created using
exact values of variables, so for continuous numeric variables it is
often necessary to round them to the nearest v
and to possibly
curtail the values to some lower and upper limit before rounding.
Here v
denotes a numeric constant specifying the matching tolerance
that will be used. dataRep
also stores marginal distribution
summaries for all the variables. For numeric variables, all 101
percentiles are stored, and for all variables, the frequency
distributions are also stored (frequencies are computed after any
rounding and curtailment of numeric variables). For the purposes of
rounding and curtailing, the roundN
function is provided. A print
method will summarize the calculations made by dataRep
, and if
long=TRUE
all unique combinations of values and their frequencies in
the original dataset are printed.
The predict
method for dataRep
takes a new data frame having
variables named the same as the original ones (but whose factor levels
are not necessarily in the same order) and examines the collapsed
crossclassifications created by dataRep
to find how many
observations were similar to each of the new observations after any
rounding or curtailment of limits is done. predict
also does some
calculations to describe how the variable values of the new
observations "stack up" against the marginal distributions of the
original data. For categorical variables, the percent of observations
having a given variable with the value of the new observation (after
rounding for variables that were through roundN
in the formula given
to dataRep
) is computed. For numeric variables, the percentile of
the original distribution in which the current value falls will be
computed. For this purpose, the data are not rounded because the 101
original percentiles were retained; linear interpolation is used to
estimate percentiles for values between two tabulated percentiles.
The lowest marginal frequency of matching values across all variables
is also computed. For example, if an age, sex combination matches 10
subjects in the original dataset but the age value matches 100 ages
(after rounding) and the sex value matches the sex code of 300
observations, the lowest marginal frequency is 100, which is a "best
case" upper limit for multivariable matching. I.e., matching on all
variables has to result on a lower frequency than this amount.
A print
method for the output of predict.dataRep
prints all
calculations done by predict
by default. Calculations can be
selectively suppressed.
dataRep(formula, data, subset, na.action)
roundN(x, tol=1, clip=NULL)
## S3 method for class 'dataRep'
print(x, long=FALSE, ...)
## S3 method for class 'dataRep'
predict(object, newdata, ...)
## S3 method for class 'predict.dataRep'
print(x, prdata=TRUE, prpct=TRUE, ...)
formula 
a formula with no lefthandside. Continuous numeric variables in
need of rounding should appear in the formula as e.g. 
x 
a numeric vector or an object created by 
object 
the object created by 
data , subset , na.action

standard modeling arguments. Default 
tol 
rounding constant (tolerance is actually 
clip 
a 2vector specifying a lower and upper limit to curtail values of 
long 
set to 
newdata 
a data frame containing all the variables given to 
prdata 
set to 
prpct 
set to 
... 
unused 
dataRep
returns a list of class "dataRep"
containing the collapsed
data frame and frequency counts along with marginal distribution
information. predict
returns an object of class "predict.dataRep"
containing information determined by matching observations in
newdata
with the original (collapsed) data.
print.dataRep
prints.
Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
fh@fharrell.com
set.seed(13)
num.symptoms < sample(1:4, 1000,TRUE)
sex < factor(sample(c('female','male'), 1000,TRUE))
x < runif(1000)
x[1] < NA
table(num.symptoms, sex, .25*round(x/.25))
d < dataRep(~ num.symptoms + sex + roundN(x,.25))
print(d, long=TRUE)
predict(d, data.frame(num.symptoms=1:3, sex=c('male','male','female'),
x=c(.03,.5,1.5)))
Computes the Kish design effect and corresponding intracluster correlation for a single clustersampled variable
deff(y, cluster)
y 
variable to analyze 
cluster 
a variable whose unique values indicate cluster membership. Any type of variable is allowed. 
a vector with named elements n
(total number of nonmissing
observations), clusters
(number of clusters after deleting
missing data), rho
(intracluster correlation), and deff
(design effect).
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
set.seed(1)
blood.pressure < rnorm(1000, 120, 15)
clinic < sample(letters, 1000, replace=TRUE)
deff(blood.pressure, clinic)
describe
is a generic method that invokes describe.data.frame
,
describe.matrix
, describe.vector
, or
describe.formula
. describe.vector
is the basic
function for handling a single variable.
This function determines whether the variable is character, factor,
category, binary, discrete numeric, and continuous numeric, and prints
a concise statistical summary according to each. A numeric variable is
deemed discrete if it has <= 10 distinct values. In this case,
quantiles are not printed. A frequency table is printed
for any nonbinary variable if it has no more than 20 distinct
values. For any variable for which the frequency table is not printed,
the 5 lowest and highest values are printed. This behavior can be
overriden for long character variables with many levels using the
listunique
parameter, to get a complete tabulation.
describe
is especially useful for
describing data frames created by *.get
, as labels, formats,
value labels, and (in the case of sas.get
) frequencies of special
missing values are printed.
For a binary variable, the sum (number of 1's) and mean (proportion of
1's) are printed. If the first argument is a formula, a model frame
is created and passed to describe.data.frame. If a variable
is of class "impute"
, a count of the number of imputed values is
printed. If a date variable has an attribute partial.date
(this is set up by sas.get
), counts of how many partial dates are
actually present (missing month, missing day, missing both) are also presented.
If a variable was created by the specialpurpose function substi
(which
substitutes values of a second variable if the first variable is NA),
the frequency table of substitutions is also printed.
For numeric variables, describe
adds an item called Info
which is a relative information measure using the relative efficiency of
a proportional odds/Wilcoxon test on the variable relative to the same
test on a variable that has no ties. Info
is related to how
continuous the variable is, and ties are less harmful the more untied
values there are. The formula for Info
is one minus the sum of
the cubes of relative frequencies of values divided by one minus the
square of the reciprocal of the sample size. The lowest information
comes from a variable having only one distinct value following by a
highly skewed binary variable. Info
is reported to
two decimal places.
A latex method exists for converting the describe
object to a
LaTeX file. For numeric variables having more than 20 distinct values,
describe
saves in its returned object the frequencies of 100
evenly spaced bins running from minimum observed value to the maximum.
When there are less than or equal to 20 distinct values, the original
values are maintained.
latex
and html
insert a spike histogram displaying these
frequency counts in the tabular material using the LaTeX picture
environment. For example output see
https://hbiostat.org/doc/rms/book/chapter7edition1.pdf.
Note that the latex method assumes you have the following styles
installed in your latex installation: setspace and relsize.
The html
method mimics the LaTeX output. This is useful in the
context of Quarto/Rmarkdown html and html notebook output.
If options(prType='html')
is in effect, calling print
on
an object that is the result of running describe
on a data frame
will result in rendering the HTML version. If run from the console a
browser window will open. When which
is specified to
print
, whether or not prType='html'
is in effect, a
gt
package html table will be produced containing only
the types of variables requested. When which='both'
a list with
element names Continuous
and Categorical
is produced,
making it convenient for the user to print as desired, or to pass the
list directed to the qreport
maketabs
function when using Quarto.
The plot
method is for describe
objects run on data
frames. It produces spike histograms for a graphic of
continuous variables and a dot chart for categorical variables, showing
category proportions. The graphic format is ggplot2
if the user
has not set options(grType='plotly')
or has set the grType
option to something other than 'plotly'
. Otherwise plotly
graphics that are interactive are produced, and these can be placed into
an Rmarkdown html notebook. The user must install the plotly
package for this to work. When the use hovers the mouse over a bin for
a raw data value, the actual value will popup (formatted using
digits
). When the user hovers over the minimum data value, most
of the information calculated by describe
will pop up. For each
variable, the number of missing values is used to assign the color to
the histogram or dot chart, and a legend is drawn. Color is not used if
there are no missing values in any variable. For categorical variables,
hovering over the leftmost point for a variable displays details, and
for all points proportions, numerators, and denominators are displayed
in the popup. If both continuous and categorical variables are present
and which='both'
is specified, the plot
method returns an
unclassed list
containing two objects, named 'Categorical'
and 'Continuous'
, in that order.
Sample weights may be specified to any of the functions, resulting in weighted means, quantiles, and frequency tables.
Note: As discussed in Cox and Longton (2008), Stata Technical Bulletin 8(4) pp. 557, the term "unique" has been replaced with "distinct" in the output (but not in parameter names).
When weights
are not used, Gini's mean difference is computed for
numeric variables. This is a robust measure of dispersion that is the
mean absolute difference between any pairs of observations. In simple
output Gini's difference is labeled Gmd
.
formatdescribeSingle
is a service function for latex
,
html
, and print
methods for single variables that is not
intended to be called by the user.
## S3 method for class 'vector'
describe(x, descript, exclude.missing=TRUE, digits=4,
listunique=0, listnchar=12,
weights=NULL, normwt=FALSE, minlength=NULL, shortmChoice=TRUE,
rmhtml=FALSE, trans=NULL, lumptails=0.01, ...)
## S3 method for class 'matrix'
describe(x, descript, exclude.missing=TRUE, digits=4, ...)
## S3 method for class 'data.frame'
describe(x, descript, exclude.missing=TRUE,
digits=4, trans=NULL, ...)
## S3 method for class 'formula'
describe(x, descript, data, subset, na.action,
digits=4, weights, ...)
## S3 method for class 'describe'
print(x, which = c('both', 'categorical', 'continuous'), ...)
## S3 method for class 'describe'
latex(object, title=NULL,
file=paste('describe',first.word(expr=attr(object,'descript')),'tex',sep='.'),
append=FALSE, size='small', tabular=TRUE, greek=TRUE,
spacing=0.7, lspace=c(0,0), ...)
## S3 method for class 'describe.single'
latex(object, title=NULL, vname,
file, append=FALSE, size='small', tabular=TRUE, greek=TRUE,
lspace=c(0,0), ...)
## S3 method for class 'describe'
html(object, size=85, tabular=TRUE,
greek=TRUE, scroll=FALSE, rows=25, cols=100, ...)
## S3 method for class 'describe.single'
html(object, size=85,
tabular=TRUE, greek=TRUE, ...)
formatdescribeSingle(x, condense=c('extremes', 'frequencies', 'both', 'none'),
lang=c('plain', 'latex', 'html'), verb=0, lspace=c(0, 0),
size=85, ...)
## S3 method for class 'describe'
plot(x, which=c('both', 'continuous', 'categorical'),
what=NULL,
sort=c('ascending', 'descending', 'none'),
n.unique=10, digits=5, bvspace=2, ...)
x 
a data frame, matrix, vector, or formula. For a data frame, the

descript 
optional title to print for x. The default is the name of the argument
or the "label" attributes of individual variables. When the first argument
is a formula, 
exclude.missing 
set toTRUE to print the names of variables that contain only missing values. This list appears at the bottom of the printout, and no space is taken up for such variables in the main listing. 
digits 
number of significant digits to print. For 
listunique 
For a character variable that is not an 
listnchar 
see 
weights 
a numeric vector of frequencies or sample weights. Each observation
will be treated as if it were sampled 
minlength 
value passed to summary.mChoice 
shortmChoice 
set to 
rmhtml 
set to 
trans 
for 
lumptails 
specifies the quantile to use (its complement is also used) for grouping observations in the tails so that outliers have less chance of distorting the variable's range for sparkline spike histograms. The default is 0.01, i.e., observations below the 0.01 quantile are grouped together in the leftmost bin, and observations above the 0.99 quantile are grouped to form the last bin. 
normwt 
The default, 
object 
a result of 
title 
unused 
data 
a data frame, data table, or list 
subset 
a subsetting expression 
na.action 
These are used if a formula is specified. 
... 
arguments passed to 
file 
name of output file (should have a suffix of .tex). Default name is
formed from the first word of the 
append 
set to 
size 
LaTeX text size ( 
tabular 
set to 
greek 
By default, the 
spacing 
By default, the 
lspace 
extra vertical scape, in character size units (i.e., "ex"
as appended to the space). When using certain font sizes, there is
too much space left around LaTeX verbatim environments. This
twovector specifies space to remove (i.e., the values are negated in
forming the 
scroll 
set to 
rows , cols

the number of rows or columns to allocate for the scrollable box 
vname 
unused argument in 
which 
specifies whether to plot numeric continuous or
binary/categorical variables, or both. When 
what 
character or numeric vector specifying which variables to plot; default is to plot all 
sort 
specifies how and whether variables are sorted in order of
the proportion of positives when 
n.unique 
the minimum number of distinct values a numeric variable
must have before 
bvspace 
the betweenvariable spacing for categorical variables. Defaults to 2, meaning twice the amount of vertical space as what is used for betweencategory spacing within a variable 
condense 
specifies whether to condense the output with regard to
the 5 lowest and highest values ( 
lang 
specifies the markup language 
verb 
set to 1 if a verbatim environment is already in effect for LaTeX 
If options(na.detail.response=TRUE)
has been set and na.action
is "na.delete"
or
"na.keep"
, summary statistics on
the response variable are printed separately for missing and nonmissing
values of each predictor. The default summary function returns
the number of nonmissing response values and the mean of the last
column of the response values, with a names
attribute of
c("N","Mean")
.
When the response is a Surv
object and the mean is used, this will
result in the crude proportion of events being used to summarize
the response. The actual summary function can be designated through
options(na.fun.response = "function name")
.
If you are modifying LaTex parskip
or certain other parameters,
you may need to shrink the area around tabular
and
verbatim
environments produced by latex.describe
. You can
do this using for example
\usepackage{etoolbox}\makeatletter\preto{\@verbatim}{\topsep=1.4pt
\partopsep=0pt}\preto{\@tabular}{\parskip=2pt
\parsep=0pt}\makeatother
in the LaTeX preamble.
a list containing elements descript
, counts
,
values
. The list is of class describe
. If the input
object was a matrix or a data
frame, the list is a list of lists, one list for each variable
analyzed. latex
returns a standard latex
object. For numeric
variables having at least 20 distinct values, an additional component
intervalFreq
. This component is a list with two elements, range
(containing two values) and count
, a vector of 100 integer frequency
counts. print
with which=
returns a 'gt' table object.
The user can modify the table by piping formatting changes, column
removals, and other operations, before final rendering.
Frank Harrell
Vanderbilt University
fh@fharrell.com
spikecomp
, sas.get
, quantile
,
GiniMd
,
table
, summary
,
model.frame.default
,
naprint
, lapply
, tapply
,
Surv
, na.delete
,
na.keep
,
na.detail.response
, latex
set.seed(1)
describe(runif(200),dig=2) #single variable, continuous
#get quantiles .05,.10,\dots
dfr < data.frame(x=rnorm(400),y=sample(c('male','female'),400,TRUE))
describe(dfr)
## Not run:
options(grType='plotly')
d < describe(mydata)
p < plot(d) # create plots for both types of variables
p[[1]]; p[[2]] # or p$Categorical; p$Continuous
plotly::subplot(p[[1]], p[[2]], nrows=2) # plot both in one
plot(d, which='categorical') # categorical ones
d < sas.get(".","mydata",special.miss=TRUE,recode=TRUE)
describe(d) #describe entire data frame
attach(d, 1)
describe(relig) #Has special missing values .D .F .M .R .T
#attr(relig,"label") is "Religious preference"
#relig : Religious preference Format:relig
# n missing D F M R T distinct
# 4038 263 45 33 7 2 1 8
#
#0:none (251, 6%), 1:Jewish (372, 9%), 2:Catholic (1230, 30%)
#3:Jehovah's Witnes (25, 1%), 4:Christ Scientist (7, 0%)
#5:Seventh Day Adv (17, 0%), 6:Protestant (2025, 50%), 7:other (111, 3%)
# Method for describing part of a data frame:
describe(death.time ~ age*sex + rcs(blood.pressure))
describe(~ age+sex)
describe(~ age+sex, weights=freqs) # weighted analysis
fit < lrm(y ~ age*sex + log(height))
describe(formula(fit))
describe(y ~ age*sex, na.action=na.delete)
# report on number deleted for each variable
options(na.detail.response=TRUE)
# keep missings separately for each x, report on dist of y by x=NA
describe(y ~ age*sex)
options(na.fun.response="quantile")
describe(y ~ age*sex) # same but use quantiles of y by x=NA
d < describe(my.data.frame)
d$age # print description for just age
d[c('age','sex')] # print description for two variables
d[sort(names(d))] # print in alphabetic order by var. names
d2 < d[20:30] # keep variables 2030
page(d2) # popup window for these variables
# Test date/time formats and suppression of times when they don't vary
library(chron)
d < data.frame(a=chron((1:20)+.1),
b=chron((1:20)+(1:20)/100),
d=ISOdatetime(year=rep(2003,20),month=rep(4,20),day=1:20,
hour=rep(11,20),min=rep(17,20),sec=rep(11,20)),
f=ISOdatetime(year=rep(2003,20),month=rep(4,20),day=1:20,
hour=1:20,min=1:20,sec=1:20),
g=ISOdate(year=2001:2020,month=rep(3,20),day=1:20))
describe(d)
# Make a function to run describe, latex.describe, and use the kdvi
# previewer in Linux to view the result and easily make a pdf file
ldesc < function(data) {
options(xdvicmd='kdvi')
d < describe(data, desc=deparse(substitute(data)))
dvi(latex(d, file='/tmp/z.tex'), nomargins=FALSE, width=8.5, height=11)
}
ldesc(d)
## End(Not run)
discrete
creates a discrete vector which is distinct from a
continuous vector, or a factor/ordered vector.
The other function are tools for manipulating descrete vectors.
as.discrete(x, ...)
## Default S3 method:
as.discrete(x, ...)
discrete(x, levels = sort(unique.default(x), na.last = TRUE), exclude = NA)
## S3 replacement method for class 'discrete'
x[...] < value
## S3 method for class 'discrete'
x[..., drop = FALSE]
## S3 method for class 'discrete'
x[[i]]
is.discrete(x)
## S3 replacement method for class 'discrete'
is.na(x) < value
## S3 replacement method for class 'discrete'
length(x) < value
x 
a vector 
drop 
Should unused levels be dropped. 
exclude 
logical: should 
i 
indexing vector 
levels 
charater: list of individual level values 
value 
index of elements to set to 
... 
arguments to be passed to other functions 
as.discrete
converts a vector into a discrete vector.
discrete
creates a discrete vector from provided values.
is.discrete
tests to see if the vector is a discrete vector.
as.discrete
, discrete
returns a vector of
discrete
type.
is.discrete
returan logical TRUE
if the vector is of
class discrete other wise it returns FALSE
.
Charles Dupont
a < discrete(1:25)
a
is.discrete(a)
b < as.discrete(2:4)
b
dotchart2
is an enhanced version of the dotchart
function
with several new options.
dotchart2(data, labels, groups=NULL, gdata=NA, horizontal=TRUE, pch=16,
xlab='', ylab='', xlim=NULL, auxdata, auxgdata=NULL, auxtitle,
lty=1, lines=TRUE, dotsize = .8,
cex = par("cex"), cex.labels = cex,
cex.group.labels = cex.labels*1.25, sort.=TRUE,
add=FALSE, dotfont=par('font'), groupfont=2,
reset.par=add, xaxis=TRUE, width.factor=1.1,
lcolor='gray', leavepar=FALSE,
axisat=NULL, axislabels=NULL, ...)
data 
a numeric vector whose values are shown on the xaxis 
labels 
a vector of labels for each point, corresponding to

groups 
an optional categorical variable indicating how

gdata 
data values for groups, typically summaries such as group medians 
horizontal 
set to 
pch 
default character number or value for plotting dots in dot charts. The default is 16. 
xlab 
xaxis title 
ylab 
yaxis title 
xlim 
xaxis limits. Applies only to 
auxdata 
a vector of auxiliary data given to 
auxgdata 
similar to 
auxtitle 
if 
lty 
line type for horizontal lines. Default is 1 for R, 2 for SPlus 
lines 
set to 
dotsize 

cex 
see 
cex.labels 

cex.group.labels 
value of 
sort. 
set to 
add 
set to 
dotfont 
font number of plotting dots. Default is one. Use 
groupfont 
font number to use in drawing 
reset.par 
set to 
xaxis 
set to 
width.factor 
When the calculated left margin turns out to be faulty, specify a
factor by which to multiple the left margin as 
lcolor 
color for horizontal reference lines. Default is 
leavepar 
set to 
axisat 
a vector of tick mark locations to pass to 
axislabels 
a vector of strings specifying axis tick mark labels. Useful if transforming the data axis 
... 
arguments passed to 
dotchart
will leave par
altered if reset.par=FALSE
.
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
set.seed(135)
maj < factor(c(rep('North',13),rep('South',13)))
g < paste('Category',rep(letters[1:13],2))
n < sample(1:15000, 26, replace=TRUE)
y1 < runif(26)
y2 < pmax(0, y1  runif(26, 0, .1))
dotchart2(y1, g, groups=maj, auxdata=n, auxtitle='n', xlab='Y')
dotchart2(y2, g, groups=maj, pch=17, add=TRUE)
## Compare with dotchart function (no superpositioning or auxdata allowed):
## dotchart(y1, g, groups=maj, xlab='Y')
## To plot using a transformed scale add for example
## axisat=sqrt(pretty(y)), axislabels=pretty(y)
These are adaptations of the R dotchart function that sorts categories
top to bottom, adds auxdata
and auxtitle
arguments to put
extra information in the right margin, and for dotchart3
adds
arguments cex.labels
, cex.group.labels
, and
groupfont
. By default, group headings are in a larger, bold
font. dotchart3
also cuts a bit of white space from the top and
bottom of the chart. The most significant change, however, is in how
x
is interpreted. Columns of x
no longer provide an
alternate way to define groups. Instead, they define superpositioned
values. This is useful for showing three quartiles, for example. Going
along with this change, for dotchart3
pch
can now be a
vector specifying symbols to use going across columns of x
.
x
was changed in this way because to put multiple points on a
line (e.g., quartiles) and keeping track of par()
parameters when
dotchart2
was called with add=TRUE
was cumbersome.
dotchart3
changes the margins to account for horizontal labels.
dotchartp
is a version of dotchart3
for making the chart
with the plotly
package.
summaryD
creates aggregate data using summarize
and
calls dotchart3
with suitable arguments to summarize data by
major and minor categories. If options(grType='plotly')
is in
effect and the plotly
package is installed, summaryD
uses
dotchartp
instead of dotchart3
.
summaryDp
is a streamlined summaryD
like function that
uses the dotchartpl
function to render a plotly
graphic.
It is used to compute summary statistics stratified separately by a
series of variables.
dotchart3(x, labels = NULL, groups = NULL, gdata = NULL,
cex = par("cex"), pch = 21, gpch = pch, bg = par("bg"),
color = par("fg"), gcolor = par("fg"), lcolor = "gray",
xlim = range(c(x, gdata), na.rm=TRUE), main = NULL, xlab = NULL,
ylab = NULL, auxdata = NULL, auxtitle = NULL, auxgdata=NULL,
axisat=NULL, axislabels=NULL,
cex.labels = cex, cex.group.labels = cex.labels * 1.25,
cex.auxdata=cex, groupfont = 2,
auxwhere=NULL, height=NULL, width=NULL, ...)
dotchartp(x, labels = NULL, groups = NULL, gdata = NULL,
xlim = range(c(x, gdata), na.rm=TRUE), main=NULL,
xlab = NULL, ylab = '', auxdata=NULL, auxtitle=NULL,
auxgdata=NULL, auxwhere=c('right', 'hover'),
symbol='circle', col=colorspace::rainbow_hcl,
legendgroup=NULL,
axisat=NULL, axislabels=NULL, sort=TRUE, digits=4, dec=NULL,
height=NULL, width=700, layoutattr=FALSE, showlegend=TRUE, ...)
summaryD(formula, data=NULL, fun=mean, funm=fun,
groupsummary=TRUE, auxvar=NULL, auxtitle='',
auxwhere=c('hover', 'right'),
vals=length(auxvar) > 0, fmtvals=format,
symbol=if(use.plotly) 'circle' else 21,
col=if(use.plotly) colorspace::rainbow_hcl else 1:10,
legendgroup=NULL,
cex.auxdata=.7, xlab=v[1], ylab=NULL,
gridevery=NULL, gridcol=gray(.95), sort=TRUE, ...)
summaryDp(formula,
fun=function(x) c(Mean=mean(x, na.rm=TRUE),
N=sum(! is.na(x))),
overall=TRUE, xlim=NULL, xlab=NULL,
data=NULL, subset=NULL, na.action=na.retain,
ncharsmax=c(50, 30),
digits=4, ...)
x 
a numeric vector or matrix 
labels 
labels for categories corresponding to rows of

groups , gdata , cex , pch , gpch , bg , color , gcolor , lcolor , xlim , main , xlab , ylab

see 
auxdata 
a vector of information to be put in the right margin,
in the same order as 
auxtitle 
a column heading for 
auxgdata 
similar to 
axisat 
a vector of tick mark locations to pass to 
axislabels 
a vector of strings specifying axis tick mark labels. Useful if transforming the data axis 
digits 
number of significant digits for formatting numeric data
in hover text for 
dec 
for 
cex.labels 

cex.group.labels 

cex.auxdata 

groupfont 
font number for group headings 
auxwhere 
for 
... 
other arguments passed to some of the graphics functions,
or to 
layoutattr 
set to 
showlegend 
set to 
formula 
a formula with one variable on the left hand side (the
variable to compute summary statistics on), and one or two
variables on the right hand side. If there are two variables,
the first is taken as the major grouping variable. If the left
hand side variable is a matrix it has to be a legal R variable
name, not an expression, and 
data 
a data frame or list used to find the variables in

fun 
a summarization function creating a single number from a
vector. Default is the mean. For 
funm 
applies if there are two right hand variables and

groupsummary 
By default, when there are two righthand
variables, 
auxvar 
when 
vals 
set to 
fmtvals 
an optional function to format values before putting
them in the right margin. Default is the 
symbol 
a scalar or vector of 
col 
a function or vector of colors to assign to multiple points plotted in one line. If a function it will be evaluated with an argument equal to the number of groups/columns. 
legendgroup 
see 
gridevery 
specify a positive number to draw very faint vertical
grid lines every 
gridcol 
color for grid lines; default is very faint gray scale 
sort 
specify 
height , width

height and width in pixels for 
overall 
set to 
subset 
an observation subsetting expression 
na.action 
an 
ncharsmax 
a 2vector specifying the number of characters after which an html new line character should be placed, respectively for the xaxis label and the stratification variable levels 
the function returns invisibly
Frank Harrell
dotchart
,dotchart2
,summarize
,
rlegend
set.seed(135)
maj < factor(c(rep('North',13),rep('South',13)))
g < paste('Category',rep(letters[1:13],2))
n < sample(1:15000, 26, replace=TRUE)
y1 < runif(26)
y2 < pmax(0, y1  runif(26, 0, .1))
dotchart3(cbind(y1,y2), g, groups=maj, auxdata=n, auxtitle='n',
xlab='Y', pch=c(1,17))
## Compare with dotchart function (no superpositioning or auxdata allowed):
## dotchart(y1, g, groups=maj, xlab='Y')
## Not run:
dotchartp(cbind(y1, y2), g, groups=maj, auxdata=n, auxtitle='n',
xlab='Y', gdata=cbind(c(0,.1), c(.23,.44)), auxgdata=c(1,2),
symbol=c('circle', 'linensopen'))
summaryDp(sbp ~ region + sex + race + cut2(age, g=5), data=mydata)
## End(Not run)
## Put options(grType='plotly') to have the following use dotchartp
## (rlegend will not apply)
## Add argument auxwhere='hover' to summaryD or dotchartp to put
## aux info in hover text instead of right margin
summaryD(y1 ~ maj + g, xlab='Mean')
summaryD(y1 ~ maj + g, groupsummary=FALSE)
summaryD(y1 ~ g, fmtvals=function(x) sprintf('%4.2f', x))
Y < cbind(y1, y2) # summaryD cannot handle cbind(...) ~ ...
summaryD(Y ~ maj + g, fun=function(y) y[1,], symbol=c(1,17))
rlegend(.1, 26, c('y1','y2'), pch=c(1,17))
summaryD(y1 ~ maj, fun=function(y) c(Mean=mean(y), n=length(y)),
auxvar='n', auxtitle='N')
This function produces a plotly
interactive graphic and accepts
a different format of data input than the other dotchart
x
functions. It was written to handle a hierarchical data structure
including strata that further subdivide the main classes. Strata,
indicated by the mult
variable, are shown on the same
horizontal line, and if the variable big
is FALSE
will
appear slightly below the main line, using smaller symbols, and having
some transparency. This is intended to handle output such as that
from the summaryP
function when there is a superpositioning
variable group
and a stratification variable mult
,
especially when the data have been run through the addMarginal
function to create mult
categories labelled "All"
for
which the user will specify big=TRUE
to indicate nonstratified
estimates (stratified only on group
) to emphasize.
When viewing graphics that used mult
and big
, the user
can click on the legends for the small points for group
s to
vanish the finely stratified estimates.
When group
is used by mult
and big
are not, and
when the group
variable has exactly two distinct values, you
can specify refgroup
to get the difference between two
proportions in addition to the individual proportions. The individual
proportions are plotted, but confidence intervals for the difference
are shown in hover text and halfwidth confidence intervals for the
difference, centered at the midpoint of the proportions, are shown.
These have the property of intersecting the two proportions if and
only if there is no significant difference at the 1  conf.int
level.
Specify fun=exp
and ifun=log
if estimates and confidence
limits are on the log scale. Make sure that zeros were prevented in
the original calculations. For exponential hazard rates this can be
accomplished by replacing event counts of 0 with 0.5.
dotchartpl(x, major=NULL, minor=NULL, group=NULL, mult=NULL,
big=NULL, htext=NULL, num=NULL, denom=NULL,
numlabel='', denomlabel='',
fun=function(x) x, ifun=function(x) x, op='',
lower=NULL, upper=NULL,
refgroup=NULL, sortdiff=TRUE, conf.int=0.95,
minkeep=NULL, xlim=NULL, xlab='Proportion',
tracename=NULL, limitstracename='Limits',
nonbigtracename='Stratified Estimates',
dec=3, width=800, height=NULL,
col=colorspace::rainbow_hcl)
x 
a numeric vector used for values on the 
major 
major vertical category, e.g., variable labels 
minor 
minor vertical category, e.g. category levels within variables 
group 
superpositioning variable such as treatment 
mult 
strata names for further subdivisions without

big 
omit if all levels of 
htext 
additional hover text per point 
num 
if 
denom 
like 
numlabel 
character string to put to the right of the numerator in hover text 
denomlabel 
character string to put to the right of the denominator in hover text 
fun 
a transformation to make when printing estimates. For
example, one may specify 
ifun 
inverse transformation of 
op 
set to for example 
lower 
lower limits for optional error bars 
upper 
upper limits for optional error bars 
refgroup 
if 
sortdiff 

conf.int 
confidence level for computing confidence intervals
for the difference in two proportions. Specify 
minkeep 
if 
xlim 

xlab 

tracename 

limitstracename 

nonbigtracename 

col 
a function or vector of colors to assign to 
dec 
number of places to the right of the decimal place for formatting numeric quantities in hover text 
width 
width of plot in pixels 
height 
height of plot in pixels; computed from number of strata by default 
a plotly
object. An attribute levelsRemoved
is
added if minkeep
is used and any categories were omitted from
the plot as a result. This is a character vector with categories
removed. If major
is present, the strings are of the form
major:minor
Frank Harrell
## Not run:
set.seed(1)
d < expand.grid(major=c('Alabama', 'Alaska', 'Arkansas'),
minor=c('East', 'West'),
group=c('Female', 'Male'),
city=0:2)
n < nrow(d)
d$num < round(100*runif(n))
d$denom < d$num + round(100*runif(n))
d$x < d$num / d$denom
d$lower < d$x  runif(n)
d$upper < d$x + runif(n)
with(d,
dotchartpl(x, major, minor, group, city, lower=lower, upper=upper,
big=city==0, num=num, denom=denom, xlab='x'))
# Show halfwidth confidence intervals for Female  Male differences
# after subsetting the data to have only one record per
# state/region/group
d < subset(d, city == 0)
with(d,
dotchartpl(x, major, minor, group, num=num, denom=denom,
lower=lower, upper=upper, refgroup='Male')
)
n < 500
set.seed(1)
d < data.frame(
race = sample(c('Asian', 'Black/AA', 'White'), n, TRUE),
sex = sample(c('Female', 'Male'), n, TRUE),
treat = sample(c('A', 'B'), n, TRUE),
smoking = sample(c('Smoker', 'Nonsmoker'), n, TRUE),
hypertension = sample(c('Hypertensive', 'NonHypertensive'), n, TRUE),
region = sample(c('North America','Europe','South America',
'Europe', 'Asia', 'Central America'), n, TRUE))
d < upData(d, labels=c(race='Race', sex='Sex'))
dm < addMarginal(d, region)
s < summaryP(race + sex + smoking + hypertension ~
region + treat, data=dm)
s$region < ifelse(s$region == 'All', 'All Regions', as.character(s$region))
with(s,
dotchartpl(freq / denom, major=var, minor=val, group=treat, mult=region,
big=region == 'All Regions', num=freq, denom=denom)
)
s2 < s[ attr(s, 'rows.to.exclude1'), ]
with(s2,
dotchartpl(freq / denom, major=var, minor=val, group=treat, mult=region,
big=region == 'All Regions', num=freq, denom=denom)
)
# Note these plots can be created by plot.summaryP when options(grType='plotly')
# Plot hazard rates and ratios with confidence limits, on log scale
d < data.frame(tx=c('a', 'a', 'b', 'b'),
event=c('MI', 'stroke', 'MI', 'stroke'),
count=c(10, 5, 5, 2),
exposure=c(1000, 1000, 900, 900))
# There were no zero event counts in this dataset. In general we
# want to handle that, hence the 0.5 below
d < upData(d, hazard = pmax(0.5, count) / exposure,
selog = sqrt(1. / pmax(0.5, count)),
lower = log(hazard)  1.96 * selog,
upper = log(hazard) + 1.96 * selog)
with(d,
dotchartpl(log(hazard), minor=event, group=tx, num=count, denom=exposure,
lower=lower, upper=upper,
fun=exp, ifun=log, op='/',
numlabel='events', denomlabel='years',
refgroup='a', xlab='Events Per PersonYear')
)
## End(Not run)
Computation of Coordinates of Extended Box Plots Elements
ebpcomp(x, qref = c(0.5, 0.25, 0.75), probs = c(0.05, 0.125, 0.25, 0.375))
x 
a numeric variable 
qref 
quantiles for major corners 
probs 
quantiles for minor corners 
For an extended box plots computes all the elements needed for plotting it. This is typically used when adding to a ggplot2
plot.
list with elements segments
, lines
, points
, points2
Frank Harrell
ebpcomp(1:1000)
Computes coordinates of cumulative distribution function of x, and by defaults
plots it as a step function. A grouping variable may be specified so that
stratified estimates are computed and (by default) plotted. If there is
more than one group, the labcurve
function is used (by default) to label
the multiple step functions or to draw a legend defining line types, colors,
or symbols by linking them with group labels. A weights
vector may
be specified to get weighted estimates. Specify normwt
to make
weights
sum to the length of x
(after removing NAs). Other wise
the total sample size is taken to be the sum of the weights.
Ecdf
is actually a method, and Ecdf.default
is what's
called for a vector argument. Ecdf.data.frame
is called when the
first argument is a data frame. This function can automatically set up
a matrix of ECDFs and wait for a mouse click if the matrix requires more
than one page. Categorical variables, character variables, and
variables having fewer than a set number of unique values are ignored.
If par(mfrow=..)
is not set up before Ecdf.data.frame
is
called, the function will try to figure the best layout depending on the
number of variables in the data frame. Upon return the original
mfrow
is left intact.
When the first argument to Ecdf
is a formula, a Trellis/Lattice function
Ecdf.formula
is called. This allows for multipanel
conditioning, superposition using a groups
variable, and other
Trellis features, along with the ability to easily plot transformed
ECDFs using the fun
argument. For example, if fun=qnorm
,
the inverse normal transformation will be used for the yaxis. If the
transformed curves are linear this indicates normality. Like the
xYplot
function, Ecdf
will create a function Key
if
the groups
variable is used. This function can be invoked by the
user to define the keys for the groups.
Ecdf(x, ...)
## Default S3 method:
Ecdf(x, what=c('F','1F','f','1f'),
weights=rep(1, length(x)), normwt=FALSE,
xlab, ylab, q, pl=TRUE, add=FALSE, lty=1,
col=1, group=rep(1,length(x)), label.curves=TRUE, xlim,
subtitles=TRUE, datadensity=c('none','rug','hist','density'),
side=1,
frac=switch(datadensity,none=NA,rug=.03,hist=.1,density=.1),
dens.opts=NULL, lwd=1, log='', ...)
## S3 method for class 'data.frame'
Ecdf(x, group=rep(1,nrows),
weights=rep(1, nrows), normwt=FALSE,
label.curves=TRUE, n.unique=10, na.big=FALSE, subtitles=TRUE,
vnames=c('labels','names'),...)
## S3 method for class 'formula'
Ecdf(x, data=sys.frame(sys.parent()), groups=NULL,
prepanel=prepanel.Ecdf, panel=panel.Ecdf, ..., xlab,
ylab, fun=function(x)x, what=c('F','1F','f','1f'), subset=TRUE)
x 
a numeric vector, data frame, or Trellis/Lattice formula 
what 
The default is 
weights 
numeric vector of weights. Omit or specify a zerolength vector or NULL to get unweighted estimates. 
normwt 
see above 
xlab 
xaxis label. Default is label(x) or name of calling argument. For

ylab 
yaxis label. Default is 
q 
a vector for quantiles for which to draw reference lines on the plot. Default is not to draw any. 
pl 
set to F to omit the plot, to just return estimates 
add 
set to TRUE to add the cdf to an existing plot. Does not apply if using lattice graphics (i.e., if a formula is given as the first argument). 
lty 
integer line type for plot. If 
lwd 
line width for plot. Can be a vector corresponding to 
log 
see 
col 
color for step function. Can be a vector. 
group 
a numeric, character, or 
label.curves 
applies if more than one 
xlim 
xaxis limits. Default is entire range of 
subtitles 
set to 
datadensity 
If 
side 
If 
frac 
passed to 
dens.opts 
a list of optional arguments for 
... 
other parameters passed to plot if add=F. For data frames, other
parameters to pass to 
n.unique 
minimum number of unique values before an ECDF is drawn for a variable in a data frame. Default is 10. 
na.big 
set to 
vnames 
By default, variable labels are used to label xaxes. Set 
method 
method for computing the empirical cumulative distribution. See

fun 
a function to transform the cumulative proportions, for the
Trellistype usage of 
data , groups , subset , prepanel , panel

the usual Trellis/Lattice parameters, with 
for Ecdf.default
an invisible list with elements x and y giving the
coordinates of the cdf. If there is more than one group
, a list of
such lists is returned. An attribute, N
, is in the returned
object. It contains the elements n
and m
, the number of
nonmissing and missing observations, respectively.
plots
Frank Harrell
Department of Biostatistics, Vanderbilt University
fh@fharrell.com
wtd.Ecdf
, label
, table
, cumsum
, labcurve
, xYplot
, histSpike
set.seed(1)
ch < rnorm(1000, 200, 40)
Ecdf(ch, xlab="Serum Cholesterol")
scat1d(ch) # add rug plot
histSpike(ch, add=TRUE, frac=.15) # add spike histogram
# Better: add a data density display automatically:
Ecdf(ch, datadensity='density')
label(ch) < "Serum Cholesterol"
Ecdf(ch)
other.ch < rnorm(500, 220, 20)
Ecdf(other.ch,add=TRUE,lty=2)
sex < factor(sample(c('female','male'), 1000, TRUE))
Ecdf(ch, q=c(.25,.5,.75)) # show quartiles
Ecdf(ch, group=sex,
label.curves=list(method='arrow'))
# Example showing how to draw multiple ECDFs from paired data
pre.test < rnorm(100,50,10)
post.test < rnorm(100,55,10)
x < c(pre.test, post.test)
g < c(rep('Pre',length(pre.test)),rep('Post',length(post.test)))
Ecdf(x, group=g, xlab='Test Results', label.curves=list(keys=1:2))
# keys=1:2 causes symbols to be drawn periodically on top of curves
# Draw a matrix of ECDFs for a data frame
m < data.frame(pre.test, post.test,
sex=sample(c('male','female'),100,TRUE))
Ecdf(m, group=m$sex, datadensity='rug')
freqs < sample(1:10, 1000, TRUE)
Ecdf(ch, weights=freqs) # weighted estimates
# Trellis/Lattice examples:
region < factor(sample(c('Europe','USA','Australia'),100,TRUE))
year < factor(sample(2001:2002,1000,TRUE))
Ecdf(~ch  region*year, groups=sex)
Key() # draw a key for sex at the default location
# Key(locator(1)) # userspecified positioning of key
age < rnorm(1000, 50, 10)
Ecdf(~ch  lattice::equal.count(age), groups=sex) # use overlapping shingles
Ecdf(~ch  sex, datadensity='hist', side=3) # add spike histogram at top
Compute Coordinates of an Empirical Distribution Function
ecdfSteps(x, extend)
x 
numeric vector, possibly with 
extend 
a 2vector do extend the range of x (low, high). Set 
For a numeric vector uses the R builtin ecdf
function to compute
coordinates of the ECDF, with extension slightly below and above the
range of x
by default. This is useful for ggplot2
where the ECDF may need to be transformed. The returned object is suitable for creating stratified statistics using data.table
and other methods.
a list with components x
and y
Frank Harrell
ecdfSteps(0:10)
## Not run:
# Use data.table for obtaining ECDFs by country and region
w < d[, ecdfSteps(z, extend=c(1,11)), by=.(country, region)] # d is a DT
# Use ggplot2 to make one graph with multiple regions' ECDFs
# and use faceting for countries
ggplot(w, aes(x, y, color=region)) + geom_step() +
facet_wrap(~ country)
## End(Not run)
Expands the width either supercolumns or the subcolumns so that the the sum of the supercolumn widths is the same as the sum of the subcolumn widths.
equalBins(widths, subwidths)
widths 
widths of the supercolumns. 
subwidths 
list of widths of the subcolumns for each supercolumn. 
This determins the correct subwidths of each of various columns in a table for printing. The correct width of the multicolumns is deterimed by summing the widths of it subcolumns.
widths of the the columns for a table.
Charles Dupont
mcols < c("Group 1", "Group 2")
mwidth < nchar(mcols, type="width")
spancols < c(3,3)
ccols < c("a", "deer", "ad", "cat", "help", "bob")
cwidth < nchar(ccols, type="width")
subwidths < partition.vector(cwidth, spancols)
equalBins(mwidth, subwidths)
Add vertical error bars to an existing plot or makes a new plot with error bars.
errbar(x, y, yplus, yminus, cap=0.015, main = NULL,
sub=NULL, xlab=as.character(substitute(x)),
ylab=if(is.factor(x)  is.character(x)) ""
else as.character(substitute(y)),
add=FALSE, lty=1, type='p', ylim=NULL,
lwd=1, pch=16, errbar.col, Type=rep(1, length(y)),
...)
x 
vector of numeric xaxis values (for vertical error bars) or a factor or
character variable (for horizontal error bars, 
y 
vector of yaxis values. 
yplus 
vector of yaxis values: the tops of the error bars. 
yminus 
vector of yaxis values: the bottoms of the error bars. 
cap 
the width of the little lines at the tops and bottoms of the error bars
in units of the width of the plot. Defaults to 
main 
a main title for the plot, passed to 
sub 
a sub title for the plot, passed to 
xlab 
optional xaxis labels if 
ylab 
optional yaxis labels if 
add 
set to 
lty 
type of line for error bars 
type 
type of point. Use 
ylim 
yaxis limits. Default is to use range of 
lwd 
line width for line segments (not main line) 
pch 
character to use as the point. 
errbar.col 
color to use for drawing error bars. 
Type 
used for horizontal bars only. Is an integer vector with values 
... 
other parameters passed to all graphics functions. 
errbar
adds vertical error bars to an existing plot or makes a new
plot with error bars. It can also make a horizontal error bar plot
that shows error bars for group differences as well as bars for
groups. For the latter type of plot, the lower xaxis scale
corresponds to group estimates and the upper scale corresponds to
differences. The spacings of the two scales are identical but the
scale for differences has its origin shifted so that zero may be
included. If at least one of the confidence intervals includes zero,
a vertical dotted reference line at zero is drawn.
Charles Geyer, University of Chicago. Modified by Frank Harrell,
Vanderbilt University, to handle missing data, to add the parameters
add
and lty
, and to implement horizontal charts with differences.
set.seed(1)
x < 1:10
y < x + rnorm(10)
delta < runif(10)
errbar( x, y, y + delta, y  delta )
# Show bootstrap nonparametric CLs for 3 group means and for
# pairwise differences on same graph
group < sample(c('a','b','d'), 200, TRUE)
y < runif(200) + .25*(group=='b') + .5*(group=='d')
cla < smean.cl.boot(y[group=='a'],B=100,reps=TRUE) # usually B=1000
a < attr(cla,'reps')
clb < smean.cl.boot(y[group=='b'],B=100,reps=TRUE)
b < attr(clb,'reps')
cld < smean.cl.boot(y[group=='d'],B=100,reps=TRUE)
d < attr(cld,'reps')
a.b < quantile(ab,c(.025,.975))
a.d < quantile(ad,c(.025,.975))
b.d < quantile(bd,c(.025,.975))
errbar(c('a','b','d','a  b','a  d','b  d'),
c(cla[1],clb[1],cld[1],cla[1]clb[1],cla[1]cld[1],clb[1]cld[1]),
c(cla[3],clb[3],cld[3],a.b[2],a.d[2],b.d[2]),
c(cla[2],clb[2],cld[2],a.b[1],a.d[1],b.d[1]),
Type=c(1,1,1,2,2,2), xlab='', ylab='')
Escapes any characters that would have special meaning in a reqular expression.
escapeRegex(string)
escapeBS(string)
string 
string being operated on. 
escapeRegex
will escape any characters that would have
special meaning in a reqular expression. For any string
grep(regexpEscape(string), string)
will always be true.
escapeBS
will escape any backslash ‘\’ in a string.
The value of the string with any characters that would have special meaning in a reqular expression escaped.
Charles Dupont
Department of Biostatistics
Vanderbilt University
string < "this\\(system) {is} [full]."
escapeRegex(string)
escapeBS(string)
Simulate Comparisons For Use in Sequential Markov Longitudinal Clinical Trial Simulations
estSeqMarkovOrd(
y,
times,
initial,
absorb = NULL,
intercepts,
parameter,
looks,
g,
formula,
ppo = NULL,
yprevfactor = TRUE,
groupContrast = NULL,
cscov = FALSE,
timecriterion = NULL,
coxzph = FALSE,
sstat = NULL,
rdsample = NULL,
maxest = NULL,
maxvest = NULL,
nsim = 1,
progress = FALSE,
pfile = ""
)
y 
vector of possible y values in order (numeric, character, factor) 
times 
vector of measurement times 
initial 
a vector of probabilities summing to 1.0 that specifies the frequency distribution of initial values to be sampled from. The vector must have names that correspond to values of 
absorb 
vector of absorbing states, a subset of 
intercepts 
vector of intercepts in the proportional odds model. There must be one fewer of these than the length of 
parameter 
vector of true parameter (effects; group differences) values. These are group 2:1 log odds ratios in the transition model, conditioning on the previous 
looks 
integer vector of ID numbers at which maximum likelihood estimates and their estimated variances are computed. For a single look specify a scalar value for 
g 
a userspecified function of three or more arguments which in order are 
formula 
a formula object given to the 
ppo 
a formula specifying the part of 
yprevfactor 
see 
groupContrast 
omit this argument if 
cscov 
applies if 
timecriterion 
a function of a timeordered vector of simulated ordinal responses 
coxzph 
set to 
sstat 
set to a function of the time vector and the corresponding vector of ordinal responses for a single group if you want to compute a Wilcoxon test on a derived quantity such as the number of days in a given state. 
rdsample 
an optional function to do responsedependent sampling. It is a function of these arguments, which are vectors that stop at any absorbing state: 
maxest 
maximum acceptable absolute value of the contrast estimate, ignored if 
maxvest 
like 
nsim 
number of simulations (default is 1) 
progress 
set to 
pfile 
file to which to write progress information. Defaults to 
Simulates sequential clinical trials of longitudinal ordinal outcomes using a firstorder Markov model. Looks are done sequentially after subject ID numbers given in the vector looks
with the earliest possible look being after subject 2. At each look, a subject's repeated records are either all used or all ignored depending on the sequent ID number. For each true effect parameter value, simulation, and at each look, runs a function to compute the estimate of the parameter of interest along with its variance. For each simulation, data are first simulated for the last look, and these data are sequentially revealed for earlier looks. The user provides a function g
that has extra arguments specifying the true effect of parameter
the treatment group
expecting treatments to be coded 1 and 2. parameter
is usually on the scale of a regression coefficient, e.g., a log odds ratio. Fitting is done using the rms::lrm()
function, unless nonproportional odds is allowed in which case VGAM::vglm()
is used. If timecriterion
is specified, the function also, for the last data look only, computes the first time at which the criterion is satisfied for the subject or use the event time and event/censoring indicator computed by timecriterion
. The Cox/logrank chisquare statistic for comparing groups on the derived time variable is saved. If coxzph=TRUE
, the survival
package correlation coefficient rho
from the scaled partial residuals is also saved so that the user can later determine to what extent the Markov model resulted in the proportional hazards assumption being violated when analyzing on the time scale. vglm
is accelerated by saving the first successful fit for the largest sample size and using its coefficients as starting value for further vglm
fits for any sample size for the same setting of parameter
.
a data frame with number of rows equal to the product of nsim
, the length of looks
, and the length of parameter
, with variables sim
, parameter
, look
, est
(log odds ratio for group), and vest
(the variance of the latter). If timecriterion
is specified the data frame also contains loghr
(Cox log hazard ratio for group), lrchisq
(chisquare from Cox test for group), and if coxph=TRUE
, phchisq
, the chisquare for testing proportional hazards. The attribute etimefreq
is also present if timecriterion
is present, and it probvides the frequency distribution of derived event times by group and censoring/event indicator. If sstat
is given, the attribute sstat
is also present, and it contains an array with dimensions corresponding to simulations, parameter values within simulations, id
, and a twocolumn subarray with columns group
and y
, the latter being the summary measure computed by the sstat
function. The returned data frame also has attribute lrmcoef
which are the lastlook logistic regression coefficient estimates over the nsim
simulations and the parameter settings, and an attribute failures
which is a data frame containing the variables reason
and frequency
cataloging the reasons for unsuccessful model fits.
Frank Harrell
gbayesSeqSim()
, simMarkovOrd()
, https://hbiostat.org/R/Hmisc/markov/
Simulate Comparisons For Use in Sequential Clinical Trial Simulations
estSeqSim(parameter, looks, gendat, fitter, nsim = 1, progress = FALSE)
parameter 
vector of true parameter (effects; group differences) values 
looks 
integer vector of observation numbers at which posterior probabilities are computed 
gendat 
a function of three arguments: true parameter value (scalar), sample size for first group, sample size for second group 
fitter 
a function of two arguments: 0/1 group indicator vector and the dependent variable vector 
nsim 
number of simulations (default is 1) 
progress 
set to 
Simulates sequential clinical trials. Looks are done sequentially at observation numbers given in the vector looks
with the earliest possible look being at observation 2. For each true effect parameter value, simulation, and at each look, runs a function to compute the estimate of the parameter of interest along with its variance. For each simulation, data are first simulated for the last look, and these data are sequentially revealed for earlier looks. The user provides a function gendat
that given a true effect of parameter
and the two sample sizes (for treatment groups 1 and 2) returns a list with vectors y1
and y2
containing simulated data. The user also provides a function fitter
with arguments x
(group indicator 0/1) and y
(response variable) that returns a 2vector containing the effect estimate and its variance. parameter
is usually on the scale of a regression coefficient, e.g., a log odds ratio.
a data frame with number of rows equal to the product of nsim
, the length of looks
, and the length of parameter
.
Frank Harrell
gbayesSeqSim()
, simMarkovOrd()
, estSeqMarkovOrd()
if (requireNamespace("rms", quietly = TRUE)) {
# Run 100 simulations, 5 looks, 2 true parameter values
# Total simulation time: 2s
lfit < function(x, y) {
f < rms::lrm.fit(x, y)
k < length(coef(f))
c(coef(f)[k], vcov(f)[k, k])
}
gdat < function(beta, n1, n2) {
# Cell probabilities for a 7category ordinal outcome for the control group
p < c(2, 1, 2, 7, 8, 38, 42) / 100
# Compute cell probabilities for the treated group
p2 < pomodm(p=p, odds.ratio=exp(beta))
y1 < sample(1 : 7, n1, p, replace=TRUE)
y2 < sample(1 : 7, n2, p2, replace=TRUE)
list(y1=y1, y2=y2)
}
set.seed(1)
est < estSeqSim(c(0, log(0.7)), looks=c(50, 75, 95, 100, 200),
gendat=gdat,
fitter=lfit, nsim=100)
head(est)
}
Creates an event chart on the current graphics device. Also, allows user to plot legend on plot area or on separate page. Contains features useful for plotting data with timetoevent outcomes Which arise in a variety of studies including randomized clinical trials and nonrandomized cohort studies. This function can use as input a matrix or a data frame, although greater utility and ease of use will be seen with a data frame.
event.chart(data, subset.r = 1:dim(data)[1], subset.c = 1:dim(data)[2],
sort.by = NA, sort.ascending = TRUE,
sort.na.last = TRUE, sort.after.subset = TRUE,
y.var = NA, y.var.type = "n",
y.jitter = FALSE, y.jitter.factor = 1,
y.renum = FALSE, NA.rm = FALSE, x.reference = NA,
now = max(data[, subset.c], na.rm = TRUE),
now.line = FALSE, now.line.lty = 2,
now.line.lwd = 1, now.line.col = 1, pty = "m",
date.orig = c(1, 1, 1960), titl = "Event Chart",
y.idlabels = NA, y.axis = "auto",
y.axis.custom.at = NA, y.axis.custom.labels = NA,
y.julian = FALSE, y.lim.extend = c(0, 0),
y.lab = ifelse(is.na(y.idlabels), "", as.character(y.idlabels)),
x.axis.all = TRUE, x.axis = "auto",
x.axis.custom.at = NA, x.axis.custom.labels = NA,
x.julian = FALSE, x.lim.extend = c(0, 0), x.scale = 1,
x.lab = ifelse(x.julian, "Followup Time", "Study Date"),
line.by = NA, line.lty = 1, line.lwd = 1, line.col = 1,
line.add = NA, line.add.lty = NA,
line.add.lwd = NA, line.add.col = NA,
point.pch = 1:length(subset.c),
point.cex = rep(0.6, length(subset.c)),
point.col = rep(1, length(subset.c)),
point.cex.mult = 1., point.cex.mult.var = NA,
extra.points.no.mult = rep(NA, length(subset.c)),
legend.plot = FALSE, legend.location = "o", legend.titl = titl,
legend.titl.cex = 3, legend.titl.line = 1,
legend.point.at = list(x = c(5, 95), y = c(95, 30)),
legend.point.pch = point.pch,
legend.point.text = ifelse(rep(is.data.frame(data), length(subset.c)),
names(data[, subset.c]),
subset.c),
legend.cex = 2.5, legend.bty = "n",
legend.line.at = list(x = c(5, 95), y = c(20, 5)),
legend.line.text = names(table(as.character(data[, line.by]),
exclude = c("", "NA"))),
legend.line.lwd = line.lwd, legend.loc.num = 1,
...)
data 
a matrix or data frame with rows corresponding to subjects and columns corresponding to variables. Note that for a data frame or matrix containing multiple timetoevent data (e.g., time to recurrence, time to death, and time to last followup), one column is required for each specific event. 
subset.r 
subset of rows of original matrix or data frame to place in event chart.
Logical arguments may be used here (e.g., 
subset.c 
subset of columns of original matrix or data frame to place in event chart;
if working with a data frame, a vector of data frame variable names may be
used for subsetting purposes (e.g., 
sort.by 
column(s) or data frame variable name(s) with which to sort the chart's output.
The default is 
sort.ascending 
logical flag (which takes effect only if the argument 
sort.na.last 
logical flag (which takes effect only if the argument 
sort.after.subset 
logical flag (which takes effect only if the argument sort.by is utilized).
If 
y.var 
variable name or column number of original matrix or data frame with
which to scale yaxis.
Default is 
y.var.type 
type of variable specified in 
y.jitter 
logical flag (which takes effect only if the argument
The default of 
y.jitter.factor 
an argument used with the 
y.renum 
logical flag. If 
NA.rm 
logical flag. If 
x.reference 
column of original matrix or data frame with which to reference the xaxis.
That is, if specified, all columns specified in 
now 
the “now” date which will be used for top of yaxis
when creating the Goldman eventchart (see reference below).
Default is 
now.line 
logical flag. A feature utilized by the Goldman Eventchart.
When 
now.line.lty 
line type of 
now.line.lwd 
line width of 
now.line.col 
color of 
pty 
graph option, 
date.orig 
date of origin to consider if dates are in julian, SAS , or SPlus dates
object format; default is January 1, 1960 (which is the default origin
used by both SPlus and SAS). Utilized when either

titl 
title for event chart. Default is 'Event Chart'. 
y.idlabels 
column or data frame variable name used for yaxis labels. For example,
if 
y.axis 
character string specifying whether program will control labelling
of yaxis (with argument 
y.axis.custom.at 
userspecified vector of yaxis label locations.
Must be used when 
y.axis.custom.labels 
userspecified vector of yaxis labels.
Must be used when 
y.julian 
logical flag (which will only be considered if 
y.lim.extend 
twodimensional vector representing the number of units that the user
wants to increase 
y.lab 
single label to be used for entire yaxis. Default will be the variable name
or column number of 
x.axis.all 
logical flag. If 
x.axis 
character string specifying whether program will control labelling
of xaxis (with argument 
x.axis.custom.at 
userspecified vector of xaxis label locations.
Must be used when 
x.axis.custom.labels 
userspecified vector of xaxis labels.
Must be used when 
x.julian 
logical flag (which will only be considered if 
x.lim.extend 
twodimensional vector representing the number of time units (usually in days)
that the user wants to increase 
x.scale 
a factor whose reciprocal is multiplied to original units of the
xaxis. For example, if the original data frame is in units of days,

x.lab 
single label to be used for entire xaxis. Default will be “On Study Date”
if 
line.by 
column or data frame variable name for plotting unique lines by unique
values of vector (e.g., specify 
line.lty 
vector of line types corresponding to ascending order of 
line.lwd 
vector of line widths corresponding to ascending order of 
line.col 
vector of line colors corresponding to ascending order of 
line.add 
a 2xk matrix with k=number of pairs of additional line segments to add.
For example, if it is of interest to draw additional line segments
connecting events one and two, two and three, and four and five,
(possibly with different colors), an appropriate The convention use of If NOTE: The drawing of the original default line
may be suppressed (with 
line.add.lty 
a kx1 vector corresponding to the columns of 
line.add.lwd 
a kx1 vector corresponding to the columns of 
line.add.col 
a kx1 vector corresponding to the columns of 
point.pch 
vector of 
point.cex 
vector of size of points representing each event.
If 
point.col 
vector of colors of points representing each event.
If 
point.cex.mult 
a single number (may be noninteger), which is the base multiplier for the value of
the 
point.cex.mult.var 
vector of variables to be used in determining what point.cex.mult is multiplied by
for determining size of plotted points from (possibly a subset of)

extra.points.no.mult 
vector of variables in the dataset to ignore for purposes of using

legend.plot 
logical flag; if 
legend.location 
will be used only if 
legend.titl 
title for the legend; default is title to be used for main plot.
Only used when 
legend.titl.cex 
size of text for legend title. Only used when 
legend.titl.line 
line location of legend title dictated by 
legend.point.at 
location of upper left and lower right corners of legend area to be utilized for describing events via points and text. 
legend.point.pch 
vector of 
legend.point.text 
text to be used for describing events; the default is setup for a data frame,
as it will print the names of the columns specified by 
legend.cex 
size of text for points and event descriptions. Default is 2.5 which is setup
for 
legend.bty 
option to put a box around the legend(s); default is to have no box
( 
legend.line.at 
if 
legend.line.text 
text to be used for describing 
legend.line.lwd 
vector of line widths corresponding to 
legend.loc.num 
number used for locator argument when 
... 
additional par arguments for use in main plot. 
if you want to put, say, two eventcharts sidebyside, in a plot
region, you should not set up par(mfrow=c(1,2))
before running the
first plot. Instead, you should add the argument mfg=c(1,1,1,2)
to the first plot call followed by the argument mfg=c(1,2,1,2)
to the second plot call.
if dates in original data frame are in a specialized form
(eg., mm/dd/yy) of mode CHARACTER, the user must convert those columns to
become class dates or julian numeric mode (see Date
for more information).
For example, in a data frame called testdata
, with specialized
dates in columns 4 thru 10, the following code could be used:
as.numeric(dates(testdata[,4:10]))
. This will convert the columns
to numeric julian dates based on the function's default origin
of January 1, 1960. If original dates are in class dates or julian form,
no extra work is necessary.
In the survival analysis, the data typically come in two
columns: one column containing survival time and the other
containing censoring indicator or event code. The
event.convert
function converts this type of data into
multiple columns of event times, one column of each event
type, suitable for the event.chart
function.
an event chart is created on the current graphics device. If legend.plot =TRUE and legend.location = 'o', a onepage legend will precede the event chart. Please note that par parameters on completion of function will be reset to par parameters existing prior to start of function.
J. Jack Lee and Kenneth R. Hess
Department of Biostatistics
University of Texas
M.D. Anderson Cancer Center
Houston, TX 77030
jjlee@mdanderson.org, khess@mdanderson.org
Joel A. Dubin
Department of Statistics
University of Waterloo
jdubin@uwaterloo.ca
Lee J.J., Hess, K.R., Dubin, J.A. (2000). Extensions and applications of event charts. The American Statistician, 54:1, 63–70.
Dubin, J.A., Lee, J.J., Hess, K.R. (1997). The Utility of Event Charts. Proceedings of the Biometrics Section, American Statistical Association.
Dubin, J.A., Muller HG, Wang JL (2001). Event history graphs for censored survival data. Statistics in Medicine, 20: 2951–2964.
Goldman, A.I. (1992). EVENTCHARTS: Visualizing Survival and Other TimedEvents Data. The American Statistician, 46:1, 13–18.
# The sample data set is an augmented CDC AIDS dataset (ASCII)
# which is used in the examples in the help file. This dataset is
# described in Kalbfleisch and Lawless (JASA, 1989).
# Here, we have included only children 4 years old and younger.
# We have also added a new field, dethdate, which
# represents a fictitious death date for each patient. There was
# no recording of death date on the original dataset. In addition, we have
# added a fictitious viral load reading (copies/ml) for each patient at time of AIDS diagnosis,
# noting viral load was also not part of the original dataset.
#
# All dates are julian with julian=0 being
# January 1, 1960, and julian=14000 being 14000 days beyond
# January 1, 1960 (i.e., May 1, 1998).
cdcaids < data.frame(
age=c(4,2,1,1,2,2,2,4,2,1,1,3,2,1,3,2,1,2,4,2,2,1,4,2,4,1,4,2,1,1,3,3,1,3),
infedate=c(
7274,7727,7949,8037,7765,8096,8186,7520,8522,8609,8524,8213,8455,8739,
8034,8646,8886,8549,8068,8682,8612,9007,8461,8888,8096,9192,9107,9001,
9344,9155,8800,8519,9282,8673),
diagdate=c(
8100,8158,8251,8343,8463,8489,8554,8644,8713,8733,8854,8855,8863,8983,
9035,9037,9132,9164,9186,9221,9224,9252,9274,9404,9405,9433,9434,9470,
9470,9472,9489,9500,9585,9649),
diffdate=c(
826,431,302,306,698,393,368,1124,191,124,330,642,408,244,1001,391,246,
615,1118,539,612,245,813,516,1309,241,327,469,126,317,689,981,303,976),
dethdate=c(
8434,8304,NA,8414,8715,NA,8667,9142,8731,8750,8963,9120,9005,9028,9445,
9180,9189,9406,9711,9453,9465,9289,9640,9608,10010,9488,9523,9633,9667,
9547,9755,NA,9686,10084),
censdate=c(
NA,NA,8321,NA,NA,8519,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,
NA,NA,NA,NA,NA,NA,NA,NA,NA,10095,NA,NA),
viralload=c(
13000,36000,70000,90000,21000,110000,75000,12000,125000,110000,13000,39000,79000,135000,14000,
42000,123000,20000,12000,18000,16000,140000,16000,58000,11000,120000,85000,31000,24000,115000,
17000,13100,72000,13500)
)
cdcaids < upData(cdcaids,
labels=c(age ='Age, y', infedate='Date of blood transfusion',
diagdate='Date of AIDS diagnosis',
diffdate='Incubation period (days from HIV to AIDS)',
dethdate='Fictitious date of death',
censdate='Fictitious censoring date',
viralload='Fictitious viral load'))
# Note that the style options listed with these
# examples are best suited for output to a postscript file (i.e., using
# the postscript function with horizontal=TRUE) as opposed to a graphical
# window (e.g., motif).
# To produce simple calendar event chart (with internal legend):
# postscript('example1.ps', horizontal=TRUE)
event.chart(cdcaids,
subset.c=c('infedate','diagdate','dethdate','censdate'),
x.lab = 'observation dates',
y.lab='patients (sorted by AIDS diagnosis date)',
titl='AIDS data calendar event chart 1',
point.pch=c(1,2,15,0), point.cex=c(1,1,0.8,0.8),
legend.plot=TRUE, legend.location='i', legend.cex=1.0,
legend.point.text=c('transfusion','AIDS diagnosis','death','censored'),
legend.point.at = list(c(7210, 8100), c(35, 27)), legend.bty='o')
# To produce simple interval event chart (with internal legend):
# postscript('example2.ps', horizontal=TRUE)
event.chart(cdcaids,
subset.c=c('infedate','diagdate','dethdate','censdate'),
x.lab = 'time since transfusion (in days)',
y.lab='patients (sorted by AIDS diagnosis date)',
titl='AIDS data interval event chart 1',
point.pch=c(1,2,15,0), point.cex=c(1,1,0.8,0.8),
legend.plot=TRUE, legend.location='i', legend.cex=1.0,
legend.point.text=c('transfusion','AIDS diagnosis','death','censored'),
x.reference='infedate', x.julian=TRUE,
legend.bty='o', legend.point.at = list(c(1400, 1950), c(7, 1)))
# To produce simple interval event chart (with internal legend),
# but now with flexible diagdate symbol size based on viral load variable:
# postscript('example2a.ps', horizontal=TRUE)
event.chart(cdcaids,
subset.c=c('infedate','diagdate','dethdate','censdate'),
x.lab = 'time since transfusion (in days)',
y.lab='patients (sorted by AIDS diagnosis date)',
titl='AIDS data interval event chart 1a, with viral load at diagdate represented',
point.pch=c(1,2,15,0), point.cex=c(1,1,0.8,0.8),
point.cex.mult = 0.00002, point.cex.mult.var = 'viralload', extra.points.no.mult = c(1,NA,1,1),
legend.plot=TRUE, legend.location='i', legend.cex=1.0,
legend.point.text=c('transfusion','AIDS diagnosis','death','censored'),
x.reference='infedate', x.julian=TRUE,
legend.bty='o', legend.point.at = list(c(1400, 1950), c(7, 1)))
# To produce more complicated interval chart which is
# referenced by infection date, and sorted by age and incubation period:
# postscript('example3.ps', horizontal=TRUE)
event.chart(cdcaids,
subset.c=c('infedate','diagdate','dethdate','censdate'),
x.lab = 'time since diagnosis of AIDS (in days)',
y.lab='patients (sorted by age and incubation length)',
titl='AIDS data interval event chart 2 (sorted by age, incubation)',
point.pch=c(1,2,15,0), point.cex=c(1,1,0.8,0.8),
legend.plot=TRUE, legend.location='i',legend.cex=1.0,
legend.point.text=c('transfusion','AIDS diagnosis','death','censored'),
x.reference='diagdate', x.julian=TRUE, sort.by=c('age','diffdate'),
line.by='age', line.lty=c(1,3,2,4), line.lwd=rep(1,4), line.col=rep(1,4),
legend.bty='o', legend.point.at = list(c(1350, 800), c(7, 1)),
legend.line.at = list(c(1350, 800), c(16, 8)),
legend.line.text=c('age = 1', ' = 2', ' = 3', ' = 4'))
# To produce the Goldman chart:
# postscript('example4.ps', horizontal=TRUE)
event.chart(cdcaids,
subset.c=c('infedate','diagdate','dethdate','censdate'),
x.lab = 'time since transfusion (in days)', y.lab='dates of observation',
titl='AIDS data Goldman event chart 1',
y.var = c('infedate'), y.var.type='d', now.line=TRUE, y.jitter=FALSE,
point.pch=c(1,2,15,0), point.cex=c(1,1,0.8,0.8), mgp = c(3.1,1.6,0),
legend.plot=TRUE, legend.location='i',legend.cex=1.0,
legend.point.text=c('transfusion','AIDS diagnosis','death','censored'),
x.reference='infedate', x.julian=TRUE,
legend.bty='o', legend.point.at = list(c(1500, 2800), c(9300, 10000)))
# To convert coded timetoevent data, then, draw an event chart:
surv.time < c(5,6,3,1,2)
cens.ind < c(1,0,1,1,0)
surv.data < cbind(surv.time,cens.ind)
event.data < event.convert(surv.data)
event.chart(cbind(rep(0,5),event.data),x.julian=TRUE,x.reference=1)
Convert a twocolumn data matrix with event time and event code into multiple column event time with one event in each column
event.convert(data2, event.time = 1, event.code = 2)
data2 
a matrix or dataframe with at least 2 columns; by default, the first column contains the event time and the second column contains the k event codes (e.g. 1=dead, 0=censord) 
event.time 
the column number in data contains the event time 
event.code 
the column number in data contains the event code 
In the survival analysis, the data typically come in two
columns: one column containing survival time and the other
containing censoring indicator or event code. The
event.convert
function converts this type of data into
multiple columns of event times, one column of each event
type, suitable for the event.chart
function.
J. Jack Lee and Kenneth R. Hess
Department of Biostatistics
University of Texas
M.D. Anderson Cancer Center
Houston, TX 77030
jjlee@mdanderson.org, khess@mdanderson.org
Joel A. Dubin
Department of Statistics
University of Waterloo
jdubin@uwaterloo.ca
event.history
, Date
, event.chart
# To convert coded timetoevent data, then, draw an event chart:
surv.time < c(5,6,3,1,2)
cens.ind < c(1,0,1,1,0)
surv.data < cbind(surv.time,cens.ind)
event.data < event.convert(surv.data)
event.chart(cbind(rep(0,5),event.data),x.julian=TRUE,x.reference=1)
Produces an event history graph for rightcensored survival data, including timedependent covariate status, as described in Dubin, Muller, and Wang (2001). Effectively, a KaplanMeier curve is produced with supplementary information regarding individual survival information, censoring information, and status over time of an individual timedependent covariate or timedependent covariate function for both uncensored and censored individuals.
event.history(data, survtime.col, surv.col,
surv.ind = c(1, 0), subset.rows = NULL,
covtime.cols = NULL, cov.cols = NULL,
num.colors = 1, cut.cov = NULL, colors = 1,
cens.density = 10, mult.end.cens = 1.05,
cens.mark.right =FALSE, cens.mark = "",
cens.mark.ahead = 0.5, cens.mark.cutoff = 1e08,
cens.mark.cex = 1,
x.lab = "time under observation",
y.lab = "estimated survival probability",
title = "event history graph", ...)
data 
A matrix or data frame with rows corresponding to units (often individuals) and columns corresponding to survival time, event/censoring indicator. Also, multiple columns may be devoted to timedependent covariate level and time change. 
survtime.col 
Column (in data) representing minimum of timetoevent or rightcensoring time for individual. 
surv.col 
Column (in data) representing event indicator for an individual. Though, traditionally, such an indicator will be 1 for an event and 0 for a censored observation, this indicator can be represented by any two numbers, made explicit by the surv.ind argument. 
surv.ind 
Twoelement vector representing, respectively, the
number for an event, as listed in 
subset.rows 
Subset of rows of original matrix or data frame (data) to
place in event history graph.
Logical arguments may be used here (e.g., 
covtime.cols 
Column(s) (in data) representing the time when change of timedependent
covariate (or timedependent covariate function) occurs.
There should be a unique non 
cov.cols 
Column(s) (in data) representing the level of the timedependent
covariate (or timedependent covariate function). There should be
a unique non 
num.colors 
Colors are utilized for the timedependent covariate level for an
individual. This argument provides the number of unique covariate
levels which will be displayed by mapping the number of colors
(via 
cut.cov 
This argument allows the user to explicitly state how to
define the intervals for the timedependent covariate, such that
different colors will be allocated to the userdefined covariate levels.
For example, for plotting five colors, six ordered points within the
span of the data's covariate levels should be provided.
Default is 
colors 
This is a vector argument defining the actual colors used
for the timedependent covariate levels in the plot, with the
index of this vector corresponding to the ordered levels
of the covariate. The number of colors (i.e., the length
of the colors vector) should correspond to the
value provided to the 
cens.density 
This will provide the shading density at the end of the
individual bars for those who are censored. For more information
on shading density, see the density argument in the SPlus
polygon function. Default is 
mult.end.cens 
This is a multiplier that extends the length of the longest surviving individual bar (or bars, if a tie exists) if rightcensored, presuming that no event times eventually follow this final censored time. Default extends the length 5 percent beyond the length of the observed rightcensored survival time. 
cens.mark.right 
A logical argument that states whether an explicit mark should be placed to the right of the individual rightcensored survival bars. This argument is most useful for large sample sizes, where it may be hard to detect the special shading via cens.density, particularly for the shortterm survivors. 
cens.mark 
Character argument which describes the censored mark that should be
used if 
cens.mark.ahead 
A numeric argument, which specifies the absolute distance to be placed between the individual rightcensored survival bars and the mark as defined in the above cens.mark argument. Default is 0.5 (that is, a half of day, if survival time is measured in days), but may very well need adjusting depending on the maximum survival time observed in the dataset. 
cens.mark.cutoff 
A negative number very close to 0
(by default 
cens.mark.cex 
Numeric argument defining the size of the mark defined in
the 
x.lab 
Single label to be used for entire xaxis.
Default is 
y.lab 
Single label to be used for entire yaxis.
Default is 
title 
Title for the event history graph.
Default is 
... 
This allows arguments to the plot function call within
the 
In order to focus on a particular area of the event history graph,
zooming can be performed. This is best done by
specifying appropriate xlim
and ylim
arguments at the end of the event.history
function call,
taking advantage of the ...
argument link to the plot function.
An example of zooming can be seen
in Plate 4 of the paper referenced below.
Please read the reference below to understand how the individual covariate and survival information is provided in the plot, how ties are handled, how rightcensoring is handled, etc.
This function has been tested thoroughly, but only within a restricted version and environment, i.e., only within SPlus 2000, Version 3, and within SPlus 6.0, version 2, both on a Windows 2000 machine. Hence, we cannot currently vouch for the function's effectiveness in other versions of SPlus (e.g., SPlus 3.4) nor in other operating environments (e.g., Windows 95, Linux or Unix). The function has also been verified to work on R under Linux.
The authors have found better control of the use of color by producing the graphs via the postscript plotting device in SPlus. In fact, the provided examples utilize the postscript function. However, your past experiences may be different, and you may prefer to control color directly (to the graphsheet in Windows environment, for example). The event.history function will work with either approach.
Joel Dubin
jdubin@uwaterloo.ca
Dubin, J.A., Muller, H.G., and Wang, J.L. (2001). Event history graphs for censored survival data. Statistics in Medicine, 20, 29512964.
plot
,polygon
,
event.chart
, par
# Code to produce event history graphs for SIM paper
#
# before generating plots, some preprocessing needs to be performed,
# in order to get dataset in proper form for event.history function;
# need to create one line per subject and sort by time under observation,
# with those experiencing event coming before those tied with censoring time;
require('survival')
data(heart)
# creation of event.history version of heart dataset (call heart.one):
heart.one < matrix(nrow=length(unique(heart$id)), ncol=8)
for(i in 1:length(unique(heart$id)))
{
if(length(heart$id[heart$id==i]) == 1)
heart.one[i,] < as.numeric(unlist(heart[heart$id==i, ]))
else if(length(heart$id[heart$id==i]) == 2)
heart.one[i,] < as.numeric(unlist(heart[heart$id==i,][2,]))
}
heart.one[,3][heart.one[,3] == 0] < 2 ## converting censored events to 2, from 0
if(is.factor(heart$transplant))
heart.one[,7] < heart.one[,7]  1
## getting back to correct transplantation coding
heart.one < as.data.frame(heart.one[order(unlist(heart.one[,2]), unlist(heart.one[,3])),])
names(heart.one) < names(heart)
# back to usual censoring indicator:
heart.one[,3][heart.one[,3] == 2] < 0
# note: transplant says 0 (for no transplants) or 1 (for one transplant)
# and event = 1 is death, while event = 0 is censored
# plot single KaplanMeier curve from heart data, first creating survival object
heart.surv < survfit(Surv(stop, event) ~ 1, data=heart.one, conf.int = FALSE)
# figure 3: traditional KaplanMeier curve
# postscript('ehgfig3.ps', horiz=TRUE)
# omi < par(omi=c(0,1.25,0.5,1.25))
plot(heart.surv, ylab='estimated survival probability',
xlab='observation time (in days)')
title('Figure 3: KaplanMeier curve for Stanford data', cex=0.8)
# dev.off()
## now, draw event history graph for Stanford heart data; use as Figure 4
# postscript('ehgfig4.ps', horiz=TRUE, colors = seq(0, 1, len=20))
# par(omi=c(0,1.25,0.5,1.25))
event.history(heart.one,
survtime.col=heart.one[,2], surv.col=heart.one[,3],
covtime.cols = cbind(rep(0, dim(heart.one)[1]), heart.one[,1]),
cov.cols = cbind(rep(0, dim(heart.one)[1]), heart.one[,7]),
num.colors=2, colors=c(6,10),
x.lab = 'time under observation (in days)',
title='Figure 4: Event history graph for\nStanford data',
cens.mark.right =TRUE, cens.mark = '',
cens.mark.ahead = 30.0, cens.mark.cex = 0.85)
# dev.off()
# now, draw agestratified event history graph for Stanford heart data;
# use as Figure 5
# two plots, stratified by age status
# postscript('c:\temp\ehgfig5.ps', horiz=TRUE, colors = seq(0, 1, len=20))
# par(omi=c(0,1.25,0.5,1.25))
par(mfrow=c(1,2))
event.history(data=heart.one, subset.rows = (heart.one[,4] < 0),
survtime.col=heart.one[,2], surv.col=heart.one[,3],
covtime.cols = cbind(rep(0, dim(heart.one)[1]), heart.one[,1]),
cov.cols = cbind(rep(0, dim(heart.one)[1]), heart.one[,7]),
num.colors=2, colors=c(6,10),
x.lab = 'time under observation\n(in days)',
title = 'Figure 5a:\nStanford data\n(age < 48)',
cens.mark.right =TRUE, cens.mark = '',
cens.mark.ahead = 40.0, cens.mark.cex = 0.85,
xlim=c(0,1900))
event.history(data=heart.one, subset.rows = (heart.one[,4] >= 0),
survtime.col=heart.one[,2], surv.col=heart.one[,3],
covtime.cols = cbind(rep(0, dim(heart.one)[1]), heart.one[,1]),
cov.cols = cbind(rep(0, dim(heart.one)[1]), heart.one[,7]),
num.colors=2, colors=c(6,10),
x.lab = 'time under observation\n(in days)',
title = 'Figure 5b:\nStanford data\n(age >= 48)',
cens.mark.right =TRUE, cens.mark = '',
cens.mark.ahead = 40.0, cens.mark.cex = 0.85,
xlim=c(0,1900))
# dev.off()
# par(omi=omi)
# we will not show liver cirrhosis data manipulation, as it was
# a bit detailed; however, here is the
# event.history code to produce Figure 7 / Plate 1
# Figure 7 / Plate 1 : prothrombin ehg with color
## Not run:
second.arg < 1 ### second.arg is for shading
third.arg < c(rep(1,18),0,1) ### third.arg is for intensity
# postscript('c:\temp\ehgfig7.ps', horiz=TRUE,
# colors = cbind(seq(0, 1, len = 20), second.arg, third.arg))
# par(omi=c(0,1.25,0.5,1.25), col=19)
event.history(cirrhos2.eh, subset.rows = NULL,
survtime.col=cirrhos2.eh$time, surv.col=cirrhos2.eh$event,
covtime.cols = as.matrix(cirrhos2.eh[, ((2:18)*2)]),
cov.cols = as.matrix(cirrhos2.eh[, ((2:18)*2) + 1]),
cut.cov = as.numeric(quantile(as.matrix(cirrhos2.eh[, ((2:18)*2) + 1]),
c(0,.2,.4,.6,.8,1), na.rm=TRUE) + c(1,0,0,0,0,1)),
colors=c(20,4,8,11,14),
x.lab = 'time under observation (in days)',
title='Figure 7: Event history graph for liver cirrhosis data (color)',
cens.mark.right =TRUE, cens.mark = '',
cens.mark.ahead = 100.0, cens.mark.cex = 0.85)
# dev.off()
## End(Not run)
Extract Labels and Units From Multiple Datasets
extractlabs(..., print = TRUE)
... 
one ore more data frames or data tables 
print 
set to 
For one or more data frames/tables extracts all labels and units and comb ines them over dataset, dropping any variables not having either labels or units defined. The resulting data table is returned and is used by the hlab
function if the user stores the result in an objectnamed LabelsUnits
. The result is NULL
if no variable in any dataset has a nonblank label
or units
. Variables found in more than one dataset with duplicate label
and units
are consolidated. A warning message is issued when duplicate variables have conflicting labels or units, and by default, details are printed. No attempt is made to resolve these conflicts.
a data table
Frank Harrell
label()
, contents()
, units()
, hlab()
d < data.frame(x=1:10, y=(1:10)/10)
d < upData(d, labels=c(x='X', y='Y'), units=c(x='mmHg'), print=FALSE)
d2 < d
units(d2$x) < 'cm'
LabelsUnits < extractlabs(d, d2)
LabelsUnits
General File Import Using rio
fImport(
file,
format,
lowernames = c("not mixed", "no", "yes"),
und. = FALSE,
...
)
file 
name of file to import, or full URL. 
format 
format of file to import, usually not needed. See 
lowernames 
defaults to changing variable names to all lower case unless the name as mixed upper and lower case, which results in keeping the original characters in the name. Set 
und. 
set to 
... 
more arguments to pass to 
This is a frontend for the rio
package's import
function. fImport
includes options for setting variable names to lower case and to change underscores in names to periods. Variables on the imported data frame that have label
s are converted to Hmisc package labelled
class so that subsetting the data frame will preserve the labels.
a data frame created by rio
, unless a rio
option is given to use another format
Frank Harrell
upData
, especially the moveUnits
option
## Not run:
# Get a Stata dataset
d < fImport('http://www.principlesofeconometrics.com/stata/alcohol.dta')
contents(d)
## End(Not run)
Compares each row in x
against all the rows in y
, finding rows in
y
with all columns within a tolerance of the values a given row of
x
. The default tolerance
tol
is zero, i.e., an exact match is required on all columns.
For qualifying matches, a distance measure is computed. This is
the sum of squares of differences between x
and y
after scaling
the columns. The default scaling values are tol
, and for columns
with tol=1
the scale values are set to 1.0 (since they are ignored
anyway). Matches (up to maxmatch
of them) are stored and listed in order of
increasing distance.
The summary
method prints a frequency distribution of the
number of matches per observation in x
, the median of the minimum
distances for all matches per x
, as a function of the number of matches,
and the frequency of selection of duplicate observations as those having
the smallest distance. The print
method prints the entire matches
and distance
components of the result from find.matches
.
matchCases
finds all controls that match cases on a single variable
x
within a tolerance of tol
. This is intended for prospective
cohort studies that use matching for confounder adjustment (even
though regression models usually work better).
find.matches(x, y, tol=rep(0, ncol(y)), scale=tol, maxmatch=10)
## S3 method for class 'find.matches'
summary(object, ...)
## S3 method for class 'find.matches'
print(x, digits, ...)
matchCases(xcase, ycase, idcase=names(ycase),
xcontrol, ycontrol, idcontrol=names(ycontrol),
tol=NULL,
maxobs=max(length(ycase),length(ycontrol))*10,
maxmatch=20, which=c('closest','random'))
x 
a numeric matrix or the result of 
y 
a numeric matrix with same number of columns as 
xcase 
numeric vector to match on for cases 
xcontrol 
numeric vector to match on for controls, not necessarily
the same length as 
ycase 
a vector or matrix 
ycontrol 

tol 
a vector of tolerances with number of elements the same as the number
of columns of 
scale 
a vector of scaling constants with number of elements the same as the
number of columns of 
maxmatch 
maximum number of matches to allow. For 
object 
an object created by 
digits 
number of digits to use in printing distances 
idcase 
vector the same length as 
idcontrol 

maxobs 
maximum number of cases and all matching controls combined (maximum
dimension of data frame resulting from 
which 
set to 
... 
unused 
find.matches
returns a list of class find.matches
with elements
matches
and distance
.
Both elements are matrices with the number of rows equal to the number
of rows in x
, and with k
columns, where k
is the maximum number of
matches (<= maxmatch
) that occurred. The elements of matches
are row identifiers of y
that match, with zeros if fewer than
maxmatch
matches are found (blanks if y
had row names).
matchCases
returns a data frame with variables idcase
(id of case
currently being matched), type
(factor variable with levels "case"
and "control"
), id
(id of case if case row, or id of matching
case), and y
.
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
Ming K, Rosenbaum PR (2001): A note on optimal matching with variable controls using the assignment algorithm. J Comp Graph Stat 10:455–463.
Cepeda MS, Boston R, Farrar JT, Strom BL (2003): Optimal matching with a variable number of controls vs. a fixed number of controls for a cohort study: tradeoffs. J Clin Epidemiology 56:230237. Note: These papers were not used for the functions here but probably should have been.
y < rbind(c(.1, .2),c(.11, .22), c(.3, .4), c(.31, .41), c(.32, 5))
x < rbind(c(.09,.21), c(.29,.39))
y
x
w < find.matches(x, y, maxmatch=5, tol=c(.05,.05))
set.seed(111) # so can replicate results
x < matrix(runif(500), ncol=2)
y < matrix(runif(2000), ncol=2)
w < find.matches(x, y, maxmatch=5, tol=c(.02,.03))
w$matches[1:5,]
w$distance[1:5,]
# Find first x with 3 or more ymatches
num.match < apply(w$matches, 1, function(x)sum(x > 0))
j < ((1:length(num.match))[num.match > 2])[1]
x[j,]
y[w$matches[j,],]
summary(w)
# For many applications would do something like this:
# attach(df1)
# x < cbind(age, sex) # Just do as.matrix(df1) if df1 has no factor objects
# attach(df2)
# y < cbind(age, sex)
# mat < find.matches(x, y, tol=c(5,0)) # exact match on sex, 5y on age
# Demonstrate matchCases
xcase < c(1,3,5,12)
xcontrol < 1:6
idcase < c('A','B','C','D')
idcontrol < c('a','b','c','d','e','f')
ycase < c(11,33,55,122)
ycontrol < c(11,22,33,44,55,66)
matchCases(xcase, ycase, idcase,
xcontrol, ycontrol, idcontrol, tol=1)
# If y is a binary response variable, the following code
# will produce a MantelHaenszel summary odds ratio that
# utilizes the matching.
# Standard variance formula will not work here because
# a control will match more than one case
# WARNING: The MH procedure exemplified here is suspect
# because of the small strata and widely varying number
# of controls per case.
x < c(1, 2, 3, 3, 3, 6, 7, 12, 1, 1:7)
y < c(0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1)
case < c(rep(TRUE, 8), rep(FALSE, 8))
id < 1:length(x)
m < matchCases(x[case], y[case], id[case],
x[!case], y[!case], id[!case], tol=1)
iscase < m$type=='case'
# Note: the first tapply on insures that event indicators are
# sorted by case id. The second actually does something.
event.case < tapply(m$y[iscase], m$idcase[iscase], sum)
event.control < tapply(m$y[!iscase], m$idcase[!iscase], sum)
n.control < tapply(!iscase, m$idcase, sum)
n < tapply(m$y, m$idcase, length)
or < sum(event.case * (n.control  event.control) / n) /
sum(event.control * (1  event.case) / n)
or
# Bootstrap this estimator by sampling with replacement from
# subjects. Assumes id is unique when combine cases+controls
# (id was constructed this way above). The following algorithms
# puts all sampled controls back with the cases to whom they were
# originally matched.
ids < unique(m$id)
idgroups < split(1:nrow(m), m$id)
B < 50 # in practice use many more
ors < numeric(B)
# Function to order w by ids, leaving unassigned elements zero
align < function(ids, w) {
z < structure(rep(0, length(ids)), names=ids)
z[names(w)] < w
z
}
for(i in 1:B) {
j < sample(ids, replace=TRUE)
obs < unlist(idgroups[j])
u < m[obs,]
iscase < u$type=='case'
n.case < align(ids, tapply(u$type, u$idcase,
function(v)sum(v=='case')))
n.control < align(ids, tapply(u$type, u$idcase,
function(v)sum(v=='control')))
event.case < align(ids, tapply(u$y[iscase], u$idcase[iscase], sum))
event.control < align(ids, tapply(u$y[!iscase], u$idcase[!iscase], sum))
n < n.case + n.control
# Remove sets having 0 cases or 0 controls in resample
s < n.case > 0 & n.control > 0
denom < sum(event.control[s] * (n.case[s]  event.case[s]) / n[s])
or < if(denom==0) NA else
sum(event.case[s] * (n.control[s]  event.control[s]) / n[s]) / denom
ors[i] < or
}
describe(ors)
first.word
finds the first word in an expression. A word is defined by
unlisting the elements of the expression found by the S parser and then
accepting any elements whose first character is either a letter or period.
The principal intended use is for the automatic generation of temporary
file names where it is important to exclude special characters from
the file name. For Microsoft Windows, periods in names are deleted and
only up to the first 8 characters of the word is returned.
first.word(x, i=1, expr=substitute(x))
x 
any scalar character string 
i 
word number, default value = 1. Used when the second or 
expr 
any S object of mode 
a character string
Frank E. Harrell, Jr.,
Department of Biostatistics,
Vanderbilt University,
fh@fharrell.com
Richard M. Heiberger,
Department of Statistics,
Temple University, Philadelphia, PA.
rmh@temple.edu
first.word(expr=expression(y ~ x + log(w)))
format.df
does appropriate rounding and decimal alignment, and outputs
a character matrix containing the formatted data. If x
is a
data.frame
, then do each component separately.
If x
is a matrix, but not a data.frame, make it a data.frame
with individual components for the columns.
If a component x$x
is a matrix, then do all columns the same.
format.df(x, digits, dec=NULL, rdec=NULL, cdec=NULL,
numeric.dollar=!dcolumn, na.blank=FALSE, na.dot=FALSE,
blank.dot=FALSE, col.just=NULL, cdot=FALSE,
dcolumn=FALSE, matrix.sep=' ', scientific=c(4,4),
math.row.names=FALSE, already.math.row.names=FALSE,
math.col.names=FALSE, already.math.col.names=FALSE,
double.slash=FALSE, format.Date="%m/%d/%Y",
format.POSIXt="%m/%d/%Y %H:%M:%OS", ...)
x 
a matrix (usually numeric) or data frame 
digits 
causes all values in the table to be formatted to 
dec 
If 
rdec 
a vector specifying the number of decimal places to the right for each row
( 
cdec 
a vector specifying the number of decimal places for each column. The vector must have number of items equal to number of columns or components of input x. 
cdot 
Set to 
na.blank 
Set to 
dcolumn 
Set to 
numeric.dollar 
logical, default 
math.row.names 
logical, set true to place dollar signs around the row names. 
already.math.row.names 
set to 
math.col.names 
logical, set true to place dollar signs around the column names. 
already.math.col.names 
set to 
na.dot 
Set to 
blank.dot 
Set to 
col.just 
Input vector 
matrix.sep 
When 
scientific 
specifies ranges of exponents (or a logical vector) specifying values
not to convert to scientific notation. See 
double.slash 
should escaping backslashes be themselves escaped. 
format.Date 
String used to format objects of the Date class. 
format.POSIXt 
String used to format objects of the POSIXt class. 
... 
other arguments are accepted and passed to 
a character matrix with character images of properly rounded x
.
Matrix components of input x
are now just sets of columns of
character matrix.
Object attribute"col.just"
repeats the value of the argument col.just
when provided,
otherwise, it includes the recommended justification for columns of output.
See the discussion of the argument col.just
.
The default justification is ‘l’ for characters and factors,
‘r’ for numeric.
When dcolumn==TRUE
, numerics will have ‘.’ as the justification character.
Frank E. Harrell, Jr.,
Department of Biostatistics,
Vanderbilt University,
fh@fharrell.com
Richard M. Heiberger,
Department of Statistics,
Temple University, Philadelphia, PA.
rmh@temple.edu
## Not run:
x < data.frame(a=1:2, b=3:4)
x$m < 10000*matrix(5:8,nrow=2)
names(x)
dim(x)
x
format.df(x, big.mark=",")
dim(format.df(x))
## End(Not run)
format.pval
is intended for formatting pvalues.
format.pval(x, pv=x, digits = max(1, .Options$digits  2),
eps = .Machine$double.eps, na.form = "NA", ...)
pv 
a numeric vector. 
x 
argument for method compliance. 
digits 
how many significant digits are to be used. 
eps 
a numerical tolerance: see Details. 
na.form 
character representation of 
... 
arguments passed to 
format.pval
is mainly an auxiliary function for
print.summary.lm
etc., and does separate formatting for
fixed, floating point and very small values; those less than
eps
are formatted as “‘< [eps]’” (where
“‘[eps]’” stands for format(eps, digits)
).
A character vector.
This is the base format.pval
function with the
ablitiy to pass the nsmall
argument to format
format.pval(c(runif(5), pi^100, NA))
format.pval(c(0.1, 0.0001, 1e27))
format.pval(c(0.1, 1e27), nsmall=3)
gbayes
derives the (Gaussian) posterior and optionally the predictive
distribution when both the prior and the likelihood are Gaussian, and
when the statistic of interest comes from a 2sample problem.
This function is especially useful in obtaining the expected power of
a statistical test, averaging over the distribution of the population
effect parameter (e.g., log hazard ratio) that is obtained using
pilot data. gbayes
is also useful for summarizing studies for
which the statistic of interest is approximately Gaussian with
known variance. An example is given for comparing two proportions
using the angular transformation, for which the variance is
independent of unknown parameters except for very extreme probabilities.
A plot
method is also given. This plots the prior, posterior, and
predictive distributions on a single graph using a nice default for
the xaxis limits and using the labcurve
function for automatic
labeling of the curves.
gbayes2
uses the method of Spiegelhalter and Freedman (1986) to compute the
probability of correctly concluding that a new treatment is superior
to a control. By this we mean that a 1alpha
normal
theorybased confidence interval for the new minus old treatment
effect lies wholly to the right of delta.w
, where delta.w
is the
minimally worthwhile treatment effect (which can be zero to be
consistent with ordinary null hypothesis testing, a method not always
making sense). This kind of power function is averaged over a prior
distribution for the unknown treatment effect. This procedure is
applicable to the situation where a prior distribution is not to be
used in constructing the test statistic or confidence interval, but is
only used for specifying the distribution of delta
, the parameter of
interest.
Even though gbayes2
assumes that the test statistic has a normal distribution with known
variance (which is strongly a function of the sample size in the two
treatment groups), the prior distribution function can be completely
general. Instead of using a stepfunction for the prior distribution
as Spiegelhalter and Freedman used in their appendix, gbayes2
uses
the builtin integrate
function for numerical integration.
gbayes2
also allows the variance of the test statistic to be general
as long as it is evaluated by the user. The conditional power given the
parameter of interest delta
is 1  pnorm((delta.w  delta)/sd + z)
, where z
is the normal critical value corresponding to 1  alpha
/2.
gbayesMixPredNoData
derives the predictive distribution of a
statistic that is Gaussian given delta
when no data have yet been
observed and when the prior is a mixture of two Gaussians.
gbayesMixPost
derives the posterior density, cdf, or posterior
mean of delta
given
the statistic x
, when the prior for delta
is a mixture of two
Gaussians and when x
is Gaussian given delta
.
gbayesMixPowerNP
computes the power for a test for delta
> delta.w
for the case where (1) a Gaussian prior or mixture of two Gaussian priors
is used as the prior distribution, (2) this prior is used in forming
the statistical test or credible interval, (3) no prior is used for
the distribution of delta
for computing power but instead a fixed
single delta
is given (as in traditional frequentist hypothesis
tests), and (4) the test statistic has a Gaussian likelihood with
known variance (and mean equal to the specified delta
).
gbayesMixPowerNP
is handy where you want to use an earlier study in
testing for treatment effects in a new study, but you want to mix with
this prior a noninformative prior. The mixing probability mix
can
be thought of as the "applicability" of the previous study. As with
gbayes2
, power here means the probability that the new study will
yield a left credible interval that is to the right of delta.w
.
gbayes1PowerNP
is a special case of gbayesMixPowerNP
when the
prior is a single Gaussian.
gbayes(mean.prior, var.prior, m1, m2, stat, var.stat,
n1, n2, cut.prior, cut.prob.prior=0.025)
## S3 method for class 'gbayes'
plot(x, xlim, ylim, name.stat='z', ...)
gbayes2(sd, prior, delta.w=0, alpha=0.05, upper=Inf, prior.aux)
gbayesMixPredNoData(mix=NA, d0=NA, v0=NA, d1=NA, v1=NA,
what=c('density','cdf'))
gbayesMixPost(x=NA, v=NA, mix=1, d0=NA, v0=NA, d1=NA, v1=NA,
what=c('density','cdf','postmean'))
gbayesMixPowerNP(pcdf, delta, v, delta.w=0, mix, interval,
nsim=0, alpha=0.05)
gbayes1PowerNP(d0, v0, delta, v, delta.w=0, alpha=0.05)
mean.prior 
mean of the prior distribution 
cut.prior , cut.prob.prior , var.prior

variance of the prior. Use a large number such as 10000 to effectively
use a flat (noninformative) prior. Sometimes it is useful to compute
the variance so that the prior probability that 
m1 
sample size in group 1 
m2 
sample size in group 2 
stat 
statistic comparing groups 1 and 2, e.g., log hazard ratio, difference in means, difference in angular transformations of proportions 
var.stat 
variance of 
x 
an object returned by 
sd 
the standard deviation of the treatment effect 
prior 
a function of possibly a vector of unknown treatment effects, returning the prior density at those values 
pcdf 
a function computing the posterior CDF of the treatment effect

delta 
a true unknown single treatment effect to detect 
v 
the variance of the statistic 
n1 
number of future observations in group 1, for obtaining a predictive distribution 
n2 
number of future observations in group 2 
xlim 
vector of 2 xaxis limits. Default is the mean of the posterior plus or minus 6 standard deviations of the posterior. 
ylim 
vector of 2 yaxis limits. Default is the range over combined prior and posterior densities. 
name.stat 
label for xaxis. Default is 
... 
optional arguments passed to 
delta.w 
the minimum worthwhile treatment difference to detech. The default is zero for a plain uninteristing null hypothesis. 
alpha 
type I error, or more accurately one minus the confidence level for a twosided confidence limit for the treatment effect 
upper 
upper limit of integration over the prior distribution multiplied by the normal likelihood for the treatment effect statistic. Default is infinity. 
prior.aux 
argument to pass to 
mix 
mixing probability or weight for the Gaussian prior having mean 
d0 
mean of the first Gaussian distribution (only Gaussian for

v0 
variance of the first Gaussian (only Gaussian for

d1 
mean of the second Gaussian (if 
v1 
variance of the second Gaussian (if 
what 
specifies whether the predictive density or the CDF is to be
computed. Default is 
interval 
a 2vector containing the lower and upper limit for possible values of
the test statistic 
nsim 
defaults to zero, causing 
gbayes
returns a list of class "gbayes"
containing the following
names elements: mean.prior
,var.prior
,mean.post
, var.post
, and
if n1
is specified, mean.pred
and var.pred
. Note that
mean.pred
is identical to mean.post
. gbayes2
returns a single
number which is the probability of correctly rejecting the null
hypothesis in favor of the new treatment. gbayesMixPredNoData
returns a function that can be used to evaluate the predictive density
or cumulative distribution. gbayesMixPost
returns a function that
can be used to evaluate the posterior density or cdf. gbayesMixPowerNP
returns a vector containing two values if nsim
= 0. The first value is the
critical value for the test statistic that will make the left credible
interval > delta.w
, and the second value is the power. If nsim
> 0,
it returns the power estimate and confidence limits for it if nsim
>
0. The examples show how to use these functions.
Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
fh@fharrell.com
Spiegelhalter DJ, Freedman LS, Parmar MKB (1994): Bayesian approaches to
randomized trials. JRSS A 157:357–416. Results for gbayes
are derived from
Equations 1, 2, 3, and 6.
Spiegelhalter DJ, Freedman LS (1986): A predictive approach to selecting the size of a clinical trial, based on subjective clinical opinion. Stat in Med 5:1–13.
Joseph, Lawrence and Belisle, Patrick (1997): Bayesian sample size determination for normal means and differences between normal means. The Statistician 46:209–226.
Grouin, JM, Coste M, Bunouf P, Lecoutre B (2007): Bayesian sample size determination in nonsequential clinical trials: Statistical aspects and some regulatory considerations. Stat in Med 26:4914–4924.
# Compare 2 proportions using the var stabilizing transformation
# arcsin(sqrt((x+3/8)/(n+3/4))) (Anscombe), which has variance
# 1/[4(n+.5)]
m1 < 100; m2 < 150
deaths1 < 10; deaths2 < 30
f < function(events,n) asin(sqrt((events+3/8)/(n+3/4)))
stat < f(deaths1,m1)  f(deaths2,m2)
var.stat < function(m1, m2) 1/4/(m1+.5) + 1/4/(m2+.5)
cat("Test statistic:",format(stat)," s.d.:",
format(sqrt(var.stat(m1,m2))), "\n")
#Use unbiased prior with variance 1000 (almost flat)
b < gbayes(0, 1000, m1, m2, stat, var.stat, 2*m1, 2*m2)
print(b)
plot(b)
#To get posterior Prob[parameter > w] use
# 1pnorm(w, b$mean.post, sqrt(b$var.post))
#If g(effect, n1, n2) is the power function to
#detect an effect of 'effect' with samples size for groups 1 and 2
#of n1,n2, estimate the expected power by getting 1000 random
#draws from the posterior distribution, computing power for
#each value of the population effect, and averaging the 1000 powers
#This code assumes that g will accept vectorvalued 'effect'
#For the 2sample proportion problem just addressed, 'effect'
#could be taken approximately as the change in the arcsin of
#the square root of the probability of the event
g < function(effect, n1, n2, alpha=.05) {
sd < sqrt(var.stat(n1,n2))
z < qnorm(1  alpha/2)
effect < abs(effect)
1  pnorm(z  effect/sd) + pnorm(z  effect/sd)
}
effects < rnorm(1000, b$mean.post, sqrt(b$var.post))
powers < g(effects, 500, 500)
hist(powers, nclass=35, xlab='Power')
describe(powers)
# gbayes2 examples
# First consider a study with a binary response where the
# sample size is n1=500 in the new treatment arm and n2=300
# in the control arm. The parameter of interest is the
# treated:control log odds ratio, which has variance
# 1/[n1 p1 (1p1)] + 1/[n2 p2 (1p2)]. This is not
# really constant so we average the variance over plausible
# values of the probabilities of response p1 and p2. We
# think that these are between .4 and .6 and we take a
# further short cut
v < function(n1, n2, p1, p2) 1/(n1*p1*(1p1)) + 1/(n2*p2*(1p2))
n1 < 500; n2 < 300
ps < seq(.4, .6, length=100)
vguess < quantile(v(n1, n2, ps, ps), .75)
vguess
# 75%
# 0.02183459
# The minimally interesting treatment effect is an odds ratio
# of 1.1. The prior distribution on the log odds ratio is
# a 50:50 mixture of a vague Gaussian (mean 0, sd 100) and
# an informative prior from a previous study (mean 1, sd 1)
prior < function(delta)
0.5*dnorm(delta, 0, 100)+0.5*dnorm(delta, 1, 1)
deltas < seq(5, 5, length=150)
plot(deltas, prior(deltas), type='l')
# Now compute the power, averaged over this prior
gbayes2(sqrt(vguess), prior, log(1.1))
# [1] 0.6133338
# See how much power is lost by ignoring the previous
# study completely
gbayes2(sqrt(vguess), function(delta)dnorm(delta, 0, 100), log(1.1))
# [1] 0.4984588
# What happens to the power if we really don't believe the treatment
# is very effective? Let's use a prior distribution for the log
# odds ratio that is uniform between log(1.2) and log(1.3).
# Also check the power against a true null hypothesis
prior2 < function(delta) dunif(delta, log(1.2), log(1.3))
gbayes2(sqrt(vguess), prior2, log(1.1))
# [1] 0.1385113
gbayes2(sqrt(vguess), prior2, 0)
# [1] 0.3264065
# Compare this with the power of a twosample binomial test to
# detect an odds ratio of 1.25
bpower(.5, odds.ratio=1.25, n1=500, n2=300)
# Power
# 0.3307486
# For the original prior, consider a new study with equal
# sample sizes n in the two arms. Solve for n to get a
# power of 0.9. For the variance of the log odds ratio
# assume a common p in the center of a range of suspected
# probabilities of response, 0.3. For this example we
# use a zero null value and the uniform prior above
v < function(n) 2/(n*.3*.7)
pow < function(n) gbayes2(sqrt(v(n)), prior2)
uniroot(function(n) pow(n)0.9, c(50,10000))$root
# [1] 2119.675
# Check this value
pow(2119.675)
# [1] 0.9
# Get the posterior density when there is a mixture of two priors,
# with mixing probability 0.5. The first prior is almost
# noninformative (normal with mean 0 and variance 10000) and the
# second has mean 2 and variance 0.3. The test statistic has a value
# of 3 with variance 0.4.
f < gbayesMixPost(3, 4, mix=0.5, d0=0, v0=10000, d1=2, v1=0.3)
args(f)
# Plot this density
delta < seq(2, 6, length=150)
plot(delta, f(delta), type='l')
# Add to the plot the posterior density that used only
# the almost noninformative prior
lines(delta, f(delta, mix=1), lty=2)
# The same but for an observed statistic of zero
lines(delta, f(delta, mix=1, x=0), lty=3)
# Derive the CDF instead of the density
g < gbayesMixPost(3, 4, mix=0.5, d0=0, v0=10000, d1=2, v1=0.3,
what='cdf')
# Had mix=0 or 1, gbayes1PowerNP could have been used instead
# of gbayesMixPowerNP below
# Compute the power to detect an effect of delta=1 if the variance
# of the test statistic is 0.2
gbayesMixPowerNP(g, 1, 0.2, interval=c(10,12))
# Do the same thing by simulation
gbayesMixPowerNP(g, 1, 0.2, interval=c(10,12), nsim=20000)
# Compute by what factor the sample size needs to be larger
# (the variance needs to be smaller) so that the power is 0.9
ratios < seq(1, 4, length=50)
pow < single(50)
for(i in 1:50)
pow[i] < gbayesMixPowerNP(g, 1, 0.2/ratios[i], interval=c(10,12))[2]
# Solve for ratio using reverse linear interpolation
approx(pow, ratios, xout=0.9)$y
# Check this by computing power
gbayesMixPowerNP(g, 1, 0.2/2.1, interval=c(10,12))
# So the study will have to be 2.1 times as large as earlier thought
Simulate Bayesian Sequential Treatment Comparisons Using a Gaussian Model
gbayesSeqSim(est, asserts)
est 
data frame created by 
asserts 
list of lists. The first element of each list is the userspecified name for each assertion/prior combination, e.g., 
Simulate a sequential trial under a Gaussian model for parameter estimates, and Gaussian priors using simulated estimates and variances returned by estSeqSim
. For each row of the data frame est
and for each prior/assertion combination, computes the posterior probability of the assertion.
a data frame with number of rows equal to that of est
with a number of new columns equal to the number of assertions added. The new columns are named p1
, p2
, p3
, ... (posterior probabilities), mean1
, mean2
, ... (posterior means), and sd1
, sd2
, ... (posterior standard deviations). The returned data frame also has an attribute asserts
added which is the original asserts
augmented with any derived mu
and sigma
and converted to a data frame, and another attribute alabels
which is a named vector used to map p1
, p2
, ... to the userprovided labels in asserts
.
Frank Harrell
gbayes()
, estSeqSim()
, simMarkovOrd()
, estSeqMarkovOrd()
## Not run:
# Simulate Bayesian operating characteristics for an unadjusted
# proportional odds comparison (Wilcoxon test)
# For 100 simulations, 5 looks, 2 true parameter values, and
# 2 assertion/prior combinations, compute the posterior probability
# Use a lowlevel logistic regression call to speed up simuluations
# Use data.table to compute various summary measures
# Total simulation time: 2s
lfit < function(x, y) {
f < rms::lrm.fit(x, y)
k < length(coef(f))
c(coef(f)[k], vcov(f)[k, k])
}
gdat < function(beta, n1, n2) {
# Cell probabilities for a 7category ordinal outcome for the control group
p < c(2, 1, 2, 7, 8, 38, 42) / 100
# Compute cell probabilities for the treated group
p2 < pomodm(p=p, odds.ratio=exp(beta))
y1 < sample(1 : 7, n1, p, replace=TRUE)
y2 < sample(1 : 7, n2, p2, replace=TRUE)
list(y1=y1, y2=y2)
}
# Assertion 1: log(OR) < 0 under prior with prior mean 0.1 and sigma 1 on log OR scale
# Assertion 2: OR between 0.9 and 1/0.9 with prior mean 0 and sigma computed so that
# P(OR > 2) = 0.05
asserts < list(list('Efficacy', '<', 0, mu=0.1, sigma=1),
list('Similarity', 'in', log(c(0.9, 1/0.9)),
cutprior=log(2), tailprob=0.05))
set.seed(1)
est < estSeqSim(c(0, log(0.7)), looks=c(50, 75, 95, 100, 200),
gendat=gdat,
fitter=lfit, nsim=100)
z < gbayesSeqSim(est, asserts)
head(z)
attr(z, 'asserts')
# Compute the proportion of simulations that hit targets (different target posterior
# probabilities for efficacy vs. similarity)
# For the efficacy assessment compute the first look at which the target
# was hit (set to infinity if never hit)
require(data.table)
z < data.table(z)
u < z[, .(first=min(p1 > 0.95)), by=.(parameter, sim)]
# Compute the proportion of simulations that ever hit the target and
# that hit it by the 100th subject
u[, .(ever=mean(first < Inf)), by=.(parameter)]
u[, .(by75=mean(first <= 100)), by=.(parameter)]
## End(Not run)
Data from The Analysis of Biological Data by Shitlock and Schluter
getabd(name = "", lowernames = FALSE, allow = "_")
name 
name of dataset to fetch. Omit to get a data table listing all available datasets. 
lowernames 
set to 
allow 
set to 
Fetches csv files for exercises in the book
data frame with attributes label
and url
Frank Harrell
This function downloads and makes ready to use datasets from the main
web site for the Hmisc and rms libraries. For R, the
datasets were stored in compressed save
format and
getHdata
makes them available by running load
after download. For SPlus, the datasets were stored in
data.dump
format and are made available by running
data.restore
after import. The dataset is run through the
cleanup.import
function. Calling getHdata
with no
file
argument provides a character vector of names of available
datasets that are currently on the web site. For R, R's default
browser can optionally be launched to view html
files that were
already prepared using the Hmisc command
html(contents())
or to view ‘.txt’ or ‘.html’ data
description files when available.
If options(localHfiles=TRUE)
the scripts are read from local directory
~/web/data/repo
instead of from the web server.
getHdata(file, what = c("data", "contents", "description", "all"),
where="https://hbiostat.org/data/repo")
file 
an unquoted name of a dataset on the web site, e.g. ‘prostate’.
Omit 
what 
specify 
where 
URL containing the data and metadata files 
getHdata()
without a file
argument returns a character
vector of dataset base names. When a dataset is downloaded, the data
frame is placed in search position one and is not returned as value of
getHdata
.
Frank Harrell
download.file
, cleanup.import
,
data.restore
, load
## Not run:
getHdata() # download list of available datasets
getHdata(prostate) # downloads, load( ) or data.restore( )
# runs cleanup.import for SPlus 6
getHdata(valung, "contents") # open browser (options(browser="whatever"))
# after downloading valung.html
# (result of html(contents()))
getHdata(support, "all") # download and open one browser window
datadensity(support)
attach(support) # make individual variables available
getHdata(plasma, "all") # download and open two browser windows
# (description file is available for plasma)
## End(Not run)
The github rscripts project at
https://github.com/harrelfe/rscripts contains R scripts that are
primarily analysis templates for teaching with RStudio. This function
allows the user to print an organized list of available scripts, to
download a script and source()
it into the current session (the
default), to
download a script and load it into an RStudio script editor window, to
list scripts whose major category contains a given string (ignoring
case), or to list all major and minor categories. If
options(localHfiles=TRUE)
the scripts are read from local directory
~/R/rscripts
instead of from github.
getRs(file=NULL, guser='harrelfe', grepo='rscripts', gdir='raw/master',
dir=NULL, browse=c('local', 'browser'), cats=FALSE,
put=c('source', 'rstudio'))
file 
a character string containing a script file name.
Omit 
guser 
GitHub user name, default is 
grepo 
Github repository name, default is 
gdir 
Github directory under which to find retrievable files 
dir 
directory under 
browse 
When showing the rscripts contents directory, the
default is to list in tabular form in the console. Specify

cats 
Leave at the default ( 
put 
Leave at the default ( 
a data frame or list, depending on arguments
Frank Harrell and Cole Beck
## Not run:
getRs() # list available scripts
scripts < getRs() # likewise, but store in an object that can easily
# be viewed on demand in RStudio
getRs('introda.r') # download introda.r and put in script editor
getRs(cats=TRUE) # list available major and minor categories
categories < getRs(cats=TRUE)
# likewise but store results in a list for later viewing
getRs(cats='reg') # list all scripts in a major category containing 'reg'
getRs('importREDCap.r') # source() to define a function
# source() a new version of the Hmisc package's cut2 function:
getRs('cut2.s', grepo='Hmisc', dir='R')
## End(Not run)
Allows downloading and reading of a zip file containing one file
getZip(url, password=NULL)
url 
either a path to a local file or a valid URL. 
password 
required to decode passwordprotected zip files 
Allows downloading and reading of zip file containing one file. The file may be password protected. If a password is needed then one will be requested unless given.
Note: to make passwordprotected zip file z.zip, do zip e z myfile
Returns a file O/I pipe.
Frank E. Harrell
## Not run:
read.csv(getZip('http://test.com/z.zip'))
## End(Not run)
Uses ggplot2
to plot a scatterplot or dotlike chart for the case
where there is a very large number of overlapping values. This works
for continuous and categorical x
and y
. For continuous
variables it serves the same purpose as hexagonal binning. Counts for
overlapping points are grouped into quantile groups and level of
transparency and rainbow colors are used to provide count information.
Instead, you can specify stick=TRUE
not use color but to encode
cell frequencies
with the height of a black line ycentered at the middle of the bins.
Relative frequencies are not transformed, and the maximum cell
frequency is shown in a caption. Every point with at least a
frequency of one is depicted with a fullheight light gray vertical
line, scaled to the above overall maximum frequency. In this way to
relative frequency is to proportion of these light gray lines that are
black, and one can see points whose frequencies are too low to see the
black lines.
The result can also be passed to ggplotly
. Actual cell
frequencies are added to the hover text in that case using the
label
ggplot2
aesthetic.
ggfreqScatter(x, y, by=NULL, bins=50, g=10, cuts=NULL,
xtrans = function(x) x,
ytrans = function(y) y,
xbreaks = pretty(x, 10),
ybreaks = pretty(y, 10),
xminor = NULL, yminor = NULL,
xlab = as.character(substitute(x)),
ylab = as.character(substitute(y)),
fcolors = viridis::viridis(10), nsize=FALSE,
stick=FALSE, html=FALSE, prfreq=FALSE, ...)
x 
xvariable 
y 
yvariable 
by 
an optional vector used to make separate plots for each
distinct value using 
bins 
for continuous 
g 
number of quantile groups to make for frequency counts. Use

cuts 
instead of using 
xtrans , ytrans

functions specifying transformations to be made before binning and plotting 
xbreaks , ybreaks

vectors of values to label on axis, on original scale 
xminor , yminor

values at which to put minor tick marks, on original scale 
xlab , ylab

axis labels. If not specified and variable has a

fcolors 

nsize 
set to 
stick 
set to 
html 
set to 
prfreq 
set to 
... 
arguments to pass to 
a ggplot
object
Frank Harrell
require(ggplot2)
set.seed(1)
x < rnorm(1000)
y < rnorm(1000)
count < sample(1:100, 1000, TRUE)
x < rep(x, count)
y < rep(y, count)
# color=alpha=NULL below makes loess smooth over all points
g < ggfreqScatter(x, y) + # might add g=0 if using plotly
geom_smooth(aes(color=NULL, alpha=NULL), se=FALSE) +
ggtitle("Using Deciles of Frequency Counts, 2500 Bins")
g
# plotly::ggplotly(g, tooltip='label') # use plotly, hover text = freq. only
# Plotly makes it somewhat interactive, with hover text tooltips
# Instead use varyingheight sticks to depict frequencies
ggfreqScatter(x, y, stick=TRUE) +
labs(subtitle='Relative height of black lines to gray lines
is proportional to cell frequency.
Note that points with even tiny frequency are visable
(gray line with no visible black line).')
# Try with x categorical
x1 < sample(c('cat', 'dog', 'giraffe'), length(x), TRUE)
ggfreqScatter(x1, y)
# Try with y categorical
y1 < sample(LETTERS[1:10], length(x), TRUE)
ggfreqScatter(x, y1)
# Both categorical, larger point symbols, box instead of circle
ggfreqScatter(x1, y1, shape=15, size=7)
# Vary box size instead
ggfreqScatter(x1, y1, nsize=TRUE, shape=15)
Render plotly
Graphic from a ggplot2
Object
ggplotlyr(ggobject, tooltip = "label", remove = "txt: ", ...)
ggobject 
an object produced by 
tooltip 
attribute specified to 
remove 
extraneous text to remove from hover text. Default is set to assume 
... 
other arguments passed to 
Uses plotly::ggplotly()
to render a plotly
graphic with a specified tooltip attribute, removing extraneous text that ggplotly
puts in hover text when tooltip='label'
a plotly
object
Frank Harrell
GiniMD
computes Gini's mean difference on a
numeric vector. This index is defined as the mean absolute difference
between any two distinct elements of a vector. For a Bernoulli
(binary) variable with proportion of ones equal to $p$
and sample
size $n$
, Gini's mean difference is
$2\frac{n}{n1}p(1p)$
. For a
trinomial variable (e.g., predicted values for a 3level categorical
predictor using two dummy variables) having (predicted)
values $A, B, C$
with corresponding proportions $a, b, c$
,
Gini's mean difference is
$2\frac{n}{n1}[abAB+acAC+bcBC]$
GiniMd(x, na.rm=FALSE)
x 
a numeric vector (for 
na.rm 
set to 
a scalar numeric
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
David HA (1968): Gini's mean difference rediscovered. Biometrika 55:573–575.
set.seed(1)
x < rnorm(40)
# Test GiniMd against a bruteforce solution
gmd < function(x) {
n < length(x)
sum(outer(x, x, function(a, b) abs(a  b))) / n / (n  1)
}
GiniMd(x)
gmd(x)
z < c(rep(0,17), rep(1,6))
n < length(z)
GiniMd(z)
2*mean(z)*(1mean(z))*n/(n1)
a < 12; b < 13; c < 7; n < a + b + c
A < .123; B < .707; C < 0.523
xx < c(rep(A, a), rep(B, b), rep(C, c))
GiniMd(xx)
2*(a*b*abs(AB) + a*c*abs(AC) + b*c*abs(BC))/n/(n1)
Check for Changes in List of Objects
hashCheck(..., file, .print. = TRUE, .names. = NULL)
... 
a list of objects including data frames, vectors, functions, and all other types of R objects that represent dependencies of a certain calculation 
file 
name of file in which results are stored 
.print. 
set to 
.names. 
vector of names of original arguments if not calling 
Given an RDS file name and a list of objects, does the following:
makes a vector of hashes, one for each object. Function objects are run through deparse
so that the environment of the function will not be considered.
see if the file exists; if not, return a list with result=NULL, hash
= new vector of hashes, changed='All'
if the file exists, read the file and its hash attribute as prevhash
if prevhash
is not identical to hash:
if .print.=TRUE
(default), print to console a summary of what's changed
return a list with result=NULL, hash
= new hash vector, changed
if prevhash = hash
, return a list with result=file object, hash
=new hash, changed=”
Set options(debughash=TRUE)
to trace results in /tmp/debughash.txt
a list
with elements result
(the computations), hash
(the new hash), and changed
which details what changed to make computations need to be run
Frank Harrell
Computes the HarrellDavis (1982) quantile estimator and jacknife standard errors of quantiles. The quantile estimator is a weighted linear combination or order statistics in which the order statistics used in traditional nonparametric quantile estimators are given the greatest weight. In small samples the HD estimator is more efficient than traditional ones, and the two methods are asymptotically equivalent. The HD estimator is the limit of a bootstrap average as the number of bootstrap resamples becomes infinitely large.
hdquantile(x, probs = seq(0, 1, 0.25),
se = FALSE, na.rm = FALSE, names = TRUE, weights=FALSE)
x 
a numeric vector 
probs 
vector of quantiles to compute 
se 
set to 
na.rm 
set to 
names 
set to 
weights 
set to 
A Fortran routine is used to compute the jackknife leaveoutone
quantile estimates. Standard errors are not computed for quantiles 0 or
1 (NA
s are returned).
A vector of quantiles. If se=TRUE
this vector will have an
attribute se
added to it, containing the standard errors. If
weights=TRUE
, also has a "weights"
attribute which is a matrix.
Frank Harrell
Harrell FE, Davis CE (1982): A new distributionfree quantile estimator. Biometrika 69:635640.
Hutson AD, Ernst MD (2000): The exact bootstrap mean and variance of an Lestimator. J Roy Statist Soc B 62:8994.
set.seed(1)
x < runif(100)
hdquantile(x, (1:3)/4, se=TRUE)
## Not run:
# Compare jackknife standard errors with those from the bootstrap
library(boot)
boot(x, function(x,i) hdquantile(x[i], probs=(1:3)/4), R=400)
## End(Not run)
Moving and hiding table of contents for Rmd HTML documents
hidingTOC(
buttonLabel = "Contents",
levels = 3,
tocSide = c("right", "left"),
buttonSide = c("right", "left"),
posCollapse = c("margin", "top", "bottom"),
hidden = FALSE
)
buttonLabel 
the text on the button that hides and unhides the
table of contents. Defaults to 
levels 
the max depth of the table of contents that it is desired to have control over the display of. (defaults to 3) 
tocSide 
which side of the page should the table of contents be placed
on. Can be either 
buttonSide 
which side of the page should the button that hides the TOC
be placed on. Can be either 
posCollapse 
if 
Logical should the table of contents be hidden at page load
Defaults to 
hidingTOC
creates a table of contents in a Rmd document that
can be hidden at the press of a button. It also generate buttons that allow
the hiding or unhiding of the diffrent level depths of the table of contents.
a HTML formated text string to be inserted into an markdown document
Thomas Dupont
## Not run:
hidingTOC()
## End(Not run)
This functions tries to compute the maximum number of histograms that will fit on one page, then it draws a matrix of histograms. If there are more qualifying variables than will fit on a page, the function waits for a mouse click before drawing the next page.
## S3 method for class 'data.frame'
hist(x, n.unique = 3, nclass = "compute",
na.big = FALSE, rugs = FALSE, freq=TRUE, mtitl = FALSE, ...)
x 
a data frame 
n.unique 
minimum number of unique values a variable must have before a histogram is drawn 
nclass 
number of bins. Default is max(2,trunc(min(n/10,25*log(n,10))/2)), where n is the number of nonmissing values for a variable. 
na.big 
set to 
rugs 
set to 
freq 
see 
mtitl 
set to a character string to set aside extra outside top margin and to use the string for an overall title 
... 
arguments passed to 
the number of pages drawn
Frank E Harrell Jr
d < data.frame(a=runif(200), b=rnorm(200),
w=factor(sample(c('green','red','blue'), 200, TRUE)))
hist.data.frame(d) # in R, just say hist(d)
Takes two vectors or a list with x
and y
components, and produces
back to back histograms of the two datasets.
histbackback(x, y, brks=NULL, xlab=NULL, axes=TRUE, probability=FALSE,
xlim=NULL, ylab='', ...)
x , y

either two vectors or a list given as 
brks 
vector of the desired breakpoints for the histograms. 
xlab 
a vector of two character strings naming the two datasets. 
axes 
logical flag stating whether or not to label the axes. 
probability 
logical flag: if 
xlim 
xaxis limits. First value must be negative, as the left histogram is
placed at negative xvalues. Second value must be positive, for the
right histogram. To make the limits symmetric, use e.g. 
ylab 
label for yaxis. Default is no label. 
... 
additional graphics parameters may be given. 
a list is returned invisibly with the following components:
left 
the counts for the dataset plotted on the left. 
right 
the counts for the dataset plotted on the right. 
breaks 
the breakpoints used. 
a plot is produced on the current graphics device.
Pat Burns
Salomon Smith Barney
London
pburns@dorado.sbi.com
options(digits=3)
set.seed(1)
histbackback(rnorm(20), rnorm(30))
fool < list(x=rnorm(40), y=rnorm(40))
histbackback(fool)
age < rnorm(1000,50,10)
sex < sample(c('female','male'),1000,TRUE)
histbackback(split(age, sex))
agef < age[sex=='female']; agem < age[sex=='male']
histbackback(list(Female=agef,Male=agem), probability=TRUE, xlim=c(.06,.06))
Uses plotly
to draw horizontal spike histograms stratified by
group
, plus the mean (solid dot) and vertical bars for these
quantiles: 0.05 (red, short), 0.25 (blue, medium), 0.50 (black, long),
0.75 (blue, medium), and 0.95 (red, short). The robust dispersion measure
Gini's mean difference and the SD may optionally be added. These are
shown as horizontal lines starting at the minimum value of x
having a length equal to the mean difference or SD. Even when Gini's
and SD are computed, they are not drawn unless the user clicks on their
legend entry.
Spike histograms have the advantage of effectively showing the raw data for both small and huge datasets, and unlike box plots allow multimodality to be easily seen.
histboxpM
plots multiple histograms stacked vertically, for
variables in a data frame having a common group
variable (if any)
and combined using plotly::subplot
.
dhistboxp
is like histboxp
but no plotly
graphics
are actually drawn. Instead, a data frame suitable for use with
plotlyM
is returned. For dhistboxp
an additional level of
stratification strata
is implemented. group
causes a
different result here to produce backtoback histograms (in the case of
two groups) for each level of strata
.
histboxp(p = plotly::plot_ly(height=height), x, group = NULL,
xlab=NULL, gmd=TRUE, sd=FALSE, bins = 100, wmax=190, mult=7,
connect=TRUE, showlegend=TRUE)
dhistboxp(x, group = NULL, strata=NULL, xlab=NULL,
gmd=FALSE, sd=FALSE, bins = 100, nmin=5, ff1=1, ff2=1)
histboxpM(p=plotly::plot_ly(height=height, width=width), x, group=NULL,
gmd=TRUE, sd=FALSE, width=NULL, nrows=NULL, ncols=NULL, ...)
p 

x 
a numeric vector, or for 
group 
a discrete grouping variable. If omitted, defaults to a vector of ones 
strata 
a discrete numeric stratification variable. Values are also used to space out different spike histograms. Defaults to a vector of ones. 
xlab 
xaxis label, defaults to labelled version include units of measurement if any 
gmd 
set to 
sd 
set to 
width 
width in pixels 
nrows 
number of rows for layout of multiple plots 
ncols 
number of columns for layout of multiple plots. At most
one of 
bins 
number of equalwidth bins to use for spike histogram. If
the number of distinct values of 
nmin 
minimum number of nonmissing observations for a groupstratum combination before the spike histogram and quantiles are drawn 
ff1 , ff2

fudge factors for position and bar length for spike histograms 
wmax , mult

tweaks for margin to allocate 
connect 
set to 
showlegend 
used if producing multiple plots to be combined with

... 
other arguments for 
a plotly
object. For dhistboxp
a data frame as
expected by plotlyM
Frank Harrell
histSpike
, plot.describe
,
scat1d
## Not run:
dist < c(rep(1, 500), rep(2, 250), rep(3, 600))
Distribution < factor(dist, 1 : 3, c('Unimodal', 'Bimodal', 'Trimodal'))
x < c(rnorm(500, 6, 1),
rnorm(200, 3, .7), rnorm(50, 7, .4),
rnorm(200, 2, .7), rnorm(300, 5.5, .4), rnorm(100, 8, .4))
histboxp(x=x, group=Distribution, sd=TRUE)
X < data.frame(x, x2=runif(length(x)))
histboxpM(x=X, group=Distribution, ncols=2) # separate plots
## End(Not run)
Easy Extraction of Labels/Units Expressions for Plotting
hlab(x, name = NULL, html = FALSE, plotmath = TRUE)
x 
a single variable name, unquoted 
name 
a single character string providing an alternate way to name 
html 
set to 
plotmath 
set to 
Given a single unquoted variable, first looks to see if a nonNULL
LabelsUnits
object exists (produced by extractlabs()
). When LabelsUnits
does not exist or is NULL
, looks up the attributes in the current dataset, which defaults to d
or may be specified by options(current_ds='name of the data frame/table')
. Finally the existence of a variable of the given name in the global environment is checked. When a variable is not found in any of these three sources or has a blank label
and units
, an expression()
with the variable name alone is returned. If html=TRUE
, HTML strings are constructed instead, suitable for plotly
graphics.
The result is useful for xlab
and ylab
in base plotting functions or in ggplot2
, along with being useful for labs
in ggplot2
. See example.
an expression created by labelPlotmath
with plotmath=TRUE
Frank Harrell
label()
, units()
, contents()
, hlabs()
, extractlabs()
, plotmath
d < data.frame(x=1:10, y=(1:10)/10)
d < upData(d, labels=c(x='X', y='Y'), units=c(x='mmHg'), print=FALSE)
hlab(x)
hlab(x, html=TRUE)
hlab(z)
require(ggplot2)
ggplot(d, aes(x, y)) + geom_point() + labs(x=hlab(x), y=hlab(y))
# Can use xlab(hlab(x)) + ylab(hlab(y)) also
# Store names, labels, units for all variables in d in object
LabelsUnits < extractlabs(d)
# Remove d; labels/units still found
rm(d)
hlab(x)
# Remove LabelsUnits and use a current dataset named
# d2 instead of the default d
rm(LabelsUnits)
options(current_ds='d2')
Frontend to ggplot2 labs Function
hlabs(x, y, html = FALSE)
x 
a single variable name, unquoted 
y 
a single variable name, unquoted 
html 
set to 
Runs x
, y
, or both through hlab()
and passes the constructed labels to the ggplot2::labs function to specify x and yaxis labels specially formatted for units of measurement
result of ggplot2::labs()
Frank Harrell
# Name the current dataset d, or specify a name with
# options(curr_ds='...') or run `extractlabs`, then
# ggplot(d, aes(x,y)) + geom_point() + hlabs(x,y)
# to specify only the xaxis label use hlabs(x), or to
# specify only the yaxis label use hlabs(y=...)
The Hmisc library contains many functions useful for data analysis, highlevel graphics, utility operations, functions for computing sample size and power, translating SAS datasets into R, imputing missing values, advanced table making, variable clustering, character string manipulation, conversion of R objects to LaTeX code, recoding variables, and bootstrap repeated measures analysis. Most of these functions were written by F Harrell, but a few were collected from statlib and from snews; other authors are indicated below. This collection of functions includes all of Harrell's submissions to statlib other than the functions in the rms and display libraries. A few of the functions do not have “Help” documentation.
To make Hmisc load silently, issue
options(Hverbose=FALSE)
before library(Hmisc)
.
Function Name  Purpose 
abs.error.pred  Computes various indexes of predictive accuracy based 
on absolute errors, for linear models  
addMarginal  Add marginal observations over selected variables 
all.is.numeric  Check if character strings are legal numerics 
approxExtrap  Linear extrapolation 
aregImpute  Multiple imputation based on additive regression, 
bootstrapping, and predictive mean matching  
areg.boot  Nonparametrically estimate transformations for both 
sides of a multiple additive regression, and  
bootstrap these estimates and $R^2$


ballocation  Optimum sample allocations in 2sample proportion test 
binconf  Exact confidence limits for a proportion and more accurate 
(narrower!) score stat.based Wilson interval  
(Rollin Brant, mod. FEH)  
bootkm  Bootstrap KaplanMeier survival or quantile estimates 
bpower  Approximate power of 2sided test for 2 proportions 
Includes bpower.sim for exact power by simulation  
bpplot  BoxPercentile plot 
(Jeffrey Banfield, umsfjban@bill.oscs.montana.edu)  
bpplotM  Chart extended box plots for multiple variables 
bsamsize  Sample size requirements for test of 2 proportions 
bystats  Statistics on a single variable by levels of >=1 factors 
bystats2  2way statistics 
character.table  Shows numeric equivalents of all latin characters 
Useful for putting many special chars. in graph titles  
(Pierre Joyet, pierre.joyet@bluewin.ch)  
ciapower  Power of Cox interaction test 
cleanup.import  More compactly store variables in a data frame, and clean up 
problem data when e.g. Excel spreadsheet had a non  
numeric value in a numeric column  
combine.levels  Combine infrequent levels of a categorical variable 
confbar  Draws confidence bars on an existing plot using multiple 
confidence levels distinguished using color or gray scale  
contents  Print the contents (variables, labels, etc.) of a data frame 
cpower  Power of Cox 2sample test allowing for noncompliance 
Cs  Vector of character strings from list of unquoted names 
csv.get  Enhanced importing of comma separated files labels 
cut2  Like cut with better endpoint label construction and allows 
construction of quantile groups or groups with given n  
datadensity  Snapshot graph of distributions of all variables in 
a data frame. For continuous variables uses scat1d.  
dataRep  Quantify representation of new observations in a database 
ddmmmyy  SAS “date7” output format for a chron object 
deff  Kish design effect and intracluster correlation 
describe  Function to describe different classes of objects. 
Invoke by saying describe(object). It calls one of the  
following:  
describe.data.frame  Describe all variables in a data frame (generalization 
of SAS UNIVARIATE)  
describe.default  Describe a variable (generalization of SAS UNIVARIATE) 
dotplot3  A more flexible version of dotplot 
Dotplot  Enhancement of Trellis dotplot allowing for matrix 
xvar., auto generation of Key function, superposition  
drawPlot  Simple mousedriven drawing program, including a function 
for fitting Bezier curves  
Ecdf  Empirical cumulative distribution function plot 
errbar  Plot with error bars (Charles Geyer, U. Chi., mod FEH) 
event.chart  Plot general event charts (Jack Lee, jjlee@mdanderson.org, 
Ken Hess, Joel Dubin; Am Statistician 54:6370,2000)  
event.history  Event history chart with timedependent cov. status 
(Joel Dubin, jdubin@uwaterloo.ca)  
find.matches  Find matches (with tolerances) between columns of 2 matrices 
first.word  Find the first word in an R expression (R Heiberger) 
fit.mult.impute  Fit most regression models over multiple transcan imputations, 
compute imputationadjusted variances and avg. betas  
format.df  Format a matrix or data frame with much user control 
(R Heiberger and FE Harrell)  
ftupwr  Power of 2sample binomial test using Fleiss, Tytun, Ury 
ftuss  Sample size for 2sample binomial test using " " " " 
(Both by Dan Heitjan, dheitjan@biostats.hmc.psu.edu)  
gbayes  Bayesian posterior and predictive distributions when both 
the prior and the likelihood are Gaussian  
getHdata  Fetch and list datasets on our web site 
hdquantile  HarrellDavis nonparametric quantile estimator with s.e. 
histbackback  Backtoback histograms (Pat Burns, Salomon Smith 
Barney, London, pburns@dorado.sbi.com)  
hist.data.frame  Matrix of histograms for all numeric vars. in data frame 
Use hist.data.frame(data.frame.name)  
histSpike  Add highresolution spike histograms or density estimates 
to an existing plot  
hoeffd  Hoeffding's D test (omnibus test of independence of X and Y) 
impute  Impute missing data (generic method) 
interaction  More flexible version of builtin function 
is.present  Tests for nonblank character values or nonNA numeric values 
james.stein  JamesStein shrinkage estimates of cell means from raw data 
labcurve 