Title:  Regression Modeling Strategies 

Description:  Regression modeling, testing, estimation, validation, graphics, prediction, and typesetting by storing enhanced model design attributes in the fit. 'rms' is a collection of functions that assist with and streamline modeling. It also contains functions for binary and ordinal logistic regression models, ordinal models for continuous Y with a variety of distribution families, and the BuckleyJames multiple regression model for rightcensored responses, and implements penalized maximum likelihood estimation for logistic and ordinary linear models. 'rms' works with almost any regression model, but it was especially written to work with binary or ordinal regression models, Cox regression, accelerated failure time models, ordinary linear models, the BuckleyJames model, generalized least squares for serially or spatially correlated observations, generalized linear models, and quantile regression. 
Authors:  Frank E Harrell Jr <[email protected]> 
Maintainer:  Frank E Harrell Jr <[email protected]> 
License:  GPL (>= 2) 
Version:  6.81 
Built:  20240528 03:36:25 UTC 
Source:  CRAN 
The anova
function automatically tests most meaningful hypotheses
in a design. For example, suppose that age and cholesterol are
predictors, and that a general interaction is modeled using a restricted
spline surface. anova
prints Wald statistics ($F$
statistics
for an ols
fit) for testing linearity of age, linearity of
cholesterol, age effect (age + age by cholesterol interaction),
cholesterol effect (cholesterol + age by cholesterol interaction),
linearity of the age by cholesterol interaction (i.e., adequacy of the
simple age * cholesterol 1 d.f. product), linearity of the interaction
in age alone, and linearity of the interaction in cholesterol
alone. Joint tests of all interaction terms in the model and all
nonlinear terms in the model are also performed. For any multiple
d.f. effects for continuous variables that were not modeled through
rcs
, pol
, lsp
, etc., tests of linearity will be
omitted. This applies to matrix predictors produced by e.g.
poly
or ns
.
For lrm, orm, cph, psm
and Glm
fits, the better likelihood
ratio chisquare tests may be obtained by specifying test='LR'
.
Fits must use x=TRUE, y=TRUE
to run LR tests. The tests are run
fairly efficiently by subsetting the design matrix rather than
recreating it.
print.anova.rms
is the printing
method. plot.anova.rms
draws dot charts depicting the importance
of variables in the model, as measured by Wald or LR $\chi^2$
,
$\chi^2$
minus d.f., AIC, $P$
values, partial
$R^2$
, $R^2$
for the whole model after deleting the effects in
question, or proportion of overall model $R^2$
that is due to each
predictor. latex.anova.rms
is the latex
method. It
substitutes Greek/math symbols in column headings, uses boldface for
TOTAL
lines, and constructs a caption. Then it passes the result
to latex.default
for conversion to LaTeX.
When the anova table was converted to account for missing data
imputation by processMI
, a separate function prmiInfo
can
be used to print information related to imputation adjustments.
For Bayesian models such as blrm
, anova
computes relative
explained variation indexes (REV) based on approximate Wald statistics.
This uses the variancecovariance matrix of all of the posterior draws,
and the individual draws of betas, plus an overall summary from the
posterior mode/mean/median beta. Wald chisquares assuming multivariate
normality of betas are computed just as with frequentist models, and for
each draw (or for the summary) the ratio of the partial Wald chisquare
to the total Wald statistic for the model is computed as REV.
The print
method calls latex
or html
methods
depending on options(prType=)
. For
latex
a table
environment is not used and an ordinary
tabular
is produced. When using html with Quarto or RMarkdown,
results='asis'
need not be written in the chunk header.
html.anova.rms
just calls latex.anova.rms
.
## S3 method for class 'rms'
anova(object, ..., main.effect=FALSE, tol=1e9,
test=c('F','Chisq','LR'), india=TRUE, indnl=TRUE, ss=TRUE,
vnames=c('names','labels'),
posterior.summary=c('mean', 'median', 'mode'), ns=500, cint=0.95)
## S3 method for class 'anova.rms'
print(x,
which=c('none','subscripts','names','dots'),
table.env=FALSE, ...)
## S3 method for class 'anova.rms'
plot(x,
what=c("chisqminusdf","chisq","aic","P","partial R2","remaining R2",
"proportion R2", "proportion chisq"),
xlab=NULL, pch=16,
rm.totals=TRUE, rm.ia=FALSE, rm.other=NULL, newnames,
sort=c("descending","ascending","none"), margin=c('chisq','P'),
pl=TRUE, trans=NULL, ntrans=40, height=NULL, width=NULL, ...)
## S3 method for class 'anova.rms'
latex(object, title, dec.chisq=2,
dec.F=2, dec.ss=NA, dec.ms=NA, dec.P=4, dec.REV=3,
table.env=TRUE,
caption=NULL, fontsize=1, params, ...)
## S3 method for class 'anova.rms'
html(object, ...)
object 
a 
... 
If omitted, all variables are tested, yielding tests for individual factors
and for pooled effects. Specify a subset of the variables to obtain tests
for only those factors, with a pooled tests for the combined effects
of all factors listed. Names may be abbreviated. For example, specify
Can be optional graphical parameters to send to
For 
main.effect 
Set to 
tol 
singularity criterion for use in matrix inversion 
test 
For an 
india 
set to 
indnl 
set to 
ss 
For an 
vnames 
set to 
posterior.summary 
specifies whether the posterior mode/mean/median beta are to be used as a measure of central tendence of the posterior distribution, for use in relative explained variation from Bayesian models 
ns 
number of random samples from the posterior draws to use for REV highest posterior density intervals 
cint 
HPD interval probability 
x 
for 
which 
If 
what 
what type of statistic to plot. The default is the 
xlab 
xaxis label, default is constructed according to 
pch 
character for plotting dots in dot charts. Default is 16 (solid dot). 
rm.totals 
set to 
rm.ia 
set to 
rm.other 
a list of other predictor names to omit from the chart 
newnames 
a list of substitute predictor names to use, after omitting any. 
sort 
default is to sort bars in descending order of the summary statistic. Available options: 'ascending', 'descending', 'none'. 
margin 
set to a vector of character strings to write text for
selected statistics in the right margin of the dot chart. The
character strings can be any combination of 
pl 
set to 
trans 
set to a function to apply that transformation to the statistics
being plotted, and to truncate negative values at zero. A good choice
is 
ntrans 

height , width

height and width of 
title 
title to pass to 
dec.chisq 
number of places to the right of the decimal place for typesetting

dec.F 
digits to the right for 
dec.ss 
digits to the right for sums of squares (default is 
dec.ms 
digits to the right for mean squares (default is 
dec.P 
digits to the right for 
dec.REV 
digits to the right for REV 
table.env 
see 
caption 
caption for table if 
fontsize 
font size for html output; default is 1 for 
params 
used internally when called through print. 
If the statistics being plotted with plot.anova.rms
are few in
number and one of them is negative or zero, plot.anova.rms
will quit because of an error in dotchart2
.
The latex
method requires LaTeX packages relsize
and
needspace
.
anova.rms
returns a matrix of class anova.rms
containing factors
as rows and $\chi^2$
, d.f., and $P$
values as
columns (or d.f., partial $SS, MS, F, P$
). An attribute
vinfo
provides list of variables involved in each row and the
type of test done.
plot.anova.rms
invisibly returns the vector of quantities
plotted. This vector has a names attribute describing the terms for
which the statistics in the vector are calculated.
print
prints, latex
creates a
file with a name of the form "title.tex"
(see the title
argument above).
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
prmiInfo
,
rms
, rmsMisc
, lrtest
,
rms.trans
, summary.rms
, plot.Predict
,
ggplot.Predict
, solvet
,
locator
,
dotchart2
, latex
,
xYplot
, anova.lm
,
contrast.rms
, pantext
require(ggplot2)
n < 1000 # define sample size
set.seed(17) # so can reproduce the results
treat < factor(sample(c('a','b','c'), n,TRUE))
num.diseases < sample(0:4, n,TRUE)
age < rnorm(n, 50, 10)
cholesterol < rnorm(n, 200, 25)
weight < rnorm(n, 150, 20)
sex < factor(sample(c('female','male'), n,TRUE))
label(age) < 'Age' # label is in Hmisc
label(num.diseases) < 'Number of Comorbid Diseases'
label(cholesterol) < 'Total Cholesterol'
label(weight) < 'Weight, lbs.'
label(sex) < 'Sex'
units(cholesterol) < 'mg/dl' # uses units.default in Hmisc
# Specify population model for log odds that Y=1
L < .1*(num.diseases2) + .045*(age50) +
(log(cholesterol  10)5.2)*(2*(treat=='a') +
3.5*(treat=='b')+2*(treat=='c'))
# Simulate binary y to have Prob(y=1) = 1/[1+exp(L)]
y < ifelse(runif(n) < plogis(L), 1, 0)
fit < lrm(y ~ treat + scored(num.diseases) + rcs(age) +
log(cholesterol+10) + treat:log(cholesterol+10),
x=TRUE, y=TRUE) # x, y needed for test='LR'
a < anova(fit) # Test all factors
b < anova(fit, treat, cholesterol) # Test these 2 by themselves
# to get their pooled effects
a
b
a2 < anova(fit, test='LR')
b2 < anova(fit, treat, cholesterol, test='LR')
a2
b2
# Add a new line to the plot with combined effects
s < rbind(a2, 'treat+cholesterol'=b2['TOTAL',])
class(s) < 'anova.rms'
plot(s, margin=c('chisq', 'proportion chisq'))
g < lrm(y ~ treat*rcs(age))
dd < datadist(treat, num.diseases, age, cholesterol)
options(datadist='dd')
p < Predict(g, age, treat="b")
s < anova(g)
tx < paste(capture.output(s), collapse='\n')
ggplot(p) + annotate('text', x=27, y=3.2, family='mono', label=tx,
hjust=0, vjust=1, size=1.5)
plot(s, margin=c('chisq', 'proportion chisq'))
# new plot  dot chart of chisqd.f. with 2 other stats in right margin
# latex(s) # nice printout  creates anova.g.tex
options(datadist=NULL)
# Simulate data with from a given model, and display exactly which
# hypotheses are being tested
set.seed(123)
age < rnorm(500, 50, 15)
treat < factor(sample(c('a','b','c'), 500, TRUE))
bp < rnorm(500, 120, 10)
y < ifelse(treat=='a', (age50)*.05, abs(age50)*.08) + 3*(treat=='c') +
pmax(bp, 100)*.09 + rnorm(500)
f < ols(y ~ treat*lsp(age,50) + rcs(bp,4))
print(names(coef(f)), quote=FALSE)
specs(f)
anova(f)
an < anova(f)
options(digits=3)
print(an, 'subscripts')
print(an, 'dots')
an < anova(f, test='Chisq', ss=FALSE)
# plot(0:1) # make some plot
# tab < pantext(an, 1.2, .6, lattice=FALSE, fontfamily='Helvetica')
# create function to write table; usually omit fontfamily
# tab() # execute it; could do tab(cex=.65)
plot(an) # new plot  dot chart of chisqd.f.
# Specify plot(an, trans=sqrt) to use a square root scale for this plot
# latex(an) # nice printout  creates anova.f.tex
## Example to save partial R^2 for all predictors, along with overall
## R^2, from two separate fits, and to combine them with ggplot2
require(ggplot2)
set.seed(1)
n < 100
x1 < runif(n)
x2 < runif(n)
y < (x1.5)^2 + x2 + runif(n)
group < c(rep('a', n/2), rep('b', n/2))
A < NULL
for(g in c('a','b')) {
f < ols(y ~ pol(x1,2) + pol(x2,2) + pol(x1,2) %ia% pol(x2,2),
subset=group==g)
a < plot(anova(f),
what='partial R2', pl=FALSE, rm.totals=FALSE, sort='none')
a < a[grep('NONLINEAR', names(a))]
d < data.frame(group=g, Variable=factor(names(a), names(a)),
partialR2=unname(a))
A < rbind(A, d)
}
ggplot(A, aes(x=partialR2, y=Variable)) + geom_point() +
facet_wrap(~ group) + xlab(ex < expression(partial~R^2)) +
scale_y_discrete(limits=rev)
ggplot(A, aes(x=partialR2, y=Variable, color=group)) + geom_point() +
xlab(ex < expression(partial~R^2)) +
scale_y_discrete(limits=rev)
# Suppose that a researcher wants to make a big deal about a variable
# because it has the highest adjusted chisquare. We use the
# bootstrap to derive 0.95 confidence intervals for the ranks of all
# the effects in the model. We use the plot method for anova, with
# pl=FALSE to suppress actual plotting of chisquare  d.f. for each
# bootstrap repetition.
# It is important to tell plot.anova.rms not to sort the results, or
# every bootstrap replication would have ranks of 1,2,3,... for the stats.
n < 300
set.seed(1)
d < data.frame(x1=runif(n), x2=runif(n), x3=runif(n),
x4=runif(n), x5=runif(n), x6=runif(n), x7=runif(n),
x8=runif(n), x9=runif(n), x10=runif(n), x11=runif(n),
x12=runif(n))
d$y < with(d, 1*x1 + 2*x2 + 3*x3 + 4*x4 + 5*x5 + 6*x6 +
7*x7 + 8*x8 + 9*x9 + 10*x10 + 11*x11 +
12*x12 + 9*rnorm(n))
f < ols(y ~ x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+x12, data=d)
B < 20 # actually use B=1000
ranks < matrix(NA, nrow=B, ncol=12)
rankvars < function(fit)
rank(plot(anova(fit), sort='none', pl=FALSE))
Rank < rankvars(f)
for(i in 1:B) {
j < sample(1:n, n, TRUE)
bootfit < update(f, data=d, subset=j)
ranks[i,] < rankvars(bootfit)
}
lim < t(apply(ranks, 2, quantile, probs=c(.025,.975)))
predictor < factor(names(Rank), names(Rank))
w < data.frame(predictor, Rank, lower=lim[,1], upper=lim[,2])
ggplot(w, aes(x=predictor, y=Rank)) + geom_point() + coord_flip() +
scale_y_continuous(breaks=1:12) +
geom_errorbar(aes(ymin=lim[,1], ymax=lim[,2]), width=0)
bj
fits the BuckleyJames distributionfree least squares multiple
regression model to a possibly rightcensored response variable.
This model reduces to ordinary least squares if
there is no censoring. By default, model fitting is done after
taking logs of the response variable.
bj
uses the rms
class
for automatic anova
, fastbw
, validate
, Function
, nomogram
,
summary
, plot
, bootcov
, and other functions. The bootcov
function may be worth using with bj
fits, as the properties of the
BuckleyJames covariance matrix estimator are not fully known for
strange censoring patterns.
For the print
method, format of output is controlled by the
user previously running options(prType="lang")
where
lang
is "plain"
(the default), "latex"
, or
"html"
. When using html with Quarto or RMarkdown,
results='asis'
need not be written in the chunk header.
The residuals.bj
function exists mainly to compute
residuals and to censor them (i.e., return them as
Surv
objects) just as the original
failure time variable was censored. These residuals are useful for
checking to see if the model also satisfies certain distributional assumptions.
To get these residuals, the fit must have specified y=TRUE
.
The bjplot
function is a special plotting function for objects
created by bj
with x=TRUE, y=TRUE
in effect. It produces three
scatterplots for every covariate in the model: the first plots the
original situation, where censored data are distingushed from
noncensored data by a different plotting symbol. In the second plot,
called a renovated plot, vertical lines show how censored data were
changed by the procedure, and the third is equal to the second, but
without vertical lines. Imputed data are again distinguished from the
noncensored by a different symbol.
The validate
method for bj
validates the Somers' Dxy
rank
correlation between predicted and observed responses, accounting for censoring.
The primary fitting function for bj
is bj.fit
, which does not
allow missing data and expects a full design matrix as input.
bj(formula, data=environment(formula), subset, na.action=na.delete,
link="log", control, method='fit', x=FALSE, y=FALSE,
time.inc)
## S3 method for class 'bj'
print(x, digits=4, long=FALSE, coefs=TRUE,
title="BuckleyJames Censored Data Regression", ...)
## S3 method for class 'bj'
residuals(object, type=c("censored","censored.normalized"),...)
bjplot(fit, which=1:dim(X)[[2]])
## S3 method for class 'bj'
validate(fit, method="boot", B=40,
bw=FALSE,rule="aic",type="residual",sls=.05,aics=0,
force=NULL, estimates=TRUE, pr=FALSE,
tol=1e7, rel.tolerance=1e3, maxiter=15, ...)
bj.fit(x, y, control)
formula 
an S statistical model formula. Interactions up to third order are
supported. The left hand side must be a 
data , subset , na.action

the usual statistical model fitting arguments 
fit 
a fit created by 
x 
a design matrix with or without a first column of ones, to pass
to 
y 
a 
link 
set to, for example, 
control 
a list containing any or all of the following components: 
method 
set to 
time.inc 
setting for default time spacing.
Default is 30 if time variable has 
digits 
number of significant digits to print if not 4. 
long 
set to 
coefs 
specify 
title 
a character string title to be passed to 
object 
the result of 
type 
type of residual desired. Default is censored unnormalized residuals,
defined as link(Y)  linear.predictors, where the
link function was usually the log function. You can specify

which 
vector of integers or character strings naming elements of the design
matrix (the names of the original predictors if they entered the model
linearly) for which to have 
B , bw , rule , sls , aics , force , estimates , pr , tol , rel.tolerance , maxiter

see

... 
ignored for 
The program implements the algorithm as described in the original article by Buckley & James. Also, we have used the original Buckley & James prescription for computing variance/covariance estimator. This is based on noncensored observations only and does not have any theoretical justification, but has been shown in simulation studies to behave well. Our experience confirms this view. Convergence is rather slow with this method, so you may want to increase the number of iterations. Our experience shows that often, in particular with high censoring, 100 iterations is not too many. Sometimes the method will not converge, but will instead enter a loop of repeating values (this is due to the discrete nature of Kaplan and Meier estimator and usually happens with small sample sizes). The program will look for such a loop and return the average betas. It will also issue a warning message and give the size of the cycle (usually less than 6).
bj
returns a fit object with similar information to what survreg
,
psm
, cph
would store as
well as what rms
stores and units
and time.inc
.
residuals.bj
returns a Surv
object. One of the components of the
fit
object produced by bj
(and bj.fit
) is a vector called
stats
which contains the following names elements:
"Obs", "Events", "d.f.","error d.f.","sigma","g"
. Here
sigma
is the estimate of the residual standard deviation.
g
is the $g$
index. If the link function is "log"
,
the $g$
index on the antilog scale is also returned as gr
.
Janez Stare
Department of Biomedical Informatics
Ljubljana University
Ljubljana, Slovenia
[email protected]
Harald Heinzl
Department of Medical Computer Sciences
Vienna University
Vienna, Austria
[email protected]
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
Buckley JJ, James IR. Linear regression with censored data. Biometrika 1979; 66:429–36.
Miller RG, Halpern J. Regression with censored data. Biometrika 1982; 69: 521–31.
James IR, Smith PJ. Consistency results for linear regression with censored data. Ann Statist 1984; 12: 590–600.
Lai TL, Ying Z. Large sample theory of a modified BuckleyJames estimator for regression analysis with censored data. Ann Statist 1991; 19: 1370–402.
Hillis SL. Residual plots for the censored data linear regression model. Stat in Med 1995; 14: 2023–2036.
Jin Z, Lin DY, Ying Z. On leastsquares regression with censored data. Biometrika 2006; 93:147–161.
rms
, psm
, survreg
,
cph
, Surv
,
na.delete
,
na.detail.response
, datadist
,
rcorr.cens
, GiniMd
,
prModFit
, dxy.cens
require(survival)
suppressWarnings(RNGversion("3.5.0"))
set.seed(1)
ftime < 10*rexp(200)
stroke < ifelse(ftime > 10, 0, 1)
ftime < pmin(ftime, 10)
units(ftime) < "Month"
age < rnorm(200, 70, 10)
hospital < factor(sample(c('a','b'),200,TRUE))
dd < datadist(age, hospital)
options(datadist="dd")
# Prior to rms 6.0 and R 4.0 the following worked with 5 knots
f < bj(Surv(ftime, stroke) ~ rcs(age,3) + hospital, x=TRUE, y=TRUE)
# add link="identity" to use a censored normal regression model instead
# of a lognormal one
anova(f)
fastbw(f)
validate(f, B=15)
plot(Predict(f, age, hospital))
# needs datadist since no explicit age,hosp.
coef(f) # look at regression coefficients
coef(psm(Surv(ftime, stroke) ~ rcs(age,3) + hospital, dist='lognormal'))
# compare with coefficients from likelihoodbased
# lognormal regression model
# use dist='gau' not under R
r < resid(f, 'censored.normalized')
survplot(npsurv(r ~ 1), conf='none')
# plot KaplanMeier estimate of
# survival function of standardized residuals
survplot(npsurv(r ~ cut2(age, g=2)), conf='none')
# may desire both strata to be n(0,1)
options(datadist=NULL)
This functions constructs an object resembling one produced by the
boot
package's boot
function, and runs that package's
boot.ci
function to compute BCa and percentile confidence limits.
bootBCa
can provide separate confidence limits for a vector of
statistics when estimate
has length greater than 1. In that
case, estimates
must have the same number of columns as
estimate
has values.
bootBCa(estimate, estimates, type=c('percentile','bca','basic'),
n, seed, conf.int = 0.95)
estimate 
original wholesample estimate 
estimates 
vector of bootstrap estimates 
type 
type of confidence interval, defaulting to nonparametric percentile 
n 
original number of observations 
seed 

conf.int 
confidence level 
a 2vector if estimate
is of length 1, otherwise a matrix
with 2 rows and number of columns equal to the length of
estimate
You can use if(!exists('.Random.seed')) runif(1)
before running
your bootstrap to make sure that .Random.seed
will be available
to bootBCa
.
Frank Harrell
## Not run:
x1 < runif(100); x2 < runif(100); y < sample(0:1, 100, TRUE)
f < lrm(y ~ x1 + x2, x=TRUE, y=TRUE)
seed < .Random.seed
b < bootcov(f)
# Get estimated log odds at x1=.4, x2=.6
X < cbind(c(1,1), x1=c(.4,2), x2=c(.6,3))
est < X
ests < t(X
bootBCa(est, ests, n=100, seed=seed)
bootBCa(est, ests, type='bca', n=100, seed=seed)
bootBCa(est, ests, type='basic', n=100, seed=seed)
## End(Not run)
bootcov
computes a bootstrap estimate of the covariance matrix for a set
of regression coefficients from ols
, lrm
, cph
,
psm
, Rq
, and any
other fit where x=TRUE, y=TRUE
was used to store the data used in making
the original regression fit and where an appropriate fitter
function
is provided here. The estimates obtained are not conditional on
the design matrix, but are instead unconditional estimates. For
small sample sizes, this will make a difference as the unconditional
variance estimates are larger. This function will also obtain
bootstrap estimates corrected for cluster sampling (intracluster
correlations) when a "working independence" model was used to fit
data which were correlated within clusters. This is done by substituting
cluster sampling with replacement for the usual simple sampling with
replacement. bootcov
has an option (coef.reps
) that causes all
of the regression coefficient estimates from all of the bootstrap
resamples to be saved, facilitating computation of nonparametric
bootstrap confidence limits and plotting of the distributions of the
coefficient estimates (using histograms and kernel smoothing estimates).
The loglik
option facilitates the calculation of simultaneous
confidence regions from quantities of interest that are functions of
the regression coefficients, using the method of Tibshirani(1996).
With Tibshirani's method, one computes the objective criterion (2 log
likelihood evaluated at the bootstrap estimate of $\beta$
but with
respect to the original design matrix and response vector) for the
original fit as well as for all of the bootstrap fits. The confidence
set of the regression coefficients is the set of all coefficients that
are associated with objective function values that are less than or
equal to say the 0.95 quantile of the vector of B + 1
objective
function values. For the coefficients satisfying this condition,
predicted values are computed at a userspecified design matrix X
,
and minima and maxima of these predicted values (over the qualifying
bootstrap repetitions) are computed to derive the final simultaneous
confidence band.
The bootplot
function takes the output of bootcov
and
either plots a histogram and kernel density
estimate of specified regression coefficients (or linear combinations
of them through the use of a specified design matrix X
), or a
qqnorm
plot of the quantities of interest to check for normality of
the maximum likelihood estimates. bootplot
draws vertical lines at
specified quantiles of the bootstrap distribution, and returns these
quantiles for possible printing by the user. Bootstrap estimates may
optionally be transformed by a userspecified function fun
before
plotting.
The confplot
function also uses the output of bootcov
but to
compute and optionally plot nonparametric bootstrap pointwise confidence
limits or (by default) Tibshirani (1996) simultaneous confidence sets.
A design matrix must be specified to allow confplot
to compute
quantities of interest such as predicted values across a range
of values or differences in predicted values (plots of effects of
changing one or more predictor variable values).
bootplot
and confplot
are actually generic functions, with
the particular functions bootplot.bootcov
and confplot.bootcov
automatically invoked for bootcov
objects.
A service function called histdensity
is also provided (for use with
bootplot
). It runs hist
and density
on the same plot, using
twice the number of classes than the default for hist
, and 1.5 times the
width
than the default used by density
.
A comprehensive example demonstrates the use of all of the functions.
bootcov(fit, cluster, B=200, fitter,
coef.reps=TRUE, loglik=FALSE,
pr=FALSE, maxit=15, eps=0.0001, group=NULL, stat=NULL,
seed=sample(10000, 1))
bootplot(obj, which=1 : ncol(Coef), X,
conf.int=c(.9,.95,.99),
what=c('density', 'qqnorm', 'box'),
fun=function(x) x, labels., ...)
confplot(obj, X, against,
method=c('simultaneous','pointwise'),
conf.int=0.95, fun=function(x)x,
add=FALSE, lty.conf=2, ...)
histdensity(y, xlab, nclass, width, mult.width=1, ...)
fit 
a fit object containing components 
obj 
an object created by 
X 
a design matrix specified to 
y 
a vector to pass to 
cluster 
a variable indicating groupings. 
B 
number of bootstrap repetitions. Default is 200. 
fitter 
the name of a function with arguments 
coef.reps 
set to 
loglik 
set to 
pr 
set to 
maxit 
maximum number of iterations, to pass to 
eps 
argument to pass to various fitters 
group 
a grouping variable used to stratify the sample upon bootstrapping.
This allows one to handle ksample problems, i.e., each bootstrap
sample will be forced to select the same number of observations from
each level of group as the number appearing in the original dataset.
You may specify both 
stat 
a single character string specifying the name of a 
seed 
random number seed for 
which 
one or more integers specifying which regression coefficients to
plot for 
conf.int 
a vector (for 
what 
for 
fun 
for 
labels. 
a vector of labels for labeling the axes in plots produced by 
... 
For 
against 
For 
method 
specifies whether 
add 
set to 
lty.conf 
line type for plotting confidence bands in 
xlab 
label for xaxis for 
nclass 
passed to 
width 
passed to 
mult.width 
multiplier by which to adjust the default 
If the fit has a scale parameter (e.g., a fit from psm
), the log
of the individual bootstrap scale estimates are added to the vector
of parameter estimates and and column and row for the log scale are
added to the new covariance matrix (the old covariance matrix also
has this row and column).
For Rq
fits, the tau
, method
, and hs
arguments are taken from the original fit.
a new fit object with class of the original object and with the element
orig.var
added. orig.var
is
the covariance matrix of the original fit. Also, the original var
component is replaced with the new bootstrap estimates. The component
boot.coef
is also added. This contains the mean bootstrap estimates
of regression coefficients (with a log scale element added if
applicable). boot.Coef
is added if coef.reps=TRUE
.
boot.loglik
is added if loglik=TRUE
. If stat
is
specified an additional vector boot.stats
will be contained in
the returned object. B
contains the number of successfully fitted
bootstrap resamples. A component
clusterInfo
is added to contain elements name
and n
holding the name of the cluster
variable and the number of clusters.
bootplot
returns a (possible matrix) of quantities of interest and
the requested quantiles of them. confplot
returns three vectors:
fitted
, lower
, and upper
.
bootcov
prints if pr=TRUE
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
Bill Pikounis
Biometrics Research Department
Merck Research Laboratories
https://billpikounis.com/wpb/
Feng Z, McLerran D, Grizzle J (1996): A comparison of statistical methods for clustered data analysis with Gaussian error. Stat in Med 15:1793–1806.
Tibshirani R, Knight K (1996): Model search and inference by bootstrap
"bumping". Department of Statistics, University of Toronto. Technical
report available from
http://wwwstat.stanford.edu/~tibs/.
Presented at the Joint Statistical Meetings,
Chicago, August 1996.
robcov
, sample
, rms
,
lm.fit
, lrm.fit
,
survivalinternal
,
predab.resample
, rmsMisc
,
Predict
, gendata
,
contrast.rms
, Predict
, setPb
,
multiwayvcov::cluster.boot
set.seed(191)
x < exp(rnorm(200))
logit < 1 + x/2
y < ifelse(runif(200) <= plogis(logit), 1, 0)
f < lrm(y ~ pol(x,2), x=TRUE, y=TRUE)
g < bootcov(f, B=50, pr=TRUE, seed=3)
anova(g) # using bootstrap covariance estimates
fastbw(g) # using bootstrap covariance estimates
beta < g$boot.Coef[,1]
hist(beta, nclass=15) #look at normality of parameter estimates
qqnorm(beta)
# bootplot would be better than these last two commands
# A dataset contains a variable number of observations per subject,
# and all observations are laid out in separate rows. The responses
# represent whether or not a given segment of the coronary arteries
# is occluded. Segments of arteries may not operate independently
# in the same patient. We assume a "working independence model" to
# get estimates of the coefficients, i.e., that estimates assuming
# independence are reasonably efficient. The job is then to get
# unbiased estimates of variances and covariances of these estimates.
set.seed(2)
n.subjects < 30
ages < rnorm(n.subjects, 50, 15)
sexes < factor(sample(c('female','male'), n.subjects, TRUE))
logit < (ages50)/5
prob < plogis(logit) # true prob not related to sex
id < sample(1:n.subjects, 300, TRUE) # subjects sampled multiple times
table(table(id)) # frequencies of number of obs/subject
age < ages[id]
sex < sexes[id]
# In truth, observations within subject are independent:
y < ifelse(runif(300) <= prob[id], 1, 0)
f < lrm(y ~ lsp(age,50)*sex, x=TRUE, y=TRUE)
g < bootcov(f, id, B=50, seed=3) # usually do B=200 or more
diag(g$var)/diag(f$var)
# add ,group=w to resample from within each level of w
anova(g) # clusteradjusted Wald statistics
# fastbw(g) # clusteradjusted backward elimination
plot(Predict(g, age=30:70, sex='female')) # clusteradjusted confidence bands
# Get design effects based on inflation of the variances when compared
# with bootstrap estimates which ignore clustering
g2 < bootcov(f, B=50, seed=3)
diag(g$var)/diag(g2$var)
# Get design effects based on pooled tests of factors in model
anova(g2)[,1] / anova(g)[,1]
# Simulate binary data where there is a strong
# age x sex interaction with linear age effects
# for both sexes, but where not knowing that
# we fit a quadratic model. Use the bootstrap
# to get bootstrap distributions of various
# effects, and to get pointwise and simultaneous
# confidence limits
set.seed(71)
n < 500
age < rnorm(n, 50, 10)
sex < factor(sample(c('female','male'), n, rep=TRUE))
L < ifelse(sex=='male', 0, .1*(age50))
y < ifelse(runif(n)<=plogis(L), 1, 0)
f < lrm(y ~ sex*pol(age,2), x=TRUE, y=TRUE)
b < bootcov(f, B=50, loglik=TRUE, pr=TRUE, seed=3) # better: B=500
par(mfrow=c(2,3))
# Assess normality of regression estimates
bootplot(b, which=1:6, what='qq')
# They appear somewhat nonnormal
# Plot histograms and estimated densities
# for 6 coefficients
w < bootplot(b, which=1:6)
# Print bootstrap quantiles
w$quantiles
# Show box plots for bootstrap reps for all coefficients
bootplot(b, what='box')
# Estimate regression function for females
# for a sequence of ages
ages < seq(25, 75, length=100)
label(ages) < 'Age'
# Plot fitted function and pointwise normal
# theory confidence bands
par(mfrow=c(1,1))
p < Predict(f, age=ages, sex='female')
plot(p)
# Save curve coordinates for later automatic
# labeling using labcurve in the Hmisc library
curves < vector('list',8)
curves[[1]] < with(p, list(x=age, y=lower))
curves[[2]] < with(p, list(x=age, y=upper))
# Add pointwise normaldistribution confidence
# bands using unconditional variancecovariance
# matrix from the 500 bootstrap reps
p < Predict(b, age=ages, sex='female')
curves[[3]] < with(p, list(x=age, y=lower))
curves[[4]] < with(p, list(x=age, y=upper))
dframe < expand.grid(sex='female', age=ages)
X < predict(f, dframe, type='x') # Full design matrix
# Add pointwise bootstrap nonparametric
# confidence limits
p < confplot(b, X=X, against=ages, method='pointwise',
add=TRUE, lty.conf=4)
curves[[5]] < list(x=ages, y=p$lower)
curves[[6]] < list(x=ages, y=p$upper)
# Add simultaneous bootstrap confidence band
p < confplot(b, X=X, against=ages, add=TRUE, lty.conf=5)
curves[[7]] < list(x=ages, y=p$lower)
curves[[8]] < list(x=ages, y=p$upper)
lab < c('a','a','b','b','c','c','d','d')
labcurve(curves, lab, pl=TRUE)
# Now get bootstrap simultaneous confidence set for
# female:male odds ratios for a variety of ages
dframe < expand.grid(age=ages, sex=c('female','male'))
X < predict(f, dframe, type='x') # design matrix
f.minus.m < X[1:100,]  X[101:200,]
# First 100 rows are for females. By subtracting
# design matrices are able to get Xf*Beta  Xm*Beta
# = (Xf  Xm)*Beta
confplot(b, X=f.minus.m, against=ages,
method='pointwise', ylab='F:M Log Odds Ratio')
confplot(b, X=f.minus.m, against=ages,
lty.conf=3, add=TRUE)
# contrast.rms makes it easier to compute the design matrix for use
# in bootstrapping contrasts:
f.minus.m < contrast(f, list(sex='female',age=ages),
list(sex='male', age=ages))$X
confplot(b, X=f.minus.m)
# For a quadratic binary logistic regression model use bootstrap
# bumping to estimate coefficients under a monotonicity constraint
set.seed(177)
n < 400
x < runif(n)
logit < 3*(x^21)
y < rbinom(n, size=1, prob=plogis(logit))
f < lrm(y ~ pol(x,2), x=TRUE, y=TRUE)
k < coef(f)
k
vertex < k[2]/(2*k[3])
vertex
# Outside [0,1] so fit satisfies monotonicity constraint within
# x in [0,1], i.e., original fit is the constrained MLE
g < bootcov(f, B=50, coef.reps=TRUE, loglik=TRUE, seed=3)
bootcoef < g$boot.Coef # 100x3 matrix
vertex < bootcoef[,2]/(2*bootcoef[,3])
table(cut2(vertex, c(0,1)))
mono < !(vertex >= 0 & vertex <= 1)
mean(mono) # estimate of Prob{monotonicity in [0,1]}
var(bootcoef) # varcov matrix for unconstrained estimates
var(bootcoef[mono,]) # for constrained estimates
# Find secondbest vector of coefficient estimates, i.e., best
# from among bootstrap estimates
g$boot.Coef[order(g$boot.loglik[length(g$boot.loglik)])[1],]
# Note closeness to MLE
## Not run:
# Get the bootstrap distribution of the difference in two ROC areas for
# two binary logistic models fitted on the same dataset. This analysis
# does not adjust for the bias ROC area (Cindex) due to overfitting.
# The same random number seed is used in two runs to enforce pairing.
set.seed(17)
x1 < rnorm(100)
x2 < rnorm(100)
y < sample(0:1, 100, TRUE)
f < lrm(y ~ x1, x=TRUE, y=TRUE)
g < lrm(y ~ x1 + x2, x=TRUE, y=TRUE)
f < bootcov(f, stat='C', seed=4)
g < bootcov(g, stat='C', seed=4)
dif < g$boot.stats  f$boot.stats
hist(dif)
quantile(dif, c(.025,.25,.5,.75,.975))
# Compute a ztest statistic. Note that comparing ROC areas is far less
# powerful than likelihood or Brier scorebased methods
z < (g$stats['C']  f$stats['C'])/sd(dif)
names(z) < NULL
c(z=z, P=2*pnorm(abs(z)))
## End(Not run)
Uses lattice graphics and the output from Predict
to plot image,
contour, or perspective plots showing the simultaneous effects of two
continuous predictor variables. Unless formula
is provided, the
$x$
axis is constructed from the first variable listed in the call
to Predict
and the $y$
axis variable comes from the second.
The perimeter
function is used to generate the boundary of data
to plot when a 3d plot is made. It finds the area where there are
sufficient data to generate believable interaction fits.
bplot(x, formula, lfun=lattice::levelplot, xlab, ylab, zlab,
adj.subtitle=!info$ref.zero, cex.adj=.75, cex.lab=1,
perim, showperim=FALSE,
zlim=range(yhat, na.rm=TRUE), scales=list(arrows=FALSE),
xlabrot, ylabrot, zlabrot=90, ...)
perimeter(x, y, xinc=diff(range(x))/10, n=10, lowess.=TRUE)
x 
for 
formula 
a formula of the form 
lfun 
a highlevel lattice plotting function that takes formulas of the
form 
xlab 
Character string label for 
ylab 
Character string abel for 
zlab 
Character string 
adj.subtitle 
Set to 
cex.adj 

cex.lab 

perim 
names a matrix created by 
showperim 
set to 
zlim 
Controls the range for plotting in the 
scales 
see 
xlabrot 
rotation angle for the xaxis. Default is 30 for

ylabrot 
rotation angle for the yaxis. Default is 40 for

zlabrot 
rotation angle for zaxis rotation for

... 
other arguments to pass to the lattice function 
y 
second variable of the pair for 
xinc 
increment in 
n 
within intervals of 
lowess. 
set to 
perimeter
is a kind of generalization of datadist
for 2
continuous variables. First, the n
smallest and largest x
values are determined. These form the lowest and highest possible
x
s to display. Then x
is grouped into intervals bounded
by these two numbers, with the interval widths defined by xinc
.
Within each interval, y
is sorted and the $n$
th smallest and
largest y
are taken as the interval containing sufficient data
density to plot interaction surfaces. The interval is ignored when
there are insufficient y
values. When the data are being
readied for persp
, bplot
uses the approx
function to do
linear interpolation of the y
boundaries as a function of the
x
values actually used in forming the grid (the values of the
first variable specified to Predict
). To make the perimeter smooth,
specify lowess.=TRUE
to perimeter
.
perimeter
returns a matrix of class perimeter
. This
outline can be conveniently plotted by lines.perimeter
.
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
datadist
, Predict
,
rms
, rmsMisc
, levelplot
,
contourplot
, wireframe
n < 1000 # define sample size
set.seed(17) # so can reproduce the results
age < rnorm(n, 50, 10)
blood.pressure < rnorm(n, 120, 15)
cholesterol < rnorm(n, 200, 25)
sex < factor(sample(c('female','male'), n,TRUE))
label(age) < 'Age' # label is in Hmisc
label(cholesterol) < 'Total Cholesterol'
label(blood.pressure) < 'Systolic Blood Pressure'
label(sex) < 'Sex'
units(cholesterol) < 'mg/dl' # uses units.default in Hmisc
units(blood.pressure) < 'mmHg'
# Specify population model for log odds that Y=1
L < .4*(sex=='male') + .045*(age50) +
(log(cholesterol  10)5.2)*(2*(sex=='female') + 2*(sex=='male'))
# Simulate binary y to have Prob(y=1) = 1/[1+exp(L)]
y < ifelse(runif(n) < plogis(L), 1, 0)
ddist < datadist(age, blood.pressure, cholesterol, sex)
options(datadist='ddist')
fit < lrm(y ~ blood.pressure + sex * (age + rcs(cholesterol,4)),
x=TRUE, y=TRUE)
p < Predict(fit, age, cholesterol, sex, np=50) # vary sex last
require(lattice)
bplot(p) # image plot for age, cholesterol with color
# coming from yhat; use default ranges for
# both continuous predictors; two panels (for sex)
bplot(p, lfun=wireframe) # same as bplot(p,,wireframe)
# View from different angle, change y label orientation accordingly
# Default is z=40, x=60
bplot(p,, wireframe, screen=list(z=40, x=75), ylabrot=25)
bplot(p,, contourplot) # contour plot
bounds < perimeter(age, cholesterol, lowess=TRUE)
plot(age, cholesterol) # show bivariate data density and perimeter
lines(bounds[,c('x','ymin')]); lines(bounds[,c('x','ymax')])
p < Predict(fit, age, cholesterol) # use only one sex
bplot(p, perim=bounds) # draws image() plot
# don't show estimates where data are sparse
# doesn't make sense here since vars don't interact
bplot(p, plogis(yhat) ~ age*cholesterol) # Probability scale
options(datadist=NULL)
Uses bootstrapping or crossvalidation to get biascorrected (overfitting
corrected) estimates of predicted vs. observed values based on
subsetting predictions into intervals (for survival models) or on
nonparametric smoothers (for other models). There are calibration
functions for Cox (cph
), parametric survival models (psm
),
binary and ordinal logistic models (lrm
) and ordinary least
squares (ols
). For survival models,
"predicted" means predicted survival probability at a single
time point, and "observed" refers to the corresponding KaplanMeier
survival estimate, stratifying on intervals of predicted survival, or,
if the polspline
package is installed, the predicted survival
probability as a function of transformed predicted survival probability
using the flexible hazard regression approach (see the val.surv
function for details). For logistic and linear models, a nonparametric
calibration curve is estimated over a sequence of predicted values. The
fit must have specified x=TRUE, y=TRUE
. The print
and
plot
methods for lrm
and ols
models (which use
calibrate.default
) print the mean absolute error in predictions,
the mean squared error, and the 0.9 quantile of the absolute error.
Here, error refers to the difference between the predicted values and
the corresponding biascorrected calibrated values.
Below, the second, third, and fourth invocations of calibrate
are, respectively, for ols
and lrm
, cph
, and
psm
. The first and second plot
invocation are
respectively for lrm
and ols
fits or all other fits.
calibrate(fit, ...)
## Default S3 method:
calibrate(fit, predy,
method=c("boot","crossvalidation",".632","randomization"),
B=40, bw=FALSE, rule=c("aic","p"),
type=c("residual","individual"),
sls=.05, aics=0, force=NULL, estimates=TRUE, pr=FALSE, kint,
smoother="lowess", digits=NULL, ...)
## S3 method for class 'cph'
calibrate(fit, cmethod=c('hare', 'KM'),
method="boot", u, m=150, pred, cuts, B=40,
bw=FALSE, rule="aic", type="residual", sls=0.05, aics=0, force=NULL,
estimates=TRUE,
pr=FALSE, what="observedpredicted", tol=1e12, maxdim=5, ...)
## S3 method for class 'psm'
calibrate(fit, cmethod=c('hare', 'KM'),
method="boot", u, m=150, pred, cuts, B=40,
bw=FALSE,rule="aic",
type="residual", sls=.05, aics=0, force=NULL, estimates=TRUE,
pr=FALSE, what="observedpredicted", tol=1e12, maxiter=15,
rel.tolerance=1e5, maxdim=5, ...)
## S3 method for class 'calibrate'
print(x, B=Inf, ...)
## S3 method for class 'calibrate.default'
print(x, B=Inf, ...)
## S3 method for class 'calibrate'
plot(x, xlab, ylab, subtitles=TRUE, conf.int=TRUE,
cex.subtitles=.75, riskdist=TRUE, add=FALSE,
scat1d.opts=list(nhistSpike=200), par.corrected=NULL, ...)
## S3 method for class 'calibrate.default'
plot(x, xlab, ylab, xlim, ylim,
legend=TRUE, subtitles=TRUE, cex.subtitles=.75, riskdist=TRUE,
scat1d.opts=list(nhistSpike=200), ...)
fit 
a fit from 
x 
an object created by 
method , B , bw , rule , type , sls , aics , force , estimates

see 
cmethod 
method for validating survival predictions using
rightcensored data. The default is 
u 
the time point for which to validate predictions for survival
models. For 
m 
group predicted 
pred 
vector of predicted survival probabilities at which to evaluate the
calibration curve. By default, the low and high prediction values
from 
cuts 
actual cut points for predicted survival probabilities. You may
specify only one of 
pr 
set to 
what 
The default is 
tol 
criterion for matrix singularity (default is 
maxdim 
see 
maxiter 
for 
rel.tolerance 
parameter passed to

predy 
a scalar or vector of predicted values to calibrate (for 
kint 
For an ordinal logistic model the default predicted
probability that 
smoother 
a function in two variables which produces 
digits 
If specified, predicted values are rounded to

... 
other arguments to pass to 
xlab 
defaults to "Predicted xunits Survival" or to a suitable label for other models 
ylab 
defaults to "Fraction Surviving xunits" or to a suitable label for other models 
xlim , ylim

2vectors specifying x and yaxis limits, if not using defaults 
subtitles 
set to 
conf.int 
set to 
cex.subtitles 
character size for plotting subtitles 
riskdist 
set to 
add 
set to 
scat1d.opts 
a list specifying options to send to 
par.corrected 
a list specifying graphics parameters 
legend 
set to 
If the fit was created using penalized maximum likelihood estimation,
the same penalty
and penalty.scale
parameters are used during
validation.
matrix specifying mean predicted survival in each interval, the
corresponding estimated biascorrected KaplanMeier estimates,
number of subjects, and other statistics. For linear and logistic models,
the matrix instead has rows corresponding to the prediction points, and
the vector of predicted values being validated is returned as an attribute.
The returned object has class "calibrate"
or
"calibrate.default"
.
plot.calibrate.default
invisibly returns the vector of estimated
prediction errors corresponding to the dataset used to fit the model.
prints, and stores an object pred.obs
or .orig.cal
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
validate
, predab.resample
,
groupkm
, errbar
,
scat1d
, cph
, psm
,
lowess
,fit.mult.impute
,
processMI
require(survival)
set.seed(1)
n < 200
d.time < rexp(n)
x1 < runif(n)
x2 < factor(sample(c('a', 'b', 'c'), n, TRUE))
f < cph(Surv(d.time) ~ pol(x1,2) * x2, x=TRUE, y=TRUE, surv=TRUE, time.inc=1.5)
#or f < psm(S ~ \dots)
pa < requireNamespace('polspline')
if(pa) {
cal < calibrate(f, u=1.5, B=20) # cmethod='hare'
plot(cal)
}
cal < calibrate(f, u=1.5, cmethod='KM', m=50, B=20) # usually B=200 or 300
plot(cal, add=pa)
set.seed(1)
y < sample(0:2, n, TRUE)
x1 < runif(n)
x2 < runif(n)
x3 < runif(n)
x4 < runif(n)
f < lrm(y ~ x1 + x2 + x3 * x4, x=TRUE, y=TRUE)
cal < calibrate(f, kint=2, predy=seq(.2, .8, length=60),
group=y)
# group= does ksample validation: make resamples have same
# numbers of subjects in each level of y as original sample
plot(cal)
#See the example for the validate function for a method of validating
#continuation ratio ordinal logistic models. You can do the same
#thing for calibrate
This function computes one or more contrasts of the estimated
regression coefficients in a fit from one of the functions in rms,
along with standard errors, confidence limits, t or Z statistics, Pvalues.
General contrasts are handled by obtaining the design matrix for two
sets of predictor settings (a
, b
) and subtracting the
corresponding rows of the two design matrics to obtain a new contrast
design matrix for testing the a
 b
differences. This allows for
quite general contrasts (e.g., estimated differences in means between
a 30 year old female and a 40 year old male).
This can also be used
to obtain a series of contrasts in the presence of interactions (e.g.,
female:male log odds ratios for several ages when the model contains
age by sex interaction). Another use of contrast
is to obtain
centerweighted (Type III test) and subjectweighted (Type II test)
estimates in a model containing treatment by center interactions. For
the latter case, you can specify type="average"
and an optional
weights
vector to average the withincenter treatment contrasts.
The design contrast matrix computed by contrast.rms
can be used
by other functions.
When the model was fitted by a Bayesian function such as blrm
,
highest posterior density intervals for contrasts are computed instead, along with the
posterior probability that the contrast is positive.
posterior.summary
specifies whether posterior mean/median/mode is
to be used for contrast point estimates.
contrast.rms
also allows one to specify four settings to
contrast, yielding contrasts that are double differences  the
difference between the first two settings (a
 b
) and the
last two (a2
 b2
). This allows assessment of interactions.
If usebootcoef=TRUE
, the fit was run through bootcov
, and
conf.type="individual"
, the confidence intervals are bootstrap
nonparametric percentile confidence intervals, basic bootstrap, or BCa
intervals, obtained on contrasts evaluated on all bootstrap samples.
By omitting the b
argument, contrast
can be used to obtain
an average or weighted average of a series of predicted values, along
with a confidence interval for this average. This can be useful for
"unconditioning" on one of the predictors (see the next to last
example).
Specifying type="joint"
, and specifying at least as many contrasts
as needed to span the space of a complex test, one can make
multiple degree of freedom tests flexibly and simply. Redundant
contrasts will be ignored in the joint test. See the examples below.
These include an example of an "incomplete interaction test" involving
only two of three levels of a categorical variable (the test also tests
the main effect).
When more than one contrast is computed, the list created by
contrast.rms
is suitable for plotting (with error bars or bands)
with xYplot
or Dotplot
(see the last example before the
type="joint"
examples).
When fit
is the result of a Bayesian model fit and fun
is
specified, contrast.rms
operates altogether differently. a
and b
must both be specified and a2, b2
not specified.
fun
is evaluated on the estimates
separately on a
and b
and the subtraction is deferred. So
even in the absence of interactions, when fun
is nonlinear, the
settings of factors (predictors) will not cancel out and estimates of
differences will be covariatespecific (unless there are no covariates
in the model besides the one being varied to get from a
to b
).
contrast(fit, ...)
## S3 method for class 'rms'
contrast(fit, a, b, a2, b2, ycut=NULL, cnames=NULL,
fun=NULL, funint=TRUE,
type=c("individual", "average", "joint"),
conf.type=c("individual","simultaneous"), usebootcoef=TRUE,
boot.type=c("percentile","bca","basic"),
posterior.summary=c('mean', 'median', 'mode'),
weights="equal", conf.int=0.95, tol=1e7, expand=TRUE, ...)
## S3 method for class 'contrast.rms'
print(x, X=FALSE,
fun=function(u)u, jointonly=FALSE, prob=0.95, ...)
fit 
a fit of class 
a 
a list containing settings for all predictors that you do not wish to
set to default (adjustto) values. Usually you will specify two
variables in this list, one set to a constant and one to a sequence of
values, to obtain contrasts for the sequence of values of an
interacting factor. The 
b 
another list that generates the same number of observations as 
a2 
an optional third list of settings of predictors 
b2 
an optional fourth list of settings of predictors. Mandatory
if 
ycut 
used of the fit is a constrained partial proportional odds
model fit, to specify the single value or vector of values
(corresponding to the multiple contrasts) of the response
variable to use in forming contrasts. When there is
nonproportional odds, odds ratios will vary over levels of the
response variable. When there are multiple contrasts and only
one value is given for 
cnames 
vector of character strings naming the contrasts when

fun 
a function to evaluate on the linear predictor for each of

type 
set 
conf.type 
The default type of confidence interval computed for a given
individual (1 d.f.) contrast is a pointwise confidence interval. Set

usebootcoef 
If 
boot.type 
set to 
posterior.summary 
By default the posterior mean is used.
Specify 
weights 
a numeric vector, used when 
conf.int 
confidence level for confidence intervals for the contrasts (HPD interval probability for Bayesian analyses) 
tol 
tolerance for 
expand 
set to 
... 
passed to 
x 
result of 
X 
set 
funint 
set to 
jointonly 
set to 
prob 
highest posterior density interval probability when the fit
was Bayesian and 
a list of class "contrast.rms"
containing the elements
Contrast
, SE
, Z
, var
, df.residual
Lower
, Upper
, Pvalue
, X
, cnames
, redundant
, which denote the contrast
estimates, standard errors, Z or tstatistics, variance matrix,
residual degrees of freedom (this is NULL
if the model was not
ols
), lower and upper confidence limits, 2sided Pvalue, design
matrix, contrast names (or NULL
), and a logical vector denoting
which contrasts are redundant with the other contrasts. If there are
any redundant contrasts, when the results of contrast
are
printed, and asterisk is printed at the start of the corresponding
lines. The object also contains ctype
indicating what method was
used for compute confidence intervals.
Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
[email protected]
Predict
, gendata
, bootcov
,
summary.rms
, anova.rms
,
require(ggplot2)
set.seed(1)
age < rnorm(200,40,12)
sex < factor(sample(c('female','male'),200,TRUE))
logit < (sex=='male') + (age40)/5
y < ifelse(runif(200) <= plogis(logit), 1, 0)
f < lrm(y ~ pol(age,2)*sex)
anova(f)
# Compare a 30 year old female to a 40 year old male
# (with or without age x sex interaction in the model)
contrast(f, list(sex='female', age=30), list(sex='male', age=40))
# Test for interaction between age and sex, duplicating anova
contrast(f, list(sex='female', age=30),
list(sex='male', age=30),
list(sex='female', age=c(40,50)),
list(sex='male', age=c(40,50)), type='joint')
# Duplicate overall sex effect in anova with 3 d.f.
contrast(f, list(sex='female', age=c(30,40,50)),
list(sex='male', age=c(30,40,50)), type='joint')
# For females get an array of odds ratios against age=40
k < contrast(f, list(sex='female', age=30:50),
list(sex='female', age=40))
print(k, fun=exp)
# Plot odds ratios with pointwise 0.95 confidence bands using log scale
k < as.data.frame(k[c('Contrast','Lower','Upper')])
ggplot(k, aes(x=30:50, y=exp(Contrast))) + geom_line() +
geom_ribbon(aes(ymin=exp(Lower), ymax=exp(Upper)),
alpha=0.15, linetype=0) +
scale_y_continuous(trans='log10', n.breaks=10,
minor_breaks=c(seq(0.1, 1, by=.1), seq(1, 10, by=.5))) +
xlab('Age') + ylab('OR against age 40')
# For a model containing two treatments, centers, and treatment
# x center interaction, get 0.95 confidence intervals separately
# by center
center < factor(sample(letters[1 : 8], 500, TRUE))
treat < factor(sample(c('a','b'), 500, TRUE))
y < 8*(treat == 'b') + rnorm(500, 100, 20)
f < ols(y ~ treat*center)
lc < levels(center)
contrast(f, list(treat='b', center=lc),
list(treat='a', center=lc))
# Get 'Type III' contrast: average b  a treatment effect over
# centers, weighting centers equally (which is almost always
# an unreasonable thing to do)
contrast(f, list(treat='b', center=lc),
list(treat='a', center=lc),
type='average')
# Get 'Type II' contrast, weighting centers by the number of
# subjects per center. Print the design contrast matrix used.
k < contrast(f, list(treat='b', center=lc),
list(treat='a', center=lc),
type='average', weights=table(center))
print(k, X=TRUE)
# Note: If other variables had interacted with either treat
# or center, we may want to list settings for these variables
# inside the list()'s, so as to not use default settings
# For a 4treatment study, get all comparisons with treatment 'a'
treat < factor(sample(c('a','b','c','d'), 500, TRUE))
y < 8*(treat == 'b') + rnorm(500, 100, 20)
dd < datadist(treat, center); options(datadist='dd')
f < ols(y ~ treat*center)
lt < levels(treat)
contrast(f, list(treat=lt[1]),
list(treat=lt[ 1]),
cnames=paste(lt[1], lt[1], sep=':'), conf.int=1  .05 / 3)
# Compare each treatment with average of all others
for(i in 1 : length(lt)) {
cat('Comparing with', lt[i], '\n\n')
print(contrast(f, list(treat=lt[i]),
list(treat=lt[ i]), type='average'))
}
options(datadist=NULL)
# Six ways to get the same thing, for a variable that
# appears linearly in a model and does not interact with
# any other variables. We estimate the change in y per
# unit change in a predictor x1. Methods 4, 5 also
# provide confidence limits. Method 6 computes nonparametric
# bootstrap confidence limits. Methods 26 can work
# for models that are nonlinear or nonadditive in x1.
# For that case more care is needed in choice of settings
# for x1 and the variables that interact with x1.
## Not run:
coef(fit)['x1'] # method 1
diff(predict(fit, gendata(x1=c(0,1)))) # method 2
g < Function(fit) # method 3
g(x1=1)  g(x1=0)
summary(fit, x1=c(0,1)) # method 4
k < contrast(fit, list(x1=1), list(x1=0)) # method 5
print(k, X=TRUE)
fit < update(fit, x=TRUE, y=TRUE) # method 6
b < bootcov(fit, B=500)
contrast(fit, list(x1=1), list(x1=0))
# In a model containing age, race, and sex,
# compute an estimate of the mean response for a
# 50 year old male, averaged over the races using
# observed frequencies for the races as weights
f < ols(y ~ age + race + sex)
contrast(f, list(age=50, sex='male', race=levels(race)),
type='average', weights=table(race))
# For a Bayesian model get the highest posterior interval for the
# difference in two nonlinear functions of predicted values
# Start with the mean from a proportional odds model
g < blrm(y ~ x)
M < Mean(g)
contrast(g, list(x=1), list(x=0), fun=M)
# For the median we have to make sure that contrast can pass the
# perposteriordraw vector of intercepts through
qu < Quantile(g)
med < function(lp, intercepts) qu(0.5, lp, intercepts=intercepts)
contrast(g, list(x=1), list(x=0), fun=med)
## End(Not run)
# Plot the treatment effect (drug  placebo) as a function of age
# and sex in a model in which age nonlinearly interacts with treatment
# for females only
set.seed(1)
n < 800
treat < factor(sample(c('drug','placebo'), n,TRUE))
sex < factor(sample(c('female','male'), n,TRUE))
age < rnorm(n, 50, 10)
y < .05*age + (sex=='female')*(treat=='drug')*.05*abs(age50) + rnorm(n)
f < ols(y ~ rcs(age,4)*treat*sex)
d < datadist(age, treat, sex); options(datadist='d')
# show separate estimates by treatment and sex
require(ggplot2)
ggplot(Predict(f, age, treat, sex='female'))
ggplot(Predict(f, age, treat, sex='male'))
ages < seq(35,65,by=5); sexes < c('female','male')
w < contrast(f, list(treat='drug', age=ages, sex=sexes),
list(treat='placebo', age=ages, sex=sexes))
# add conf.type="simultaneous" to adjust for having done 14 contrasts
xYplot(Cbind(Contrast, Lower, Upper) ~ age  sex, data=w,
ylab='Drug  Placebo')
w < as.data.frame(w[c('age','sex','Contrast','Lower','Upper')])
ggplot(w, aes(x=age, y=Contrast)) + geom_point() + facet_grid(sex ~ .) +
geom_errorbar(aes(ymin=Lower, ymax=Upper), width=0)
ggplot(w, aes(x=age, y=Contrast)) + geom_line() + facet_grid(sex ~ .) +
geom_ribbon(aes(ymin=Lower, ymax=Upper), width=0, alpha=0.15, linetype=0)
xYplot(Cbind(Contrast, Lower, Upper) ~ age, groups=sex, data=w,
ylab='Drug  Placebo', method='alt bars')
options(datadist=NULL)
# Examples of type='joint' contrast tests
set.seed(1)
x1 < rnorm(100)
x2 < factor(sample(c('a','b','c'), 100, TRUE))
dd < datadist(x1, x2); options(datadist='dd')
y < x1 + (x2=='b') + rnorm(100)
# First replicate a test statistic from anova()
f < ols(y ~ x2)
anova(f)
contrast(f, list(x2=c('b','c')), list(x2='a'), type='joint')
# Repeat with a redundancy; compare a vs b, a vs c, b vs c
contrast(f, list(x2=c('a','a','b')), list(x2=c('b','c','c')), type='joint')
# Get a test of association of a continuous predictor with y
# First assume linearity, then cubic
f < lrm(y>0 ~ x1 + x2)
anova(f)
contrast(f, list(x1=1), list(x1=0), type='joint') # a minimum set of contrasts
xs < seq(2, 2, length=20)
contrast(f, list(x1=0), list(x1=xs), type='joint')
# All contrasts were redundant except for the first, because of
# linearity assumption
f < lrm(y>0 ~ pol(x1,3) + x2)
anova(f)
contrast(f, list(x1=0), list(x1=xs), type='joint')
print(contrast(f, list(x1=0), list(x1=xs), type='joint'), jointonly=TRUE)
# All contrasts were redundant except for the first 3, because of
# cubic regression assumption
# Now do something that is difficult to do without cryptic contrast
# matrix operations: Allow each of the three x2 groups to have a different
# shape for the x1 effect where x1 is quadratic. Test whether there is
# a difference in mean levels of y for x2='b' vs. 'c' or whether
# the shape or slope of x1 is different between x2='b' and x2='c' regardless
# of how they differ when x2='a'. In other words, test whether the mean
# response differs between group b and c at any value of x1.
# This is a 3 d.f. test (intercept, linear, quadratic effects) and is
# a better approach than subsetting the data to remove x2='a' then
# fitting a simpler model, as it uses a better estimate of sigma from
# all the data.
f < ols(y ~ pol(x1,2) * x2)
anova(f)
contrast(f, list(x1=xs, x2='b'),
list(x1=xs, x2='c'), type='joint')
# Note: If using a spline fit, there should be at least one value of
# x1 between any two knots and beyond the outer knots.
options(datadist=NULL)
Modification of Therneau's coxph
function to fit the Cox model and
its extension, the AndersenGill model. The latter allows for interval
timedependent covariables, timedependent strata, and repeated events.
The Survival
method for an object created by cph
returns an S
function for computing estimates of the survival function.
The Quantile
method for cph
returns an S function for computing
quantiles of survival time (median, by default).
The Mean
method returns a function for computing the mean survival
time. This function issues a warning if the last followup time is uncensored,
unless a restricted mean is explicitly requested.
cph(formula = formula(data), data=environment(formula),
weights, subset, na.action=na.delete,
method=c("efron","breslow","exact","model.frame","model.matrix"),
singular.ok=FALSE, robust=FALSE,
model=FALSE, x=FALSE, y=FALSE, se.fit=FALSE,
linear.predictors=TRUE, residuals=TRUE, nonames=FALSE,
eps=1e4, init, iter.max=10, tol=1e9, surv=FALSE, time.inc,
type=NULL, vartype=NULL, debug=FALSE, ...)
## S3 method for class 'cph'
Survival(object, ...)
# Evaluate result as g(times, lp, stratum=1, type=c("step","polygon"))
## S3 method for class 'cph'
Quantile(object, ...)
# Evaluate like h(q, lp, stratum=1, type=c("step","polygon"))
## S3 method for class 'cph'
Mean(object, method=c("exact","approximate"), type=c("step","polygon"),
n=75, tmax, ...)
# E.g. m(lp, stratum=1, type=c("step","polygon"), tmax, \dots)
formula 
an S formula object with a 
object 
an object created by 
data 
name of an S data frame containing all needed variables. Omit this to use a data frame already in the S “search list”. 
weights 
case weights 
subset 
an expression defining a subset of the observations to use in the fit. The default
is to use all observations. Specify for example 
na.action 
specifies an S function to handle missing data. The default is the function 
method 
for For 
singular.ok 
If 
robust 
if 
model 
default is 
x 
default is 
y 
default is 
se.fit 
default is 
linear.predictors 
set to 
residuals 
set to 
nonames 
set to 
eps 
convergence criterion  change in log likelihood. 
init 
vector of initial parameter estimates. Defaults to all zeros.
Special residuals can be obtained by setting some elements of 
iter.max 
maximum number of iterations to allow. Set to 
tol 
tolerance for declaring singularity for matrix inversion (available only when survival5 or later package is in effect) 
surv 
set to 
time.inc 
time increment used in deriving 
type 
(for For 
vartype 
see 
debug 
set to 
... 
other arguments passed to 
times 
a scalar or vector of times at which to evaluate the survival estimates 
lp 
a scalar or vector of linear predictors (including the centering constant) at which to evaluate the survival estimates 
stratum 
a scalar stratum number or name (e.g., 
q 
a scalar quantile or a vector of quantiles to compute 
n 
the number of points at which to evaluate the mean survival time, for

tmax 
For 
If there is any strata by covariable interaction in the model such that
the mean X beta varies greatly over strata, method="approximate"
may
not yield very accurate estimates of the mean in Mean.cph
.
For method="approximate"
if you ask for an estimate of the mean for
a linear predictor value that was outside the range of linear predictors
stored with the fit, the mean for that observation will be NA
.
For Survival
, Quantile
, or Mean
, an S function is returned. Otherwise,
in addition to what is listed below, formula/design information and
the components
maxtime, time.inc, units, model, x, y, se.fit
are stored, the last 5
depending on the settings of options by the same names.
The vectors or matrix stored if y=TRUE
or x=TRUE
have rows deleted according to subset
and
to missing data, and have names or row names that come from the
data frame used as input data.
n 
table with one row per stratum containing number of censored and uncensored observations 
coef 
vector of regression coefficients 
stats 
vector containing the named elements 
var 
variance/covariance matrix of coefficients 
linear.predictors 
values of predicted X beta for observations used in fit, normalized to have overall mean zero, then having any offsets added 
resid 
martingale residuals 
loglik 
log likelihood at initial and final parameter values 
score 
value of score statistic at initial values of parameters 
times 
lists of times (if 
surv 
lists of underlying survival probability estimates 
std.err 
lists of standard errors of estimate loglog survival 
surv.summary 
a 3 dimensional array if 
center 
centering constant, equal to overall mean of X beta. 
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
coxph
, survivalinternal
,
Surv
, residuals.cph
,
cox.zph
, survfit.cph
,
survest.cph
, survfit.coxph
,
survplot
, datadist
,
rms
, rms.trans
, anova.rms
,
summary.rms
, Predict
,
fastbw
, validate
, calibrate
,
plot.Predict
, ggplot.Predict
,
specs.rms
, lrm
, which.influence
,
na.delete
,
na.detail.response
, print.cph
,
latex.cph
, vif
, ie.setup
,
GiniMd
, dxy.cens
,
concordance
# Simulate data from a population model in which the log hazard
# function is linear in age and there is no age x sex interaction
require(survival)
require(ggplot2)
n < 1000
set.seed(731)
age < 50 + 12*rnorm(n)
label(age) < "Age"
sex < factor(sample(c('Male','Female'), n,
rep=TRUE, prob=c(.6, .4)))
cens < 15*runif(n)
h < .02*exp(.04*(age50)+.8*(sex=='Female'))
dt < log(runif(n))/h
label(dt) < 'Followup Time'
e < ifelse(dt <= cens,1,0)
dt < pmin(dt, cens)
units(dt) < "Year"
dd < datadist(age, sex)
options(datadist='dd')
S < Surv(dt,e)
f < cph(S ~ rcs(age,4) + sex, x=TRUE, y=TRUE)
cox.zph(f, "rank") # tests of PH
anova(f)
ggplot(Predict(f, age, sex)) # plot age effect, 2 curves for 2 sexes
survplot(f, sex) # time on xaxis, curves for x2
res < resid(f, "scaledsch")
time < as.numeric(dimnames(res)[[1]])
z < loess(res[,4] ~ time, span=0.50) # residuals for sex
plot(time, fitted(z))
lines(supsmu(time, res[,4]),lty=2)
plot(cox.zph(f,"identity")) #Easier approach for last few lines
# latex(f)
f < cph(S ~ age + strat(sex), surv=TRUE)
g < Survival(f) # g is a function
g(seq(.1,1,by=.1), stratum="sex=Male", type="poly") #could use stratum=2
med < Quantile(f)
plot(Predict(f, age, fun=function(x) med(lp=x))) #plot median survival
# Fit a model that is quadratic in age, interacting with sex as strata
# Compare standard errors of linear predictor values with those from
# coxph
# Use more stringent convergence criteria to match with coxph
f < cph(S ~ pol(age,2)*strat(sex), x=TRUE, eps=1e9, iter.max=20)
coef(f)
se < predict(f, se.fit=TRUE)$se.fit
require(lattice)
xyplot(se ~ age  sex, main='From cph')
a < c(30,50,70)
comb < data.frame(age=rep(a, each=2),
sex=rep(levels(sex), 3))
p < predict(f, comb, se.fit=TRUE)
comb$yhat < p$linear.predictors
comb$se < p$se.fit
z < qnorm(.975)
comb$lower < p$linear.predictors  z*p$se.fit
comb$upper < p$linear.predictors + z*p$se.fit
comb
age2 < age^2
f2 < coxph(S ~ (age + age2)*strata(sex))
coef(f2)
se < predict(f2, se.fit=TRUE)$se.fit
xyplot(se ~ age  sex, main='From coxph')
comb < data.frame(age=rep(a, each=2), age2=rep(a, each=2)^2,
sex=rep(levels(sex), 3))
p < predict(f2, newdata=comb, se.fit=TRUE)
comb$yhat < p$fit
comb$se < p$se.fit
comb$lower < p$fit  z*p$se.fit
comb$upper < p$fit + z*p$se.fit
comb
# g < cph(Surv(hospital.charges) ~ age, surv=TRUE)
# Cox model very useful for analyzing highly skewed data, censored or not
# m < Mean(g)
# m(0) # Predicted mean charge for reference age
#Fit a timedependent covariable representing the instantaneous effect
#of an intervening nonfatal event
rm(age)
set.seed(121)
dframe < data.frame(failure.time=1:10, event=rep(0:1,5),
ie.time=c(NA,1.5,2.5,NA,3,4,NA,5,5,5),
age=sample(40:80,10,rep=TRUE))
z < ie.setup(dframe$failure.time, dframe$event, dframe$ie.time)
S < z$S
ie.status < z$ie.status
attach(dframe[z$subs,]) # replicates all variables
f < cph(S ~ age + ie.status, x=TRUE, y=TRUE)
#Must use x=TRUE,y=TRUE to get survival curves with timedep. covariables
#Get estimated survival curve for a 50year old who has an intervening
#nonfatal event at 5 days
new < data.frame(S=Surv(c(0,5), c(5,999), c(FALSE,FALSE)), age=rep(50,2),
ie.status=c(0,1))
g < survfit(f, new)
plot(c(0,g$time), c(1,g$surv[,2]), type='s',
xlab='Days', ylab='Survival Prob.')
# Not certain about what columns represent in g$surv for survival5
# but appears to be for different ie.status
#or:
#g < survest(f, new)
#plot(g$time, g$surv, type='s', xlab='Days', ylab='Survival Prob.')
#Compare with estimates when there is no intervening event
new2 < data.frame(S=Surv(c(0,5), c(5, 999), c(FALSE,FALSE)), age=rep(50,2),
ie.status=c(0,0))
g2 < survfit(f, new2)
lines(c(0,g2$time), c(1,g2$surv[,2]), type='s', lty=2)
#or:
#g2 < survest(f, new2)
#lines(g2$time, g2$surv, type='s', lty=2)
detach("dframe[z$subs, ]")
options(datadist=NULL)
Creates several new variables which help set up a dataset with an
ordinal response variable $y$
for use in fitting a forward continuation
ratio (CR) model. The CR model can be fitted with binary logistic
regression if each input observation is replicated the proper
number of times according to the $y$
value, a new binary $y$
is computed that has at most one $y=1$
per subject,
and if a cohort
variable is used to define the current
qualifying condition for a cohort of subjects, e.g., $y\geq 2$
.
cr.setup
creates the needed auxilliary variables. See
predab.resample
and validate.lrm
for information about
validating CR models (e.g., using the bootstrap to sample with
replacement from the original subjects instead of the records used in
the fit, validating the model separately for userspecified values of
cohort
).
cr.setup(y)
y 
a character, numeric, 
a list with components y, cohort, subs, reps
. y
is a new binary
variable that is to be used in the binary logistic fit. cohort
is
a factor
vector specifying which cohort condition currently applies.
subs
is a vector of subscripts that can be used to replicate other
variables the same way y
was replicated. reps
specifies how many
times each original observation was replicated. y, cohort, subs
are
all the same length and are longer than the original y
vector.
reps
is the same length as the original y
vector.
The subs
vector is suitable for passing to validate.lrm
or calibrate
,
which pass this vector under the name cluster
on to predab.resample
so that bootstrapping can be
done by sampling with replacement from the original subjects rather than
from the individual records created by cr.setup
.
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
Berridge DM, Whitehead J: Analysis of failure time data with ordinal categories of response. Stat in Med 10:1703–1710, 1991.
y < c(NA, 10, 21, 32, 32)
cr.setup(y)
set.seed(171)
y < sample(0:2, 100, rep=TRUE)
sex < sample(c("f","m"),100,rep=TRUE)
sex < factor(sex)
table(sex, y)
options(digits=5)
tapply(y==0, sex, mean)
tapply(y==1, sex, mean)
tapply(y==2, sex, mean)
cohort < y>=1
tapply(y[cohort]==1, sex[cohort], mean)
u < cr.setup(y)
Y < u$y
cohort < u$cohort
sex < sex[u$subs]
lrm(Y ~ cohort + sex)
f < lrm(Y ~ cohort*sex) # saturated model  has to fit all data cells
f
#Prob(y=0female):
# plogis(.50078)
#Prob(y=0male):
# plogis(.50078+.11301)
#Prob(y=1y>=1, female):
plogis(.50078+.31845)
#Prob(y=1y>=1, male):
plogis(.50078+.31845+.11301.07379)
combinations < expand.grid(cohort=levels(cohort), sex=levels(sex))
combinations
p < predict(f, combinations, type="fitted")
p
p0 < p[c(1,3)]
p1 < p[c(2,4)]
p1.unconditional < (1  p0) *p1
p1.unconditional
p2.unconditional < 1  p0  p1.unconditional
p2.unconditional
## Not run:
dd < datadist(inputdata) # do this on nonreplicated data
options(datadist='dd')
pain.severity < inputdata$pain.severity
u < cr.setup(pain.severity)
# inputdata frame has age, sex with pain.severity
attach(inputdata[u$subs,]) # replicate age, sex
# If age, sex already available, could do age < age[u$subs] etc., or
# age < rep(age, u$reps), etc.
y < u$y
cohort < u$cohort
dd < datadist(dd, cohort) # add to dd
f < lrm(y ~ cohort + age*sex) # ordinary cont. ratio model
g < lrm(y ~ cohort*sex + age, x=TRUE,y=TRUE) # allow unequal slopes for
# sex across cutoffs
cal < calibrate(g, cluster=u$subs, subset=cohort=='all')
# subs makes bootstrap sample the correct units, subset causes
# Predicted Prob(pain.severity=0) to be checked for calibration
## End(Not run)
For a given set of variables or a data frame, determines summaries
of variables for effect and plotting ranges, values to adjust to,
and overall ranges
for Predict
, plot.Predict
, ggplot.Predict
,
summary.rms
, survplot
, and nomogram.rms
.
If datadist
is called before
a model fit and the resulting object pointed to with options(datadist="name")
,
the data characteristics will be stored with the fit by Design()
, so
that later predictions and summaries of the fit will not need to access
the original data used in the fit. Alternatively, you can specify the
values for each variable in the model when using these 3 functions, or
specify the values of some of them and let the functions look up the
remainder (of say adjustmemt levels) from an object created by datadist
.
The best method is probably to run datadist
once before any models are
fitted, storing the distribution summaries for all potential variables.
Adjustment values are 0
for binary variables, the most frequent
category (or optionally the first category level)
for categorical (factor
) variables, the middle level for
ordered factor
variables, and medians for continuous variables.
See descriptions of q.display
and q.effect
for how display and
effect ranges are chosen for continuous variables.
datadist(..., data, q.display, q.effect=c(0.25, 0.75),
adjto.cat=c('mode','first'), n.unique=10)
## S3 method for class 'datadist'
print(x, ...)
# options(datadist="dd")
# used by summary, plot, survplot, sometimes predict
# For dd substitute the name of the result of datadist
... 
a list of variable names, separated by commas, a single data frame, or
a fit with 
data 
a data frame or a search position. If 
q.display 
set of two quantiles for computing the range of continuous variables
to use in displaying regression relationships. Defaults are

q.effect 
set of two quantiles for computing the range of continuous variables to use in estimating regression effects. Defaults are c(.25,.75), which yields interquartilerange odds ratios, etc. 
adjto.cat 
default is 
n.unique 
variables having 
x 
result of 
For categorical variables, the 7 limits are set to character strings
(factors) which correspond to
c(NA,adjto.level,NA,1,k,1,k)
, where k
is the number of levels.
For ordered variables with numeric levels, the limits are set to
c(L,M,H,L,H,L,H)
, where L
is the lowest level, M
is the middle
level, and H
is the highest level.
a list of class "datadist"
with the following components
limits 
a 
values 
a named list, with one vector of unique values for each numeric
variable having no more than 
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
rms
, rms.trans
, describe
, Predict
, summary.rms
## Not run:
d < datadist(data=1) # use all variables in search pos. 1
d < datadist(x1, x2, x3)
page(d) # if your options(pager) leaves up a popup
# window, this is a useful guide in analyses
d < datadist(data=2) # all variables in search pos. 2
d < datadist(data=my.data.frame)
d < datadist(my.data.frame) # same as previous. Run for all potential vars.
d < datadist(x2, x3, data=my.data.frame) # combine variables
d < datadist(x2, x3, q.effect=c(.1,.9), q.display=c(0,1))
# uses interdecile range odds ratios,
# total range of variables for regression function plots
d < datadist(d, z) # add a new variable to an existing datadist
options(datadist="d") #often a good idea, to store info with fit
f < ols(y ~ x1*x2*x3)
options(datadist=NULL) #default at start of session
f < ols(y ~ x1*x2)
d < datadist(f) #info not stored in `f'
d$limits["Adjust to","x1"] < .5 #reset adjustment level to .5
options(datadist="d")
f < lrm(y ~ x1*x2, data=mydata)
d < datadist(f, data=mydata)
options(datadist="d")
f < lrm(y ~ x1*x2) #datadist not used  specify all values for
summary(f, x1=c(200,500,800), x2=c(1,3,5)) # obtaining predictions
plot(Predict(f, x1=200:800, x2=3)) # or ggplot()
# Change reference value to get a relative odds plot for a logistic model
d$limits$age[2] < 30 # make 30 the reference value for age
# Could also do: d$limits["Adjust to","age"] < 30
fit < update(fit) # make new reference value take effect
plot(Predict(fit, age, ref.zero=TRUE, fun=exp),
ylab='Age=x:Age=30 Odds Ratio') # or ggplot()
## End(Not run)
For an orm
object generates a function for computing the
estimates of the function Prob(Y>=y) given one or more values of the
linear predictor using the reference (median) intercept. This
function can optionally be evaluated at only a set of userspecified
y
values, otherwise a rightstep function is returned. There
is a plot method for plotting the step functions, and if more than one
linear predictor was evaluated multiple step functions are drawn.
ExProb
is especially useful for nomogram
.
Optionally a normal approximation for a confidence
interval for exceedance probabilities will be computed using the delta
method, if
conf.int > 0
is specified to the function generated from calling
ExProb
. In that case, a "lims"
attribute is included
in the result computed by the derived cumulative probability function.
ExProb(object, ...)
## S3 method for class 'orm'
ExProb(object, codes = FALSE, ...)
## S3 method for class 'ExProb'
plot(x, ..., data=NULL,
xlim=NULL, xlab=x$yname, ylab=expression(Prob(Y>=y)),
col=par('col'), col.vert='gray85', pch=20,
pch.data=21, lwd=par('lwd'), lwd.data=lwd,
lty.data=2, key=TRUE)
object 
a fit object from 
codes 
if 
... 
ignored for 
data 
Specify 
x 
an object created by running the function created by 
xlim 
limits for xaxis; default is range of observed 
xlab 
xaxis label 
ylab 
yaxis label 
col 
color for horizontal lines and points 
col.vert 
color for vertical discontinuities 
pch 
plotting symbol for predicted curves 
lwd 
line width for predicted curves 
pch.data , lwd.data , lty.data

plotting parameters for data 
key 
set to 
ExProb
returns an R function. Running the function returns an
object of class "ExProb"
.
Frank Harrell and Shengxin Tu
set.seed(1)
x1 < runif(200)
yvar < x1 + runif(200)
f < orm(yvar ~ x1)
d < ExProb(f)
lp < predict(f, newdata=data.frame(x1=c(.2,.8)))
w < d(lp)
s1 < abs(x1  .2) < .1
s2 < abs(x1  .8) < .1
plot(w, data=data.frame(x1=c(rep(.2, sum(s1)), rep(.8, sum(s2))),
yvar=c(yvar[s1], yvar[s2])))
qu < Quantile(f)
abline(h=c(.1,.5), col='gray80')
abline(v=qu(.5, lp), col='gray80')
abline(v=qu(.9, lp), col='green')
Performs a slightly inefficient but numerically stable version of fast
backward elimination on factors, using a method based on Lawless and Singhal
(1978).
This method uses the fitted complete model and computes approximate Wald
statistics by computing conditional (restricted) maximum likelihood estimates
assuming multivariate normality of estimates.
fastbw
deletes factors, not columns of the design matrix. Factors requiring multiple d.f. will be retained or dropped as a group.
The function prints the deletion statistics for each variable in
turn, and prints approximate parameter estimates for the model after
deleting variables. The approximation is better when the number of
factors deleted is not large. For ols
, the approximation is exact for
regression coefficients, and standard errors are only off by a factor
equal to the ratio of the mean squared error estimate for the reduced
model to the original mean squared error estimate for the full model.
If the fit was from ols
, fastbw
will compute the usual $R^2$
statistic for each model.
fastbw(fit, rule=c("aic", "p"),
type=c("residual", "individual", "total"), sls=.05, aics=0, eps=1e9,
k.aic=2, force=NULL)
## S3 method for class 'fastbw'
print(x, digits=4, estimates=TRUE, ...)
fit 
fit object with 
rule 
Stopping rule. Defaults to 
type 
Type of statistic on which to base the stopping rule. Default is

sls 
Significance level for staying in a model if 
aics 
For 
eps 
Singularity criterion, default is 
k.aic 
multiplier to compute AIC, default is 2. To use BIC, set 
force 
a vector of integers specifying parameters forced to be in the model, not counting intercept(s) 
x 
result of 
digits 
number of significant digits to print 
estimates 
set to 
... 
ignored 
a list with an attribute kept
if bw=TRUE
, and the
following components:
result 
matrix of statistics with rows in order of deletion. 
names.kept 
names of factors kept in final model. 
factors.kept 
the subscripts of factors kept in the final model 
factors.deleted 
opposite of 
parms.kept 
column numbers in design matrix corresponding to parameters kept in the final model. 
parms.deleted 
opposite of 
coefficients 
vector of approximate coefficients of reduced model. 
var 
approximate covariance matrix for reduced model. 
Coefficients 
matrix of coefficients of all models. Rows correspond to the successive models examined and columns correspond to the coefficients in the full model. For variables not in a particular submodel (row), the coefficients are zero. 
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
Lawless, J. F. and Singhal, K. (1978): Efficient screening of nonnormal regression models. Biometrics 34:318–327.
rms
, ols
, lrm
,
cph
, psm
, validate
,
solvet
, rmsMisc
## Not run:
fastbw(fit, optional.arguments) # print results
z < fastbw(fit, optional.args) # typically used in simulations
lm.fit(X[,z$parms.kept], Y) # least squares fit of reduced model
## End(Not run)
Function
is a class of functions for creating other S functions.
Function.rms
is the method for creating S functions to compute
X beta, based on a model fitted with rms
in effect.
Like latexrms
, Function.rms
simplifies restricted cubic
spline functions and factors out terms in secondorder interactions.
Function.rms
will not work for models that have thirdorder
interactions involving restricted cubic splines.
Function.cph
is a particular method for handling fits from
cph
, for which an intercept (the negative of the centering
constant) is added to
the model. sascode
is a function that takes an S function such
as one created by Function
and does most of the editing
to turn the function definition into
a fragment of SAS code for computing X beta from the fitted model, along
with assignment statements that initialize predictors to reference
values.
perlcode
similarly creates Perl code to evaluate a fitted
regression model.
## S3 method for class 'rms'
Function(object, intercept=NULL,
digits=max(8, .Options$digits), posterior.summary=c('mean', 'median', 'mode'), ...)
## S3 method for class 'cph'
Function(object, intercept=object$center, ...)
# Use result as fun(predictor1=value1, predictor2=value2, \dots)
sascode(object, file='', append=FALSE)
perlcode(object)
object 
a fit created with 
intercept 
an intercept value to use (not allowed to be specified to 
digits 
number of significant digits to use for coefficients and knot locations 
posterior.summary 
if using a Bayesian model fit such as from

file 
name of a file in which to write the SAS code. Default is to write to standard output. 
append 
set to 
... 
arguments to pass to 
Function
returns an SPlus function that can be invoked in any
usual context. The function has one argument per predictor variable,
and the default values of the predictors are set to adjustto
values
(see datadist
). Multiple predicted X beta values may be calculated
by specifying vectors as arguments to the created function.
All nonscalar argument values must have the same length.
perlcode
returns a character string with embedded newline characters.
Frank Harrell, Jeremy Stephens, and Thomas Dupont
Department of Biostatistics
Vanderbilt University
[email protected]
latexrms
, transcan
,
predict.rms
, rms
, rms.trans
suppressWarnings(RNGversion("3.5.0"))
set.seed(1331)
x1 < exp(rnorm(100))
x2 < factor(sample(c('a','b'),100,rep=TRUE))
dd < datadist(x1, x2)
options(datadist='dd')
y < log(x1)^2+log(x1)*(x2=='b')+rnorm(100)/4
f < ols(y ~ pol(log(x1),2)*x2)
f$coef
g < Function(f, digits=5)
g
sascode(g)
cat(perlcode(g), '\n')
g()
g(x1=c(2,3), x2='b') #could omit x2 since b is default category
predict(f, expand.grid(x1=c(2,3),x2='b'))
g8 < Function(f) # default is 8 sig. digits
g8(x1=c(2,3), x2='b')
options(datadist=NULL)
## Not run:
require(survival)
# Make selfcontained functions for computing survival probabilities
# using a lognormal regression
f < psm(Surv(d.time, death) ~ rcs(age,4)*sex, dist='gaussian')
g < Function(f)
surv < Survival(f)
# Compute 2 and 5year survival estimates for 50 year old male
surv(c(2,5), g(age=50, sex='male'))
## End(Not run)
If nobs
is not specified, allows user to specify predictor settings
by e.g. age=50, sex="male"
, and any omitted predictors are set to
reference values (default=median for continuous variables, first level
for categorical ones  see datadist
). If any predictor has more than one
value given, expand.grid
is called to generate all possible combinations
of values, unless expand=FALSE
. If nobs
is given, a data
frame is first generated which has
nobs
of adjustto values duplicated. Then an editor window is opened
which allows the user to subset the variable names down to ones which she
intends to vary (this streamlines the data.ed
step). Then, if any
predictors kept are discrete and viewvals=TRUE
, a window (using page
)
is opened defining the possible values of this subset, to facilitate
data editing. Then the data.ed
function is invoked to allow interactive
overriding of predictor settings in the nobs
rows. The subset of
variables are combined with the other predictors which were not
displayed with data.ed
, and a final full data frame is returned.
gendata
is most useful for creating a newdata
data frame to pass
to predict
.
gendata(fit, ..., nobs, viewvals=FALSE, expand=TRUE, factors)
fit 
a fit object created with 
... 
predictor settings, if 
nobs 
number of observations to create if doing it interactively using Xwindows 
viewvals 
if 
expand 
set to 
factors 
a list containing predictor settings with their names. This is an
alternative to specifying the variables separately in .... Unlike the
usage of ..., variables getting default ranges in 
if you have a variable in ...
that is named n, no, nob,
nob
, add nobs=FALSE
to the invocation to prevent that variable
from being misrecognized as nobs
a data frame with all predictors, and an attribute names.subset
if
nobs
is specified. This attribute contains the vector of variable
names for predictors which were passed to de
and hence were
allowed to vary. If neither nobs
nor any predictor settings were
given, returns a data frame with adjustto values.
optionally writes to the terminal, opens Xwindows, and generates a
temporary file using sink
.
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
predict.rms
, survest.cph
,
survest.psm
, rmsMisc
,
expand.grid
, de
, page
,
print.datadist
, Predict
set.seed(1)
age < rnorm(200, 50, 10)
sex < factor(sample(c('female','male'),200,TRUE))
race < factor(sample(c('a','b','c','d'),200,TRUE))
y < sample(0:1, 200, TRUE)
dd < datadist(age,sex,race)
options(datadist="dd")
f < lrm(y ~ age*sex + race)
gendata(f)
gendata(f, age=50)
d < gendata(f, age=50, sex="female") # leave race=reference category
d < gendata(f, age=c(50,60), race=c("b","a")) # 4 obs.
d$Predicted < predict(f, d, type="fitted")
d # Predicted column prints at the far right
options(datadist=NULL)
## Not run:
d < gendata(f, nobs=5, view=TRUE) # 5 interactively defined obs.
d[,attr(d,"names.subset")] # print variables which varied
predict(f, d)
## End(Not run)
Uses ggplot2
graphics to plot the effect of one or two predictors
on the linear predictor or X beta scale, or on some transformation of
that scale. The first argument specifies the result of the
Predict
function. The predictor is always plotted in its
original coding.
If rdata
is given, a spike histogram is drawn showing
the location/density of data values for the $x$
axis variable. If
there is a groups
(superposition) variable that generated separate
curves, the data density specific to each class of points is shown.
This assumes that the second variable was a factor variable. The histograms
are drawn by histSpikeg
.
To plot effects instead of estimates (e.g., treatment differences as a
function of interacting factors) see contrast.rms
and
summary.rms
.
## S3 method for class 'Predict'
ggplot(data, mapping, formula=NULL, groups=NULL,
aestype=c('color', 'linetype'),
conf=c('fill', 'lines'),
conflinetype=1,
varypred=FALSE, sepdiscrete=c('no', 'list', 'vertical', 'horizontal'),
subset, xlim., ylim., xlab, ylab,
colorscale=function(...) scale_color_manual(...,
values=c("#000000", "#E69F00", "#56B4E9",
"#009E73","#F0E442", "#0072B2", "#D55E00", "#CC79A7")),
colfill='black',
rdata=NULL, anova=NULL, pval=FALSE, size.anova=4,
adj.subtitle, size.adj=2.5, perim=NULL, nlevels=3,
flipxdiscrete=TRUE,
legend.position='right', legend.label=NULL,
vnames=c('labels','names'), abbrev=FALSE, minlength=6,
layout=NULL, addlayer,
histSpike.opts=list(frac=function(f) 0.01 +
0.02 * sqrt(f  1)/sqrt(max(f, 2)  1), side=1, nint=100),
type=NULL, ggexpr=FALSE, height=NULL, width=NULL, ..., environment)
data 
a data frame created by 
mapping 
kept because of 
formula 
a 
groups 
an optional character string containing the
name of one of the variables in 
aestype 
a string vector of aesthetic names corresponding to
variables in the 
conf 
specify 
conflinetype 
specify an alternative 
varypred 
set to 
sepdiscrete 
set to something other than 
subset 
a subsetting expression for restricting the rows of

xlim. 
This parameter is seldom used, as limits are usually controlled with

ylim. 
Range for plotting on response variable axis. Computed by default.
Usually specified using its legal definition 
xlab 
Label for 
ylab 
Label for 
colorscale 
a 
colfill 
a single character string or number specifying the fill color
to use for 
rdata 
a data frame containing the original raw data on which the
regression model were based, or at least containing the 
anova 
an object returned by 
pval 
specify 
size.anova 
character size for the test statistic printed on the panel, mm 
adj.subtitle 
Set to 
size.adj 
Size of adjustment settings in subtitles in mm. Default is 2.5. 
perim 

nlevels 
when 
flipxdiscrete 
see 
legend.position 

legend.label 
if omitted, group variable labels will be used for
label the legend. Specify 
vnames 
applies to the case where multiple plots are produced
separately by predictor. Set to 
abbrev 
set to true to abbreviate levels of predictors that are
categorical to a minimum length of 
minlength 
see 
layout 
for multipanel plots a 2vector specifying the number of rows and number of columns. If omitted will be computed from the number of panels to make as square as possible. 
addlayer 
a 
histSpike.opts 
a list containing named elements that specifies
parameters to 
type 
a value ( 
ggexpr 
set to 
height , width

used if 
... 
ignored 
environment 
ignored; used to satisfy rules because of the generic ggplot 
an object of class "ggplot2"
ready for printing. For the
case where predictors were not specified to Predict
,
sepdiscrete=TRUE
, and there were both continuous and discrete
predictors in the model, a list of two graphics objects is returned.
If plotting the effects of all predictors you can reorder the
panels using for example p < Predict(fit); p$.predictor. <
factor(p$.predictor., v)
where v
is a vector of predictor
names specified in the desired order.
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
Fox J, Hong J (2009): Effect displays in R for multinomial and proportionalodds logit models: Extensions to the effects package. J Stat Software 32 No. 1.
Predict
, rbind.Predict
,
datadist
, predictrms
, anova.rms
,
contrast.rms
, summary.rms
,
rms
, rmsMisc
, plot.Predict
,
labcurve
, histSpikeg
,
ggplot
, Overview
require(ggplot2)
n < 350 # define sample size
set.seed(17) # so can reproduce the results
age < rnorm(n, 50, 10)
blood.pressure < rnorm(n, 120, 15)
cholesterol < rnorm(n, 200, 25)
sex < factor(sample(c('female','male'), n,TRUE))
label(age) < 'Age' # label is in Hmisc
label(cholesterol) < 'Total Cholesterol'
label(blood.pressure) < 'Systolic Blood Pressure'
label(sex) < 'Sex'
units(cholesterol) < 'mg/dl' # uses units.default in Hmisc
units(blood.pressure) < 'mmHg'
# Specify population model for log odds that Y=1
L < .4*(sex=='male') + .045*(age50) +
(log(cholesterol  10)5.2)*(2*(sex=='female') + 2*(sex=='male')) +
.01 * (blood.pressure  120)
# Simulate binary y to have Prob(y=1) = 1/[1+exp(L)]
y < ifelse(runif(n) < plogis(L), 1, 0)
ddist < datadist(age, blood.pressure, cholesterol, sex)
options(datadist='ddist')
fit < lrm(y ~ blood.pressure + sex * (age + rcs(cholesterol,4)),
x=TRUE, y=TRUE)
an < anova(fit)
# Plot effects in two vertical subpanels with continuous predictors on top
# ggplot(Predict(fit), sepdiscrete='vertical')
# Plot effects of all 4 predictors with test statistics from anova, and P
ggplot(Predict(fit), anova=an, pval=TRUE)
# ggplot(Predict(fit), rdata=llist(blood.pressure, age))
# spike histogram plot for two of the predictors
# p < Predict(fit, name=c('age','cholesterol')) # Make 2 plots
# ggplot(p)
# p < Predict(fit, age=seq(20,80,length=100), sex, conf.int=FALSE)
# # Plot relationship between age and log
# odds, separate curve for each sex,
# ggplot(p, subset=sex=='female'  age > 30)
# No confidence interval, suppress estimates for males <= 30
# p < Predict(fit, age, sex)
# ggplot(p, rdata=llist(age,sex))
# rdata= allows rug plots (1dimensional scatterplots)
# on each sex's curve, with sex
# specific density of age
# If data were in data frame could have used that
# p < Predict(fit, age=seq(20,80,length=100), sex='male', fun=plogis)
# works if datadist not used
# ggplot(p, ylab=expression(hat(P)))
# plot predicted probability in place of log odds
# per < function(x, y) x >= 30
# ggplot(p, perim=per) # suppress output for age < 30 but leave scale alone
# Do ggplot2 faceting a few different ways
p < Predict(fit, age, sex, blood.pressure=c(120,140,160),
cholesterol=c(180,200,215))
# ggplot(p)
ggplot(p, cholesterol ~ blood.pressure)
# ggplot(p, ~ cholesterol + blood.pressure)
# color for sex, line type for blood.pressure:
ggplot(p, groups=c('sex', 'blood.pressure'))
# Add legend.position='top' to allow wider plot
# Map blood.pressure to line thickness instead of line type:
# ggplot(p, groups=c('sex', 'blood.pressure'), aestype=c('color', 'size'))
# Plot the age effect as an odds ratio
# comparing the age shown on the xaxis to age=30 years
# ddist$limits$age[2] < 30 # make 30 the reference value for age
# Could also do: ddist$limits["Adjust to","age"] < 30
# fit < update(fit) # make new reference value take effect
# p < Predict(fit, age, ref.zero=TRUE, fun=exp)
# ggplot(p, ylab='Age=x:Age=30 Odds Ratio',
# addlayer=geom_hline(yintercept=1, col=gray(.8)) +
# geom_vline(xintercept=30, col=gray(.8)) +
# scale_y_continuous(trans='log',
# breaks=c(.5, 1, 2, 4, 8))))
# Compute predictions for three predictors, with superpositioning or
# conditioning on sex, combined into one graph
p1 < Predict(fit, age, sex)
p2 < Predict(fit, cholesterol, sex)
p3 < Predict(fit, blood.pressure, sex)
p < rbind(age=p1, cholesterol=p2, blood.pressure=p3)
ggplot(p, groups='sex', varypred=TRUE, adj.subtitle=FALSE)
# ggplot(p, groups='sex', varypred=TRUE, adj.subtitle=FALSE, sepdiscrete='vert')
## Not run:
# For males at the median blood pressure and cholesterol, plot 3 types
# of confidence intervals for the probability on one plot, for varying age
ages < seq(20, 80, length=100)
p1 < Predict(fit, age=ages, sex='male', fun=plogis) # standard pointwise
p2 < Predict(fit, age=ages, sex='male', fun=plogis,
conf.type='simultaneous') # simultaneous
p3 < Predict(fit, age=c(60,65,70), sex='male', fun=plogis,
conf.type='simultaneous') # simultaneous 3 pts
# The previous only adjusts for a multiplicity of 3 points instead of 100
f < update(fit, x=TRUE, y=TRUE)
g < bootcov(f, B=500, coef.reps=TRUE)
p4 < Predict(g, age=ages, sex='male', fun=plogis) # bootstrap percentile
p < rbind(Pointwise=p1, 'Simultaneous 100 ages'=p2,
'Simultaneous 3 ages'=p3, 'Bootstrap nonparametric'=p4)
# as.data.frame so will call builtin ggplot
ggplot(as.data.frame(p), aes(x=age, y=yhat)) + geom_line() +
geom_ribbon(data=p, aes(ymin=lower, ymax=upper), alpha=0.2, linetype=0)+
facet_wrap(~ .set., ncol=2)
# Plots for a parametric survival model
n < 1000
set.seed(731)
age < 50 + 12*rnorm(n)
label(age) < "Age"
sex < factor(sample(c('Male','Female'), n,
rep=TRUE, prob=c(.6, .4)))
cens < 15*runif(n)
h < .02*exp(.04*(age50)+.8*(sex=='Female'))
t < log(runif(n))/h
label(t) < 'Followup Time'
e < ifelse(t<=cens,1,0)
t < pmin(t, cens)
units(t) < "Year"
ddist < datadist(age, sex)
require(survival)
Srv < Surv(t,e)
# Fit lognormal survival model and plot median survival time vs. age
f < psm(Srv ~ rcs(age), dist='lognormal')
med < Quantile(f) # Creates function to compute quantiles
# (median by default)
p < Predict(f, age, fun=function(x) med(lp=x))
ggplot(p, ylab="Median Survival Time")
# Note: confidence intervals from this method are approximate since
# they don't take into account estimation of scale parameter
# Fit an ols model to log(y) and plot the relationship between x1
# and the predicted mean(y) on the original scale without assuming
# normality of residuals; use the smearing estimator
# See help file for rbind.Predict for a method of showing two
# types of confidence intervals simultaneously.
# Add raw data scatterplot to graph
set.seed(1)
x1 < runif(300)
x2 < runif(300)
ddist < datadist(x1, x2); options(datadist='ddist')
y < exp(x1 + x2  1 + rnorm(300))
f < ols(log(y) ~ pol(x1,2) + x2)
r < resid(f)
smean < function(yhat)smearingEst(yhat, exp, res, statistic='mean')
formals(smean) < list(yhat=numeric(0), res=r[! is.na(r)])
#smean$res < r[! is.na(r)] # define default res argument to function
ggplot(Predict(f, x1, fun=smean), ylab='Predicted Mean on yscale',
addlayer=geom_point(aes(x=x1, y=y), data.frame(x1, y)))
# Had ggplot not added a subtitle (i.e., if x2 were not present), you
# could have done ggplot(Predict(), ylab=...) + geom_point(...)
## End(Not run)
# Make an 'interaction plot', forcing the xaxis variable to be
# plotted at integer values but labeled with category levels
n < 100
set.seed(1)
gender < c(rep('male', n), rep('female',n))
m < sample(c('a','b'), 2*n, TRUE)
d < datadist(gender, m); options(datadist='d')
anxiety < runif(2*n) + .2*(gender=='female') + .4*(gender=='female' & m=='b')
tapply(anxiety, llist(gender,m), mean)
f < ols(anxiety ~ gender*m)
p < Predict(f, gender, m)
# ggplot(p) # horizontal dot chart; usually preferred for categorical predictors
# ggplot(p, flipxdiscrete=FALSE) # back to vertical
ggplot(p, groups='gender')
ggplot(p, ~ m, groups=FALSE, flipxdiscrete=FALSE)
options(datadist=NULL)
## Not run:
# Example in which separate curves are shown for 4 income values
# For each curve the estimated percentage of voters voting for
# the democratic party is plotted against the percent of voters
# who graduated from college. Data are countylevel percents.
incomes < seq(22900, 32800, length=4)
# equally spaced to outer quintiles
p < Predict(f, college, income=incomes, conf.int=FALSE)
ggplot(p, xlim=c(0,35), ylim=c(30,55))
# Erase end portions of each curve where there are fewer than 10 counties having
# percent of college graduates to the left of the xcoordinate being plotted,
# for the subset of counties having median family income with 1650
# of the target income for the curve
show.pts < function(college.pts, income.pt) {
s < abs(income  income.pt) < 1650 #assumes income known to top frame
x < college[s]
x < sort(x[!is.na(x)])
n < length(x)
low < x[10]; high < x[n9]
college.pts >= low & college.pts <= high
}
ggplot(p, xlim=c(0,35), ylim=c(30,55), perim=show.pts)
# Rename variables for better plotting of a long list of predictors
f < ...
p < Predict(f)
re < c(trt='treatment', diabet='diabetes', sbp='systolic blood pressure')
for(n in names(re)) {
names(p)[names(p)==n] < re[n]
p$.predictor.[p$.predictor.==n] < re[n]
}
ggplot(p)
## End(Not run)
gIndex
computes the total $g$
index for a model based on
the vector of linear predictors, and the partial $g$
index for
each predictor in a model. The latter is computed by summing all the
terms involving each variable, weighted by their regression
coefficients, then computing Gini's mean difference on this sum. For
example, a regression model having age and sex and age*sex on the
right hand side, with corresponding regression coefficients $b_{1},
b_{2}, b_{3}$
will have the $g$
index for age
computed from Gini's mean
difference on the product of age $\times (b_{1} + b_{3}w)$
where
$w$
is an indicator set to one for observations with sex not equal
to the reference value. When there are nonlinear terms associated
with a predictor, these terms will also be combined.
A print
method is defined, and there is a plot
method for displaying
$g$
indexes using a dot chart.
These functions use Hmisc::GiniMd
.
gIndex(object, partials=TRUE, type=c('ccterms', 'cterms', 'terms'),
lplabel=if(length(object$scale) && is.character(object$scale))
object$scale[1] else 'X*Beta',
fun, funlabel=if(missing(fun)) character(0) else
deparse(substitute(fun)),
postfun=if(length(object$scale)==2) exp else NULL,
postlabel=if(length(postfun))
ifelse(missing(postfun),
if((length(object$scale) > 1) &&
is.character(object$scale)) object$scale[2] else
'Antilog',
deparse(substitute(postfun))) else character(0),
...)
## S3 method for class 'gIndex'
print(x, digits=4, abbrev=FALSE,
vnames=c("names","labels"), ...)
## S3 method for class 'gIndex'
plot(x, what=c('pre', 'post'),
xlab=NULL, pch=16, rm.totals=FALSE,
sort=c('descending', 'ascending', 'none'), ...)
object 
result of an 
partials 
set to 
type 
defaults to 
lplabel 
a replacement for default values such as

fun 
an optional function to transform the linear predictors
before computing the total (only) 
funlabel 
a character string label for 
postfun 
a function to transform 
postlabel 
a label for 
... 
For 
x 
an object created by 
digits 
causes rounding to the 
abbrev 
set to 
vnames 
set to 
what 
set to 
xlab 

pch 
plotting character for point 
rm.totals 
set to 
sort 
specifies how to sort predictors by 
For stratification factors in a Cox proportional hazards model, there is
no contribution of variation towards computing a partial $g$
except from terms that interact with the stratification variable.
gIndex
returns a matrix of class "gIndex"
with auxiliary
information stored as attributes, such as variable labels.
GiniMd
returns a scalar.
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
David HA (1968): Gini's mean difference rediscovered. Biometrika 55:573–575.
set.seed(1)
n < 40
x < 1:n
w < factor(sample(c('a','b'), n, TRUE))
u < factor(sample(c('A','B'), n, TRUE))
y < .01*x + .2*(w=='b') + .3*(u=='B') + .2*(w=='b' & u=='B') + rnorm(n)/5
dd < datadist(x,w,u); options(datadist='dd')
f < ols(y ~ x*w*u, x=TRUE, y=TRUE)
f
anova(f)
z < list()
for(type in c('terms','cterms','ccterms'))
{
zc < predict(f, type=type)
cat('type:', type, '\n')
print(zc)
z[[type]] < zc
}
zc < z$cterms
GiniMd(zc[, 1])
GiniMd(zc[, 2])
GiniMd(zc[, 3])
GiniMd(f$linear.predictors)
g < gIndex(f)
g
g['Total',]
gIndex(f, partials=FALSE)
gIndex(f, type='cterms')
gIndex(f, type='terms')
y < y > .8
f < lrm(y ~ x * w * u, x=TRUE, y=TRUE)
gIndex(f, fun=plogis, funlabel='Prob[y=1]')
# Manual calculation of combined main effect + interaction effort of
# sex in a 2x2 design with treatments A B, sexes F M,
# model .1 + .3*(treat=='B') + .5*(sex=='M') + .4*(treat=='B' & sex=='M')
set.seed(1)
X < expand.grid(treat=c('A','B'), sex=c('F', 'M'))
a < 3; b < 7; c < 13; d < 5
X < rbind(X[rep(1, a),], X[rep(2, b),], X[rep(3, c),], X[rep(4, d),])
y < with(X, .1 + .3*(treat=='B') + .5*(sex=='M') + .4*(treat=='B' & sex=='M'))
f < ols(y ~ treat*sex, data=X, x=TRUE)
gIndex(f, type='cterms')
k < coef(f)
b1 < k[2]; b2 < k[3]; b3 < k[4]
n < nrow(X)
( (a+b)*c*abs(b2) + (a+b)*d*abs(b2+b3) + c*d*abs(b3))/(n*(n1)/2 )
# Manual calculation for combined age effect in a model with sex,
# age, and age*sex interaction
a < 13; b < 7
sex < c(rep('female',a), rep('male',b))
agef < round(runif(a, 20, 30))
agem < round(runif(b, 20, 40))
age < c(agef, agem)
y < (sex=='male') + age/10  (sex=='male')*age/20
f < ols(y ~ sex*age, x=TRUE)
f
gIndex(f, type='cterms')
k < coef(f)
b1 < k[2]; b2 < k[3]; b3 < k[4]
n < a + b
sp < function(w, z=w) sum(outer(w, z, function(u, v) abs(uv)))
( abs(b2)*sp(agef) + abs(b2+b3)*sp(agem) + 2*sp(b2*agef, (b2+b3)*agem) ) / (n*(n1))
( abs(b2)*GiniMd(agef)*a*(a1) + abs(b2+b3)*GiniMd(agem)*b*(b1) +
2*sp(b2*agef, (b2+b3)*agem) ) / (n*(n1))
## Not run:
# Compare partial and total gindexes over many random fits
plot(NA, NA, xlim=c(0,3), ylim=c(0,3), xlab='Global',
ylab='x1 (black) x2 (red) x3 (green) x4 (blue)')
abline(a=0, b=1, col=gray(.9))
big < integer(3)
n < 50 # try with n=7  see lots of exceptions esp. for interacting var
for(i in 1:100) {
x1 < runif(n)
x2 < runif(n)
x3 < runif(n)
x4 < runif(n)
y < x1 + x2 + x3 + x4 + 2*runif(n)
f < ols(y ~ x1*x2+x3+x4, x=TRUE)
# f < ols(y ~ x1+x2+x3+x4, x=TRUE) # also try this
w < gIndex(f)[,1]
gt < w['Total']
points(gt, w['x1, x2'])
points(gt, w['x3'], col='green')
points(gt, w['x4'], col='blue')
big[1] < big[1] + (w['x1, x2'] > gt)
big[2] < big[2] + (w['x3'] > gt)
big[3] < big[3] + (w['x4'] > gt)
}
print(big)
## End(Not run)
options(datadist=NULL)
This function saves rms
attributes with the fit object so that
anova.rms
, Predict
, etc. can be used just as with ols
and other fits. No validate
or calibrate
methods exist for
Glm
though.
Glm(
formula,
family = gaussian,
data = environment(formula),
weights,
subset,
na.action = na.delete,
start = NULL,
offset = NULL,
control = glm.control(...),
model = TRUE,
method = "glm.fit",
x = FALSE,
y = TRUE,
contrasts = NULL,
...
)
formula , family , data , weights , subset , na.action , start , offset , control , model , method , x , y , contrasts

see 
... 
ignored 
For the print
method, format of output is controlled by the user
previously running options(prType="lang")
where lang
is
"plain"
(the default), "latex"
, or "html"
.
a fit object like that produced by stats::glm()
but with
rms
attributes and a class
of "rms"
, "Glm"
,
"glm"
, and "lm"
. The g
element of the fit object is
the $g$
index.
stats::glm()
,Hmisc::GiniMd()
, prModFit()
, stats::residuals.glm
## Dobson (1990) Page 93: Randomized Controlled Trial :
counts < c(18,17,15,20,10,20,25,13,12)
outcome < gl(3,1,9)
treatment < gl(3,3)
f < glm(counts ~ outcome + treatment, family=poisson())
f
anova(f)
summary(f)
f < Glm(counts ~ outcome + treatment, family=poisson())
# could have had rcs( ) etc. if there were continuous predictors
f
anova(f)
summary(f, outcome=c('1','2','3'), treatment=c('1','2','3'))
This function fits a linear model using generalized least
squares. The errors are allowed to be correlated and/or have unequal
variances. Gls
is a slightly enhanced version of the
Pinheiro and Bates gls
function in the nlme
package to
make it easy to use with the rms package and to implement cluster
bootstrapping (primarily for nonparametric estimates of the
variancecovariance matrix of the parameter estimates and for
nonparametric confidence limits of correlation parameters).
For the print
method, format of output is controlled by the
user previously running options(prType="lang")
where
lang
is "plain"
(the default), "latex"
, or
"html"
. When using html with Quarto or RMarkdown,
results='asis'
need not be written in the chunk header.
Gls(model, data, correlation, weights, subset, method, na.action=na.omit,
control, verbose, B=0, dupCluster=FALSE, pr=FALSE, x=FALSE)
## S3 method for class 'Gls'
print(x, digits=4, coefs=TRUE, title, ...)
model 
a twosided linear formula object describing the
model, with the response on the left of a 
data 
an optional data frame containing the variables named in

correlation 
an optional 
weights 
an optional 
subset 
an optional expression indicating which subset of the rows of

method 
a character string. If 
na.action 
a function that indicates what should happen when the
data contain 
control 
a list of control values for the estimation algorithm to
replace the default values returned by the function 
verbose 
an optional logical value. If 
B 
number of bootstrap resamples to fit and store, default is none 
dupCluster 
set to 
pr 
set to 
x 
for 
digits 
number of significant digits to print 
coefs 
specify 
title 
a character string title to be passed to 
... 
ignored 
The na.delete
function will not work with
Gls
due to some nuance in the model.frame.default
function. This probably relates to na.delete
storing extra
information in the "na.action"
attribute of the returned data
frame.
an object of classes Gls
, rms
, and gls
representing the linear model
fit. Generic functions such as print
, plot
,
ggplot
, and summary
have methods to show the results of
the fit. See
glsObject
for the components of the fit. The functions
resid
, coef
, and fitted
can be used to extract
some of its components. Gls
returns the following components
not returned by gls
: Design
, assign
,
formula
(see arguments), B
(see
arguments), bootCoef
(matrix of B
bootstrapped
coefficients), boot.Corr
(vector of bootstrapped correlation
parameters), Nboot
(vector of total sample size used in each
bootstrap (may vary if have unbalanced clusters), and var
(sample variancecovariance matrix of bootstrapped coefficients). The
$g$
index is also stored in the returned object under the name
"g"
.
Jose Pinheiro, Douglas Bates [email protected], Saikat DebRoy, Deepayan Sarkar, Rcore [email protected], Frank Harrell [email protected], Patrick Aboyoun
Pinheiro J, Bates D (2000): Mixed effects models in S and SPlus. New York: SpringerVerlag.
gls
glsControl
, glsObject
,
varFunc
, corClasses
,
varClasses
, GiniMd
,
prModFit
, logLik.Gls
## Not run:
require(ggplot2)
ns < 20 # no. subjects
nt < 10 # no. time points/subject
B < 10 # no. bootstrap resamples
# usually do 100 for variances, 1000 for nonparametric CLs
rho < .5 # AR(1) correlation parameter
V < matrix(0, nrow=nt, ncol=nt)
V < rho^abs(row(V)col(V)) # persubject correlation/covariance matrix
d < expand.grid(tim=1:nt, id=1:ns)
d$trt < factor(ifelse(d$id <= ns/2, 'a', 'b'))
true.beta < c(Intercept=0,tim=.1,'tim^2'=0,'trt=b'=1)
d$ey < true.beta['Intercept'] + true.beta['tim']*d$tim +
true.beta['tim^2']*(d$tim^2) + true.beta['trt=b']*(d$trt=='b')
set.seed(13)
library(MASS) # needed for mvrnorm
d$y < d$ey + as.vector(t(mvrnorm(n=ns, mu=rep(0,nt), Sigma=V)))
dd < datadist(d); options(datadist='dd')
f < Gls(y ~ pol(tim,2) + trt, correlation=corCAR1(form= ~tim  id),
data=d, B=B)
f
AIC(f)
f$var # bootstrap variances
f$varBeta # original variances
summary(f)
anova(f)
ggplot(Predict(f, tim, trt))
# v < Variogram(f, form=~timid, data=d)
nlme:::summary.gls(f)$tTable # print matrix of estimates etc.
options(datadist=NULL)
## End(Not run)
Function to divide x
(e.g. age, or predicted survival at time
u
created by survest
) into g
quantile groups, get
KaplanMeier estimates at time u
(a scaler), and to return a
matrix with columns x
=mean x
in quantile, n
=number
of subjects, events
=no. events, and KM
=KM survival at
time u
, std.err
= s.e. of log KM. Confidence intervals
are based on log S(t). Instead of supplying g
, the user can
supply the minimum number of subjects to have in the quantile group
(m
, default=50). If cuts
is given
(e.g. cuts=c(0,.1,.2,...,.9,.1)
), it overrides m
and
g
. Calls Therneau's survfitKM
in the survival
package to get KaplanMeiers estimates and standard errors.
groupkm(x, Srv, m=50, g, cuts, u,
pl=FALSE, loglog=FALSE, conf.int=.95, xlab, ylab,
lty=1, add=FALSE, cex.subtitle=.7, ...)
x 
variable to stratify 
Srv 
a 
m 
desired minimum number of observations in a group 
g 
number of quantile groups 
cuts 
actual cuts in 
u 
time for which to estimate survival 
pl 
TRUE to plot results 
loglog 
set to 
conf.int 
defaults to 
xlab 
if 
ylab 
if 
lty 
line time for primary line connecting estimates 
add 
set to 
cex.subtitle 
character size for subtitle. Default is 
... 
plotting parameters to pass to the plot and errbar functions 
matrix with columns named x
(mean predictor value in interval), n
(sample size
in interval), events
(number of events in interval), KM
(KaplanMeier
estimate), std.err
(standard error of log KM
)
survfit
, errbar
,
cut2
, Surv
,
units
require(survival)
n < 1000
set.seed(731)
age < 50 + 12*rnorm(n)
cens < 15*runif(n)
h < .02*exp(.04*(age50))
d.time < log(runif(n))/h
label(d.time) < 'Followup Time'
e < ifelse(d.time <= cens,1,0)
d.time < pmin(d.time, cens)
units(d.time) < "Year"
groupkm(age, Surv(d.time, e), g=10, u=5, pl=TRUE)
#Plot 5year KM survival estimates and 0.95 confidence bars by
#decile of age. If omit g=10, will have >= 50 obs./group.
The hazard.ratio.plot
function repeatedly estimates Cox
regression coefficients and confidence limits within time intervals.
The log hazard ratios are plotted against the mean failure/censoring
time within the interval. Unless times
is specified, the number of
time intervals will be $\max(round(d/e),2)$
, where $d$
is the
total number
of events in the sample. Efron's likelihood is used for estimating
Cox regression coefficients (using coxph.fit
). In the case of
tied failure times, some intervals may have a point in common.
hazard.ratio.plot(x, Srv, which, times=, e=30, subset,
conf.int=.95, legendloc=NULL, smooth=TRUE, pr=FALSE, pl=TRUE,
add=FALSE, ylim, cex=.5, xlab="t", ylab, antilog=FALSE, ...)
x 
a vector or matrix of predictors 
Srv 
a 
which 
a vector of column numbers of 
times 
optional vector of time interval endpoints.
Example: 
e 
number of events per time interval if times not given 
subset 
vector used for subsetting the entire analysis,
e.g. 
conf.int 
confidence interval coverage 
legendloc 
location for legend. Omit to use mouse, 
smooth 
also plot the super–smoothed version of the log hazard ratios 
pr 
defaults to 
pl 
defaults to 
add 
add this plot to an already existing plot 
ylim 
vector of 
cex 
character size for legend information, default is 0.5 
xlab 
label for 
ylab 
label for 
antilog 
default is 
... 
optional graphical parameters 
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
cox.zph
, residuals.cph
,
survivalinternal
, cph
,
coxph
, Surv
require(survival)
n < 500
set.seed(1)
age < 50 + 12*rnorm(n)
cens < 15*runif(n)
h < .02*exp(.04*(age50))
d.time < log(runif(n))/h
label(d.time) < 'Followup Time'
e < ifelse(d.time <= cens,1,0)
d.time < pmin(d.time, cens)
units(d.time) < "Year"
hazard.ratio.plot(age, Surv(d.time,e), e=20, legendloc='ll')
Creates several new variables which help set up a dataset for modeling
with cph
or coxph
when there is a single binary timedependent
covariable which turns on at a given time, and stays on. This is
typical when analyzing the impact of an intervening event.
ie.setup
creates a Surv
object using the start time, stop time
format. It also creates a binary indicator for the intervening event,
and a variable called subs
that is useful when attach
ing
a dataframe.
subs
has observation numbers duplicated for subjects having an
intervening event, so those subject's baseline covariables (that are
not timedependent) can be duplicated correctly.
ie.setup(failure.time, event, ie.time, break.ties=FALSE)
failure.time 
a numeric variable containing the event or censoring times for the terminating event 
event 
a binary (0/1) variable specifying whether observations had the terminating event (event=1) or were censored (event=0) 
ie.time 
intervening event times. For subjects having no intervening events, the corresponding values of ie.time must be NA. 
break.ties 
Occasionally intervening events are recorded as happening at exactly
the same time as the termination of followup for some subjects.
The 
a list with components S, ie.status, subs, reps
. S
is a
Surv
object containing start and stop times for intervals of observation,
along with event indicators. ie.status
is one if the intervening
event has occurred at the start of the interval, zero otherwise.
subs
is a vector of subscripts that can be used to replicate other
variables the same way S
was replicated. reps
specifies how many
times each original observation was replicated. S, ie.status, subs
are
all the same length (at least the number of rows for S
is) and are longer than the original failure.time
vector.
reps
is the same length as the original failure.time
vector.
The subs
vector is suitable for passing to validate.lrm
or calibrate
,
which pass this vector under the name cluster
on to predab.resample
so that bootstrapping can be
done by sampling with replacement from the original subjects rather than
from the individual records created by ie.setup
.
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
cph
, coxph
,
Surv
, cr.setup
,
predab.resample
failure.time < c(1 , 2, 3)
event < c(1 , 1, 0)
ie.time < c(NA, 1.5, 2.5)
z < ie.setup(failure.time, event, ie.time)
S < z$S
S
ie.status < z$ie.status
ie.status
z$subs
z$reps
## Not run:
attach(input.data.frame[z$subs,]) #replicates all variables
f < cph(S ~ age + sex + ie.status)
# Instead of duplicating rows of data frame, could do this:
attach(input.data.frame)
z < ie.setup(failure.time, event, ie.time)
s < z$subs
age < age[s]
sex < sex[s]
f < cph(S ~ age + sex + ie.status)
## End(Not run)
Checks the impact of the proportional odds assumption by comparing predicted cell probabilities from a PO model with those from a multinomial or partial proportional odds logistic model that relax assumptions. For a given model formula, fits the model with both lrm
and either nnet::multinom
or VGAM::vglm
or both, and obtains predicted cell probabilities for the PO and relaxed models on the newdata
data frame. A print
method formats the output.
impactPO(
formula,
relax = if (missing(nonpo)) "multinomial" else "both",
nonpo,
newdata,
data = environment(formula),
minfreq = 15,
B = 0,
...
)
formula 
a model formula. To work properly with 
relax 
defaults to 
nonpo 
a formula with no left hand side variable, specifying the variable or variables for which PO is not assumed. Specifying 
newdata 
a data frame or data table with one row per covariate setting for which predictions are to be made 
data 
data frame containing variables to fit; default is the frame in which 
minfreq 
minimum sample size to allow for the least frequent category of the dependent variable. If the observed minimum frequency is less than this, the 
B 
number of bootstrap resamples to do to get confidence intervals for differences in predicted probabilities for relaxed methods vs. PO model fits. Default is not to run the bootstrap. When running the bootstrap make sure that all model variables are explicitly in 
... 
other parameters to pass to 
Since partial proportional odds models and especially multinomial logistic models can have many parameters, it is not feasible to use this model comparison approach when the number of levels of the dependent variable Y is large. By default, the function will use combine.levels()
to combine consecutive levels if the lowest frequency category of Y has fewer than minfreq
observations.
an impactPO
object which is a list with elements estimates
, stats
, mad
, newdata
, nboot
, and boot
. estimates
is a data frame containing the variables and values in newdata
in a tall and thin format with additional variable method
("PO", "Multinomial", "PPO"), y
(current level of the dependent variable), and Probability
(predicted cell probability for covariate values and value of y
in the current row). stats
is a data frame containing Deviance
the model deviance, d.f.
the total number of parameters counting intercepts, AIC
, p
the number of regression coefficients, LR chi^2
the likelihood ratio chisquare statistic for testing the predictors, LR  p
a chancecorrected LR chisquare, LR chi^2 test for PO
the likelihood ratio chisquare test statistic for testing the PO assumption (by comparing 2 log likelihood for a relaxed model to that of a fully PO model), d.f.
the degrees of freedom for this test, Pr(>chi^2)
the Pvalue for this test, MCS R2
the MaddalaCoxSnell R2 using the actual sample size, MCS R2 adj
(MCS R2
adjusted for estimating p
regression coefficients by subtracting p
from LR
), McFadden R2
, McFadden R2 adj
(an AIClike adjustment proposed by McFadden without full justification), Mean difference} from PO
the overall mean absolute difference between predicted probabilities over all categories of Y and over all covariate settings. mad
contains newdata
and separately by rows in newdata
the mean absolute difference (over Y categories) between estimated probabilities by the indicated relaxed model and those from the PO model. nboot
is the number of successful bootstrap repetitions, and boot
is a 4way array with dimensions represented by the nboot
resamples, the number of rows in newdata
, the number of outcome levels, and elements for PPO
and multinomial
. For the modifications of the MaddalaCoxSnell indexes see Hmisc::R2Measures
.
Frank Harrell [email protected]
nnet::multinom()
, VGAM::vglm()
, lrm()
, Hmisc::propsPO()
, Hmisc::R2Measures()
, Hmisc::combine.levels()
## Not run:
set.seed(1)
age < rnorm(500, 50, 10)
sex < sample(c('female', 'male'), 500, TRUE)
y < sample(0:4, 500, TRUE)
d < expand.grid(age=50, sex=c('female', 'male'))
w < impactPO(y ~ age + sex, nonpo = ~ sex, newdata=d)
w
# Note that PO model is a better model than multinomial (lower AIC)
# since multinomial model's improvement in fit is low in comparison
# with number of additional parameters estimated. Same for PO model
# in comparison with partial PO model.
# Reverse levels of y so stacked bars have higher y located higher
revo < function(z) {
z < as.factor(z)
factor(z, levels=rev(levels(as.factor(z))))
}
require(ggplot2)
ggplot(w$estimates, aes(x=method, y=Probability, fill=revo(y))) +
facet_wrap(~ sex) + geom_col() +
xlab('') + guides(fill=guide_legend(title=''))
# Now vary 2 predictors
d < expand.grid(sex=c('female', 'male'), age=c(40, 60))
w < impactPO(y ~ age + sex, nonpo = ~ sex, newdata=d)
w
ggplot(w$estimates, aes(x=method, y=Probability, fill=revo(y))) +
facet_grid(age ~ sex) + geom_col() +
xlab('') + guides(fill=guide_legend(title=''))
## End(Not run)
Surv
and ggplot
are imported from, respectively, the
survival
and ggplot2
packages and are exported from
rms
so that the user does not have to attach these packages to do
simple things.
Surv(time, time2, event,
type = c("right", "left", "interval", "counting", "interval2", "mstate"),
origin = 0)
ggplot(data = NULL, mapping = aes(), ..., environment = parent.frame())
time , time2 , event , type , origin

see 
data , mapping , ... , environment

see 
see documentation in the original packages
## Not run:
f < psm(Surv(dtime, death) ~ x1 + x2 + sex + race, dist='gau')
ggplot(Predict(f))
## End(Not run)
Creates a file containing a LaTeX representation of the fitted model.
## S3 method for class 'cph'
latex(object, title,
file='',
append=FALSE, surv=TRUE, maxt=FALSE, which=NULL, varnames, columns=65,
inline=FALSE, before=if(inline)"" else "& &", after="", dec=3,
pretrans=TRUE, caption, digits=.Options$digits, size="",
...) # for cph fit
## S3 method for class 'lrm'
latex(object, title, file, append, which, varnames,
columns, inline, before, after, pretrans, caption,
digits=.Options$digits, size="", ...) # for lrm fit
## S3 method for class 'ols'
latex(object, title, file, append, which, varnames,
columns, inline, before, after, pretrans, caption,
digits=.Options$digits, size="", ...) # ols fit
## S3 method for class 'orm'
latex(object, title, file, append, which, varnames,
columns, inline, before, after, pretrans, caption,
digits=.Options$digits, size="", intercepts=nrp < 10, ...) # for orm fit
## S3 method for class 'pphsm'
latex(object, title, file, append, which=NULL, varnames,
columns, inline, before, after, pretrans, caption,
digits=.Options$digits, size="", ...) # pphsm fit
## S3 method for class 'psm'
latex(object, title, file, append, which=NULL, varnames,
columns, inline, before, after, pretrans, caption,
digits=.Options$digits, size="", ...) # psm fit
object 
a fit object created by a 
title 
ignored 
file , append

see 
surv 
if 
maxt 
if the maximum followup time in the data ( 
which , varnames , columns , inline , before , dec , pretrans

see

after 
if not an empty string, added to end of markup if

caption 
a character string specifying a title for the equation to be centered and typeset in bold face. Default is no title. 
digits 
see latexrms 
size 
a LaTeX size to use, without the slash. Default is the prevailing size 
intercepts 
for 
... 
ignored 
the name of the created file, with class c("latex","file")
. This
object works with latex viewing and printing commands in Hmisc. If
file=''
and options(prType=x
is in effect, where x
is "html", "markdown"
or "md"
, the result is run through
knitr::asis_output
so that it will be rendered correctly no
matter which options are in effect in the chunk header.
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
latexrms
, rcspline.restate
,
latex
## Not run:
require(survival)
units(ftime) < "Day"
f < cph(Surv(ftime, death) ~ rcs(age)+sex, surv=TRUE, time.inc=60)
w < latex(f, file='f.tex') #Interprets fitted model and makes table of S0(t)
#for t=0,60,120,180,...
w #displays image, if viewer installed and file given above
latex(f) # send LaTeX code to the console for knitr
options(prType='html')
latex(f) # for use with knitr and R Markdown/Quarto using MathJax
## End(Not run)
Creates a file containing a LaTeX representation of the fitted model. For
modelspecific typesetting there is latex.lrm
, latex.cph
,
latex.psm
and latex.ols
. latex.cph
has some
arguments that are specific to cph
models.
latexrms
is the core function which is
called internally by latexrms
(which is called by
latex.cph
, latex.ols
, etc.). html
and R
Markdowncompatible markup (using MathJax) are written if
options(prType='html')
.
latexrms(object,
file='',
append=FALSE, which=1:p, varnames, columns=65, prefix=NULL, inline=FALSE,
before=if(inline)"" else "& &", after="", intercept, pretrans=TRUE,
digits=.Options$digits, size="")
object 
a fit object created by a fitting function in the 
file 
name of 
append 
whether or not to append to an existing file 
which 
a vector of subcripts (corresponding to 
varnames 
variable names to substitute for noninteractions. Order must correspond
to 
columns 
maximum number of columns of printing characters to allow before outputting a LaTeX newline command 
prefix 
if given, a LaTeX \lefteqn command of the form 
inline 
Set to 
before 
a character string to place before each line of output. Use the default
for a LaTeX 
after 
a character string to place after the output if 
intercept 
a special intercept value to include that is not part of the standard
model parameters (e.g., centering constant in Cox model). Only allowed
in the 
pretrans 
if any spline or polynomialexpanded variables are themselves
transformed, a table of pretransformations will be formed unless

digits 
number of digits of precision to use in formatting coefficients and other numbers 
size 
a LaTeX font size to use for the output, without the slash. Default is current size. 
latexrms
returns a character vector if file=''
,
otherwise writes the output to file
. For particular model
fits, the latex
method returns the result of running
knitr::asis_output
on the LaTeX or HTML code if file=''
,
options(prType)
was set but not to 'plain'
, and if
knitr
is currently running. This causes correct output to be
rendered whether or not results='asis'
appeared in the R
Markdown or Quarto chunk header.
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
## Not run:
f < lrm(death ~ rcs(age)+sex)
w < latex(f, file='f.tex')
w # displays, using e.g. xdvi
latex(f) # send LaTeX code to console, as for knitr
options(prType='html')
latex(f) # emit html and latex for knitr html and html notebooks
## End(Not run)
Fit binary and proportional odds ordinal
logistic regression models using maximum likelihood estimation or
penalized maximum likelihood estimation. See cr.setup
for how to
fit forward continuation ratio models with lrm
.
For the print
method, format of output is controlled by the
user previously running options(prType="lang")
where
lang
is "plain"
(the default), "latex"
, or
"html"
. When using html with Quarto or RMarkdown,
results='asis'
need not be written in the chunk header.
lrm(formula, data=environment(formula),
subset, na.action=na.delete, method="lrm.fit",
model=FALSE, x=FALSE, y=FALSE, linear.predictors=TRUE, se.fit=FALSE,
penalty=0, penalty.matrix, tol=1e7,
strata.penalty=0, var.penalty=c('simple','sandwich'),
weights, normwt, scale=FALSE, ...)
## S3 method for class 'lrm'
print(x, digits=4, r2=c(0,2,4), strata.coefs=FALSE,
coefs=TRUE, pg=FALSE, title='Logistic Regression Model', ...)
formula 
a formula object. An 
data 
data frame to use. Default is the current frame. 
subset 
logical expression or vector of subscripts defining a subset of observations to analyze 
na.action 
function to handle 
method 
name of fitting function. Only allowable choice at present is 
model 
causes the model frame to be returned in the fit object 
x 
causes the expanded design matrix (with missings excluded)
to be returned under the name 
y 
causes the response variable (with missings excluded) to be returned
under the name 
linear.predictors 
causes the predicted X beta (with missings excluded) to be returned
under the name 
se.fit 
causes the standard errors of the fitted values to be returned under
the name 
penalty 
The penalty factor subtracted from the log likelihood is

penalty.matrix 
specifies the symmetric penalty matrix for nonintercept terms.
The default matrix for continuous predictors has
the variance of the columns of the design matrix in its diagonal elements
so that the penalty to the log likelhood is unitless. For main effects
for categorical predictors with 
tol 
singularity criterion (see 
strata.penalty 
scalar penalty factor for the stratification
factor, for the experimental 
var.penalty 
the type of variancecovariance matrix to be stored in the 
weights 
a vector (same length as 
normwt 
set to 
scale 
set to 
... 
arguments that are passed to 
digits 
number of significant digits to use 
r2 
vector of integers specifying which R^2 measures to print,
with 0 for Nagelkerke R^2 and 1:4 corresponding to the 4 measures
computed by 
strata.coefs 
set to 
coefs 
specify 
pg 
set to 
title 
a character string title to be passed to 
The returned fit object of lrm
contains the following components
in addition to the ones mentioned under the optional arguments.
call 
calling expression 
freq 
table of frequencies for 
stats 
vector with the following elements: number of observations used in the
fit, maximum absolute value of first
derivative of log likelihood, model likelihood ratio

fail 
set to 
coefficients 
estimated parameters 
var 
estimated variancecovariance matrix (inverse of information matrix).
If 
effective.df.diagonal 
is returned if 
u 
vector of first derivatives of loglikelihood 
deviance 
2 log likelihoods (counting penalty components) When an offset variable is present, three deviances are computed: for intercept(s) only, for intercepts+offset, and for intercepts+offset+predictors. When there is no offset variable, the vector contains deviances for the intercept(s)only model and the model with intercept(s) and predictors. 
est 
vector of column numbers of 
non.slopes 
number of intercepts in model 
penalty 
see above 
penalty.matrix 
the penalty matrix actually used in the estimation 
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
Le Cessie S, Van Houwelingen JC: Ridge estimators in logistic regression. Applied Statistics 41:191–201, 1992.
Verweij PJM, Van Houwelingen JC: Penalized likelihood in Cox regression. Stat in Med 13:2427–2436, 1994.
Gray RJ: Flexible methods for analyzing survival data using splines, with applications to breast cancer prognosis. JASA 87:942–951, 1992.
Shao J: Linear model selection by crossvalidation. JASA 88:486–494, 1993.
Verweij PJM, Van Houwelingen JC: Crossvalidation in survival analysis. Stat in Med 12:2305–2314, 1993.
Harrell FE: Model uncertainty, penalization, and parsimony. ISCB Presentation on UVa Web page, 1998.
lrm.fit
, predict.lrm
,
rms.trans
, rms
, glm
,
latex.lrm
,
residuals.lrm
, na.delete
,
na.detail.response
,
pentrace
, rmsMisc
, vif
,
cr.setup
, predab.resample
,
validate.lrm
, calibrate
,
Mean.lrm
, gIndex
, prModFit
#Fit a logistic model containing predictors age, blood.pressure, sex
#and cholesterol, with age fitted with a smooth 5knot restricted cubic
#spline function and a different shape of the age relationship for males
#and females. As an intermediate step, predict mean cholesterol from
#age using a proportional odds ordinal logistic model
#
require(ggplot2)
n < 1000 # define sample size
set.seed(17) # so can reproduce the results
age < rnorm(n, 50, 10)
blood.pressure < rnorm(n, 120, 15)
cholesterol < rnorm(n, 200, 25)
sex < factor(sample(c('female','male'), n,TRUE))
label(age) < 'Age' # label is in Hmisc
label(cholesterol) < 'Total Cholesterol'
label(blood.pressure) < 'Systolic Blood Pressure'
label(sex) < 'Sex'
units(cholesterol) < 'mg/dl' # uses units.default in Hmisc
units(blood.pressure) < 'mmHg'
#To use prop. odds model, avoid using a huge number of intercepts by
#grouping cholesterol into 40tiles
ch < cut2(cholesterol, g=40, levels.mean=TRUE) # use mean values in intervals
table(ch)
f < lrm(ch ~ age)
options(prType='latex')
print(f, coefs=4) # write latex code to console
m < Mean(f) # see help file for Mean.lrm
d < data.frame(age=seq(0,90,by=10))
m(predict(f, d))
# Repeat using ols
f < ols(cholesterol ~ age)
predict(f, d)
# Specify population model for log odds that Y=1
L < .4*(sex=='male') + .045*(age50) +
(log(cholesterol  10)5.2)*(2*(sex=='female') + 2*(sex=='male'))
# Simulate binary y to have Prob(y=1) = 1/[1+exp(L)]
y < ifelse(runif(n) < plogis(L), 1, 0)
cholesterol[1:3] < NA # 3 missings, at random
ddist < datadist(age, blood.pressure, cholesterol, sex)
options(datadist='ddist')
fit < lrm(y ~ blood.pressure + sex * (age + rcs(cholesterol,4)),
x=TRUE, y=TRUE)
# x=TRUE, y=TRUE allows use of resid(), which.influence below
# could define d < datadist(fit) after lrm(), but data distribution
# summary would not be stored with fit, so later uses of Predict
# or summary.rms would require access to the original dataset or
# d or specifying all variable values to summary, Predict, nomogram
anova(fit)
p < Predict(fit, age, sex)
ggplot(p) # or plot()
ggplot(Predict(fit, age=20:70, sex="male")) # need if datadist not used
print(cbind(resid(fit,"dfbetas"), resid(fit,"dffits"))[1:20,])
which.influence(fit, .3)
# latex(fit) #print nice statement of fitted model
#
#Repeat this fit using penalized MLE, penalizing complex terms
#(for nonlinear or interaction effects)
#
fitp < update(fit, penalty=list(simple=0,nonlinear=10), x=TRUE, y=TRUE)
effective.df(fitp)
# or lrm(y ~ \dots, penalty=\dots)
#Get fits for a variety of penalties and assess predictive accuracy
#in a new data set. Program efficiently so that complex design
#matrices are only created once.
set.seed(201)
x1 < rnorm(500)
x2 < rnorm(500)
x3 < sample(0:1,500,rep=TRUE)
L < x1+abs(x2)+x3
y < ifelse(runif(500)<=plogis(L), 1, 0)
new.data < data.frame(x1,x2,x3,y)[301:500,]
#
for(penlty in seq(0,.15,by=.005)) {
if(penlty==0) {
f < lrm(y ~ rcs(x1,4)+rcs(x2,6)*x3, subset=1:300, x=TRUE, y=TRUE)
# True model is linear in x1 and has no interaction
X < f$x # saves time for future runs  don't have to use rcs etc.
Y < f$y # this also deletes rows with NAs (if there were any)
penalty.matrix < diag(diag(var(X)))
Xnew < predict(f, new.data, type="x")
# expand design matrix for new data
Ynew < new.data$y
} else f < lrm.fit(X,Y, penalty.matrix=penlty*penalty.matrix)
#
cat("\nPenalty :",penlty,"\n")
pred.logit < f$coef[1] + (Xnew %*% f$coef[1])
pred < plogis(pred.logit)
C.index < somers2(pred, Ynew)["C"]
Brier < mean((predYnew)^2)
Deviance< 2*sum( Ynew*log(pred) + (1Ynew)*log(1pred) )
cat("ROC area:",format(C.index)," Brier score:",format(Brier),
" 2 Log L:",format(Deviance),"\n")
}
#penalty=0.045 gave lowest 2 Log L, Brier, ROC in test sample for S+
#
#Use bootstrap validation to estimate predictive accuracy of
#logistic models with various penalties
#To see how noisy crossvalidation estimates can be, change the
#validate(f, \dots) to validate(f, method="cross", B=10) for example.
#You will see tremendous variation in accuracy with minute changes in
#the penalty. This comes from the error inherent in using 10fold
#cross validation but also because we are not fixing the splits.
#20fold cross validation was even worse for some
#indexes because of the small test sample size. Stability would be
#obtained by using the same sample splits for all penalty values
#(see above), but then we wouldn't be sure that the choice of the
#best penalty is not specific to how the sample was split. This
#problem is addressed in the last example.
#
penalties < seq(0,.7,length=3) # really use by=.02
index < matrix(NA, nrow=length(penalties), ncol=11,
dimnames=list(format(penalties),
c("Dxy","R2","Intercept","Slope","Emax","D","U","Q","B","g","gp")))
i < 0
for(penlty in penalties)
{
cat(penlty, "")
i < i+1
if(penlty==0)
{
f < lrm(y ~ rcs(x1,4)+rcs(x2,6)*x3, x=TRUE, y=TRUE) # fit whole sample
X < f$x
Y < f$y
penalty.matrix < diag(diag(var(X))) # save time  only do once
}
else
f < lrm(Y ~ X, penalty=penlty,
penalty.matrix=penalty.matrix, x=TRUE,y=TRUE)
val < validate(f, method="boot", B=20) # use larger B in practice
index[i,] < val[,"index.corrected"]
}
par(mfrow=c(3,3))
for(i in 1:9)
{
plot(penalties, index[,i],
xlab="Penalty", ylab=dimnames(index)[[2]][i])
lines(lowess(penalties, index[,i]))
}
options(datadist=NULL)
# Example of weighted analysis
x < 1:5
y < c(0,1,0,1,0)
reps < c(1,2,3,2,1)
lrm(y ~ x, weights=reps)
x < rep(x, reps)
y < rep(y, reps)
lrm(y ~ x) # same as above
#
#Study performance of a modified AIC which uses the effective d.f.
#See Verweij and Van Houwelingen (1994) Eq. (6). Here AIC=chisq2*df.
#Also try as effective d.f. equation (4) of the previous reference.
#Also study performance of Shao's crossvalidation technique (which was
#designed to pick the "right" set of variables, and uses a much smaller
#training sample than most methods). Compare crossvalidated deviance
#vs. penalty to the gold standard accuracy on a 7500 observation dataset.
#Note that if you only want to get AIC or Schwarz Bayesian information
#criterion, all you need is to invoke the pentrace function.
#NOTE: the effective.df( ) function is used in practice
#
## Not run:
for(seed in c(339,777,22,111,3)){
# study performance for several datasets
set.seed(seed)
n < 175; p < 8
X < matrix(rnorm(n*p), ncol=p) # p normal(0,1) predictors
Coef < c(.1,.2,.3,.4,.5,.6,.65,.7) # true population coefficients
L < X %*% Coef # intercept is zero
Y < ifelse(runif(n)<=plogis(L), 1, 0)
pm < diag(diag(var(X)))
#Generate a large validation sample to use as a gold standard
n.val < 7500
X.val < matrix(rnorm(n.val*p), ncol=p)
L.val < X.val %*% Coef
Y.val < ifelse(runif(n.val)<=plogis(L.val), 1, 0)
#
Penalty < seq(0,30,by=1)
reps < length(Penalty)
effective.df < effective.df2 < aic < aic2 < deviance.val <
Lpenalty < single(reps)
n.t < round(n^.75)
ncv < c(10,20,30,40) # try various no. of reps in crossval.
deviance < matrix(NA,nrow=reps,ncol=length(ncv))
#If model were complex, could have started things off by getting X, Y
#penalty.matrix from an initial lrm fit to save time
#
for(i in 1:reps) {
pen < Penalty[i]
cat(format(pen),"")
f.full < lrm.fit(X, Y, penalty.matrix=pen*pm)
Lpenalty[i] < pen* t(f.full$coef[1]) %*% pm %*% f.full$coef[1]
f.full.nopenalty < lrm.fit(X, Y, initial=f.full$coef, maxit=1)
info.matrix.unpenalized < solve(f.full.nopenalty$var)
effective.df[i] < sum(diag(info.matrix.unpenalized %*% f.full$var))  1
lrchisq < f.full.nopenalty$stats["Model L.R."]
# lrm does all this penalty adjustment automatically (for var, d.f.,
# chisquare)
aic[i] < lrchisq  2*effective.df[i]
#
pred < plogis(f.full$linear.predictors)
score.matrix < cbind(1,X) * (Y  pred)
sum.u.uprime < t(score.matrix) %*% score.matrix
effective.df2[i] < sum(diag(f.full$var %*% sum.u.uprime))
aic2[i] < lrchisq  2*effective.df2[i]
#
#Shao suggested averaging 2*n crossvalidations, but let's do only 40
#and stop along the way to see if fewer is OK
dev < 0
for(j in 1:max(ncv)) {
s < sample(1:n, n.t)
cof < lrm.fit(X[s,],Y[s],
penalty.matrix=pen*pm)$coef
pred < cof[1] + (X[s,] %*% cof[1])
dev < dev 2*sum(Y[s]*pred + log(1plogis(pred)))
for(k in 1:length(ncv)) if(j==ncv[k]) deviance[i,k] < dev/j
}
#
pred.val < f.full$coef[1] + (X.val %*% f.full$coef[1])
prob.val < plogis(pred.val)
deviance.val[i] < 2*sum(Y.val*pred.val + log(1prob.val))
}
postscript(hor=TRUE) # along with graphics.off() below, allow plots
par(mfrow=c(2,4)) # to be printed as they are finished
plot(Penalty, effective.df, type="l")
lines(Penalty, effective.df2, lty=2)
plot(Penalty, Lpenalty, type="l")
title("Penalty on 2 log L")
plot(Penalty, aic, type="l")
lines(Penalty, aic2, lty=2)
for(k in 1:length(ncv)) {
plot(Penalty, deviance[,k], ylab="deviance")
title(paste(ncv[k],"reps"))
lines(supsmu(Penalty, deviance[,k]))
}
plot(Penalty, deviance.val, type="l")
title("Gold Standard (n=7500)")
title(sub=format(seed),adj=1,cex=.5)
graphics.off()
}
## End(Not run)
#The results showed that to obtain a clear picture of the penalty
#accuracy relationship one needs 30 or 40 reps in the crossvalidation.
#For 4 of 5 samples, though, the super smoother was able to detect
#an accurate penalty giving the best (lowest) deviance using 10fold
#crossvalidation. Crossvalidation would have worked better had
#the same splits been used for all penalties.
#The AIC methods worked just as well and are much quicker to compute.
#The first AIC based on the effective d.f. in Gray's Eq. 2.9
#(Verweij and Van Houwelingen (1994) Eq. 5 (note typo)) worked best.
Fits a binary or ordinal logistic model for a given design matrix and response vector with no missing values in either. Ordinary or penalized maximum likelihood estimation is used.
lrm.fit(x, y, offset=0, initial, est, maxit=12, eps=.025,
tol=1e7, trace=FALSE, penalty.matrix=NULL, weights=NULL,
normwt=FALSE, scale=FALSE)
x 
design matrix with no column for an intercept 
y 
response vector, numeric, categorical, or character 
offset 
optional numeric vector containing an offset on the logit scale 
initial 
vector of initial parameter estimates, beginning with the intercept 
est 
indexes of 
maxit 
maximum no. iterations (default= 
eps 
difference in 
tol 
Singularity criterion. Default is 1e7 
trace 
set to 
penalty.matrix 
a selfcontained readytouse penalty matrix  see 
weights 
a vector (same length as 
normwt 
set to 
scale 
set to 
a list with the following components:
call 
calling expression 
freq 
table of frequencies for 
stats 
vector with the following elements: number of observations used in the
fit, maximum absolute value of first
derivative of log likelihood, model likelihood ratio chisquare, d.f.,
Pvalue,

fail 
set to 
coefficients 
estimated parameters 
var 
estimated variancecovariance matrix (inverse of information matrix).
Note that in the case of penalized estimation, 
u 
vector of first derivatives of loglikelihood 
deviance 
2 log likelihoods. When an offset variable is present, three deviances are computed: for intercept(s) only, for intercepts+offset, and for intercepts+offset+predictors. When there is no offset variable, the vector contains deviances for the intercept(s)only model and the model with intercept(s) and predictors. 
est 
vector of column numbers of 
non.slopes 
number of intercepts in model 
penalty.matrix 
see above 
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
lrm
, glm
, matinv
,
solvet
, cr.setup
, gIndex
#Fit an additive logistic model containing numeric predictors age,
#blood.pressure, and sex, assumed to be already properly coded and
#transformed
#
# fit < lrm.fit(cbind(age,blood.pressure,sex), death)
Bare Bones Logistic Regression Fit
lrm.fit.bare(x, y, maxit = 12, eps = 0.025, tol = 1e07)
x 
a vector of matrix of covariate values 
y 
a numeric or factor vector representing the dependent variable 
maxit 
maximum number of iteractions 
eps 
stopping criterion (change in 2 log likelihood) 
tol 
matrix inversion tolerance for singularities 
This is a stripped down version of the lrm.fit()
function that computes only the regression coefficients, variancecovariancematrix, and log likelihood (for null and fitted model) and does not compute any model fit indexes etc. This is for speed in simulations or with bootstrapping. Missing data are not allowed. The function handles binary and ordinal logistic regression (proportional odds model).
a list with elements coefficients
, var
, fail
, freq
, deviance
Frank Harrell
Update Model LR Statistics After Multiple Imputation
LRupdate(fit, anova)
fit 
an 
anova 
the result of 
For fits from orm, lrm, orm, cph, psm
that were created using fit.mult.impute
with lrt=TRUE
or equivalent options and for which anova
was obtained using processMI(fit, 'anova')
to compute imputationadjusted LR statistics. LRupdate
uses the last line of the anova
result (containing the overall model LR chisquare) to update Model L.R.
in the fit stats
component, and to adjust any of the new Rsquare measures in stats
.
For models using Nagelkerke's Rsquared, these are set to NA
as they would need to be recomputed with a new interceptonly loglikelihood, which is not computed by anova
. For ols
models, Rsquared is left alone as it is samplesizeindependent and print.ols
prints the correct adjusted Rsquared due to fit.mult.impute
correcting the residual d.f. in stacked fits.
new fit object like fit
but with the substitutions made
Frank Harrell
processMI.fit.mult.impute()
, Hmisc::R2Measures()
## Not run:
a < aregImpute(~ y + x1 + x2, n.impute=30, data=d)
f < fit.mult.impute(y ~ x1 + x2, lrm, a, data=d, lrt=TRUE)
a < processMI(f, 'anova')
f < LRupdate(f, a)
print(f, r2=1:4) # print all imputationcorrected R2 measures
## End(Not run)
This function inverts or partially inverts a matrix using pivoting (the sweep operator). It is useful for sequential modelbuilding.
matinv(a, which, negate=TRUE, eps=1e12)
a 
square matrix to invert or partially invert. May have been inverted or partially inverted previously by matinv, in which case its "swept" attribute is updated. Will uninvert if already inverted. 
which 
vector of column/row numbers in a to invert. Default is all, for total inverse. 
negate 
So that the algorithm can keep track of which pivots have been swept as well as roundoff errors, it actually returns the negative of the inverse or partial inverse. By default, these elements are negated to give the usual expected result. Set negate=FALSE if you will be passing the result right back into matinv, otherwise, negate the submatrix before sending back to matinv. 
eps 
singularity criterion 
a square matrix, with attributes "rank" and "swept".
Clarke MRB (1982). Algorithm AS 178: The GaussJordan sweep operator with detection of collinearity. Appl Statist 31:166–9.
Ridout MS, Cobb JM (1986). Algorithm AS R78 : A remark on algorithm AS 178: The GaussJordan sweep operator with detection of collinearity. Appl Statist 38:420–2.
a < diag(1:3)
a.inv1 < matinv(a, 1, negate=FALSE) #Invert with respect to a[1,1]
a.inv1
a.inv < matinv(a.inv1, 2:3, negate=FALSE) #Finish the job
a.inv
solve(a)
Draws a partial nomogram that can be used to manually obtain predicted
values from a regression model that was fitted with rms
.
The nomogram does not have lines representing sums, but it has a reference
line for reading scoring points (default range 0–100). Once the reader
manually totals the points, the predicted values can be read at the bottom.
Nonmonotonic transformations of continuous variables are handled (scales
wrap around), as
are transformations which have flat sections (tick marks are labeled
with ranges). If interactions are in the model, one variable
is picked as the “axis variable”, and separate axes are constructed for
each level of the interacting factors (preference is given automatically
to using any discrete factors to construct separate axes) and
levels of factors which are indirectly related to interacting
factors (see DETAILS). Thus the nomogram is designed so that only
one axis is actually read for each variable, since the variable
combinations are disjoint. For
categorical interacting factors, the default is to construct axes for
all levels.
The user may specify
coordinates of each predictor to label on its axis, or use default values.
If a factor interacts with other factors, settings for one or more of
the interacting factors may be specified separately (this is mandatory
for continuous variables). Optional confidence intervals will be
drawn for individual scores as well as for the linear predictor.
If more than one confidence level is chosen, multiple levels may be
displayed using different colors or gray scales. Functions of the
linear predictors may be added to the nomogram.
The datadist
object that was in effect when the model
was fit is used to specify the limits of the axis for continuous
predictors when the user does not specify tick mark locations in the
nomogram
call.
print.nomogram
prints axis information stored in an object returned
by nomogram
. This is useful in producing tables of point assignments
by levels of predictors. It also prints how many linear predictor
units there are per point and the number of points per unit change in
the linear predictor.
legend.nomabbrev
draws legends describing abbreviations used for
labeling tick marks for levels of categorical predictors.
nomogram(fit, ..., adj.to, lp=TRUE, lp.at=NULL,
fun=NULL, fun.at=NULL, fun.lp.at=NULL, funlabel="Predicted Value",
interact=NULL, kint=NULL, conf.int=FALSE,
conf.lp=c("representative", "all", "none"),
est.all=TRUE, posterior.summary=c('mean', 'median', 'mode'),
abbrev=FALSE, minlength=4, maxscale=100, nint=10,
vnames=c("labels","names"),
varname.label=TRUE, varname.label.sep="=",
omit=NULL, verbose=FALSE)
## S3 method for class 'nomogram'
print(x, dec=0, ...)
## S3 method for class 'nomogram'
plot(x, lplabel="Linear Predictor", fun.side,
col.conf=c(1, 0.3),
conf.space=c(.08,.2), label.every=1, force.label=FALSE,
xfrac=.35, cex.axis=.85, cex.var=1, col.grid=NULL,
varname.label=TRUE, varname.label.sep="=", ia.space=.7,
tck=NA, tcl=0.25, lmgp=.4, naxes,
points.label='Points', total.points.label='Total Points',
total.sep.page=FALSE, total.fun, cap.labels=FALSE, ...)
legend.nomabbrev(object, which, x, y, ncol=3, ...)
fit 
a regression model fit that was created with 
... 
settings of variables to use in constructing axes. If 
adj.to 
If you didn't define 
lp 
Set to 
lp.at 
If 
fun 
an optional function to transform the linear predictors, and to plot
on another axis. If more than one transformation is plotted, put
them in a list, e.g. 
fun.at 
function values to label on axis. Default 
fun.lp.at 
If you want to
evaluate one of the functions at a different set of linear predictor
values than may have been used in constructing the linear predictor axis,
specify a vector or list of vectors
of linear predictor values at which to evaluate the function. This is
especially useful for discrete functions. The presence of this attribute
also does away with the need for 
funlabel 
label for 
interact 
When a continuous variable interacts with a discrete one, axes are
constructed so that the continuous variable moves within the axis, and
separate axes represent levels of interacting factors. For interactions
between two continuous variables, all but the axis variable must have
discrete levels defined in 
kint 
for models such as the ordinal models with multiple intercepts,
specifies which one to use in evaluating the linear predictor.
Default is to use 
conf.int 
confidence levels to display for each scoring. Default is 
conf.lp 
default is 
est.all 
To plot axes for only the subset of variables named in 
posterior.summary 
when operating on a Bayesian model such as a
result of 
abbrev 
Set to 
minlength 
applies if 
maxscale 
default maximum point score is 100 
nint 
number of intervals to label for axes representing continuous variables.
See 
vnames 
By default, variable labels are used to label axes. Set

omit 
vector of character strings containing names of variables for which to suppress drawing axes. Default is to show all variables. 
verbose 
set to 
x 
an object created by 
dec 
number of digits to the right of the decimal point, for rounding
point scores in 
lplabel 
label for linear predictor axis. Default is 
fun.side 
a vector or list of vectors of 
col.conf 
colors corresponding to 
conf.space 
a 2element vector with the vertical range within which to draw confidence bars, in units of 1=spacing between main bars. Four heights are used within this range (8 for the linear predictor if more than 16 unique values were evaluated), cycling them among separate confidence intervals to reduce overlapping. 
label.every 
Specify 
force.label 
set to 
xfrac 
fraction of horizontal plot to set aside for axis titles 
cex.axis 
character size for tick mark labels 
cex.var 
character size for axis titles (variable names) 
col.grid 
If left unspecified, no vertical reference lines are drawn. Specify a
vector of length one (to use the same color for both minor and major
reference lines) or two (corresponding to the color for the major and
minor divisions, respectively) containing colors, to cause vertical reference
lines to the top points scale to be drawn. For R, a good choice is

varname.label 
In constructing axis titles for interactions, the default is to add

varname.label.sep 
If 
ia.space 
When multiple axes are draw for levels of interacting factors, the default is to group combinations related to a main effect. This is done by spacing the axes for the second to last of these within a group only 0.7 (by default) of the way down as compared with normal space of 1 unit. 
tck 
see 
tcl 
length of tick marks in nomogram 
lmgp 
spacing between numeric axis labels and axis (see 
naxes 
maximum number of axes to allow on one plot. If the nomogram requires more than one “page”, the “Points” axis will be repeated at the top of each page when necessary. 
points.label 
a character string giving the axis label for the points scale 
total.points.label 
a character string giving the axis label for the total points scale 
total.sep.page 
set to 
total.fun 
a userprovided function that will be executed before the total points
axis is drawn. Default is not to execute a function. This is useful e.g.
when 
cap.labels 
logical: should the factor labels have their first letter capitalized? 
object 
the result returned from 
which 
a character string giving the name of a variable for which to draw a legend with abbreviations of factor levels 
y 
ycoordinate to pass to the 
ncol 
the number of columns to form in drawing the legend. 
A variable is considered to be discrete if it is categorical or ordered
or if datadist
stored values
for it (meaning it
had <11
unique values).
A variable is said to be indirectly related to another variable if
the two are related by some interaction. For example, if a model
has variables a, b, c, d, and the interactions are a:c and c:d,
variable d is indirectly related to variable a. The complete list
of variables related to a is c, d. If an axis is made for variable a,
several axes will actually be drawn, one for each combination of c
and d specified in interact
.
Note that with a caliper, it is easy to continually add point scores for individual predictors, and then to place the caliper on the upper “Points” axis (with extrapolation if needed). Then transfer these points to the “Total Points” axis. In this way, points can be added without writing them down.
Confidence limits for an individual predictor score are really confidence
limits for the entire linear predictor, with other predictors set to
adjustment values. If lp = TRUE
, all confidence bars for all linear
predictor values evaluated are drawn. The extent to which multiple
confidence bars of differing widths appear at the same linear predictor
value means that precision depended on how the linear predictor was
arrived at (e.g., a certain value may be realized from a setting of
a certain predictor that was associated with a large standard error
on the regression coefficients for that predictor).
On occasion, you may want to reverse the regression coefficients of a model
to make the “points” scales reverse direction. For parametric survival
models, which are stated in terms of increasing regression effects meaning
longer survival (the opposite of a Cox model), just do something like
fit$coefficients < fit$coefficients
before invoking nomogram
,
and if you add function axes, negate the function
arguments. For the Cox model, you also need to negate fit$center
.
If you omit lp.at
, also negate fit$linear.predictors
.
a list of class "nomogram"
that contains information used in plotting
the axes. If you specified abbrev = TRUE
, a list called abbrev
is also
returned that gives the abbreviations used for tick mark labels, if any.
This list is useful for
making legends and is used by legend.nomabbrev
(see the last example).
The returned list also has components called total.points
, lp
,
and the function axis names. These components have components
x
(at
argument vector given to axis
), y
(pos
for axis
),
and x.real
, the xcoordinates appearing on tick mark labels.
An often useful result is stored in the list of data for each axis variable,
namely the exact number of points that correspond to each tick mark on
that variable's axis.
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
Banks J: Nomograms. Encylopedia of Statistical Sciences, Vol 6. Editors: S Kotz and NL Johnson. New York: Wiley; 1985.
Lubsen J, Pool J, van der Does, E: A practical device for the application of a diagnostic or prognostic function. Meth. Inform. Med. 17:127–129; 1978.
Wikipedia: Nomogram, https://en.wikipedia.org/wiki/Nomogram.
rms
, plot.Predict
,
ggplot.Predict
, plot.summary.rms
,
axis
, pretty
, approx
,
latexrms
, rmsMisc
n < 1000 # define sample size
set.seed(17) # so can reproduce the results
d < data.frame(age = rnorm(n, 50, 10),
blood.pressure = rnorm(n, 120, 15),
cholesterol = rnorm(n, 200, 25),
sex = factor(sample(c('female','male'), n,TRUE)))
# Specify population model for log odds that Y=1
# Simulate binary y to have Prob(y=1) = 1/[1+exp(L)]
d < upData(d,
L = .4*(sex=='male') + .045*(age50) +
(log(cholesterol  10)5.2)*(2*(sex=='female') + 2*(sex=='male')),
y = ifelse(runif(n) < plogis(L), 1, 0))
ddist < datadist(d); options(datadist='ddist')
f < lrm(y ~ lsp(age,50) + sex * rcs(cholesterol, 4) + blood.pressure,
data=d)
nom < nomogram(f, fun=function(x)1/(1+exp(x)), # or fun=plogis
fun.at=c(.001,.01,.05,seq(.1,.9,by=.1),.95,.99,.999),
funlabel="Risk of Death")
#Instead of fun.at, could have specified fun.lp.at=logit of
#sequence above  faster and slightly more accurate
plot(nom, xfrac=.45)
print(nom)
nom < nomogram(f, age=seq(10,90,by=10))
plot(nom, xfrac=.45)
g < lrm(y ~ sex + rcs(age, 3) * rcs(cholesterol, 3), data=d)
nom < nomogram(g, interact=list(age=c(20,40,60)),
conf.int=c(.7,.9,.95))
plot(nom, col.conf=c(1,.5,.2), naxes=7)
require(survival)
w < upData(d,
cens = 15 * runif(n),
h = .02 * exp(.04 * (age  50) + .8 * (sex == 'Female')),
d.time = log(runif(n)) / h,
death = ifelse(d.time <= cens, 1, 0),
d.time = pmin(d.time, cens))
f < psm(Surv(d.time,death) ~ sex * age, data=w, dist='lognormal')
med < Quantile(f)
surv < Survival(f) # This would also work if f was from cph
plot(nomogram(f, fun=function(x) med(lp=x), funlabel="Median Survival Time"))
nom < nomogram(f, fun=list(function(x) surv(3, x),
function(x) surv(6, x)),
funlabel=c("3Month Survival Probability",
"6month Survival Probability"))
plot(nom, xfrac=.7)
## Not run:
nom < nomogram(fit.with.categorical.predictors, abbrev=TRUE, minlength=1)
nom$x1$points # print points assigned to each level of x1 for its axis
#Add legend for abbreviations for category levels
abb < attr(nom, 'info')$abbrev$treatment
legend(locator(1), abb$full, pch=paste(abb$abbrev,collapse=''),
ncol=2, bty='n') # this only works for 1letter abbreviations
#Or use the legend.nomabbrev function:
legend.nomabbrev(nom, 'treatment', locator(1), ncol=2, bty='n')
## End(Not run)
#Make a nomogram with axes predicting probabilities Y>=j for all j=13
#in an ordinal logistic model, where Y=0,1,2,3
w < upData(w, Y = ifelse(y==0, 0, sample(1:3, length(y), TRUE)))
g < lrm(Y ~ age+rcs(cholesterol,4) * sex, data=w)
fun2 < function(x) plogis(xg$coef[1]+g$coef[2])
fun3 < function(x) plogis(xg$coef[1]+g$coef[3])
f < Newlabels(g, c(age='Age in Years'))
#see Design.Misc, which also has Newlevels to change
#labels for levels of categorical variables
g < nomogram(f, fun=list('Prob Y>=1'=plogis, 'Prob Y>=2'=fun2,
'Prob Y=3'=fun3),
fun.at=c(.01,.05,seq(.1,.9,by=.1),.95,.99))
plot(g, lmgp=.2, cex.axis=.6)
options(datadist=NULL)
Computes an estimate of a survival curve for censored data
using either the KaplanMeier or the FlemingHarrington method
or computes the predicted survivor function.
For competing risks data it computes the cumulative incidence curve.
This calls the survival
package's survfit.formula
function. Attributes of the event time variable are saved (label and
units of measurement).
For competing risks the second argument for Surv
should be the
event state variable, and it should be a factor variable with the first
factor level denoting rightcensored observations.
npsurv(formula, data=environment(formula),
subset, weights, na.action=na.delete, ...)
formula 
a formula object, which must have a 
data , subset , weights , na.action

see 
... 
see 
see survfit.formula
for details
an object of class "npsurv"
and "survfit"
.
See survfit.object
for details. Methods defined for survfit
objects are print
, summary
, plot
,lines
, and
points
.
Thomas Lumley [email protected] and Terry Therneau
survfit.cph
for survival curves from Cox models.
print
,
plot
,
lines
,
coxph
,
strata
,
survplot
require(survival)
# fit a KaplanMeier and plot it
fit < npsurv(Surv(time, status) ~ x, data = aml)
plot(fit, lty = 2:3)
legend(100, .8, c("Maintained", "Nonmaintained"), lty = 2:3)
# Here is the data set from Turnbull
# There are no interval censored subjects, only leftcensored (status=3),
# rightcensored (status 0) and observed events (status 1)
#
# Time
# 1 2 3 4
# Type of observation
# death 12 6 2 3
# losses 3 2 0 3
# late entry 2 4 2 5
#
tdata < data.frame(time = c(1,1,1,2,2,2,3,3,3,4,4,4),
status = rep(c(1,0,2),4),
n = c(12,3,2,6,2,4,2,0,2,3,3,5))
fit < npsurv(Surv(time, time, status, type='interval') ~ 1,
data=tdata, weights=n)
#
# Time to progression/death for patients with monoclonal gammopathy
# Competing risk curves (cumulative incidence)
# status variable must be a factor with first level denoting right censoring
m < upData(mgus1, stop = stop / 365.25, units=c(stop='years'),
labels=c(stop='Followup Time'), subset=start == 0)
f < npsurv(Surv(stop, event) ~ 1, data=m)
# CI curves are always plotted from 0 upwards, rather than 1 down
plot(f, fun='event', xmax=20, mark.time=FALSE,
col=2:3, xlab="Years post diagnosis of MGUS")
text(10, .4, "Competing Risk: death", col=3)
text(16, .15,"Competing Risk: progression", col=2)
# Use survplot for enhanced displays of cumulative incidence curves for
# competing risks
survplot(f, state='pcm', n.risk=TRUE, xlim=c(0, 20), ylim=c(0, .5), col=2)
survplot(f, state='death', add=TRUE, col=3)
f < npsurv(Surv(stop, event) ~ sex, data=m)
survplot(f, state='death', n.risk=TRUE, conf='diffbands')
Fits the usual weighted or unweighted linear regression model using the
same fitting routines used by lm
, but also storing the variancecovariance
matrix var
and using traditional dummyvariable coding for categorical
factors.
Also fits unweighted models using penalized least squares, with the same
penalization options as in the lrm
function. For penalized estimation,
there is a fitter function call lm.pfit
.
ols(formula, data=environment(formula), weights, subset, na.action=na.delete,
method="qr", model=FALSE,
x=FALSE, y=FALSE, se.fit=FALSE, linear.predictors=TRUE,
penalty=0, penalty.matrix, tol=1e7, sigma,
var.penalty=c('simple','sandwich'), ...)
formula 
an S formula object, e.g.

data 
name of an S data frame containing all needed variables. Omit this to use a data frame already in the S “search list”. 
weights 
an optional vector of weights to be used in the fitting
process. If specified, weighted least squares is used with
weights 
subset 
an expression defining a subset of the observations to use in the fit. The default
is to use all observations. Specify for example 
na.action 
specifies an S function to handle missing data. The default is the function 
method 
specifies a particular fitting method, or 
model 
default is 
x 
default is 
y 
default is 
se.fit 
default is 
linear.predictors 
set to 
penalty 
see 
penalty.matrix 
see 
tol 
tolerance for information matrix singularity 
sigma 
If 
var.penalty 
the type of variancecovariance matrix to be stored in the 
... 
For penalized estimation, the penalty factor on the log likelihood is
$0.5 \beta' P \beta / \sigma^2$
, where $P$
is defined above.
The penalized maximum likelihood estimate (penalized least squares
or ridge estimate) of $\beta$
is $(X'X + P)^{1} X'Y$
.
The maximum likelihood estimate of $\sigma^2$
is $(sse + \beta'
P \beta) / n$
, where
sse
is the sum of squared errors (residuals).
The effective.df.diagonal
vector is the
diagonal of the matrix $X'X/(sse/n) \sigma^{2} (X'X + P)^{1}$
.
the same objects returned from lm
(unless penalty
or penalty.matrix
are given  then an
abbreviated list is returned since lm.pfit
is used as a fitter)
plus the design attributes
(see rms
).
Predicted values are always returned, in the element linear.predictors
.
The vectors or matrix stored if y=TRUE
or x=TRUE
have rows deleted according to subset
and
to missing data, and have names or row names that come from the
data frame used as input data. If penalty
or penalty.matrix
is given,
the var
matrix
returned is an improved variancecovariance matrix
for the penalized regression coefficient estimates. If
var.penalty="sandwich"
(not the default, as limited simulation
studies have found it provides variance estimates that are too low) it
is defined as
$\sigma^{2} (X'X + P)^{1} X'X (X'X + P)^{1}$
, where $P$
is
penalty factors * penalty.matrix
, with a column and row of zeros
added for the
intercept. When var.penalty="simple"
(the default), var
is
$\sigma^{2} (X'X + P)^{1}$
.
The returned list has a vector stats
with named elements
n, Model L.R., d.f., R2, g, Sigma
. Model L.R.
is the model
likelihood ratio $\chi^2$
statistic, and R2
is
$R^2$
. For penalized estimation, d.f.
is the
effective degrees of freedom, which is the sum of the elements of another
vector returned, effective.df.diagonal
, minus one for the
intercept.
g
is the $g$
index.
Sigma
is the penalized maximum likelihood estimate (see below).
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
rms
, rms.trans
, anova.rms
,
summary.rms
, predict.rms
,
fastbw
, validate
, calibrate
,
Predict
, specs.rms
, cph
,
lrm
, which.influence
, lm
,
summary.lm
, print.ols
,
residuals.ols
, latex.ols
,
na.delete
, na.detail.response
,
datadist
, pentrace
, vif
,
abs.error.pred
set.seed(1)
x1 < runif(200)
x2 < sample(0:3, 200, TRUE)
distance < (x1 + x2/3 + rnorm(200))^2
d < datadist(x1,x2)
options(datadist="d") # No d > no summary, plot without giving all details
f < ols(sqrt(distance) ~ rcs(x1,4) + scored(x2), x=TRUE)
# could use d < datadist(f); options(datadist="d") at this point,
# but predictor summaries would not be stored in the fit object for
# use with Predict, summary.rms. In that case, the original
# dataset or d would need to be accessed later, or all variable values
# would have to be specified to summary, plot
anova(f)
which.influence(f)
summary(f)
summary.lm(f) # will only work if penalty and penalty.matrix not used
# Fit a complex model and approximate it with a simple one
x1 < runif(200)
x2 < runif(200)
x3 < runif(200)
x4 < runif(200)
y < x1 + x2 + rnorm(200)
f < ols(y ~ rcs(x1,4) + x2 + x3 + x4)
pred < fitted(f) # or predict(f) or f$linear.predictors
f2 < ols(pred ~ rcs(x1,4) + x2 + x3 + x4, sigma=1)
# sigma=1 prevents numerical problems resulting from R2=1
fastbw(f2, aics=100000)
# This will find the best 1variable model, best 2variable model, etc.
# in predicting the predicted values from the original model
options(datadist=NULL)
Fits ordinal cumulative probability models for continuous or ordinal
response variables, efficiently allowing for a large number of
intercepts by capitalizing on the information matrix being sparse.
Five different distribution functions are implemented, with the
default being the logistic (i.e., the proportional odds
model). The ordinal cumulative probability models are stated in terms
of exceedance probabilities ($Prob[Y \ge y  X]$
) so that as with
OLS larger predicted values are associated with larger Y
. This is
important to note for the asymmetric distributions given by the
loglog and complementary loglog families, for which negating the
linear predictor does not result in $Prob[Y < y  X]$
. The
family
argument is defined in orm.fit
. The model
assumes that the inverse of the assumed cumulative distribution
function, when applied to one minus the true cumulative distribution function
and plotted on the $y$
axis (with the original $y$
on the
$x$
axis) yields parallel curves (though not necessarily linear).
This can be checked by plotting the inverse cumulative probability
function of one minus the empirical distribution function, stratified
by X
, and assessing parallelism. Note that parametric
regression models make the much stronger assumption of linearity of
such inverse functions.
For the print
method, format of output is controlled by the
user previously running options(prType="lang")
where
lang
is "plain"
(the default), "latex"
, or
"html"
. When using html with Quarto or RMarkdown,
results='asis'
need not be written in the chunk header.
Quantile.orm
creates an R function that computes an estimate of
a given quantile for a given value of the linear predictor (which was
assumed to use thefirst intercept). It uses a linear
interpolation method by default, but you can override that to use a
discrete method by specifying method="discrete"
when calling
the function generated by Quantile
.
Optionally a normal approximation for a confidence
interval for quantiles will be computed using the delta method, if
conf.int > 0
is specified to the function generated from calling
Quantile
and you specify X
. In that case, a
"lims"
attribute is included
in the result computed by the derived quantile function.
orm(formula, data=environment(formula),
subset, na.action=na.delete, method="orm.fit",
model=FALSE, x=FALSE, y=FALSE, linear.predictors=TRUE, se.fit=FALSE,
penalty=0, penalty.matrix, tol=1e7, eps=0.005,
var.penalty=c('simple','sandwich'), scale=FALSE, ...)
## S3 method for class 'orm'
print(x, digits=4, r2=c(0,2,4), coefs=TRUE, pg=FALSE,
intercepts=x$non.slopes < 10, title, ...)
## S3 method for class 'orm'
Quantile(object, codes=FALSE, ...)
formula 
a formula object. An 
data 
data frame to use. Default is the current frame. 
subset 
logical expression or vector of subscripts defining a subset of observations to analyze 
na.action 
function to handle 
method 
name of fitting function. Only allowable choice at present is 
model 
causes the model frame to be returned in the fit object 
x 
causes the expanded design matrix (with missings excluded)
to be returned under the name 
y 
causes the response variable (with missings excluded) to be returned
under the name 
linear.predictors 
causes the predicted X beta (with missings excluded) to be returned
under the name 
se.fit 
causes the standard errors of the fitted values (on the linear predictor
scale) to be returned under the name 
penalty 
see 
penalty.matrix 
see 
tol 
singularity criterion (see 
eps 
difference in 
var.penalty 
see 
scale 
set to 
... 
arguments that are passed to 
digits 
number of significant digits to use 
r2 
vector of integers specifying which R^2 measures to print,
with 0 for Nagelkerke R^2 and 1:4 corresponding to the 4 measures
computed by 
pg 
set to 
coefs 
specify 
intercepts 
By default, intercepts are only printed if there are
fewer than 10 of them. Otherwise this is controlled by specifying

title 
a character string title to be passed to 
object 
an object created by 
codes 
if 
The returned fit object of orm
contains the following components
in addition to the ones mentioned under the optional arguments.
call 
calling expression 
freq 
table of frequencies for 
stats 
vector with the following elements: number of observations used in the
fit, number of unique 
fail 
set to 
coefficients 
estimated parameters 
var 
estimated variancecovariance matrix (inverse of information matrix)
for the middle intercept and regression coefficients. See

effective.df.diagonal 
see 
family 
the character string for 
trans 
a list of functions for the choice of 
deviance 
2 log likelihoods (counting penalty components) When an offset variable is present, three deviances are computed: for intercept(s) only, for intercepts+offset, and for intercepts+offset+predictors. When there is no offset variable, the vector contains deviances for the intercept(s)only model and the model with intercept(s) and predictors. 
non.slopes 
number of intercepts in model 
interceptRef 
the index of the middle (median) intercept used in
computing the linear predictor and 
penalty 
see 
penalty.matrix 
the penalty matrix actually used in the estimation 
info.matrix 
a sparse matrix representation of type

Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
For the Quantile
function:
Qi Liu and Shengxin Tu
Department of Biostatistics, Vanderbilt University
Sall J: A monotone regression smoother based on ordinal cumulative logistic regression, 1991.
Le Cessie S, Van Houwelingen JC: Ridge estimators in logistic regression. Applied Statistics 41:191–201, 1992.
Verweij PJM, Van Houwelingen JC: Penalized likelihood in Cox regression. Stat in Med 13:2427–2436, 1994.
Gray RJ: Flexible methods for analyzing survival data using splines, with applications to breast cancer prognosis. JASA 87:942–951, 1992.
Shao J: Linear model selection by crossvalidation. JASA 88:486–494, 1993.
Verweij PJM, Van Houwelingen JC: Crossvalidation in survival analysis. Stat in Med 12:2305–2314, 1993.
Harrell FE: Model uncertainty, penalization, and parsimony. Available from https://hbiostat.org/talks/iscb98.pdf.
orm.fit
, predict.orm
, solve
,
rms.trans
, rms
, polr
,
latex.orm
, vcov.orm
,
num.intercepts
,
residuals.orm
, na.delete
,
na.detail.response
,
pentrace
, rmsMisc
, vif
,
predab.resample
,
validate.orm
, calibrate
,
Mean.orm
, gIndex
, prModFit
require(ggplot2)
set.seed(1)
n < 100
y < round(runif(n), 2)
x1 < sample(c(1,0,1), n, TRUE)
x2 < sample(c(1,0,1), n, TRUE)
f < lrm(y ~ x1 + x2, eps=1e5)
g < orm(y ~ x1 + x2, eps=1e5)
max(abs(coef(g)  coef(f)))
w < vcov(g, intercepts='all') / vcov(f)  1
max(abs(w))
set.seed(1)
n < 300
x1 < c(rep(0,150), rep(1,150))
y < rnorm(n) + 3*x1
g < orm(y ~ x1)
g
k < coef(g)
i < num.intercepts(g)
h < orm(y ~ x1, family=probit)
ll < orm(y ~ x1, family=loglog)
cll < orm(y ~ x1, family=cloglog)
cau < orm(y ~ x1, family=cauchit)
x < 1:i
z < list(logistic=list(x=x, y=coef(g)[1:i]),
probit =list(x=x, y=coef(h)[1:i]),
loglog =list(x=x, y=coef(ll)[1:i]),
cloglog =list(x=x, y=coef(cll)[1:i]))
labcurve(z, pl=TRUE, col=1:4, ylab='Intercept')
tapply(y, x1, mean)
m < Mean(g)
m(w < k[1] + k['x1']*c(0,1))
mh < Mean(h)
wh < coef(h)[1] + coef(h)['x1']*c(0,1)
mh(wh)
qu < Quantile(g)
# Compare model estimated and empirical quantiles
cq < function(y) {
cat(qu(.1, w), tapply(y, x1, quantile, probs=.1), '\n')
cat(qu(.5, w), tapply(y, x1, quantile, probs=.5), '\n')
cat(qu(.9, w), tapply(y, x1, quantile, probs=.9), '\n')
}
cq(y)
# Try on lognormal model
g < orm(exp(y) ~ x1)
g
k < coef(g)
plot(k[1:i])
m < Mean(g)
m(w < k[1] + k['x1']*c(0,1))
tapply(exp(y), x1, mean)
qu < Quantile(g)
cq(exp(y))
# Compare predicted mean with ols for a continuous x
set.seed(3)
n < 200
x1 < rnorm(n)
y < x1 + rnorm(n)
dd < datadist(x1); options(datadist='dd')
f < ols(y ~ x1)
g < orm(y ~ x1, family=probit)
h < orm(y ~ x1, family=logistic)
w < orm(y ~ x1, family=cloglog)
mg < Mean(g); mh < Mean(h); mw < Mean(w)
r < rbind(ols = Predict(f, conf.int=FALSE),
probit = Predict(g, conf.int=FALSE, fun=mg),
logistic = Predict(h, conf.int=FALSE, fun=mh),
cloglog = Predict(w, conf.int=FALSE, fun=mw))
plot(r, groups='.set.')
# Compare predicted 0.8 quantile with quantile regression
qu < Quantile(g)
qu80 < function(lp) qu(.8, lp)
f < Rq(y ~ x1, tau=.8)
r < rbind(probit = Predict(g, conf.int=FALSE, fun=qu80),
quantreg = Predict(f, conf.int=FALSE))
plot(r, groups='.set.')
# Verify transformation invariance of ordinal regression
ga < orm(exp(y) ~ x1, family=probit)
qua < Quantile(ga)
qua80 < function(lp) log(qua(.8, lp))
r < rbind(logprobit = Predict(ga, conf.int=FALSE, fun=qua80),
probit = Predict(g, conf.int=FALSE, fun=qu80))
plot(r, groups='.set.')
# Try the same with quantile regression. Need to transform x1
fa < Rq(exp(y) ~ rcs(x1,5), tau=.8)
r < rbind(qr = Predict(f, conf.int=FALSE),
logqr = Predict(fa, conf.int=FALSE, fun=log))
plot(r, groups='.set.')
# Make a plot of Pr(Y >= y) vs. a continuous covariate for 3 levels
# of y and also against a binary covariate
set.seed(1)
n < 1000
age < rnorm(n, 50, 15)
sex < sample(c('m', 'f'), 1000, TRUE)
Y < runif(n)
dd < datadist(age, sex); options(datadist='dd')
f < orm(Y ~ age + sex)
# Use ExProb function to derive an R function to compute
# P(Y >= y  X)
ex < ExProb(f)
ex1 < function(x) ex(x, y=0.25)
ex2 < function(x) ex(x, y=0.5)
ex3 < function(x) ex(x, y=0.75)
p1 < Predict(f, age, sex, fun=ex1)
p2 < Predict(f, age, sex, fun=ex2)
p3 < Predict(f, age, sex, fun=ex3)
p < rbind('P(Y >= 0.25)' = p1,
'P(Y >= 0.5)' = p2,
'P(Y >= 0.75)' = p3)
ggplot(p)
# Make plot with two curves (by sex) with y on the xaxis, and
# estimated P(Y >= y  sex, age=median) on the yaxis
ys < seq(min(Y), max(Y), length=100)
g < function(sx) as.vector(ex(y=ys, Predict(f, sex=sx)$yhat)$prob)
d < rbind(data.frame(sex='m', y=ys, p=g('m')),
data.frame(sex='f', y=ys, p=g('f')))
ggplot(d, aes(x=y, y=p, color=sex)) + geom_line() +
ylab(expression(P(Y >= y))) +
guides(color=guide_legend(title='Sex')) +
theme(legend.position='bottom')
options(datadist=NULL)
## Not run:
## Simulate power and type I error for orm logistic and probit regression
## for likelihood ratio, Wald, and score chisquare tests, and compare
## with ttest
require(rms)
set.seed(5)
nsim < 2000
r < NULL
for(beta in c(0, .4)) {
for(n in c(10, 50, 300)) {
cat('beta=', beta, ' n=', n, '\n\n')
plogistic < pprobit < plogistics < pprobits < plogisticw <
pprobitw < ptt < numeric(nsim)
x < c(rep(0, n/2), rep(1, n/2))
pb < setPb(nsim, every=25, label=paste('beta=', beta, ' n=', n))
for(j in 1:nsim) {
pb(j)
y < beta*x + rnorm(n)
tt < t.test(y ~ x)
ptt[j] < tt$p.value
f < orm(y ~ x)
plogistic[j] < f$stats['P']
plogistics[j] < f$stats['Score P']
plogisticw[j] < 1  pchisq(coef(f)['x']^2 / vcov(f)[2,2], 1)
f < orm(y ~ x, family=probit)
pprobit[j] < f$stats['P']
pprobits[j] < f$stats['Score P']
pprobitw[j] < 1  pchisq(coef(f)['x']^2 / vcov(f)[2,2], 1)
}
if(beta == 0) plot(ecdf(plogistic))
r < rbind(r, data.frame(beta = beta, n=n,
ttest = mean(ptt < 0.05),
logisticlr = mean(plogistic < 0.05),
logisticscore= mean(plogistics < 0.05),
logisticwald = mean(plogisticw < 0.05),
probit = mean(pprobit < 0.05),
probitscore = mean(pprobits < 0.05),
probitwald = mean(pprobitw < 0.05)))
}
}
print(r)
# beta n ttest logisticlr logisticscore logisticwald probit probitscore probitwald
#1 0.0 10 0.0435 0.1060 0.0655 0.043 0.0920 0.0920 0.0820
#2 0.0 50 0.0515 0.0635 0.0615 0.060 0.0620 0.0620 0.0620
#3 0.0 300 0.0595 0.0595 0.0590 0.059 0.0605 0.0605 0.0605
#4 0.4 10 0.0755 0.1595 0.1070 0.074 0.1430 0.1430 0.1285
#5 0.4 50 0.2950 0.2960 0.2935 0.288 0.3120 0.3120 0.3120
#6 0.4 300 0.9240 0.9215 0.9205 0.920 0.9230 0.9230 0.9230
## End(Not run)
Fits ordinal cumulative probability models for continuous or ordinal response variables, efficiently allowing for a large number of intercepts by capitalizing on the information matrix being sparse. Five different distribution functions are implemented, with the default being the logistic (yielding the proportional odds model). Penalized estimation will be implemented in the future. Weights are not implemented. The optimization method is NewtonRaphson with stephalving. Execution time is linear in the number of intercepts.
orm.fit(x=NULL, y, family='logistic',
offset=0., initial, maxit=12L, eps=.005, tol=1e7, trace=FALSE,
penalty.matrix=NULL, scale=FALSE, y.precision = 7)
x 
design matrix with no column for an intercept 
y 
response vector, numeric, factor, or character. The ordering of levels
is assumed from 
family 
the distribution family, corresponding to logistic (the
default), Gaussian, Cauchy, Gumbel maximum ( 
offset 
optional numeric vector containing an offset on the logit scale 
initial 
vector of initial parameter estimates, beginning with the
intercepts. If 
maxit 
maximum no. iterations (default= 
eps 
difference in 2 log likelihood is below 1E9, convergence is still declared. This handles the case where the initial estimates are MLEs, to prevent endless stephalving. 
tol 
Singularity criterion. Default is 1e7 
trace 
set to 
penalty.matrix 
a selfcontained readytouse penalty matrix  see 
scale 
set to 
y.precision 
When ‘y’ is numeric, values may need to be rounded
to avoid unpredictable behavior with 
a list with the following components:
call 
calling expression 
freq 
table of frequencies for 
yunique 
vector of sorted unique values of 
stats 
vector with the following elements: number of observations used in the
fit, number of unique 
fail 
set to 
coefficients 
estimated parameters 
var 
estimated variancecovariance matrix (inverse of information matrix).
Note that in the case of penalized estimation, 
family , trans

see 
deviance 
2 log likelihoods. When an offset variable is present, three deviances are computed: for intercept(s) only, for intercepts+offset, and for intercepts+offset+predictors. When there is no offset variable, the vector contains deviances for the intercept(s)only model and the model with intercept(s) and predictors. 
non.slopes 
number of intercepts in model 
interceptRef 
the index of the middle (median) intercept used in
computing the linear predictor and 
linear.predictors 
the linear predictor using the first intercept 
penalty.matrix 
see above 
info.matrix 
see 
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
#Fit an additive logistic model containing numeric predictors age,
#blood.pressure, and sex, assumed to be already properly coded and
#transformed
#
# fit < orm.fit(cbind(age,blood.pressure,sex), death)
For an ordinary unpenalized fit from lrm
or ols
and for a vector or list of penalties,
fits a series of logistic or linear models using penalized maximum likelihood
estimation, and saves the effective degrees of freedom, Akaike Information
Criterion ($AIC$
), Schwarz Bayesian Information Criterion ($BIC$
), and
Hurvich and Tsai's corrected $AIC$
($AIC_c$
). Optionally
pentrace
can
use the nlminb
function to solve for the optimum penalty factor or
combination of factors penalizing different kinds of terms in the model.
The effective.df
function prints the original and effective
degrees of freedom for a penalized fit or for an unpenalized fit and
the best penalization determined from a previous invocation of
pentrace
if method="grid"
(the default).
The effective d.f. is computed separately for each class of terms in
the model (e.g., interaction, nonlinear).
A plot
method exists to plot the results, and a print
method exists
to print the most pertinent components. Both $AIC$
and $BIC$
may be plotted if
there is only one penalty factor type specified in penalty
. Otherwise,
the first two types of penalty factors are plotted, showing only the $AIC$
.
pentrace(fit, penalty, penalty.matrix,
method=c('grid','optimize'),
which=c('aic.c','aic','bic'), target.df=NULL,
fitter, pr=FALSE, tol=1e7,
keep.coef=FALSE, complex.more=TRUE, verbose=FALSE, maxit=12,
subset, noaddzero=FALSE)
effective.df(fit, object)
## S3 method for class 'pentrace'
print(x, ...)
## S3 method for class 'pentrace'
plot(x, method=c('points','image'),
which=c('effective.df','aic','aic.c','bic'), pch=2, add=FALSE,
ylim, ...)
fit 
a result from 
penalty 
can be a vector or a list. If it is a vector, all types of terms in
the model will be penalized by the same amount, specified by elements in

object 
an object returned by 
penalty.matrix 
see 
method 
The default is 
which 
the objective to maximize for either 
target.df 
applies only to 
fitter 
a fitting function. Default is 
pr 
set to 
tol 
tolerance for declaring a matrix singular (see 
keep.coef 
set to 
complex.more 
By default if 
verbose 
set to 
maxit 
maximum number of iterations to allow in a model fit (default=12).
This is passed to the appropriate fitter function with the correct
argument name. Increase 
subset 
a logical or integer vector specifying rows of the design and response
matrices to subset in fitting models. This is most useful for
bootstrapping 
noaddzero 
set to 
x 
a result from 
pch 
used for 
add 
set to 
ylim 
2vector of yaxis limits for plots other than effective d.f. 
... 
other arguments passed to 
a list of class "pentrace"
with elements penalty, df, objective, fit, var.adj, diag, results.all
, and
optionally Coefficients
.
The first 6 elements correspond to the fit that had the best objective
as named in the which
argument, from the sequence of fits tried.
Here fit
is the fit object from fitter
which was a penalized fit,
diag
is the diagonal of the matrix used to compute the effective
d.f., and var.adj
is Gray (1992) Equation 2.9, which is an improved
covariance matrix for the penalized beta. results.all
is a data
frame whose first few variables are the components of penalty
and
whose other columns are df, aic, bic, aic.c
. results.all
thus
contains a summary of results for all fits attempted. When
method="optimize"
, only two components are returned: penalty
and
objective
, and the object does not have a class.
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
Gray RJ: Flexible methods for analyzing survival data using splines, with applications to breast cancer prognosis. JASA 87:942–951, 1992.
Hurvich CM, Tsai, CL: Regression and time series model selection in small samples. Biometrika 76:297–307, 1989.
lrm
, ols
, solvet
, rmsMisc
, image
n < 1000 # define sample size
set.seed(17) # so can reproduce the results
age < rnorm(n, 50, 10)
blood.pressure < rnorm(n, 120, 15)
cholesterol < rnorm(n, 200, 25)
sex < factor(sample(c('female','male'), n,TRUE))
# Specify population model for log odds that Y=1
L < .4*(sex=='male') + .045*(age50) +
(log(cholesterol  10)5.2)*(2*(sex=='female') + 2*(sex=='male'))
# Simulate binary y to have Prob(y=1) = 1/[1+exp(L)]
y < ifelse(runif(n) < plogis(L), 1, 0)
f < lrm(y ~ blood.pressure + sex * (age + rcs(cholesterol,4)),
x=TRUE, y=TRUE)
p < pentrace(f, seq(.2,1,by=.05))
plot(p)
p$diag # may learn something about fractional effective d.f.
# for each original parameter
pentrace(f, list(simple=c(0,.2,.4), nonlinear=c(0,.2,.4,.8,1)))
# Bootstrap pentrace 5 times, making a plot of corrected AIC plot with 5 reps
n < nrow(f$x)
plot(pentrace(f, seq(.2,1,by=.05)), which='aic.c',
col=1, ylim=c(30,120)) #original in black
for(j in 1:5)
plot(pentrace(f, seq(.2,1,by=.05), subset=sample(n,n,TRUE)),
which='aic.c', col=j+1, add=TRUE)
# Find penalty giving optimum corrected AIC. Initial guess is 1.0
# Not implemented yet
# pentrace(f, 1, method='optimize')
# Find penalty reducing total regression d.f. effectively to 5
# pentrace(f, 1, target.df=5)
# Refit with penalty giving best aic.c without differential penalization
f < update(f, penalty=p$penalty)
effective.df(f)
Plot Bayesian Contrast Posterior Densities
## S3 method for class 'contrast.rms'
plot(
x,
bivar = FALSE,
bivarmethod = c("ellipse", "kernel"),
prob = 0.95,
which = c("both", "diff", "ind"),
nrow = NULL,
ncol = NULL,
...
)
x 
the result of 
bivar 
set to 
bivarmethod 

prob 
posterior coverage probability for HPD interval or 2d contour 
which 
applies when plotting the result of 
nrow 

ncol 
likewise 
... 
unused 
If there are exactly two contrasts and bivar=TRUE
plots an elliptical or kernal (based on bivarmethod
posterior density contour with probability prob
). Otherwise plots a series of posterior densities of contrasts along with HPD intervals, posterior means, and medians. When the result being plotted comes from contrast
with fun=
specified, both the two individual estimates and their difference are plotted.
ggplot2
object
Frank Harrell
Uses lattice
graphics to plot the effect of one or two predictors
on the linear predictor or X beta scale, or on some transformation of
that scale. The first argument specifies the result of the
Predict
function. The predictor is always plotted in its
original coding. plot.Predict
uses the
xYplot
function unless formula
is omitted and the xaxis
variable is a factor, in which case it reverses the x and yaxes and
uses the Dotplot
function.
If data
is given, a rug plot is drawn showing
the location/density of data values for the $x$
axis variable. If
there is a groups
(superposition) variable that generated separate
curves, the data density specific to each class of points is shown.
This assumes that the second variable was a factor variable. The rug plots
are drawn by scat1d
. When the same predictor is used on all
$x$
axes, and multiple panels are drawn, you can use
subdata
to specify an expression to subset according to other
criteria in addition.
To plot effects instead of estimates (e.g., treatment differences as a
function of interacting factors) see contrast.rms
and
summary.rms
.
pantext
creates a lattice
panel function for including
text such as that produced by print.anova.rms
inside a panel or
in a base graphic.
## S3 method for class 'Predict'
plot(x, formula, groups=NULL,
cond=NULL, varypred=FALSE, subset,
xlim, ylim, xlab, ylab,
data=NULL, subdata, anova=NULL, pval=FALSE, cex.anova=.85,
col.fill=gray(seq(.825, .55, length=5)),
adj.subtitle, cex.adj, cex.axis, perim=NULL, digits=4, nlevels=3,
nlines=FALSE, addpanel, scat1d.opts=list(frac=0.025, lwd=0.3),
type=NULL, yscale=NULL, scaletrans=function(z) z, ...)
pantext(object, x, y, cex=.5, adj=0, fontfamily="Courier", lattice=TRUE)
x 
a data frame created by 
formula 
the right hand side of a 
groups 
an optional name of one of the variables in 
cond 
when plotting effects of different predictors, 
varypred 
set to 
subset 
a subsetting expression for restricting the rows of

xlim 
This parameter is seldom used, as limits are usually controlled with

ylim 
Range for plotting on response variable axis. Computed by default. 
xlab 
Label for 
ylab 
Label for 
data 
a data frame containing the original raw data on which the
regression model were based, or at least containing the 
subdata 
if 
anova 
an object returned by 
pval 
specify 
cex.anova 
character size for the test statistic printed on the panel 
col.fill 
a vector of colors used to fill confidence bands for successive superposed groups. Default is inceasingly dark gray scale. 
adj.subtitle 
Set to 
cex.adj 

cex.axis 