Title: | Regression Modeling Strategies |
---|---|
Description: | Regression modeling, testing, estimation, validation, graphics, prediction, and typesetting by storing enhanced model design attributes in the fit. 'rms' is a collection of functions that assist with and streamline modeling. It also contains functions for binary and ordinal logistic regression models, ordinal models for continuous Y with a variety of distribution families, and the Buckley-James multiple regression model for right-censored responses, and implements penalized maximum likelihood estimation for logistic and ordinary linear models. 'rms' works with almost any regression model, but it was especially written to work with binary or ordinal regression models, Cox regression, accelerated failure time models, ordinary linear models, the Buckley-James model, generalized least squares for serially or spatially correlated observations, generalized linear models, and quantile regression. |
Authors: | Frank E Harrell Jr [aut, cre] |
Maintainer: | Frank E Harrell Jr <[email protected]> |
License: | GPL (>= 2) |
Version: | 6.8-2 |
Built: | 2024-10-23 06:52:54 UTC |
Source: | CRAN |
The anova
function automatically tests most meaningful hypotheses
in a design. For example, suppose that age and cholesterol are
predictors, and that a general interaction is modeled using a restricted
spline surface. anova
prints Wald statistics ( statistics
for an
ols
fit) for testing linearity of age, linearity of
cholesterol, age effect (age + age by cholesterol interaction),
cholesterol effect (cholesterol + age by cholesterol interaction),
linearity of the age by cholesterol interaction (i.e., adequacy of the
simple age * cholesterol 1 d.f. product), linearity of the interaction
in age alone, and linearity of the interaction in cholesterol
alone. Joint tests of all interaction terms in the model and all
nonlinear terms in the model are also performed. For any multiple
d.f. effects for continuous variables that were not modeled through
rcs
, pol
, lsp
, etc., tests of linearity will be
omitted. This applies to matrix predictors produced by e.g.
poly
or ns
.
For lrm, orm, cph, psm
and Glm
fits, the better likelihood
ratio chi-square tests may be obtained by specifying test='LR'
.
Fits must use x=TRUE, y=TRUE
to run LR tests. The tests are run
fairly efficiently by subsetting the design matrix rather than
recreating it.
print.anova.rms
is the printing
method. plot.anova.rms
draws dot charts depicting the importance
of variables in the model, as measured by Wald or LR ,
minus d.f., AIC,
-values, partial
,
for the whole model after deleting the effects in
question, or proportion of overall model
that is due to each
predictor.
latex.anova.rms
is the latex
method. It
substitutes Greek/math symbols in column headings, uses boldface for
TOTAL
lines, and constructs a caption. Then it passes the result
to latex.default
for conversion to LaTeX.
When the anova table was converted to account for missing data
imputation by processMI
, a separate function prmiInfo
can
be used to print information related to imputation adjustments.
For Bayesian models such as blrm
, anova
computes relative
explained variation indexes (REV) based on approximate Wald statistics.
This uses the variance-covariance matrix of all of the posterior draws,
and the individual draws of betas, plus an overall summary from the
posterior mode/mean/median beta. Wald chi-squares assuming multivariate
normality of betas are computed just as with frequentist models, and for
each draw (or for the summary) the ratio of the partial Wald chi-square
to the total Wald statistic for the model is computed as REV.
The print
method calls latex
or html
methods
depending on options(prType=)
. For
latex
a table
environment is not used and an ordinary
tabular
is produced. When using html with Quarto or RMarkdown,
results='asis'
need not be written in the chunk header.
html.anova.rms
just calls latex.anova.rms
.
## S3 method for class 'rms' anova(object, ..., main.effect=FALSE, tol=1e-9, test=c('F','Chisq','LR'), india=TRUE, indnl=TRUE, ss=TRUE, vnames=c('names','labels'), posterior.summary=c('mean', 'median', 'mode'), ns=500, cint=0.95) ## S3 method for class 'anova.rms' print(x, which=c('none','subscripts','names','dots'), table.env=FALSE, ...) ## S3 method for class 'anova.rms' plot(x, what=c("chisqminusdf","chisq","aic","P","partial R2","remaining R2", "proportion R2", "proportion chisq"), xlab=NULL, pch=16, rm.totals=TRUE, rm.ia=FALSE, rm.other=NULL, newnames, sort=c("descending","ascending","none"), margin=c('chisq','P'), pl=TRUE, trans=NULL, ntrans=40, height=NULL, width=NULL, ...) ## S3 method for class 'anova.rms' latex(object, title, dec.chisq=2, dec.F=2, dec.ss=NA, dec.ms=NA, dec.P=4, dec.REV=3, table.env=TRUE, caption=NULL, fontsize=1, params, ...) ## S3 method for class 'anova.rms' html(object, ...)
## S3 method for class 'rms' anova(object, ..., main.effect=FALSE, tol=1e-9, test=c('F','Chisq','LR'), india=TRUE, indnl=TRUE, ss=TRUE, vnames=c('names','labels'), posterior.summary=c('mean', 'median', 'mode'), ns=500, cint=0.95) ## S3 method for class 'anova.rms' print(x, which=c('none','subscripts','names','dots'), table.env=FALSE, ...) ## S3 method for class 'anova.rms' plot(x, what=c("chisqminusdf","chisq","aic","P","partial R2","remaining R2", "proportion R2", "proportion chisq"), xlab=NULL, pch=16, rm.totals=TRUE, rm.ia=FALSE, rm.other=NULL, newnames, sort=c("descending","ascending","none"), margin=c('chisq','P'), pl=TRUE, trans=NULL, ntrans=40, height=NULL, width=NULL, ...) ## S3 method for class 'anova.rms' latex(object, title, dec.chisq=2, dec.F=2, dec.ss=NA, dec.ms=NA, dec.P=4, dec.REV=3, table.env=TRUE, caption=NULL, fontsize=1, params, ...) ## S3 method for class 'anova.rms' html(object, ...)
object |
a |
... |
If omitted, all variables are tested, yielding tests for individual factors
and for pooled effects. Specify a subset of the variables to obtain tests
for only those factors, with a pooled tests for the combined effects
of all factors listed. Names may be abbreviated. For example, specify
Can be optional graphical parameters to send to
For |
main.effect |
Set to |
tol |
singularity criterion for use in matrix inversion |
test |
For an |
india |
set to |
indnl |
set to |
ss |
For an |
vnames |
set to |
posterior.summary |
specifies whether the posterior mode/mean/median beta are to be used as a measure of central tendence of the posterior distribution, for use in relative explained variation from Bayesian models |
ns |
number of random samples from the posterior draws to use for REV highest posterior density intervals |
cint |
HPD interval probability |
x |
for |
which |
If |
what |
what type of statistic to plot. The default is the |
xlab |
x-axis label, default is constructed according to |
pch |
character for plotting dots in dot charts. Default is 16 (solid dot). |
rm.totals |
set to |
rm.ia |
set to |
rm.other |
a list of other predictor names to omit from the chart |
newnames |
a list of substitute predictor names to use, after omitting any. |
sort |
default is to sort bars in descending order of the summary statistic. Available options: 'ascending', 'descending', 'none'. |
margin |
set to a vector of character strings to write text for
selected statistics in the right margin of the dot chart. The
character strings can be any combination of |
pl |
set to |
trans |
set to a function to apply that transformation to the statistics
being plotted, and to truncate negative values at zero. A good choice
is |
ntrans |
|
height , width
|
height and width of |
title |
title to pass to |
dec.chisq |
number of places to the right of the decimal place for typesetting
|
dec.F |
digits to the right for |
dec.ss |
digits to the right for sums of squares (default is |
dec.ms |
digits to the right for mean squares (default is |
dec.P |
digits to the right for |
dec.REV |
digits to the right for REV |
table.env |
see |
caption |
caption for table if |
fontsize |
font size for html output; default is 1 for |
params |
used internally when called through print. |
If the statistics being plotted with plot.anova.rms
are few in
number and one of them is negative or zero, plot.anova.rms
will quit because of an error in dotchart2
.
The latex
method requires LaTeX packages relsize
and
needspace
.
anova.rms
returns a matrix of class anova.rms
containing factors
as rows and , d.f., and
-values as
columns (or d.f., partial
). An attribute
vinfo
provides list of variables involved in each row and the
type of test done.
plot.anova.rms
invisibly returns the vector of quantities
plotted. This vector has a names attribute describing the terms for
which the statistics in the vector are calculated.
print
prints, latex
creates a
file with a name of the form "title.tex"
(see the title
argument above).
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
prmiInfo
,
rms
, rmsMisc
, lrtest
,
rms.trans
, summary.rms
, plot.Predict
,
ggplot.Predict
, solvet
,
locator
,
dotchart2
, latex
,
xYplot
, anova.lm
,
contrast.rms
, pantext
require(ggplot2) n <- 1000 # define sample size set.seed(17) # so can reproduce the results treat <- factor(sample(c('a','b','c'), n,TRUE)) num.diseases <- sample(0:4, n,TRUE) age <- rnorm(n, 50, 10) cholesterol <- rnorm(n, 200, 25) weight <- rnorm(n, 150, 20) sex <- factor(sample(c('female','male'), n,TRUE)) label(age) <- 'Age' # label is in Hmisc label(num.diseases) <- 'Number of Comorbid Diseases' label(cholesterol) <- 'Total Cholesterol' label(weight) <- 'Weight, lbs.' label(sex) <- 'Sex' units(cholesterol) <- 'mg/dl' # uses units.default in Hmisc # Specify population model for log odds that Y=1 L <- .1*(num.diseases-2) + .045*(age-50) + (log(cholesterol - 10)-5.2)*(-2*(treat=='a') + 3.5*(treat=='b')+2*(treat=='c')) # Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)] y <- ifelse(runif(n) < plogis(L), 1, 0) fit <- lrm(y ~ treat + scored(num.diseases) + rcs(age) + log(cholesterol+10) + treat:log(cholesterol+10), x=TRUE, y=TRUE) # x, y needed for test='LR' a <- anova(fit) # Test all factors b <- anova(fit, treat, cholesterol) # Test these 2 by themselves # to get their pooled effects a b a2 <- anova(fit, test='LR') b2 <- anova(fit, treat, cholesterol, test='LR') a2 b2 # Add a new line to the plot with combined effects s <- rbind(a2, 'treat+cholesterol'=b2['TOTAL',]) class(s) <- 'anova.rms' plot(s, margin=c('chisq', 'proportion chisq')) g <- lrm(y ~ treat*rcs(age)) dd <- datadist(treat, num.diseases, age, cholesterol) options(datadist='dd') p <- Predict(g, age, treat="b") s <- anova(g) tx <- paste(capture.output(s), collapse='\n') ggplot(p) + annotate('text', x=27, y=3.2, family='mono', label=tx, hjust=0, vjust=1, size=1.5) plot(s, margin=c('chisq', 'proportion chisq')) # new plot - dot chart of chisq-d.f. with 2 other stats in right margin # latex(s) # nice printout - creates anova.g.tex options(datadist=NULL) # Simulate data with from a given model, and display exactly which # hypotheses are being tested set.seed(123) age <- rnorm(500, 50, 15) treat <- factor(sample(c('a','b','c'), 500, TRUE)) bp <- rnorm(500, 120, 10) y <- ifelse(treat=='a', (age-50)*.05, abs(age-50)*.08) + 3*(treat=='c') + pmax(bp, 100)*.09 + rnorm(500) f <- ols(y ~ treat*lsp(age,50) + rcs(bp,4)) print(names(coef(f)), quote=FALSE) specs(f) anova(f) an <- anova(f) options(digits=3) print(an, 'subscripts') print(an, 'dots') an <- anova(f, test='Chisq', ss=FALSE) # plot(0:1) # make some plot # tab <- pantext(an, 1.2, .6, lattice=FALSE, fontfamily='Helvetica') # create function to write table; usually omit fontfamily # tab() # execute it; could do tab(cex=.65) plot(an) # new plot - dot chart of chisq-d.f. # Specify plot(an, trans=sqrt) to use a square root scale for this plot # latex(an) # nice printout - creates anova.f.tex ## Example to save partial R^2 for all predictors, along with overall ## R^2, from two separate fits, and to combine them with ggplot2 require(ggplot2) set.seed(1) n <- 100 x1 <- runif(n) x2 <- runif(n) y <- (x1-.5)^2 + x2 + runif(n) group <- c(rep('a', n/2), rep('b', n/2)) A <- NULL for(g in c('a','b')) { f <- ols(y ~ pol(x1,2) + pol(x2,2) + pol(x1,2) %ia% pol(x2,2), subset=group==g) a <- plot(anova(f), what='partial R2', pl=FALSE, rm.totals=FALSE, sort='none') a <- a[-grep('NONLINEAR', names(a))] d <- data.frame(group=g, Variable=factor(names(a), names(a)), partialR2=unname(a)) A <- rbind(A, d) } ggplot(A, aes(x=partialR2, y=Variable)) + geom_point() + facet_wrap(~ group) + xlab(ex <- expression(partial~R^2)) + scale_y_discrete(limits=rev) ggplot(A, aes(x=partialR2, y=Variable, color=group)) + geom_point() + xlab(ex <- expression(partial~R^2)) + scale_y_discrete(limits=rev) # Suppose that a researcher wants to make a big deal about a variable # because it has the highest adjusted chi-square. We use the # bootstrap to derive 0.95 confidence intervals for the ranks of all # the effects in the model. We use the plot method for anova, with # pl=FALSE to suppress actual plotting of chi-square - d.f. for each # bootstrap repetition. # It is important to tell plot.anova.rms not to sort the results, or # every bootstrap replication would have ranks of 1,2,3,... for the stats. n <- 300 set.seed(1) d <- data.frame(x1=runif(n), x2=runif(n), x3=runif(n), x4=runif(n), x5=runif(n), x6=runif(n), x7=runif(n), x8=runif(n), x9=runif(n), x10=runif(n), x11=runif(n), x12=runif(n)) d$y <- with(d, 1*x1 + 2*x2 + 3*x3 + 4*x4 + 5*x5 + 6*x6 + 7*x7 + 8*x8 + 9*x9 + 10*x10 + 11*x11 + 12*x12 + 9*rnorm(n)) f <- ols(y ~ x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+x12, data=d) B <- 20 # actually use B=1000 ranks <- matrix(NA, nrow=B, ncol=12) rankvars <- function(fit) rank(plot(anova(fit), sort='none', pl=FALSE)) Rank <- rankvars(f) for(i in 1:B) { j <- sample(1:n, n, TRUE) bootfit <- update(f, data=d, subset=j) ranks[i,] <- rankvars(bootfit) } lim <- t(apply(ranks, 2, quantile, probs=c(.025,.975))) predictor <- factor(names(Rank), names(Rank)) w <- data.frame(predictor, Rank, lower=lim[,1], upper=lim[,2]) ggplot(w, aes(x=predictor, y=Rank)) + geom_point() + coord_flip() + scale_y_continuous(breaks=1:12) + geom_errorbar(aes(ymin=lim[,1], ymax=lim[,2]), width=0)
require(ggplot2) n <- 1000 # define sample size set.seed(17) # so can reproduce the results treat <- factor(sample(c('a','b','c'), n,TRUE)) num.diseases <- sample(0:4, n,TRUE) age <- rnorm(n, 50, 10) cholesterol <- rnorm(n, 200, 25) weight <- rnorm(n, 150, 20) sex <- factor(sample(c('female','male'), n,TRUE)) label(age) <- 'Age' # label is in Hmisc label(num.diseases) <- 'Number of Comorbid Diseases' label(cholesterol) <- 'Total Cholesterol' label(weight) <- 'Weight, lbs.' label(sex) <- 'Sex' units(cholesterol) <- 'mg/dl' # uses units.default in Hmisc # Specify population model for log odds that Y=1 L <- .1*(num.diseases-2) + .045*(age-50) + (log(cholesterol - 10)-5.2)*(-2*(treat=='a') + 3.5*(treat=='b')+2*(treat=='c')) # Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)] y <- ifelse(runif(n) < plogis(L), 1, 0) fit <- lrm(y ~ treat + scored(num.diseases) + rcs(age) + log(cholesterol+10) + treat:log(cholesterol+10), x=TRUE, y=TRUE) # x, y needed for test='LR' a <- anova(fit) # Test all factors b <- anova(fit, treat, cholesterol) # Test these 2 by themselves # to get their pooled effects a b a2 <- anova(fit, test='LR') b2 <- anova(fit, treat, cholesterol, test='LR') a2 b2 # Add a new line to the plot with combined effects s <- rbind(a2, 'treat+cholesterol'=b2['TOTAL',]) class(s) <- 'anova.rms' plot(s, margin=c('chisq', 'proportion chisq')) g <- lrm(y ~ treat*rcs(age)) dd <- datadist(treat, num.diseases, age, cholesterol) options(datadist='dd') p <- Predict(g, age, treat="b") s <- anova(g) tx <- paste(capture.output(s), collapse='\n') ggplot(p) + annotate('text', x=27, y=3.2, family='mono', label=tx, hjust=0, vjust=1, size=1.5) plot(s, margin=c('chisq', 'proportion chisq')) # new plot - dot chart of chisq-d.f. with 2 other stats in right margin # latex(s) # nice printout - creates anova.g.tex options(datadist=NULL) # Simulate data with from a given model, and display exactly which # hypotheses are being tested set.seed(123) age <- rnorm(500, 50, 15) treat <- factor(sample(c('a','b','c'), 500, TRUE)) bp <- rnorm(500, 120, 10) y <- ifelse(treat=='a', (age-50)*.05, abs(age-50)*.08) + 3*(treat=='c') + pmax(bp, 100)*.09 + rnorm(500) f <- ols(y ~ treat*lsp(age,50) + rcs(bp,4)) print(names(coef(f)), quote=FALSE) specs(f) anova(f) an <- anova(f) options(digits=3) print(an, 'subscripts') print(an, 'dots') an <- anova(f, test='Chisq', ss=FALSE) # plot(0:1) # make some plot # tab <- pantext(an, 1.2, .6, lattice=FALSE, fontfamily='Helvetica') # create function to write table; usually omit fontfamily # tab() # execute it; could do tab(cex=.65) plot(an) # new plot - dot chart of chisq-d.f. # Specify plot(an, trans=sqrt) to use a square root scale for this plot # latex(an) # nice printout - creates anova.f.tex ## Example to save partial R^2 for all predictors, along with overall ## R^2, from two separate fits, and to combine them with ggplot2 require(ggplot2) set.seed(1) n <- 100 x1 <- runif(n) x2 <- runif(n) y <- (x1-.5)^2 + x2 + runif(n) group <- c(rep('a', n/2), rep('b', n/2)) A <- NULL for(g in c('a','b')) { f <- ols(y ~ pol(x1,2) + pol(x2,2) + pol(x1,2) %ia% pol(x2,2), subset=group==g) a <- plot(anova(f), what='partial R2', pl=FALSE, rm.totals=FALSE, sort='none') a <- a[-grep('NONLINEAR', names(a))] d <- data.frame(group=g, Variable=factor(names(a), names(a)), partialR2=unname(a)) A <- rbind(A, d) } ggplot(A, aes(x=partialR2, y=Variable)) + geom_point() + facet_wrap(~ group) + xlab(ex <- expression(partial~R^2)) + scale_y_discrete(limits=rev) ggplot(A, aes(x=partialR2, y=Variable, color=group)) + geom_point() + xlab(ex <- expression(partial~R^2)) + scale_y_discrete(limits=rev) # Suppose that a researcher wants to make a big deal about a variable # because it has the highest adjusted chi-square. We use the # bootstrap to derive 0.95 confidence intervals for the ranks of all # the effects in the model. We use the plot method for anova, with # pl=FALSE to suppress actual plotting of chi-square - d.f. for each # bootstrap repetition. # It is important to tell plot.anova.rms not to sort the results, or # every bootstrap replication would have ranks of 1,2,3,... for the stats. n <- 300 set.seed(1) d <- data.frame(x1=runif(n), x2=runif(n), x3=runif(n), x4=runif(n), x5=runif(n), x6=runif(n), x7=runif(n), x8=runif(n), x9=runif(n), x10=runif(n), x11=runif(n), x12=runif(n)) d$y <- with(d, 1*x1 + 2*x2 + 3*x3 + 4*x4 + 5*x5 + 6*x6 + 7*x7 + 8*x8 + 9*x9 + 10*x10 + 11*x11 + 12*x12 + 9*rnorm(n)) f <- ols(y ~ x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+x12, data=d) B <- 20 # actually use B=1000 ranks <- matrix(NA, nrow=B, ncol=12) rankvars <- function(fit) rank(plot(anova(fit), sort='none', pl=FALSE)) Rank <- rankvars(f) for(i in 1:B) { j <- sample(1:n, n, TRUE) bootfit <- update(f, data=d, subset=j) ranks[i,] <- rankvars(bootfit) } lim <- t(apply(ranks, 2, quantile, probs=c(.025,.975))) predictor <- factor(names(Rank), names(Rank)) w <- data.frame(predictor, Rank, lower=lim[,1], upper=lim[,2]) ggplot(w, aes(x=predictor, y=Rank)) + geom_point() + coord_flip() + scale_y_continuous(breaks=1:12) + geom_errorbar(aes(ymin=lim[,1], ymax=lim[,2]), width=0)
bj
fits the Buckley-James distribution-free least squares multiple
regression model to a possibly right-censored response variable.
This model reduces to ordinary least squares if
there is no censoring. By default, model fitting is done after
taking logs of the response variable.
bj
uses the rms
class
for automatic anova
, fastbw
, validate
, Function
, nomogram
,
summary
, plot
, bootcov
, and other functions. The bootcov
function may be worth using with bj
fits, as the properties of the
Buckley-James covariance matrix estimator are not fully known for
strange censoring patterns.
For the print
method, format of output is controlled by the
user previously running options(prType="lang")
where
lang
is "plain"
(the default), "latex"
, or
"html"
. When using html with Quarto or RMarkdown,
results='asis'
need not be written in the chunk header.
The residuals.bj
function exists mainly to compute
residuals and to censor them (i.e., return them as
Surv
objects) just as the original
failure time variable was censored. These residuals are useful for
checking to see if the model also satisfies certain distributional assumptions.
To get these residuals, the fit must have specified y=TRUE
.
The bjplot
function is a special plotting function for objects
created by bj
with x=TRUE, y=TRUE
in effect. It produces three
scatterplots for every covariate in the model: the first plots the
original situation, where censored data are distingushed from
non-censored data by a different plotting symbol. In the second plot,
called a renovated plot, vertical lines show how censored data were
changed by the procedure, and the third is equal to the second, but
without vertical lines. Imputed data are again distinguished from the
non-censored by a different symbol.
The validate
method for bj
validates the Somers' Dxy
rank
correlation between predicted and observed responses, accounting for censoring.
The primary fitting function for bj
is bj.fit
, which does not
allow missing data and expects a full design matrix as input.
bj(formula, data=environment(formula), subset, na.action=na.delete, link="log", control, method='fit', x=FALSE, y=FALSE, time.inc) ## S3 method for class 'bj' print(x, digits=4, long=FALSE, coefs=TRUE, title="Buckley-James Censored Data Regression", ...) ## S3 method for class 'bj' residuals(object, type=c("censored","censored.normalized"),...) bjplot(fit, which=1:dim(X)[[2]]) ## S3 method for class 'bj' validate(fit, method="boot", B=40, bw=FALSE,rule="aic",type="residual",sls=.05,aics=0, force=NULL, estimates=TRUE, pr=FALSE, tol=1e-7, rel.tolerance=1e-3, maxiter=15, ...) bj.fit(x, y, control)
bj(formula, data=environment(formula), subset, na.action=na.delete, link="log", control, method='fit', x=FALSE, y=FALSE, time.inc) ## S3 method for class 'bj' print(x, digits=4, long=FALSE, coefs=TRUE, title="Buckley-James Censored Data Regression", ...) ## S3 method for class 'bj' residuals(object, type=c("censored","censored.normalized"),...) bjplot(fit, which=1:dim(X)[[2]]) ## S3 method for class 'bj' validate(fit, method="boot", B=40, bw=FALSE,rule="aic",type="residual",sls=.05,aics=0, force=NULL, estimates=TRUE, pr=FALSE, tol=1e-7, rel.tolerance=1e-3, maxiter=15, ...) bj.fit(x, y, control)
formula |
an S statistical model formula. Interactions up to third order are
supported. The left hand side must be a |
data , subset , na.action
|
the usual statistical model fitting arguments |
fit |
a fit created by |
x |
a design matrix with or without a first column of ones, to pass
to |
y |
a |
link |
set to, for example, |
control |
a list containing any or all of the following components: |
method |
set to |
time.inc |
setting for default time spacing.
Default is 30 if time variable has |
digits |
number of significant digits to print if not 4. |
long |
set to |
coefs |
specify |
title |
a character string title to be passed to |
object |
the result of |
type |
type of residual desired. Default is censored unnormalized residuals,
defined as link(Y) - linear.predictors, where the
link function was usually the log function. You can specify
|
which |
vector of integers or character strings naming elements of the design
matrix (the names of the original predictors if they entered the model
linearly) for which to have |
B , bw , rule , sls , aics , force , estimates , pr , tol , rel.tolerance , maxiter
|
see
|
... |
ignored for |
The program implements the algorithm as described in the original article by Buckley & James. Also, we have used the original Buckley & James prescription for computing variance/covariance estimator. This is based on non-censored observations only and does not have any theoretical justification, but has been shown in simulation studies to behave well. Our experience confirms this view. Convergence is rather slow with this method, so you may want to increase the number of iterations. Our experience shows that often, in particular with high censoring, 100 iterations is not too many. Sometimes the method will not converge, but will instead enter a loop of repeating values (this is due to the discrete nature of Kaplan and Meier estimator and usually happens with small sample sizes). The program will look for such a loop and return the average betas. It will also issue a warning message and give the size of the cycle (usually less than 6).
bj
returns a fit object with similar information to what survreg
,
psm
, cph
would store as
well as what rms
stores and units
and time.inc
.
residuals.bj
returns a Surv
object. One of the components of the
fit
object produced by bj
(and bj.fit
) is a vector called
stats
which contains the following names elements:
"Obs", "Events", "d.f.","error d.f.","sigma","g"
. Here
sigma
is the estimate of the residual standard deviation.
g
is the -index. If the link function is
"log"
,
the -index on the anti-log scale is also returned as
gr
.
Janez Stare
Department of Biomedical Informatics
Ljubljana University
Ljubljana, Slovenia
[email protected]
Harald Heinzl
Department of Medical Computer Sciences
Vienna University
Vienna, Austria
[email protected]
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
Buckley JJ, James IR. Linear regression with censored data. Biometrika 1979; 66:429–36.
Miller RG, Halpern J. Regression with censored data. Biometrika 1982; 69: 521–31.
James IR, Smith PJ. Consistency results for linear regression with censored data. Ann Statist 1984; 12: 590–600.
Lai TL, Ying Z. Large sample theory of a modified Buckley-James estimator for regression analysis with censored data. Ann Statist 1991; 19: 1370–402.
Hillis SL. Residual plots for the censored data linear regression model. Stat in Med 1995; 14: 2023–2036.
Jin Z, Lin DY, Ying Z. On least-squares regression with censored data. Biometrika 2006; 93:147–161.
rms
, psm
, survreg
,
cph
, Surv
,
na.delete
,
na.detail.response
, datadist
,
rcorr.cens
, GiniMd
,
prModFit
, dxy.cens
require(survival) suppressWarnings(RNGversion("3.5.0")) set.seed(1) ftime <- 10*rexp(200) stroke <- ifelse(ftime > 10, 0, 1) ftime <- pmin(ftime, 10) units(ftime) <- "Month" age <- rnorm(200, 70, 10) hospital <- factor(sample(c('a','b'),200,TRUE)) dd <- datadist(age, hospital) options(datadist="dd") # Prior to rms 6.0 and R 4.0 the following worked with 5 knots f <- bj(Surv(ftime, stroke) ~ rcs(age,3) + hospital, x=TRUE, y=TRUE) # add link="identity" to use a censored normal regression model instead # of a lognormal one anova(f) fastbw(f) validate(f, B=15) plot(Predict(f, age, hospital)) # needs datadist since no explicit age,hosp. coef(f) # look at regression coefficients coef(psm(Surv(ftime, stroke) ~ rcs(age,3) + hospital, dist='lognormal')) # compare with coefficients from likelihood-based # log-normal regression model # use dist='gau' not under R r <- resid(f, 'censored.normalized') survplot(npsurv(r ~ 1), conf='none') # plot Kaplan-Meier estimate of # survival function of standardized residuals survplot(npsurv(r ~ cut2(age, g=2)), conf='none') # may desire both strata to be n(0,1) options(datadist=NULL)
require(survival) suppressWarnings(RNGversion("3.5.0")) set.seed(1) ftime <- 10*rexp(200) stroke <- ifelse(ftime > 10, 0, 1) ftime <- pmin(ftime, 10) units(ftime) <- "Month" age <- rnorm(200, 70, 10) hospital <- factor(sample(c('a','b'),200,TRUE)) dd <- datadist(age, hospital) options(datadist="dd") # Prior to rms 6.0 and R 4.0 the following worked with 5 knots f <- bj(Surv(ftime, stroke) ~ rcs(age,3) + hospital, x=TRUE, y=TRUE) # add link="identity" to use a censored normal regression model instead # of a lognormal one anova(f) fastbw(f) validate(f, B=15) plot(Predict(f, age, hospital)) # needs datadist since no explicit age,hosp. coef(f) # look at regression coefficients coef(psm(Surv(ftime, stroke) ~ rcs(age,3) + hospital, dist='lognormal')) # compare with coefficients from likelihood-based # log-normal regression model # use dist='gau' not under R r <- resid(f, 'censored.normalized') survplot(npsurv(r ~ 1), conf='none') # plot Kaplan-Meier estimate of # survival function of standardized residuals survplot(npsurv(r ~ cut2(age, g=2)), conf='none') # may desire both strata to be n(0,1) options(datadist=NULL)
This functions constructs an object resembling one produced by the
boot
package's boot
function, and runs that package's
boot.ci
function to compute BCa and percentile confidence limits.
bootBCa
can provide separate confidence limits for a vector of
statistics when estimate
has length greater than 1. In that
case, estimates
must have the same number of columns as
estimate
has values.
bootBCa(estimate, estimates, type=c('percentile','bca','basic'), n, seed, conf.int = 0.95)
bootBCa(estimate, estimates, type=c('percentile','bca','basic'), n, seed, conf.int = 0.95)
estimate |
original whole-sample estimate |
estimates |
vector of bootstrap estimates |
type |
type of confidence interval, defaulting to nonparametric percentile |
n |
original number of observations |
seed |
|
conf.int |
confidence level |
a 2-vector if estimate
is of length 1, otherwise a matrix
with 2 rows and number of columns equal to the length of
estimate
You can use if(!exists('.Random.seed')) runif(1)
before running
your bootstrap to make sure that .Random.seed
will be available
to bootBCa
.
Frank Harrell
## Not run: x1 <- runif(100); x2 <- runif(100); y <- sample(0:1, 100, TRUE) f <- lrm(y ~ x1 + x2, x=TRUE, y=TRUE) seed <- .Random.seed b <- bootcov(f) # Get estimated log odds at x1=.4, x2=.6 X <- cbind(c(1,1), x1=c(.4,2), x2=c(.6,3)) est <- X ests <- t(X bootBCa(est, ests, n=100, seed=seed) bootBCa(est, ests, type='bca', n=100, seed=seed) bootBCa(est, ests, type='basic', n=100, seed=seed) ## End(Not run)
## Not run: x1 <- runif(100); x2 <- runif(100); y <- sample(0:1, 100, TRUE) f <- lrm(y ~ x1 + x2, x=TRUE, y=TRUE) seed <- .Random.seed b <- bootcov(f) # Get estimated log odds at x1=.4, x2=.6 X <- cbind(c(1,1), x1=c(.4,2), x2=c(.6,3)) est <- X ests <- t(X bootBCa(est, ests, n=100, seed=seed) bootBCa(est, ests, type='bca', n=100, seed=seed) bootBCa(est, ests, type='basic', n=100, seed=seed) ## End(Not run)
bootcov
computes a bootstrap estimate of the covariance matrix for a set
of regression coefficients from ols
, lrm
, cph
,
psm
, Rq
, and any
other fit where x=TRUE, y=TRUE
was used to store the data used in making
the original regression fit and where an appropriate fitter
function
is provided here. The estimates obtained are not conditional on
the design matrix, but are instead unconditional estimates. For
small sample sizes, this will make a difference as the unconditional
variance estimates are larger. This function will also obtain
bootstrap estimates corrected for cluster sampling (intra-cluster
correlations) when a "working independence" model was used to fit
data which were correlated within clusters. This is done by substituting
cluster sampling with replacement for the usual simple sampling with
replacement. bootcov
has an option (coef.reps
) that causes all
of the regression coefficient estimates from all of the bootstrap
re-samples to be saved, facilitating computation of nonparametric
bootstrap confidence limits and plotting of the distributions of the
coefficient estimates (using histograms and kernel smoothing estimates).
The loglik
option facilitates the calculation of simultaneous
confidence regions from quantities of interest that are functions of
the regression coefficients, using the method of Tibshirani(1996).
With Tibshirani's method, one computes the objective criterion (-2 log
likelihood evaluated at the bootstrap estimate of but with
respect to the original design matrix and response vector) for the
original fit as well as for all of the bootstrap fits. The confidence
set of the regression coefficients is the set of all coefficients that
are associated with objective function values that are less than or
equal to say the 0.95 quantile of the vector of
B + 1
objective
function values. For the coefficients satisfying this condition,
predicted values are computed at a user-specified design matrix X
,
and minima and maxima of these predicted values (over the qualifying
bootstrap repetitions) are computed to derive the final simultaneous
confidence band.
The bootplot
function takes the output of bootcov
and
either plots a histogram and kernel density
estimate of specified regression coefficients (or linear combinations
of them through the use of a specified design matrix X
), or a
qqnorm
plot of the quantities of interest to check for normality of
the maximum likelihood estimates. bootplot
draws vertical lines at
specified quantiles of the bootstrap distribution, and returns these
quantiles for possible printing by the user. Bootstrap estimates may
optionally be transformed by a user-specified function fun
before
plotting.
The confplot
function also uses the output of bootcov
but to
compute and optionally plot nonparametric bootstrap pointwise confidence
limits or (by default) Tibshirani (1996) simultaneous confidence sets.
A design matrix must be specified to allow confplot
to compute
quantities of interest such as predicted values across a range
of values or differences in predicted values (plots of effects of
changing one or more predictor variable values).
bootplot
and confplot
are actually generic functions, with
the particular functions bootplot.bootcov
and confplot.bootcov
automatically invoked for bootcov
objects.
A service function called histdensity
is also provided (for use with
bootplot
). It runs hist
and density
on the same plot, using
twice the number of classes than the default for hist
, and 1.5 times the
width
than the default used by density
.
A comprehensive example demonstrates the use of all of the functions.
bootcov(fit, cluster, B=200, fitter, coef.reps=TRUE, loglik=FALSE, pr=FALSE, maxit=15, eps=0.0001, group=NULL, stat=NULL, seed=sample(10000, 1)) bootplot(obj, which=1 : ncol(Coef), X, conf.int=c(.9,.95,.99), what=c('density', 'qqnorm', 'box'), fun=function(x) x, labels., ...) confplot(obj, X, against, method=c('simultaneous','pointwise'), conf.int=0.95, fun=function(x)x, add=FALSE, lty.conf=2, ...) histdensity(y, xlab, nclass, width, mult.width=1, ...)
bootcov(fit, cluster, B=200, fitter, coef.reps=TRUE, loglik=FALSE, pr=FALSE, maxit=15, eps=0.0001, group=NULL, stat=NULL, seed=sample(10000, 1)) bootplot(obj, which=1 : ncol(Coef), X, conf.int=c(.9,.95,.99), what=c('density', 'qqnorm', 'box'), fun=function(x) x, labels., ...) confplot(obj, X, against, method=c('simultaneous','pointwise'), conf.int=0.95, fun=function(x)x, add=FALSE, lty.conf=2, ...) histdensity(y, xlab, nclass, width, mult.width=1, ...)
fit |
a fit object containing components |
obj |
an object created by |
X |
a design matrix specified to |
y |
a vector to pass to |
cluster |
a variable indicating groupings. |
B |
number of bootstrap repetitions. Default is 200. |
fitter |
the name of a function with arguments |
coef.reps |
set to |
loglik |
set to |
pr |
set to |
maxit |
maximum number of iterations, to pass to |
eps |
argument to pass to various fitters |
group |
a grouping variable used to stratify the sample upon bootstrapping.
This allows one to handle k-sample problems, i.e., each bootstrap
sample will be forced to select the same number of observations from
each level of group as the number appearing in the original dataset.
You may specify both |
stat |
a single character string specifying the name of a |
seed |
random number seed for |
which |
one or more integers specifying which regression coefficients to
plot for |
conf.int |
a vector (for |
what |
for |
fun |
for |
labels. |
a vector of labels for labeling the axes in plots produced by |
... |
For |
against |
For |
method |
specifies whether |
add |
set to |
lty.conf |
line type for plotting confidence bands in |
xlab |
label for x-axis for |
nclass |
passed to |
width |
passed to |
mult.width |
multiplier by which to adjust the default |
If the fit has a scale parameter (e.g., a fit from psm
), the log
of the individual bootstrap scale estimates are added to the vector
of parameter estimates and and column and row for the log scale are
added to the new covariance matrix (the old covariance matrix also
has this row and column).
For Rq
fits, the tau
, method
, and hs
arguments are taken from the original fit.
a new fit object with class of the original object and with the element
orig.var
added. orig.var
is
the covariance matrix of the original fit. Also, the original var
component is replaced with the new bootstrap estimates. The component
boot.coef
is also added. This contains the mean bootstrap estimates
of regression coefficients (with a log scale element added if
applicable). boot.Coef
is added if coef.reps=TRUE
.
boot.loglik
is added if loglik=TRUE
. If stat
is
specified an additional vector boot.stats
will be contained in
the returned object. B
contains the number of successfully fitted
bootstrap resamples. A component
clusterInfo
is added to contain elements name
and n
holding the name of the cluster
variable and the number of clusters.
bootplot
returns a (possible matrix) of quantities of interest and
the requested quantiles of them. confplot
returns three vectors:
fitted
, lower
, and upper
.
bootcov
prints if pr=TRUE
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
Bill Pikounis
Biometrics Research Department
Merck Research Laboratories
https://billpikounis.com/wpb/
Feng Z, McLerran D, Grizzle J (1996): A comparison of statistical methods for clustered data analysis with Gaussian error. Stat in Med 15:1793–1806.
Tibshirani R, Knight K (1996): Model search and inference by bootstrap
"bumping". Department of Statistics, University of Toronto. Technical
report available from
http://www-stat.stanford.edu/~tibs/.
Presented at the Joint Statistical Meetings,
Chicago, August 1996.
robcov
, sample
, rms
,
lm.fit
, lrm.fit
,
survival-internal
,
predab.resample
, rmsMisc
,
Predict
, gendata
,
contrast.rms
, Predict
, setPb
,
multiwayvcov::cluster.boot
set.seed(191) x <- exp(rnorm(200)) logit <- 1 + x/2 y <- ifelse(runif(200) <= plogis(logit), 1, 0) f <- lrm(y ~ pol(x,2), x=TRUE, y=TRUE) g <- bootcov(f, B=50, pr=TRUE, seed=3) anova(g) # using bootstrap covariance estimates fastbw(g) # using bootstrap covariance estimates beta <- g$boot.Coef[,1] hist(beta, nclass=15) #look at normality of parameter estimates qqnorm(beta) # bootplot would be better than these last two commands # A dataset contains a variable number of observations per subject, # and all observations are laid out in separate rows. The responses # represent whether or not a given segment of the coronary arteries # is occluded. Segments of arteries may not operate independently # in the same patient. We assume a "working independence model" to # get estimates of the coefficients, i.e., that estimates assuming # independence are reasonably efficient. The job is then to get # unbiased estimates of variances and covariances of these estimates. set.seed(2) n.subjects <- 30 ages <- rnorm(n.subjects, 50, 15) sexes <- factor(sample(c('female','male'), n.subjects, TRUE)) logit <- (ages-50)/5 prob <- plogis(logit) # true prob not related to sex id <- sample(1:n.subjects, 300, TRUE) # subjects sampled multiple times table(table(id)) # frequencies of number of obs/subject age <- ages[id] sex <- sexes[id] # In truth, observations within subject are independent: y <- ifelse(runif(300) <= prob[id], 1, 0) f <- lrm(y ~ lsp(age,50)*sex, x=TRUE, y=TRUE) g <- bootcov(f, id, B=50, seed=3) # usually do B=200 or more diag(g$var)/diag(f$var) # add ,group=w to re-sample from within each level of w anova(g) # cluster-adjusted Wald statistics # fastbw(g) # cluster-adjusted backward elimination plot(Predict(g, age=30:70, sex='female')) # cluster-adjusted confidence bands # Get design effects based on inflation of the variances when compared # with bootstrap estimates which ignore clustering g2 <- bootcov(f, B=50, seed=3) diag(g$var)/diag(g2$var) # Get design effects based on pooled tests of factors in model anova(g2)[,1] / anova(g)[,1] # Simulate binary data where there is a strong # age x sex interaction with linear age effects # for both sexes, but where not knowing that # we fit a quadratic model. Use the bootstrap # to get bootstrap distributions of various # effects, and to get pointwise and simultaneous # confidence limits set.seed(71) n <- 500 age <- rnorm(n, 50, 10) sex <- factor(sample(c('female','male'), n, rep=TRUE)) L <- ifelse(sex=='male', 0, .1*(age-50)) y <- ifelse(runif(n)<=plogis(L), 1, 0) f <- lrm(y ~ sex*pol(age,2), x=TRUE, y=TRUE) b <- bootcov(f, B=50, loglik=TRUE, pr=TRUE, seed=3) # better: B=500 par(mfrow=c(2,3)) # Assess normality of regression estimates bootplot(b, which=1:6, what='qq') # They appear somewhat non-normal # Plot histograms and estimated densities # for 6 coefficients w <- bootplot(b, which=1:6) # Print bootstrap quantiles w$quantiles # Show box plots for bootstrap reps for all coefficients bootplot(b, what='box') # Estimate regression function for females # for a sequence of ages ages <- seq(25, 75, length=100) label(ages) <- 'Age' # Plot fitted function and pointwise normal- # theory confidence bands par(mfrow=c(1,1)) p <- Predict(f, age=ages, sex='female') plot(p) # Save curve coordinates for later automatic # labeling using labcurve in the Hmisc library curves <- vector('list',8) curves[[1]] <- with(p, list(x=age, y=lower)) curves[[2]] <- with(p, list(x=age, y=upper)) # Add pointwise normal-distribution confidence # bands using unconditional variance-covariance # matrix from the 500 bootstrap reps p <- Predict(b, age=ages, sex='female') curves[[3]] <- with(p, list(x=age, y=lower)) curves[[4]] <- with(p, list(x=age, y=upper)) dframe <- expand.grid(sex='female', age=ages) X <- predict(f, dframe, type='x') # Full design matrix # Add pointwise bootstrap nonparametric # confidence limits p <- confplot(b, X=X, against=ages, method='pointwise', add=TRUE, lty.conf=4) curves[[5]] <- list(x=ages, y=p$lower) curves[[6]] <- list(x=ages, y=p$upper) # Add simultaneous bootstrap confidence band p <- confplot(b, X=X, against=ages, add=TRUE, lty.conf=5) curves[[7]] <- list(x=ages, y=p$lower) curves[[8]] <- list(x=ages, y=p$upper) lab <- c('a','a','b','b','c','c','d','d') labcurve(curves, lab, pl=TRUE) # Now get bootstrap simultaneous confidence set for # female:male odds ratios for a variety of ages dframe <- expand.grid(age=ages, sex=c('female','male')) X <- predict(f, dframe, type='x') # design matrix f.minus.m <- X[1:100,] - X[101:200,] # First 100 rows are for females. By subtracting # design matrices are able to get Xf*Beta - Xm*Beta # = (Xf - Xm)*Beta confplot(b, X=f.minus.m, against=ages, method='pointwise', ylab='F:M Log Odds Ratio') confplot(b, X=f.minus.m, against=ages, lty.conf=3, add=TRUE) # contrast.rms makes it easier to compute the design matrix for use # in bootstrapping contrasts: f.minus.m <- contrast(f, list(sex='female',age=ages), list(sex='male', age=ages))$X confplot(b, X=f.minus.m) # For a quadratic binary logistic regression model use bootstrap # bumping to estimate coefficients under a monotonicity constraint set.seed(177) n <- 400 x <- runif(n) logit <- 3*(x^2-1) y <- rbinom(n, size=1, prob=plogis(logit)) f <- lrm(y ~ pol(x,2), x=TRUE, y=TRUE) k <- coef(f) k vertex <- -k[2]/(2*k[3]) vertex # Outside [0,1] so fit satisfies monotonicity constraint within # x in [0,1], i.e., original fit is the constrained MLE g <- bootcov(f, B=50, coef.reps=TRUE, loglik=TRUE, seed=3) bootcoef <- g$boot.Coef # 100x3 matrix vertex <- -bootcoef[,2]/(2*bootcoef[,3]) table(cut2(vertex, c(0,1))) mono <- !(vertex >= 0 & vertex <= 1) mean(mono) # estimate of Prob{monotonicity in [0,1]} var(bootcoef) # var-cov matrix for unconstrained estimates var(bootcoef[mono,]) # for constrained estimates # Find second-best vector of coefficient estimates, i.e., best # from among bootstrap estimates g$boot.Coef[order(g$boot.loglik[-length(g$boot.loglik)])[1],] # Note closeness to MLE ## Not run: # Get the bootstrap distribution of the difference in two ROC areas for # two binary logistic models fitted on the same dataset. This analysis # does not adjust for the bias ROC area (C-index) due to overfitting. # The same random number seed is used in two runs to enforce pairing. set.seed(17) x1 <- rnorm(100) x2 <- rnorm(100) y <- sample(0:1, 100, TRUE) f <- lrm(y ~ x1, x=TRUE, y=TRUE) g <- lrm(y ~ x1 + x2, x=TRUE, y=TRUE) f <- bootcov(f, stat='C', seed=4) g <- bootcov(g, stat='C', seed=4) dif <- g$boot.stats - f$boot.stats hist(dif) quantile(dif, c(.025,.25,.5,.75,.975)) # Compute a z-test statistic. Note that comparing ROC areas is far less # powerful than likelihood or Brier score-based methods z <- (g$stats['C'] - f$stats['C'])/sd(dif) names(z) <- NULL c(z=z, P=2*pnorm(-abs(z))) ## End(Not run)
set.seed(191) x <- exp(rnorm(200)) logit <- 1 + x/2 y <- ifelse(runif(200) <= plogis(logit), 1, 0) f <- lrm(y ~ pol(x,2), x=TRUE, y=TRUE) g <- bootcov(f, B=50, pr=TRUE, seed=3) anova(g) # using bootstrap covariance estimates fastbw(g) # using bootstrap covariance estimates beta <- g$boot.Coef[,1] hist(beta, nclass=15) #look at normality of parameter estimates qqnorm(beta) # bootplot would be better than these last two commands # A dataset contains a variable number of observations per subject, # and all observations are laid out in separate rows. The responses # represent whether or not a given segment of the coronary arteries # is occluded. Segments of arteries may not operate independently # in the same patient. We assume a "working independence model" to # get estimates of the coefficients, i.e., that estimates assuming # independence are reasonably efficient. The job is then to get # unbiased estimates of variances and covariances of these estimates. set.seed(2) n.subjects <- 30 ages <- rnorm(n.subjects, 50, 15) sexes <- factor(sample(c('female','male'), n.subjects, TRUE)) logit <- (ages-50)/5 prob <- plogis(logit) # true prob not related to sex id <- sample(1:n.subjects, 300, TRUE) # subjects sampled multiple times table(table(id)) # frequencies of number of obs/subject age <- ages[id] sex <- sexes[id] # In truth, observations within subject are independent: y <- ifelse(runif(300) <= prob[id], 1, 0) f <- lrm(y ~ lsp(age,50)*sex, x=TRUE, y=TRUE) g <- bootcov(f, id, B=50, seed=3) # usually do B=200 or more diag(g$var)/diag(f$var) # add ,group=w to re-sample from within each level of w anova(g) # cluster-adjusted Wald statistics # fastbw(g) # cluster-adjusted backward elimination plot(Predict(g, age=30:70, sex='female')) # cluster-adjusted confidence bands # Get design effects based on inflation of the variances when compared # with bootstrap estimates which ignore clustering g2 <- bootcov(f, B=50, seed=3) diag(g$var)/diag(g2$var) # Get design effects based on pooled tests of factors in model anova(g2)[,1] / anova(g)[,1] # Simulate binary data where there is a strong # age x sex interaction with linear age effects # for both sexes, but where not knowing that # we fit a quadratic model. Use the bootstrap # to get bootstrap distributions of various # effects, and to get pointwise and simultaneous # confidence limits set.seed(71) n <- 500 age <- rnorm(n, 50, 10) sex <- factor(sample(c('female','male'), n, rep=TRUE)) L <- ifelse(sex=='male', 0, .1*(age-50)) y <- ifelse(runif(n)<=plogis(L), 1, 0) f <- lrm(y ~ sex*pol(age,2), x=TRUE, y=TRUE) b <- bootcov(f, B=50, loglik=TRUE, pr=TRUE, seed=3) # better: B=500 par(mfrow=c(2,3)) # Assess normality of regression estimates bootplot(b, which=1:6, what='qq') # They appear somewhat non-normal # Plot histograms and estimated densities # for 6 coefficients w <- bootplot(b, which=1:6) # Print bootstrap quantiles w$quantiles # Show box plots for bootstrap reps for all coefficients bootplot(b, what='box') # Estimate regression function for females # for a sequence of ages ages <- seq(25, 75, length=100) label(ages) <- 'Age' # Plot fitted function and pointwise normal- # theory confidence bands par(mfrow=c(1,1)) p <- Predict(f, age=ages, sex='female') plot(p) # Save curve coordinates for later automatic # labeling using labcurve in the Hmisc library curves <- vector('list',8) curves[[1]] <- with(p, list(x=age, y=lower)) curves[[2]] <- with(p, list(x=age, y=upper)) # Add pointwise normal-distribution confidence # bands using unconditional variance-covariance # matrix from the 500 bootstrap reps p <- Predict(b, age=ages, sex='female') curves[[3]] <- with(p, list(x=age, y=lower)) curves[[4]] <- with(p, list(x=age, y=upper)) dframe <- expand.grid(sex='female', age=ages) X <- predict(f, dframe, type='x') # Full design matrix # Add pointwise bootstrap nonparametric # confidence limits p <- confplot(b, X=X, against=ages, method='pointwise', add=TRUE, lty.conf=4) curves[[5]] <- list(x=ages, y=p$lower) curves[[6]] <- list(x=ages, y=p$upper) # Add simultaneous bootstrap confidence band p <- confplot(b, X=X, against=ages, add=TRUE, lty.conf=5) curves[[7]] <- list(x=ages, y=p$lower) curves[[8]] <- list(x=ages, y=p$upper) lab <- c('a','a','b','b','c','c','d','d') labcurve(curves, lab, pl=TRUE) # Now get bootstrap simultaneous confidence set for # female:male odds ratios for a variety of ages dframe <- expand.grid(age=ages, sex=c('female','male')) X <- predict(f, dframe, type='x') # design matrix f.minus.m <- X[1:100,] - X[101:200,] # First 100 rows are for females. By subtracting # design matrices are able to get Xf*Beta - Xm*Beta # = (Xf - Xm)*Beta confplot(b, X=f.minus.m, against=ages, method='pointwise', ylab='F:M Log Odds Ratio') confplot(b, X=f.minus.m, against=ages, lty.conf=3, add=TRUE) # contrast.rms makes it easier to compute the design matrix for use # in bootstrapping contrasts: f.minus.m <- contrast(f, list(sex='female',age=ages), list(sex='male', age=ages))$X confplot(b, X=f.minus.m) # For a quadratic binary logistic regression model use bootstrap # bumping to estimate coefficients under a monotonicity constraint set.seed(177) n <- 400 x <- runif(n) logit <- 3*(x^2-1) y <- rbinom(n, size=1, prob=plogis(logit)) f <- lrm(y ~ pol(x,2), x=TRUE, y=TRUE) k <- coef(f) k vertex <- -k[2]/(2*k[3]) vertex # Outside [0,1] so fit satisfies monotonicity constraint within # x in [0,1], i.e., original fit is the constrained MLE g <- bootcov(f, B=50, coef.reps=TRUE, loglik=TRUE, seed=3) bootcoef <- g$boot.Coef # 100x3 matrix vertex <- -bootcoef[,2]/(2*bootcoef[,3]) table(cut2(vertex, c(0,1))) mono <- !(vertex >= 0 & vertex <= 1) mean(mono) # estimate of Prob{monotonicity in [0,1]} var(bootcoef) # var-cov matrix for unconstrained estimates var(bootcoef[mono,]) # for constrained estimates # Find second-best vector of coefficient estimates, i.e., best # from among bootstrap estimates g$boot.Coef[order(g$boot.loglik[-length(g$boot.loglik)])[1],] # Note closeness to MLE ## Not run: # Get the bootstrap distribution of the difference in two ROC areas for # two binary logistic models fitted on the same dataset. This analysis # does not adjust for the bias ROC area (C-index) due to overfitting. # The same random number seed is used in two runs to enforce pairing. set.seed(17) x1 <- rnorm(100) x2 <- rnorm(100) y <- sample(0:1, 100, TRUE) f <- lrm(y ~ x1, x=TRUE, y=TRUE) g <- lrm(y ~ x1 + x2, x=TRUE, y=TRUE) f <- bootcov(f, stat='C', seed=4) g <- bootcov(g, stat='C', seed=4) dif <- g$boot.stats - f$boot.stats hist(dif) quantile(dif, c(.025,.25,.5,.75,.975)) # Compute a z-test statistic. Note that comparing ROC areas is far less # powerful than likelihood or Brier score-based methods z <- (g$stats['C'] - f$stats['C'])/sd(dif) names(z) <- NULL c(z=z, P=2*pnorm(-abs(z))) ## End(Not run)
Uses lattice graphics and the output from Predict
to plot image,
contour, or perspective plots showing the simultaneous effects of two
continuous predictor variables. Unless formula
is provided, the
-axis is constructed from the first variable listed in the call
to
Predict
and the -axis variable comes from the second.
The perimeter
function is used to generate the boundary of data
to plot when a 3-d plot is made. It finds the area where there are
sufficient data to generate believable interaction fits.
bplot(x, formula, lfun=lattice::levelplot, xlab, ylab, zlab, adj.subtitle=!info$ref.zero, cex.adj=.75, cex.lab=1, perim, showperim=FALSE, zlim=range(yhat, na.rm=TRUE), scales=list(arrows=FALSE), xlabrot, ylabrot, zlabrot=90, ...) perimeter(x, y, xinc=diff(range(x))/10, n=10, lowess.=TRUE)
bplot(x, formula, lfun=lattice::levelplot, xlab, ylab, zlab, adj.subtitle=!info$ref.zero, cex.adj=.75, cex.lab=1, perim, showperim=FALSE, zlim=range(yhat, na.rm=TRUE), scales=list(arrows=FALSE), xlabrot, ylabrot, zlabrot=90, ...) perimeter(x, y, xinc=diff(range(x))/10, n=10, lowess.=TRUE)
x |
for |
formula |
a formula of the form |
lfun |
a high-level lattice plotting function that takes formulas of the
form |
xlab |
Character string label for |
ylab |
Character string abel for |
zlab |
Character string |
adj.subtitle |
Set to |
cex.adj |
|
cex.lab |
|
perim |
names a matrix created by |
showperim |
set to |
zlim |
Controls the range for plotting in the |
scales |
see |
xlabrot |
rotation angle for the x-axis. Default is 30 for
|
ylabrot |
rotation angle for the y-axis. Default is -40 for
|
zlabrot |
rotation angle for z-axis rotation for
|
... |
other arguments to pass to the lattice function |
y |
second variable of the pair for |
xinc |
increment in |
n |
within intervals of |
lowess. |
set to |
perimeter
is a kind of generalization of datadist
for 2
continuous variables. First, the n
smallest and largest x
values are determined. These form the lowest and highest possible
x
s to display. Then x
is grouped into intervals bounded
by these two numbers, with the interval widths defined by xinc
.
Within each interval, y
is sorted and the th smallest and
largest
y
are taken as the interval containing sufficient data
density to plot interaction surfaces. The interval is ignored when
there are insufficient y
values. When the data are being
readied for persp
, bplot
uses the approx
function to do
linear interpolation of the y
-boundaries as a function of the
x
values actually used in forming the grid (the values of the
first variable specified to Predict
). To make the perimeter smooth,
specify lowess.=TRUE
to perimeter
.
perimeter
returns a matrix of class perimeter
. This
outline can be conveniently plotted by lines.perimeter
.
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
datadist
, Predict
,
rms
, rmsMisc
, levelplot
,
contourplot
, wireframe
n <- 1000 # define sample size set.seed(17) # so can reproduce the results age <- rnorm(n, 50, 10) blood.pressure <- rnorm(n, 120, 15) cholesterol <- rnorm(n, 200, 25) sex <- factor(sample(c('female','male'), n,TRUE)) label(age) <- 'Age' # label is in Hmisc label(cholesterol) <- 'Total Cholesterol' label(blood.pressure) <- 'Systolic Blood Pressure' label(sex) <- 'Sex' units(cholesterol) <- 'mg/dl' # uses units.default in Hmisc units(blood.pressure) <- 'mmHg' # Specify population model for log odds that Y=1 L <- .4*(sex=='male') + .045*(age-50) + (log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male')) # Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)] y <- ifelse(runif(n) < plogis(L), 1, 0) ddist <- datadist(age, blood.pressure, cholesterol, sex) options(datadist='ddist') fit <- lrm(y ~ blood.pressure + sex * (age + rcs(cholesterol,4)), x=TRUE, y=TRUE) p <- Predict(fit, age, cholesterol, sex, np=50) # vary sex last require(lattice) bplot(p) # image plot for age, cholesterol with color # coming from yhat; use default ranges for # both continuous predictors; two panels (for sex) bplot(p, lfun=wireframe) # same as bplot(p,,wireframe) # View from different angle, change y label orientation accordingly # Default is z=40, x=-60 bplot(p,, wireframe, screen=list(z=40, x=-75), ylabrot=-25) bplot(p,, contourplot) # contour plot bounds <- perimeter(age, cholesterol, lowess=TRUE) plot(age, cholesterol) # show bivariate data density and perimeter lines(bounds[,c('x','ymin')]); lines(bounds[,c('x','ymax')]) p <- Predict(fit, age, cholesterol) # use only one sex bplot(p, perim=bounds) # draws image() plot # don't show estimates where data are sparse # doesn't make sense here since vars don't interact bplot(p, plogis(yhat) ~ age*cholesterol) # Probability scale options(datadist=NULL)
n <- 1000 # define sample size set.seed(17) # so can reproduce the results age <- rnorm(n, 50, 10) blood.pressure <- rnorm(n, 120, 15) cholesterol <- rnorm(n, 200, 25) sex <- factor(sample(c('female','male'), n,TRUE)) label(age) <- 'Age' # label is in Hmisc label(cholesterol) <- 'Total Cholesterol' label(blood.pressure) <- 'Systolic Blood Pressure' label(sex) <- 'Sex' units(cholesterol) <- 'mg/dl' # uses units.default in Hmisc units(blood.pressure) <- 'mmHg' # Specify population model for log odds that Y=1 L <- .4*(sex=='male') + .045*(age-50) + (log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male')) # Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)] y <- ifelse(runif(n) < plogis(L), 1, 0) ddist <- datadist(age, blood.pressure, cholesterol, sex) options(datadist='ddist') fit <- lrm(y ~ blood.pressure + sex * (age + rcs(cholesterol,4)), x=TRUE, y=TRUE) p <- Predict(fit, age, cholesterol, sex, np=50) # vary sex last require(lattice) bplot(p) # image plot for age, cholesterol with color # coming from yhat; use default ranges for # both continuous predictors; two panels (for sex) bplot(p, lfun=wireframe) # same as bplot(p,,wireframe) # View from different angle, change y label orientation accordingly # Default is z=40, x=-60 bplot(p,, wireframe, screen=list(z=40, x=-75), ylabrot=-25) bplot(p,, contourplot) # contour plot bounds <- perimeter(age, cholesterol, lowess=TRUE) plot(age, cholesterol) # show bivariate data density and perimeter lines(bounds[,c('x','ymin')]); lines(bounds[,c('x','ymax')]) p <- Predict(fit, age, cholesterol) # use only one sex bplot(p, perim=bounds) # draws image() plot # don't show estimates where data are sparse # doesn't make sense here since vars don't interact bplot(p, plogis(yhat) ~ age*cholesterol) # Probability scale options(datadist=NULL)
Uses bootstrapping or cross-validation to get bias-corrected (overfitting-
corrected) estimates of predicted vs. observed values based on
subsetting predictions into intervals (for survival models) or on
nonparametric smoothers (for other models). There are calibration
functions for Cox (cph
), parametric survival models (psm
),
binary and ordinal logistic models (lrm
) and ordinary least
squares (ols
). For survival models,
"predicted" means predicted survival probability at a single
time point, and "observed" refers to the corresponding Kaplan-Meier
survival estimate, stratifying on intervals of predicted survival, or,
if the polspline
package is installed, the predicted survival
probability as a function of transformed predicted survival probability
using the flexible hazard regression approach (see the val.surv
function for details). For logistic and linear models, a nonparametric
calibration curve is estimated over a sequence of predicted values. The
fit must have specified x=TRUE, y=TRUE
. The print
and
plot
methods for lrm
and ols
models (which use
calibrate.default
) print the mean absolute error in predictions,
the mean squared error, and the 0.9 quantile of the absolute error.
Here, error refers to the difference between the predicted values and
the corresponding bias-corrected calibrated values.
Below, the second, third, and fourth invocations of calibrate
are, respectively, for ols
and lrm
, cph
, and
psm
. The first and second plot
invocation are
respectively for lrm
and ols
fits or all other fits.
calibrate(fit, ...) ## Default S3 method: calibrate(fit, predy, method=c("boot","crossvalidation",".632","randomization"), B=40, bw=FALSE, rule=c("aic","p"), type=c("residual","individual"), sls=.05, aics=0, force=NULL, estimates=TRUE, pr=FALSE, kint, smoother="lowess", digits=NULL, ...) ## S3 method for class 'cph' calibrate(fit, cmethod=c('hare', 'KM'), method="boot", u, m=150, pred, cuts, B=40, bw=FALSE, rule="aic", type="residual", sls=0.05, aics=0, force=NULL, estimates=TRUE, pr=FALSE, what="observed-predicted", tol=1e-12, maxdim=5, ...) ## S3 method for class 'psm' calibrate(fit, cmethod=c('hare', 'KM'), method="boot", u, m=150, pred, cuts, B=40, bw=FALSE,rule="aic", type="residual", sls=.05, aics=0, force=NULL, estimates=TRUE, pr=FALSE, what="observed-predicted", tol=1e-12, maxiter=15, rel.tolerance=1e-5, maxdim=5, ...) ## S3 method for class 'calibrate' print(x, B=Inf, ...) ## S3 method for class 'calibrate.default' print(x, B=Inf, ...) ## S3 method for class 'calibrate' plot(x, xlab, ylab, subtitles=TRUE, conf.int=TRUE, cex.subtitles=.75, riskdist=TRUE, add=FALSE, scat1d.opts=list(nhistSpike=200), par.corrected=NULL, ...) ## S3 method for class 'calibrate.default' plot(x, xlab, ylab, xlim, ylim, legend=TRUE, subtitles=TRUE, cex.subtitles=.75, riskdist=TRUE, scat1d.opts=list(nhistSpike=200), ...)
calibrate(fit, ...) ## Default S3 method: calibrate(fit, predy, method=c("boot","crossvalidation",".632","randomization"), B=40, bw=FALSE, rule=c("aic","p"), type=c("residual","individual"), sls=.05, aics=0, force=NULL, estimates=TRUE, pr=FALSE, kint, smoother="lowess", digits=NULL, ...) ## S3 method for class 'cph' calibrate(fit, cmethod=c('hare', 'KM'), method="boot", u, m=150, pred, cuts, B=40, bw=FALSE, rule="aic", type="residual", sls=0.05, aics=0, force=NULL, estimates=TRUE, pr=FALSE, what="observed-predicted", tol=1e-12, maxdim=5, ...) ## S3 method for class 'psm' calibrate(fit, cmethod=c('hare', 'KM'), method="boot", u, m=150, pred, cuts, B=40, bw=FALSE,rule="aic", type="residual", sls=.05, aics=0, force=NULL, estimates=TRUE, pr=FALSE, what="observed-predicted", tol=1e-12, maxiter=15, rel.tolerance=1e-5, maxdim=5, ...) ## S3 method for class 'calibrate' print(x, B=Inf, ...) ## S3 method for class 'calibrate.default' print(x, B=Inf, ...) ## S3 method for class 'calibrate' plot(x, xlab, ylab, subtitles=TRUE, conf.int=TRUE, cex.subtitles=.75, riskdist=TRUE, add=FALSE, scat1d.opts=list(nhistSpike=200), par.corrected=NULL, ...) ## S3 method for class 'calibrate.default' plot(x, xlab, ylab, xlim, ylim, legend=TRUE, subtitles=TRUE, cex.subtitles=.75, riskdist=TRUE, scat1d.opts=list(nhistSpike=200), ...)
fit |
a fit from |
x |
an object created by |
method , B , bw , rule , type , sls , aics , force , estimates
|
see |
cmethod |
method for validating survival predictions using
right-censored data. The default is |
u |
the time point for which to validate predictions for survival
models. For |
m |
group predicted |
pred |
vector of predicted survival probabilities at which to evaluate the
calibration curve. By default, the low and high prediction values
from |
cuts |
actual cut points for predicted survival probabilities. You may
specify only one of |
pr |
set to |
what |
The default is |
tol |
criterion for matrix singularity (default is |
maxdim |
see |
maxiter |
for |
rel.tolerance |
parameter passed to
|
predy |
a scalar or vector of predicted values to calibrate (for |
kint |
For an ordinal logistic model the default predicted
probability that |
smoother |
a function in two variables which produces |
digits |
If specified, predicted values are rounded to
|
... |
other arguments to pass to |
xlab |
defaults to "Predicted x-units Survival" or to a suitable label for other models |
ylab |
defaults to "Fraction Surviving x-units" or to a suitable label for other models |
xlim , ylim
|
2-vectors specifying x- and y-axis limits, if not using defaults |
subtitles |
set to |
conf.int |
set to |
cex.subtitles |
character size for plotting subtitles |
riskdist |
set to |
add |
set to |
scat1d.opts |
a list specifying options to send to |
par.corrected |
a list specifying graphics parameters |
legend |
set to |
If the fit was created using penalized maximum likelihood estimation,
the same penalty
and penalty.scale
parameters are used during
validation.
matrix specifying mean predicted survival in each interval, the
corresponding estimated bias-corrected Kaplan-Meier estimates,
number of subjects, and other statistics. For linear and logistic models,
the matrix instead has rows corresponding to the prediction points, and
the vector of predicted values being validated is returned as an attribute.
The returned object has class "calibrate"
or
"calibrate.default"
.
plot.calibrate.default
invisibly returns the vector of estimated
prediction errors corresponding to the dataset used to fit the model.
prints, and stores an object pred.obs
or .orig.cal
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
validate
, predab.resample
,
groupkm
, errbar
,
scat1d
, cph
, psm
,
lowess
,fit.mult.impute
,
processMI
require(survival) set.seed(1) n <- 200 d.time <- rexp(n) x1 <- runif(n) x2 <- factor(sample(c('a', 'b', 'c'), n, TRUE)) f <- cph(Surv(d.time) ~ pol(x1,2) * x2, x=TRUE, y=TRUE, surv=TRUE, time.inc=1.5) #or f <- psm(S ~ \dots) pa <- requireNamespace('polspline') if(pa) { cal <- calibrate(f, u=1.5, B=20) # cmethod='hare' plot(cal) } cal <- calibrate(f, u=1.5, cmethod='KM', m=50, B=20) # usually B=200 or 300 plot(cal, add=pa) set.seed(1) y <- sample(0:2, n, TRUE) x1 <- runif(n) x2 <- runif(n) x3 <- runif(n) x4 <- runif(n) f <- lrm(y ~ x1 + x2 + x3 * x4, x=TRUE, y=TRUE) cal <- calibrate(f, kint=2, predy=seq(.2, .8, length=60), group=y) # group= does k-sample validation: make resamples have same # numbers of subjects in each level of y as original sample plot(cal) #See the example for the validate function for a method of validating #continuation ratio ordinal logistic models. You can do the same #thing for calibrate
require(survival) set.seed(1) n <- 200 d.time <- rexp(n) x1 <- runif(n) x2 <- factor(sample(c('a', 'b', 'c'), n, TRUE)) f <- cph(Surv(d.time) ~ pol(x1,2) * x2, x=TRUE, y=TRUE, surv=TRUE, time.inc=1.5) #or f <- psm(S ~ \dots) pa <- requireNamespace('polspline') if(pa) { cal <- calibrate(f, u=1.5, B=20) # cmethod='hare' plot(cal) } cal <- calibrate(f, u=1.5, cmethod='KM', m=50, B=20) # usually B=200 or 300 plot(cal, add=pa) set.seed(1) y <- sample(0:2, n, TRUE) x1 <- runif(n) x2 <- runif(n) x3 <- runif(n) x4 <- runif(n) f <- lrm(y ~ x1 + x2 + x3 * x4, x=TRUE, y=TRUE) cal <- calibrate(f, kint=2, predy=seq(.2, .8, length=60), group=y) # group= does k-sample validation: make resamples have same # numbers of subjects in each level of y as original sample plot(cal) #See the example for the validate function for a method of validating #continuation ratio ordinal logistic models. You can do the same #thing for calibrate
This function computes one or more contrasts of the estimated
regression coefficients in a fit from one of the functions in rms,
along with standard errors, confidence limits, t or Z statistics, P-values.
General contrasts are handled by obtaining the design matrix for two
sets of predictor settings (a
, b
) and subtracting the
corresponding rows of the two design matrics to obtain a new contrast
design matrix for testing the a
- b
differences. This allows for
quite general contrasts (e.g., estimated differences in means between
a 30 year old female and a 40 year old male).
This can also be used
to obtain a series of contrasts in the presence of interactions (e.g.,
female:male log odds ratios for several ages when the model contains
age by sex interaction). Another use of contrast
is to obtain
center-weighted (Type III test) and subject-weighted (Type II test)
estimates in a model containing treatment by center interactions. For
the latter case, you can specify type="average"
and an optional
weights
vector to average the within-center treatment contrasts.
The design contrast matrix computed by contrast.rms
can be used
by other functions.
When the model was fitted by a Bayesian function such as blrm
,
highest posterior density intervals for contrasts are computed instead, along with the
posterior probability that the contrast is positive.
posterior.summary
specifies whether posterior mean/median/mode is
to be used for contrast point estimates.
contrast.rms
also allows one to specify four settings to
contrast, yielding contrasts that are double differences - the
difference between the first two settings (a
- b
) and the
last two (a2
- b2
). This allows assessment of interactions.
If usebootcoef=TRUE
, the fit was run through bootcov
, and
conf.type="individual"
, the confidence intervals are bootstrap
nonparametric percentile confidence intervals, basic bootstrap, or BCa
intervals, obtained on contrasts evaluated on all bootstrap samples.
By omitting the b
argument, contrast
can be used to obtain
an average or weighted average of a series of predicted values, along
with a confidence interval for this average. This can be useful for
"unconditioning" on one of the predictors (see the next to last
example).
Specifying type="joint"
, and specifying at least as many contrasts
as needed to span the space of a complex test, one can make
multiple degree of freedom tests flexibly and simply. Redundant
contrasts will be ignored in the joint test. See the examples below.
These include an example of an "incomplete interaction test" involving
only two of three levels of a categorical variable (the test also tests
the main effect).
When more than one contrast is computed, the list created by
contrast.rms
is suitable for plotting (with error bars or bands)
with xYplot
or Dotplot
(see the last example before the
type="joint"
examples).
When fit
is the result of a Bayesian model fit and fun
is
specified, contrast.rms
operates altogether differently. a
and b
must both be specified and a2, b2
not specified.
fun
is evaluated on the estimates
separately on a
and b
and the subtraction is deferred. So
even in the absence of interactions, when fun
is nonlinear, the
settings of factors (predictors) will not cancel out and estimates of
differences will be covariate-specific (unless there are no covariates
in the model besides the one being varied to get from a
to b
).
contrast(fit, ...) ## S3 method for class 'rms' contrast(fit, a, b, a2, b2, ycut=NULL, cnames=NULL, fun=NULL, funint=TRUE, type=c("individual", "average", "joint"), conf.type=c("individual","simultaneous"), usebootcoef=TRUE, boot.type=c("percentile","bca","basic"), posterior.summary=c('mean', 'median', 'mode'), weights="equal", conf.int=0.95, tol=1e-7, expand=TRUE, ...) ## S3 method for class 'contrast.rms' print(x, X=FALSE, fun=function(u)u, jointonly=FALSE, prob=0.95, ...)
contrast(fit, ...) ## S3 method for class 'rms' contrast(fit, a, b, a2, b2, ycut=NULL, cnames=NULL, fun=NULL, funint=TRUE, type=c("individual", "average", "joint"), conf.type=c("individual","simultaneous"), usebootcoef=TRUE, boot.type=c("percentile","bca","basic"), posterior.summary=c('mean', 'median', 'mode'), weights="equal", conf.int=0.95, tol=1e-7, expand=TRUE, ...) ## S3 method for class 'contrast.rms' print(x, X=FALSE, fun=function(u)u, jointonly=FALSE, prob=0.95, ...)
fit |
a fit of class |
a |
a list containing settings for all predictors that you do not wish to
set to default (adjust-to) values. Usually you will specify two
variables in this list, one set to a constant and one to a sequence of
values, to obtain contrasts for the sequence of values of an
interacting factor. The |
b |
another list that generates the same number of observations as |
a2 |
an optional third list of settings of predictors |
b2 |
an optional fourth list of settings of predictors. Mandatory
if |
ycut |
used of the fit is a constrained partial proportional odds
model fit, to specify the single value or vector of values
(corresponding to the multiple contrasts) of the response
variable to use in forming contrasts. When there is
non-proportional odds, odds ratios will vary over levels of the
response variable. When there are multiple contrasts and only
one value is given for |
cnames |
vector of character strings naming the contrasts when
|
fun |
a function to evaluate on the linear predictor for each of
|
type |
set |
conf.type |
The default type of confidence interval computed for a given
individual (1 d.f.) contrast is a pointwise confidence interval. Set
|
usebootcoef |
If |
boot.type |
set to |
posterior.summary |
By default the posterior mean is used.
Specify |
weights |
a numeric vector, used when |
conf.int |
confidence level for confidence intervals for the contrasts (HPD interval probability for Bayesian analyses) |
tol |
tolerance for |
expand |
set to |
... |
passed to |
x |
result of |
X |
set |
funint |
set to |
jointonly |
set to |
prob |
highest posterior density interval probability when the fit
was Bayesian and |
a list of class "contrast.rms"
containing the elements
Contrast
, SE
, Z
, var
, df.residual
Lower
, Upper
, Pvalue
, X
, cnames
, redundant
, which denote the contrast
estimates, standard errors, Z or t-statistics, variance matrix,
residual degrees of freedom (this is NULL
if the model was not
ols
), lower and upper confidence limits, 2-sided P-value, design
matrix, contrast names (or NULL
), and a logical vector denoting
which contrasts are redundant with the other contrasts. If there are
any redundant contrasts, when the results of contrast
are
printed, and asterisk is printed at the start of the corresponding
lines. The object also contains ctype
indicating what method was
used for compute confidence intervals.
Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
[email protected]
Predict
, gendata
, bootcov
,
summary.rms
, anova.rms
,
require(ggplot2) set.seed(1) age <- rnorm(200,40,12) sex <- factor(sample(c('female','male'),200,TRUE)) logit <- (sex=='male') + (age-40)/5 y <- ifelse(runif(200) <= plogis(logit), 1, 0) f <- lrm(y ~ pol(age,2)*sex) anova(f) # Compare a 30 year old female to a 40 year old male # (with or without age x sex interaction in the model) contrast(f, list(sex='female', age=30), list(sex='male', age=40)) # Test for interaction between age and sex, duplicating anova contrast(f, list(sex='female', age=30), list(sex='male', age=30), list(sex='female', age=c(40,50)), list(sex='male', age=c(40,50)), type='joint') # Duplicate overall sex effect in anova with 3 d.f. contrast(f, list(sex='female', age=c(30,40,50)), list(sex='male', age=c(30,40,50)), type='joint') # For females get an array of odds ratios against age=40 k <- contrast(f, list(sex='female', age=30:50), list(sex='female', age=40)) print(k, fun=exp) # Plot odds ratios with pointwise 0.95 confidence bands using log scale k <- as.data.frame(k[c('Contrast','Lower','Upper')]) ggplot(k, aes(x=30:50, y=exp(Contrast))) + geom_line() + geom_ribbon(aes(ymin=exp(Lower), ymax=exp(Upper)), alpha=0.15, linetype=0) + scale_y_continuous(trans='log10', n.breaks=10, minor_breaks=c(seq(0.1, 1, by=.1), seq(1, 10, by=.5))) + xlab('Age') + ylab('OR against age 40') # For a model containing two treatments, centers, and treatment # x center interaction, get 0.95 confidence intervals separately # by center center <- factor(sample(letters[1 : 8], 500, TRUE)) treat <- factor(sample(c('a','b'), 500, TRUE)) y <- 8*(treat == 'b') + rnorm(500, 100, 20) f <- ols(y ~ treat*center) lc <- levels(center) contrast(f, list(treat='b', center=lc), list(treat='a', center=lc)) # Get 'Type III' contrast: average b - a treatment effect over # centers, weighting centers equally (which is almost always # an unreasonable thing to do) contrast(f, list(treat='b', center=lc), list(treat='a', center=lc), type='average') # Get 'Type II' contrast, weighting centers by the number of # subjects per center. Print the design contrast matrix used. k <- contrast(f, list(treat='b', center=lc), list(treat='a', center=lc), type='average', weights=table(center)) print(k, X=TRUE) # Note: If other variables had interacted with either treat # or center, we may want to list settings for these variables # inside the list()'s, so as to not use default settings # For a 4-treatment study, get all comparisons with treatment 'a' treat <- factor(sample(c('a','b','c','d'), 500, TRUE)) y <- 8*(treat == 'b') + rnorm(500, 100, 20) dd <- datadist(treat, center); options(datadist='dd') f <- ols(y ~ treat*center) lt <- levels(treat) contrast(f, list(treat=lt[-1]), list(treat=lt[ 1]), cnames=paste(lt[-1], lt[1], sep=':'), conf.int=1 - .05 / 3) # Compare each treatment with average of all others for(i in 1 : length(lt)) { cat('Comparing with', lt[i], '\n\n') print(contrast(f, list(treat=lt[-i]), list(treat=lt[ i]), type='average')) } options(datadist=NULL) # Six ways to get the same thing, for a variable that # appears linearly in a model and does not interact with # any other variables. We estimate the change in y per # unit change in a predictor x1. Methods 4, 5 also # provide confidence limits. Method 6 computes nonparametric # bootstrap confidence limits. Methods 2-6 can work # for models that are nonlinear or non-additive in x1. # For that case more care is needed in choice of settings # for x1 and the variables that interact with x1. ## Not run: coef(fit)['x1'] # method 1 diff(predict(fit, gendata(x1=c(0,1)))) # method 2 g <- Function(fit) # method 3 g(x1=1) - g(x1=0) summary(fit, x1=c(0,1)) # method 4 k <- contrast(fit, list(x1=1), list(x1=0)) # method 5 print(k, X=TRUE) fit <- update(fit, x=TRUE, y=TRUE) # method 6 b <- bootcov(fit, B=500) contrast(fit, list(x1=1), list(x1=0)) # In a model containing age, race, and sex, # compute an estimate of the mean response for a # 50 year old male, averaged over the races using # observed frequencies for the races as weights f <- ols(y ~ age + race + sex) contrast(f, list(age=50, sex='male', race=levels(race)), type='average', weights=table(race)) # For a Bayesian model get the highest posterior interval for the # difference in two nonlinear functions of predicted values # Start with the mean from a proportional odds model g <- blrm(y ~ x) M <- Mean(g) contrast(g, list(x=1), list(x=0), fun=M) # For the median we have to make sure that contrast can pass the # per-posterior-draw vector of intercepts through qu <- Quantile(g) med <- function(lp, intercepts) qu(0.5, lp, intercepts=intercepts) contrast(g, list(x=1), list(x=0), fun=med) ## End(Not run) # Plot the treatment effect (drug - placebo) as a function of age # and sex in a model in which age nonlinearly interacts with treatment # for females only set.seed(1) n <- 800 treat <- factor(sample(c('drug','placebo'), n,TRUE)) sex <- factor(sample(c('female','male'), n,TRUE)) age <- rnorm(n, 50, 10) y <- .05*age + (sex=='female')*(treat=='drug')*.05*abs(age-50) + rnorm(n) f <- ols(y ~ rcs(age,4)*treat*sex) d <- datadist(age, treat, sex); options(datadist='d') # show separate estimates by treatment and sex require(ggplot2) ggplot(Predict(f, age, treat, sex='female')) ggplot(Predict(f, age, treat, sex='male')) ages <- seq(35,65,by=5); sexes <- c('female','male') w <- contrast(f, list(treat='drug', age=ages, sex=sexes), list(treat='placebo', age=ages, sex=sexes)) # add conf.type="simultaneous" to adjust for having done 14 contrasts xYplot(Cbind(Contrast, Lower, Upper) ~ age | sex, data=w, ylab='Drug - Placebo') w <- as.data.frame(w[c('age','sex','Contrast','Lower','Upper')]) ggplot(w, aes(x=age, y=Contrast)) + geom_point() + facet_grid(sex ~ .) + geom_errorbar(aes(ymin=Lower, ymax=Upper), width=0) ggplot(w, aes(x=age, y=Contrast)) + geom_line() + facet_grid(sex ~ .) + geom_ribbon(aes(ymin=Lower, ymax=Upper), width=0, alpha=0.15, linetype=0) xYplot(Cbind(Contrast, Lower, Upper) ~ age, groups=sex, data=w, ylab='Drug - Placebo', method='alt bars') options(datadist=NULL) # Examples of type='joint' contrast tests set.seed(1) x1 <- rnorm(100) x2 <- factor(sample(c('a','b','c'), 100, TRUE)) dd <- datadist(x1, x2); options(datadist='dd') y <- x1 + (x2=='b') + rnorm(100) # First replicate a test statistic from anova() f <- ols(y ~ x2) anova(f) contrast(f, list(x2=c('b','c')), list(x2='a'), type='joint') # Repeat with a redundancy; compare a vs b, a vs c, b vs c contrast(f, list(x2=c('a','a','b')), list(x2=c('b','c','c')), type='joint') # Get a test of association of a continuous predictor with y # First assume linearity, then cubic f <- lrm(y>0 ~ x1 + x2) anova(f) contrast(f, list(x1=1), list(x1=0), type='joint') # a minimum set of contrasts xs <- seq(-2, 2, length=20) contrast(f, list(x1=0), list(x1=xs), type='joint') # All contrasts were redundant except for the first, because of # linearity assumption f <- lrm(y>0 ~ pol(x1,3) + x2) anova(f) contrast(f, list(x1=0), list(x1=xs), type='joint') print(contrast(f, list(x1=0), list(x1=xs), type='joint'), jointonly=TRUE) # All contrasts were redundant except for the first 3, because of # cubic regression assumption # Now do something that is difficult to do without cryptic contrast # matrix operations: Allow each of the three x2 groups to have a different # shape for the x1 effect where x1 is quadratic. Test whether there is # a difference in mean levels of y for x2='b' vs. 'c' or whether # the shape or slope of x1 is different between x2='b' and x2='c' regardless # of how they differ when x2='a'. In other words, test whether the mean # response differs between group b and c at any value of x1. # This is a 3 d.f. test (intercept, linear, quadratic effects) and is # a better approach than subsetting the data to remove x2='a' then # fitting a simpler model, as it uses a better estimate of sigma from # all the data. f <- ols(y ~ pol(x1,2) * x2) anova(f) contrast(f, list(x1=xs, x2='b'), list(x1=xs, x2='c'), type='joint') # Note: If using a spline fit, there should be at least one value of # x1 between any two knots and beyond the outer knots. options(datadist=NULL)
require(ggplot2) set.seed(1) age <- rnorm(200,40,12) sex <- factor(sample(c('female','male'),200,TRUE)) logit <- (sex=='male') + (age-40)/5 y <- ifelse(runif(200) <= plogis(logit), 1, 0) f <- lrm(y ~ pol(age,2)*sex) anova(f) # Compare a 30 year old female to a 40 year old male # (with or without age x sex interaction in the model) contrast(f, list(sex='female', age=30), list(sex='male', age=40)) # Test for interaction between age and sex, duplicating anova contrast(f, list(sex='female', age=30), list(sex='male', age=30), list(sex='female', age=c(40,50)), list(sex='male', age=c(40,50)), type='joint') # Duplicate overall sex effect in anova with 3 d.f. contrast(f, list(sex='female', age=c(30,40,50)), list(sex='male', age=c(30,40,50)), type='joint') # For females get an array of odds ratios against age=40 k <- contrast(f, list(sex='female', age=30:50), list(sex='female', age=40)) print(k, fun=exp) # Plot odds ratios with pointwise 0.95 confidence bands using log scale k <- as.data.frame(k[c('Contrast','Lower','Upper')]) ggplot(k, aes(x=30:50, y=exp(Contrast))) + geom_line() + geom_ribbon(aes(ymin=exp(Lower), ymax=exp(Upper)), alpha=0.15, linetype=0) + scale_y_continuous(trans='log10', n.breaks=10, minor_breaks=c(seq(0.1, 1, by=.1), seq(1, 10, by=.5))) + xlab('Age') + ylab('OR against age 40') # For a model containing two treatments, centers, and treatment # x center interaction, get 0.95 confidence intervals separately # by center center <- factor(sample(letters[1 : 8], 500, TRUE)) treat <- factor(sample(c('a','b'), 500, TRUE)) y <- 8*(treat == 'b') + rnorm(500, 100, 20) f <- ols(y ~ treat*center) lc <- levels(center) contrast(f, list(treat='b', center=lc), list(treat='a', center=lc)) # Get 'Type III' contrast: average b - a treatment effect over # centers, weighting centers equally (which is almost always # an unreasonable thing to do) contrast(f, list(treat='b', center=lc), list(treat='a', center=lc), type='average') # Get 'Type II' contrast, weighting centers by the number of # subjects per center. Print the design contrast matrix used. k <- contrast(f, list(treat='b', center=lc), list(treat='a', center=lc), type='average', weights=table(center)) print(k, X=TRUE) # Note: If other variables had interacted with either treat # or center, we may want to list settings for these variables # inside the list()'s, so as to not use default settings # For a 4-treatment study, get all comparisons with treatment 'a' treat <- factor(sample(c('a','b','c','d'), 500, TRUE)) y <- 8*(treat == 'b') + rnorm(500, 100, 20) dd <- datadist(treat, center); options(datadist='dd') f <- ols(y ~ treat*center) lt <- levels(treat) contrast(f, list(treat=lt[-1]), list(treat=lt[ 1]), cnames=paste(lt[-1], lt[1], sep=':'), conf.int=1 - .05 / 3) # Compare each treatment with average of all others for(i in 1 : length(lt)) { cat('Comparing with', lt[i], '\n\n') print(contrast(f, list(treat=lt[-i]), list(treat=lt[ i]), type='average')) } options(datadist=NULL) # Six ways to get the same thing, for a variable that # appears linearly in a model and does not interact with # any other variables. We estimate the change in y per # unit change in a predictor x1. Methods 4, 5 also # provide confidence limits. Method 6 computes nonparametric # bootstrap confidence limits. Methods 2-6 can work # for models that are nonlinear or non-additive in x1. # For that case more care is needed in choice of settings # for x1 and the variables that interact with x1. ## Not run: coef(fit)['x1'] # method 1 diff(predict(fit, gendata(x1=c(0,1)))) # method 2 g <- Function(fit) # method 3 g(x1=1) - g(x1=0) summary(fit, x1=c(0,1)) # method 4 k <- contrast(fit, list(x1=1), list(x1=0)) # method 5 print(k, X=TRUE) fit <- update(fit, x=TRUE, y=TRUE) # method 6 b <- bootcov(fit, B=500) contrast(fit, list(x1=1), list(x1=0)) # In a model containing age, race, and sex, # compute an estimate of the mean response for a # 50 year old male, averaged over the races using # observed frequencies for the races as weights f <- ols(y ~ age + race + sex) contrast(f, list(age=50, sex='male', race=levels(race)), type='average', weights=table(race)) # For a Bayesian model get the highest posterior interval for the # difference in two nonlinear functions of predicted values # Start with the mean from a proportional odds model g <- blrm(y ~ x) M <- Mean(g) contrast(g, list(x=1), list(x=0), fun=M) # For the median we have to make sure that contrast can pass the # per-posterior-draw vector of intercepts through qu <- Quantile(g) med <- function(lp, intercepts) qu(0.5, lp, intercepts=intercepts) contrast(g, list(x=1), list(x=0), fun=med) ## End(Not run) # Plot the treatment effect (drug - placebo) as a function of age # and sex in a model in which age nonlinearly interacts with treatment # for females only set.seed(1) n <- 800 treat <- factor(sample(c('drug','placebo'), n,TRUE)) sex <- factor(sample(c('female','male'), n,TRUE)) age <- rnorm(n, 50, 10) y <- .05*age + (sex=='female')*(treat=='drug')*.05*abs(age-50) + rnorm(n) f <- ols(y ~ rcs(age,4)*treat*sex) d <- datadist(age, treat, sex); options(datadist='d') # show separate estimates by treatment and sex require(ggplot2) ggplot(Predict(f, age, treat, sex='female')) ggplot(Predict(f, age, treat, sex='male')) ages <- seq(35,65,by=5); sexes <- c('female','male') w <- contrast(f, list(treat='drug', age=ages, sex=sexes), list(treat='placebo', age=ages, sex=sexes)) # add conf.type="simultaneous" to adjust for having done 14 contrasts xYplot(Cbind(Contrast, Lower, Upper) ~ age | sex, data=w, ylab='Drug - Placebo') w <- as.data.frame(w[c('age','sex','Contrast','Lower','Upper')]) ggplot(w, aes(x=age, y=Contrast)) + geom_point() + facet_grid(sex ~ .) + geom_errorbar(aes(ymin=Lower, ymax=Upper), width=0) ggplot(w, aes(x=age, y=Contrast)) + geom_line() + facet_grid(sex ~ .) + geom_ribbon(aes(ymin=Lower, ymax=Upper), width=0, alpha=0.15, linetype=0) xYplot(Cbind(Contrast, Lower, Upper) ~ age, groups=sex, data=w, ylab='Drug - Placebo', method='alt bars') options(datadist=NULL) # Examples of type='joint' contrast tests set.seed(1) x1 <- rnorm(100) x2 <- factor(sample(c('a','b','c'), 100, TRUE)) dd <- datadist(x1, x2); options(datadist='dd') y <- x1 + (x2=='b') + rnorm(100) # First replicate a test statistic from anova() f <- ols(y ~ x2) anova(f) contrast(f, list(x2=c('b','c')), list(x2='a'), type='joint') # Repeat with a redundancy; compare a vs b, a vs c, b vs c contrast(f, list(x2=c('a','a','b')), list(x2=c('b','c','c')), type='joint') # Get a test of association of a continuous predictor with y # First assume linearity, then cubic f <- lrm(y>0 ~ x1 + x2) anova(f) contrast(f, list(x1=1), list(x1=0), type='joint') # a minimum set of contrasts xs <- seq(-2, 2, length=20) contrast(f, list(x1=0), list(x1=xs), type='joint') # All contrasts were redundant except for the first, because of # linearity assumption f <- lrm(y>0 ~ pol(x1,3) + x2) anova(f) contrast(f, list(x1=0), list(x1=xs), type='joint') print(contrast(f, list(x1=0), list(x1=xs), type='joint'), jointonly=TRUE) # All contrasts were redundant except for the first 3, because of # cubic regression assumption # Now do something that is difficult to do without cryptic contrast # matrix operations: Allow each of the three x2 groups to have a different # shape for the x1 effect where x1 is quadratic. Test whether there is # a difference in mean levels of y for x2='b' vs. 'c' or whether # the shape or slope of x1 is different between x2='b' and x2='c' regardless # of how they differ when x2='a'. In other words, test whether the mean # response differs between group b and c at any value of x1. # This is a 3 d.f. test (intercept, linear, quadratic effects) and is # a better approach than subsetting the data to remove x2='a' then # fitting a simpler model, as it uses a better estimate of sigma from # all the data. f <- ols(y ~ pol(x1,2) * x2) anova(f) contrast(f, list(x1=xs, x2='b'), list(x1=xs, x2='c'), type='joint') # Note: If using a spline fit, there should be at least one value of # x1 between any two knots and beyond the outer knots. options(datadist=NULL)
Modification of Therneau's coxph
function to fit the Cox model and
its extension, the Andersen-Gill model. The latter allows for interval
time-dependent covariables, time-dependent strata, and repeated events.
The Survival
method for an object created by cph
returns an S
function for computing estimates of the survival function.
The Quantile
method for cph
returns an S function for computing
quantiles of survival time (median, by default).
The Mean
method returns a function for computing the mean survival
time. This function issues a warning if the last follow-up time is uncensored,
unless a restricted mean is explicitly requested.
cph(formula = formula(data), data=environment(formula), weights, subset, na.action=na.delete, method=c("efron","breslow","exact","model.frame","model.matrix"), singular.ok=FALSE, robust=FALSE, model=FALSE, x=FALSE, y=FALSE, se.fit=FALSE, linear.predictors=TRUE, residuals=TRUE, nonames=FALSE, eps=1e-4, init, iter.max=10, tol=1e-9, surv=FALSE, time.inc, type=NULL, vartype=NULL, debug=FALSE, ...) ## S3 method for class 'cph' Survival(object, ...) # Evaluate result as g(times, lp, stratum=1, type=c("step","polygon")) ## S3 method for class 'cph' Quantile(object, ...) # Evaluate like h(q, lp, stratum=1, type=c("step","polygon")) ## S3 method for class 'cph' Mean(object, method=c("exact","approximate"), type=c("step","polygon"), n=75, tmax, ...) # E.g. m(lp, stratum=1, type=c("step","polygon"), tmax, \dots)
cph(formula = formula(data), data=environment(formula), weights, subset, na.action=na.delete, method=c("efron","breslow","exact","model.frame","model.matrix"), singular.ok=FALSE, robust=FALSE, model=FALSE, x=FALSE, y=FALSE, se.fit=FALSE, linear.predictors=TRUE, residuals=TRUE, nonames=FALSE, eps=1e-4, init, iter.max=10, tol=1e-9, surv=FALSE, time.inc, type=NULL, vartype=NULL, debug=FALSE, ...) ## S3 method for class 'cph' Survival(object, ...) # Evaluate result as g(times, lp, stratum=1, type=c("step","polygon")) ## S3 method for class 'cph' Quantile(object, ...) # Evaluate like h(q, lp, stratum=1, type=c("step","polygon")) ## S3 method for class 'cph' Mean(object, method=c("exact","approximate"), type=c("step","polygon"), n=75, tmax, ...) # E.g. m(lp, stratum=1, type=c("step","polygon"), tmax, \dots)
formula |
an S formula object with a |
object |
an object created by |
data |
name of an S data frame containing all needed variables. Omit this to use a data frame already in the S “search list”. |
weights |
case weights |
subset |
an expression defining a subset of the observations to use in the fit. The default
is to use all observations. Specify for example |
na.action |
specifies an S function to handle missing data. The default is the function |
method |
for For |
singular.ok |
If |
robust |
if |
model |
default is |
x |
default is |
y |
default is |
se.fit |
default is |
linear.predictors |
set to |
residuals |
set to |
nonames |
set to |
eps |
convergence criterion - change in log likelihood. |
init |
vector of initial parameter estimates. Defaults to all zeros.
Special residuals can be obtained by setting some elements of |
iter.max |
maximum number of iterations to allow. Set to |
tol |
tolerance for declaring singularity for matrix inversion (available only when survival5 or later package is in effect) |
surv |
set to |
time.inc |
time increment used in deriving |
type |
(for For |
vartype |
see |
debug |
set to |
... |
other arguments passed to |
times |
a scalar or vector of times at which to evaluate the survival estimates |
lp |
a scalar or vector of linear predictors (including the centering constant) at which to evaluate the survival estimates |
stratum |
a scalar stratum number or name (e.g., |
q |
a scalar quantile or a vector of quantiles to compute |
n |
the number of points at which to evaluate the mean survival time, for
|
tmax |
For |
If there is any strata by covariable interaction in the model such that
the mean X beta varies greatly over strata, method="approximate"
may
not yield very accurate estimates of the mean in Mean.cph
.
For method="approximate"
if you ask for an estimate of the mean for
a linear predictor value that was outside the range of linear predictors
stored with the fit, the mean for that observation will be NA
.
For Survival
, Quantile
, or Mean
, an S function is returned. Otherwise,
in addition to what is listed below, formula/design information and
the components
maxtime, time.inc, units, model, x, y, se.fit
are stored, the last 5
depending on the settings of options by the same names.
The vectors or matrix stored if y=TRUE
or x=TRUE
have rows deleted according to subset
and
to missing data, and have names or row names that come from the
data frame used as input data.
n |
table with one row per stratum containing number of censored and uncensored observations |
coef |
vector of regression coefficients |
stats |
vector containing the named elements |
var |
variance/covariance matrix of coefficients |
linear.predictors |
values of predicted X beta for observations used in fit, normalized to have overall mean zero, then having any offsets added |
resid |
martingale residuals |
loglik |
log likelihood at initial and final parameter values |
score |
value of score statistic at initial values of parameters |
times |
lists of times (if |
surv |
lists of underlying survival probability estimates |
std.err |
lists of standard errors of estimate log-log survival |
surv.summary |
a 3 dimensional array if |
center |
centering constant, equal to overall mean of X beta. |
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
coxph
, survival-internal
,
Surv
, residuals.cph
,
cox.zph
, survfit.cph
,
survest.cph
, survfit.coxph
,
survplot
, datadist
,
rms
, rms.trans
, anova.rms
,
summary.rms
, Predict
,
fastbw
, validate
, calibrate
,
plot.Predict
, ggplot.Predict
,
specs.rms
, lrm
, which.influence
,
na.delete
,
na.detail.response
, print.cph
,
latex.cph
, vif
, ie.setup
,
GiniMd
, dxy.cens
,
concordance
# Simulate data from a population model in which the log hazard # function is linear in age and there is no age x sex interaction require(survival) require(ggplot2) n <- 1000 set.seed(731) age <- 50 + 12*rnorm(n) label(age) <- "Age" sex <- factor(sample(c('Male','Female'), n, rep=TRUE, prob=c(.6, .4))) cens <- 15*runif(n) h <- .02*exp(.04*(age-50)+.8*(sex=='Female')) dt <- -log(runif(n))/h label(dt) <- 'Follow-up Time' e <- ifelse(dt <= cens,1,0) dt <- pmin(dt, cens) units(dt) <- "Year" dd <- datadist(age, sex) options(datadist='dd') S <- Surv(dt,e) f <- cph(S ~ rcs(age,4) + sex, x=TRUE, y=TRUE) cox.zph(f, "rank") # tests of PH anova(f) ggplot(Predict(f, age, sex)) # plot age effect, 2 curves for 2 sexes survplot(f, sex) # time on x-axis, curves for x2 res <- resid(f, "scaledsch") time <- as.numeric(dimnames(res)[[1]]) z <- loess(res[,4] ~ time, span=0.50) # residuals for sex plot(time, fitted(z)) lines(supsmu(time, res[,4]),lty=2) plot(cox.zph(f,"identity")) #Easier approach for last few lines # latex(f) f <- cph(S ~ age + strat(sex), surv=TRUE) g <- Survival(f) # g is a function g(seq(.1,1,by=.1), stratum="sex=Male", type="poly") #could use stratum=2 med <- Quantile(f) plot(Predict(f, age, fun=function(x) med(lp=x))) #plot median survival # Fit a model that is quadratic in age, interacting with sex as strata # Compare standard errors of linear predictor values with those from # coxph # Use more stringent convergence criteria to match with coxph f <- cph(S ~ pol(age,2)*strat(sex), x=TRUE, eps=1e-9, iter.max=20) coef(f) se <- predict(f, se.fit=TRUE)$se.fit require(lattice) xyplot(se ~ age | sex, main='From cph') a <- c(30,50,70) comb <- data.frame(age=rep(a, each=2), sex=rep(levels(sex), 3)) p <- predict(f, comb, se.fit=TRUE) comb$yhat <- p$linear.predictors comb$se <- p$se.fit z <- qnorm(.975) comb$lower <- p$linear.predictors - z*p$se.fit comb$upper <- p$linear.predictors + z*p$se.fit comb age2 <- age^2 f2 <- coxph(S ~ (age + age2)*strata(sex)) coef(f2) se <- predict(f2, se.fit=TRUE)$se.fit xyplot(se ~ age | sex, main='From coxph') comb <- data.frame(age=rep(a, each=2), age2=rep(a, each=2)^2, sex=rep(levels(sex), 3)) p <- predict(f2, newdata=comb, se.fit=TRUE) comb$yhat <- p$fit comb$se <- p$se.fit comb$lower <- p$fit - z*p$se.fit comb$upper <- p$fit + z*p$se.fit comb # g <- cph(Surv(hospital.charges) ~ age, surv=TRUE) # Cox model very useful for analyzing highly skewed data, censored or not # m <- Mean(g) # m(0) # Predicted mean charge for reference age #Fit a time-dependent covariable representing the instantaneous effect #of an intervening non-fatal event rm(age) set.seed(121) dframe <- data.frame(failure.time=1:10, event=rep(0:1,5), ie.time=c(NA,1.5,2.5,NA,3,4,NA,5,5,5), age=sample(40:80,10,rep=TRUE)) z <- ie.setup(dframe$failure.time, dframe$event, dframe$ie.time) S <- z$S ie.status <- z$ie.status attach(dframe[z$subs,]) # replicates all variables f <- cph(S ~ age + ie.status, x=TRUE, y=TRUE) #Must use x=TRUE,y=TRUE to get survival curves with time-dep. covariables #Get estimated survival curve for a 50-year old who has an intervening #non-fatal event at 5 days new <- data.frame(S=Surv(c(0,5), c(5,999), c(FALSE,FALSE)), age=rep(50,2), ie.status=c(0,1)) g <- survfit(f, new) plot(c(0,g$time), c(1,g$surv[,2]), type='s', xlab='Days', ylab='Survival Prob.') # Not certain about what columns represent in g$surv for survival5 # but appears to be for different ie.status #or: #g <- survest(f, new) #plot(g$time, g$surv, type='s', xlab='Days', ylab='Survival Prob.') #Compare with estimates when there is no intervening event new2 <- data.frame(S=Surv(c(0,5), c(5, 999), c(FALSE,FALSE)), age=rep(50,2), ie.status=c(0,0)) g2 <- survfit(f, new2) lines(c(0,g2$time), c(1,g2$surv[,2]), type='s', lty=2) #or: #g2 <- survest(f, new2) #lines(g2$time, g2$surv, type='s', lty=2) detach("dframe[z$subs, ]") options(datadist=NULL)
# Simulate data from a population model in which the log hazard # function is linear in age and there is no age x sex interaction require(survival) require(ggplot2) n <- 1000 set.seed(731) age <- 50 + 12*rnorm(n) label(age) <- "Age" sex <- factor(sample(c('Male','Female'), n, rep=TRUE, prob=c(.6, .4))) cens <- 15*runif(n) h <- .02*exp(.04*(age-50)+.8*(sex=='Female')) dt <- -log(runif(n))/h label(dt) <- 'Follow-up Time' e <- ifelse(dt <= cens,1,0) dt <- pmin(dt, cens) units(dt) <- "Year" dd <- datadist(age, sex) options(datadist='dd') S <- Surv(dt,e) f <- cph(S ~ rcs(age,4) + sex, x=TRUE, y=TRUE) cox.zph(f, "rank") # tests of PH anova(f) ggplot(Predict(f, age, sex)) # plot age effect, 2 curves for 2 sexes survplot(f, sex) # time on x-axis, curves for x2 res <- resid(f, "scaledsch") time <- as.numeric(dimnames(res)[[1]]) z <- loess(res[,4] ~ time, span=0.50) # residuals for sex plot(time, fitted(z)) lines(supsmu(time, res[,4]),lty=2) plot(cox.zph(f,"identity")) #Easier approach for last few lines # latex(f) f <- cph(S ~ age + strat(sex), surv=TRUE) g <- Survival(f) # g is a function g(seq(.1,1,by=.1), stratum="sex=Male", type="poly") #could use stratum=2 med <- Quantile(f) plot(Predict(f, age, fun=function(x) med(lp=x))) #plot median survival # Fit a model that is quadratic in age, interacting with sex as strata # Compare standard errors of linear predictor values with those from # coxph # Use more stringent convergence criteria to match with coxph f <- cph(S ~ pol(age,2)*strat(sex), x=TRUE, eps=1e-9, iter.max=20) coef(f) se <- predict(f, se.fit=TRUE)$se.fit require(lattice) xyplot(se ~ age | sex, main='From cph') a <- c(30,50,70) comb <- data.frame(age=rep(a, each=2), sex=rep(levels(sex), 3)) p <- predict(f, comb, se.fit=TRUE) comb$yhat <- p$linear.predictors comb$se <- p$se.fit z <- qnorm(.975) comb$lower <- p$linear.predictors - z*p$se.fit comb$upper <- p$linear.predictors + z*p$se.fit comb age2 <- age^2 f2 <- coxph(S ~ (age + age2)*strata(sex)) coef(f2) se <- predict(f2, se.fit=TRUE)$se.fit xyplot(se ~ age | sex, main='From coxph') comb <- data.frame(age=rep(a, each=2), age2=rep(a, each=2)^2, sex=rep(levels(sex), 3)) p <- predict(f2, newdata=comb, se.fit=TRUE) comb$yhat <- p$fit comb$se <- p$se.fit comb$lower <- p$fit - z*p$se.fit comb$upper <- p$fit + z*p$se.fit comb # g <- cph(Surv(hospital.charges) ~ age, surv=TRUE) # Cox model very useful for analyzing highly skewed data, censored or not # m <- Mean(g) # m(0) # Predicted mean charge for reference age #Fit a time-dependent covariable representing the instantaneous effect #of an intervening non-fatal event rm(age) set.seed(121) dframe <- data.frame(failure.time=1:10, event=rep(0:1,5), ie.time=c(NA,1.5,2.5,NA,3,4,NA,5,5,5), age=sample(40:80,10,rep=TRUE)) z <- ie.setup(dframe$failure.time, dframe$event, dframe$ie.time) S <- z$S ie.status <- z$ie.status attach(dframe[z$subs,]) # replicates all variables f <- cph(S ~ age + ie.status, x=TRUE, y=TRUE) #Must use x=TRUE,y=TRUE to get survival curves with time-dep. covariables #Get estimated survival curve for a 50-year old who has an intervening #non-fatal event at 5 days new <- data.frame(S=Surv(c(0,5), c(5,999), c(FALSE,FALSE)), age=rep(50,2), ie.status=c(0,1)) g <- survfit(f, new) plot(c(0,g$time), c(1,g$surv[,2]), type='s', xlab='Days', ylab='Survival Prob.') # Not certain about what columns represent in g$surv for survival5 # but appears to be for different ie.status #or: #g <- survest(f, new) #plot(g$time, g$surv, type='s', xlab='Days', ylab='Survival Prob.') #Compare with estimates when there is no intervening event new2 <- data.frame(S=Surv(c(0,5), c(5, 999), c(FALSE,FALSE)), age=rep(50,2), ie.status=c(0,0)) g2 <- survfit(f, new2) lines(c(0,g2$time), c(1,g2$surv[,2]), type='s', lty=2) #or: #g2 <- survest(f, new2) #lines(g2$time, g2$surv, type='s', lty=2) detach("dframe[z$subs, ]") options(datadist=NULL)
Creates several new variables which help set up a dataset with an
ordinal response variable for use in fitting a forward continuation
ratio (CR) model. The CR model can be fitted with binary logistic
regression if each input observation is replicated the proper
number of times according to the
value, a new binary
is computed that has at most one
per subject,
and if a
cohort
variable is used to define the current
qualifying condition for a cohort of subjects, e.g., .
cr.setup
creates the needed auxilliary variables. See
predab.resample
and validate.lrm
for information about
validating CR models (e.g., using the bootstrap to sample with
replacement from the original subjects instead of the records used in
the fit, validating the model separately for user-specified values of
cohort
).
cr.setup(y)
cr.setup(y)
y |
a character, numeric, |
a list with components y, cohort, subs, reps
. y
is a new binary
variable that is to be used in the binary logistic fit. cohort
is
a factor
vector specifying which cohort condition currently applies.
subs
is a vector of subscripts that can be used to replicate other
variables the same way y
was replicated. reps
specifies how many
times each original observation was replicated. y, cohort, subs
are
all the same length and are longer than the original y
vector.
reps
is the same length as the original y
vector.
The subs
vector is suitable for passing to validate.lrm
or calibrate
,
which pass this vector under the name cluster
on to predab.resample
so that bootstrapping can be
done by sampling with replacement from the original subjects rather than
from the individual records created by cr.setup
.
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
Berridge DM, Whitehead J: Analysis of failure time data with ordinal categories of response. Stat in Med 10:1703–1710, 1991.
y <- c(NA, 10, 21, 32, 32) cr.setup(y) set.seed(171) y <- sample(0:2, 100, rep=TRUE) sex <- sample(c("f","m"),100,rep=TRUE) sex <- factor(sex) table(sex, y) options(digits=5) tapply(y==0, sex, mean) tapply(y==1, sex, mean) tapply(y==2, sex, mean) cohort <- y>=1 tapply(y[cohort]==1, sex[cohort], mean) u <- cr.setup(y) Y <- u$y cohort <- u$cohort sex <- sex[u$subs] lrm(Y ~ cohort + sex) f <- lrm(Y ~ cohort*sex) # saturated model - has to fit all data cells f #Prob(y=0|female): # plogis(-.50078) #Prob(y=0|male): # plogis(-.50078+.11301) #Prob(y=1|y>=1, female): plogis(-.50078+.31845) #Prob(y=1|y>=1, male): plogis(-.50078+.31845+.11301-.07379) combinations <- expand.grid(cohort=levels(cohort), sex=levels(sex)) combinations p <- predict(f, combinations, type="fitted") p p0 <- p[c(1,3)] p1 <- p[c(2,4)] p1.unconditional <- (1 - p0) *p1 p1.unconditional p2.unconditional <- 1 - p0 - p1.unconditional p2.unconditional ## Not run: dd <- datadist(inputdata) # do this on non-replicated data options(datadist='dd') pain.severity <- inputdata$pain.severity u <- cr.setup(pain.severity) # inputdata frame has age, sex with pain.severity attach(inputdata[u$subs,]) # replicate age, sex # If age, sex already available, could do age <- age[u$subs] etc., or # age <- rep(age, u$reps), etc. y <- u$y cohort <- u$cohort dd <- datadist(dd, cohort) # add to dd f <- lrm(y ~ cohort + age*sex) # ordinary cont. ratio model g <- lrm(y ~ cohort*sex + age, x=TRUE,y=TRUE) # allow unequal slopes for # sex across cutoffs cal <- calibrate(g, cluster=u$subs, subset=cohort=='all') # subs makes bootstrap sample the correct units, subset causes # Predicted Prob(pain.severity=0) to be checked for calibration ## End(Not run)
y <- c(NA, 10, 21, 32, 32) cr.setup(y) set.seed(171) y <- sample(0:2, 100, rep=TRUE) sex <- sample(c("f","m"),100,rep=TRUE) sex <- factor(sex) table(sex, y) options(digits=5) tapply(y==0, sex, mean) tapply(y==1, sex, mean) tapply(y==2, sex, mean) cohort <- y>=1 tapply(y[cohort]==1, sex[cohort], mean) u <- cr.setup(y) Y <- u$y cohort <- u$cohort sex <- sex[u$subs] lrm(Y ~ cohort + sex) f <- lrm(Y ~ cohort*sex) # saturated model - has to fit all data cells f #Prob(y=0|female): # plogis(-.50078) #Prob(y=0|male): # plogis(-.50078+.11301) #Prob(y=1|y>=1, female): plogis(-.50078+.31845) #Prob(y=1|y>=1, male): plogis(-.50078+.31845+.11301-.07379) combinations <- expand.grid(cohort=levels(cohort), sex=levels(sex)) combinations p <- predict(f, combinations, type="fitted") p p0 <- p[c(1,3)] p1 <- p[c(2,4)] p1.unconditional <- (1 - p0) *p1 p1.unconditional p2.unconditional <- 1 - p0 - p1.unconditional p2.unconditional ## Not run: dd <- datadist(inputdata) # do this on non-replicated data options(datadist='dd') pain.severity <- inputdata$pain.severity u <- cr.setup(pain.severity) # inputdata frame has age, sex with pain.severity attach(inputdata[u$subs,]) # replicate age, sex # If age, sex already available, could do age <- age[u$subs] etc., or # age <- rep(age, u$reps), etc. y <- u$y cohort <- u$cohort dd <- datadist(dd, cohort) # add to dd f <- lrm(y ~ cohort + age*sex) # ordinary cont. ratio model g <- lrm(y ~ cohort*sex + age, x=TRUE,y=TRUE) # allow unequal slopes for # sex across cutoffs cal <- calibrate(g, cluster=u$subs, subset=cohort=='all') # subs makes bootstrap sample the correct units, subset causes # Predicted Prob(pain.severity=0) to be checked for calibration ## End(Not run)
For a given set of variables or a data frame, determines summaries
of variables for effect and plotting ranges, values to adjust to,
and overall ranges
for Predict
, plot.Predict
, ggplot.Predict
,
summary.rms
, survplot
, and nomogram.rms
.
If datadist
is called before
a model fit and the resulting object pointed to with options(datadist="name")
,
the data characteristics will be stored with the fit by Design()
, so
that later predictions and summaries of the fit will not need to access
the original data used in the fit. Alternatively, you can specify the
values for each variable in the model when using these 3 functions, or
specify the values of some of them and let the functions look up the
remainder (of say adjustmemt levels) from an object created by datadist
.
The best method is probably to run datadist
once before any models are
fitted, storing the distribution summaries for all potential variables.
Adjustment values are 0
for binary variables, the most frequent
category (or optionally the first category level)
for categorical (factor
) variables, the middle level for
ordered factor
variables, and medians for continuous variables.
See descriptions of q.display
and q.effect
for how display and
effect ranges are chosen for continuous variables.
datadist(..., data, q.display, q.effect=c(0.25, 0.75), adjto.cat=c('mode','first'), n.unique=10) ## S3 method for class 'datadist' print(x, ...) # options(datadist="dd") # used by summary, plot, survplot, sometimes predict # For dd substitute the name of the result of datadist
datadist(..., data, q.display, q.effect=c(0.25, 0.75), adjto.cat=c('mode','first'), n.unique=10) ## S3 method for class 'datadist' print(x, ...) # options(datadist="dd") # used by summary, plot, survplot, sometimes predict # For dd substitute the name of the result of datadist
... |
a list of variable names, separated by commas, a single data frame, or
a fit with |
data |
a data frame or a search position. If |
q.display |
set of two quantiles for computing the range of continuous variables
to use in displaying regression relationships. Defaults are
|
q.effect |
set of two quantiles for computing the range of continuous variables to use in estimating regression effects. Defaults are c(.25,.75), which yields inter-quartile-range odds ratios, etc. |
adjto.cat |
default is |
n.unique |
variables having |
x |
result of |
For categorical variables, the 7 limits are set to character strings
(factors) which correspond to
c(NA,adjto.level,NA,1,k,1,k)
, where k
is the number of levels.
For ordered variables with numeric levels, the limits are set to
c(L,M,H,L,H,L,H)
, where L
is the lowest level, M
is the middle
level, and H
is the highest level.
a list of class "datadist"
with the following components
limits |
a |
values |
a named list, with one vector of unique values for each numeric
variable having no more than |
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
rms
, rms.trans
, describe
, Predict
, summary.rms
## Not run: d <- datadist(data=1) # use all variables in search pos. 1 d <- datadist(x1, x2, x3) page(d) # if your options(pager) leaves up a pop-up # window, this is a useful guide in analyses d <- datadist(data=2) # all variables in search pos. 2 d <- datadist(data=my.data.frame) d <- datadist(my.data.frame) # same as previous. Run for all potential vars. d <- datadist(x2, x3, data=my.data.frame) # combine variables d <- datadist(x2, x3, q.effect=c(.1,.9), q.display=c(0,1)) # uses inter-decile range odds ratios, # total range of variables for regression function plots d <- datadist(d, z) # add a new variable to an existing datadist options(datadist="d") #often a good idea, to store info with fit f <- ols(y ~ x1*x2*x3) options(datadist=NULL) #default at start of session f <- ols(y ~ x1*x2) d <- datadist(f) #info not stored in `f' d$limits["Adjust to","x1"] <- .5 #reset adjustment level to .5 options(datadist="d") f <- lrm(y ~ x1*x2, data=mydata) d <- datadist(f, data=mydata) options(datadist="d") f <- lrm(y ~ x1*x2) #datadist not used - specify all values for summary(f, x1=c(200,500,800), x2=c(1,3,5)) # obtaining predictions plot(Predict(f, x1=200:800, x2=3)) # or ggplot() # Change reference value to get a relative odds plot for a logistic model d$limits$age[2] <- 30 # make 30 the reference value for age # Could also do: d$limits["Adjust to","age"] <- 30 fit <- update(fit) # make new reference value take effect plot(Predict(fit, age, ref.zero=TRUE, fun=exp), ylab='Age=x:Age=30 Odds Ratio') # or ggplot() ## End(Not run)
## Not run: d <- datadist(data=1) # use all variables in search pos. 1 d <- datadist(x1, x2, x3) page(d) # if your options(pager) leaves up a pop-up # window, this is a useful guide in analyses d <- datadist(data=2) # all variables in search pos. 2 d <- datadist(data=my.data.frame) d <- datadist(my.data.frame) # same as previous. Run for all potential vars. d <- datadist(x2, x3, data=my.data.frame) # combine variables d <- datadist(x2, x3, q.effect=c(.1,.9), q.display=c(0,1)) # uses inter-decile range odds ratios, # total range of variables for regression function plots d <- datadist(d, z) # add a new variable to an existing datadist options(datadist="d") #often a good idea, to store info with fit f <- ols(y ~ x1*x2*x3) options(datadist=NULL) #default at start of session f <- ols(y ~ x1*x2) d <- datadist(f) #info not stored in `f' d$limits["Adjust to","x1"] <- .5 #reset adjustment level to .5 options(datadist="d") f <- lrm(y ~ x1*x2, data=mydata) d <- datadist(f, data=mydata) options(datadist="d") f <- lrm(y ~ x1*x2) #datadist not used - specify all values for summary(f, x1=c(200,500,800), x2=c(1,3,5)) # obtaining predictions plot(Predict(f, x1=200:800, x2=3)) # or ggplot() # Change reference value to get a relative odds plot for a logistic model d$limits$age[2] <- 30 # make 30 the reference value for age # Could also do: d$limits["Adjust to","age"] <- 30 fit <- update(fit) # make new reference value take effect plot(Predict(fit, age, ref.zero=TRUE, fun=exp), ylab='Age=x:Age=30 Odds Ratio') # or ggplot() ## End(Not run)
For an orm
object generates a function for computing the
estimates of the function Prob(Y>=y) given one or more values of the
linear predictor using the reference (median) intercept. This
function can optionally be evaluated at only a set of user-specified
y
values, otherwise a right-step function is returned. There
is a plot method for plotting the step functions, and if more than one
linear predictor was evaluated multiple step functions are drawn.
ExProb
is especially useful for nomogram
.
Optionally a normal approximation for a confidence
interval for exceedance probabilities will be computed using the delta
method, if
conf.int > 0
is specified to the function generated from calling
ExProb
. In that case, a "lims"
attribute is included
in the result computed by the derived cumulative probability function.
ExProb(object, ...) ## S3 method for class 'orm' ExProb(object, codes = FALSE, ...) ## S3 method for class 'ExProb' plot(x, ..., data=NULL, xlim=NULL, xlab=x$yname, ylab=expression(Prob(Y>=y)), col=par('col'), col.vert='gray85', pch=20, pch.data=21, lwd=par('lwd'), lwd.data=lwd, lty.data=2, key=TRUE)
ExProb(object, ...) ## S3 method for class 'orm' ExProb(object, codes = FALSE, ...) ## S3 method for class 'ExProb' plot(x, ..., data=NULL, xlim=NULL, xlab=x$yname, ylab=expression(Prob(Y>=y)), col=par('col'), col.vert='gray85', pch=20, pch.data=21, lwd=par('lwd'), lwd.data=lwd, lty.data=2, key=TRUE)
object |
a fit object from |
codes |
if |
... |
ignored for |
data |
Specify |
x |
an object created by running the function created by |
xlim |
limits for x-axis; default is range of observed |
xlab |
x-axis label |
ylab |
y-axis label |
col |
color for horizontal lines and points |
col.vert |
color for vertical discontinuities |
pch |
plotting symbol for predicted curves |
lwd |
line width for predicted curves |
pch.data , lwd.data , lty.data
|
plotting parameters for data |
key |
set to |
ExProb
returns an R function. Running the function returns an
object of class "ExProb"
.
Frank Harrell and Shengxin Tu
set.seed(1) x1 <- runif(200) yvar <- x1 + runif(200) f <- orm(yvar ~ x1) d <- ExProb(f) lp <- predict(f, newdata=data.frame(x1=c(.2,.8))) w <- d(lp) s1 <- abs(x1 - .2) < .1 s2 <- abs(x1 - .8) < .1 plot(w, data=data.frame(x1=c(rep(.2, sum(s1)), rep(.8, sum(s2))), yvar=c(yvar[s1], yvar[s2]))) qu <- Quantile(f) abline(h=c(.1,.5), col='gray80') abline(v=qu(.5, lp), col='gray80') abline(v=qu(.9, lp), col='green')
set.seed(1) x1 <- runif(200) yvar <- x1 + runif(200) f <- orm(yvar ~ x1) d <- ExProb(f) lp <- predict(f, newdata=data.frame(x1=c(.2,.8))) w <- d(lp) s1 <- abs(x1 - .2) < .1 s2 <- abs(x1 - .8) < .1 plot(w, data=data.frame(x1=c(rep(.2, sum(s1)), rep(.8, sum(s2))), yvar=c(yvar[s1], yvar[s2]))) qu <- Quantile(f) abline(h=c(.1,.5), col='gray80') abline(v=qu(.5, lp), col='gray80') abline(v=qu(.9, lp), col='green')
Performs a slightly inefficient but numerically stable version of fast
backward elimination on factors, using a method based on Lawless and Singhal
(1978).
This method uses the fitted complete model and computes approximate Wald
statistics by computing conditional (restricted) maximum likelihood estimates
assuming multivariate normality of estimates.
fastbw
deletes factors, not columns of the design matrix. Factors requiring multiple d.f. will be retained or dropped as a group.
The function prints the deletion statistics for each variable in
turn, and prints approximate parameter estimates for the model after
deleting variables. The approximation is better when the number of
factors deleted is not large. For ols
, the approximation is exact for
regression coefficients, and standard errors are only off by a factor
equal to the ratio of the mean squared error estimate for the reduced
model to the original mean squared error estimate for the full model.
If the fit was from ols
, fastbw
will compute the usual
statistic for each model.
fastbw(fit, rule=c("aic", "p"), type=c("residual", "individual", "total"), sls=.05, aics=0, eps=1e-9, k.aic=2, force=NULL) ## S3 method for class 'fastbw' print(x, digits=4, estimates=TRUE, ...)
fastbw(fit, rule=c("aic", "p"), type=c("residual", "individual", "total"), sls=.05, aics=0, eps=1e-9, k.aic=2, force=NULL) ## S3 method for class 'fastbw' print(x, digits=4, estimates=TRUE, ...)
fit |
fit object with |
rule |
Stopping rule. Defaults to |
type |
Type of statistic on which to base the stopping rule. Default is
|
sls |
Significance level for staying in a model if |
aics |
For |
eps |
Singularity criterion, default is |
k.aic |
multiplier to compute AIC, default is 2. To use BIC, set |
force |
a vector of integers specifying parameters forced to be in the model, not counting intercept(s) |
x |
result of |
digits |
number of significant digits to print |
estimates |
set to |
... |
ignored |
a list with an attribute kept
if bw=TRUE
, and the
following components:
result |
matrix of statistics with rows in order of deletion. |
names.kept |
names of factors kept in final model. |
factors.kept |
the subscripts of factors kept in the final model |
factors.deleted |
opposite of |
parms.kept |
column numbers in design matrix corresponding to parameters kept in the final model. |
parms.deleted |
opposite of |
coefficients |
vector of approximate coefficients of reduced model. |
var |
approximate covariance matrix for reduced model. |
Coefficients |
matrix of coefficients of all models. Rows correspond to the successive models examined and columns correspond to the coefficients in the full model. For variables not in a particular sub-model (row), the coefficients are zero. |
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
Lawless, J. F. and Singhal, K. (1978): Efficient screening of nonnormal regression models. Biometrics 34:318–327.
rms
, ols
, lrm
,
cph
, psm
, validate
,
solvet
, rmsMisc
## Not run: fastbw(fit, optional.arguments) # print results z <- fastbw(fit, optional.args) # typically used in simulations lm.fit(X[,z$parms.kept], Y) # least squares fit of reduced model ## End(Not run)
## Not run: fastbw(fit, optional.arguments) # print results z <- fastbw(fit, optional.args) # typically used in simulations lm.fit(X[,z$parms.kept], Y) # least squares fit of reduced model ## End(Not run)
Function
is a class of functions for creating other S functions.
Function.rms
is the method for creating S functions to compute
X beta, based on a model fitted with rms
in effect.
Like latexrms
, Function.rms
simplifies restricted cubic
spline functions and factors out terms in second-order interactions.
Function.rms
will not work for models that have third-order
interactions involving restricted cubic splines.
Function.cph
is a particular method for handling fits from
cph
, for which an intercept (the negative of the centering
constant) is added to
the model. sascode
is a function that takes an S function such
as one created by Function
and does most of the editing
to turn the function definition into
a fragment of SAS code for computing X beta from the fitted model, along
with assignment statements that initialize predictors to reference
values.
perlcode
similarly creates Perl code to evaluate a fitted
regression model.
## S3 method for class 'rms' Function(object, intercept=NULL, digits=max(8, .Options$digits), posterior.summary=c('mean', 'median', 'mode'), ...) ## S3 method for class 'cph' Function(object, intercept=-object$center, ...) # Use result as fun(predictor1=value1, predictor2=value2, \dots) sascode(object, file='', append=FALSE) perlcode(object)
## S3 method for class 'rms' Function(object, intercept=NULL, digits=max(8, .Options$digits), posterior.summary=c('mean', 'median', 'mode'), ...) ## S3 method for class 'cph' Function(object, intercept=-object$center, ...) # Use result as fun(predictor1=value1, predictor2=value2, \dots) sascode(object, file='', append=FALSE) perlcode(object)
object |
a fit created with |
intercept |
an intercept value to use (not allowed to be specified to |
digits |
number of significant digits to use for coefficients and knot locations |
posterior.summary |
if using a Bayesian model fit such as from
|
file |
name of a file in which to write the SAS code. Default is to write to standard output. |
append |
set to |
... |
arguments to pass to |
Function
returns an S-Plus function that can be invoked in any
usual context. The function has one argument per predictor variable,
and the default values of the predictors are set to adjust-to
values
(see datadist
). Multiple predicted X beta values may be calculated
by specifying vectors as arguments to the created function.
All non-scalar argument values must have the same length.
perlcode
returns a character string with embedded newline characters.
Frank Harrell, Jeremy Stephens, and Thomas Dupont
Department of Biostatistics
Vanderbilt University
[email protected]
latexrms
, transcan
,
predict.rms
, rms
, rms.trans
suppressWarnings(RNGversion("3.5.0")) set.seed(1331) x1 <- exp(rnorm(100)) x2 <- factor(sample(c('a','b'),100,rep=TRUE)) dd <- datadist(x1, x2) options(datadist='dd') y <- log(x1)^2+log(x1)*(x2=='b')+rnorm(100)/4 f <- ols(y ~ pol(log(x1),2)*x2) f$coef g <- Function(f, digits=5) g sascode(g) cat(perlcode(g), '\n') g() g(x1=c(2,3), x2='b') #could omit x2 since b is default category predict(f, expand.grid(x1=c(2,3),x2='b')) g8 <- Function(f) # default is 8 sig. digits g8(x1=c(2,3), x2='b') options(datadist=NULL) ## Not run: require(survival) # Make self-contained functions for computing survival probabilities # using a log-normal regression f <- psm(Surv(d.time, death) ~ rcs(age,4)*sex, dist='gaussian') g <- Function(f) surv <- Survival(f) # Compute 2 and 5-year survival estimates for 50 year old male surv(c(2,5), g(age=50, sex='male')) ## End(Not run)
suppressWarnings(RNGversion("3.5.0")) set.seed(1331) x1 <- exp(rnorm(100)) x2 <- factor(sample(c('a','b'),100,rep=TRUE)) dd <- datadist(x1, x2) options(datadist='dd') y <- log(x1)^2+log(x1)*(x2=='b')+rnorm(100)/4 f <- ols(y ~ pol(log(x1),2)*x2) f$coef g <- Function(f, digits=5) g sascode(g) cat(perlcode(g), '\n') g() g(x1=c(2,3), x2='b') #could omit x2 since b is default category predict(f, expand.grid(x1=c(2,3),x2='b')) g8 <- Function(f) # default is 8 sig. digits g8(x1=c(2,3), x2='b') options(datadist=NULL) ## Not run: require(survival) # Make self-contained functions for computing survival probabilities # using a log-normal regression f <- psm(Surv(d.time, death) ~ rcs(age,4)*sex, dist='gaussian') g <- Function(f) surv <- Survival(f) # Compute 2 and 5-year survival estimates for 50 year old male surv(c(2,5), g(age=50, sex='male')) ## End(Not run)
If nobs
is not specified, allows user to specify predictor settings
by e.g. age=50, sex="male"
, and any omitted predictors are set to
reference values (default=median for continuous variables, first level
for categorical ones - see datadist
). If any predictor has more than one
value given, expand.grid
is called to generate all possible combinations
of values, unless expand=FALSE
. If nobs
is given, a data
frame is first generated which has
nobs
of adjust-to values duplicated. Then an editor window is opened
which allows the user to subset the variable names down to ones which she
intends to vary (this streamlines the data.ed
step). Then, if any
predictors kept are discrete and viewvals=TRUE
, a window (using page
)
is opened defining the possible values of this subset, to facilitate
data editing. Then the data.ed
function is invoked to allow interactive
overriding of predictor settings in the nobs
rows. The subset of
variables are combined with the other predictors which were not
displayed with data.ed
, and a final full data frame is returned.
gendata
is most useful for creating a newdata
data frame to pass
to predict
.
gendata(fit, ..., nobs, viewvals=FALSE, expand=TRUE, factors)
gendata(fit, ..., nobs, viewvals=FALSE, expand=TRUE, factors)
fit |
a fit object created with |
... |
predictor settings, if |
nobs |
number of observations to create if doing it interactively using X-windows |
viewvals |
if |
expand |
set to |
factors |
a list containing predictor settings with their names. This is an
alternative to specifying the variables separately in .... Unlike the
usage of ..., variables getting default ranges in |
if you have a variable in ...
that is named n, no, nob,
nob
, add nobs=FALSE
to the invocation to prevent that variable
from being misrecognized as nobs
a data frame with all predictors, and an attribute names.subset
if
nobs
is specified. This attribute contains the vector of variable
names for predictors which were passed to de
and hence were
allowed to vary. If neither nobs
nor any predictor settings were
given, returns a data frame with adjust-to values.
optionally writes to the terminal, opens X-windows, and generates a
temporary file using sink
.
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
predict.rms
, survest.cph
,
survest.psm
, rmsMisc
,
expand.grid
, de
, page
,
print.datadist
, Predict
set.seed(1) age <- rnorm(200, 50, 10) sex <- factor(sample(c('female','male'),200,TRUE)) race <- factor(sample(c('a','b','c','d'),200,TRUE)) y <- sample(0:1, 200, TRUE) dd <- datadist(age,sex,race) options(datadist="dd") f <- lrm(y ~ age*sex + race) gendata(f) gendata(f, age=50) d <- gendata(f, age=50, sex="female") # leave race=reference category d <- gendata(f, age=c(50,60), race=c("b","a")) # 4 obs. d$Predicted <- predict(f, d, type="fitted") d # Predicted column prints at the far right options(datadist=NULL) ## Not run: d <- gendata(f, nobs=5, view=TRUE) # 5 interactively defined obs. d[,attr(d,"names.subset")] # print variables which varied predict(f, d) ## End(Not run)
set.seed(1) age <- rnorm(200, 50, 10) sex <- factor(sample(c('female','male'),200,TRUE)) race <- factor(sample(c('a','b','c','d'),200,TRUE)) y <- sample(0:1, 200, TRUE) dd <- datadist(age,sex,race) options(datadist="dd") f <- lrm(y ~ age*sex + race) gendata(f) gendata(f, age=50) d <- gendata(f, age=50, sex="female") # leave race=reference category d <- gendata(f, age=c(50,60), race=c("b","a")) # 4 obs. d$Predicted <- predict(f, d, type="fitted") d # Predicted column prints at the far right options(datadist=NULL) ## Not run: d <- gendata(f, nobs=5, view=TRUE) # 5 interactively defined obs. d[,attr(d,"names.subset")] # print variables which varied predict(f, d) ## End(Not run)
Uses ggplot2
graphics to plot the effect of one or two predictors
on the linear predictor or X beta scale, or on some transformation of
that scale. The first argument specifies the result of the
Predict
function. The predictor is always plotted in its
original coding.
If rdata
is given, a spike histogram is drawn showing
the location/density of data values for the -axis variable. If
there is a
groups
(superposition) variable that generated separate
curves, the data density specific to each class of points is shown.
This assumes that the second variable was a factor variable. The histograms
are drawn by histSpikeg
.
To plot effects instead of estimates (e.g., treatment differences as a
function of interacting factors) see contrast.rms
and
summary.rms
.
## S3 method for class 'Predict' ggplot(data, mapping, formula=NULL, groups=NULL, aestype=c('color', 'linetype'), conf=c('fill', 'lines'), conflinetype=1, varypred=FALSE, sepdiscrete=c('no', 'list', 'vertical', 'horizontal'), subset, xlim., ylim., xlab, ylab, colorscale=function(...) scale_color_manual(..., values=c("#000000", "#E69F00", "#56B4E9", "#009E73","#F0E442", "#0072B2", "#D55E00", "#CC79A7")), colfill='black', rdata=NULL, anova=NULL, pval=FALSE, size.anova=4, adj.subtitle, size.adj=2.5, perim=NULL, nlevels=3, flipxdiscrete=TRUE, legend.position='right', legend.label=NULL, vnames=c('labels','names'), abbrev=FALSE, minlength=6, layout=NULL, addlayer, histSpike.opts=list(frac=function(f) 0.01 + 0.02 * sqrt(f - 1)/sqrt(max(f, 2) - 1), side=1, nint=100), type=NULL, ggexpr=FALSE, height=NULL, width=NULL, ..., environment)
## S3 method for class 'Predict' ggplot(data, mapping, formula=NULL, groups=NULL, aestype=c('color', 'linetype'), conf=c('fill', 'lines'), conflinetype=1, varypred=FALSE, sepdiscrete=c('no', 'list', 'vertical', 'horizontal'), subset, xlim., ylim., xlab, ylab, colorscale=function(...) scale_color_manual(..., values=c("#000000", "#E69F00", "#56B4E9", "#009E73","#F0E442", "#0072B2", "#D55E00", "#CC79A7")), colfill='black', rdata=NULL, anova=NULL, pval=FALSE, size.anova=4, adj.subtitle, size.adj=2.5, perim=NULL, nlevels=3, flipxdiscrete=TRUE, legend.position='right', legend.label=NULL, vnames=c('labels','names'), abbrev=FALSE, minlength=6, layout=NULL, addlayer, histSpike.opts=list(frac=function(f) 0.01 + 0.02 * sqrt(f - 1)/sqrt(max(f, 2) - 1), side=1, nint=100), type=NULL, ggexpr=FALSE, height=NULL, width=NULL, ..., environment)
data |
a data frame created by |
mapping |
kept because of |
formula |
a |
groups |
an optional character string containing the
name of one of the variables in |
aestype |
a string vector of aesthetic names corresponding to
variables in the |
conf |
specify |
conflinetype |
specify an alternative |
varypred |
set to |
sepdiscrete |
set to something other than |
subset |
a subsetting expression for restricting the rows of
|
xlim. |
This parameter is seldom used, as limits are usually controlled with
|
ylim. |
Range for plotting on response variable axis. Computed by default.
Usually specified using its legal definition |
xlab |
Label for |
ylab |
Label for |
colorscale |
a |
colfill |
a single character string or number specifying the fill color
to use for |
rdata |
a data frame containing the original raw data on which the
regression model were based, or at least containing the |
anova |
an object returned by |
pval |
specify |
size.anova |
character size for the test statistic printed on the panel, mm |
adj.subtitle |
Set to |
size.adj |
Size of adjustment settings in subtitles in mm. Default is 2.5. |
perim |
|
nlevels |
when |
flipxdiscrete |
see |
legend.position |
|
legend.label |
if omitted, group variable labels will be used for
label the legend. Specify |
vnames |
applies to the case where multiple plots are produced
separately by predictor. Set to |
abbrev |
set to true to abbreviate levels of predictors that are
categorical to a minimum length of |
minlength |
see |
layout |
for multi-panel plots a 2-vector specifying the number of rows and number of columns. If omitted will be computed from the number of panels to make as square as possible. |
addlayer |
a |
histSpike.opts |
a list containing named elements that specifies
parameters to |
type |
a value ( |
ggexpr |
set to |
height , width
|
used if |
... |
ignored |
environment |
ignored; used to satisfy rules because of the generic ggplot |
an object of class "ggplot2"
ready for printing. For the
case where predictors were not specified to Predict
,
sepdiscrete=TRUE
, and there were both continuous and discrete
predictors in the model, a list of two graphics objects is returned.
If plotting the effects of all predictors you can reorder the
panels using for example p <- Predict(fit); p$.predictor. <-
factor(p$.predictor., v)
where v
is a vector of predictor
names specified in the desired order.
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
Fox J, Hong J (2009): Effect displays in R for multinomial and proportional-odds logit models: Extensions to the effects package. J Stat Software 32 No. 1.
Predict
, rbind.Predict
,
datadist
, predictrms
, anova.rms
,
contrast.rms
, summary.rms
,
rms
, rmsMisc
, plot.Predict
,
labcurve
, histSpikeg
,
ggplot
, Overview
require(ggplot2) n <- 350 # define sample size set.seed(17) # so can reproduce the results age <- rnorm(n, 50, 10) blood.pressure <- rnorm(n, 120, 15) cholesterol <- rnorm(n, 200, 25) sex <- factor(sample(c('female','male'), n,TRUE)) label(age) <- 'Age' # label is in Hmisc label(cholesterol) <- 'Total Cholesterol' label(blood.pressure) <- 'Systolic Blood Pressure' label(sex) <- 'Sex' units(cholesterol) <- 'mg/dl' # uses units.default in Hmisc units(blood.pressure) <- 'mmHg' # Specify population model for log odds that Y=1 L <- .4*(sex=='male') + .045*(age-50) + (log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male')) + .01 * (blood.pressure - 120) # Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)] y <- ifelse(runif(n) < plogis(L), 1, 0) ddist <- datadist(age, blood.pressure, cholesterol, sex) options(datadist='ddist') fit <- lrm(y ~ blood.pressure + sex * (age + rcs(cholesterol,4)), x=TRUE, y=TRUE) an <- anova(fit) # Plot effects in two vertical sub-panels with continuous predictors on top # ggplot(Predict(fit), sepdiscrete='vertical') # Plot effects of all 4 predictors with test statistics from anova, and P ggplot(Predict(fit), anova=an, pval=TRUE) # ggplot(Predict(fit), rdata=llist(blood.pressure, age)) # spike histogram plot for two of the predictors # p <- Predict(fit, name=c('age','cholesterol')) # Make 2 plots # ggplot(p) # p <- Predict(fit, age=seq(20,80,length=100), sex, conf.int=FALSE) # # Plot relationship between age and log # odds, separate curve for each sex, # ggplot(p, subset=sex=='female' | age > 30) # No confidence interval, suppress estimates for males <= 30 # p <- Predict(fit, age, sex) # ggplot(p, rdata=llist(age,sex)) # rdata= allows rug plots (1-dimensional scatterplots) # on each sex's curve, with sex- # specific density of age # If data were in data frame could have used that # p <- Predict(fit, age=seq(20,80,length=100), sex='male', fun=plogis) # works if datadist not used # ggplot(p, ylab=expression(hat(P))) # plot predicted probability in place of log odds # per <- function(x, y) x >= 30 # ggplot(p, perim=per) # suppress output for age < 30 but leave scale alone # Do ggplot2 faceting a few different ways p <- Predict(fit, age, sex, blood.pressure=c(120,140,160), cholesterol=c(180,200,215)) # ggplot(p) ggplot(p, cholesterol ~ blood.pressure) # ggplot(p, ~ cholesterol + blood.pressure) # color for sex, line type for blood.pressure: ggplot(p, groups=c('sex', 'blood.pressure')) # Add legend.position='top' to allow wider plot # Map blood.pressure to line thickness instead of line type: # ggplot(p, groups=c('sex', 'blood.pressure'), aestype=c('color', 'size')) # Plot the age effect as an odds ratio # comparing the age shown on the x-axis to age=30 years # ddist$limits$age[2] <- 30 # make 30 the reference value for age # Could also do: ddist$limits["Adjust to","age"] <- 30 # fit <- update(fit) # make new reference value take effect # p <- Predict(fit, age, ref.zero=TRUE, fun=exp) # ggplot(p, ylab='Age=x:Age=30 Odds Ratio', # addlayer=geom_hline(yintercept=1, col=gray(.8)) + # geom_vline(xintercept=30, col=gray(.8)) + # scale_y_continuous(trans='log', # breaks=c(.5, 1, 2, 4, 8)))) # Compute predictions for three predictors, with superpositioning or # conditioning on sex, combined into one graph p1 <- Predict(fit, age, sex) p2 <- Predict(fit, cholesterol, sex) p3 <- Predict(fit, blood.pressure, sex) p <- rbind(age=p1, cholesterol=p2, blood.pressure=p3) ggplot(p, groups='sex', varypred=TRUE, adj.subtitle=FALSE) # ggplot(p, groups='sex', varypred=TRUE, adj.subtitle=FALSE, sepdiscrete='vert') ## Not run: # For males at the median blood pressure and cholesterol, plot 3 types # of confidence intervals for the probability on one plot, for varying age ages <- seq(20, 80, length=100) p1 <- Predict(fit, age=ages, sex='male', fun=plogis) # standard pointwise p2 <- Predict(fit, age=ages, sex='male', fun=plogis, conf.type='simultaneous') # simultaneous p3 <- Predict(fit, age=c(60,65,70), sex='male', fun=plogis, conf.type='simultaneous') # simultaneous 3 pts # The previous only adjusts for a multiplicity of 3 points instead of 100 f <- update(fit, x=TRUE, y=TRUE) g <- bootcov(f, B=500, coef.reps=TRUE) p4 <- Predict(g, age=ages, sex='male', fun=plogis) # bootstrap percentile p <- rbind(Pointwise=p1, 'Simultaneous 100 ages'=p2, 'Simultaneous 3 ages'=p3, 'Bootstrap nonparametric'=p4) # as.data.frame so will call built-in ggplot ggplot(as.data.frame(p), aes(x=age, y=yhat)) + geom_line() + geom_ribbon(data=p, aes(ymin=lower, ymax=upper), alpha=0.2, linetype=0)+ facet_wrap(~ .set., ncol=2) # Plots for a parametric survival model n <- 1000 set.seed(731) age <- 50 + 12*rnorm(n) label(age) <- "Age" sex <- factor(sample(c('Male','Female'), n, rep=TRUE, prob=c(.6, .4))) cens <- 15*runif(n) h <- .02*exp(.04*(age-50)+.8*(sex=='Female')) t <- -log(runif(n))/h label(t) <- 'Follow-up Time' e <- ifelse(t<=cens,1,0) t <- pmin(t, cens) units(t) <- "Year" ddist <- datadist(age, sex) require(survival) Srv <- Surv(t,e) # Fit log-normal survival model and plot median survival time vs. age f <- psm(Srv ~ rcs(age), dist='lognormal') med <- Quantile(f) # Creates function to compute quantiles # (median by default) p <- Predict(f, age, fun=function(x) med(lp=x)) ggplot(p, ylab="Median Survival Time") # Note: confidence intervals from this method are approximate since # they don't take into account estimation of scale parameter # Fit an ols model to log(y) and plot the relationship between x1 # and the predicted mean(y) on the original scale without assuming # normality of residuals; use the smearing estimator # See help file for rbind.Predict for a method of showing two # types of confidence intervals simultaneously. # Add raw data scatterplot to graph set.seed(1) x1 <- runif(300) x2 <- runif(300) ddist <- datadist(x1, x2); options(datadist='ddist') y <- exp(x1 + x2 - 1 + rnorm(300)) f <- ols(log(y) ~ pol(x1,2) + x2) r <- resid(f) smean <- function(yhat)smearingEst(yhat, exp, res, statistic='mean') formals(smean) <- list(yhat=numeric(0), res=r[! is.na(r)]) #smean$res <- r[! is.na(r)] # define default res argument to function ggplot(Predict(f, x1, fun=smean), ylab='Predicted Mean on y-scale', addlayer=geom_point(aes(x=x1, y=y), data.frame(x1, y))) # Had ggplot not added a subtitle (i.e., if x2 were not present), you # could have done ggplot(Predict(), ylab=...) + geom_point(...) ## End(Not run) # Make an 'interaction plot', forcing the x-axis variable to be # plotted at integer values but labeled with category levels n <- 100 set.seed(1) gender <- c(rep('male', n), rep('female',n)) m <- sample(c('a','b'), 2*n, TRUE) d <- datadist(gender, m); options(datadist='d') anxiety <- runif(2*n) + .2*(gender=='female') + .4*(gender=='female' & m=='b') tapply(anxiety, llist(gender,m), mean) f <- ols(anxiety ~ gender*m) p <- Predict(f, gender, m) # ggplot(p) # horizontal dot chart; usually preferred for categorical predictors # ggplot(p, flipxdiscrete=FALSE) # back to vertical ggplot(p, groups='gender') ggplot(p, ~ m, groups=FALSE, flipxdiscrete=FALSE) options(datadist=NULL) ## Not run: # Example in which separate curves are shown for 4 income values # For each curve the estimated percentage of voters voting for # the democratic party is plotted against the percent of voters # who graduated from college. Data are county-level percents. incomes <- seq(22900, 32800, length=4) # equally spaced to outer quintiles p <- Predict(f, college, income=incomes, conf.int=FALSE) ggplot(p, xlim=c(0,35), ylim=c(30,55)) # Erase end portions of each curve where there are fewer than 10 counties having # percent of college graduates to the left of the x-coordinate being plotted, # for the subset of counties having median family income with 1650 # of the target income for the curve show.pts <- function(college.pts, income.pt) { s <- abs(income - income.pt) < 1650 #assumes income known to top frame x <- college[s] x <- sort(x[!is.na(x)]) n <- length(x) low <- x[10]; high <- x[n-9] college.pts >= low & college.pts <= high } ggplot(p, xlim=c(0,35), ylim=c(30,55), perim=show.pts) # Rename variables for better plotting of a long list of predictors f <- ... p <- Predict(f) re <- c(trt='treatment', diabet='diabetes', sbp='systolic blood pressure') for(n in names(re)) { names(p)[names(p)==n] <- re[n] p$.predictor.[p$.predictor.==n] <- re[n] } ggplot(p) ## End(Not run)
require(ggplot2) n <- 350 # define sample size set.seed(17) # so can reproduce the results age <- rnorm(n, 50, 10) blood.pressure <- rnorm(n, 120, 15) cholesterol <- rnorm(n, 200, 25) sex <- factor(sample(c('female','male'), n,TRUE)) label(age) <- 'Age' # label is in Hmisc label(cholesterol) <- 'Total Cholesterol' label(blood.pressure) <- 'Systolic Blood Pressure' label(sex) <- 'Sex' units(cholesterol) <- 'mg/dl' # uses units.default in Hmisc units(blood.pressure) <- 'mmHg' # Specify population model for log odds that Y=1 L <- .4*(sex=='male') + .045*(age-50) + (log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male')) + .01 * (blood.pressure - 120) # Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)] y <- ifelse(runif(n) < plogis(L), 1, 0) ddist <- datadist(age, blood.pressure, cholesterol, sex) options(datadist='ddist') fit <- lrm(y ~ blood.pressure + sex * (age + rcs(cholesterol,4)), x=TRUE, y=TRUE) an <- anova(fit) # Plot effects in two vertical sub-panels with continuous predictors on top # ggplot(Predict(fit), sepdiscrete='vertical') # Plot effects of all 4 predictors with test statistics from anova, and P ggplot(Predict(fit), anova=an, pval=TRUE) # ggplot(Predict(fit), rdata=llist(blood.pressure, age)) # spike histogram plot for two of the predictors # p <- Predict(fit, name=c('age','cholesterol')) # Make 2 plots # ggplot(p) # p <- Predict(fit, age=seq(20,80,length=100), sex, conf.int=FALSE) # # Plot relationship between age and log # odds, separate curve for each sex, # ggplot(p, subset=sex=='female' | age > 30) # No confidence interval, suppress estimates for males <= 30 # p <- Predict(fit, age, sex) # ggplot(p, rdata=llist(age,sex)) # rdata= allows rug plots (1-dimensional scatterplots) # on each sex's curve, with sex- # specific density of age # If data were in data frame could have used that # p <- Predict(fit, age=seq(20,80,length=100), sex='male', fun=plogis) # works if datadist not used # ggplot(p, ylab=expression(hat(P))) # plot predicted probability in place of log odds # per <- function(x, y) x >= 30 # ggplot(p, perim=per) # suppress output for age < 30 but leave scale alone # Do ggplot2 faceting a few different ways p <- Predict(fit, age, sex, blood.pressure=c(120,140,160), cholesterol=c(180,200,215)) # ggplot(p) ggplot(p, cholesterol ~ blood.pressure) # ggplot(p, ~ cholesterol + blood.pressure) # color for sex, line type for blood.pressure: ggplot(p, groups=c('sex', 'blood.pressure')) # Add legend.position='top' to allow wider plot # Map blood.pressure to line thickness instead of line type: # ggplot(p, groups=c('sex', 'blood.pressure'), aestype=c('color', 'size')) # Plot the age effect as an odds ratio # comparing the age shown on the x-axis to age=30 years # ddist$limits$age[2] <- 30 # make 30 the reference value for age # Could also do: ddist$limits["Adjust to","age"] <- 30 # fit <- update(fit) # make new reference value take effect # p <- Predict(fit, age, ref.zero=TRUE, fun=exp) # ggplot(p, ylab='Age=x:Age=30 Odds Ratio', # addlayer=geom_hline(yintercept=1, col=gray(.8)) + # geom_vline(xintercept=30, col=gray(.8)) + # scale_y_continuous(trans='log', # breaks=c(.5, 1, 2, 4, 8)))) # Compute predictions for three predictors, with superpositioning or # conditioning on sex, combined into one graph p1 <- Predict(fit, age, sex) p2 <- Predict(fit, cholesterol, sex) p3 <- Predict(fit, blood.pressure, sex) p <- rbind(age=p1, cholesterol=p2, blood.pressure=p3) ggplot(p, groups='sex', varypred=TRUE, adj.subtitle=FALSE) # ggplot(p, groups='sex', varypred=TRUE, adj.subtitle=FALSE, sepdiscrete='vert') ## Not run: # For males at the median blood pressure and cholesterol, plot 3 types # of confidence intervals for the probability on one plot, for varying age ages <- seq(20, 80, length=100) p1 <- Predict(fit, age=ages, sex='male', fun=plogis) # standard pointwise p2 <- Predict(fit, age=ages, sex='male', fun=plogis, conf.type='simultaneous') # simultaneous p3 <- Predict(fit, age=c(60,65,70), sex='male', fun=plogis, conf.type='simultaneous') # simultaneous 3 pts # The previous only adjusts for a multiplicity of 3 points instead of 100 f <- update(fit, x=TRUE, y=TRUE) g <- bootcov(f, B=500, coef.reps=TRUE) p4 <- Predict(g, age=ages, sex='male', fun=plogis) # bootstrap percentile p <- rbind(Pointwise=p1, 'Simultaneous 100 ages'=p2, 'Simultaneous 3 ages'=p3, 'Bootstrap nonparametric'=p4) # as.data.frame so will call built-in ggplot ggplot(as.data.frame(p), aes(x=age, y=yhat)) + geom_line() + geom_ribbon(data=p, aes(ymin=lower, ymax=upper), alpha=0.2, linetype=0)+ facet_wrap(~ .set., ncol=2) # Plots for a parametric survival model n <- 1000 set.seed(731) age <- 50 + 12*rnorm(n) label(age) <- "Age" sex <- factor(sample(c('Male','Female'), n, rep=TRUE, prob=c(.6, .4))) cens <- 15*runif(n) h <- .02*exp(.04*(age-50)+.8*(sex=='Female')) t <- -log(runif(n))/h label(t) <- 'Follow-up Time' e <- ifelse(t<=cens,1,0) t <- pmin(t, cens) units(t) <- "Year" ddist <- datadist(age, sex) require(survival) Srv <- Surv(t,e) # Fit log-normal survival model and plot median survival time vs. age f <- psm(Srv ~ rcs(age), dist='lognormal') med <- Quantile(f) # Creates function to compute quantiles # (median by default) p <- Predict(f, age, fun=function(x) med(lp=x)) ggplot(p, ylab="Median Survival Time") # Note: confidence intervals from this method are approximate since # they don't take into account estimation of scale parameter # Fit an ols model to log(y) and plot the relationship between x1 # and the predicted mean(y) on the original scale without assuming # normality of residuals; use the smearing estimator # See help file for rbind.Predict for a method of showing two # types of confidence intervals simultaneously. # Add raw data scatterplot to graph set.seed(1) x1 <- runif(300) x2 <- runif(300) ddist <- datadist(x1, x2); options(datadist='ddist') y <- exp(x1 + x2 - 1 + rnorm(300)) f <- ols(log(y) ~ pol(x1,2) + x2) r <- resid(f) smean <- function(yhat)smearingEst(yhat, exp, res, statistic='mean') formals(smean) <- list(yhat=numeric(0), res=r[! is.na(r)]) #smean$res <- r[! is.na(r)] # define default res argument to function ggplot(Predict(f, x1, fun=smean), ylab='Predicted Mean on y-scale', addlayer=geom_point(aes(x=x1, y=y), data.frame(x1, y))) # Had ggplot not added a subtitle (i.e., if x2 were not present), you # could have done ggplot(Predict(), ylab=...) + geom_point(...) ## End(Not run) # Make an 'interaction plot', forcing the x-axis variable to be # plotted at integer values but labeled with category levels n <- 100 set.seed(1) gender <- c(rep('male', n), rep('female',n)) m <- sample(c('a','b'), 2*n, TRUE) d <- datadist(gender, m); options(datadist='d') anxiety <- runif(2*n) + .2*(gender=='female') + .4*(gender=='female' & m=='b') tapply(anxiety, llist(gender,m), mean) f <- ols(anxiety ~ gender*m) p <- Predict(f, gender, m) # ggplot(p) # horizontal dot chart; usually preferred for categorical predictors # ggplot(p, flipxdiscrete=FALSE) # back to vertical ggplot(p, groups='gender') ggplot(p, ~ m, groups=FALSE, flipxdiscrete=FALSE) options(datadist=NULL) ## Not run: # Example in which separate curves are shown for 4 income values # For each curve the estimated percentage of voters voting for # the democratic party is plotted against the percent of voters # who graduated from college. Data are county-level percents. incomes <- seq(22900, 32800, length=4) # equally spaced to outer quintiles p <- Predict(f, college, income=incomes, conf.int=FALSE) ggplot(p, xlim=c(0,35), ylim=c(30,55)) # Erase end portions of each curve where there are fewer than 10 counties having # percent of college graduates to the left of the x-coordinate being plotted, # for the subset of counties having median family income with 1650 # of the target income for the curve show.pts <- function(college.pts, income.pt) { s <- abs(income - income.pt) < 1650 #assumes income known to top frame x <- college[s] x <- sort(x[!is.na(x)]) n <- length(x) low <- x[10]; high <- x[n-9] college.pts >= low & college.pts <= high } ggplot(p, xlim=c(0,35), ylim=c(30,55), perim=show.pts) # Rename variables for better plotting of a long list of predictors f <- ... p <- Predict(f) re <- c(trt='treatment', diabet='diabetes', sbp='systolic blood pressure') for(n in names(re)) { names(p)[names(p)==n] <- re[n] p$.predictor.[p$.predictor.==n] <- re[n] } ggplot(p) ## End(Not run)
gIndex
computes the total -index for a model based on
the vector of linear predictors, and the partial
-index for
each predictor in a model. The latter is computed by summing all the
terms involving each variable, weighted by their regression
coefficients, then computing Gini's mean difference on this sum. For
example, a regression model having age and sex and age*sex on the
right hand side, with corresponding regression coefficients
will have the
-index for age
computed from Gini's mean
difference on the product of age
where
is an indicator set to one for observations with sex not equal
to the reference value. When there are nonlinear terms associated
with a predictor, these terms will also be combined.
A print
method is defined, and there is a plot
method for displaying
-indexes using a dot chart.
These functions use Hmisc::GiniMd
.
gIndex(object, partials=TRUE, type=c('ccterms', 'cterms', 'terms'), lplabel=if(length(object$scale) && is.character(object$scale)) object$scale[1] else 'X*Beta', fun, funlabel=if(missing(fun)) character(0) else deparse(substitute(fun)), postfun=if(length(object$scale)==2) exp else NULL, postlabel=if(length(postfun)) ifelse(missing(postfun), if((length(object$scale) > 1) && is.character(object$scale)) object$scale[2] else 'Anti-log', deparse(substitute(postfun))) else character(0), ...) ## S3 method for class 'gIndex' print(x, digits=4, abbrev=FALSE, vnames=c("names","labels"), ...) ## S3 method for class 'gIndex' plot(x, what=c('pre', 'post'), xlab=NULL, pch=16, rm.totals=FALSE, sort=c('descending', 'ascending', 'none'), ...)
gIndex(object, partials=TRUE, type=c('ccterms', 'cterms', 'terms'), lplabel=if(length(object$scale) && is.character(object$scale)) object$scale[1] else 'X*Beta', fun, funlabel=if(missing(fun)) character(0) else deparse(substitute(fun)), postfun=if(length(object$scale)==2) exp else NULL, postlabel=if(length(postfun)) ifelse(missing(postfun), if((length(object$scale) > 1) && is.character(object$scale)) object$scale[2] else 'Anti-log', deparse(substitute(postfun))) else character(0), ...) ## S3 method for class 'gIndex' print(x, digits=4, abbrev=FALSE, vnames=c("names","labels"), ...) ## S3 method for class 'gIndex' plot(x, what=c('pre', 'post'), xlab=NULL, pch=16, rm.totals=FALSE, sort=c('descending', 'ascending', 'none'), ...)
object |
result of an |
partials |
set to |
type |
defaults to |
lplabel |
a replacement for default values such as
|
fun |
an optional function to transform the linear predictors
before computing the total (only) |
funlabel |
a character string label for |
postfun |
a function to transform |
postlabel |
a label for |
... |
For |
x |
an object created by |
digits |
causes rounding to the |
abbrev |
set to |
vnames |
set to |
what |
set to |
xlab |
|
pch |
plotting character for point |
rm.totals |
set to |
sort |
specifies how to sort predictors by |
For stratification factors in a Cox proportional hazards model, there is
no contribution of variation towards computing a partial
except from terms that interact with the stratification variable.
gIndex
returns a matrix of class "gIndex"
with auxiliary
information stored as attributes, such as variable labels.
GiniMd
returns a scalar.
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
David HA (1968): Gini's mean difference rediscovered. Biometrika 55:573–575.
set.seed(1) n <- 40 x <- 1:n w <- factor(sample(c('a','b'), n, TRUE)) u <- factor(sample(c('A','B'), n, TRUE)) y <- .01*x + .2*(w=='b') + .3*(u=='B') + .2*(w=='b' & u=='B') + rnorm(n)/5 dd <- datadist(x,w,u); options(datadist='dd') f <- ols(y ~ x*w*u, x=TRUE, y=TRUE) f anova(f) z <- list() for(type in c('terms','cterms','ccterms')) { zc <- predict(f, type=type) cat('type:', type, '\n') print(zc) z[[type]] <- zc } zc <- z$cterms GiniMd(zc[, 1]) GiniMd(zc[, 2]) GiniMd(zc[, 3]) GiniMd(f$linear.predictors) g <- gIndex(f) g g['Total',] gIndex(f, partials=FALSE) gIndex(f, type='cterms') gIndex(f, type='terms') y <- y > .8 f <- lrm(y ~ x * w * u, x=TRUE, y=TRUE) gIndex(f, fun=plogis, funlabel='Prob[y=1]') # Manual calculation of combined main effect + interaction effort of # sex in a 2x2 design with treatments A B, sexes F M, # model -.1 + .3*(treat=='B') + .5*(sex=='M') + .4*(treat=='B' & sex=='M') set.seed(1) X <- expand.grid(treat=c('A','B'), sex=c('F', 'M')) a <- 3; b <- 7; c <- 13; d <- 5 X <- rbind(X[rep(1, a),], X[rep(2, b),], X[rep(3, c),], X[rep(4, d),]) y <- with(X, -.1 + .3*(treat=='B') + .5*(sex=='M') + .4*(treat=='B' & sex=='M')) f <- ols(y ~ treat*sex, data=X, x=TRUE) gIndex(f, type='cterms') k <- coef(f) b1 <- k[2]; b2 <- k[3]; b3 <- k[4] n <- nrow(X) ( (a+b)*c*abs(b2) + (a+b)*d*abs(b2+b3) + c*d*abs(b3))/(n*(n-1)/2 ) # Manual calculation for combined age effect in a model with sex, # age, and age*sex interaction a <- 13; b <- 7 sex <- c(rep('female',a), rep('male',b)) agef <- round(runif(a, 20, 30)) agem <- round(runif(b, 20, 40)) age <- c(agef, agem) y <- (sex=='male') + age/10 - (sex=='male')*age/20 f <- ols(y ~ sex*age, x=TRUE) f gIndex(f, type='cterms') k <- coef(f) b1 <- k[2]; b2 <- k[3]; b3 <- k[4] n <- a + b sp <- function(w, z=w) sum(outer(w, z, function(u, v) abs(u-v))) ( abs(b2)*sp(agef) + abs(b2+b3)*sp(agem) + 2*sp(b2*agef, (b2+b3)*agem) ) / (n*(n-1)) ( abs(b2)*GiniMd(agef)*a*(a-1) + abs(b2+b3)*GiniMd(agem)*b*(b-1) + 2*sp(b2*agef, (b2+b3)*agem) ) / (n*(n-1)) ## Not run: # Compare partial and total g-indexes over many random fits plot(NA, NA, xlim=c(0,3), ylim=c(0,3), xlab='Global', ylab='x1 (black) x2 (red) x3 (green) x4 (blue)') abline(a=0, b=1, col=gray(.9)) big <- integer(3) n <- 50 # try with n=7 - see lots of exceptions esp. for interacting var for(i in 1:100) { x1 <- runif(n) x2 <- runif(n) x3 <- runif(n) x4 <- runif(n) y <- x1 + x2 + x3 + x4 + 2*runif(n) f <- ols(y ~ x1*x2+x3+x4, x=TRUE) # f <- ols(y ~ x1+x2+x3+x4, x=TRUE) # also try this w <- gIndex(f)[,1] gt <- w['Total'] points(gt, w['x1, x2']) points(gt, w['x3'], col='green') points(gt, w['x4'], col='blue') big[1] <- big[1] + (w['x1, x2'] > gt) big[2] <- big[2] + (w['x3'] > gt) big[3] <- big[3] + (w['x4'] > gt) } print(big) ## End(Not run) options(datadist=NULL)
set.seed(1) n <- 40 x <- 1:n w <- factor(sample(c('a','b'), n, TRUE)) u <- factor(sample(c('A','B'), n, TRUE)) y <- .01*x + .2*(w=='b') + .3*(u=='B') + .2*(w=='b' & u=='B') + rnorm(n)/5 dd <- datadist(x,w,u); options(datadist='dd') f <- ols(y ~ x*w*u, x=TRUE, y=TRUE) f anova(f) z <- list() for(type in c('terms','cterms','ccterms')) { zc <- predict(f, type=type) cat('type:', type, '\n') print(zc) z[[type]] <- zc } zc <- z$cterms GiniMd(zc[, 1]) GiniMd(zc[, 2]) GiniMd(zc[, 3]) GiniMd(f$linear.predictors) g <- gIndex(f) g g['Total',] gIndex(f, partials=FALSE) gIndex(f, type='cterms') gIndex(f, type='terms') y <- y > .8 f <- lrm(y ~ x * w * u, x=TRUE, y=TRUE) gIndex(f, fun=plogis, funlabel='Prob[y=1]') # Manual calculation of combined main effect + interaction effort of # sex in a 2x2 design with treatments A B, sexes F M, # model -.1 + .3*(treat=='B') + .5*(sex=='M') + .4*(treat=='B' & sex=='M') set.seed(1) X <- expand.grid(treat=c('A','B'), sex=c('F', 'M')) a <- 3; b <- 7; c <- 13; d <- 5 X <- rbind(X[rep(1, a),], X[rep(2, b),], X[rep(3, c),], X[rep(4, d),]) y <- with(X, -.1 + .3*(treat=='B') + .5*(sex=='M') + .4*(treat=='B' & sex=='M')) f <- ols(y ~ treat*sex, data=X, x=TRUE) gIndex(f, type='cterms') k <- coef(f) b1 <- k[2]; b2 <- k[3]; b3 <- k[4] n <- nrow(X) ( (a+b)*c*abs(b2) + (a+b)*d*abs(b2+b3) + c*d*abs(b3))/(n*(n-1)/2 ) # Manual calculation for combined age effect in a model with sex, # age, and age*sex interaction a <- 13; b <- 7 sex <- c(rep('female',a), rep('male',b)) agef <- round(runif(a, 20, 30)) agem <- round(runif(b, 20, 40)) age <- c(agef, agem) y <- (sex=='male') + age/10 - (sex=='male')*age/20 f <- ols(y ~ sex*age, x=TRUE) f gIndex(f, type='cterms') k <- coef(f) b1 <- k[2]; b2 <- k[3]; b3 <- k[4] n <- a + b sp <- function(w, z=w) sum(outer(w, z, function(u, v) abs(u-v))) ( abs(b2)*sp(agef) + abs(b2+b3)*sp(agem) + 2*sp(b2*agef, (b2+b3)*agem) ) / (n*(n-1)) ( abs(b2)*GiniMd(agef)*a*(a-1) + abs(b2+b3)*GiniMd(agem)*b*(b-1) + 2*sp(b2*agef, (b2+b3)*agem) ) / (n*(n-1)) ## Not run: # Compare partial and total g-indexes over many random fits plot(NA, NA, xlim=c(0,3), ylim=c(0,3), xlab='Global', ylab='x1 (black) x2 (red) x3 (green) x4 (blue)') abline(a=0, b=1, col=gray(.9)) big <- integer(3) n <- 50 # try with n=7 - see lots of exceptions esp. for interacting var for(i in 1:100) { x1 <- runif(n) x2 <- runif(n) x3 <- runif(n) x4 <- runif(n) y <- x1 + x2 + x3 + x4 + 2*runif(n) f <- ols(y ~ x1*x2+x3+x4, x=TRUE) # f <- ols(y ~ x1+x2+x3+x4, x=TRUE) # also try this w <- gIndex(f)[,1] gt <- w['Total'] points(gt, w['x1, x2']) points(gt, w['x3'], col='green') points(gt, w['x4'], col='blue') big[1] <- big[1] + (w['x1, x2'] > gt) big[2] <- big[2] + (w['x3'] > gt) big[3] <- big[3] + (w['x4'] > gt) } print(big) ## End(Not run) options(datadist=NULL)
This function saves rms
attributes with the fit object so that
anova.rms
, Predict
, etc. can be used just as with ols
and other fits. No validate
or calibrate
methods exist for
Glm
though.
Glm( formula, family = gaussian, data = environment(formula), weights, subset, na.action = na.delete, start = NULL, offset = NULL, control = glm.control(...), model = TRUE, method = "glm.fit", x = FALSE, y = TRUE, contrasts = NULL, ... )
Glm( formula, family = gaussian, data = environment(formula), weights, subset, na.action = na.delete, start = NULL, offset = NULL, control = glm.control(...), model = TRUE, method = "glm.fit", x = FALSE, y = TRUE, contrasts = NULL, ... )
formula , family , data , weights , subset , na.action , start , offset , control , model , method , x , y , contrasts
|
see |
... |
ignored |
For the print
method, format of output is controlled by the user
previously running options(prType="lang")
where lang
is
"plain"
(the default), "latex"
, or "html"
.
a fit object like that produced by stats::glm()
but with
rms
attributes and a class
of "rms"
, "Glm"
,
"glm"
, and "lm"
. The g
element of the fit object is
the -index.
stats::glm()
,Hmisc::GiniMd()
, prModFit()
, stats::residuals.glm
## Dobson (1990) Page 93: Randomized Controlled Trial : counts <- c(18,17,15,20,10,20,25,13,12) outcome <- gl(3,1,9) treatment <- gl(3,3) f <- glm(counts ~ outcome + treatment, family=poisson()) f anova(f) summary(f) f <- Glm(counts ~ outcome + treatment, family=poisson()) # could have had rcs( ) etc. if there were continuous predictors f anova(f) summary(f, outcome=c('1','2','3'), treatment=c('1','2','3'))
## Dobson (1990) Page 93: Randomized Controlled Trial : counts <- c(18,17,15,20,10,20,25,13,12) outcome <- gl(3,1,9) treatment <- gl(3,3) f <- glm(counts ~ outcome + treatment, family=poisson()) f anova(f) summary(f) f <- Glm(counts ~ outcome + treatment, family=poisson()) # could have had rcs( ) etc. if there were continuous predictors f anova(f) summary(f, outcome=c('1','2','3'), treatment=c('1','2','3'))
This function fits a linear model using generalized least
squares. The errors are allowed to be correlated and/or have unequal
variances. Gls
is a slightly enhanced version of the
Pinheiro and Bates gls
function in the nlme
package to
make it easy to use with the rms package and to implement cluster
bootstrapping (primarily for nonparametric estimates of the
variance-covariance matrix of the parameter estimates and for
nonparametric confidence limits of correlation parameters).
For the print
method, format of output is controlled by the
user previously running options(prType="lang")
where
lang
is "plain"
(the default), "latex"
, or
"html"
. When using html with Quarto or RMarkdown,
results='asis'
need not be written in the chunk header.
Gls(model, data, correlation, weights, subset, method, na.action=na.omit, control, verbose, B=0, dupCluster=FALSE, pr=FALSE, x=FALSE) ## S3 method for class 'Gls' print(x, digits=4, coefs=TRUE, title, ...)
Gls(model, data, correlation, weights, subset, method, na.action=na.omit, control, verbose, B=0, dupCluster=FALSE, pr=FALSE, x=FALSE) ## S3 method for class 'Gls' print(x, digits=4, coefs=TRUE, title, ...)
model |
a two-sided linear formula object describing the
model, with the response on the left of a |
data |
an optional data frame containing the variables named in
|
correlation |
an optional |
weights |
an optional |
subset |
an optional expression indicating which subset of the rows of
|
method |
a character string. If |
na.action |
a function that indicates what should happen when the
data contain |
control |
a list of control values for the estimation algorithm to
replace the default values returned by the function |
verbose |
an optional logical value. If |
B |
number of bootstrap resamples to fit and store, default is none |
dupCluster |
set to |
pr |
set to |
x |
for |
digits |
number of significant digits to print |
coefs |
specify |
title |
a character string title to be passed to |
... |
ignored |
The na.delete
function will not work with
Gls
due to some nuance in the model.frame.default
function. This probably relates to na.delete
storing extra
information in the "na.action"
attribute of the returned data
frame.
an object of classes Gls
, rms
, and gls
representing the linear model
fit. Generic functions such as print
, plot
,
ggplot
, and summary
have methods to show the results of
the fit. See
glsObject
for the components of the fit. The functions
resid
, coef
, and fitted
can be used to extract
some of its components. Gls
returns the following components
not returned by gls
: Design
, assign
,
formula
(see arguments), B
(see
arguments), bootCoef
(matrix of B
bootstrapped
coefficients), boot.Corr
(vector of bootstrapped correlation
parameters), Nboot
(vector of total sample size used in each
bootstrap (may vary if have unbalanced clusters), and var
(sample variance-covariance matrix of bootstrapped coefficients). The
-index is also stored in the returned object under the name
"g"
.
Jose Pinheiro, Douglas Bates [email protected], Saikat DebRoy, Deepayan Sarkar, R-core [email protected], Frank Harrell [email protected], Patrick Aboyoun
Pinheiro J, Bates D (2000): Mixed effects models in S and S-Plus. New York: Springer-Verlag.
gls
glsControl
, glsObject
,
varFunc
, corClasses
,
varClasses
, GiniMd
,
prModFit
, logLik.Gls
## Not run: require(ggplot2) ns <- 20 # no. subjects nt <- 10 # no. time points/subject B <- 10 # no. bootstrap resamples # usually do 100 for variances, 1000 for nonparametric CLs rho <- .5 # AR(1) correlation parameter V <- matrix(0, nrow=nt, ncol=nt) V <- rho^abs(row(V)-col(V)) # per-subject correlation/covariance matrix d <- expand.grid(tim=1:nt, id=1:ns) d$trt <- factor(ifelse(d$id <= ns/2, 'a', 'b')) true.beta <- c(Intercept=0,tim=.1,'tim^2'=0,'trt=b'=1) d$ey <- true.beta['Intercept'] + true.beta['tim']*d$tim + true.beta['tim^2']*(d$tim^2) + true.beta['trt=b']*(d$trt=='b') set.seed(13) library(MASS) # needed for mvrnorm d$y <- d$ey + as.vector(t(mvrnorm(n=ns, mu=rep(0,nt), Sigma=V))) dd <- datadist(d); options(datadist='dd') f <- Gls(y ~ pol(tim,2) + trt, correlation=corCAR1(form= ~tim | id), data=d, B=B) f AIC(f) f$var # bootstrap variances f$varBeta # original variances summary(f) anova(f) ggplot(Predict(f, tim, trt)) # v <- Variogram(f, form=~tim|id, data=d) nlme:::summary.gls(f)$tTable # print matrix of estimates etc. options(datadist=NULL) ## End(Not run)
## Not run: require(ggplot2) ns <- 20 # no. subjects nt <- 10 # no. time points/subject B <- 10 # no. bootstrap resamples # usually do 100 for variances, 1000 for nonparametric CLs rho <- .5 # AR(1) correlation parameter V <- matrix(0, nrow=nt, ncol=nt) V <- rho^abs(row(V)-col(V)) # per-subject correlation/covariance matrix d <- expand.grid(tim=1:nt, id=1:ns) d$trt <- factor(ifelse(d$id <= ns/2, 'a', 'b')) true.beta <- c(Intercept=0,tim=.1,'tim^2'=0,'trt=b'=1) d$ey <- true.beta['Intercept'] + true.beta['tim']*d$tim + true.beta['tim^2']*(d$tim^2) + true.beta['trt=b']*(d$trt=='b') set.seed(13) library(MASS) # needed for mvrnorm d$y <- d$ey + as.vector(t(mvrnorm(n=ns, mu=rep(0,nt), Sigma=V))) dd <- datadist(d); options(datadist='dd') f <- Gls(y ~ pol(tim,2) + trt, correlation=corCAR1(form= ~tim | id), data=d, B=B) f AIC(f) f$var # bootstrap variances f$varBeta # original variances summary(f) anova(f) ggplot(Predict(f, tim, trt)) # v <- Variogram(f, form=~tim|id, data=d) nlme:::summary.gls(f)$tTable # print matrix of estimates etc. options(datadist=NULL) ## End(Not run)
Function to divide x
(e.g. age, or predicted survival at time
u
created by survest
) into g
quantile groups, get
Kaplan-Meier estimates at time u
(a scaler), and to return a
matrix with columns x
=mean x
in quantile, n
=number
of subjects, events
=no. events, and KM
=K-M survival at
time u
, std.err
= s.e. of -log K-M. Confidence intervals
are based on -log S(t). Instead of supplying g
, the user can
supply the minimum number of subjects to have in the quantile group
(m
, default=50). If cuts
is given
(e.g. cuts=c(0,.1,.2,...,.9,.1)
), it overrides m
and
g
. Calls Therneau's survfitKM
in the survival
package to get Kaplan-Meiers estimates and standard errors.
groupkm(x, Srv, m=50, g, cuts, u, pl=FALSE, loglog=FALSE, conf.int=.95, xlab, ylab, lty=1, add=FALSE, cex.subtitle=.7, ...)
groupkm(x, Srv, m=50, g, cuts, u, pl=FALSE, loglog=FALSE, conf.int=.95, xlab, ylab, lty=1, add=FALSE, cex.subtitle=.7, ...)
x |
variable to stratify |
Srv |
a |
m |
desired minimum number of observations in a group |
g |
number of quantile groups |
cuts |
actual cuts in |
u |
time for which to estimate survival |
pl |
TRUE to plot results |
loglog |
set to |
conf.int |
defaults to |
xlab |
if |
ylab |
if |
lty |
line time for primary line connecting estimates |
add |
set to |
cex.subtitle |
character size for subtitle. Default is |
... |
plotting parameters to pass to the plot and errbar functions |
matrix with columns named x
(mean predictor value in interval), n
(sample size
in interval), events
(number of events in interval), KM
(Kaplan-Meier
estimate), std.err
(standard error of -log KM
)
survfit
, errbar
,
cut2
, Surv
,
units
require(survival) n <- 1000 set.seed(731) age <- 50 + 12*rnorm(n) cens <- 15*runif(n) h <- .02*exp(.04*(age-50)) d.time <- -log(runif(n))/h label(d.time) <- 'Follow-up Time' e <- ifelse(d.time <= cens,1,0) d.time <- pmin(d.time, cens) units(d.time) <- "Year" groupkm(age, Surv(d.time, e), g=10, u=5, pl=TRUE) #Plot 5-year K-M survival estimates and 0.95 confidence bars by #decile of age. If omit g=10, will have >= 50 obs./group.
require(survival) n <- 1000 set.seed(731) age <- 50 + 12*rnorm(n) cens <- 15*runif(n) h <- .02*exp(.04*(age-50)) d.time <- -log(runif(n))/h label(d.time) <- 'Follow-up Time' e <- ifelse(d.time <= cens,1,0) d.time <- pmin(d.time, cens) units(d.time) <- "Year" groupkm(age, Surv(d.time, e), g=10, u=5, pl=TRUE) #Plot 5-year K-M survival estimates and 0.95 confidence bars by #decile of age. If omit g=10, will have >= 50 obs./group.
The hazard.ratio.plot
function repeatedly estimates Cox
regression coefficients and confidence limits within time intervals.
The log hazard ratios are plotted against the mean failure/censoring
time within the interval. Unless times
is specified, the number of
time intervals will be , where
is the
total number
of events in the sample. Efron's likelihood is used for estimating
Cox regression coefficients (using
coxph.fit
). In the case of
tied failure times, some intervals may have a point in common.
hazard.ratio.plot(x, Srv, which, times=, e=30, subset, conf.int=.95, legendloc=NULL, smooth=TRUE, pr=FALSE, pl=TRUE, add=FALSE, ylim, cex=.5, xlab="t", ylab, antilog=FALSE, ...)
hazard.ratio.plot(x, Srv, which, times=, e=30, subset, conf.int=.95, legendloc=NULL, smooth=TRUE, pr=FALSE, pl=TRUE, add=FALSE, ylim, cex=.5, xlab="t", ylab, antilog=FALSE, ...)
x |
a vector or matrix of predictors |
Srv |
a |
which |
a vector of column numbers of |
times |
optional vector of time interval endpoints.
Example: |
e |
number of events per time interval if times not given |
subset |
vector used for subsetting the entire analysis,
e.g. |
conf.int |
confidence interval coverage |
legendloc |
location for legend. Omit to use mouse, |
smooth |
also plot the super–smoothed version of the log hazard ratios |
pr |
defaults to |
pl |
defaults to |
add |
add this plot to an already existing plot |
ylim |
vector of |
cex |
character size for legend information, default is 0.5 |
xlab |
label for |
ylab |
label for |
antilog |
default is |
... |
optional graphical parameters |
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
cox.zph
, residuals.cph
,
survival-internal
, cph
,
coxph
, Surv
require(survival) n <- 500 set.seed(1) age <- 50 + 12*rnorm(n) cens <- 15*runif(n) h <- .02*exp(.04*(age-50)) d.time <- -log(runif(n))/h label(d.time) <- 'Follow-up Time' e <- ifelse(d.time <= cens,1,0) d.time <- pmin(d.time, cens) units(d.time) <- "Year" hazard.ratio.plot(age, Surv(d.time,e), e=20, legendloc='ll')
require(survival) n <- 500 set.seed(1) age <- 50 + 12*rnorm(n) cens <- 15*runif(n) h <- .02*exp(.04*(age-50)) d.time <- -log(runif(n))/h label(d.time) <- 'Follow-up Time' e <- ifelse(d.time <= cens,1,0) d.time <- pmin(d.time, cens) units(d.time) <- "Year" hazard.ratio.plot(age, Surv(d.time,e), e=20, legendloc='ll')
Creates several new variables which help set up a dataset for modeling
with cph
or coxph
when there is a single binary time-dependent
covariable which turns on at a given time, and stays on. This is
typical when analyzing the impact of an intervening event.
ie.setup
creates a Surv
object using the start time, stop time
format. It also creates a binary indicator for the intervening event,
and a variable called subs
that is useful when attach
-ing
a dataframe.
subs
has observation numbers duplicated for subjects having an
intervening event, so those subject's baseline covariables (that are
not time-dependent) can be duplicated correctly.
ie.setup(failure.time, event, ie.time, break.ties=FALSE)
ie.setup(failure.time, event, ie.time, break.ties=FALSE)
failure.time |
a numeric variable containing the event or censoring times for the terminating event |
event |
a binary (0/1) variable specifying whether observations had the terminating event (event=1) or were censored (event=0) |
ie.time |
intervening event times. For subjects having no intervening events, the corresponding values of ie.time must be NA. |
break.ties |
Occasionally intervening events are recorded as happening at exactly
the same time as the termination of follow-up for some subjects.
The |
a list with components S, ie.status, subs, reps
. S
is a
Surv
object containing start and stop times for intervals of observation,
along with event indicators. ie.status
is one if the intervening
event has occurred at the start of the interval, zero otherwise.
subs
is a vector of subscripts that can be used to replicate other
variables the same way S
was replicated. reps
specifies how many
times each original observation was replicated. S, ie.status, subs
are
all the same length (at least the number of rows for S
is) and are longer than the original failure.time
vector.
reps
is the same length as the original failure.time
vector.
The subs
vector is suitable for passing to validate.lrm
or calibrate
,
which pass this vector under the name cluster
on to predab.resample
so that bootstrapping can be
done by sampling with replacement from the original subjects rather than
from the individual records created by ie.setup
.
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
cph
, coxph
,
Surv
, cr.setup
,
predab.resample
failure.time <- c(1 , 2, 3) event <- c(1 , 1, 0) ie.time <- c(NA, 1.5, 2.5) z <- ie.setup(failure.time, event, ie.time) S <- z$S S ie.status <- z$ie.status ie.status z$subs z$reps ## Not run: attach(input.data.frame[z$subs,]) #replicates all variables f <- cph(S ~ age + sex + ie.status) # Instead of duplicating rows of data frame, could do this: attach(input.data.frame) z <- ie.setup(failure.time, event, ie.time) s <- z$subs age <- age[s] sex <- sex[s] f <- cph(S ~ age + sex + ie.status) ## End(Not run)
failure.time <- c(1 , 2, 3) event <- c(1 , 1, 0) ie.time <- c(NA, 1.5, 2.5) z <- ie.setup(failure.time, event, ie.time) S <- z$S S ie.status <- z$ie.status ie.status z$subs z$reps ## Not run: attach(input.data.frame[z$subs,]) #replicates all variables f <- cph(S ~ age + sex + ie.status) # Instead of duplicating rows of data frame, could do this: attach(input.data.frame) z <- ie.setup(failure.time, event, ie.time) s <- z$subs age <- age[s] sex <- sex[s] f <- cph(S ~ age + sex + ie.status) ## End(Not run)
Checks the impact of the proportional odds assumption by comparing predicted cell probabilities from a PO model with those from a multinomial or partial proportional odds logistic model that relax assumptions. For a given model formula, fits the model with both lrm
and either nnet::multinom
or VGAM::vglm
or both, and obtains predicted cell probabilities for the PO and relaxed models on the newdata
data frame. A print
method formats the output.
impactPO( formula, relax = if (missing(nonpo)) "multinomial" else "both", nonpo, newdata, data = environment(formula), minfreq = 15, B = 0, ... )
impactPO( formula, relax = if (missing(nonpo)) "multinomial" else "both", nonpo, newdata, data = environment(formula), minfreq = 15, B = 0, ... )
formula |
a model formula. To work properly with |
relax |
defaults to |
nonpo |
a formula with no left hand side variable, specifying the variable or variables for which PO is not assumed. Specifying |
newdata |
a data frame or data table with one row per covariate setting for which predictions are to be made |
data |
data frame containing variables to fit; default is the frame in which |
minfreq |
minimum sample size to allow for the least frequent category of the dependent variable. If the observed minimum frequency is less than this, the |
B |
number of bootstrap resamples to do to get confidence intervals for differences in predicted probabilities for relaxed methods vs. PO model fits. Default is not to run the bootstrap. When running the bootstrap make sure that all model variables are explicitly in |
... |
other parameters to pass to |
Since partial proportional odds models and especially multinomial logistic models can have many parameters, it is not feasible to use this model comparison approach when the number of levels of the dependent variable Y is large. By default, the function will use Hmisc::combine.levels()
to combine consecutive levels if the lowest frequency category of Y has fewer than minfreq
observations.
an impactPO
object which is a list with elements estimates
, stats
, mad
, newdata
, nboot
, and boot
. estimates
is a data frame containing the variables and values in newdata
in a tall and thin format with additional variable method
("PO", "Multinomial", "PPO"), y
(current level of the dependent variable), and Probability
(predicted cell probability for covariate values and value of y
in the current row). stats
is a data frame containing Deviance
the model deviance, d.f.
the total number of parameters counting intercepts, AIC
, p
the number of regression coefficients, LR chi^2
the likelihood ratio chi-square statistic for testing the predictors, LR - p
a chance-corrected LR chi-square, LR chi^2 test for PO
the likelihood ratio chi-square test statistic for testing the PO assumption (by comparing -2 log likelihood for a relaxed model to that of a fully PO model), d.f.
the degrees of freedom for this test, Pr(>chi^2)
the P-value for this test, MCS R2
the Maddala-Cox-Snell R2 using the actual sample size, MCS R2 adj
(MCS R2
adjusted for estimating p
regression coefficients by subtracting p
from LR
), McFadden R2
, McFadden R2 adj
(an AIC-like adjustment proposed by McFadden without full justification), Mean |difference} from PO
the overall mean absolute difference between predicted probabilities over all categories of Y and over all covariate settings. mad
contains newdata
and separately by rows in newdata
the mean absolute difference (over Y categories) between estimated probabilities by the indicated relaxed model and those from the PO model. nboot
is the number of successful bootstrap repetitions, and boot
is a 4-way array with dimensions represented by the nboot
resamples, the number of rows in newdata
, the number of outcome levels, and elements for PPO
and multinomial
. For the modifications of the Maddala-Cox-Snell indexes see Hmisc::R2Measures
.
Frank Harrell [email protected]
nnet::multinom()
, VGAM::vglm()
, lrm()
, Hmisc::propsPO()
, Hmisc::R2Measures()
, Hmisc::combine.levels()
## Not run: set.seed(1) age <- rnorm(500, 50, 10) sex <- sample(c('female', 'male'), 500, TRUE) y <- sample(0:4, 500, TRUE) d <- expand.grid(age=50, sex=c('female', 'male')) w <- impactPO(y ~ age + sex, nonpo = ~ sex, newdata=d) w # Note that PO model is a better model than multinomial (lower AIC) # since multinomial model's improvement in fit is low in comparison # with number of additional parameters estimated. Same for PO model # in comparison with partial PO model. # Reverse levels of y so stacked bars have higher y located higher revo <- function(z) { z <- as.factor(z) factor(z, levels=rev(levels(as.factor(z)))) } require(ggplot2) ggplot(w$estimates, aes(x=method, y=Probability, fill=revo(y))) + facet_wrap(~ sex) + geom_col() + xlab('') + guides(fill=guide_legend(title='')) # Now vary 2 predictors d <- expand.grid(sex=c('female', 'male'), age=c(40, 60)) w <- impactPO(y ~ age + sex, nonpo = ~ sex, newdata=d) w ggplot(w$estimates, aes(x=method, y=Probability, fill=revo(y))) + facet_grid(age ~ sex) + geom_col() + xlab('') + guides(fill=guide_legend(title='')) ## End(Not run)
## Not run: set.seed(1) age <- rnorm(500, 50, 10) sex <- sample(c('female', 'male'), 500, TRUE) y <- sample(0:4, 500, TRUE) d <- expand.grid(age=50, sex=c('female', 'male')) w <- impactPO(y ~ age + sex, nonpo = ~ sex, newdata=d) w # Note that PO model is a better model than multinomial (lower AIC) # since multinomial model's improvement in fit is low in comparison # with number of additional parameters estimated. Same for PO model # in comparison with partial PO model. # Reverse levels of y so stacked bars have higher y located higher revo <- function(z) { z <- as.factor(z) factor(z, levels=rev(levels(as.factor(z)))) } require(ggplot2) ggplot(w$estimates, aes(x=method, y=Probability, fill=revo(y))) + facet_wrap(~ sex) + geom_col() + xlab('') + guides(fill=guide_legend(title='')) # Now vary 2 predictors d <- expand.grid(sex=c('female', 'male'), age=c(40, 60)) w <- impactPO(y ~ age + sex, nonpo = ~ sex, newdata=d) w ggplot(w$estimates, aes(x=method, y=Probability, fill=revo(y))) + facet_grid(age ~ sex) + geom_col() + xlab('') + guides(fill=guide_legend(title='')) ## End(Not run)
Surv
and ggplot
are imported from, respectively, the
survival
and ggplot2
packages and are exported from
rms
so that the user does not have to attach these packages to do
simple things.
Surv(time, time2, event, type = c("right", "left", "interval", "counting", "interval2", "mstate"), origin = 0) ggplot(data = NULL, mapping = aes(), ..., environment = parent.frame())
Surv(time, time2, event, type = c("right", "left", "interval", "counting", "interval2", "mstate"), origin = 0) ggplot(data = NULL, mapping = aes(), ..., environment = parent.frame())
time , time2 , event , type , origin
|
see |
data , mapping , ... , environment
|
see |
see documentation in the original packages
## Not run: f <- psm(Surv(dtime, death) ~ x1 + x2 + sex + race, dist='gau') ggplot(Predict(f)) ## End(Not run)
## Not run: f <- psm(Surv(dtime, death) ~ x1 + x2 + sex + race, dist='gau') ggplot(Predict(f)) ## End(Not run)
Creates a file containing a LaTeX representation of the fitted model.
## S3 method for class 'cph' latex(object, title, file='', append=FALSE, surv=TRUE, maxt=FALSE, which=NULL, varnames, columns=65, inline=FALSE, before=if(inline)"" else "& &", after="", dec=3, pretrans=TRUE, caption, digits=.Options$digits, size="", ...) # for cph fit ## S3 method for class 'lrm' latex(object, title, file, append, which, varnames, columns, inline, before, after, pretrans, caption, digits=.Options$digits, size="", ...) # for lrm fit ## S3 method for class 'ols' latex(object, title, file, append, which, varnames, columns, inline, before, after, pretrans, caption, digits=.Options$digits, size="", ...) # ols fit ## S3 method for class 'orm' latex(object, title, file, append, which, varnames, columns, inline, before, after, pretrans, caption, digits=.Options$digits, size="", intercepts=nrp < 10, ...) # for orm fit ## S3 method for class 'pphsm' latex(object, title, file, append, which=NULL, varnames, columns, inline, before, after, pretrans, caption, digits=.Options$digits, size="", ...) # pphsm fit ## S3 method for class 'psm' latex(object, title, file, append, which=NULL, varnames, columns, inline, before, after, pretrans, caption, digits=.Options$digits, size="", ...) # psm fit
## S3 method for class 'cph' latex(object, title, file='', append=FALSE, surv=TRUE, maxt=FALSE, which=NULL, varnames, columns=65, inline=FALSE, before=if(inline)"" else "& &", after="", dec=3, pretrans=TRUE, caption, digits=.Options$digits, size="", ...) # for cph fit ## S3 method for class 'lrm' latex(object, title, file, append, which, varnames, columns, inline, before, after, pretrans, caption, digits=.Options$digits, size="", ...) # for lrm fit ## S3 method for class 'ols' latex(object, title, file, append, which, varnames, columns, inline, before, after, pretrans, caption, digits=.Options$digits, size="", ...) # ols fit ## S3 method for class 'orm' latex(object, title, file, append, which, varnames, columns, inline, before, after, pretrans, caption, digits=.Options$digits, size="", intercepts=nrp < 10, ...) # for orm fit ## S3 method for class 'pphsm' latex(object, title, file, append, which=NULL, varnames, columns, inline, before, after, pretrans, caption, digits=.Options$digits, size="", ...) # pphsm fit ## S3 method for class 'psm' latex(object, title, file, append, which=NULL, varnames, columns, inline, before, after, pretrans, caption, digits=.Options$digits, size="", ...) # psm fit
object |
a fit object created by a |
title |
ignored |
file , append
|
see |
surv |
if |
maxt |
if the maximum follow-up time in the data ( |
which , varnames , columns , inline , before , dec , pretrans
|
see
|
after |
if not an empty string, added to end of markup if
|
caption |
a character string specifying a title for the equation to be centered and typeset in bold face. Default is no title. |
digits |
see latexrms |
size |
a LaTeX size to use, without the slash. Default is the prevailing size |
intercepts |
for |
... |
ignored |
the name of the created file, with class c("latex","file")
. This
object works with latex viewing and printing commands in Hmisc. If
file=''
and options(prType=x
is in effect, where x
is "html", "markdown"
or "md"
, the result is run through
knitr::asis_output
so that it will be rendered correctly no
matter which options are in effect in the chunk header.
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
latexrms
, rcspline.restate
,
latex
## Not run: require(survival) units(ftime) <- "Day" f <- cph(Surv(ftime, death) ~ rcs(age)+sex, surv=TRUE, time.inc=60) w <- latex(f, file='f.tex') #Interprets fitted model and makes table of S0(t) #for t=0,60,120,180,... w #displays image, if viewer installed and file given above latex(f) # send LaTeX code to the console for knitr options(prType='html') latex(f) # for use with knitr and R Markdown/Quarto using MathJax ## End(Not run)
## Not run: require(survival) units(ftime) <- "Day" f <- cph(Surv(ftime, death) ~ rcs(age)+sex, surv=TRUE, time.inc=60) w <- latex(f, file='f.tex') #Interprets fitted model and makes table of S0(t) #for t=0,60,120,180,... w #displays image, if viewer installed and file given above latex(f) # send LaTeX code to the console for knitr options(prType='html') latex(f) # for use with knitr and R Markdown/Quarto using MathJax ## End(Not run)
Creates a file containing a LaTeX representation of the fitted model. For
model-specific typesetting there is latex.lrm
, latex.cph
,
latex.psm
and latex.ols
. latex.cph
has some
arguments that are specific to cph
models.
latexrms
is the core function which is
called internally by latexrms
(which is called by
latex.cph
, latex.ols
, etc.). html
and R
Markdown-compatible markup (using MathJax) are written if
options(prType='html')
.
latexrms(object, file='', append=FALSE, which=1:p, varnames, columns=65, prefix=NULL, inline=FALSE, before=if(inline)"" else "& &", after="", intercept, pretrans=TRUE, digits=.Options$digits, size="")
latexrms(object, file='', append=FALSE, which=1:p, varnames, columns=65, prefix=NULL, inline=FALSE, before=if(inline)"" else "& &", after="", intercept, pretrans=TRUE, digits=.Options$digits, size="")
object |
a fit object created by a fitting function in the |
file |
name of |
append |
whether or not to append to an existing file |
which |
a vector of subcripts (corresponding to |
varnames |
variable names to substitute for non-interactions. Order must correspond
to |
columns |
maximum number of columns of printing characters to allow before outputting a LaTeX newline command |
prefix |
if given, a LaTeX \lefteqn command of the form |
inline |
Set to |
before |
a character string to place before each line of output. Use the default
for a LaTeX |
after |
a character string to place after the output if |
intercept |
a special intercept value to include that is not part of the standard
model parameters (e.g., centering constant in Cox model). Only allowed
in the |
pretrans |
if any spline or polynomial-expanded variables are themselves
transformed, a table of pre-transformations will be formed unless
|
digits |
number of digits of precision to use in formatting coefficients and other numbers |
size |
a LaTeX font size to use for the output, without the slash. Default is current size. |
latexrms
returns a character vector if file=''
,
otherwise writes the output to file
. For particular model
fits, the latex
method returns the result of running
knitr::asis_output
on the LaTeX or HTML code if file=''
,
options(prType)
was set but not to 'plain'
, and if
knitr
is currently running. This causes correct output to be
rendered whether or not results='asis'
appeared in the R
Markdown or Quarto chunk header.
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
## Not run: f <- lrm(death ~ rcs(age)+sex) w <- latex(f, file='f.tex') w # displays, using e.g. xdvi latex(f) # send LaTeX code to console, as for knitr options(prType='html') latex(f) # emit html and latex for knitr html and html notebooks ## End(Not run)
## Not run: f <- lrm(death ~ rcs(age)+sex) w <- latex(f, file='f.tex') w # displays, using e.g. xdvi latex(f) # send LaTeX code to console, as for knitr options(prType='html') latex(f) # emit html and latex for knitr html and html notebooks ## End(Not run)
Fit binary and proportional odds ordinal
logistic regression models using maximum likelihood estimation or
penalized maximum likelihood estimation. See cr.setup
for how to
fit forward continuation ratio models with lrm
.
For the print
method, format of output is controlled by the
user previously running options(prType="lang")
where
lang
is "plain"
(the default), "latex"
, or
"html"
. When using html with Quarto or RMarkdown,
results='asis'
need not be written in the chunk header.
lrm(formula, data=environment(formula), subset, na.action=na.delete, method="lrm.fit", model=FALSE, x=FALSE, y=FALSE, linear.predictors=TRUE, se.fit=FALSE, penalty=0, penalty.matrix, tol=1e-7, strata.penalty=0, var.penalty=c('simple','sandwich'), weights, normwt, scale=FALSE, ...) ## S3 method for class 'lrm' print(x, digits=4, r2=c(0,2,4), strata.coefs=FALSE, coefs=TRUE, pg=FALSE, title='Logistic Regression Model', ...)
lrm(formula, data=environment(formula), subset, na.action=na.delete, method="lrm.fit", model=FALSE, x=FALSE, y=FALSE, linear.predictors=TRUE, se.fit=FALSE, penalty=0, penalty.matrix, tol=1e-7, strata.penalty=0, var.penalty=c('simple','sandwich'), weights, normwt, scale=FALSE, ...) ## S3 method for class 'lrm' print(x, digits=4, r2=c(0,2,4), strata.coefs=FALSE, coefs=TRUE, pg=FALSE, title='Logistic Regression Model', ...)
formula |
a formula object. An |
data |
data frame to use. Default is the current frame. |
subset |
logical expression or vector of subscripts defining a subset of observations to analyze |
na.action |
function to handle |
method |
name of fitting function. Only allowable choice at present is |
model |
causes the model frame to be returned in the fit object |
x |
causes the expanded design matrix (with missings excluded)
to be returned under the name |
y |
causes the response variable (with missings excluded) to be returned
under the name |
linear.predictors |
causes the predicted X beta (with missings excluded) to be returned
under the name |
se.fit |
causes the standard errors of the fitted values to be returned under
the name |
penalty |
The penalty factor subtracted from the log likelihood is
|
penalty.matrix |
specifies the symmetric penalty matrix for non-intercept terms.
The default matrix for continuous predictors has
the variance of the columns of the design matrix in its diagonal elements
so that the penalty to the log likelhood is unitless. For main effects
for categorical predictors with |
tol |
singularity criterion (see |
strata.penalty |
scalar penalty factor for the stratification
factor, for the experimental |
var.penalty |
the type of variance-covariance matrix to be stored in the |
weights |
a vector (same length as |
normwt |
set to |
scale |
set to |
... |
arguments that are passed to |
digits |
number of significant digits to use |
r2 |
vector of integers specifying which R^2 measures to print,
with 0 for Nagelkerke R^2 and 1:4 corresponding to the 4 measures
computed by |
strata.coefs |
set to |
coefs |
specify |
pg |
set to |
title |
a character string title to be passed to |
The returned fit object of lrm
contains the following components
in addition to the ones mentioned under the optional arguments.
call |
calling expression |
freq |
table of frequencies for |
stats |
vector with the following elements: number of observations used in the
fit, maximum absolute value of first
derivative of log likelihood, model likelihood ratio
|
fail |
set to |
coefficients |
estimated parameters |
var |
estimated variance-covariance matrix (inverse of information matrix).
If |
effective.df.diagonal |
is returned if |
u |
vector of first derivatives of log-likelihood |
deviance |
-2 log likelihoods (counting penalty components) When an offset variable is present, three deviances are computed: for intercept(s) only, for intercepts+offset, and for intercepts+offset+predictors. When there is no offset variable, the vector contains deviances for the intercept(s)-only model and the model with intercept(s) and predictors. |
est |
vector of column numbers of |
non.slopes |
number of intercepts in model |
penalty |
see above |
penalty.matrix |
the penalty matrix actually used in the estimation |
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
Le Cessie S, Van Houwelingen JC: Ridge estimators in logistic regression. Applied Statistics 41:191–201, 1992.
Verweij PJM, Van Houwelingen JC: Penalized likelihood in Cox regression. Stat in Med 13:2427–2436, 1994.
Gray RJ: Flexible methods for analyzing survival data using splines, with applications to breast cancer prognosis. JASA 87:942–951, 1992.
Shao J: Linear model selection by cross-validation. JASA 88:486–494, 1993.
Verweij PJM, Van Houwelingen JC: Crossvalidation in survival analysis. Stat in Med 12:2305–2314, 1993.
Harrell FE: Model uncertainty, penalization, and parsimony. ISCB Presentation on UVa Web page, 1998.
lrm.fit
, predict.lrm
,
rms.trans
, rms
, glm
,
latex.lrm
,
residuals.lrm
, na.delete
,
na.detail.response
,
pentrace
, rmsMisc
, vif
,
cr.setup
, predab.resample
,
validate.lrm
, calibrate
,
Mean.lrm
, gIndex
, prModFit
#Fit a logistic model containing predictors age, blood.pressure, sex #and cholesterol, with age fitted with a smooth 5-knot restricted cubic #spline function and a different shape of the age relationship for males #and females. As an intermediate step, predict mean cholesterol from #age using a proportional odds ordinal logistic model # require(ggplot2) n <- 1000 # define sample size set.seed(17) # so can reproduce the results age <- rnorm(n, 50, 10) blood.pressure <- rnorm(n, 120, 15) cholesterol <- rnorm(n, 200, 25) sex <- factor(sample(c('female','male'), n,TRUE)) label(age) <- 'Age' # label is in Hmisc label(cholesterol) <- 'Total Cholesterol' label(blood.pressure) <- 'Systolic Blood Pressure' label(sex) <- 'Sex' units(cholesterol) <- 'mg/dl' # uses units.default in Hmisc units(blood.pressure) <- 'mmHg' #To use prop. odds model, avoid using a huge number of intercepts by #grouping cholesterol into 40-tiles ch <- cut2(cholesterol, g=40, levels.mean=TRUE) # use mean values in intervals table(ch) f <- lrm(ch ~ age) options(prType='latex') print(f, coefs=4) # write latex code to console m <- Mean(f) # see help file for Mean.lrm d <- data.frame(age=seq(0,90,by=10)) m(predict(f, d)) # Repeat using ols f <- ols(cholesterol ~ age) predict(f, d) # Specify population model for log odds that Y=1 L <- .4*(sex=='male') + .045*(age-50) + (log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male')) # Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)] y <- ifelse(runif(n) < plogis(L), 1, 0) cholesterol[1:3] <- NA # 3 missings, at random ddist <- datadist(age, blood.pressure, cholesterol, sex) options(datadist='ddist') fit <- lrm(y ~ blood.pressure + sex * (age + rcs(cholesterol,4)), x=TRUE, y=TRUE) # x=TRUE, y=TRUE allows use of resid(), which.influence below # could define d <- datadist(fit) after lrm(), but data distribution # summary would not be stored with fit, so later uses of Predict # or summary.rms would require access to the original dataset or # d or specifying all variable values to summary, Predict, nomogram anova(fit) p <- Predict(fit, age, sex) ggplot(p) # or plot() ggplot(Predict(fit, age=20:70, sex="male")) # need if datadist not used print(cbind(resid(fit,"dfbetas"), resid(fit,"dffits"))[1:20,]) which.influence(fit, .3) # latex(fit) #print nice statement of fitted model # #Repeat this fit using penalized MLE, penalizing complex terms #(for nonlinear or interaction effects) # fitp <- update(fit, penalty=list(simple=0,nonlinear=10), x=TRUE, y=TRUE) effective.df(fitp) # or lrm(y ~ \dots, penalty=\dots) #Get fits for a variety of penalties and assess predictive accuracy #in a new data set. Program efficiently so that complex design #matrices are only created once. set.seed(201) x1 <- rnorm(500) x2 <- rnorm(500) x3 <- sample(0:1,500,rep=TRUE) L <- x1+abs(x2)+x3 y <- ifelse(runif(500)<=plogis(L), 1, 0) new.data <- data.frame(x1,x2,x3,y)[301:500,] # for(penlty in seq(0,.15,by=.005)) { if(penlty==0) { f <- lrm(y ~ rcs(x1,4)+rcs(x2,6)*x3, subset=1:300, x=TRUE, y=TRUE) # True model is linear in x1 and has no interaction X <- f$x # saves time for future runs - don't have to use rcs etc. Y <- f$y # this also deletes rows with NAs (if there were any) penalty.matrix <- diag(diag(var(X))) Xnew <- predict(f, new.data, type="x") # expand design matrix for new data Ynew <- new.data$y } else f <- lrm.fit(X,Y, penalty.matrix=penlty*penalty.matrix) # cat("\nPenalty :",penlty,"\n") pred.logit <- f$coef[1] + (Xnew %*% f$coef[-1]) pred <- plogis(pred.logit) C.index <- somers2(pred, Ynew)["C"] Brier <- mean((pred-Ynew)^2) Deviance<- -2*sum( Ynew*log(pred) + (1-Ynew)*log(1-pred) ) cat("ROC area:",format(C.index)," Brier score:",format(Brier), " -2 Log L:",format(Deviance),"\n") } #penalty=0.045 gave lowest -2 Log L, Brier, ROC in test sample for S+ # #Use bootstrap validation to estimate predictive accuracy of #logistic models with various penalties #To see how noisy cross-validation estimates can be, change the #validate(f, \dots) to validate(f, method="cross", B=10) for example. #You will see tremendous variation in accuracy with minute changes in #the penalty. This comes from the error inherent in using 10-fold #cross validation but also because we are not fixing the splits. #20-fold cross validation was even worse for some #indexes because of the small test sample size. Stability would be #obtained by using the same sample splits for all penalty values #(see above), but then we wouldn't be sure that the choice of the #best penalty is not specific to how the sample was split. This #problem is addressed in the last example. # penalties <- seq(0,.7,length=3) # really use by=.02 index <- matrix(NA, nrow=length(penalties), ncol=11, dimnames=list(format(penalties), c("Dxy","R2","Intercept","Slope","Emax","D","U","Q","B","g","gp"))) i <- 0 for(penlty in penalties) { cat(penlty, "") i <- i+1 if(penlty==0) { f <- lrm(y ~ rcs(x1,4)+rcs(x2,6)*x3, x=TRUE, y=TRUE) # fit whole sample X <- f$x Y <- f$y penalty.matrix <- diag(diag(var(X))) # save time - only do once } else f <- lrm(Y ~ X, penalty=penlty, penalty.matrix=penalty.matrix, x=TRUE,y=TRUE) val <- validate(f, method="boot", B=20) # use larger B in practice index[i,] <- val[,"index.corrected"] } par(mfrow=c(3,3)) for(i in 1:9) { plot(penalties, index[,i], xlab="Penalty", ylab=dimnames(index)[[2]][i]) lines(lowess(penalties, index[,i])) } options(datadist=NULL) # Example of weighted analysis x <- 1:5 y <- c(0,1,0,1,0) reps <- c(1,2,3,2,1) lrm(y ~ x, weights=reps) x <- rep(x, reps) y <- rep(y, reps) lrm(y ~ x) # same as above # #Study performance of a modified AIC which uses the effective d.f. #See Verweij and Van Houwelingen (1994) Eq. (6). Here AIC=chisq-2*df. #Also try as effective d.f. equation (4) of the previous reference. #Also study performance of Shao's cross-validation technique (which was #designed to pick the "right" set of variables, and uses a much smaller #training sample than most methods). Compare cross-validated deviance #vs. penalty to the gold standard accuracy on a 7500 observation dataset. #Note that if you only want to get AIC or Schwarz Bayesian information #criterion, all you need is to invoke the pentrace function. #NOTE: the effective.df( ) function is used in practice # ## Not run: for(seed in c(339,777,22,111,3)){ # study performance for several datasets set.seed(seed) n <- 175; p <- 8 X <- matrix(rnorm(n*p), ncol=p) # p normal(0,1) predictors Coef <- c(-.1,.2,-.3,.4,-.5,.6,-.65,.7) # true population coefficients L <- X %*% Coef # intercept is zero Y <- ifelse(runif(n)<=plogis(L), 1, 0) pm <- diag(diag(var(X))) #Generate a large validation sample to use as a gold standard n.val <- 7500 X.val <- matrix(rnorm(n.val*p), ncol=p) L.val <- X.val %*% Coef Y.val <- ifelse(runif(n.val)<=plogis(L.val), 1, 0) # Penalty <- seq(0,30,by=1) reps <- length(Penalty) effective.df <- effective.df2 <- aic <- aic2 <- deviance.val <- Lpenalty <- single(reps) n.t <- round(n^.75) ncv <- c(10,20,30,40) # try various no. of reps in cross-val. deviance <- matrix(NA,nrow=reps,ncol=length(ncv)) #If model were complex, could have started things off by getting X, Y #penalty.matrix from an initial lrm fit to save time # for(i in 1:reps) { pen <- Penalty[i] cat(format(pen),"") f.full <- lrm.fit(X, Y, penalty.matrix=pen*pm) Lpenalty[i] <- pen* t(f.full$coef[-1]) %*% pm %*% f.full$coef[-1] f.full.nopenalty <- lrm.fit(X, Y, initial=f.full$coef, maxit=1) info.matrix.unpenalized <- solve(f.full.nopenalty$var) effective.df[i] <- sum(diag(info.matrix.unpenalized %*% f.full$var)) - 1 lrchisq <- f.full.nopenalty$stats["Model L.R."] # lrm does all this penalty adjustment automatically (for var, d.f., # chi-square) aic[i] <- lrchisq - 2*effective.df[i] # pred <- plogis(f.full$linear.predictors) score.matrix <- cbind(1,X) * (Y - pred) sum.u.uprime <- t(score.matrix) %*% score.matrix effective.df2[i] <- sum(diag(f.full$var %*% sum.u.uprime)) aic2[i] <- lrchisq - 2*effective.df2[i] # #Shao suggested averaging 2*n cross-validations, but let's do only 40 #and stop along the way to see if fewer is OK dev <- 0 for(j in 1:max(ncv)) { s <- sample(1:n, n.t) cof <- lrm.fit(X[s,],Y[s], penalty.matrix=pen*pm)$coef pred <- cof[1] + (X[-s,] %*% cof[-1]) dev <- dev -2*sum(Y[-s]*pred + log(1-plogis(pred))) for(k in 1:length(ncv)) if(j==ncv[k]) deviance[i,k] <- dev/j } # pred.val <- f.full$coef[1] + (X.val %*% f.full$coef[-1]) prob.val <- plogis(pred.val) deviance.val[i] <- -2*sum(Y.val*pred.val + log(1-prob.val)) } postscript(hor=TRUE) # along with graphics.off() below, allow plots par(mfrow=c(2,4)) # to be printed as they are finished plot(Penalty, effective.df, type="l") lines(Penalty, effective.df2, lty=2) plot(Penalty, Lpenalty, type="l") title("Penalty on -2 log L") plot(Penalty, aic, type="l") lines(Penalty, aic2, lty=2) for(k in 1:length(ncv)) { plot(Penalty, deviance[,k], ylab="deviance") title(paste(ncv[k],"reps")) lines(supsmu(Penalty, deviance[,k])) } plot(Penalty, deviance.val, type="l") title("Gold Standard (n=7500)") title(sub=format(seed),adj=1,cex=.5) graphics.off() } ## End(Not run) #The results showed that to obtain a clear picture of the penalty- #accuracy relationship one needs 30 or 40 reps in the cross-validation. #For 4 of 5 samples, though, the super smoother was able to detect #an accurate penalty giving the best (lowest) deviance using 10-fold #cross-validation. Cross-validation would have worked better had #the same splits been used for all penalties. #The AIC methods worked just as well and are much quicker to compute. #The first AIC based on the effective d.f. in Gray's Eq. 2.9 #(Verweij and Van Houwelingen (1994) Eq. 5 (note typo)) worked best.
#Fit a logistic model containing predictors age, blood.pressure, sex #and cholesterol, with age fitted with a smooth 5-knot restricted cubic #spline function and a different shape of the age relationship for males #and females. As an intermediate step, predict mean cholesterol from #age using a proportional odds ordinal logistic model # require(ggplot2) n <- 1000 # define sample size set.seed(17) # so can reproduce the results age <- rnorm(n, 50, 10) blood.pressure <- rnorm(n, 120, 15) cholesterol <- rnorm(n, 200, 25) sex <- factor(sample(c('female','male'), n,TRUE)) label(age) <- 'Age' # label is in Hmisc label(cholesterol) <- 'Total Cholesterol' label(blood.pressure) <- 'Systolic Blood Pressure' label(sex) <- 'Sex' units(cholesterol) <- 'mg/dl' # uses units.default in Hmisc units(blood.pressure) <- 'mmHg' #To use prop. odds model, avoid using a huge number of intercepts by #grouping cholesterol into 40-tiles ch <- cut2(cholesterol, g=40, levels.mean=TRUE) # use mean values in intervals table(ch) f <- lrm(ch ~ age) options(prType='latex') print(f, coefs=4) # write latex code to console m <- Mean(f) # see help file for Mean.lrm d <- data.frame(age=seq(0,90,by=10)) m(predict(f, d)) # Repeat using ols f <- ols(cholesterol ~ age) predict(f, d) # Specify population model for log odds that Y=1 L <- .4*(sex=='male') + .045*(age-50) + (log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male')) # Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)] y <- ifelse(runif(n) < plogis(L), 1, 0) cholesterol[1:3] <- NA # 3 missings, at random ddist <- datadist(age, blood.pressure, cholesterol, sex) options(datadist='ddist') fit <- lrm(y ~ blood.pressure + sex * (age + rcs(cholesterol,4)), x=TRUE, y=TRUE) # x=TRUE, y=TRUE allows use of resid(), which.influence below # could define d <- datadist(fit) after lrm(), but data distribution # summary would not be stored with fit, so later uses of Predict # or summary.rms would require access to the original dataset or # d or specifying all variable values to summary, Predict, nomogram anova(fit) p <- Predict(fit, age, sex) ggplot(p) # or plot() ggplot(Predict(fit, age=20:70, sex="male")) # need if datadist not used print(cbind(resid(fit,"dfbetas"), resid(fit,"dffits"))[1:20,]) which.influence(fit, .3) # latex(fit) #print nice statement of fitted model # #Repeat this fit using penalized MLE, penalizing complex terms #(for nonlinear or interaction effects) # fitp <- update(fit, penalty=list(simple=0,nonlinear=10), x=TRUE, y=TRUE) effective.df(fitp) # or lrm(y ~ \dots, penalty=\dots) #Get fits for a variety of penalties and assess predictive accuracy #in a new data set. Program efficiently so that complex design #matrices are only created once. set.seed(201) x1 <- rnorm(500) x2 <- rnorm(500) x3 <- sample(0:1,500,rep=TRUE) L <- x1+abs(x2)+x3 y <- ifelse(runif(500)<=plogis(L), 1, 0) new.data <- data.frame(x1,x2,x3,y)[301:500,] # for(penlty in seq(0,.15,by=.005)) { if(penlty==0) { f <- lrm(y ~ rcs(x1,4)+rcs(x2,6)*x3, subset=1:300, x=TRUE, y=TRUE) # True model is linear in x1 and has no interaction X <- f$x # saves time for future runs - don't have to use rcs etc. Y <- f$y # this also deletes rows with NAs (if there were any) penalty.matrix <- diag(diag(var(X))) Xnew <- predict(f, new.data, type="x") # expand design matrix for new data Ynew <- new.data$y } else f <- lrm.fit(X,Y, penalty.matrix=penlty*penalty.matrix) # cat("\nPenalty :",penlty,"\n") pred.logit <- f$coef[1] + (Xnew %*% f$coef[-1]) pred <- plogis(pred.logit) C.index <- somers2(pred, Ynew)["C"] Brier <- mean((pred-Ynew)^2) Deviance<- -2*sum( Ynew*log(pred) + (1-Ynew)*log(1-pred) ) cat("ROC area:",format(C.index)," Brier score:",format(Brier), " -2 Log L:",format(Deviance),"\n") } #penalty=0.045 gave lowest -2 Log L, Brier, ROC in test sample for S+ # #Use bootstrap validation to estimate predictive accuracy of #logistic models with various penalties #To see how noisy cross-validation estimates can be, change the #validate(f, \dots) to validate(f, method="cross", B=10) for example. #You will see tremendous variation in accuracy with minute changes in #the penalty. This comes from the error inherent in using 10-fold #cross validation but also because we are not fixing the splits. #20-fold cross validation was even worse for some #indexes because of the small test sample size. Stability would be #obtained by using the same sample splits for all penalty values #(see above), but then we wouldn't be sure that the choice of the #best penalty is not specific to how the sample was split. This #problem is addressed in the last example. # penalties <- seq(0,.7,length=3) # really use by=.02 index <- matrix(NA, nrow=length(penalties), ncol=11, dimnames=list(format(penalties), c("Dxy","R2","Intercept","Slope","Emax","D","U","Q","B","g","gp"))) i <- 0 for(penlty in penalties) { cat(penlty, "") i <- i+1 if(penlty==0) { f <- lrm(y ~ rcs(x1,4)+rcs(x2,6)*x3, x=TRUE, y=TRUE) # fit whole sample X <- f$x Y <- f$y penalty.matrix <- diag(diag(var(X))) # save time - only do once } else f <- lrm(Y ~ X, penalty=penlty, penalty.matrix=penalty.matrix, x=TRUE,y=TRUE) val <- validate(f, method="boot", B=20) # use larger B in practice index[i,] <- val[,"index.corrected"] } par(mfrow=c(3,3)) for(i in 1:9) { plot(penalties, index[,i], xlab="Penalty", ylab=dimnames(index)[[2]][i]) lines(lowess(penalties, index[,i])) } options(datadist=NULL) # Example of weighted analysis x <- 1:5 y <- c(0,1,0,1,0) reps <- c(1,2,3,2,1) lrm(y ~ x, weights=reps) x <- rep(x, reps) y <- rep(y, reps) lrm(y ~ x) # same as above # #Study performance of a modified AIC which uses the effective d.f. #See Verweij and Van Houwelingen (1994) Eq. (6). Here AIC=chisq-2*df. #Also try as effective d.f. equation (4) of the previous reference. #Also study performance of Shao's cross-validation technique (which was #designed to pick the "right" set of variables, and uses a much smaller #training sample than most methods). Compare cross-validated deviance #vs. penalty to the gold standard accuracy on a 7500 observation dataset. #Note that if you only want to get AIC or Schwarz Bayesian information #criterion, all you need is to invoke the pentrace function. #NOTE: the effective.df( ) function is used in practice # ## Not run: for(seed in c(339,777,22,111,3)){ # study performance for several datasets set.seed(seed) n <- 175; p <- 8 X <- matrix(rnorm(n*p), ncol=p) # p normal(0,1) predictors Coef <- c(-.1,.2,-.3,.4,-.5,.6,-.65,.7) # true population coefficients L <- X %*% Coef # intercept is zero Y <- ifelse(runif(n)<=plogis(L), 1, 0) pm <- diag(diag(var(X))) #Generate a large validation sample to use as a gold standard n.val <- 7500 X.val <- matrix(rnorm(n.val*p), ncol=p) L.val <- X.val %*% Coef Y.val <- ifelse(runif(n.val)<=plogis(L.val), 1, 0) # Penalty <- seq(0,30,by=1) reps <- length(Penalty) effective.df <- effective.df2 <- aic <- aic2 <- deviance.val <- Lpenalty <- single(reps) n.t <- round(n^.75) ncv <- c(10,20,30,40) # try various no. of reps in cross-val. deviance <- matrix(NA,nrow=reps,ncol=length(ncv)) #If model were complex, could have started things off by getting X, Y #penalty.matrix from an initial lrm fit to save time # for(i in 1:reps) { pen <- Penalty[i] cat(format(pen),"") f.full <- lrm.fit(X, Y, penalty.matrix=pen*pm) Lpenalty[i] <- pen* t(f.full$coef[-1]) %*% pm %*% f.full$coef[-1] f.full.nopenalty <- lrm.fit(X, Y, initial=f.full$coef, maxit=1) info.matrix.unpenalized <- solve(f.full.nopenalty$var) effective.df[i] <- sum(diag(info.matrix.unpenalized %*% f.full$var)) - 1 lrchisq <- f.full.nopenalty$stats["Model L.R."] # lrm does all this penalty adjustment automatically (for var, d.f., # chi-square) aic[i] <- lrchisq - 2*effective.df[i] # pred <- plogis(f.full$linear.predictors) score.matrix <- cbind(1,X) * (Y - pred) sum.u.uprime <- t(score.matrix) %*% score.matrix effective.df2[i] <- sum(diag(f.full$var %*% sum.u.uprime)) aic2[i] <- lrchisq - 2*effective.df2[i] # #Shao suggested averaging 2*n cross-validations, but let's do only 40 #and stop along the way to see if fewer is OK dev <- 0 for(j in 1:max(ncv)) { s <- sample(1:n, n.t) cof <- lrm.fit(X[s,],Y[s], penalty.matrix=pen*pm)$coef pred <- cof[1] + (X[-s,] %*% cof[-1]) dev <- dev -2*sum(Y[-s]*pred + log(1-plogis(pred))) for(k in 1:length(ncv)) if(j==ncv[k]) deviance[i,k] <- dev/j } # pred.val <- f.full$coef[1] + (X.val %*% f.full$coef[-1]) prob.val <- plogis(pred.val) deviance.val[i] <- -2*sum(Y.val*pred.val + log(1-prob.val)) } postscript(hor=TRUE) # along with graphics.off() below, allow plots par(mfrow=c(2,4)) # to be printed as they are finished plot(Penalty, effective.df, type="l") lines(Penalty, effective.df2, lty=2) plot(Penalty, Lpenalty, type="l") title("Penalty on -2 log L") plot(Penalty, aic, type="l") lines(Penalty, aic2, lty=2) for(k in 1:length(ncv)) { plot(Penalty, deviance[,k], ylab="deviance") title(paste(ncv[k],"reps")) lines(supsmu(Penalty, deviance[,k])) } plot(Penalty, deviance.val, type="l") title("Gold Standard (n=7500)") title(sub=format(seed),adj=1,cex=.5) graphics.off() } ## End(Not run) #The results showed that to obtain a clear picture of the penalty- #accuracy relationship one needs 30 or 40 reps in the cross-validation. #For 4 of 5 samples, though, the super smoother was able to detect #an accurate penalty giving the best (lowest) deviance using 10-fold #cross-validation. Cross-validation would have worked better had #the same splits been used for all penalties. #The AIC methods worked just as well and are much quicker to compute. #The first AIC based on the effective d.f. in Gray's Eq. 2.9 #(Verweij and Van Houwelingen (1994) Eq. 5 (note typo)) worked best.
Fits a binary or ordinal logistic model for a given design matrix and response vector with no missing values in either. Ordinary or penalized maximum likelihood estimation is used.
lrm.fit(x, y, offset=0, initial, est, maxit=12, eps=.025, tol=1e-7, trace=FALSE, penalty.matrix=NULL, weights=NULL, normwt=FALSE, scale=FALSE)
lrm.fit(x, y, offset=0, initial, est, maxit=12, eps=.025, tol=1e-7, trace=FALSE, penalty.matrix=NULL, weights=NULL, normwt=FALSE, scale=FALSE)
x |
design matrix with no column for an intercept |
y |
response vector, numeric, categorical, or character |
offset |
optional numeric vector containing an offset on the logit scale |
initial |
vector of initial parameter estimates, beginning with the intercept |
est |
indexes of |
maxit |
maximum no. iterations (default= |
eps |
difference in |
tol |
Singularity criterion. Default is 1e-7 |
trace |
set to |
penalty.matrix |
a self-contained ready-to-use penalty matrix - see |
weights |
a vector (same length as |
normwt |
set to |
scale |
set to |
a list with the following components:
call |
calling expression |
freq |
table of frequencies for |
stats |
vector with the following elements: number of observations used in the
fit, maximum absolute value of first
derivative of log likelihood, model likelihood ratio chi-square, d.f.,
P-value,
|
fail |
set to |
coefficients |
estimated parameters |
var |
estimated variance-covariance matrix (inverse of information matrix).
Note that in the case of penalized estimation, |
u |
vector of first derivatives of log-likelihood |
deviance |
-2 log likelihoods. When an offset variable is present, three deviances are computed: for intercept(s) only, for intercepts+offset, and for intercepts+offset+predictors. When there is no offset variable, the vector contains deviances for the intercept(s)-only model and the model with intercept(s) and predictors. |
est |
vector of column numbers of |
non.slopes |
number of intercepts in model |
penalty.matrix |
see above |
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
lrm
, glm
, matinv
,
solvet
, cr.setup
, gIndex
#Fit an additive logistic model containing numeric predictors age, #blood.pressure, and sex, assumed to be already properly coded and #transformed # # fit <- lrm.fit(cbind(age,blood.pressure,sex), death)
#Fit an additive logistic model containing numeric predictors age, #blood.pressure, and sex, assumed to be already properly coded and #transformed # # fit <- lrm.fit(cbind(age,blood.pressure,sex), death)
Bare Bones Logistic Regression Fit
lrm.fit.bare(x, y, maxit = 12, eps = 0.025, tol = 1e-07)
lrm.fit.bare(x, y, maxit = 12, eps = 0.025, tol = 1e-07)
x |
a vector of matrix of covariate values |
y |
a numeric or factor vector representing the dependent variable |
maxit |
maximum number of iteractions |
eps |
stopping criterion (change in -2 log likelihood) |
tol |
matrix inversion tolerance for singularities |
This is a stripped down version of the lrm.fit()
function that computes only the regression coefficients, variance-covariance-matrix, and log likelihood (for null and fitted model) and does not compute any model fit indexes etc. This is for speed in simulations or with bootstrapping. Missing data are not allowed. The function handles binary and ordinal logistic regression (proportional odds model).
a list with elements coefficients
, var
, fail
, freq
, deviance
Frank Harrell
Update Model LR Statistics After Multiple Imputation
LRupdate(fit, anova)
LRupdate(fit, anova)
fit |
an |
anova |
the result of |
For fits from orm, lrm, orm, cph, psm
that were created using fit.mult.impute
with lrt=TRUE
or equivalent options and for which anova
was obtained using processMI(fit, 'anova')
to compute imputation-adjusted LR statistics. LRupdate
uses the last line of the anova
result (containing the overall model LR chi-square) to update Model L.R.
in the fit stats
component, and to adjust any of the new R-square measures in stats
.
For models using Nagelkerke's R-squared, these are set to NA
as they would need to be recomputed with a new intercept-only log-likelihood, which is not computed by anova
. For ols
models, R-squared is left alone as it is sample-size-independent and print.ols
prints the correct adjusted R-squared due to fit.mult.impute
correcting the residual d.f. in stacked fits.
new fit object like fit
but with the substitutions made
Frank Harrell
processMI.fit.mult.impute()
, Hmisc::R2Measures()
## Not run: a <- aregImpute(~ y + x1 + x2, n.impute=30, data=d) f <- fit.mult.impute(y ~ x1 + x2, lrm, a, data=d, lrt=TRUE) a <- processMI(f, 'anova') f <- LRupdate(f, a) print(f, r2=1:4) # print all imputation-corrected R2 measures ## End(Not run)
## Not run: a <- aregImpute(~ y + x1 + x2, n.impute=30, data=d) f <- fit.mult.impute(y ~ x1 + x2, lrm, a, data=d, lrt=TRUE) a <- processMI(f, 'anova') f <- LRupdate(f, a) print(f, r2=1:4) # print all imputation-corrected R2 measures ## End(Not run)
This function inverts or partially inverts a matrix using pivoting (the sweep operator). It is useful for sequential model-building.
matinv(a, which, negate=TRUE, eps=1e-12)
matinv(a, which, negate=TRUE, eps=1e-12)
a |
square matrix to invert or partially invert. May have been inverted or partially inverted previously by matinv, in which case its "swept" attribute is updated. Will un-invert if already inverted. |
which |
vector of column/row numbers in a to invert. Default is all, for total inverse. |
negate |
So that the algorithm can keep track of which pivots have been swept as well as roundoff errors, it actually returns the negative of the inverse or partial inverse. By default, these elements are negated to give the usual expected result. Set negate=FALSE if you will be passing the result right back into matinv, otherwise, negate the submatrix before sending back to matinv. |
eps |
singularity criterion |
a square matrix, with attributes "rank" and "swept".
Clarke MRB (1982). Algorithm AS 178: The Gauss-Jordan sweep operator with detection of collinearity. Appl Statist 31:166–9.
Ridout MS, Cobb JM (1986). Algorithm AS R78 : A remark on algorithm AS 178: The Gauss-Jordan sweep operator with detection of collinearity. Appl Statist 38:420–2.
a <- diag(1:3) a.inv1 <- matinv(a, 1, negate=FALSE) #Invert with respect to a[1,1] a.inv1 a.inv <- -matinv(a.inv1, 2:3, negate=FALSE) #Finish the job a.inv solve(a)
a <- diag(1:3) a.inv1 <- matinv(a, 1, negate=FALSE) #Invert with respect to a[1,1] a.inv1 a.inv <- -matinv(a.inv1, 2:3, negate=FALSE) #Finish the job a.inv solve(a)
Draws a partial nomogram that can be used to manually obtain predicted
values from a regression model that was fitted with rms
.
The nomogram does not have lines representing sums, but it has a reference
line for reading scoring points (default range 0–100). Once the reader
manually totals the points, the predicted values can be read at the bottom.
Non-monotonic transformations of continuous variables are handled (scales
wrap around), as
are transformations which have flat sections (tick marks are labeled
with ranges). If interactions are in the model, one variable
is picked as the “axis variable”, and separate axes are constructed for
each level of the interacting factors (preference is given automatically
to using any discrete factors to construct separate axes) and
levels of factors which are indirectly related to interacting
factors (see DETAILS). Thus the nomogram is designed so that only
one axis is actually read for each variable, since the variable
combinations are disjoint. For
categorical interacting factors, the default is to construct axes for
all levels.
The user may specify
coordinates of each predictor to label on its axis, or use default values.
If a factor interacts with other factors, settings for one or more of
the interacting factors may be specified separately (this is mandatory
for continuous variables). Optional confidence intervals will be
drawn for individual scores as well as for the linear predictor.
If more than one confidence level is chosen, multiple levels may be
displayed using different colors or gray scales. Functions of the
linear predictors may be added to the nomogram.
The datadist
object that was in effect when the model
was fit is used to specify the limits of the axis for continuous
predictors when the user does not specify tick mark locations in the
nomogram
call.
print.nomogram
prints axis information stored in an object returned
by nomogram
. This is useful in producing tables of point assignments
by levels of predictors. It also prints how many linear predictor
units there are per point and the number of points per unit change in
the linear predictor.
legend.nomabbrev
draws legends describing abbreviations used for
labeling tick marks for levels of categorical predictors.
nomogram(fit, ..., adj.to, lp=TRUE, lp.at=NULL, fun=NULL, fun.at=NULL, fun.lp.at=NULL, funlabel="Predicted Value", interact=NULL, kint=NULL, conf.int=FALSE, conf.lp=c("representative", "all", "none"), est.all=TRUE, posterior.summary=c('mean', 'median', 'mode'), abbrev=FALSE, minlength=4, maxscale=100, nint=10, vnames=c("labels","names"), varname.label=TRUE, varname.label.sep="=", omit=NULL, verbose=FALSE) ## S3 method for class 'nomogram' print(x, dec=0, ...) ## S3 method for class 'nomogram' plot(x, lplabel="Linear Predictor", fun.side, col.conf=c(1, 0.3), conf.space=c(.08,.2), label.every=1, force.label=FALSE, xfrac=.35, cex.axis=.85, cex.var=1, col.grid=NULL, varname.label=TRUE, varname.label.sep="=", ia.space=.7, tck=NA, tcl=-0.25, lmgp=.4, naxes, points.label='Points', total.points.label='Total Points', total.sep.page=FALSE, total.fun, cap.labels=FALSE, ...) legend.nomabbrev(object, which, x, y, ncol=3, ...)
nomogram(fit, ..., adj.to, lp=TRUE, lp.at=NULL, fun=NULL, fun.at=NULL, fun.lp.at=NULL, funlabel="Predicted Value", interact=NULL, kint=NULL, conf.int=FALSE, conf.lp=c("representative", "all", "none"), est.all=TRUE, posterior.summary=c('mean', 'median', 'mode'), abbrev=FALSE, minlength=4, maxscale=100, nint=10, vnames=c("labels","names"), varname.label=TRUE, varname.label.sep="=", omit=NULL, verbose=FALSE) ## S3 method for class 'nomogram' print(x, dec=0, ...) ## S3 method for class 'nomogram' plot(x, lplabel="Linear Predictor", fun.side, col.conf=c(1, 0.3), conf.space=c(.08,.2), label.every=1, force.label=FALSE, xfrac=.35, cex.axis=.85, cex.var=1, col.grid=NULL, varname.label=TRUE, varname.label.sep="=", ia.space=.7, tck=NA, tcl=-0.25, lmgp=.4, naxes, points.label='Points', total.points.label='Total Points', total.sep.page=FALSE, total.fun, cap.labels=FALSE, ...) legend.nomabbrev(object, which, x, y, ncol=3, ...)
fit |
a regression model fit that was created with |
... |
settings of variables to use in constructing axes. If |
adj.to |
If you didn't define |
lp |
Set to |
lp.at |
If |
fun |
an optional function to transform the linear predictors, and to plot
on another axis. If more than one transformation is plotted, put
them in a list, e.g. |
fun.at |
function values to label on axis. Default |
fun.lp.at |
If you want to
evaluate one of the functions at a different set of linear predictor
values than may have been used in constructing the linear predictor axis,
specify a vector or list of vectors
of linear predictor values at which to evaluate the function. This is
especially useful for discrete functions. The presence of this attribute
also does away with the need for |
funlabel |
label for |
interact |
When a continuous variable interacts with a discrete one, axes are
constructed so that the continuous variable moves within the axis, and
separate axes represent levels of interacting factors. For interactions
between two continuous variables, all but the axis variable must have
discrete levels defined in |
kint |
for models such as the ordinal models with multiple intercepts,
specifies which one to use in evaluating the linear predictor.
Default is to use |
conf.int |
confidence levels to display for each scoring. Default is |
conf.lp |
default is |
est.all |
To plot axes for only the subset of variables named in |
posterior.summary |
when operating on a Bayesian model such as a
result of |
abbrev |
Set to |
minlength |
applies if |
maxscale |
default maximum point score is 100 |
nint |
number of intervals to label for axes representing continuous variables.
See |
vnames |
By default, variable labels are used to label axes. Set
|
omit |
vector of character strings containing names of variables for which to suppress drawing axes. Default is to show all variables. |
verbose |
set to |
x |
an object created by |
dec |
number of digits to the right of the decimal point, for rounding
point scores in |
lplabel |
label for linear predictor axis. Default is |
fun.side |
a vector or list of vectors of |
col.conf |
colors corresponding to |
conf.space |
a 2-element vector with the vertical range within which to draw confidence bars, in units of 1=spacing between main bars. Four heights are used within this range (8 for the linear predictor if more than 16 unique values were evaluated), cycling them among separate confidence intervals to reduce overlapping. |
label.every |
Specify |
force.label |
set to |
xfrac |
fraction of horizontal plot to set aside for axis titles |
cex.axis |
character size for tick mark labels |
cex.var |
character size for axis titles (variable names) |
col.grid |
If left unspecified, no vertical reference lines are drawn. Specify a
vector of length one (to use the same color for both minor and major
reference lines) or two (corresponding to the color for the major and
minor divisions, respectively) containing colors, to cause vertical reference
lines to the top points scale to be drawn. For R, a good choice is
|
varname.label |
In constructing axis titles for interactions, the default is to add
|
varname.label.sep |
If |
ia.space |
When multiple axes are draw for levels of interacting factors, the default is to group combinations related to a main effect. This is done by spacing the axes for the second to last of these within a group only 0.7 (by default) of the way down as compared with normal space of 1 unit. |
tck |
see |
tcl |
length of tick marks in nomogram |
lmgp |
spacing between numeric axis labels and axis (see |
naxes |
maximum number of axes to allow on one plot. If the nomogram requires more than one “page”, the “Points” axis will be repeated at the top of each page when necessary. |
points.label |
a character string giving the axis label for the points scale |
total.points.label |
a character string giving the axis label for the total points scale |
total.sep.page |
set to |
total.fun |
a user-provided function that will be executed before the total points
axis is drawn. Default is not to execute a function. This is useful e.g.
when |
cap.labels |
logical: should the factor labels have their first letter capitalized? |
object |
the result returned from |
which |
a character string giving the name of a variable for which to draw a legend with abbreviations of factor levels |
y |
y-coordinate to pass to the |
ncol |
the number of columns to form in drawing the legend. |
A variable is considered to be discrete if it is categorical or ordered
or if datadist
stored values
for it (meaning it
had <11
unique values).
A variable is said to be indirectly related to another variable if
the two are related by some interaction. For example, if a model
has variables a, b, c, d, and the interactions are a:c and c:d,
variable d is indirectly related to variable a. The complete list
of variables related to a is c, d. If an axis is made for variable a,
several axes will actually be drawn, one for each combination of c
and d specified in interact
.
Note that with a caliper, it is easy to continually add point scores for individual predictors, and then to place the caliper on the upper “Points” axis (with extrapolation if needed). Then transfer these points to the “Total Points” axis. In this way, points can be added without writing them down.
Confidence limits for an individual predictor score are really confidence
limits for the entire linear predictor, with other predictors set to
adjustment values. If lp = TRUE
, all confidence bars for all linear
predictor values evaluated are drawn. The extent to which multiple
confidence bars of differing widths appear at the same linear predictor
value means that precision depended on how the linear predictor was
arrived at (e.g., a certain value may be realized from a setting of
a certain predictor that was associated with a large standard error
on the regression coefficients for that predictor).
On occasion, you may want to reverse the regression coefficients of a model
to make the “points” scales reverse direction. For parametric survival
models, which are stated in terms of increasing regression effects meaning
longer survival (the opposite of a Cox model), just do something like
fit$coefficients <- -fit$coefficients
before invoking nomogram
,
and if you add function axes, negate the function
arguments. For the Cox model, you also need to negate fit$center
.
If you omit lp.at
, also negate fit$linear.predictors
.
a list of class "nomogram"
that contains information used in plotting
the axes. If you specified abbrev = TRUE
, a list called abbrev
is also
returned that gives the abbreviations used for tick mark labels, if any.
This list is useful for
making legends and is used by legend.nomabbrev
(see the last example).
The returned list also has components called total.points
, lp
,
and the function axis names. These components have components
x
(at
argument vector given to axis
), y
(pos
for axis
),
and x.real
, the x-coordinates appearing on tick mark labels.
An often useful result is stored in the list of data for each axis variable,
namely the exact number of points that correspond to each tick mark on
that variable's axis.
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
Banks J: Nomograms. Encylopedia of Statistical Sciences, Vol 6. Editors: S Kotz and NL Johnson. New York: Wiley; 1985.
Lubsen J, Pool J, van der Does, E: A practical device for the application of a diagnostic or prognostic function. Meth. Inform. Med. 17:127–129; 1978.
Wikipedia: Nomogram, https://en.wikipedia.org/wiki/Nomogram.
rms
, plot.Predict
,
ggplot.Predict
, plot.summary.rms
,
axis
, pretty
, approx
,
latexrms
, rmsMisc
n <- 1000 # define sample size set.seed(17) # so can reproduce the results d <- data.frame(age = rnorm(n, 50, 10), blood.pressure = rnorm(n, 120, 15), cholesterol = rnorm(n, 200, 25), sex = factor(sample(c('female','male'), n,TRUE))) # Specify population model for log odds that Y=1 # Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)] d <- upData(d, L = .4*(sex=='male') + .045*(age-50) + (log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male')), y = ifelse(runif(n) < plogis(L), 1, 0)) ddist <- datadist(d); options(datadist='ddist') f <- lrm(y ~ lsp(age,50) + sex * rcs(cholesterol, 4) + blood.pressure, data=d) nom <- nomogram(f, fun=function(x)1/(1+exp(-x)), # or fun=plogis fun.at=c(.001,.01,.05,seq(.1,.9,by=.1),.95,.99,.999), funlabel="Risk of Death") #Instead of fun.at, could have specified fun.lp.at=logit of #sequence above - faster and slightly more accurate plot(nom, xfrac=.45) print(nom) nom <- nomogram(f, age=seq(10,90,by=10)) plot(nom, xfrac=.45) g <- lrm(y ~ sex + rcs(age, 3) * rcs(cholesterol, 3), data=d) nom <- nomogram(g, interact=list(age=c(20,40,60)), conf.int=c(.7,.9,.95)) plot(nom, col.conf=c(1,.5,.2), naxes=7) require(survival) w <- upData(d, cens = 15 * runif(n), h = .02 * exp(.04 * (age - 50) + .8 * (sex == 'Female')), d.time = -log(runif(n)) / h, death = ifelse(d.time <= cens, 1, 0), d.time = pmin(d.time, cens)) f <- psm(Surv(d.time,death) ~ sex * age, data=w, dist='lognormal') med <- Quantile(f) surv <- Survival(f) # This would also work if f was from cph plot(nomogram(f, fun=function(x) med(lp=x), funlabel="Median Survival Time")) nom <- nomogram(f, fun=list(function(x) surv(3, x), function(x) surv(6, x)), funlabel=c("3-Month Survival Probability", "6-month Survival Probability")) plot(nom, xfrac=.7) ## Not run: nom <- nomogram(fit.with.categorical.predictors, abbrev=TRUE, minlength=1) nom$x1$points # print points assigned to each level of x1 for its axis #Add legend for abbreviations for category levels abb <- attr(nom, 'info')$abbrev$treatment legend(locator(1), abb$full, pch=paste(abb$abbrev,collapse=''), ncol=2, bty='n') # this only works for 1-letter abbreviations #Or use the legend.nomabbrev function: legend.nomabbrev(nom, 'treatment', locator(1), ncol=2, bty='n') ## End(Not run) #Make a nomogram with axes predicting probabilities Y>=j for all j=1-3 #in an ordinal logistic model, where Y=0,1,2,3 w <- upData(w, Y = ifelse(y==0, 0, sample(1:3, length(y), TRUE))) g <- lrm(Y ~ age+rcs(cholesterol,4) * sex, data=w) fun2 <- function(x) plogis(x-g$coef[1]+g$coef[2]) fun3 <- function(x) plogis(x-g$coef[1]+g$coef[3]) f <- Newlabels(g, c(age='Age in Years')) #see Design.Misc, which also has Newlevels to change #labels for levels of categorical variables g <- nomogram(f, fun=list('Prob Y>=1'=plogis, 'Prob Y>=2'=fun2, 'Prob Y=3'=fun3), fun.at=c(.01,.05,seq(.1,.9,by=.1),.95,.99)) plot(g, lmgp=.2, cex.axis=.6) options(datadist=NULL)
n <- 1000 # define sample size set.seed(17) # so can reproduce the results d <- data.frame(age = rnorm(n, 50, 10), blood.pressure = rnorm(n, 120, 15), cholesterol = rnorm(n, 200, 25), sex = factor(sample(c('female','male'), n,TRUE))) # Specify population model for log odds that Y=1 # Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)] d <- upData(d, L = .4*(sex=='male') + .045*(age-50) + (log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male')), y = ifelse(runif(n) < plogis(L), 1, 0)) ddist <- datadist(d); options(datadist='ddist') f <- lrm(y ~ lsp(age,50) + sex * rcs(cholesterol, 4) + blood.pressure, data=d) nom <- nomogram(f, fun=function(x)1/(1+exp(-x)), # or fun=plogis fun.at=c(.001,.01,.05,seq(.1,.9,by=.1),.95,.99,.999), funlabel="Risk of Death") #Instead of fun.at, could have specified fun.lp.at=logit of #sequence above - faster and slightly more accurate plot(nom, xfrac=.45) print(nom) nom <- nomogram(f, age=seq(10,90,by=10)) plot(nom, xfrac=.45) g <- lrm(y ~ sex + rcs(age, 3) * rcs(cholesterol, 3), data=d) nom <- nomogram(g, interact=list(age=c(20,40,60)), conf.int=c(.7,.9,.95)) plot(nom, col.conf=c(1,.5,.2), naxes=7) require(survival) w <- upData(d, cens = 15 * runif(n), h = .02 * exp(.04 * (age - 50) + .8 * (sex == 'Female')), d.time = -log(runif(n)) / h, death = ifelse(d.time <= cens, 1, 0), d.time = pmin(d.time, cens)) f <- psm(Surv(d.time,death) ~ sex * age, data=w, dist='lognormal') med <- Quantile(f) surv <- Survival(f) # This would also work if f was from cph plot(nomogram(f, fun=function(x) med(lp=x), funlabel="Median Survival Time")) nom <- nomogram(f, fun=list(function(x) surv(3, x), function(x) surv(6, x)), funlabel=c("3-Month Survival Probability", "6-month Survival Probability")) plot(nom, xfrac=.7) ## Not run: nom <- nomogram(fit.with.categorical.predictors, abbrev=TRUE, minlength=1) nom$x1$points # print points assigned to each level of x1 for its axis #Add legend for abbreviations for category levels abb <- attr(nom, 'info')$abbrev$treatment legend(locator(1), abb$full, pch=paste(abb$abbrev,collapse=''), ncol=2, bty='n') # this only works for 1-letter abbreviations #Or use the legend.nomabbrev function: legend.nomabbrev(nom, 'treatment', locator(1), ncol=2, bty='n') ## End(Not run) #Make a nomogram with axes predicting probabilities Y>=j for all j=1-3 #in an ordinal logistic model, where Y=0,1,2,3 w <- upData(w, Y = ifelse(y==0, 0, sample(1:3, length(y), TRUE))) g <- lrm(Y ~ age+rcs(cholesterol,4) * sex, data=w) fun2 <- function(x) plogis(x-g$coef[1]+g$coef[2]) fun3 <- function(x) plogis(x-g$coef[1]+g$coef[3]) f <- Newlabels(g, c(age='Age in Years')) #see Design.Misc, which also has Newlevels to change #labels for levels of categorical variables g <- nomogram(f, fun=list('Prob Y>=1'=plogis, 'Prob Y>=2'=fun2, 'Prob Y=3'=fun3), fun.at=c(.01,.05,seq(.1,.9,by=.1),.95,.99)) plot(g, lmgp=.2, cex.axis=.6) options(datadist=NULL)
Computes an estimate of a survival curve for censored data
using either the Kaplan-Meier or the Fleming-Harrington method
or computes the predicted survivor function.
For competing risks data it computes the cumulative incidence curve.
This calls the survival
package's survfit.formula
function. Attributes of the event time variable are saved (label and
units of measurement).
For competing risks the second argument for Surv
should be the
event state variable, and it should be a factor variable with the first
factor level denoting right-censored observations.
npsurv(formula, data=environment(formula), subset, weights, na.action=na.delete, ...)
npsurv(formula, data=environment(formula), subset, weights, na.action=na.delete, ...)
formula |
a formula object, which must have a |
data , subset , weights , na.action
|
see |
... |
see |
see survfit.formula
for details
an object of class "npsurv"
and "survfit"
.
See survfit.object
for details. Methods defined for survfit
objects are print
, summary
, plot
,lines
, and
points
.
Thomas Lumley [email protected] and Terry Therneau
survfit.cph
for survival curves from Cox models.
print
,
plot
,
lines
,
coxph
,
strata
,
survplot
require(survival) # fit a Kaplan-Meier and plot it fit <- npsurv(Surv(time, status) ~ x, data = aml) plot(fit, lty = 2:3) legend(100, .8, c("Maintained", "Nonmaintained"), lty = 2:3) # Here is the data set from Turnbull # There are no interval censored subjects, only left-censored (status=3), # right-censored (status 0) and observed events (status 1) # # Time # 1 2 3 4 # Type of observation # death 12 6 2 3 # losses 3 2 0 3 # late entry 2 4 2 5 # tdata <- data.frame(time = c(1,1,1,2,2,2,3,3,3,4,4,4), status = rep(c(1,0,2),4), n = c(12,3,2,6,2,4,2,0,2,3,3,5)) fit <- npsurv(Surv(time, time, status, type='interval') ~ 1, data=tdata, weights=n) # # Time to progression/death for patients with monoclonal gammopathy # Competing risk curves (cumulative incidence) # status variable must be a factor with first level denoting right censoring m <- upData(mgus1, stop = stop / 365.25, units=c(stop='years'), labels=c(stop='Follow-up Time'), subset=start == 0) f <- npsurv(Surv(stop, event) ~ 1, data=m) # CI curves are always plotted from 0 upwards, rather than 1 down plot(f, fun='event', xmax=20, mark.time=FALSE, col=2:3, xlab="Years post diagnosis of MGUS") text(10, .4, "Competing Risk: death", col=3) text(16, .15,"Competing Risk: progression", col=2) # Use survplot for enhanced displays of cumulative incidence curves for # competing risks survplot(f, state='pcm', n.risk=TRUE, xlim=c(0, 20), ylim=c(0, .5), col=2) survplot(f, state='death', add=TRUE, col=3) f <- npsurv(Surv(stop, event) ~ sex, data=m) survplot(f, state='death', n.risk=TRUE, conf='diffbands')
require(survival) # fit a Kaplan-Meier and plot it fit <- npsurv(Surv(time, status) ~ x, data = aml) plot(fit, lty = 2:3) legend(100, .8, c("Maintained", "Nonmaintained"), lty = 2:3) # Here is the data set from Turnbull # There are no interval censored subjects, only left-censored (status=3), # right-censored (status 0) and observed events (status 1) # # Time # 1 2 3 4 # Type of observation # death 12 6 2 3 # losses 3 2 0 3 # late entry 2 4 2 5 # tdata <- data.frame(time = c(1,1,1,2,2,2,3,3,3,4,4,4), status = rep(c(1,0,2),4), n = c(12,3,2,6,2,4,2,0,2,3,3,5)) fit <- npsurv(Surv(time, time, status, type='interval') ~ 1, data=tdata, weights=n) # # Time to progression/death for patients with monoclonal gammopathy # Competing risk curves (cumulative incidence) # status variable must be a factor with first level denoting right censoring m <- upData(mgus1, stop = stop / 365.25, units=c(stop='years'), labels=c(stop='Follow-up Time'), subset=start == 0) f <- npsurv(Surv(stop, event) ~ 1, data=m) # CI curves are always plotted from 0 upwards, rather than 1 down plot(f, fun='event', xmax=20, mark.time=FALSE, col=2:3, xlab="Years post diagnosis of MGUS") text(10, .4, "Competing Risk: death", col=3) text(16, .15,"Competing Risk: progression", col=2) # Use survplot for enhanced displays of cumulative incidence curves for # competing risks survplot(f, state='pcm', n.risk=TRUE, xlim=c(0, 20), ylim=c(0, .5), col=2) survplot(f, state='death', add=TRUE, col=3) f <- npsurv(Surv(stop, event) ~ sex, data=m) survplot(f, state='death', n.risk=TRUE, conf='diffbands')
Fits the usual weighted or unweighted linear regression model using the
same fitting routines used by lm
, but also storing the variance-covariance
matrix var
and using traditional dummy-variable coding for categorical
factors.
Also fits unweighted models using penalized least squares, with the same
penalization options as in the lrm
function. For penalized estimation,
there is a fitter function call lm.pfit
.
ols(formula, data=environment(formula), weights, subset, na.action=na.delete, method="qr", model=FALSE, x=FALSE, y=FALSE, se.fit=FALSE, linear.predictors=TRUE, penalty=0, penalty.matrix, tol=1e-7, sigma, var.penalty=c('simple','sandwich'), ...)
ols(formula, data=environment(formula), weights, subset, na.action=na.delete, method="qr", model=FALSE, x=FALSE, y=FALSE, se.fit=FALSE, linear.predictors=TRUE, penalty=0, penalty.matrix, tol=1e-7, sigma, var.penalty=c('simple','sandwich'), ...)
formula |
an S formula object, e.g.
|
data |
name of an S data frame containing all needed variables. Omit this to use a data frame already in the S “search list”. |
weights |
an optional vector of weights to be used in the fitting
process. If specified, weighted least squares is used with
weights |
subset |
an expression defining a subset of the observations to use in the fit. The default
is to use all observations. Specify for example |
na.action |
specifies an S function to handle missing data. The default is the function |
method |
specifies a particular fitting method, or |
model |
default is |
x |
default is |
y |
default is |
se.fit |
default is |
linear.predictors |
set to |
penalty |
see |
penalty.matrix |
see |
tol |
tolerance for information matrix singularity |
sigma |
If |
var.penalty |
the type of variance-covariance matrix to be stored in the |
... |
For penalized estimation, the penalty factor on the log likelihood is
, where
is defined above.
The penalized maximum likelihood estimate (penalized least squares
or ridge estimate) of
is
.
The maximum likelihood estimate of
is
, where
sse
is the sum of squared errors (residuals).
The effective.df.diagonal
vector is the
diagonal of the matrix .
the same objects returned from lm
(unless penalty
or penalty.matrix
are given - then an
abbreviated list is returned since lm.pfit
is used as a fitter)
plus the design attributes
(see rms
).
Predicted values are always returned, in the element linear.predictors
.
The vectors or matrix stored if y=TRUE
or x=TRUE
have rows deleted according to subset
and
to missing data, and have names or row names that come from the
data frame used as input data. If penalty
or penalty.matrix
is given,
the var
matrix
returned is an improved variance-covariance matrix
for the penalized regression coefficient estimates. If
var.penalty="sandwich"
(not the default, as limited simulation
studies have found it provides variance estimates that are too low) it
is defined as
, where
is
penalty factors * penalty.matrix
, with a column and row of zeros
added for the
intercept. When var.penalty="simple"
(the default), var
is
.
The returned list has a vector
stats
with named elements
n, Model L.R., d.f., R2, g, Sigma
. Model L.R.
is the model
likelihood ratio statistic, and
R2
is
. For penalized estimation,
d.f.
is the
effective degrees of freedom, which is the sum of the elements of another
vector returned, effective.df.diagonal
, minus one for the
intercept.
g
is the -index.
Sigma
is the penalized maximum likelihood estimate (see below).
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
rms
, rms.trans
, anova.rms
,
summary.rms
, predict.rms
,
fastbw
, validate
, calibrate
,
Predict
, specs.rms
, cph
,
lrm
, which.influence
, lm
,
summary.lm
, print.ols
,
residuals.ols
, latex.ols
,
na.delete
, na.detail.response
,
datadist
, pentrace
, vif
,
abs.error.pred
set.seed(1) x1 <- runif(200) x2 <- sample(0:3, 200, TRUE) distance <- (x1 + x2/3 + rnorm(200))^2 d <- datadist(x1,x2) options(datadist="d") # No d -> no summary, plot without giving all details f <- ols(sqrt(distance) ~ rcs(x1,4) + scored(x2), x=TRUE) # could use d <- datadist(f); options(datadist="d") at this point, # but predictor summaries would not be stored in the fit object for # use with Predict, summary.rms. In that case, the original # dataset or d would need to be accessed later, or all variable values # would have to be specified to summary, plot anova(f) which.influence(f) summary(f) summary.lm(f) # will only work if penalty and penalty.matrix not used # Fit a complex model and approximate it with a simple one x1 <- runif(200) x2 <- runif(200) x3 <- runif(200) x4 <- runif(200) y <- x1 + x2 + rnorm(200) f <- ols(y ~ rcs(x1,4) + x2 + x3 + x4) pred <- fitted(f) # or predict(f) or f$linear.predictors f2 <- ols(pred ~ rcs(x1,4) + x2 + x3 + x4, sigma=1) # sigma=1 prevents numerical problems resulting from R2=1 fastbw(f2, aics=100000) # This will find the best 1-variable model, best 2-variable model, etc. # in predicting the predicted values from the original model options(datadist=NULL)
set.seed(1) x1 <- runif(200) x2 <- sample(0:3, 200, TRUE) distance <- (x1 + x2/3 + rnorm(200))^2 d <- datadist(x1,x2) options(datadist="d") # No d -> no summary, plot without giving all details f <- ols(sqrt(distance) ~ rcs(x1,4) + scored(x2), x=TRUE) # could use d <- datadist(f); options(datadist="d") at this point, # but predictor summaries would not be stored in the fit object for # use with Predict, summary.rms. In that case, the original # dataset or d would need to be accessed later, or all variable values # would have to be specified to summary, plot anova(f) which.influence(f) summary(f) summary.lm(f) # will only work if penalty and penalty.matrix not used # Fit a complex model and approximate it with a simple one x1 <- runif(200) x2 <- runif(200) x3 <- runif(200) x4 <- runif(200) y <- x1 + x2 + rnorm(200) f <- ols(y ~ rcs(x1,4) + x2 + x3 + x4) pred <- fitted(f) # or predict(f) or f$linear.predictors f2 <- ols(pred ~ rcs(x1,4) + x2 + x3 + x4, sigma=1) # sigma=1 prevents numerical problems resulting from R2=1 fastbw(f2, aics=100000) # This will find the best 1-variable model, best 2-variable model, etc. # in predicting the predicted values from the original model options(datadist=NULL)
Fits ordinal cumulative probability models for continuous or ordinal
response variables, efficiently allowing for a large number of
intercepts by capitalizing on the information matrix being sparse.
Five different distribution functions are implemented, with the
default being the logistic (i.e., the proportional odds
model). The ordinal cumulative probability models are stated in terms
of exceedance probabilities () so that as with
OLS larger predicted values are associated with larger
Y
. This is
important to note for the asymmetric distributions given by the
log-log and complementary log-log families, for which negating the
linear predictor does not result in . The
family
argument is defined in orm.fit
. The model
assumes that the inverse of the assumed cumulative distribution
function, when applied to one minus the true cumulative distribution function
and plotted on the -axis (with the original
on the
-axis) yields parallel curves (though not necessarily linear).
This can be checked by plotting the inverse cumulative probability
function of one minus the empirical distribution function, stratified
by
X
, and assessing parallelism. Note that parametric
regression models make the much stronger assumption of linearity of
such inverse functions.
For the print
method, format of output is controlled by the
user previously running options(prType="lang")
where
lang
is "plain"
(the default), "latex"
, or
"html"
. When using html with Quarto or RMarkdown,
results='asis'
need not be written in the chunk header.
Quantile.orm
creates an R function that computes an estimate of
a given quantile for a given value of the linear predictor (which was
assumed to use thefirst intercept). It uses a linear
interpolation method by default, but you can override that to use a
discrete method by specifying method="discrete"
when calling
the function generated by Quantile
.
Optionally a normal approximation for a confidence
interval for quantiles will be computed using the delta method, if
conf.int > 0
is specified to the function generated from calling
Quantile
and you specify X
. In that case, a
"lims"
attribute is included
in the result computed by the derived quantile function.
orm(formula, data=environment(formula), subset, na.action=na.delete, method="orm.fit", model=FALSE, x=FALSE, y=FALSE, linear.predictors=TRUE, se.fit=FALSE, penalty=0, penalty.matrix, tol=1e-7, eps=0.005, var.penalty=c('simple','sandwich'), scale=FALSE, ...) ## S3 method for class 'orm' print(x, digits=4, r2=c(0,2,4), coefs=TRUE, pg=FALSE, intercepts=x$non.slopes < 10, title, ...) ## S3 method for class 'orm' Quantile(object, codes=FALSE, ...)
orm(formula, data=environment(formula), subset, na.action=na.delete, method="orm.fit", model=FALSE, x=FALSE, y=FALSE, linear.predictors=TRUE, se.fit=FALSE, penalty=0, penalty.matrix, tol=1e-7, eps=0.005, var.penalty=c('simple','sandwich'), scale=FALSE, ...) ## S3 method for class 'orm' print(x, digits=4, r2=c(0,2,4), coefs=TRUE, pg=FALSE, intercepts=x$non.slopes < 10, title, ...) ## S3 method for class 'orm' Quantile(object, codes=FALSE, ...)
formula |
a formula object. An |
data |
data frame to use. Default is the current frame. |
subset |
logical expression or vector of subscripts defining a subset of observations to analyze |
na.action |
function to handle |
method |
name of fitting function. Only allowable choice at present is |
model |
causes the model frame to be returned in the fit object |
x |
causes the expanded design matrix (with missings excluded)
to be returned under the name |
y |
causes the response variable (with missings excluded) to be returned
under the name |
linear.predictors |
causes the predicted X beta (with missings excluded) to be returned
under the name |
se.fit |
causes the standard errors of the fitted values (on the linear predictor
scale) to be returned under the name |
penalty |
see |
penalty.matrix |
see |
tol |
singularity criterion (see |
eps |
difference in |
var.penalty |
see |
scale |
set to |
... |
arguments that are passed to |
digits |
number of significant digits to use |
r2 |
vector of integers specifying which R^2 measures to print,
with 0 for Nagelkerke R^2 and 1:4 corresponding to the 4 measures
computed by |
pg |
set to |
coefs |
specify |
intercepts |
By default, intercepts are only printed if there are
fewer than 10 of them. Otherwise this is controlled by specifying
|
title |
a character string title to be passed to |
object |
an object created by |
codes |
if |
The returned fit object of orm
contains the following components
in addition to the ones mentioned under the optional arguments.
call |
calling expression |
freq |
table of frequencies for |
stats |
vector with the following elements: number of observations used in the
fit, number of unique |
fail |
set to |
coefficients |
estimated parameters |
var |
estimated variance-covariance matrix (inverse of information matrix)
for the middle intercept and regression coefficients. See
|
effective.df.diagonal |
see |
family |
the character string for |
trans |
a list of functions for the choice of |
deviance |
-2 log likelihoods (counting penalty components) When an offset variable is present, three deviances are computed: for intercept(s) only, for intercepts+offset, and for intercepts+offset+predictors. When there is no offset variable, the vector contains deviances for the intercept(s)-only model and the model with intercept(s) and predictors. |
non.slopes |
number of intercepts in model |
interceptRef |
the index of the middle (median) intercept used in
computing the linear predictor and |
penalty |
see |
penalty.matrix |
the penalty matrix actually used in the estimation |
info.matrix |
a sparse matrix representation of type
|
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
For the Quantile
function:
Qi Liu and Shengxin Tu
Department of Biostatistics, Vanderbilt University
Sall J: A monotone regression smoother based on ordinal cumulative logistic regression, 1991.
Le Cessie S, Van Houwelingen JC: Ridge estimators in logistic regression. Applied Statistics 41:191–201, 1992.
Verweij PJM, Van Houwelingen JC: Penalized likelihood in Cox regression. Stat in Med 13:2427–2436, 1994.
Gray RJ: Flexible methods for analyzing survival data using splines, with applications to breast cancer prognosis. JASA 87:942–951, 1992.
Shao J: Linear model selection by cross-validation. JASA 88:486–494, 1993.
Verweij PJM, Van Houwelingen JC: Crossvalidation in survival analysis. Stat in Med 12:2305–2314, 1993.
Harrell FE: Model uncertainty, penalization, and parsimony. Available from https://hbiostat.org/talks/iscb98.pdf.
orm.fit
, predict.orm
, solve
,
rms.trans
, rms
, polr
,
latex.orm
, vcov.orm
,
num.intercepts
,
residuals.orm
, na.delete
,
na.detail.response
,
pentrace
, rmsMisc
, vif
,
predab.resample
,
validate.orm
, calibrate
,
Mean.orm
, gIndex
, prModFit
require(ggplot2) set.seed(1) n <- 100 y <- round(runif(n), 2) x1 <- sample(c(-1,0,1), n, TRUE) x2 <- sample(c(-1,0,1), n, TRUE) f <- lrm(y ~ x1 + x2, eps=1e-5) g <- orm(y ~ x1 + x2, eps=1e-5) max(abs(coef(g) - coef(f))) w <- vcov(g, intercepts='all') / vcov(f) - 1 max(abs(w)) set.seed(1) n <- 300 x1 <- c(rep(0,150), rep(1,150)) y <- rnorm(n) + 3*x1 g <- orm(y ~ x1) g k <- coef(g) i <- num.intercepts(g) h <- orm(y ~ x1, family=probit) ll <- orm(y ~ x1, family=loglog) cll <- orm(y ~ x1, family=cloglog) cau <- orm(y ~ x1, family=cauchit) x <- 1:i z <- list(logistic=list(x=x, y=coef(g)[1:i]), probit =list(x=x, y=coef(h)[1:i]), loglog =list(x=x, y=coef(ll)[1:i]), cloglog =list(x=x, y=coef(cll)[1:i])) labcurve(z, pl=TRUE, col=1:4, ylab='Intercept') tapply(y, x1, mean) m <- Mean(g) m(w <- k[1] + k['x1']*c(0,1)) mh <- Mean(h) wh <- coef(h)[1] + coef(h)['x1']*c(0,1) mh(wh) qu <- Quantile(g) # Compare model estimated and empirical quantiles cq <- function(y) { cat(qu(.1, w), tapply(y, x1, quantile, probs=.1), '\n') cat(qu(.5, w), tapply(y, x1, quantile, probs=.5), '\n') cat(qu(.9, w), tapply(y, x1, quantile, probs=.9), '\n') } cq(y) # Try on log-normal model g <- orm(exp(y) ~ x1) g k <- coef(g) plot(k[1:i]) m <- Mean(g) m(w <- k[1] + k['x1']*c(0,1)) tapply(exp(y), x1, mean) qu <- Quantile(g) cq(exp(y)) # Compare predicted mean with ols for a continuous x set.seed(3) n <- 200 x1 <- rnorm(n) y <- x1 + rnorm(n) dd <- datadist(x1); options(datadist='dd') f <- ols(y ~ x1) g <- orm(y ~ x1, family=probit) h <- orm(y ~ x1, family=logistic) w <- orm(y ~ x1, family=cloglog) mg <- Mean(g); mh <- Mean(h); mw <- Mean(w) r <- rbind(ols = Predict(f, conf.int=FALSE), probit = Predict(g, conf.int=FALSE, fun=mg), logistic = Predict(h, conf.int=FALSE, fun=mh), cloglog = Predict(w, conf.int=FALSE, fun=mw)) plot(r, groups='.set.') # Compare predicted 0.8 quantile with quantile regression qu <- Quantile(g) qu80 <- function(lp) qu(.8, lp) f <- Rq(y ~ x1, tau=.8) r <- rbind(probit = Predict(g, conf.int=FALSE, fun=qu80), quantreg = Predict(f, conf.int=FALSE)) plot(r, groups='.set.') # Verify transformation invariance of ordinal regression ga <- orm(exp(y) ~ x1, family=probit) qua <- Quantile(ga) qua80 <- function(lp) log(qua(.8, lp)) r <- rbind(logprobit = Predict(ga, conf.int=FALSE, fun=qua80), probit = Predict(g, conf.int=FALSE, fun=qu80)) plot(r, groups='.set.') # Try the same with quantile regression. Need to transform x1 fa <- Rq(exp(y) ~ rcs(x1,5), tau=.8) r <- rbind(qr = Predict(f, conf.int=FALSE), logqr = Predict(fa, conf.int=FALSE, fun=log)) plot(r, groups='.set.') # Make a plot of Pr(Y >= y) vs. a continuous covariate for 3 levels # of y and also against a binary covariate set.seed(1) n <- 1000 age <- rnorm(n, 50, 15) sex <- sample(c('m', 'f'), 1000, TRUE) Y <- runif(n) dd <- datadist(age, sex); options(datadist='dd') f <- orm(Y ~ age + sex) # Use ExProb function to derive an R function to compute # P(Y >= y | X) ex <- ExProb(f) ex1 <- function(x) ex(x, y=0.25) ex2 <- function(x) ex(x, y=0.5) ex3 <- function(x) ex(x, y=0.75) p1 <- Predict(f, age, sex, fun=ex1) p2 <- Predict(f, age, sex, fun=ex2) p3 <- Predict(f, age, sex, fun=ex3) p <- rbind('P(Y >= 0.25)' = p1, 'P(Y >= 0.5)' = p2, 'P(Y >= 0.75)' = p3) ggplot(p) # Make plot with two curves (by sex) with y on the x-axis, and # estimated P(Y >= y | sex, age=median) on the y-axis ys <- seq(min(Y), max(Y), length=100) g <- function(sx) as.vector(ex(y=ys, Predict(f, sex=sx)$yhat)$prob) d <- rbind(data.frame(sex='m', y=ys, p=g('m')), data.frame(sex='f', y=ys, p=g('f'))) ggplot(d, aes(x=y, y=p, color=sex)) + geom_line() + ylab(expression(P(Y >= y))) + guides(color=guide_legend(title='Sex')) + theme(legend.position='bottom') options(datadist=NULL) ## Not run: ## Simulate power and type I error for orm logistic and probit regression ## for likelihood ratio, Wald, and score chi-square tests, and compare ## with t-test require(rms) set.seed(5) nsim <- 2000 r <- NULL for(beta in c(0, .4)) { for(n in c(10, 50, 300)) { cat('beta=', beta, ' n=', n, '\n\n') plogistic <- pprobit <- plogistics <- pprobits <- plogisticw <- pprobitw <- ptt <- numeric(nsim) x <- c(rep(0, n/2), rep(1, n/2)) pb <- setPb(nsim, every=25, label=paste('beta=', beta, ' n=', n)) for(j in 1:nsim) { pb(j) y <- beta*x + rnorm(n) tt <- t.test(y ~ x) ptt[j] <- tt$p.value f <- orm(y ~ x) plogistic[j] <- f$stats['P'] plogistics[j] <- f$stats['Score P'] plogisticw[j] <- 1 - pchisq(coef(f)['x']^2 / vcov(f)[2,2], 1) f <- orm(y ~ x, family=probit) pprobit[j] <- f$stats['P'] pprobits[j] <- f$stats['Score P'] pprobitw[j] <- 1 - pchisq(coef(f)['x']^2 / vcov(f)[2,2], 1) } if(beta == 0) plot(ecdf(plogistic)) r <- rbind(r, data.frame(beta = beta, n=n, ttest = mean(ptt < 0.05), logisticlr = mean(plogistic < 0.05), logisticscore= mean(plogistics < 0.05), logisticwald = mean(plogisticw < 0.05), probit = mean(pprobit < 0.05), probitscore = mean(pprobits < 0.05), probitwald = mean(pprobitw < 0.05))) } } print(r) # beta n ttest logisticlr logisticscore logisticwald probit probitscore probitwald #1 0.0 10 0.0435 0.1060 0.0655 0.043 0.0920 0.0920 0.0820 #2 0.0 50 0.0515 0.0635 0.0615 0.060 0.0620 0.0620 0.0620 #3 0.0 300 0.0595 0.0595 0.0590 0.059 0.0605 0.0605 0.0605 #4 0.4 10 0.0755 0.1595 0.1070 0.074 0.1430 0.1430 0.1285 #5 0.4 50 0.2950 0.2960 0.2935 0.288 0.3120 0.3120 0.3120 #6 0.4 300 0.9240 0.9215 0.9205 0.920 0.9230 0.9230 0.9230 ## End(Not run)
require(ggplot2) set.seed(1) n <- 100 y <- round(runif(n), 2) x1 <- sample(c(-1,0,1), n, TRUE) x2 <- sample(c(-1,0,1), n, TRUE) f <- lrm(y ~ x1 + x2, eps=1e-5) g <- orm(y ~ x1 + x2, eps=1e-5) max(abs(coef(g) - coef(f))) w <- vcov(g, intercepts='all') / vcov(f) - 1 max(abs(w)) set.seed(1) n <- 300 x1 <- c(rep(0,150), rep(1,150)) y <- rnorm(n) + 3*x1 g <- orm(y ~ x1) g k <- coef(g) i <- num.intercepts(g) h <- orm(y ~ x1, family=probit) ll <- orm(y ~ x1, family=loglog) cll <- orm(y ~ x1, family=cloglog) cau <- orm(y ~ x1, family=cauchit) x <- 1:i z <- list(logistic=list(x=x, y=coef(g)[1:i]), probit =list(x=x, y=coef(h)[1:i]), loglog =list(x=x, y=coef(ll)[1:i]), cloglog =list(x=x, y=coef(cll)[1:i])) labcurve(z, pl=TRUE, col=1:4, ylab='Intercept') tapply(y, x1, mean) m <- Mean(g) m(w <- k[1] + k['x1']*c(0,1)) mh <- Mean(h) wh <- coef(h)[1] + coef(h)['x1']*c(0,1) mh(wh) qu <- Quantile(g) # Compare model estimated and empirical quantiles cq <- function(y) { cat(qu(.1, w), tapply(y, x1, quantile, probs=.1), '\n') cat(qu(.5, w), tapply(y, x1, quantile, probs=.5), '\n') cat(qu(.9, w), tapply(y, x1, quantile, probs=.9), '\n') } cq(y) # Try on log-normal model g <- orm(exp(y) ~ x1) g k <- coef(g) plot(k[1:i]) m <- Mean(g) m(w <- k[1] + k['x1']*c(0,1)) tapply(exp(y), x1, mean) qu <- Quantile(g) cq(exp(y)) # Compare predicted mean with ols for a continuous x set.seed(3) n <- 200 x1 <- rnorm(n) y <- x1 + rnorm(n) dd <- datadist(x1); options(datadist='dd') f <- ols(y ~ x1) g <- orm(y ~ x1, family=probit) h <- orm(y ~ x1, family=logistic) w <- orm(y ~ x1, family=cloglog) mg <- Mean(g); mh <- Mean(h); mw <- Mean(w) r <- rbind(ols = Predict(f, conf.int=FALSE), probit = Predict(g, conf.int=FALSE, fun=mg), logistic = Predict(h, conf.int=FALSE, fun=mh), cloglog = Predict(w, conf.int=FALSE, fun=mw)) plot(r, groups='.set.') # Compare predicted 0.8 quantile with quantile regression qu <- Quantile(g) qu80 <- function(lp) qu(.8, lp) f <- Rq(y ~ x1, tau=.8) r <- rbind(probit = Predict(g, conf.int=FALSE, fun=qu80), quantreg = Predict(f, conf.int=FALSE)) plot(r, groups='.set.') # Verify transformation invariance of ordinal regression ga <- orm(exp(y) ~ x1, family=probit) qua <- Quantile(ga) qua80 <- function(lp) log(qua(.8, lp)) r <- rbind(logprobit = Predict(ga, conf.int=FALSE, fun=qua80), probit = Predict(g, conf.int=FALSE, fun=qu80)) plot(r, groups='.set.') # Try the same with quantile regression. Need to transform x1 fa <- Rq(exp(y) ~ rcs(x1,5), tau=.8) r <- rbind(qr = Predict(f, conf.int=FALSE), logqr = Predict(fa, conf.int=FALSE, fun=log)) plot(r, groups='.set.') # Make a plot of Pr(Y >= y) vs. a continuous covariate for 3 levels # of y and also against a binary covariate set.seed(1) n <- 1000 age <- rnorm(n, 50, 15) sex <- sample(c('m', 'f'), 1000, TRUE) Y <- runif(n) dd <- datadist(age, sex); options(datadist='dd') f <- orm(Y ~ age + sex) # Use ExProb function to derive an R function to compute # P(Y >= y | X) ex <- ExProb(f) ex1 <- function(x) ex(x, y=0.25) ex2 <- function(x) ex(x, y=0.5) ex3 <- function(x) ex(x, y=0.75) p1 <- Predict(f, age, sex, fun=ex1) p2 <- Predict(f, age, sex, fun=ex2) p3 <- Predict(f, age, sex, fun=ex3) p <- rbind('P(Y >= 0.25)' = p1, 'P(Y >= 0.5)' = p2, 'P(Y >= 0.75)' = p3) ggplot(p) # Make plot with two curves (by sex) with y on the x-axis, and # estimated P(Y >= y | sex, age=median) on the y-axis ys <- seq(min(Y), max(Y), length=100) g <- function(sx) as.vector(ex(y=ys, Predict(f, sex=sx)$yhat)$prob) d <- rbind(data.frame(sex='m', y=ys, p=g('m')), data.frame(sex='f', y=ys, p=g('f'))) ggplot(d, aes(x=y, y=p, color=sex)) + geom_line() + ylab(expression(P(Y >= y))) + guides(color=guide_legend(title='Sex')) + theme(legend.position='bottom') options(datadist=NULL) ## Not run: ## Simulate power and type I error for orm logistic and probit regression ## for likelihood ratio, Wald, and score chi-square tests, and compare ## with t-test require(rms) set.seed(5) nsim <- 2000 r <- NULL for(beta in c(0, .4)) { for(n in c(10, 50, 300)) { cat('beta=', beta, ' n=', n, '\n\n') plogistic <- pprobit <- plogistics <- pprobits <- plogisticw <- pprobitw <- ptt <- numeric(nsim) x <- c(rep(0, n/2), rep(1, n/2)) pb <- setPb(nsim, every=25, label=paste('beta=', beta, ' n=', n)) for(j in 1:nsim) { pb(j) y <- beta*x + rnorm(n) tt <- t.test(y ~ x) ptt[j] <- tt$p.value f <- orm(y ~ x) plogistic[j] <- f$stats['P'] plogistics[j] <- f$stats['Score P'] plogisticw[j] <- 1 - pchisq(coef(f)['x']^2 / vcov(f)[2,2], 1) f <- orm(y ~ x, family=probit) pprobit[j] <- f$stats['P'] pprobits[j] <- f$stats['Score P'] pprobitw[j] <- 1 - pchisq(coef(f)['x']^2 / vcov(f)[2,2], 1) } if(beta == 0) plot(ecdf(plogistic)) r <- rbind(r, data.frame(beta = beta, n=n, ttest = mean(ptt < 0.05), logisticlr = mean(plogistic < 0.05), logisticscore= mean(plogistics < 0.05), logisticwald = mean(plogisticw < 0.05), probit = mean(pprobit < 0.05), probitscore = mean(pprobits < 0.05), probitwald = mean(pprobitw < 0.05))) } } print(r) # beta n ttest logisticlr logisticscore logisticwald probit probitscore probitwald #1 0.0 10 0.0435 0.1060 0.0655 0.043 0.0920 0.0920 0.0820 #2 0.0 50 0.0515 0.0635 0.0615 0.060 0.0620 0.0620 0.0620 #3 0.0 300 0.0595 0.0595 0.0590 0.059 0.0605 0.0605 0.0605 #4 0.4 10 0.0755 0.1595 0.1070 0.074 0.1430 0.1430 0.1285 #5 0.4 50 0.2950 0.2960 0.2935 0.288 0.3120 0.3120 0.3120 #6 0.4 300 0.9240 0.9215 0.9205 0.920 0.9230 0.9230 0.9230 ## End(Not run)
Fits ordinal cumulative probability models for continuous or ordinal response variables, efficiently allowing for a large number of intercepts by capitalizing on the information matrix being sparse. Five different distribution functions are implemented, with the default being the logistic (yielding the proportional odds model). Penalized estimation will be implemented in the future. Weights are not implemented. The optimization method is Newton-Raphson with step-halving. Execution time is linear in the number of intercepts.
orm.fit(x=NULL, y, family='logistic', offset=0., initial, maxit=12L, eps=.005, tol=1e-7, trace=FALSE, penalty.matrix=NULL, scale=FALSE, y.precision = 7)
orm.fit(x=NULL, y, family='logistic', offset=0., initial, maxit=12L, eps=.005, tol=1e-7, trace=FALSE, penalty.matrix=NULL, scale=FALSE, y.precision = 7)
x |
design matrix with no column for an intercept |
y |
response vector, numeric, factor, or character. The ordering of levels
is assumed from |
family |
the distribution family, corresponding to logistic (the
default), Gaussian, Cauchy, Gumbel maximum ( |
offset |
optional numeric vector containing an offset on the logit scale |
initial |
vector of initial parameter estimates, beginning with the
intercepts. If |
maxit |
maximum no. iterations (default= |
eps |
difference in -2 log likelihood is below 1E-9, convergence is still declared. This handles the case where the initial estimates are MLEs, to prevent endless step-halving. |
tol |
Singularity criterion. Default is 1e-7 |
trace |
set to |
penalty.matrix |
a self-contained ready-to-use penalty matrix - see |
scale |
set to |
y.precision |
When ‘y’ is numeric, values may need to be rounded
to avoid unpredictable behavior with |
a list with the following components:
call |
calling expression |
freq |
table of frequencies for |
yunique |
vector of sorted unique values of |
stats |
vector with the following elements: number of observations used in the
fit, number of unique |
fail |
set to |
coefficients |
estimated parameters |
var |
estimated variance-covariance matrix (inverse of information matrix).
Note that in the case of penalized estimation, |
family , trans
|
see |
deviance |
-2 log likelihoods. When an offset variable is present, three deviances are computed: for intercept(s) only, for intercepts+offset, and for intercepts+offset+predictors. When there is no offset variable, the vector contains deviances for the intercept(s)-only model and the model with intercept(s) and predictors. |
non.slopes |
number of intercepts in model |
interceptRef |
the index of the middle (median) intercept used in
computing the linear predictor and |
linear.predictors |
the linear predictor using the first intercept |
penalty.matrix |
see above |
info.matrix |
see |
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
#Fit an additive logistic model containing numeric predictors age, #blood.pressure, and sex, assumed to be already properly coded and #transformed # # fit <- orm.fit(cbind(age,blood.pressure,sex), death)
#Fit an additive logistic model containing numeric predictors age, #blood.pressure, and sex, assumed to be already properly coded and #transformed # # fit <- orm.fit(cbind(age,blood.pressure,sex), death)
For an ordinary unpenalized fit from lrm
or ols
and for a vector or list of penalties,
fits a series of logistic or linear models using penalized maximum likelihood
estimation, and saves the effective degrees of freedom, Akaike Information
Criterion (), Schwarz Bayesian Information Criterion (
), and
Hurvich and Tsai's corrected
(
). Optionally
pentrace
can
use the nlminb
function to solve for the optimum penalty factor or
combination of factors penalizing different kinds of terms in the model.
The effective.df
function prints the original and effective
degrees of freedom for a penalized fit or for an unpenalized fit and
the best penalization determined from a previous invocation of
pentrace
if method="grid"
(the default).
The effective d.f. is computed separately for each class of terms in
the model (e.g., interaction, nonlinear).
A plot
method exists to plot the results, and a print
method exists
to print the most pertinent components. Both and
may be plotted if
there is only one penalty factor type specified in
penalty
. Otherwise,
the first two types of penalty factors are plotted, showing only the .
pentrace(fit, penalty, penalty.matrix, method=c('grid','optimize'), which=c('aic.c','aic','bic'), target.df=NULL, fitter, pr=FALSE, tol=1e-7, keep.coef=FALSE, complex.more=TRUE, verbose=FALSE, maxit=12, subset, noaddzero=FALSE) effective.df(fit, object) ## S3 method for class 'pentrace' print(x, ...) ## S3 method for class 'pentrace' plot(x, method=c('points','image'), which=c('effective.df','aic','aic.c','bic'), pch=2, add=FALSE, ylim, ...)
pentrace(fit, penalty, penalty.matrix, method=c('grid','optimize'), which=c('aic.c','aic','bic'), target.df=NULL, fitter, pr=FALSE, tol=1e-7, keep.coef=FALSE, complex.more=TRUE, verbose=FALSE, maxit=12, subset, noaddzero=FALSE) effective.df(fit, object) ## S3 method for class 'pentrace' print(x, ...) ## S3 method for class 'pentrace' plot(x, method=c('points','image'), which=c('effective.df','aic','aic.c','bic'), pch=2, add=FALSE, ylim, ...)
fit |
a result from |
penalty |
can be a vector or a list. If it is a vector, all types of terms in
the model will be penalized by the same amount, specified by elements in
|
object |
an object returned by |
penalty.matrix |
see |
method |
The default is |
which |
the objective to maximize for either |
target.df |
applies only to |
fitter |
a fitting function. Default is |
pr |
set to |
tol |
tolerance for declaring a matrix singular (see |
keep.coef |
set to |
complex.more |
By default if |
verbose |
set to |
maxit |
maximum number of iterations to allow in a model fit (default=12).
This is passed to the appropriate fitter function with the correct
argument name. Increase |
subset |
a logical or integer vector specifying rows of the design and response
matrices to subset in fitting models. This is most useful for
bootstrapping |
noaddzero |
set to |
x |
a result from |
pch |
used for |
add |
set to |
ylim |
2-vector of y-axis limits for plots other than effective d.f. |
... |
other arguments passed to |
a list of class "pentrace"
with elements penalty, df, objective, fit, var.adj, diag, results.all
, and
optionally Coefficients
.
The first 6 elements correspond to the fit that had the best objective
as named in the which
argument, from the sequence of fits tried.
Here fit
is the fit object from fitter
which was a penalized fit,
diag
is the diagonal of the matrix used to compute the effective
d.f., and var.adj
is Gray (1992) Equation 2.9, which is an improved
covariance matrix for the penalized beta. results.all
is a data
frame whose first few variables are the components of penalty
and
whose other columns are df, aic, bic, aic.c
. results.all
thus
contains a summary of results for all fits attempted. When
method="optimize"
, only two components are returned: penalty
and
objective
, and the object does not have a class.
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
Gray RJ: Flexible methods for analyzing survival data using splines, with applications to breast cancer prognosis. JASA 87:942–951, 1992.
Hurvich CM, Tsai, CL: Regression and time series model selection in small samples. Biometrika 76:297–307, 1989.
lrm
, ols
, solvet
, rmsMisc
, image
n <- 1000 # define sample size set.seed(17) # so can reproduce the results age <- rnorm(n, 50, 10) blood.pressure <- rnorm(n, 120, 15) cholesterol <- rnorm(n, 200, 25) sex <- factor(sample(c('female','male'), n,TRUE)) # Specify population model for log odds that Y=1 L <- .4*(sex=='male') + .045*(age-50) + (log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male')) # Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)] y <- ifelse(runif(n) < plogis(L), 1, 0) f <- lrm(y ~ blood.pressure + sex * (age + rcs(cholesterol,4)), x=TRUE, y=TRUE) p <- pentrace(f, seq(.2,1,by=.05)) plot(p) p$diag # may learn something about fractional effective d.f. # for each original parameter pentrace(f, list(simple=c(0,.2,.4), nonlinear=c(0,.2,.4,.8,1))) # Bootstrap pentrace 5 times, making a plot of corrected AIC plot with 5 reps n <- nrow(f$x) plot(pentrace(f, seq(.2,1,by=.05)), which='aic.c', col=1, ylim=c(30,120)) #original in black for(j in 1:5) plot(pentrace(f, seq(.2,1,by=.05), subset=sample(n,n,TRUE)), which='aic.c', col=j+1, add=TRUE) # Find penalty giving optimum corrected AIC. Initial guess is 1.0 # Not implemented yet # pentrace(f, 1, method='optimize') # Find penalty reducing total regression d.f. effectively to 5 # pentrace(f, 1, target.df=5) # Re-fit with penalty giving best aic.c without differential penalization f <- update(f, penalty=p$penalty) effective.df(f)
n <- 1000 # define sample size set.seed(17) # so can reproduce the results age <- rnorm(n, 50, 10) blood.pressure <- rnorm(n, 120, 15) cholesterol <- rnorm(n, 200, 25) sex <- factor(sample(c('female','male'), n,TRUE)) # Specify population model for log odds that Y=1 L <- .4*(sex=='male') + .045*(age-50) + (log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male')) # Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)] y <- ifelse(runif(n) < plogis(L), 1, 0) f <- lrm(y ~ blood.pressure + sex * (age + rcs(cholesterol,4)), x=TRUE, y=TRUE) p <- pentrace(f, seq(.2,1,by=.05)) plot(p) p$diag # may learn something about fractional effective d.f. # for each original parameter pentrace(f, list(simple=c(0,.2,.4), nonlinear=c(0,.2,.4,.8,1))) # Bootstrap pentrace 5 times, making a plot of corrected AIC plot with 5 reps n <- nrow(f$x) plot(pentrace(f, seq(.2,1,by=.05)), which='aic.c', col=1, ylim=c(30,120)) #original in black for(j in 1:5) plot(pentrace(f, seq(.2,1,by=.05), subset=sample(n,n,TRUE)), which='aic.c', col=j+1, add=TRUE) # Find penalty giving optimum corrected AIC. Initial guess is 1.0 # Not implemented yet # pentrace(f, 1, method='optimize') # Find penalty reducing total regression d.f. effectively to 5 # pentrace(f, 1, target.df=5) # Re-fit with penalty giving best aic.c without differential penalization f <- update(f, penalty=p$penalty) effective.df(f)
Plot Bayesian Contrast Posterior Densities
## S3 method for class 'contrast.rms' plot( x, bivar = FALSE, bivarmethod = c("ellipse", "kernel"), prob = 0.95, which = c("both", "diff", "ind"), nrow = NULL, ncol = NULL, ... )
## S3 method for class 'contrast.rms' plot( x, bivar = FALSE, bivarmethod = c("ellipse", "kernel"), prob = 0.95, which = c("both", "diff", "ind"), nrow = NULL, ncol = NULL, ... )
x |
the result of |
bivar |
set to |
bivarmethod |
|
prob |
posterior coverage probability for HPD interval or 2-d contour |
which |
applies when plotting the result of |
nrow |
|
ncol |
likewise |
... |
unused |
If there are exactly two contrasts and bivar=TRUE
plots an elliptical or kernal (based on bivarmethod
posterior density contour with probability prob
). Otherwise plots a series of posterior densities of contrasts along with HPD intervals, posterior means, and medians. When the result being plotted comes from contrast
with fun=
specified, both the two individual estimates and their difference are plotted.
ggplot2
object
Frank Harrell
Uses lattice
graphics to plot the effect of one or two predictors
on the linear predictor or X beta scale, or on some transformation of
that scale. The first argument specifies the result of the
Predict
function. The predictor is always plotted in its
original coding. plot.Predict
uses the
xYplot
function unless formula
is omitted and the x-axis
variable is a factor, in which case it reverses the x- and y-axes and
uses the Dotplot
function.
If data
is given, a rug plot is drawn showing
the location/density of data values for the -axis variable. If
there is a
groups
(superposition) variable that generated separate
curves, the data density specific to each class of points is shown.
This assumes that the second variable was a factor variable. The rug plots
are drawn by scat1d
. When the same predictor is used on all
-axes, and multiple panels are drawn, you can use
subdata
to specify an expression to subset according to other
criteria in addition.
To plot effects instead of estimates (e.g., treatment differences as a
function of interacting factors) see contrast.rms
and
summary.rms
.
pantext
creates a lattice
panel function for including
text such as that produced by print.anova.rms
inside a panel or
in a base graphic.
## S3 method for class 'Predict' plot(x, formula, groups=NULL, cond=NULL, varypred=FALSE, subset, xlim, ylim, xlab, ylab, data=NULL, subdata, anova=NULL, pval=FALSE, cex.anova=.85, col.fill=gray(seq(.825, .55, length=5)), adj.subtitle, cex.adj, cex.axis, perim=NULL, digits=4, nlevels=3, nlines=FALSE, addpanel, scat1d.opts=list(frac=0.025, lwd=0.3), type=NULL, yscale=NULL, scaletrans=function(z) z, ...) pantext(object, x, y, cex=.5, adj=0, fontfamily="Courier", lattice=TRUE)
## S3 method for class 'Predict' plot(x, formula, groups=NULL, cond=NULL, varypred=FALSE, subset, xlim, ylim, xlab, ylab, data=NULL, subdata, anova=NULL, pval=FALSE, cex.anova=.85, col.fill=gray(seq(.825, .55, length=5)), adj.subtitle, cex.adj, cex.axis, perim=NULL, digits=4, nlevels=3, nlines=FALSE, addpanel, scat1d.opts=list(frac=0.025, lwd=0.3), type=NULL, yscale=NULL, scaletrans=function(z) z, ...) pantext(object, x, y, cex=.5, adj=0, fontfamily="Courier", lattice=TRUE)
x |
a data frame created by |
formula |
the right hand side of a |
groups |
an optional name of one of the variables in |
cond |
when plotting effects of different predictors, |
varypred |
set to |
subset |
a subsetting expression for restricting the rows of
|
xlim |
This parameter is seldom used, as limits are usually controlled with
|
ylim |
Range for plotting on response variable axis. Computed by default. |
xlab |
Label for |
ylab |
Label for |
data |
a data frame containing the original raw data on which the
regression model were based, or at least containing the |
subdata |
if |
anova |
an object returned by |
pval |
specify |
cex.anova |
character size for the test statistic printed on the panel |
col.fill |
a vector of colors used to fill confidence bands for successive superposed groups. Default is inceasingly dark gray scale. |
adj.subtitle |
Set to |
cex.adj |
|
cex.axis |
|
perim |
|
digits |
Controls how numeric variables used for panel labels are formatted. The default is 4 significant digits. |
nlevels |
when |
nlines |
If |
addpanel |
an additional panel function to call along with panel
functions used for |
scat1d.opts |
a list containing named elements that specifies
parameters to |
type |
a value ( |
yscale |
a |
scaletrans |
a function that operates on the |
... |
extra arguments to pass to |
object |
an object having a |
y |
y-coordinate for placing text in a |
cex |
character expansion size for |
adj |
text justification. Default is left justified. |
fontfamily |
font family for |
lattice |
set to |
When a groups
(superpositioning) variable was used, you can issue
the command Key(...)
after printing the result of
plot.Predict
, to draw a key for the groups.
a lattice
object ready to print
for rendering.
If plotting the effects of all predictors you can reorder the
panels using for example p <- Predict(fit); p$.predictor. <-
factor(p$.predictor., v)
where v
is a vector of predictor
names specified in the desired order.
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
Fox J, Hong J (2009): Effect displays in R for multinomial and proportional-odds logit models: Extensions to the effects package. J Stat Software 32 No. 1.
Predict
, ggplot.Predict
,
link{plotp.Predict}
, rbind.Predict
,
datadist
, predictrms
, anova.rms
,
contrast.rms
, summary.rms
,
rms
, rmsMisc
,
labcurve
, scat1d
,
xYplot
, Overview
n <- 1000 # define sample size set.seed(17) # so can reproduce the results age <- rnorm(n, 50, 10) blood.pressure <- rnorm(n, 120, 15) cholesterol <- rnorm(n, 200, 25) sex <- factor(sample(c('female','male'), n,TRUE)) label(age) <- 'Age' # label is in Hmisc label(cholesterol) <- 'Total Cholesterol' label(blood.pressure) <- 'Systolic Blood Pressure' label(sex) <- 'Sex' units(cholesterol) <- 'mg/dl' # uses units.default in Hmisc units(blood.pressure) <- 'mmHg' # Specify population model for log odds that Y=1 L <- .4*(sex=='male') + .045*(age-50) + (log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male')) # Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)] y <- ifelse(runif(n) < plogis(L), 1, 0) ddist <- datadist(age, blood.pressure, cholesterol, sex) options(datadist='ddist') fit <- lrm(y ~ blood.pressure + sex * (age + rcs(cholesterol,4)), x=TRUE, y=TRUE) an <- anova(fit) # Plot effects of all 4 predictors with test statistics from anova, and P plot(Predict(fit), anova=an, pval=TRUE) plot(Predict(fit), data=llist(blood.pressure,age)) # rug plot for two of the predictors p <- Predict(fit, name=c('age','cholesterol')) # Make 2 plots plot(p) p <- Predict(fit, age=seq(20,80,length=100), sex, conf.int=FALSE) # Plot relationship between age and log # odds, separate curve for each sex, plot(p, subset=sex=='female' | age > 30) # No confidence interval, suppress estimates for males <= 30 p <- Predict(fit, age, sex) plot(p, label.curves=FALSE, data=llist(age,sex)) # use label.curves=list(keys=c('a','b'))' # to use 1-letter abbreviations # data= allows rug plots (1-dimensional scatterplots) # on each sex's curve, with sex- # specific density of age # If data were in data frame could have used that p <- Predict(fit, age=seq(20,80,length=100), sex='male', fun=plogis) # works if datadist not used plot(p, ylab=expression(hat(P))) # plot predicted probability in place of log odds per <- function(x, y) x >= 30 plot(p, perim=per) # suppress output for age < 30 but leave scale alone # Take charge of the plot setup by specifying a lattice formula p <- Predict(fit, age, blood.pressure=c(120,140,160), cholesterol=c(180,200,215), sex) plot(p, ~ age | blood.pressure*cholesterol, subset=sex=='male') # plot(p, ~ age | cholesterol*blood.pressure, subset=sex=='female') # plot(p, ~ blood.pressure|cholesterol*round(age,-1), subset=sex=='male') plot(p) # Plot the age effect as an odds ratio # comparing the age shown on the x-axis to age=30 years ddist$limits$age[2] <- 30 # make 30 the reference value for age # Could also do: ddist$limits["Adjust to","age"] <- 30 fit <- update(fit) # make new reference value take effect p <- Predict(fit, age, ref.zero=TRUE, fun=exp) plot(p, ylab='Age=x:Age=30 Odds Ratio', abline=list(list(h=1, lty=2, col=2), list(v=30, lty=2, col=2))) # Compute predictions for three predictors, with superpositioning or # conditioning on sex, combined into one graph p1 <- Predict(fit, age, sex) p2 <- Predict(fit, cholesterol, sex) p3 <- Predict(fit, blood.pressure, sex) p <- rbind(age=p1, cholesterol=p2, blood.pressure=p3) plot(p, groups='sex', varypred=TRUE, adj.subtitle=FALSE) plot(p, cond='sex', varypred=TRUE, adj.subtitle=FALSE) ## Not run: # For males at the median blood pressure and cholesterol, plot 3 types # of confidence intervals for the probability on one plot, for varying age ages <- seq(20, 80, length=100) p1 <- Predict(fit, age=ages, sex='male', fun=plogis) # standard pointwise p2 <- Predict(fit, age=ages, sex='male', fun=plogis, conf.type='simultaneous') # simultaneous p3 <- Predict(fit, age=c(60,65,70), sex='male', fun=plogis, conf.type='simultaneous') # simultaneous 3 pts # The previous only adjusts for a multiplicity of 3 points instead of 100 f <- update(fit, x=TRUE, y=TRUE) g <- bootcov(f, B=500, coef.reps=TRUE) p4 <- Predict(g, age=ages, sex='male', fun=plogis) # bootstrap percentile p <- rbind(Pointwise=p1, 'Simultaneous 100 ages'=p2, 'Simultaneous 3 ages'=p3, 'Bootstrap nonparametric'=p4) xYplot(Cbind(yhat, lower, upper) ~ age, groups=.set., data=p, type='l', method='bands', label.curve=list(keys='lines')) ## End(Not run) # Plots for a parametric survival model require(survival) n <- 1000 set.seed(731) age <- 50 + 12*rnorm(n) label(age) <- "Age" sex <- factor(sample(c('Male','Female'), n, rep=TRUE, prob=c(.6, .4))) cens <- 15*runif(n) h <- .02*exp(.04*(age-50)+.8*(sex=='Female')) t <- -log(runif(n))/h label(t) <- 'Follow-up Time' e <- ifelse(t<=cens,1,0) t <- pmin(t, cens) units(t) <- "Year" ddist <- datadist(age, sex) Srv <- Surv(t,e) # Fit log-normal survival model and plot median survival time vs. age f <- psm(Srv ~ rcs(age), dist='lognormal') med <- Quantile(f) # Creates function to compute quantiles # (median by default) p <- Predict(f, age, fun=function(x) med(lp=x)) plot(p, ylab="Median Survival Time") # Note: confidence intervals from this method are approximate since # they don't take into account estimation of scale parameter # Fit an ols model to log(y) and plot the relationship between x1 # and the predicted mean(y) on the original scale without assuming # normality of residuals; use the smearing estimator # See help file for rbind.Predict for a method of showing two # types of confidence intervals simultaneously. set.seed(1) x1 <- runif(300) x2 <- runif(300) ddist <- datadist(x1,x2) y <- exp(x1+x2-1+rnorm(300)) f <- ols(log(y) ~ pol(x1,2)+x2) r <- resid(f) smean <- function(yhat)smearingEst(yhat, exp, res, statistic='mean') formals(smean) <- list(yhat=numeric(0), res=r[!is.na(r)]) #smean$res <- r[!is.na(r)] # define default res argument to function plot(Predict(f, x1, fun=smean), ylab='Predicted Mean on y-scale') # Make an 'interaction plot', forcing the x-axis variable to be # plotted at integer values but labeled with category levels n <- 100 set.seed(1) gender <- c(rep('male', n), rep('female',n)) m <- sample(c('a','b'), 2*n, TRUE) d <- datadist(gender, m); options(datadist='d') anxiety <- runif(2*n) + .2*(gender=='female') + .4*(gender=='female' & m=='b') tapply(anxiety, llist(gender,m), mean) f <- ols(anxiety ~ gender*m) p <- Predict(f, gender, m) plot(p) # horizontal dot chart; usually preferred for categorical predictors Key(.5, .5) plot(p, ~gender, groups='m', nlines=TRUE) plot(p, ~m, groups='gender', nlines=TRUE) plot(p, ~gender|m, nlines=TRUE) options(datadist=NULL) ## Not run: # Example in which separate curves are shown for 4 income values # For each curve the estimated percentage of voters voting for # the democratic party is plotted against the percent of voters # who graduated from college. Data are county-level percents. incomes <- seq(22900, 32800, length=4) # equally spaced to outer quintiles p <- Predict(f, college, income=incomes, conf.int=FALSE) plot(p, xlim=c(0,35), ylim=c(30,55)) # Erase end portions of each curve where there are fewer than 10 counties having # percent of college graduates to the left of the x-coordinate being plotted, # for the subset of counties having median family income with 1650 # of the target income for the curve show.pts <- function(college.pts, income.pt) { s <- abs(income - income.pt) < 1650 #assumes income known to top frame x <- college[s] x <- sort(x[!is.na(x)]) n <- length(x) low <- x[10]; high <- x[n-9] college.pts >= low & college.pts <= high } plot(p, xlim=c(0,35), ylim=c(30,55), perim=show.pts) # Rename variables for better plotting of a long list of predictors f <- ... p <- Predict(f) re <- c(trt='treatment', diabet='diabetes', sbp='systolic blood pressure') for(n in names(re)) { names(p)[names(p)==n] <- re[n] p$.predictor.[p$.predictor.==n] <- re[n] } plot(p) ## End(Not run)
n <- 1000 # define sample size set.seed(17) # so can reproduce the results age <- rnorm(n, 50, 10) blood.pressure <- rnorm(n, 120, 15) cholesterol <- rnorm(n, 200, 25) sex <- factor(sample(c('female','male'), n,TRUE)) label(age) <- 'Age' # label is in Hmisc label(cholesterol) <- 'Total Cholesterol' label(blood.pressure) <- 'Systolic Blood Pressure' label(sex) <- 'Sex' units(cholesterol) <- 'mg/dl' # uses units.default in Hmisc units(blood.pressure) <- 'mmHg' # Specify population model for log odds that Y=1 L <- .4*(sex=='male') + .045*(age-50) + (log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male')) # Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)] y <- ifelse(runif(n) < plogis(L), 1, 0) ddist <- datadist(age, blood.pressure, cholesterol, sex) options(datadist='ddist') fit <- lrm(y ~ blood.pressure + sex * (age + rcs(cholesterol,4)), x=TRUE, y=TRUE) an <- anova(fit) # Plot effects of all 4 predictors with test statistics from anova, and P plot(Predict(fit), anova=an, pval=TRUE) plot(Predict(fit), data=llist(blood.pressure,age)) # rug plot for two of the predictors p <- Predict(fit, name=c('age','cholesterol')) # Make 2 plots plot(p) p <- Predict(fit, age=seq(20,80,length=100), sex, conf.int=FALSE) # Plot relationship between age and log # odds, separate curve for each sex, plot(p, subset=sex=='female' | age > 30) # No confidence interval, suppress estimates for males <= 30 p <- Predict(fit, age, sex) plot(p, label.curves=FALSE, data=llist(age,sex)) # use label.curves=list(keys=c('a','b'))' # to use 1-letter abbreviations # data= allows rug plots (1-dimensional scatterplots) # on each sex's curve, with sex- # specific density of age # If data were in data frame could have used that p <- Predict(fit, age=seq(20,80,length=100), sex='male', fun=plogis) # works if datadist not used plot(p, ylab=expression(hat(P))) # plot predicted probability in place of log odds per <- function(x, y) x >= 30 plot(p, perim=per) # suppress output for age < 30 but leave scale alone # Take charge of the plot setup by specifying a lattice formula p <- Predict(fit, age, blood.pressure=c(120,140,160), cholesterol=c(180,200,215), sex) plot(p, ~ age | blood.pressure*cholesterol, subset=sex=='male') # plot(p, ~ age | cholesterol*blood.pressure, subset=sex=='female') # plot(p, ~ blood.pressure|cholesterol*round(age,-1), subset=sex=='male') plot(p) # Plot the age effect as an odds ratio # comparing the age shown on the x-axis to age=30 years ddist$limits$age[2] <- 30 # make 30 the reference value for age # Could also do: ddist$limits["Adjust to","age"] <- 30 fit <- update(fit) # make new reference value take effect p <- Predict(fit, age, ref.zero=TRUE, fun=exp) plot(p, ylab='Age=x:Age=30 Odds Ratio', abline=list(list(h=1, lty=2, col=2), list(v=30, lty=2, col=2))) # Compute predictions for three predictors, with superpositioning or # conditioning on sex, combined into one graph p1 <- Predict(fit, age, sex) p2 <- Predict(fit, cholesterol, sex) p3 <- Predict(fit, blood.pressure, sex) p <- rbind(age=p1, cholesterol=p2, blood.pressure=p3) plot(p, groups='sex', varypred=TRUE, adj.subtitle=FALSE) plot(p, cond='sex', varypred=TRUE, adj.subtitle=FALSE) ## Not run: # For males at the median blood pressure and cholesterol, plot 3 types # of confidence intervals for the probability on one plot, for varying age ages <- seq(20, 80, length=100) p1 <- Predict(fit, age=ages, sex='male', fun=plogis) # standard pointwise p2 <- Predict(fit, age=ages, sex='male', fun=plogis, conf.type='simultaneous') # simultaneous p3 <- Predict(fit, age=c(60,65,70), sex='male', fun=plogis, conf.type='simultaneous') # simultaneous 3 pts # The previous only adjusts for a multiplicity of 3 points instead of 100 f <- update(fit, x=TRUE, y=TRUE) g <- bootcov(f, B=500, coef.reps=TRUE) p4 <- Predict(g, age=ages, sex='male', fun=plogis) # bootstrap percentile p <- rbind(Pointwise=p1, 'Simultaneous 100 ages'=p2, 'Simultaneous 3 ages'=p3, 'Bootstrap nonparametric'=p4) xYplot(Cbind(yhat, lower, upper) ~ age, groups=.set., data=p, type='l', method='bands', label.curve=list(keys='lines')) ## End(Not run) # Plots for a parametric survival model require(survival) n <- 1000 set.seed(731) age <- 50 + 12*rnorm(n) label(age) <- "Age" sex <- factor(sample(c('Male','Female'), n, rep=TRUE, prob=c(.6, .4))) cens <- 15*runif(n) h <- .02*exp(.04*(age-50)+.8*(sex=='Female')) t <- -log(runif(n))/h label(t) <- 'Follow-up Time' e <- ifelse(t<=cens,1,0) t <- pmin(t, cens) units(t) <- "Year" ddist <- datadist(age, sex) Srv <- Surv(t,e) # Fit log-normal survival model and plot median survival time vs. age f <- psm(Srv ~ rcs(age), dist='lognormal') med <- Quantile(f) # Creates function to compute quantiles # (median by default) p <- Predict(f, age, fun=function(x) med(lp=x)) plot(p, ylab="Median Survival Time") # Note: confidence intervals from this method are approximate since # they don't take into account estimation of scale parameter # Fit an ols model to log(y) and plot the relationship between x1 # and the predicted mean(y) on the original scale without assuming # normality of residuals; use the smearing estimator # See help file for rbind.Predict for a method of showing two # types of confidence intervals simultaneously. set.seed(1) x1 <- runif(300) x2 <- runif(300) ddist <- datadist(x1,x2) y <- exp(x1+x2-1+rnorm(300)) f <- ols(log(y) ~ pol(x1,2)+x2) r <- resid(f) smean <- function(yhat)smearingEst(yhat, exp, res, statistic='mean') formals(smean) <- list(yhat=numeric(0), res=r[!is.na(r)]) #smean$res <- r[!is.na(r)] # define default res argument to function plot(Predict(f, x1, fun=smean), ylab='Predicted Mean on y-scale') # Make an 'interaction plot', forcing the x-axis variable to be # plotted at integer values but labeled with category levels n <- 100 set.seed(1) gender <- c(rep('male', n), rep('female',n)) m <- sample(c('a','b'), 2*n, TRUE) d <- datadist(gender, m); options(datadist='d') anxiety <- runif(2*n) + .2*(gender=='female') + .4*(gender=='female' & m=='b') tapply(anxiety, llist(gender,m), mean) f <- ols(anxiety ~ gender*m) p <- Predict(f, gender, m) plot(p) # horizontal dot chart; usually preferred for categorical predictors Key(.5, .5) plot(p, ~gender, groups='m', nlines=TRUE) plot(p, ~m, groups='gender', nlines=TRUE) plot(p, ~gender|m, nlines=TRUE) options(datadist=NULL) ## Not run: # Example in which separate curves are shown for 4 income values # For each curve the estimated percentage of voters voting for # the democratic party is plotted against the percent of voters # who graduated from college. Data are county-level percents. incomes <- seq(22900, 32800, length=4) # equally spaced to outer quintiles p <- Predict(f, college, income=incomes, conf.int=FALSE) plot(p, xlim=c(0,35), ylim=c(30,55)) # Erase end portions of each curve where there are fewer than 10 counties having # percent of college graduates to the left of the x-coordinate being plotted, # for the subset of counties having median family income with 1650 # of the target income for the curve show.pts <- function(college.pts, income.pt) { s <- abs(income - income.pt) < 1650 #assumes income known to top frame x <- college[s] x <- sort(x[!is.na(x)]) n <- length(x) low <- x[10]; high <- x[n-9] college.pts >= low & college.pts <= high } plot(p, xlim=c(0,35), ylim=c(30,55), perim=show.pts) # Rename variables for better plotting of a long list of predictors f <- ... p <- Predict(f) re <- c(trt='treatment', diabet='diabetes', sbp='systolic blood pressure') for(n in names(re)) { names(p)[names(p)==n] <- re[n] p$.predictor.[p$.predictor.==n] <- re[n] } plot(p) ## End(Not run)
Plot rexVar Result
## S3 method for class 'rexVar' plot( x, xlab = "Relative Explained Variation", xlim = NULL, pch = 16, sort = c("descending", "ascending", "none"), margin = FALSE, height = NULL, width = NULL, ... )
## S3 method for class 'rexVar' plot( x, xlab = "Relative Explained Variation", xlim = NULL, pch = 16, sort = c("descending", "ascending", "none"), margin = FALSE, height = NULL, width = NULL, ... )
x |
a vector or matrix created by |
xlab |
x-axis label |
xlim |
x-axis limits; defaults to range of all values (limits and point estimates) |
pch |
plotting symbol for dot |
sort |
defaults to sorted predictors in descending order of relative explained variable. Can set to |
margin |
set to |
height |
optional height in pixels for |
width |
likewise optional width |
... |
arguments passed to |
Makes a dot chart displaying the results of rexVar
. Base graphics are used unless options(grType='plotly')
is in effect, in which case a plotly
graphic is produced with hovertext
plotly
graphics object if using plotly
Frank Harrell
Separately for each predictor variable in a formula, plots the mean of
vs. levels of
. Then under the proportional odds assumption,
the expected value of the predictor for each
value is also plotted (as
a dotted line). This plot is useful for assessing the ordinality assumption
for
separately for each
, and for assessing the proportional odds
assumption in a simple univariable way. If several predictors do not
distinguish adjacent categories of
, those levels may need to be
pooled. This display assumes
that each predictor is linearly related to the log odds of each event in
the proportional odds model. There is also an option to plot the
expected means assuming a forward continuation ratio model.
## S3 method for class 'xmean.ordinaly' plot(x, data, subset, na.action, subn=TRUE, cr=FALSE, topcats=1, cex.points=.75, ...)
## S3 method for class 'xmean.ordinaly' plot(x, data, subset, na.action, subn=TRUE, cr=FALSE, topcats=1, cex.points=.75, ...)
x |
an S formula. Response variable is treated as ordinal. For categorical predictors, a binary version of the variable is substituted, specifying whether or not the variable equals the modal category. Interactions or non-linear effects are not allowed. |
data |
a data frame or frame number |
subset |
vector of subscripts or logical vector describing subset of data to analyze |
na.action |
defaults to |
subn |
set to |
cr |
set to |
topcats |
When a predictor is categorical, by default only the
proportion of observations in the overall most frequent category will
be plotted against response variable strata. Specify a higher value
of |
cex.points |
if |
... |
other arguments passed to |
plots
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
Harrell FE et al. (1998): Development of a clinical prediction model for an ordinal outcome. Stat in Med 17:909–44.
lrm
, residuals.lrm
, cr.setup
,
summary.formula
, biVar
.
# Simulate data from a population proportional odds model set.seed(1) n <- 400 age <- rnorm(n, 50, 10) blood.pressure <- rnorm(n, 120, 15) region <- factor(sample(c('north','south','east','west'), n, replace=TRUE)) L <- .2*(age-50) + .1*(blood.pressure-120) p12 <- plogis(L) # Pr(Y>=1) p2 <- plogis(L-1) # Pr(Y=2) p <- cbind(1-p12, p12-p2, p2) # individual class probabilites # Cumulative probabilities: cp <- matrix(cumsum(t(p)) - rep(0:(n-1), rep(3,n)), byrow=TRUE, ncol=3) y <- (cp < runif(n)) %*% rep(1,3) # Thanks to Dave Krantz <[email protected]> for this trick par(mfrow=c(2,2)) plot.xmean.ordinaly(y ~ age + blood.pressure + region, cr=TRUE, topcats=2) par(mfrow=c(1,1)) # Note that for unimportant predictors we don't care very much about the # shapes of these plots. Use the Hmisc chiSquare function to compute # Pearson chi-square statistics to rank the variables by unadjusted # importance without assuming any ordering of the response: chiSquare(y ~ age + blood.pressure + region, g=3) chiSquare(y ~ age + blood.pressure + region, g=5)
# Simulate data from a population proportional odds model set.seed(1) n <- 400 age <- rnorm(n, 50, 10) blood.pressure <- rnorm(n, 120, 15) region <- factor(sample(c('north','south','east','west'), n, replace=TRUE)) L <- .2*(age-50) + .1*(blood.pressure-120) p12 <- plogis(L) # Pr(Y>=1) p2 <- plogis(L-1) # Pr(Y=2) p <- cbind(1-p12, p12-p2, p2) # individual class probabilites # Cumulative probabilities: cp <- matrix(cumsum(t(p)) - rep(0:(n-1), rep(3,n)), byrow=TRUE, ncol=3) y <- (cp < runif(n)) %*% rep(1,3) # Thanks to Dave Krantz <[email protected]> for this trick par(mfrow=c(2,2)) plot.xmean.ordinaly(y ~ age + blood.pressure + region, cr=TRUE, topcats=2) par(mfrow=c(1,1)) # Note that for unimportant predictors we don't care very much about the # shapes of these plots. Use the Hmisc chiSquare function to compute # Pearson chi-square statistics to rank the variables by unadjusted # importance without assuming any ordering of the response: chiSquare(y ~ age + blood.pressure + region, g=3) chiSquare(y ~ age + blood.pressure + region, g=5)
Uses plotly
graphics (without using ggplot2) to plot the effect
of one or two predictors
on the linear predictor or X beta scale, or on some transformation of
that scale. The first argument specifies the result of the
Predict
function. The predictor is always plotted in its
original coding. Hover text shows point estimates, confidence
intervals, and on the leftmost x-point, adjustment variable settings.
If Predict
was run with no variable settings, so that each
predictor is varied one at a time, the result of plotp.Predict
is a list with two elements. The first, named Continuous
, is a
plotly
object containing a single graphic with all the
continuous predictors varying. The second, named Categorical
,
is a plotly
object containing a single graphic with all the
categorical predictors varying. If there are no categorical
predictors, the value returned by by plotp.Predict
is a single
plotly
object and not a list of objects.
If rdata
is given, a spike histogram is drawn showing
the location/density of data values for the -axis variable. If
there is a superposition variable that generated separate
curves, the data density specific to each class of points is shown.
The histograms are drawn by
histSpikeg
.
To plot effects instead of estimates (e.g., treatment differences as a
function of interacting factors) see contrast.rms
and
summary.rms
.
Unlike ggplot.Predict
, plotp.Predict
does not handle
groups
, anova
, or perim
arguments.
## S3 method for class 'Predict' plotp(data, subset, xlim, ylim, xlab, ylab, rdata=NULL, nlevels=3, vnames=c('labels','names'), histSpike.opts=list(frac=function(f) 0.01 + 0.02 * sqrt(f - 1)/sqrt(max(f, 2) - 1), side=1, nint=100), ncols=3, width=800, ...)
## S3 method for class 'Predict' plotp(data, subset, xlim, ylim, xlab, ylab, rdata=NULL, nlevels=3, vnames=c('labels','names'), histSpike.opts=list(frac=function(f) 0.01 + 0.02 * sqrt(f - 1)/sqrt(max(f, 2) - 1), side=1, nint=100), ncols=3, width=800, ...)
data |
a data frame created by |
subset |
a subsetting expression for restricting the rows of
|
xlim |
ignored unless predictors were specified to |
ylim |
Range for plotting on response variable axis. Computed by default and includes the confidence limits. |
xlab |
Label for |
ylab |
Label for |
rdata |
a data frame containing the original raw data on which the
regression model were based, or at least containing the |
nlevels |
A non-numeric x-axis variable with |
vnames |
applies to the case where multiple plots are produced
separately by predictor. Set to |
histSpike.opts |
a list containing named elements that specifies
parameters to |
ncols |
number of columns of plots to use when plotting multiple continuous predictors |
width |
width in pixels for |
... |
ignored |
a plotly
object or a list containing two elements, each
one a plotly
object
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
Fox J, Hong J (2009): Effect displays in R for multinomial and proportional-odds logit models: Extensions to the effects package. J Stat Software 32 No. 1.
Predict
, rbind.Predict
,
datadist
, predictrms
,
contrast.rms
, summary.rms
,
rms
, rmsMisc
, plot.Predict
,
ggplot.Predict
,
histSpikeg
,
Overview
## Not run: n <- 350 # define sample size set.seed(17) # so can reproduce the results age <- rnorm(n, 50, 10) blood.pressure <- rnorm(n, 120, 15) cholesterol <- rnorm(n, 200, 25) sex <- factor(sample(c('female','male'), n,TRUE)) label(age) <- 'Age' # label is in Hmisc label(cholesterol) <- 'Total Cholesterol' label(blood.pressure) <- 'Systolic Blood Pressure' label(sex) <- 'Sex' units(cholesterol) <- 'mg/dl' # uses units.default in Hmisc units(blood.pressure) <- 'mmHg' # Specify population model for log odds that Y=1 L <- .4*(sex=='male') + .045*(age-50) + (log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male')) + .01 * (blood.pressure - 120) # Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)] y <- ifelse(runif(n) < plogis(L), 1, 0) ddist <- datadist(age, blood.pressure, cholesterol, sex) options(datadist='ddist') fit <- lrm(y ~ blood.pressure + sex * (age + rcs(cholesterol,4)), x=TRUE, y=TRUE) p <- plotp(Predict(fit)) p$Continuous p$Categorical # When using Rmarkdown html notebook, best to use # prList(p) to render the two objects plotp(Predict(fit), rdata=llist(blood.pressure, age))$Continuous # spike histogram plot for two of the predictors p <- Predict(fit, name=c('age','cholesterol')) # Make 2 plots plotp(p) p <- Predict(fit, age, sex) plotp(p, rdata=llist(age,sex)) # rdata= allows rug plots (1-dimensional scatterplots) # on each sex's curve, with sex- # specific density of age # If data were in data frame could have used that p <- Predict(fit, age=seq(20,80,length=100), sex='male', fun=plogis) # works if datadist not used plotp(p, ylab='P') # plot predicted probability in place of log odds # Compute predictions for three predictors, with superpositioning or # conditioning on sex, combined into one graph p1 <- Predict(fit, age, sex) p2 <- Predict(fit, cholesterol, sex) p3 <- Predict(fit, blood.pressure, sex) p <- rbind(age=p1, cholesterol=p2, blood.pressure=p3) plotp(p, ncols=2, rdata=llist(age, cholesterol, sex)) ## End(Not run)
## Not run: n <- 350 # define sample size set.seed(17) # so can reproduce the results age <- rnorm(n, 50, 10) blood.pressure <- rnorm(n, 120, 15) cholesterol <- rnorm(n, 200, 25) sex <- factor(sample(c('female','male'), n,TRUE)) label(age) <- 'Age' # label is in Hmisc label(cholesterol) <- 'Total Cholesterol' label(blood.pressure) <- 'Systolic Blood Pressure' label(sex) <- 'Sex' units(cholesterol) <- 'mg/dl' # uses units.default in Hmisc units(blood.pressure) <- 'mmHg' # Specify population model for log odds that Y=1 L <- .4*(sex=='male') + .045*(age-50) + (log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male')) + .01 * (blood.pressure - 120) # Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)] y <- ifelse(runif(n) < plogis(L), 1, 0) ddist <- datadist(age, blood.pressure, cholesterol, sex) options(datadist='ddist') fit <- lrm(y ~ blood.pressure + sex * (age + rcs(cholesterol,4)), x=TRUE, y=TRUE) p <- plotp(Predict(fit)) p$Continuous p$Categorical # When using Rmarkdown html notebook, best to use # prList(p) to render the two objects plotp(Predict(fit), rdata=llist(blood.pressure, age))$Continuous # spike histogram plot for two of the predictors p <- Predict(fit, name=c('age','cholesterol')) # Make 2 plots plotp(p) p <- Predict(fit, age, sex) plotp(p, rdata=llist(age,sex)) # rdata= allows rug plots (1-dimensional scatterplots) # on each sex's curve, with sex- # specific density of age # If data were in data frame could have used that p <- Predict(fit, age=seq(20,80,length=100), sex='male', fun=plogis) # works if datadist not used plotp(p, ylab='P') # plot predicted probability in place of log odds # Compute predictions for three predictors, with superpositioning or # conditioning on sex, combined into one graph p1 <- Predict(fit, age, sex) p2 <- Predict(fit, cholesterol, sex) p3 <- Predict(fit, blood.pressure, sex) p <- rbind(age=p1, cholesterol=p2, blood.pressure=p3) plotp(p, ncols=2, rdata=llist(age, cholesterol, sex)) ## End(Not run)
Based on codes and strategies from Frank Harrell's canonical 'Regression Modeling Strategies' text
poma(mod.orm, cutval, minfreq = 15, ...)
poma(mod.orm, cutval, minfreq = 15, ...)
mod.orm |
Model fit of class 'orm' or 'lrm'. For 'fit.mult.impute' objects, 'poma' will refit model on a singly-imputed data-set |
cutval |
Numeric vector; sequence of observed values to cut outcome |
minfreq |
Numeric vector; an 'impactPO' argument which specifies the minimum sample size to allow for the least frequent category of the dependent variable. |
... |
parameters to pass to 'impactPO' function such as 'newdata', 'nonpo', and 'B'. |
Strategy 1: Compare PO model fit with models that relax the PO assumption (for discrete response variable)
Strategy 2: Apply different link functions to Prob of Binary Ys (defined by cutval). Regress transformed outcome on combined X and assess constancy of slopes (betas) across cut-points
Strategy 3: Generate score residual plot for each predictor (for response variable with <10 unique levels)
Strategy 4: Assess parallelism of link function transformed inverse CDFs curves for different XBeta levels (for response variables with >=10 unique levels)
Yong Hao Pua <[email protected]>
Harrell FE. *Regression Modeling Strategies: with applications to linear models,
logistic and ordinal regression, and survival analysis.* New York: Springer Science, LLC, 2015.
Harrell FE. Statistical Thinking - Assessing the Proportional Odds Assumption and Its Impact. https://www.fharrell.com/post/impactpo/. Published March 9, 2022. Accessed January 13, 2023.
[rms::impactPO()]
## Not run: ## orm model (response variable has fewer than 10 unique levels) mod.orm <- orm(carb ~ cyl + hp , x = TRUE, y = TRUE, data = mtcars) poma(mod.orm) ## runs rms::impactPO when its args are supplied ## More examples: (https://yhpua.github.io/poma/) d <- expand.grid(hp = c(90, 180), vs = c(0, 1)) mod.orm <- orm(cyl ~ vs + hp , x = TRUE, y = TRUE, data = mtcars) poma(mod.orm, newdata = d) ## orm model (response variable has >=10 unique levels) mod.orm <- orm(mpg ~ cyl + hp , x=TRUE, y=TRUE, data = mtcars) poma(mod.orm) ## orm model using imputation dat <- mtcars ## introduce NAs dat[sample(rownames(dat), 10), "cyl"] <- NA im <- aregImpute(~ cyl + wt + mpg + am, data = dat) aa <- fit.mult.impute(mpg ~ cyl + wt , xtrans = im, data = dat, fitter = orm) poma(aa) ## End(Not run)
## Not run: ## orm model (response variable has fewer than 10 unique levels) mod.orm <- orm(carb ~ cyl + hp , x = TRUE, y = TRUE, data = mtcars) poma(mod.orm) ## runs rms::impactPO when its args are supplied ## More examples: (https://yhpua.github.io/poma/) d <- expand.grid(hp = c(90, 180), vs = c(0, 1)) mod.orm <- orm(cyl ~ vs + hp , x = TRUE, y = TRUE, data = mtcars) poma(mod.orm, newdata = d) ## orm model (response variable has >=10 unique levels) mod.orm <- orm(mpg ~ cyl + hp , x=TRUE, y=TRUE, data = mtcars) poma(mod.orm) ## orm model using imputation dat <- mtcars ## introduce NAs dat[sample(rownames(dat), 10), "cyl"] <- NA im <- aregImpute(~ cyl + wt + mpg + am, data = dat) aa <- fit.mult.impute(mpg ~ cyl + wt , xtrans = im, data = dat, fitter = orm) poma(aa) ## End(Not run)
Translates an accelerated failure time (AFT) model fitted by
psm
to proportional hazards form, if the fitted model was
a Weibull or exponential model (extreme value distribution with
"log" link).
pphsm(fit) ## S3 method for class 'pphsm' print(x, digits=max(options()$digits - 4, 3), correlation=TRUE, ...) ## S3 method for class 'pphsm' vcov(object, ...)
pphsm(fit) ## S3 method for class 'pphsm' print(x, digits=max(options()$digits - 4, 3), correlation=TRUE, ...) ## S3 method for class 'pphsm' vcov(object, ...)
fit |
fit object created by |
x |
result of |
digits |
how many significant digits are to be used for the returned value |
correlation |
set to |
... |
ignored |
object |
a pphsm object |
a new fit object with transformed parameter estimates
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
require(survival) set.seed(1) S <- Surv(runif(100)) x <- runif(100) dd <- datadist(x); options(datadist='dd') f <- psm(S ~ x, dist="exponential") summary(f) # effects on log(T) scale f.ph <- pphsm(f) ## Not run: summary(f.ph) # effects on hazard ratio scale options(datadist=NULL)
require(survival) set.seed(1) S <- Surv(runif(100)) x <- runif(100) dd <- datadist(x); options(datadist='dd') f <- psm(S ~ x, dist="exponential") summary(f) # effects on log(T) scale f.ph <- pphsm(f) ## Not run: summary(f.ph) # effects on hazard ratio scale options(datadist=NULL)
predab.resample
is a general-purpose
function that is used by functions for specific models.
It computes estimates of optimism of, and bias-corrected estimates of a vector
of indexes of predictive accuracy, for a model with a specified
design matrix, with or without fast backward step-down of predictors. If bw=TRUE
, the design
matrix x
must have been created by ols
, lrm
, or cph
.
If bw=TRUE
, predab.resample
stores as the kept
attribute a logical matrix encoding which
factors were selected at each repetition.
predab.resample(fit.orig, fit, measure, method=c("boot","crossvalidation",".632","randomization"), bw=FALSE, B=50, pr=FALSE, prmodsel=TRUE, rule="aic", type="residual", sls=.05, aics=0, tol=1e-12, force=NULL, estimates=TRUE, non.slopes.in.x=TRUE, kint=1, cluster, subset, group=NULL, allow.varying.intercepts=FALSE, debug=FALSE, ...)
predab.resample(fit.orig, fit, measure, method=c("boot","crossvalidation",".632","randomization"), bw=FALSE, B=50, pr=FALSE, prmodsel=TRUE, rule="aic", type="residual", sls=.05, aics=0, tol=1e-12, force=NULL, estimates=TRUE, non.slopes.in.x=TRUE, kint=1, cluster, subset, group=NULL, allow.varying.intercepts=FALSE, debug=FALSE, ...)
fit.orig |
object containing the original full-sample fit, with the |
fit |
a function to fit the model, either the original model fit, or a fit in a
sample. fit has as arguments |
measure |
a function to compute a vector of indexes of predictive accuracy for a given fit.
For |
method |
The default is |
bw |
Set to |
B |
Number of repetitions, default=50. For |
pr |
|
prmodsel |
set to |
rule |
Stopping rule for fastbw, |
type |
Type of statistic to use in stopping rule for fastbw, |
sls |
Significance level for stopping in fastbw if |
aics |
Stopping criteria for |
tol |
Tolerance for singularity checking. Is passed to |
force |
see |
estimates |
see |
non.slopes.in.x |
set to |
kint |
For multiple intercept models such as the ordinal logistic model, you may
specify which intercept to use as |
cluster |
Vector containing cluster identifiers. This can be specified only if
|
subset |
specify a vector of positive or negative integers or a logical vector when
you want to have the |
group |
a grouping variable used to stratify the sample upon bootstrapping. This allows one to handle k-sample problems, i.e., each bootstrap sample will be forced to selected the same number of observations from each level of group as the number appearing in the original dataset. |
allow.varying.intercepts |
set to |
debug |
set to |
... |
The user may add other arguments here that are passed to |
For method=".632"
, the program stops with an error if every observation
is not omitted at least once from a bootstrap sample. Efron's ".632" method
was developed for measures that are formulated in terms on per-observation
contributions. In general, error measures (e.g., ROC areas) cannot be
written in this way, so this function uses a heuristic extension to
Efron's formulation in which it is assumed that the average error measure
omitting the i
th observation is the same as the average error measure
omitting any other observation. Then weights are derived
for each bootstrap repetition and weighted averages over the B
repetitions
can easily be computed.
a matrix of class "validate"
with rows corresponding
to indexes computed by measure
, and the following columns:
index.orig |
indexes in original overall fit |
training |
average indexes in training samples |
test |
average indexes in test samples |
optimism |
average |
index.corrected |
|
n |
number of successful repetitions with the given index non-missing |
.
Also contains an attribute keepinfo
if measure
returned
such an attribute when run on the original fit.
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
Efron B, Tibshirani R (1997). Improvements on cross-validation: The .632+ bootstrap method. JASA 92:548–560.
rms
, validate
, fastbw
,
lrm
, ols
, cph
,
bootcov
, setPb
# See the code for validate.ols for an example of the use of # predab.resample
# See the code for validate.ols for an example of the use of # predab.resample
Predict
allows the user to easily specify which predictors are to
vary. When the vector of values over which a predictor should vary is
not specified, the
range will be all levels of a categorical predictor or equally-spaced
points between the datadist
"Low:prediction"
and
"High:prediction"
values for the variable (datadist
by
default uses the 10th smallest and 10th largest predictor values in the
dataset). Predicted values are
the linear predictor (X beta), a user-specified transformation of that
scale, or estimated probability of surviving past a fixed single time
point given the linear predictor. Predict
is usually used for
plotting predicted values but there is also a print
method.
When the first argument to Predict
is a fit object created by
bootcov
with coef.reps=TRUE
, confidence limits come from
the stored matrix of bootstrap repetitions of coefficients, using
bootstrap percentile nonparametric confidence limits, basic bootstrap,
or BCa limits. Such confidence
intervals do not make distributional assumptions. You can force
Predict
to instead use the bootstrap covariance matrix by setting
usebootcoef=FALSE
. If coef.reps
was FALSE
,
usebootcoef=FALSE
is the default.
There are ggplot
, plotp
, and plot
methods for
Predict
objects that makes it easy to show predicted values and
confidence bands.
The rbind
method for Predict
objects allows you to create
separate sets of predictions under different situations and to combine
them into one set for feeding to plot.Predict
,
ggplot.Predict
, or plotp.Predict
. For example you
might want to plot confidence intervals for means and for individuals
using ols
, and have the two types of confidence bands be
superposed onto one plot or placed into two panels. Another use for
rbind
is to combine predictions from quantile regression models
that predicted three different quantiles.
If conf.type="simultaneous"
, simultaneous (over all requested
predictions) confidence limits are computed. See the
predictrms
function for details.
If fun
is given, conf.int
> 0, the model is not a
Bayesian model, and the bootstrap was not used, fun
may return
limits
attribute when fun
computed its own confidence
limits. These confidence limits will be functions of the design matrix,
not just the linear predictor.
Predict(object, ..., fun=NULL, funint=TRUE, type = c("predictions", "model.frame", "x"), np = 200, conf.int = 0.95, conf.type = c("mean", "individual","simultaneous"), usebootcoef=TRUE, boot.type=c("percentile", "bca", "basic"), posterior.summary=c('mean', 'median', 'mode'), adj.zero = FALSE, ref.zero = FALSE, kint=NULL, ycut=NULL, time = NULL, loglog = FALSE, digits=4, name, factors=NULL, offset=NULL) ## S3 method for class 'Predict' print(x, ...) ## S3 method for class 'Predict' rbind(..., rename)
Predict(object, ..., fun=NULL, funint=TRUE, type = c("predictions", "model.frame", "x"), np = 200, conf.int = 0.95, conf.type = c("mean", "individual","simultaneous"), usebootcoef=TRUE, boot.type=c("percentile", "bca", "basic"), posterior.summary=c('mean', 'median', 'mode'), adj.zero = FALSE, ref.zero = FALSE, kint=NULL, ycut=NULL, time = NULL, loglog = FALSE, digits=4, name, factors=NULL, offset=NULL) ## S3 method for class 'Predict' print(x, ...) ## S3 method for class 'Predict' rbind(..., rename)
object |
an |
... |
One or more variables to vary, or single-valued adjustment values.
Specify a variable name without an equal sign to use the default
display range, or any range
you choose (e.g. |
fun |
an optional transformation of the linear predictor.
Specify |
funint |
set to |
type |
defaults to providing predictions. Set to |
np |
the number of equally-spaced points computed for continuous
predictors that vary, i.e., when the specified value is |
conf.int |
confidence level (highest posterior density interval probability for
Bayesian models). Default is 0.95. Specify |
conf.type |
type of confidence interval. Default is |
usebootcoef |
set to |
boot.type |
set to |
posterior.summary |
defaults to using the posterior mean of the
regression coefficients. Specify |
adj.zero |
Set to |
ref.zero |
Set to |
kint |
This is only useful in a multiple intercept model such as the ordinal
logistic model. There to use to second of three intercepts, for example,
specify |
ycut |
for an ordinal model specifies the Y cutoff to use in
evaluating departures from proportional odds, when the constrained
partial proportional odds model is used. When omitted, |
time |
Specify a single time |
loglog |
Specify |
digits |
Controls how “adjust-to” values are plotted. The default is 4 significant digits. |
name |
Instead of specifying the variables to vary in the
|
factors |
an alternate way of specifying ..., mainly for use by
|
offset |
a list containing one value for one variable, which is mandatory if the model included an offset term. The variable name must match the innermost variable name in the offset term. The single offset is added to all predicted values. |
x |
an object created by |
rename |
If you are concatenating predictor sets using |
When there are no intercepts in the fitted model, plot subtracts adjustment values from each factor while computing variances for confidence limits.
Specifying time
will not work for Cox models with time-dependent
covariables. Use survest
or survfit
for that purpose.
a data frame containing all model predictors and the computed values
yhat
, lower
, upper
, the latter two if confidence
intervals were requested. The data frame has an additional
class
"Predict"
. If name
is specified or no
predictors are specified in ..., the resulting data frame has an
additional variable called .predictor.
specifying which
predictor is currently being varied. .predictor.
is handy for
use as a paneling variable in lattice
or ggplot2
graphics.
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
plot.Predict
, ggplot.Predict
,
plotp.Predict
,
datadist
, predictrms
,
contrast.rms
, summary.rms
,
rms
, rms.trans
, survest
,
survplot
, rmsMisc
,
transace
, rbind
, bootcov
,
bootBCa
, boot.ci
n <- 1000 # define sample size set.seed(17) # so can reproduce the results age <- rnorm(n, 50, 10) blood.pressure <- rnorm(n, 120, 15) cholesterol <- rnorm(n, 200, 25) sex <- factor(sample(c('female','male'), n,TRUE)) label(age) <- 'Age' # label is in Hmisc label(cholesterol) <- 'Total Cholesterol' label(blood.pressure) <- 'Systolic Blood Pressure' label(sex) <- 'Sex' units(cholesterol) <- 'mg/dl' # uses units.default in Hmisc units(blood.pressure) <- 'mmHg' # Specify population model for log odds that Y=1 L <- .4*(sex=='male') + .045*(age-50) + (log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male')) # Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)] y <- ifelse(runif(n) < plogis(L), 1, 0) ddist <- datadist(age, blood.pressure, cholesterol, sex) options(datadist='ddist') fit <- lrm(y ~ blood.pressure + sex * (age + rcs(cholesterol,4))) Predict(fit, age, cholesterol, np=4) Predict(fit, age=seq(20,80,by=10), sex, conf.int=FALSE) Predict(fit, age=seq(20,80,by=10), sex='male') # works if datadist not used # Get simultaneous confidence limits accounting for making 7 estimates # Predict(fit, age=seq(20,80,by=10), sex='male', conf.type='simult') # (this needs the multcomp package) ddist$limits$age[2] <- 30 # make 30 the reference value for age # Could also do: ddist$limits["Adjust to","age"] <- 30 fit <- update(fit) # make new reference value take effect Predict(fit, age, ref.zero=TRUE, fun=exp) # Make two curves, and plot the predicted curves as two trellis panels w <- Predict(fit, age, sex) require(lattice) xyplot(yhat ~ age | sex, data=w, type='l') # To add confidence bands we need to use the Hmisc xYplot function in # place of xyplot xYplot(Cbind(yhat,lower,upper) ~ age | sex, data=w, method='filled bands', type='l', col.fill=gray(.95)) # If non-displayed variables were in the model, add a subtitle to show # their settings using title(sub=paste('Adjusted to',attr(w,'info')$adjust),adj=0) # Easier: feed w into plot.Predict, ggplot.Predict, plotp.Predict ## Not run: # Predictions form a parametric survival model require(survival) n <- 1000 set.seed(731) age <- 50 + 12*rnorm(n) label(age) <- "Age" sex <- factor(sample(c('Male','Female'), n, rep=TRUE, prob=c(.6, .4))) cens <- 15*runif(n) h <- .02*exp(.04*(age-50)+.8*(sex=='Female')) t <- -log(runif(n))/h label(t) <- 'Follow-up Time' e <- ifelse(t<=cens,1,0) t <- pmin(t, cens) units(t) <- "Year" ddist <- datadist(age, sex) Srv <- Surv(t,e) # Fit log-normal survival model and plot median survival time vs. age f <- psm(Srv ~ rcs(age), dist='lognormal') med <- Quantile(f) # Creates function to compute quantiles # (median by default) Predict(f, age, fun=function(x)med(lp=x)) # Note: This works because med() expects the linear predictor (X*beta) # as an argument. Would not work if use # ref.zero=TRUE or adj.zero=TRUE. # Also, confidence intervals from this method are approximate since # they don't take into account estimation of scale parameter # Fit an ols model to log(y) and plot the relationship between x1 # and the predicted mean(y) on the original scale without assuming # normality of residuals; use the smearing estimator. Before doing # that, show confidence intervals for mean and individual log(y), # and for the latter, also show bootstrap percentile nonparametric # pointwise confidence limits set.seed(1) x1 <- runif(300) x2 <- runif(300) ddist <- datadist(x1,x2); options(datadist='ddist') y <- exp(x1+ x2 - 1 + rnorm(300)) f <- ols(log(y) ~ pol(x1,2) + x2, x=TRUE, y=TRUE) # x y for bootcov fb <- bootcov(f, B=100) pb <- Predict(fb, x1, x2=c(.25,.75)) p1 <- Predict(f, x1, x2=c(.25,.75)) p <- rbind(normal=p1, boot=pb) plot(p) p1 <- Predict(f, x1, conf.type='mean') p2 <- Predict(f, x1, conf.type='individual') p <- rbind(mean=p1, individual=p2) plot(p, label.curve=FALSE) # uses superposition plot(p, ~x1 | .set.) # 2 panels r <- resid(f) smean <- function(yhat)smearingEst(yhat, exp, res, statistic='mean') formals(smean) <- list(yhat=numeric(0), res=r[!is.na(r)]) #smean$res <- r[!is.na(r)] # define default res argument to function Predict(f, x1, fun=smean) ## Example using offset g <- Glm(Y ~ offset(log(N)) + x1 + x2, family=poisson) Predict(g, offset=list(N=100)) ## End(Not run) options(datadist=NULL)
n <- 1000 # define sample size set.seed(17) # so can reproduce the results age <- rnorm(n, 50, 10) blood.pressure <- rnorm(n, 120, 15) cholesterol <- rnorm(n, 200, 25) sex <- factor(sample(c('female','male'), n,TRUE)) label(age) <- 'Age' # label is in Hmisc label(cholesterol) <- 'Total Cholesterol' label(blood.pressure) <- 'Systolic Blood Pressure' label(sex) <- 'Sex' units(cholesterol) <- 'mg/dl' # uses units.default in Hmisc units(blood.pressure) <- 'mmHg' # Specify population model for log odds that Y=1 L <- .4*(sex=='male') + .045*(age-50) + (log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male')) # Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)] y <- ifelse(runif(n) < plogis(L), 1, 0) ddist <- datadist(age, blood.pressure, cholesterol, sex) options(datadist='ddist') fit <- lrm(y ~ blood.pressure + sex * (age + rcs(cholesterol,4))) Predict(fit, age, cholesterol, np=4) Predict(fit, age=seq(20,80,by=10), sex, conf.int=FALSE) Predict(fit, age=seq(20,80,by=10), sex='male') # works if datadist not used # Get simultaneous confidence limits accounting for making 7 estimates # Predict(fit, age=seq(20,80,by=10), sex='male', conf.type='simult') # (this needs the multcomp package) ddist$limits$age[2] <- 30 # make 30 the reference value for age # Could also do: ddist$limits["Adjust to","age"] <- 30 fit <- update(fit) # make new reference value take effect Predict(fit, age, ref.zero=TRUE, fun=exp) # Make two curves, and plot the predicted curves as two trellis panels w <- Predict(fit, age, sex) require(lattice) xyplot(yhat ~ age | sex, data=w, type='l') # To add confidence bands we need to use the Hmisc xYplot function in # place of xyplot xYplot(Cbind(yhat,lower,upper) ~ age | sex, data=w, method='filled bands', type='l', col.fill=gray(.95)) # If non-displayed variables were in the model, add a subtitle to show # their settings using title(sub=paste('Adjusted to',attr(w,'info')$adjust),adj=0) # Easier: feed w into plot.Predict, ggplot.Predict, plotp.Predict ## Not run: # Predictions form a parametric survival model require(survival) n <- 1000 set.seed(731) age <- 50 + 12*rnorm(n) label(age) <- "Age" sex <- factor(sample(c('Male','Female'), n, rep=TRUE, prob=c(.6, .4))) cens <- 15*runif(n) h <- .02*exp(.04*(age-50)+.8*(sex=='Female')) t <- -log(runif(n))/h label(t) <- 'Follow-up Time' e <- ifelse(t<=cens,1,0) t <- pmin(t, cens) units(t) <- "Year" ddist <- datadist(age, sex) Srv <- Surv(t,e) # Fit log-normal survival model and plot median survival time vs. age f <- psm(Srv ~ rcs(age), dist='lognormal') med <- Quantile(f) # Creates function to compute quantiles # (median by default) Predict(f, age, fun=function(x)med(lp=x)) # Note: This works because med() expects the linear predictor (X*beta) # as an argument. Would not work if use # ref.zero=TRUE or adj.zero=TRUE. # Also, confidence intervals from this method are approximate since # they don't take into account estimation of scale parameter # Fit an ols model to log(y) and plot the relationship between x1 # and the predicted mean(y) on the original scale without assuming # normality of residuals; use the smearing estimator. Before doing # that, show confidence intervals for mean and individual log(y), # and for the latter, also show bootstrap percentile nonparametric # pointwise confidence limits set.seed(1) x1 <- runif(300) x2 <- runif(300) ddist <- datadist(x1,x2); options(datadist='ddist') y <- exp(x1+ x2 - 1 + rnorm(300)) f <- ols(log(y) ~ pol(x1,2) + x2, x=TRUE, y=TRUE) # x y for bootcov fb <- bootcov(f, B=100) pb <- Predict(fb, x1, x2=c(.25,.75)) p1 <- Predict(f, x1, x2=c(.25,.75)) p <- rbind(normal=p1, boot=pb) plot(p) p1 <- Predict(f, x1, conf.type='mean') p2 <- Predict(f, x1, conf.type='individual') p <- rbind(mean=p1, individual=p2) plot(p, label.curve=FALSE) # uses superposition plot(p, ~x1 | .set.) # 2 panels r <- resid(f) smean <- function(yhat)smearingEst(yhat, exp, res, statistic='mean') formals(smean) <- list(yhat=numeric(0), res=r[!is.na(r)]) #smean$res <- r[!is.na(r)] # define default res argument to function Predict(f, x1, fun=smean) ## Example using offset g <- Glm(Y ~ offset(log(N)) + x1 + x2, family=poisson) Predict(g, offset=list(N=100)) ## End(Not run) options(datadist=NULL)
Computes a variety of types of predicted values for fits from
lrm
and orm
, either from the original dataset or for new
observations. The Mean.lrm
and Mean.orm
functions produce
an R function to compute the predicted mean of a numeric ordered
response variable given the linear predictor, which is assumed to use
the first intercept when it was computed. The returned function has two
optional arguments if confidence intervals are desired: conf.int
and the design matrix X
. When this derived function is called
with nonzero conf.int
, an attribute named limits
is attached
to the estimated mean. This is a list with elements lower
and
upper
containing normal approximations for confidence limits
using the delta method.
## S3 method for class 'lrm' predict(object, ..., type=c("lp", "fitted", "fitted.ind", "mean", "x", "data.frame", "terms", "cterms", "ccterms", "adjto","adjto.data.frame", "model.frame"), se.fit=FALSE, codes=FALSE) ## S3 method for class 'orm' predict(object, ..., type=c("lp", "fitted", "fitted.ind", "mean", "x", "data.frame", "terms", "cterms", "ccterms", "adjto","adjto.data.frame", "model.frame"), se.fit=FALSE, codes=FALSE) ## S3 method for class 'lrm' Mean(object, codes=FALSE, ...) ## S3 method for class 'orm' Mean(object, codes=FALSE, ...)
## S3 method for class 'lrm' predict(object, ..., type=c("lp", "fitted", "fitted.ind", "mean", "x", "data.frame", "terms", "cterms", "ccterms", "adjto","adjto.data.frame", "model.frame"), se.fit=FALSE, codes=FALSE) ## S3 method for class 'orm' predict(object, ..., type=c("lp", "fitted", "fitted.ind", "mean", "x", "data.frame", "terms", "cterms", "ccterms", "adjto","adjto.data.frame", "model.frame"), se.fit=FALSE, codes=FALSE) ## S3 method for class 'lrm' Mean(object, codes=FALSE, ...) ## S3 method for class 'orm' Mean(object, codes=FALSE, ...)
object |
a object created by |
... |
arguments passed to |
type |
See |
se.fit |
applies only to |
codes |
if |
a vector (type="lp"
with se.fit=FALSE
, or
type="mean"
or only one
observation being predicted), a list (with elements linear.predictors
and se.fit
if se.fit=TRUE
), a matrix (type="fitted"
or type="fitted.ind"
), a data frame, or a design matrix. For
Mean.lrm
and Mean.orm
, the result is an R function.
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
For the Quantile
function:
Qi Liu and Shengxin Tu
Department of Biostatistics, Vanderbilt University
Hannah M, Quigley P: Presentation of ordinal regression analysis on the original scale. Biometrics 52:771–5; 1996.
lrm
, orm
, predict.rms
,
naresid
, contrast.rms
# See help for predict.rms for several binary logistic # regression examples # Examples of predictions from ordinal models set.seed(1) y <- factor(sample(1:3, 400, TRUE), 1:3, c('good','better','best')) x1 <- runif(400) x2 <- runif(400) f <- lrm(y ~ rcs(x1,4)*x2, x=TRUE) #x=TRUE needed for se.fit # Get 0.95 confidence limits for Prob[better or best] L <- predict(f, se.fit=TRUE) #omitted kint= so use 1st intercept plogis(with(L, linear.predictors + 1.96*cbind(-se.fit,se.fit))) predict(f, type="fitted.ind")[1:10,] #gets Prob(better) and all others d <- data.frame(x1=c(.1,.5),x2=c(.5,.15)) predict(f, d, type="fitted") # Prob(Y>=j) for new observation predict(f, d, type="fitted.ind") # Prob(Y=j) predict(f, d, type='mean', codes=TRUE) # predicts mean(y) using codes 1,2,3 m <- Mean(f, codes=TRUE) lp <- predict(f, d) m(lp) # Can use function m as an argument to Predict or nomogram to # get predicted means instead of log odds or probabilities dd <- datadist(x1,x2); options(datadist='dd') m plot(Predict(f, x1, fun=m), ylab='Predicted Mean') # Note: Run f through bootcov with coef.reps=TRUE to get proper confidence # limits for predicted means from the prop. odds model options(datadist=NULL)
# See help for predict.rms for several binary logistic # regression examples # Examples of predictions from ordinal models set.seed(1) y <- factor(sample(1:3, 400, TRUE), 1:3, c('good','better','best')) x1 <- runif(400) x2 <- runif(400) f <- lrm(y ~ rcs(x1,4)*x2, x=TRUE) #x=TRUE needed for se.fit # Get 0.95 confidence limits for Prob[better or best] L <- predict(f, se.fit=TRUE) #omitted kint= so use 1st intercept plogis(with(L, linear.predictors + 1.96*cbind(-se.fit,se.fit))) predict(f, type="fitted.ind")[1:10,] #gets Prob(better) and all others d <- data.frame(x1=c(.1,.5),x2=c(.5,.15)) predict(f, d, type="fitted") # Prob(Y>=j) for new observation predict(f, d, type="fitted.ind") # Prob(Y=j) predict(f, d, type='mean', codes=TRUE) # predicts mean(y) using codes 1,2,3 m <- Mean(f, codes=TRUE) lp <- predict(f, d) m(lp) # Can use function m as an argument to Predict or nomogram to # get predicted means instead of log odds or probabilities dd <- datadist(x1,x2); options(datadist='dd') m plot(Predict(f, x1, fun=m), ylab='Predicted Mean') # Note: Run f through bootcov with coef.reps=TRUE to get proper confidence # limits for predicted means from the prop. odds model options(datadist=NULL)
The predict
function is used to obtain a variety of values or
predicted values from either the data used to fit the model (if
type="adjto"
or "adjto.data.frame"
or if x=TRUE
or
linear.predictors=TRUE
were specified to the modeling function), or from
a new dataset. Parameters such as knots and factor levels used in creating
the design matrix in the original fit are "remembered".
See the Function
function for another method for computing the
linear predictors. predictrms
is an internal utility function
that is for the other functions.
predictrms(fit, newdata=NULL, type=c("lp", "x", "data.frame", "terms", "cterms", "ccterms", "adjto", "adjto.data.frame", "model.frame"), se.fit=FALSE, conf.int=FALSE, conf.type=c('mean', 'individual', 'simultaneous'), kint=NULL, na.action=na.keep, expand.na=TRUE, center.terms=type=="terms", ref.zero=FALSE, posterior.summary=c('mean', 'median', 'mode'), second=FALSE, ...) ## S3 method for class 'bj' predict(object, newdata, type=c("lp", "x", "data.frame", "terms", "cterms", "ccterms", "adjto", "adjto.data.frame", "model.frame"), se.fit=FALSE, conf.int=FALSE, conf.type=c('mean','individual','simultaneous'), kint=1, na.action=na.keep, expand.na=TRUE, center.terms=type=="terms", ...) # for bj ## S3 method for class 'cph' predict(object, newdata=NULL, type=c("lp", "x", "data.frame", "terms", "cterms", "ccterms", "adjto", "adjto.data.frame", "model.frame"), se.fit=FALSE, conf.int=FALSE, conf.type=c('mean','individual','simultaneous'), kint=1, na.action=na.keep, expand.na=TRUE, center.terms=type=="terms", ...) # cph ## S3 method for class 'Glm' predict(object, newdata, type= c("lp", "x", "data.frame", "terms", "cterms", "ccterms", "adjto", "adjto.data.frame", "model.frame"), se.fit=FALSE, conf.int=FALSE, conf.type=c('mean','individual','simultaneous'), kint=1, na.action=na.keep, expand.na=TRUE, center.terms=type=="terms", ...) # Glm ## S3 method for class 'Gls' predict(object, newdata, type=c("lp", "x", "data.frame", "terms", "cterms", "ccterms", "adjto", "adjto.data.frame", "model.frame"), se.fit=FALSE, conf.int=FALSE, conf.type=c('mean','individual','simultaneous'), kint=1, na.action=na.keep, expand.na=TRUE, center.terms=type=="terms", ...) # Gls ## S3 method for class 'ols' predict(object, newdata, type=c("lp", "x", "data.frame", "terms", "cterms", "ccterms", "adjto", "adjto.data.frame", "model.frame"), se.fit=FALSE, conf.int=FALSE, conf.type=c('mean','individual','simultaneous'), kint=1, na.action=na.keep, expand.na=TRUE, center.terms=type=="terms", ...) # ols ## S3 method for class 'psm' predict(object, newdata, type=c("lp", "x", "data.frame", "terms", "cterms", "ccterms", "adjto", "adjto.data.frame", "model.frame"), se.fit=FALSE, conf.int=FALSE, conf.type=c('mean','individual','simultaneous'), kint=1, na.action=na.keep, expand.na=TRUE, center.terms=type=="terms", ...) # psm
predictrms(fit, newdata=NULL, type=c("lp", "x", "data.frame", "terms", "cterms", "ccterms", "adjto", "adjto.data.frame", "model.frame"), se.fit=FALSE, conf.int=FALSE, conf.type=c('mean', 'individual', 'simultaneous'), kint=NULL, na.action=na.keep, expand.na=TRUE, center.terms=type=="terms", ref.zero=FALSE, posterior.summary=c('mean', 'median', 'mode'), second=FALSE, ...) ## S3 method for class 'bj' predict(object, newdata, type=c("lp", "x", "data.frame", "terms", "cterms", "ccterms", "adjto", "adjto.data.frame", "model.frame"), se.fit=FALSE, conf.int=FALSE, conf.type=c('mean','individual','simultaneous'), kint=1, na.action=na.keep, expand.na=TRUE, center.terms=type=="terms", ...) # for bj ## S3 method for class 'cph' predict(object, newdata=NULL, type=c("lp", "x", "data.frame", "terms", "cterms", "ccterms", "adjto", "adjto.data.frame", "model.frame"), se.fit=FALSE, conf.int=FALSE, conf.type=c('mean','individual','simultaneous'), kint=1, na.action=na.keep, expand.na=TRUE, center.terms=type=="terms", ...) # cph ## S3 method for class 'Glm' predict(object, newdata, type= c("lp", "x", "data.frame", "terms", "cterms", "ccterms", "adjto", "adjto.data.frame", "model.frame"), se.fit=FALSE, conf.int=FALSE, conf.type=c('mean','individual','simultaneous'), kint=1, na.action=na.keep, expand.na=TRUE, center.terms=type=="terms", ...) # Glm ## S3 method for class 'Gls' predict(object, newdata, type=c("lp", "x", "data.frame", "terms", "cterms", "ccterms", "adjto", "adjto.data.frame", "model.frame"), se.fit=FALSE, conf.int=FALSE, conf.type=c('mean','individual','simultaneous'), kint=1, na.action=na.keep, expand.na=TRUE, center.terms=type=="terms", ...) # Gls ## S3 method for class 'ols' predict(object, newdata, type=c("lp", "x", "data.frame", "terms", "cterms", "ccterms", "adjto", "adjto.data.frame", "model.frame"), se.fit=FALSE, conf.int=FALSE, conf.type=c('mean','individual','simultaneous'), kint=1, na.action=na.keep, expand.na=TRUE, center.terms=type=="terms", ...) # ols ## S3 method for class 'psm' predict(object, newdata, type=c("lp", "x", "data.frame", "terms", "cterms", "ccterms", "adjto", "adjto.data.frame", "model.frame"), se.fit=FALSE, conf.int=FALSE, conf.type=c('mean','individual','simultaneous'), kint=1, na.action=na.keep, expand.na=TRUE, center.terms=type=="terms", ...) # psm
object , fit
|
a fit object with an |
newdata |
An S data frame, list or a matrix specifying new data for which predictions
are desired. If |
type |
Type of output desired. The default is |
se.fit |
Defaults to |
conf.int |
Specify |
conf.type |
specifies the type of confidence interval. Default is for the mean.
For |
posterior.summary |
when making predictions from a Bayesian model, specifies whether you want the linear predictor to be computed from the posterior mean of parameters (default) or the posterior mode or median median |
second |
set to |
kint |
a single integer specifying the number of the intercept to use in
multiple-intercept models. The default is 1 for |
na.action |
Function to handle missing values in |
expand.na |
set to |
center.terms |
set to |
ref.zero |
Set to |
... |
ignored |
datadist
and options(datadist=)
should be run before predictrms
if using type="adjto"
, type="adjto.data.frame"
, or type="terms"
,
or if the fit is a Cox model fit and you are requesting se.fit=TRUE
.
For these cases, the adjustment values are needed (either for the
returned result or for the correct covariance matrix computation).
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
plot.Predict
, ggplot.Predict
,
summary.rms
,
rms
, rms.trans
, predict.lrm
,
predict.orm
,
residuals.cph
, datadist
,
gendata
, gIndex
,
Function.rms
, reShape
,
xYplot
, contrast.rms
n <- 1000 # define sample size set.seed(17) # so can reproduce the results age <- rnorm(n, 50, 10) blood.pressure <- rnorm(n, 120, 15) cholesterol <- rnorm(n, 200, 25) sex <- factor(sample(c('female','male'), n,TRUE)) treat <- factor(sample(c('a','b','c'), n,TRUE)) # Specify population model for log odds that Y=1 L <- .4*(sex=='male') + .045*(age-50) + (log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male')) + .3*sqrt(blood.pressure-60)-2.3 + 1*(treat=='b') # Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)] y <- ifelse(runif(n) < plogis(L), 1, 0) ddist <- datadist(age, blood.pressure, cholesterol, sex, treat) options(datadist='ddist') fit <- lrm(y ~ rcs(blood.pressure,4) + sex * (age + rcs(cholesterol,4)) + sex*treat*age) # Use xYplot to display predictions in 9 panels, with error bars, # with superposition of two treatments dat <- expand.grid(treat=levels(treat),sex=levels(sex), age=c(20,40,60),blood.pressure=120, cholesterol=seq(100,300,length=10)) # Add variables linear.predictors and se.fit to dat dat <- cbind(dat, predict(fit, dat, se.fit=TRUE)) # This is much easier with Predict # xYplot in Hmisc extends xyplot to allow error bars xYplot(Cbind(linear.predictors,linear.predictors-1.96*se.fit, linear.predictors+1.96*se.fit) ~ cholesterol | sex*age, groups=treat, data=dat, type='b') # Since blood.pressure doesn't interact with anything, we can quickly and # interactively try various transformations of blood.pressure, taking # the fitted spline function as the gold standard. We are seeking a # linearizing transformation even though this may lead to falsely # narrow confidence intervals if we use this data-dredging-based transformation bp <- 70:160 logit <- predict(fit, expand.grid(treat="a", sex='male', age=median(age), cholesterol=median(cholesterol), blood.pressure=bp), type="terms")[,"blood.pressure"] #Note: if age interacted with anything, this would be the age # "main effect" ignoring interaction terms #Could also use Predict(f, age=ag)$yhat #which allows evaluation of the shape for any level of interacting #factors. When age does not interact with anything, the result from #predict(f, \dots, type="terms") would equal the result from #plot if all other terms were ignored plot(bp^.5, logit) # try square root vs. spline transform. plot(bp^1.5, logit) # try 1.5 power plot(sqrt(bp-60), logit) #Some approaches to making a plot showing how predicted values #vary with a continuous predictor on the x-axis, with two other #predictors varying combos <- gendata(fit, age=seq(10,100,by=10), cholesterol=c(170,200,230), blood.pressure=c(80,120,160)) #treat, sex not specified -> set to mode #can also used expand.grid require(lattice) combos$pred <- predict(fit, combos) xyplot(pred ~ age | cholesterol*blood.pressure, data=combos, type='l') xYplot(pred ~ age | cholesterol, groups=blood.pressure, data=combos, type='l') Key() # Key created by xYplot xYplot(pred ~ age, groups=interaction(cholesterol,blood.pressure), data=combos, type='l', lty=1:9) Key() # Add upper and lower 0.95 confidence limits for individuals combos <- cbind(combos, predict(fit, combos, conf.int=.95)) xYplot(Cbind(linear.predictors, lower, upper) ~ age | cholesterol, groups=blood.pressure, data=combos, type='b') Key() # Plot effects of treatments (all pairwise comparisons) vs. # levels of interacting factors (age, sex) d <- gendata(fit, treat=levels(treat), sex=levels(sex), age=seq(30,80,by=10)) x <- predict(fit, d, type="x") betas <- fit$coef cov <- vcov(fit, intercepts='none') i <- d$treat=="a"; xa <- x[i,]; Sex <- d$sex[i]; Age <- d$age[i] i <- d$treat=="b"; xb <- x[i,] i <- d$treat=="c"; xc <- x[i,] doit <- function(xd, lab) { xb <- matxv(xd, betas) se <- apply((xd %*% cov) * xd, 1, sum)^.5 q <- qnorm(1-.01/2) # 0.99 confidence limits lower <- xb - q * se; upper <- xb + q * se #Get odds ratios instead of linear effects xb <- exp(xb); lower <- exp(lower); upper <- exp(upper) #First elements of these agree with #summary(fit, age=30, sex='female',conf.int=.99)) for(sx in levels(Sex)) { j <- Sex==sx errbar(Age[j], xb[j], upper[j], lower[j], xlab="Age", ylab=paste(lab, "Odds Ratio"), ylim=c(.1, 20), log='y') title(paste("Sex:", sx)) abline(h=1, lty=2) } } par(mfrow=c(3,2), oma=c(3,0,3,0)) doit(xb - xa, "b:a") doit(xc - xa, "c:a") doit(xb - xa, "c:b") # NOTE: This is much easier to do using contrast.rms # Demonstrate type="terms", "cterms", "ccterms" set.seed(1) n <- 40 x <- 1:n w <- factor(sample(c('a', 'b'), n, TRUE)) u <- factor(sample(c('A', 'B'), n, TRUE)) y <- .01*x + .2*(w=='b') + .3*(u=='B') + .2*(w=='b' & u=='B') + rnorm(n)/5 ddist <- datadist(x, w, u) f <- ols(y ~ x*w*u, x=TRUE, y=TRUE) f anova(f) z <- predict(f, type='terms', center.terms=FALSE) z[1:5,] k <- coef(f) ## Manually compute combined terms wb <- w=='b' uB <- u=='B' h <- k['x * w=b * u=B']*x*wb*uB tx <- k['x'] *x + k['x * w=b']*x*wb + k['x * u=B'] *x*uB + h tw <- k['w=b']*wb + k['x * w=b']*x*wb + k['w=b * u=B']*wb*uB + h tu <- k['u=B']*uB + k['x * u=B']*x*uB + k['w=b * u=B']*wb*uB + h h <- z[,'x * w * u'] # highest order term is present in all cterms tx2 <- z[,'x']+z[,'x * w']+z[,'x * u']+h tw2 <- z[,'w']+z[,'x * w']+z[,'w * u']+h tu2 <- z[,'u']+z[,'x * u']+z[,'w * u']+h ae <- function(a, b) all.equal(a, b, check.attributes=FALSE) ae(tx, tx2) ae(tw, tw2) ae(tu, tu2) zc <- predict(f, type='cterms') zc[1:5,] ae(tx, zc[,'x']) ae(tw, zc[,'w']) ae(tu, zc[,'u']) zc <- predict(f, type='ccterms') # As all factors are indirectly related, ccterms gives overall linear # predictor except for the intercept zc[1:5,] ae(as.vector(zc + coef(f)[1]), f$linear.predictors) ## Not run: #A variable state.code has levels "1", "5","13" #Get predictions with or without converting variable in newdata to factor predict(fit, data.frame(state.code=c(5,13))) predict(fit, data.frame(state.code=factor(c(5,13)))) #Use gendata function (gendata.rms) for interactive specification of #predictor variable settings (for 10 observations) df <- gendata(fit, nobs=10, viewvals=TRUE) df$predicted <- predict(fit, df) # add variable to data frame df df <- gendata(fit, age=c(10,20,30)) # leave other variables at ref. vals. predict(fit, df, type="fitted") # See reShape (in Hmisc) for an example where predictions corresponding to # values of one of the varying predictors are reformatted into multiple # columns of a matrix ## End(Not run) options(datadist=NULL)
n <- 1000 # define sample size set.seed(17) # so can reproduce the results age <- rnorm(n, 50, 10) blood.pressure <- rnorm(n, 120, 15) cholesterol <- rnorm(n, 200, 25) sex <- factor(sample(c('female','male'), n,TRUE)) treat <- factor(sample(c('a','b','c'), n,TRUE)) # Specify population model for log odds that Y=1 L <- .4*(sex=='male') + .045*(age-50) + (log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male')) + .3*sqrt(blood.pressure-60)-2.3 + 1*(treat=='b') # Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)] y <- ifelse(runif(n) < plogis(L), 1, 0) ddist <- datadist(age, blood.pressure, cholesterol, sex, treat) options(datadist='ddist') fit <- lrm(y ~ rcs(blood.pressure,4) + sex * (age + rcs(cholesterol,4)) + sex*treat*age) # Use xYplot to display predictions in 9 panels, with error bars, # with superposition of two treatments dat <- expand.grid(treat=levels(treat),sex=levels(sex), age=c(20,40,60),blood.pressure=120, cholesterol=seq(100,300,length=10)) # Add variables linear.predictors and se.fit to dat dat <- cbind(dat, predict(fit, dat, se.fit=TRUE)) # This is much easier with Predict # xYplot in Hmisc extends xyplot to allow error bars xYplot(Cbind(linear.predictors,linear.predictors-1.96*se.fit, linear.predictors+1.96*se.fit) ~ cholesterol | sex*age, groups=treat, data=dat, type='b') # Since blood.pressure doesn't interact with anything, we can quickly and # interactively try various transformations of blood.pressure, taking # the fitted spline function as the gold standard. We are seeking a # linearizing transformation even though this may lead to falsely # narrow confidence intervals if we use this data-dredging-based transformation bp <- 70:160 logit <- predict(fit, expand.grid(treat="a", sex='male', age=median(age), cholesterol=median(cholesterol), blood.pressure=bp), type="terms")[,"blood.pressure"] #Note: if age interacted with anything, this would be the age # "main effect" ignoring interaction terms #Could also use Predict(f, age=ag)$yhat #which allows evaluation of the shape for any level of interacting #factors. When age does not interact with anything, the result from #predict(f, \dots, type="terms") would equal the result from #plot if all other terms were ignored plot(bp^.5, logit) # try square root vs. spline transform. plot(bp^1.5, logit) # try 1.5 power plot(sqrt(bp-60), logit) #Some approaches to making a plot showing how predicted values #vary with a continuous predictor on the x-axis, with two other #predictors varying combos <- gendata(fit, age=seq(10,100,by=10), cholesterol=c(170,200,230), blood.pressure=c(80,120,160)) #treat, sex not specified -> set to mode #can also used expand.grid require(lattice) combos$pred <- predict(fit, combos) xyplot(pred ~ age | cholesterol*blood.pressure, data=combos, type='l') xYplot(pred ~ age | cholesterol, groups=blood.pressure, data=combos, type='l') Key() # Key created by xYplot xYplot(pred ~ age, groups=interaction(cholesterol,blood.pressure), data=combos, type='l', lty=1:9) Key() # Add upper and lower 0.95 confidence limits for individuals combos <- cbind(combos, predict(fit, combos, conf.int=.95)) xYplot(Cbind(linear.predictors, lower, upper) ~ age | cholesterol, groups=blood.pressure, data=combos, type='b') Key() # Plot effects of treatments (all pairwise comparisons) vs. # levels of interacting factors (age, sex) d <- gendata(fit, treat=levels(treat), sex=levels(sex), age=seq(30,80,by=10)) x <- predict(fit, d, type="x") betas <- fit$coef cov <- vcov(fit, intercepts='none') i <- d$treat=="a"; xa <- x[i,]; Sex <- d$sex[i]; Age <- d$age[i] i <- d$treat=="b"; xb <- x[i,] i <- d$treat=="c"; xc <- x[i,] doit <- function(xd, lab) { xb <- matxv(xd, betas) se <- apply((xd %*% cov) * xd, 1, sum)^.5 q <- qnorm(1-.01/2) # 0.99 confidence limits lower <- xb - q * se; upper <- xb + q * se #Get odds ratios instead of linear effects xb <- exp(xb); lower <- exp(lower); upper <- exp(upper) #First elements of these agree with #summary(fit, age=30, sex='female',conf.int=.99)) for(sx in levels(Sex)) { j <- Sex==sx errbar(Age[j], xb[j], upper[j], lower[j], xlab="Age", ylab=paste(lab, "Odds Ratio"), ylim=c(.1, 20), log='y') title(paste("Sex:", sx)) abline(h=1, lty=2) } } par(mfrow=c(3,2), oma=c(3,0,3,0)) doit(xb - xa, "b:a") doit(xc - xa, "c:a") doit(xb - xa, "c:b") # NOTE: This is much easier to do using contrast.rms # Demonstrate type="terms", "cterms", "ccterms" set.seed(1) n <- 40 x <- 1:n w <- factor(sample(c('a', 'b'), n, TRUE)) u <- factor(sample(c('A', 'B'), n, TRUE)) y <- .01*x + .2*(w=='b') + .3*(u=='B') + .2*(w=='b' & u=='B') + rnorm(n)/5 ddist <- datadist(x, w, u) f <- ols(y ~ x*w*u, x=TRUE, y=TRUE) f anova(f) z <- predict(f, type='terms', center.terms=FALSE) z[1:5,] k <- coef(f) ## Manually compute combined terms wb <- w=='b' uB <- u=='B' h <- k['x * w=b * u=B']*x*wb*uB tx <- k['x'] *x + k['x * w=b']*x*wb + k['x * u=B'] *x*uB + h tw <- k['w=b']*wb + k['x * w=b']*x*wb + k['w=b * u=B']*wb*uB + h tu <- k['u=B']*uB + k['x * u=B']*x*uB + k['w=b * u=B']*wb*uB + h h <- z[,'x * w * u'] # highest order term is present in all cterms tx2 <- z[,'x']+z[,'x * w']+z[,'x * u']+h tw2 <- z[,'w']+z[,'x * w']+z[,'w * u']+h tu2 <- z[,'u']+z[,'x * u']+z[,'w * u']+h ae <- function(a, b) all.equal(a, b, check.attributes=FALSE) ae(tx, tx2) ae(tw, tw2) ae(tu, tu2) zc <- predict(f, type='cterms') zc[1:5,] ae(tx, zc[,'x']) ae(tw, zc[,'w']) ae(tu, zc[,'u']) zc <- predict(f, type='ccterms') # As all factors are indirectly related, ccterms gives overall linear # predictor except for the intercept zc[1:5,] ae(as.vector(zc + coef(f)[1]), f$linear.predictors) ## Not run: #A variable state.code has levels "1", "5","13" #Get predictions with or without converting variable in newdata to factor predict(fit, data.frame(state.code=c(5,13))) predict(fit, data.frame(state.code=factor(c(5,13)))) #Use gendata function (gendata.rms) for interactive specification of #predictor variable settings (for 10 observations) df <- gendata(fit, nobs=10, viewvals=TRUE) df$predicted <- predict(fit, df) # add variable to data frame df df <- gendata(fit, age=c(10,20,30)) # leave other variables at ref. vals. predict(fit, df, type="fitted") # See reShape (in Hmisc) for an example where predictions corresponding to # values of one of the varying predictors are reformatted into multiple # columns of a matrix ## End(Not run) options(datadist=NULL)
Formatted printing of an object of class cph
. Prints strata
frequencies, parameter estimates, standard errors, z-statistics, numbers
of missing values, etc.
Format of output is controlled by the user previously running
options(prType="lang")
where lang
is "plain"
(the default),
"latex"
, or "html"
. This does not require results='asis'
in knitr
chunk headers.
## S3 method for class 'cph' print(x, digits=4, r2=c(0,2,4), table=TRUE, conf.int=FALSE, coefs=TRUE, pg=FALSE, title='Cox Proportional Hazards Model', ...)
## S3 method for class 'cph' print(x, digits=4, r2=c(0,2,4), table=TRUE, conf.int=FALSE, coefs=TRUE, pg=FALSE, title='Cox Proportional Hazards Model', ...)
x |
fit object |
digits |
number of digits to right of decimal place to print |
r2 |
vector of integers specifying which R^2 measures to print,
with 0 for Nagelkerke R^2 and 1:4 corresponding to the 4 measures
computed by |
conf.int |
set to e.g. .95 to print 0.95 confidence intervals on simple hazard ratios (which are usually meaningless as one-unit changes are seldom relevant and most models contain multiple terms per predictor) |
table |
set to |
coefs |
specify |
pg |
set to |
title |
a character string title to be passed to |
... |
arguments passed to |
Print a 'Glm' Object
## S3 method for class 'Glm' print(x, digits = 4, coefs = TRUE, title = "General Linear Model", ...)
## S3 method for class 'Glm' print(x, digits = 4, coefs = TRUE, title = "General Linear Model", ...)
x |
'Glm' object |
digits |
number of significant digits to print |
coefs |
specify 'coefs=FALSE' to suppress printing the table of model coefficients, standard errors, etc. Specify 'coefs=n' to print only the first 'n' regression coefficients in the model. |
title |
a character string title to be passed to 'prModFit' |
... |
ignored |
Prints a 'Glm' object, optionally in LaTeX or html
Frank Harrell
Prints statistical summaries and optionally predicted values computed by impactPO
, transposing statistical summaries for easy reading
## S3 method for class 'impactPO' print(x, estimates = nrow(x$estimates) < 16, ...)
## S3 method for class 'impactPO' print(x, estimates = nrow(x$estimates) < 16, ...)
x |
an object created by |
estimates |
set to |
... |
ignored |
Frank Harrell
Formatted printing of an object of class ols
using methods taken
from print.lm
and summary.lm
. Prints R-squared, adjusted
R-squared, parameter estimates, standard errors, and t-statistics (Z
statistics if penalized estimation was used). For penalized estimation,
prints the maximum penalized likelihood estimate of the residual
standard deviation (Sigma
) instead of the usual root mean squared
error.
Format of output is controlled by the user previously running
options(prType="lang")
where lang
is "plain"
(the default),
"latex"
, or "html"
. When using html with Quarto or RMarkdown,
results='asis'
need not be written in the chunk header.
## S3 method for class 'ols' print(x, digits=4, long=FALSE, coefs=TRUE, title="Linear Regression Model", ...)
## S3 method for class 'ols' print(x, digits=4, long=FALSE, coefs=TRUE, title="Linear Regression Model", ...)
x |
fit object |
digits |
number of significant digits to print |
long |
set to |
coefs |
specify |
title |
a character string title to be passed to |
... |
other parameters to pass to |
Print rexVar Result
## S3 method for class 'rexVar' print(x, title = "Relative Explained Variation", digits = 3, ...)
## S3 method for class 'rexVar' print(x, title = "Relative Explained Variation", digits = 3, ...)
x |
a vector or matrix created by |
title |
character string which can be set to |
digits |
passed to |
... |
unused |
Prints the results of an rexVar
call
invisible
Frank Harrell
Print Information About Impact of Imputation
prmiInfo(x)
prmiInfo(x)
x |
an object created by |
For the results of processMI.fit.mult.impute
prints or writes html (the latter if options(prType='html')
is in effect) summarizing various correction factors related to missing data multiple imputation.
nothing
Frank Harrell
## Not run: a <- aregImpute(...) f <- fit.mult.impute(...) v <- processMI(f, 'anova') prmiInfo(v) ## End(Not run)
## Not run: a <- aregImpute(...) f <- fit.mult.impute(...) v <- processMI(f, 'anova') prmiInfo(v) ## End(Not run)
Process Special Multiple Imputation Output
processMI(object, ...)
processMI(object, ...)
object |
a fit object created by |
... |
ignored |
Processes lists that have one element per imputation
an object that resembles something created by a single fit without multiple imputation
Frank Harrell
Process Special Multiple Imputation Output From fit.mult.impute
## S3 method for class 'fit.mult.impute' processMI( object, which = c("validate", "calibrate", "anova"), plotall = TRUE, nind = 0, prmi = TRUE, ... )
## S3 method for class 'fit.mult.impute' processMI( object, which = c("validate", "calibrate", "anova"), plotall = TRUE, nind = 0, prmi = TRUE, ... )
object |
a fit object created by |
which |
specifies which component of the extra output should be processed |
plotall |
set to |
nind |
set to a positive integer to use base graphics to plot a matrix of graphs, one each for the first |
prmi |
set to |
... |
ignored |
Processes a funresults
object stored in a fit object created by fit.mult.impute
when its fun
argument was used. These objects are typically named validate
or calibrate
and represent bootstrap or cross-validations run separately for each imputation. See this for a case study.
For which='anova'
assumes that the fun
given to fit.mult.impute
runs anova(fit, test='LR')
to get likelihood ratio tests, and that method='stack'
was specified also so that a final anova
was run on the stacked combination of all completed datasets. The method of Chan and Meng (2022) is used to obtain overall likelihood ratio tests, with each line of the anova
table getting a customized adjustment based on the amount of missing information pertaining to the variables tested in that line. The resulting statistics are chi-square and not $F$ statistics as used by Chan and Meng. This will matter when the estimated denominator degrees of freedom for a variable is small (e.g., less than 50). These d.f. are reported so that user can take appropriate cautions such as increasing n.impute
for aregImpute
.
an object like a validate
, calibrate
, or anova
result obtained when no multiple imputation was done. This object is suitable for print
and plot
methods for these kinds of objects.
Frank Harrell
psm
is a modification of Therneau's survreg
function for
fitting the accelerated failure time family of parametric survival
models. psm
uses the rms
class for automatic
anova
, fastbw
, calibrate
, validate
, and
other functions. Hazard.psm
, Survival.psm
,
Quantile.psm
, and Mean.psm
create S functions that
evaluate the hazard, survival, quantile, and mean (expected value)
functions analytically, as functions of time or probabilities and the
linear predictor values. The Nagelkerke R^2 and and adjusted
Maddala-Cox-Snell R^2 are computed. For the latter the notation is
R2(p,m) where p is the number of regression coefficients being
adjusted for and m is the effective sample size (number of uncensored
observations). See R2Measures
for more information.
For the print
method, format of output is controlled by the
user previously running options(prType="lang")
where
lang
is "plain"
(the default), "latex"
, or
"html"
.
The residuals.psm
function exists mainly to compute normalized
(standardized) residuals and to censor them (i.e., return them as
Surv
objects) just as the original failure time variable was
censored. These residuals are useful for checking the underlying
distributional assumption (see the examples). To get these residuals,
the fit must have specified y=TRUE
. A lines
method for these
residuals automatically draws a curve with the assumed standardized
survival distribution. A survplot
method runs the standardized
censored residuals through npsurv
to get Kaplan-Meier estimates,
with optional stratification (automatically grouping a continuous
variable into quantiles) and then through survplot.npsurv
to plot
them. Then lines
is invoked to show the theoretical curve. Other
types of residuals are computed by residuals
using
residuals.survreg
.
psm(formula, data=environment(formula), weights, subset, na.action=na.delete, dist="weibull", init=NULL, scale=0, control=survreg.control(), parms=NULL, model=FALSE, x=FALSE, y=TRUE, time.inc, ...) ## S3 method for class 'psm' print(x, correlation=FALSE, digits=4, r2=c(0,2,4), coefs=TRUE, pg=FALSE, title, ...) Hazard(object, ...) ## S3 method for class 'psm' Hazard(object, ...) # for psm fit # E.g. lambda <- Hazard(fit) Survival(object, ...) ## S3 method for class 'psm' Survival(object, ...) # for psm # E.g. survival <- Survival(fit) ## S3 method for class 'psm' Quantile(object, ...) # for psm # E.g. quantsurv <- Quantile(fit) ## S3 method for class 'psm' Mean(object, ...) # for psm # E.g. meant <- Mean(fit) # lambda(times, lp) # get hazard function at t=times, xbeta=lp # survival(times, lp) # survival function at t=times, lp # quantsurv(q, lp) # quantiles of survival time # meant(lp) # mean survival time ## S3 method for class 'psm' residuals(object, type=c("censored.normalized", "response", "deviance", "dfbeta", "dfbetas", "working", "ldcase", "ldresp", "ldshape", "matrix", "score"), ...) ## S3 method for class 'residuals.psm.censored.normalized' survplot(fit, x, g=4, col, main, ...) ## S3 method for class 'residuals.psm.censored.normalized' lines(x, n=100, lty=1, xlim, lwd=3, ...) # for type="censored.normalized"
psm(formula, data=environment(formula), weights, subset, na.action=na.delete, dist="weibull", init=NULL, scale=0, control=survreg.control(), parms=NULL, model=FALSE, x=FALSE, y=TRUE, time.inc, ...) ## S3 method for class 'psm' print(x, correlation=FALSE, digits=4, r2=c(0,2,4), coefs=TRUE, pg=FALSE, title, ...) Hazard(object, ...) ## S3 method for class 'psm' Hazard(object, ...) # for psm fit # E.g. lambda <- Hazard(fit) Survival(object, ...) ## S3 method for class 'psm' Survival(object, ...) # for psm # E.g. survival <- Survival(fit) ## S3 method for class 'psm' Quantile(object, ...) # for psm # E.g. quantsurv <- Quantile(fit) ## S3 method for class 'psm' Mean(object, ...) # for psm # E.g. meant <- Mean(fit) # lambda(times, lp) # get hazard function at t=times, xbeta=lp # survival(times, lp) # survival function at t=times, lp # quantsurv(q, lp) # quantiles of survival time # meant(lp) # mean survival time ## S3 method for class 'psm' residuals(object, type=c("censored.normalized", "response", "deviance", "dfbeta", "dfbetas", "working", "ldcase", "ldresp", "ldshape", "matrix", "score"), ...) ## S3 method for class 'residuals.psm.censored.normalized' survplot(fit, x, g=4, col, main, ...) ## S3 method for class 'residuals.psm.censored.normalized' lines(x, n=100, lty=1, xlim, lwd=3, ...) # for type="censored.normalized"
formula |
an S statistical model formula. Interactions up to third order are
supported. The left hand side must be a |
object |
a fit created by |
fit |
a fit created by |
data , subset , weights , dist , scale , init , na.action , control
|
see |
parms |
a list of fixed parameters. For the |
model |
set to |
x |
set to |
y |
set to |
time.inc |
setting for default time spacing. Used in constructing time axis
in |
correlation |
set to |
digits |
number of places to print to the right of the decimal point |
r2 |
vector of integers specifying which R^2 measures to print,
with 0 for Nagelkerke R^2 and 1:4 corresponding to the 4 measures
computed by |
coefs |
specify |
pg |
set to |
title |
a character string title to be passed to |
... |
other arguments to fitting routines, or to pass to |
times |
a scalar or vector of times for which to evaluate survival probability or hazard |
lp |
a scalar or vector of linear predictor values at which to evaluate
survival probability or hazard. If both |
q |
a scalar or vector of probabilities. The default is .5, so just the
median survival time is returned. If |
type |
type of residual desired. Default is censored normalized residuals,
defined as (link(Y) - linear.predictors)/scale parameter, where the
link function was usually the log function. See |
n |
number of points to evaluate theoretical standardized survival
function for
|
lty |
line type for |
xlim |
range of times (or transformed times) for which to evaluate the standardized survival function. Default is range in normalized residuals. |
lwd |
line width for theoretical distribution, default is 3 |
g |
number of quantile groups to use for stratifying continuous variables having more than 5 levels |
col |
vector of colors for |
main |
main plot title for |
The object survreg.distributions
contains definitions of properties
of the various survival distributions.
psm
does not trap singularity errors due to the way survreg.fit
does matrix inversion. It will trap non-convergence (thus returning
fit$fail=TRUE
) if you give the argument failure=2
inside the
control
list which is passed to survreg.fit
. For example, use
f <- psm(S ~ x, control=list(failure=2, maxiter=20))
to allow up to
20 iterations and to set f$fail=TRUE
in case of non-convergence.
This is especially useful in simulation work.
psm
returns a fit object with all the information survreg
would store as
well as what rms
stores and units
and time.inc
.
Hazard
, Survival
, and Quantile
return S-functions.
residuals.psm
with type="censored.normalized"
returns a
Surv
object which has a special attribute "theoretical"
which is used by the lines
routine. This is the assumed standardized survival function as a function
of time or transformed time.
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
rms
, survreg
,
residuals.survreg
, survreg.object
,
survreg.distributions
,
pphsm
, survplot
, survest
,
Surv
,
na.delete
,
na.detail.response
, datadist
,
latex.psm
, GiniMd
, prModFit
,
ggplot.Predict
, plot.Predict
,
R2Measures
require(survival) n <- 400 set.seed(1) age <- rnorm(n, 50, 12) sex <- factor(sample(c('Female','Male'),n,TRUE)) dd <- datadist(age,sex) options(datadist='dd') # Population hazard function: h <- .02*exp(.06*(age-50)+.8*(sex=='Female')) d.time <- -log(runif(n))/h cens <- 15*runif(n) death <- ifelse(d.time <= cens,1,0) d.time <- pmin(d.time, cens) f <- psm(Surv(d.time,death) ~ sex*pol(age,2), dist='lognormal') # Log-normal model is a bad fit for proportional hazards data print(f, r2=0:4, pg=TRUE) anova(f) fastbw(f) # if deletes sex while keeping age*sex ignore the result f <- update(f, x=TRUE,y=TRUE) # so can validate, compute certain resids validate(f, B=10) # ordinarily use B=300 or more plot(Predict(f, age, sex)) # needs datadist since no explicit age, hosp. # Could have used ggplot(Predict(...)) survplot(f, age=c(20,60)) # needs datadist since hospital not set here # latex(f) S <- Survival(f) plot(f$linear.predictors, S(6, f$linear.predictors), xlab=expression(X*hat(beta)), ylab=expression(S(6,X*hat(beta)))) # plots 6-month survival as a function of linear predictor (X*Beta hat) times <- seq(0,24,by=.25) plot(times, S(times,0), type='l') # plots survival curve at X*Beta hat=0 lam <- Hazard(f) plot(times, lam(times,0), type='l') # similarly for hazard function med <- Quantile(f) # new function defaults to computing median only lp <- seq(-3, 5, by=.1) plot(lp, med(lp=lp), ylab="Median Survival Time") med(c(.25,.5), f$linear.predictors) # prints matrix with 2 columns # fit a model with no predictors f <- psm(Surv(d.time,death) ~ 1, dist="weibull") f pphsm(f) # print proportional hazards form g <- survest(f) plot(g$time, g$surv, xlab='Time', type='l', ylab=expression(S(t))) f <- psm(Surv(d.time,death) ~ age, dist="loglogistic", y=TRUE) r <- resid(f, 'cens') # note abbreviation survplot(npsurv(r ~ 1), conf='none') # plot Kaplan-Meier estimate of # survival function of standardized residuals survplot(npsurv(r ~ cut2(age, g=2)), conf='none') # both strata should be n(0,1) lines(r) # add theoretical survival function #More simply: survplot(r, age, g=2) options(datadist=NULL)
require(survival) n <- 400 set.seed(1) age <- rnorm(n, 50, 12) sex <- factor(sample(c('Female','Male'),n,TRUE)) dd <- datadist(age,sex) options(datadist='dd') # Population hazard function: h <- .02*exp(.06*(age-50)+.8*(sex=='Female')) d.time <- -log(runif(n))/h cens <- 15*runif(n) death <- ifelse(d.time <= cens,1,0) d.time <- pmin(d.time, cens) f <- psm(Surv(d.time,death) ~ sex*pol(age,2), dist='lognormal') # Log-normal model is a bad fit for proportional hazards data print(f, r2=0:4, pg=TRUE) anova(f) fastbw(f) # if deletes sex while keeping age*sex ignore the result f <- update(f, x=TRUE,y=TRUE) # so can validate, compute certain resids validate(f, B=10) # ordinarily use B=300 or more plot(Predict(f, age, sex)) # needs datadist since no explicit age, hosp. # Could have used ggplot(Predict(...)) survplot(f, age=c(20,60)) # needs datadist since hospital not set here # latex(f) S <- Survival(f) plot(f$linear.predictors, S(6, f$linear.predictors), xlab=expression(X*hat(beta)), ylab=expression(S(6,X*hat(beta)))) # plots 6-month survival as a function of linear predictor (X*Beta hat) times <- seq(0,24,by=.25) plot(times, S(times,0), type='l') # plots survival curve at X*Beta hat=0 lam <- Hazard(f) plot(times, lam(times,0), type='l') # similarly for hazard function med <- Quantile(f) # new function defaults to computing median only lp <- seq(-3, 5, by=.1) plot(lp, med(lp=lp), ylab="Median Survival Time") med(c(.25,.5), f$linear.predictors) # prints matrix with 2 columns # fit a model with no predictors f <- psm(Surv(d.time,death) ~ 1, dist="weibull") f pphsm(f) # print proportional hazards form g <- survest(f) plot(g$time, g$surv, xlab='Time', type='l', ylab=expression(S(t))) f <- psm(Surv(d.time,death) ~ age, dist="loglogistic", y=TRUE) r <- resid(f, 'cens') # note abbreviation survplot(npsurv(r ~ 1), conf='none') # plot Kaplan-Meier estimate of # survival function of standardized residuals survplot(npsurv(r ~ cut2(age, g=2)), conf='none') # both strata should be n(0,1) lines(r) # add theoretical survival function #More simply: survplot(r, age, g=2) options(datadist=NULL)
Calculates martingale, deviance, score or Schoenfeld residuals
(scaled or unscaled) or influence statistics for a
Cox proportional hazards model. This is a slightly modified version
of Therneau's residuals.coxph
function. It assumes that x=TRUE
and
y=TRUE
were specified to cph
, except for martingale
residuals, which are stored with the fit by default.
## S3 method for class 'cph' residuals(object, type=c("martingale", "deviance", "score", "schoenfeld", "dfbeta", "dfbetas", "scaledsch", "partial"), ...)
## S3 method for class 'cph' residuals(object, type=c("martingale", "deviance", "score", "schoenfeld", "dfbeta", "dfbetas", "scaledsch", "partial"), ...)
object |
a |
type |
character string indicating the type of residual desired;
the default is martingale.
Only enough of the string to determine a unique match is required.
Instead of the usual residuals, |
... |
see |
The object returned will be a vector for martingale and deviance
residuals and matrices for score and schoenfeld residuals, dfbeta, or dfbetas.
There will
be one row of residuals for each row in the input data (without collapse
).
One column of score and Schoenfeld
residuals will be returned for each column in the model.matrix.
The scaled Schoenfeld residuals are used in the cox.zph
function.
The score residuals are each individual's contribution to the score
vector. Two transformations of this are often more useful: dfbeta
is
the approximate change in the coefficient vector if that observation
were dropped, and dfbetas
is the approximate change in the coefficients,
scaled by the standard error for the coefficients.
T. Therneau, P. Grambsch, and T.Fleming. "Martingale based residuals for survival models", Biometrika, March 1990.
P. Grambsch, T. Therneau. "Proportional hazards tests and diagnostics based on weighted residuals", unpublished manuscript, Feb 1993.
cph
, coxph
, residuals.coxph
, cox.zph
, naresid
# fit <- cph(Surv(start, stop, event) ~ (age + surgery)* transplant, # data=jasa1) # mresid <- resid(fit, collapse=jasa1$id) # Get unadjusted relationships for several variables # Pick one variable that's not missing too much, for fit require(survival) n <- 1000 # define sample size set.seed(17) # so can reproduce the results age <- rnorm(n, 50, 10) blood.pressure <- rnorm(n, 120, 15) cholesterol <- rnorm(n, 200, 25) sex <- factor(sample(c('female','male'), n,TRUE)) cens <- 15*runif(n) h <- .02*exp(.04*(age-50)+.8*(sex=='Female')) d.time <- -log(runif(n))/h death <- ifelse(d.time <= cens,1,0) d.time <- pmin(d.time, cens) f <- cph(Surv(d.time, death) ~ age + blood.pressure + cholesterol, iter.max=0) res <- resid(f) # This re-inserts rows for NAs, unlike f$resid yl <- quantile(res, c(10/length(res),1-10/length(res)), na.rm=TRUE) # Scale all plots from 10th smallest to 10th largest residual par(mfrow=c(2,2), oma=c(3,0,3,0)) p <- function(x) { s <- !is.na(x+res) plot(lowess(x[s], res[s], iter=0), xlab=label(x), ylab="Residual", ylim=yl, type="l") } p(age); p(blood.pressure); p(cholesterol) mtext("Smoothed Martingale Residuals", outer=TRUE) # Assess PH by estimating log relative hazard over time f <- cph(Surv(d.time,death) ~ age + sex + blood.pressure, x=TRUE, y=TRUE) r <- resid(f, "scaledsch") tt <- as.numeric(dimnames(r)[[1]]) par(mfrow=c(3,2)) for(i in 1:3) { g <- areg.boot(I(r[,i]) ~ tt, B=20) plot(g, boot=FALSE) # shows bootstrap CIs } # Focus on 3 graphs on right # Easier approach: plot(cox.zph(f)) # invokes plot.cox.zph par(mfrow=c(1,1))
# fit <- cph(Surv(start, stop, event) ~ (age + surgery)* transplant, # data=jasa1) # mresid <- resid(fit, collapse=jasa1$id) # Get unadjusted relationships for several variables # Pick one variable that's not missing too much, for fit require(survival) n <- 1000 # define sample size set.seed(17) # so can reproduce the results age <- rnorm(n, 50, 10) blood.pressure <- rnorm(n, 120, 15) cholesterol <- rnorm(n, 200, 25) sex <- factor(sample(c('female','male'), n,TRUE)) cens <- 15*runif(n) h <- .02*exp(.04*(age-50)+.8*(sex=='Female')) d.time <- -log(runif(n))/h death <- ifelse(d.time <= cens,1,0) d.time <- pmin(d.time, cens) f <- cph(Surv(d.time, death) ~ age + blood.pressure + cholesterol, iter.max=0) res <- resid(f) # This re-inserts rows for NAs, unlike f$resid yl <- quantile(res, c(10/length(res),1-10/length(res)), na.rm=TRUE) # Scale all plots from 10th smallest to 10th largest residual par(mfrow=c(2,2), oma=c(3,0,3,0)) p <- function(x) { s <- !is.na(x+res) plot(lowess(x[s], res[s], iter=0), xlab=label(x), ylab="Residual", ylim=yl, type="l") } p(age); p(blood.pressure); p(cholesterol) mtext("Smoothed Martingale Residuals", outer=TRUE) # Assess PH by estimating log relative hazard over time f <- cph(Surv(d.time,death) ~ age + sex + blood.pressure, x=TRUE, y=TRUE) r <- resid(f, "scaledsch") tt <- as.numeric(dimnames(r)[[1]]) par(mfrow=c(3,2)) for(i in 1:3) { g <- areg.boot(I(r[,i]) ~ tt, B=20) plot(g, boot=FALSE) # shows bootstrap CIs } # Focus on 3 graphs on right # Easier approach: plot(cox.zph(f)) # invokes plot.cox.zph par(mfrow=c(1,1))
Residuals for 'Glm'
## S3 method for class 'Glm' residuals(object, type, ...)
## S3 method for class 'Glm' residuals(object, type, ...)
object |
a fit object produced by 'Glm' |
type |
either ''score'' or a 'type' accepted by 'residuals.glm' |
... |
ignored |
This function mainly passes through to ‘residuals.glm' but for 'type=’score'' computes the matrix of score residuals using code modified from 'sandwich::estfun.glm'.
a vector or matrix
Frank Harrell
lrm
or orm
FitFor a binary logistic model fit, computes the following residuals, letting
denote the predicted probability of the higher category of
,
denote the design matrix (with a column of 1s for the intercept), and
denote the logit or linear predictors: ordinary or Li-Shepherd
(
), score (
), pearson (
),
deviance (for
is
, for
is
, pseudo dependent variable used in influence
statistics (
), and partial (
).
Will compute all these residuals for an ordinal logistic model, using
as temporary binary responses dichotomizations of , along with
the corresponding
, the probability that
cutoff. For
type="partial"
, all
possible dichotomizations are used, and for type="score"
, the actual
components of the first derivative of the log likelihood are used for
an ordinal model. For type="li.shepherd"
the residual is
where Y is the observed response and W is a
random variable from the fitted distribution.
Alternatively, specify
type="score.binary"
to use binary model score residuals but for all cutpoints of
(plotted only, not returned). The
score.binary
,
partial
, and perhaps score
residuals are useful for
checking the proportional odds assumption.
If the option pl=TRUE
is used to plot the score
or
score.binary
residuals, a score residual plot is made for each
column of the design (predictor) matrix, with Y
cutoffs on the
x-axis and the mean +- 1.96 standard errors of the score residuals on
the y-axis. You can instead use a box plot to display these residuals,
for both score.binary
and score
.
Proportional odds dictates a horizontal score.binary
plot. Partial
residual plots use smooth nonparametric estimates, separately for each
cutoff of . One examines that plot for parallelism of the curves
to check the proportional odds assumption, as well as to see if the
predictor behaves linearly.
Also computes a variety of influence statistics and the
le Cessie - van Houwelingen - Copas - Hosmer unweighted sum of squares test
for global goodness of fit, done separately for each cutoff of in the
case of an ordinal model.
The plot.lrm.partial
function computes partial residuals for a series
of binary logistic model fits that all used the same predictors and that
specified x=TRUE, y=TRUE
. It then computes smoothed partial residual
relationships (using lowess
with iter=0
) and plots them separately
for each predictor, with residual plots from all model fits shown on the
same plot for that predictor.
## S3 method for class 'lrm' residuals(object, type=c("li.shepherd","ordinary", "score", "score.binary", "pearson", "deviance", "pseudo.dep", "partial", "dfbeta", "dfbetas", "dffit", "dffits", "hat", "gof", "lp1"), pl=FALSE, xlim, ylim, kint, label.curves=TRUE, which, ...) ## S3 method for class 'orm' residuals(object, type=c("li.shepherd","ordinary", "score", "score.binary", "pearson", "deviance", "pseudo.dep", "partial", "dfbeta", "dfbetas", "dffit", "dffits", "hat", "gof", "lp1"), pl=FALSE, xlim, ylim, kint, label.curves=TRUE, which, ...) ## S3 method for class 'lrm.partial' plot(..., labels, center=FALSE, ylim)
## S3 method for class 'lrm' residuals(object, type=c("li.shepherd","ordinary", "score", "score.binary", "pearson", "deviance", "pseudo.dep", "partial", "dfbeta", "dfbetas", "dffit", "dffits", "hat", "gof", "lp1"), pl=FALSE, xlim, ylim, kint, label.curves=TRUE, which, ...) ## S3 method for class 'orm' residuals(object, type=c("li.shepherd","ordinary", "score", "score.binary", "pearson", "deviance", "pseudo.dep", "partial", "dfbeta", "dfbetas", "dffit", "dffits", "hat", "gof", "lp1"), pl=FALSE, xlim, ylim, kint, label.curves=TRUE, which, ...) ## S3 method for class 'lrm.partial' plot(..., labels, center=FALSE, ylim)
object |
object created by |
... |
for |
type |
type of residual desired. Use |
pl |
applies only to |
xlim |
plotting range for x-axis (default = whole range of predictor) |
ylim |
plotting range for y-axis (default = whole range of residuals, range of
all confidence intervals for |
kint |
for an ordinal model for residuals other than |
label.curves |
set to |
which |
a vector of integers specifying column numbers of the design matrix for
which to compute or plot residuals, for
|
labels |
for |
center |
for |
For the goodness-of-fit test, the le Cessie-van Houwelingen normal test
statistic for the unweighted sum of squared errors (Brier score times )
is used. For an ordinal response variable, the test
for predicting the probability that
is done separately for
all
(except the first). Note that the test statistic can have
strange behavior (i.e., it is far too large) if the model has no
predictive value.
For most of the values of type
, you must have specified
x=TRUE, y=TRUE
to lrm
or orm
.
There is yet no literature on interpreting score residual plots for the
ordinal model. Simulations when proportional odds is satisfied have
still shown a U-shaped residual plot. The series of binary model score
residuals for all cutoffs of seems to better check the assumptions.
See the examples.
The li.shepherd residual is a single value per observation on the probability scale and can be useful for examining linearity, checking for outliers, and measuring residual correlation.
a matrix (type="partial","dfbeta","dfbetas","score"
),
test statistic (type="gof"
), or a vector otherwise.
For partial residuals from an ordinal
model, the returned object is a 3-way array (rows of by columns
of
by cutoffs of
), and NAs deleted during the fit
are not re-inserted into the residuals. For
score.binary
, nothing
is returned.
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
Landwehr, Pregibon, Shoemaker. JASA 79:61–83, 1984.
le Cessie S, van Houwelingen JC. Biometrics 47:1267–1282, 1991.
Hosmer DW, Hosmer T, Lemeshow S, le Cessie S, Lemeshow S. A comparison of goodness-of-fit tests for the logistic regression model. Stat in Med 16:965–980, 1997.
Copas JB. Applied Statistics 38:71–80, 1989.
Li C, Shepherd BE. Biometrika 99:473-480, 2012.
lrm
, orm
,
naresid
, which.influence
,
loess
, supsmu
, lowess
,
boxplot
, labcurve
set.seed(1) x1 <- runif(200, -1, 1) x2 <- runif(200, -1, 1) L <- x1^2 - .5 + x2 y <- ifelse(runif(200) <= plogis(L), 1, 0) f <- lrm(y ~ x1 + x2, x=TRUE, y=TRUE) resid(f) #add rows for NAs back to data resid(f, "score") #also adds back rows r <- resid(f, "partial") #for checking transformations of X's par(mfrow=c(1,2)) for(i in 1:2) { xx <- if(i==1)x1 else x2 plot(xx, r[,i], xlab=c('x1','x2')[i]) lines(lowess(xx,r[,i])) } resid(f, "partial", pl="loess") #same as last 3 lines resid(f, "partial", pl=TRUE) #plots for all columns of X using supsmu resid(f, "gof") #global test of goodness of fit lp1 <- resid(f, "lp1") #approx. leave-out-1 linear predictors -2*sum(y*lp1 + log(1-plogis(lp1))) #approx leave-out-1 deviance #formula assumes y is binary # Simulate data from a population proportional odds model set.seed(1) n <- 400 age <- rnorm(n, 50, 10) blood.pressure <- rnorm(n, 120, 15) L <- .05*(age-50) + .03*(blood.pressure-120) p12 <- plogis(L) # Pr(Y>=1) p2 <- plogis(L-1) # Pr(Y=2) p <- cbind(1-p12, p12-p2, p2) # individual class probabilites # Cumulative probabilities: cp <- matrix(cumsum(t(p)) - rep(0:(n-1), rep(3,n)), byrow=TRUE, ncol=3) # simulate multinomial with varying probs: y <- (cp < runif(n)) %*% rep(1,3) y <- as.vector(y) # Thanks to Dave Krantz for this trick f <- lrm(y ~ age + blood.pressure, x=TRUE, y=TRUE) par(mfrow=c(2,2)) resid(f, 'score.binary', pl=TRUE) #plot score residuals resid(f, 'partial', pl=TRUE) #plot partial residuals resid(f, 'gof') #test GOF for each level separately # Show use of Li-Shepherd residuals f.wrong <- lrm(y ~ blood.pressure, x=TRUE, y=TRUE) par(mfrow=c(2,1)) # li.shepherd residuals from model without age plot(age, resid(f.wrong, type="li.shepherd"), ylab="li.shepherd residual") lines(lowess(age, resid(f.wrong, type="li.shepherd"))) # li.shepherd residuals from model including age plot(age, resid(f, type="li.shepherd"), ylab="li.shepherd residual") lines(lowess(age, resid(f, type="li.shepherd"))) # Make a series of binary fits and draw 2 partial residual plots # f1 <- lrm(y>=1 ~ age + blood.pressure, x=TRUE, y=TRUE) f2 <- update(f1, y==2 ~.) par(mfrow=c(2,1)) plot.lrm.partial(f1, f2) # Simulate data from both a proportional odds and a non-proportional # odds population model. Check how 3 kinds of residuals detect # non-prop. odds set.seed(71) n <- 400 x <- rnorm(n) par(mfrow=c(2,3)) for(j in 1:2) { # 1: prop.odds 2: non-prop. odds if(j==1) L <- matrix(c(1.4,.4,-.1,-.5,-.9), nrow=n, ncol=5, byrow=TRUE) + x / 2 else { # Slopes and intercepts for cutoffs of 1:5 : slopes <- c(.7,.5,.3,.3,0) ints <- c(2.5,1.2,0,-1.2,-2.5) L <- matrix(ints, nrow=n, ncol=5, byrow=TRUE) + matrix(slopes, nrow=n, ncol=5, byrow=TRUE) * x } p <- plogis(L) # Cell probabilities p <- cbind(1-p[,1],p[,1]-p[,2],p[,2]-p[,3],p[,3]-p[,4],p[,4]-p[,5],p[,5]) # Cumulative probabilities from left to right cp <- matrix(cumsum(t(p)) - rep(0:(n-1), rep(6,n)), byrow=TRUE, ncol=6) y <- (cp < runif(n)) %*% rep(1,6) f <- lrm(y ~ x, x=TRUE, y=TRUE) for(cutoff in 1:5) print(lrm(y >= cutoff ~ x)$coef) print(resid(f,'gof')) resid(f, 'score', pl=TRUE) # Note that full ordinal model score residuals exhibit a # U-shaped pattern even under prop. odds ti <- if(j==2) 'Non-Proportional Odds\nSlopes=.7 .5 .3 .3 0' else 'True Proportional Odds\nOrdinal Model Score Residuals' title(ti) resid(f, 'score.binary', pl=TRUE) if(j==1) ti <- 'True Proportional Odds\nBinary Score Residuals' title(ti) resid(f, 'partial', pl=TRUE) if(j==1) ti <- 'True Proportional Odds\nPartial Residuals' title(ti) } par(mfrow=c(1,1)) # Shepherd-Li residuals from orm. Thanks: Qi Liu set.seed(3) n <- 100 x1 <- rnorm(n) y <- x1 + rnorm(n) g <- orm(y ~ x1, family=probit, x=TRUE, y=TRUE) g.resid <- resid(g) plot(x1, g.resid, cex=0.4); lines(lowess(x1, g.resid)); abline(h=0, col=2,lty=2) set.seed(3) n <- 100 x1 <- rnorm(n) y <- x1 + x1^2 +rnorm(n) # model misspecification, the square term is left out in the model g <- orm(y ~ x1, family=probit, x=TRUE, y=TRUE) g.resid <- resid(g) plot(x1, g.resid, cex=0.4); lines(lowess(x1, g.resid)); abline(h=0, col=2,lty=2) ## Not run: # Get data used in Hosmer et al. paper and reproduce their calculations v <- Cs(id, low, age, lwt, race, smoke, ptl, ht, ui, ftv, bwt) d <- read.table("http://www.umass.edu/statdata/statdata/data/lowbwt.dat", skip=6, col.names=v) d <- upData(d, race=factor(race,1:3,c('white','black','other'))) f <- lrm(low ~ age + lwt + race + smoke, data=d, x=TRUE,y=TRUE) f resid(f, 'gof') # Their Table 7 Line 2 found sum of squared errors=36.91, expected # value under H0=36.45, variance=.065, P=.071 # We got 36.90, 36.45, SD=.26055 (var=.068), P=.085 # Note that two logistic regression coefficients differed a bit # from their Table 1 ## End(Not run)
set.seed(1) x1 <- runif(200, -1, 1) x2 <- runif(200, -1, 1) L <- x1^2 - .5 + x2 y <- ifelse(runif(200) <= plogis(L), 1, 0) f <- lrm(y ~ x1 + x2, x=TRUE, y=TRUE) resid(f) #add rows for NAs back to data resid(f, "score") #also adds back rows r <- resid(f, "partial") #for checking transformations of X's par(mfrow=c(1,2)) for(i in 1:2) { xx <- if(i==1)x1 else x2 plot(xx, r[,i], xlab=c('x1','x2')[i]) lines(lowess(xx,r[,i])) } resid(f, "partial", pl="loess") #same as last 3 lines resid(f, "partial", pl=TRUE) #plots for all columns of X using supsmu resid(f, "gof") #global test of goodness of fit lp1 <- resid(f, "lp1") #approx. leave-out-1 linear predictors -2*sum(y*lp1 + log(1-plogis(lp1))) #approx leave-out-1 deviance #formula assumes y is binary # Simulate data from a population proportional odds model set.seed(1) n <- 400 age <- rnorm(n, 50, 10) blood.pressure <- rnorm(n, 120, 15) L <- .05*(age-50) + .03*(blood.pressure-120) p12 <- plogis(L) # Pr(Y>=1) p2 <- plogis(L-1) # Pr(Y=2) p <- cbind(1-p12, p12-p2, p2) # individual class probabilites # Cumulative probabilities: cp <- matrix(cumsum(t(p)) - rep(0:(n-1), rep(3,n)), byrow=TRUE, ncol=3) # simulate multinomial with varying probs: y <- (cp < runif(n)) %*% rep(1,3) y <- as.vector(y) # Thanks to Dave Krantz for this trick f <- lrm(y ~ age + blood.pressure, x=TRUE, y=TRUE) par(mfrow=c(2,2)) resid(f, 'score.binary', pl=TRUE) #plot score residuals resid(f, 'partial', pl=TRUE) #plot partial residuals resid(f, 'gof') #test GOF for each level separately # Show use of Li-Shepherd residuals f.wrong <- lrm(y ~ blood.pressure, x=TRUE, y=TRUE) par(mfrow=c(2,1)) # li.shepherd residuals from model without age plot(age, resid(f.wrong, type="li.shepherd"), ylab="li.shepherd residual") lines(lowess(age, resid(f.wrong, type="li.shepherd"))) # li.shepherd residuals from model including age plot(age, resid(f, type="li.shepherd"), ylab="li.shepherd residual") lines(lowess(age, resid(f, type="li.shepherd"))) # Make a series of binary fits and draw 2 partial residual plots # f1 <- lrm(y>=1 ~ age + blood.pressure, x=TRUE, y=TRUE) f2 <- update(f1, y==2 ~.) par(mfrow=c(2,1)) plot.lrm.partial(f1, f2) # Simulate data from both a proportional odds and a non-proportional # odds population model. Check how 3 kinds of residuals detect # non-prop. odds set.seed(71) n <- 400 x <- rnorm(n) par(mfrow=c(2,3)) for(j in 1:2) { # 1: prop.odds 2: non-prop. odds if(j==1) L <- matrix(c(1.4,.4,-.1,-.5,-.9), nrow=n, ncol=5, byrow=TRUE) + x / 2 else { # Slopes and intercepts for cutoffs of 1:5 : slopes <- c(.7,.5,.3,.3,0) ints <- c(2.5,1.2,0,-1.2,-2.5) L <- matrix(ints, nrow=n, ncol=5, byrow=TRUE) + matrix(slopes, nrow=n, ncol=5, byrow=TRUE) * x } p <- plogis(L) # Cell probabilities p <- cbind(1-p[,1],p[,1]-p[,2],p[,2]-p[,3],p[,3]-p[,4],p[,4]-p[,5],p[,5]) # Cumulative probabilities from left to right cp <- matrix(cumsum(t(p)) - rep(0:(n-1), rep(6,n)), byrow=TRUE, ncol=6) y <- (cp < runif(n)) %*% rep(1,6) f <- lrm(y ~ x, x=TRUE, y=TRUE) for(cutoff in 1:5) print(lrm(y >= cutoff ~ x)$coef) print(resid(f,'gof')) resid(f, 'score', pl=TRUE) # Note that full ordinal model score residuals exhibit a # U-shaped pattern even under prop. odds ti <- if(j==2) 'Non-Proportional Odds\nSlopes=.7 .5 .3 .3 0' else 'True Proportional Odds\nOrdinal Model Score Residuals' title(ti) resid(f, 'score.binary', pl=TRUE) if(j==1) ti <- 'True Proportional Odds\nBinary Score Residuals' title(ti) resid(f, 'partial', pl=TRUE) if(j==1) ti <- 'True Proportional Odds\nPartial Residuals' title(ti) } par(mfrow=c(1,1)) # Shepherd-Li residuals from orm. Thanks: Qi Liu set.seed(3) n <- 100 x1 <- rnorm(n) y <- x1 + rnorm(n) g <- orm(y ~ x1, family=probit, x=TRUE, y=TRUE) g.resid <- resid(g) plot(x1, g.resid, cex=0.4); lines(lowess(x1, g.resid)); abline(h=0, col=2,lty=2) set.seed(3) n <- 100 x1 <- rnorm(n) y <- x1 + x1^2 +rnorm(n) # model misspecification, the square term is left out in the model g <- orm(y ~ x1, family=probit, x=TRUE, y=TRUE) g.resid <- resid(g) plot(x1, g.resid, cex=0.4); lines(lowess(x1, g.resid)); abline(h=0, col=2,lty=2) ## Not run: # Get data used in Hosmer et al. paper and reproduce their calculations v <- Cs(id, low, age, lwt, race, smoke, ptl, ht, ui, ftv, bwt) d <- read.table("http://www.umass.edu/statdata/statdata/data/lowbwt.dat", skip=6, col.names=v) d <- upData(d, race=factor(race,1:3,c('white','black','other'))) f <- lrm(low ~ age + lwt + race + smoke, data=d, x=TRUE,y=TRUE) f resid(f, 'gof') # Their Table 7 Line 2 found sum of squared errors=36.91, expected # value under H0=36.45, variance=.065, P=.071 # We got 36.90, 36.45, SD=.26055 (var=.068), P=.085 # Note that two logistic regression coefficients differed a bit # from their Table 1 ## End(Not run)
Computes various residuals and measures of influence for a
fit from ols
.
## S3 method for class 'ols' residuals(object, type=c("ordinary", "score", "dfbeta", "dfbetas", "dffit", "dffits", "hat", "hscore", "influence.measures", "studentized"), ...)
## S3 method for class 'ols' residuals(object, type=c("ordinary", "score", "dfbeta", "dfbetas", "dffit", "dffits", "hat", "hscore", "influence.measures", "studentized"), ...)
object |
object created by |
type |
type of residual desired. |
... |
ignored |
a matrix or vector, with places for observations that were originally
deleted by ols
held by NA
s
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
lm.influence
, ols
,
which.influence
set.seed(1) x1 <- rnorm(100) x2 <- rnorm(100) x1[1] <- 100 y <- x1 + x2 + rnorm(100) f <- ols(y ~ x1 + x2, x=TRUE, y=TRUE) resid(f, "dfbetas") which.influence(f) i <- resid(f, 'influence.measures') # dfbeta, dffit, etc.
set.seed(1) x1 <- rnorm(100) x2 <- rnorm(100) x1[1] <- 100 y <- x1 + x2 + rnorm(100) f <- ols(y ~ x1 + x2, x=TRUE, y=TRUE) resid(f, "dfbetas") which.influence(f) i <- resid(f, 'influence.measures') # dfbeta, dffit, etc.
Relative Explained Variation
rexVar(object, data, ns = 500, cint = 0.95)
rexVar(object, data, ns = 500, cint = 0.95)
object |
a fit from |
data |
a data frame, data table, or list providing the predictors used in the original fit |
ns |
maximum number of bootstrap repetitions or posterior draws to use |
cint |
confidence interval coverage probability for nonparametric bootstrap percentile intervals, or probability for a Bayesian highest posterior density interval for the relative explained variations. |
Computes measures of relative explained variation for each predictor in an rms
or rmsb
model fit object
. This is similar to plot(anova(fit), what='proportion R2')
. For an ols
model the result is exactly that. Uncertainty intervals are computed if the model fit is from rmsb
or was run through bootcov()
with coef.reps=TRUE
. The results may be printed, and there is also a plot
method.
When object
is not an ols
fit, the linear predictor from the fit in object
is predicted from the original predictors, resulting in a linear model with . The partial
for each predictor from a new
ols
fit is the relative explained variation. The process is repeated when bootstrap coefficients repetitions or posterior draws are present, to get uncertainty intervals. So relative explained variation is the proportion of variation in the initial model's predicted values (on the linear predictor scale) that is due to each predictor.
Nonlinear and interaction terms are pooled with main linear effect of predictors, so relative explained variation for a predictor measures its total impact on predicted values, either as main effects or effect modifiers (interaction components).
a vector (if bootstrapping or Bayesian posterior sampling was not done) or a matrix otherwise, with rows corresponding to predictors and colums REV
, Lower
, Upper
. The returned object is of class rexVar
.
Frank Harrell
set.seed(1) n <- 100 x1 <- rnorm(n) x2 <- rnorm(n) x3 <- rnorm(n) y <- x1 + x2 + rnorm(n) / 2. d <- data.frame(x1, x2, x3, y) dd <- datadist(d); options(datadist='dd') f <- ols(y ~ pol(x1, 2) * pol(x2, 2) + x3, data=d, x=TRUE, y=TRUE) plot(anova(f), what='proportion R2', pl=FALSE) rexVar(f) g <- bootcov(f, B=20, coef.reps=TRUE) rexVar(g, data=d) f <- orm(y ~ pol(x1,2) * pol(x2, 2) + x3, data=d, x=TRUE, y=TRUE) rexVar(f, data=d) g <- bootcov(f, B=20, coef.reps=TRUE) rexVar(g, data=d) ## Not run: require(rmsb) h <- blrm(y ~ pol(x1,2) * pol(x2, 2) + x3, data=d) rexVar(h, data=d) ## End(Not run) options(datadist=NULL)
set.seed(1) n <- 100 x1 <- rnorm(n) x2 <- rnorm(n) x3 <- rnorm(n) y <- x1 + x2 + rnorm(n) / 2. d <- data.frame(x1, x2, x3, y) dd <- datadist(d); options(datadist='dd') f <- ols(y ~ pol(x1, 2) * pol(x2, 2) + x3, data=d, x=TRUE, y=TRUE) plot(anova(f), what='proportion R2', pl=FALSE) rexVar(f) g <- bootcov(f, B=20, coef.reps=TRUE) rexVar(g, data=d) f <- orm(y ~ pol(x1,2) * pol(x2, 2) + x3, data=d, x=TRUE, y=TRUE) rexVar(f, data=d) g <- bootcov(f, B=20, coef.reps=TRUE) rexVar(g, data=d) ## Not run: require(rmsb) h <- blrm(y ~ pol(x1,2) * pol(x2, 2) + x3, data=d) rexVar(h, data=d) ## End(Not run) options(datadist=NULL)
This is a series of special transformation functions (asis
,
pol
, lsp
, rcs
, catg
, scored
,
strat
, matrx
), fitting functions (e.g.,
lrm
,cph
, psm
, or ols
), and generic
analysis functions (anova.rms
, summary.rms
,
Predict
, plot.Predict
, ggplot.Predict
, survplot
,
fastbw
, validate
, calibrate
, specs.rms
,
which.influence
, latexrms
, nomogram
,
datadist
, gendata
)
that help automate many
analysis steps, e.g. fitting restricted interactions and multiple
stratification variables, analysis of variance (with tests of linearity
of each factor and pooled tests), plotting effects of variables in the
model, estimating and graphing effects of variables that appear non-linearly in the
model using e.g. inter-quartile-range hazard ratios, bootstrapping
model fits, and constructing nomograms for obtaining predictions manually.
Behind the scene is the Design
function which
stores extra attributes. Design()
is not intended to be
called by users.
Design
causes detailed design attributes
and descriptions of the distribution of predictors to be stored
in an attribute of the terms
component called Design
.
modelData
is a replacement for model.frame.default
that is
much streamlined and prepares data for Design()
. If a second
formula is present, modelData
ensures that missing data deletions
are the same for both formulas, and produces a second model frame for
formula2
as the data2
attribute of the main returned data frame.
modelData(data=environment(formula), formula, formula2=NULL, weights, subset, na.action=na.delete, dotexpand=TRUE, callenv=parent.frame(n=2)) Design(mf, formula=NULL, specials=NULL, allow.offset=TRUE, intercept=1) # not to be called by the user; called by fitting routines # dist <- datadist(x1,x2,sex,age,race,bp) # or dist <- datadist(my.data.frame) # Can omit call to datadist if not using summary.rms, Predict, # survplot.rms, or if all variable settings are given to them # options(datadist="dist") # f <- fitting.function(formula = y ~ rcs(x1,4) + rcs(x2,5) + x1%ia%x2 + # rcs(x1,4)%ia%rcs(x2,5) + # strat(sex)*age + strat(race)*bp) # See rms.trans for rcs, strat, etc. # %ia% is restricted interaction - not doubly nonlinear # for x1 by x2 this uses the simple product only, but pools x1*x2 # effect with nonlinear function for overall tests # specs(f) # anova(f) # summary(f) # fastbw(f) # pred <- predict(f, newdata=expand.grid(x1=1:10,x2=3,sex="male", # age=50,race="black")) # pred <- predict(f, newdata=gendata(f, x1=1:10, x2=3, sex="male")) # This leaves unspecified variables set to reference values from datadist # pred.combos <- gendata(f, nobs=10) # Use X-windows to edit predictor settings # predict(f, newdata=pred.combos) # plot(Predict(f, x1)) # or ggplot(...) # latex(f) # nomogram(f)
modelData(data=environment(formula), formula, formula2=NULL, weights, subset, na.action=na.delete, dotexpand=TRUE, callenv=parent.frame(n=2)) Design(mf, formula=NULL, specials=NULL, allow.offset=TRUE, intercept=1) # not to be called by the user; called by fitting routines # dist <- datadist(x1,x2,sex,age,race,bp) # or dist <- datadist(my.data.frame) # Can omit call to datadist if not using summary.rms, Predict, # survplot.rms, or if all variable settings are given to them # options(datadist="dist") # f <- fitting.function(formula = y ~ rcs(x1,4) + rcs(x2,5) + x1%ia%x2 + # rcs(x1,4)%ia%rcs(x2,5) + # strat(sex)*age + strat(race)*bp) # See rms.trans for rcs, strat, etc. # %ia% is restricted interaction - not doubly nonlinear # for x1 by x2 this uses the simple product only, but pools x1*x2 # effect with nonlinear function for overall tests # specs(f) # anova(f) # summary(f) # fastbw(f) # pred <- predict(f, newdata=expand.grid(x1=1:10,x2=3,sex="male", # age=50,race="black")) # pred <- predict(f, newdata=gendata(f, x1=1:10, x2=3, sex="male")) # This leaves unspecified variables set to reference values from datadist # pred.combos <- gendata(f, nobs=10) # Use X-windows to edit predictor settings # predict(f, newdata=pred.combos) # plot(Predict(f, x1)) # or ggplot(...) # latex(f) # nomogram(f)
data |
a data frame or calling environment |
formula |
model formula |
formula2 |
an optional second model formula (see for example
|
weights |
a weight variable or expression |
subset |
a subsetting expression evaluated in the calling frame
or |
na.action |
NA handling function, ideally one such as
|
specials |
a character vector specifying which function
evaluations appearing in |
dotexpand |
set to |
callenv |
the parent frame that called the fitting function |
mf |
a model frame |
allow.offset |
set to |
intercept |
1 if an ordinary intercept is present, 0 otherwise |
a data frame augmented with additional information about the predictors and model formulation
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
rms.trans
, rmsMisc
, cph
, lrm
, ols
, specs.rms
, anova.rms
,
summary.rms
, Predict
, gendata
, fastbw
, predictrms
.
validate
, calibrate
, which.influence
,
latex
, latexrms
, model.frame.default
, datadist
, describe
,
nomogram
, vif
, dataRep
## Not run: require(rms) require(ggplot2) require(survival) dist <- datadist(data=2) # can omit if not using summary, (gg)plot, survplot, # or if specify all variable values to them. Can # also defer. data=2: get distribution summaries # for all variables in search position 2 # run datadist once, for all candidate variables dist <- datadist(age,race,bp,sex,height) # alternative options(datadist="dist") f <- cph(Surv(d.time, death) ~ rcs(age,4)*strat(race) + bp*strat(sex)+lsp(height,60),x=TRUE,y=TRUE) anova(f) anova(f,age,height) # Joint test of 2 vars fastbw(f) summary(f, sex="female") # Adjust sex to "female" when testing # interacting factor bp bplot(Predict(f, age, height)) # 3-D plot ggplot(Predict(f, age=10:70, height=60)) latex(f) # LaTeX representation of fit f <- lm(y ~ x) # Can use with any fitting function that # calls model.frame.default, e.g. lm, glm specs.rms(f) # Use .rms since class(f)="lm" anova(f) # Works since Varcov(f) (=Varcov.lm(f)) works fastbw(f) options(datadist=NULL) f <- ols(y ~ x1*x2) # Saves enough information to do fastbw, anova anova(f) # Will not do Predict since distributions fastbw(f) # of predictors not saved plot(f, x1=seq(100,300,by=.5), x2=.5) # all values defined - don't need datadist dist <- datadist(x1,x2) # Equivalent to datadist(f) options(datadist="dist") plot(f, x1, x2=.5) # Now you can do plot, summary plot(nomogram(f, interact=list(x2=c(.2,.7)))) ## End(Not run)
## Not run: require(rms) require(ggplot2) require(survival) dist <- datadist(data=2) # can omit if not using summary, (gg)plot, survplot, # or if specify all variable values to them. Can # also defer. data=2: get distribution summaries # for all variables in search position 2 # run datadist once, for all candidate variables dist <- datadist(age,race,bp,sex,height) # alternative options(datadist="dist") f <- cph(Surv(d.time, death) ~ rcs(age,4)*strat(race) + bp*strat(sex)+lsp(height,60),x=TRUE,y=TRUE) anova(f) anova(f,age,height) # Joint test of 2 vars fastbw(f) summary(f, sex="female") # Adjust sex to "female" when testing # interacting factor bp bplot(Predict(f, age, height)) # 3-D plot ggplot(Predict(f, age=10:70, height=60)) latex(f) # LaTeX representation of fit f <- lm(y ~ x) # Can use with any fitting function that # calls model.frame.default, e.g. lm, glm specs.rms(f) # Use .rms since class(f)="lm" anova(f) # Works since Varcov(f) (=Varcov.lm(f)) works fastbw(f) options(datadist=NULL) f <- ols(y ~ x1*x2) # Saves enough information to do fastbw, anova anova(f) # Will not do Predict since distributions fastbw(f) # of predictors not saved plot(f, x1=seq(100,300,by=.5), x2=.5) # all values defined - don't need datadist dist <- datadist(x1,x2) # Equivalent to datadist(f) options(datadist="dist") plot(f, x1, x2=.5) # Now you can do plot, summary plot(nomogram(f, interact=list(x2=c(.2,.7)))) ## End(Not run)
This is a series of functions (asis
, pol
, lsp
,
rcs
, catg
, scored
, strat
, matrx
,
gTrans
, and
%ia%
) that set up special attributes (such as
knots and nonlinear term indicators) that are carried through to fits
(using for example lrm
,cph
, ols
,
psm
). anova.rms
, summary.rms
, Predict
,
survplot
, fastbw
, validate
, specs
,
which.influence
, nomogram
and latex.rms
use these
attributes to automate certain analyses (e.g., automatic tests of linearity
for each predictor are done by anova.rms
). Many of the functions
are called implicitly. Some S functions such as ns
derive data-dependent
transformations that are not always "remembered" when predicted values are
later computed, so the predictions may be incorrect. The functions listed
here solve that problem when used in the rms
context.
asis
is the identity transformation, pol
is an ordinary
(non-orthogonal) polynomial, rcs
is a linear tail-restricted
cubic spline function (natural spline, for which the
rcspline.eval
function generates the design matrix, the
presence of system option rcspc
causes rcspline.eval
to be
invoked with pc=TRUE
, and the presence of system option fractied
causes this value to be passed to rcspline.eval
as the fractied
argument), catg
is for a categorical variable,
scored
is for an ordered categorical variable, strat
is
for a stratification factor in a Cox model, matrx
is for a matrix
predictor, and %ia%
represents restricted interactions in which
products involving nonlinear effects on both variables are not included
in the model. asis, catg, scored, matrx
are seldom invoked
explicitly by the user (only to specify label
or name
,
usually).
gTrans
is a general multiple-parameter transformation function.
It can be used to specify new polynomial bases, smooth relationships
with a discontinuity at one or more values of x
, grouped
categorical variables, e.g., a categorical variable with 5 levels where
you want to combine two of the levels to spend only 3 degrees of freedom in
all but see plots of predicted values where the two combined categories
are kept separate but will have equal effect estimates. The first
argument to gTrans
is a regular numeric, character, or factor
variable. The next argument is a function that transforms a vector into
a matrix. If the basis functions are to include a linear term it is up
too the user to include the original x
as one of the columns.
Column names are assigned automaticall, but any column names specified
by the user will override the default name. If you want to signal which
terms correspond to linear and which correspond to nonlinear effects for
the purpose of running anova.rms
, add an integer vector attribute
nonlinear
to the resulting matrix. This vector specifies the
column numbers corresponding to nonlinear effects. The default is to assume a column
is a linear effect. The parms
attribute stored with a
gTrans
result a character vector version of the function, so as
to not waste space carrying along any environment information. If you
will be using the latex
method for typesetting the fitted model,
you must include a tex
attribute also in the produced matrix.
This must be a function of a single character string argument (that will
ultimately contain the name of the predictor in LaTeX notation) and must
produce a vector of LaTeX character strings. See
https://hbiostat.org/R/examples/gTrans/gTrans.html for several examples of the
use of gTrans
including the use of nonlinear
and
tex
.
A makepredictcall
method is defined so that usage of the
transformation functions outside of rms
fitting functions will
work for getting predicted values. Thanks to Therry Therneau for the code.
In the list below, functions asis
through gTrans
can have
arguments x, parms, label, name
except that parms
does not
apply to asis, matrx, strat
.
asis(...) matrx(...) pol(...) lsp(...) rcs(...) catg(...) scored(...) strat(...) gTrans(...) x1 %ia% x2 ## S3 method for class 'rms' makepredictcall(var, call)
asis(...) matrx(...) pol(...) lsp(...) rcs(...) catg(...) scored(...) strat(...) gTrans(...) x1 %ia% x2 ## S3 method for class 'rms' makepredictcall(var, call)
... |
The arguments ... above contain the following.
|
x1 , x2
|
two continuous variables for which to form a non-doubly-nonlinear interaction |
var |
a model term passed from a (usually non- |
call |
call object for a model term |
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
rcspline.eval
,
rcspline.restate
, rms
,
cph
, lrm
, ols
,
datadist
, makepredictcall
## Not run: options(knots=4, poly.degree=2) # To get the old behavior of rcspline.eval knot placement (which didnt' handle # clumping at the lowest or highest value of the predictor very well): # options(fractied = 1.0) # see rcspline.eval for details country <- factor(country.codes) blood.pressure <- cbind(sbp=systolic.bp, dbp=diastolic.bp) fit <- lrm(Y ~ sqrt(x1)*rcs(x2) + rcs(x3,c(5,10,15)) + lsp(x4,c(10,20)) + country + blood.pressure + poly(age,2)) # sqrt(x1) is an implicit asis variable, but limits of x1, not sqrt(x1) # are used for later plotting and effect estimation # x2 fitted with restricted cubic spline with 4 default knots # x3 fitted with r.c.s. with 3 specified knots # x4 fitted with linear spline with 2 specified knots # country is an implied catg variable # blood.pressure is an implied matrx variable # since poly is not an rms function (pol is), it creates a # matrx type variable with no automatic linearity testing # or plotting f1 <- lrm(y ~ rcs(x1) + rcs(x2) + rcs(x1) %ia% rcs(x2)) # %ia% restricts interactions. Here it removes terms nonlinear in # both x1 and x2 f2 <- lrm(y ~ rcs(x1) + rcs(x2) + x1 %ia% rcs(x2)) # interaction linear in x1 f3 <- lrm(y ~ rcs(x1) + rcs(x2) + x1 %ia% x2) # simple product interaction (doubly linear) # Use x1 %ia% x2 instead of x1:x2 because x1 %ia% x2 triggers # anova to pool x1*x2 term into x1 terms to test total effect # of x1 # # Examples of gTrans # # Linear relationship with a discontinuity at zero: ldisc <- function(x) {z <- cbind(x == 0, x); attr(z, 'nonlinear') <- 1; z} gTrans(x, ldisc) # Duplicate pol(x, 2): pol2 <- function(x) {z <- cbind(x, x^2); attr(z, 'nonlinear') <- 2; z} gTrans(x, pol2) # Linear spline with a knot at x=10 with the new slope taking effect # until x=20 and the spline turning flat at that point but with a # discontinuous vertical shift # tex is only needed if you will be using latex(fit) dspl <- function(x) { z <- cbind(x, pmax(pmin(x, 20) - 10, 0), x > 20) attr(z, 'nonlinear') <- 2:3 attr(z, 'tex') <- function(x) sprintf(c('%s', '(\min(%s, 20) - 10)_{+}', '[%s > 20]'), x) z } gTrans(x, dspl) ## End(Not run)
## Not run: options(knots=4, poly.degree=2) # To get the old behavior of rcspline.eval knot placement (which didnt' handle # clumping at the lowest or highest value of the predictor very well): # options(fractied = 1.0) # see rcspline.eval for details country <- factor(country.codes) blood.pressure <- cbind(sbp=systolic.bp, dbp=diastolic.bp) fit <- lrm(Y ~ sqrt(x1)*rcs(x2) + rcs(x3,c(5,10,15)) + lsp(x4,c(10,20)) + country + blood.pressure + poly(age,2)) # sqrt(x1) is an implicit asis variable, but limits of x1, not sqrt(x1) # are used for later plotting and effect estimation # x2 fitted with restricted cubic spline with 4 default knots # x3 fitted with r.c.s. with 3 specified knots # x4 fitted with linear spline with 2 specified knots # country is an implied catg variable # blood.pressure is an implied matrx variable # since poly is not an rms function (pol is), it creates a # matrx type variable with no automatic linearity testing # or plotting f1 <- lrm(y ~ rcs(x1) + rcs(x2) + rcs(x1) %ia% rcs(x2)) # %ia% restricts interactions. Here it removes terms nonlinear in # both x1 and x2 f2 <- lrm(y ~ rcs(x1) + rcs(x2) + x1 %ia% rcs(x2)) # interaction linear in x1 f3 <- lrm(y ~ rcs(x1) + rcs(x2) + x1 %ia% x2) # simple product interaction (doubly linear) # Use x1 %ia% x2 instead of x1:x2 because x1 %ia% x2 triggers # anova to pool x1*x2 term into x1 terms to test total effect # of x1 # # Examples of gTrans # # Linear relationship with a discontinuity at zero: ldisc <- function(x) {z <- cbind(x == 0, x); attr(z, 'nonlinear') <- 1; z} gTrans(x, ldisc) # Duplicate pol(x, 2): pol2 <- function(x) {z <- cbind(x, x^2); attr(z, 'nonlinear') <- 2; z} gTrans(x, pol2) # Linear spline with a knot at x=10 with the new slope taking effect # until x=20 and the spline turning flat at that point but with a # discontinuous vertical shift # tex is only needed if you will be using latex(fit) dspl <- function(x) { z <- cbind(x, pmax(pmin(x, 20) - 10, 0), x > 20) attr(z, 'nonlinear') <- 2:3 attr(z, 'tex') <- function(x) sprintf(c('%s', '(\min(%s, 20) - 10)_{+}', '[%s > 20]'), x) z } gTrans(x, dspl) ## End(Not run)
These functions are used internally to anova.rms
,
fastbw
, etc., to retrieve various attributes of a design. These
functions allow some fitting functions not in the rms
series
(e.g,, lm
, glm
) to be used with rms.Design
,
fastbw
, and similar functions.
For vcov
, there are several functions. The method for orm
fits is a bit different because the covariance matrix stored in the fit
object only deals with the middle intercept. See the intercepts
argument for more options. There is a method for lrm
that also
allows non-default intercept(s) to be selected (default is first).
The oos.loglik
function for
each type of model implemented computes the -2 log likelihood for
out-of-sample data (i.e., data not necessarily used to fit the model)
evaluated at the parameter estimates from a model fit. Vectors for the
model's linear predictors and response variable must be given.
oos.loglik
is used primarily by bootcov
.
The Getlim
function retrieves distribution summaries
from the fit or from a datadist
object. It handles getting summaries
from both sources to fill in characteristics for variables that were not
defined during the model fit. Getlimi
returns the summary
for an individual model variable.
Mean
is a generic function that creates an R function that
calculates the expected value of the response variable given a fit from
rms
or rmsb
.
The related.predictors
function
returns a list containing variable numbers that are directly or
indirectly related to each predictor. The interactions.containing
function returns indexes of interaction effects containing a given
predictor. The param.order
function returns a vector of logical
indicators for whether parameters are associated with certain types of
effects (nonlinear, interaction, nonlinear interaction).
combineRelatedPredictors
creates of list of inter-connected main
effects and interations for use with predictrms
with
type='ccterms'
(useful for gIndex
).
The Penalty.matrix
function builds a default penalty matrix for
non-intercept term(s) for use in penalized maximum likelihood
estimation. The Penalty.setup
function takes a constant or list
describing penalty factors for each type of term in the model and
generates the proper vector of penalty multipliers for the current model.
logLik.rms
returns the maximized log likelihood for the model,
whereas AIC.rms
returns the AIC. The latter function has an
optional argument for computing AIC on a "chi-square" scale (model
likelihood ratio chi-square minus twice the regression degrees of
freedom. logLik.ols
handles the case for ols
, just by
invoking logLik.lm
in the stats
package.
logLik.Gls
is also defined.
nobs.rms
returns the number of observations used in the fit.
The lrtest
function does likelihood ratio tests for
two nested models, from fits that have stats
components with
"Model L.R."
values. For models such as psm, survreg, ols, lm
which have
scale parameters, it is assumed that scale parameter for the smaller model
is fixed at the estimate from the larger model (see the example).
univarLR
takes a multivariable model fit object from
rms
and re-fits a sequence of models containing one predictor
at a time. It prints a table of likelihood ratio statistics
from these fits.
The Newlabels
function is used to override the variable labels in a
fit object. Likewise, Newlevels
can be used to create a new fit object
with levels of categorical predictors changed. These two functions are
especially useful when constructing nomograms.
rmsArgs
handles ... arguments to functions such as
Predict
, summary.rms
, nomogram
so that variables to
vary may be specified without values (after an equals sign).
prModFit
is the workhorse for the print
methods for
highest-level rms
model fitting functions, handling both regular,
html, and LaTeX printing, the latter two resulting in html or LaTeX code
written to the console, automatically ready for knitr
. The work
of printing
summary statistics is done by prStats
, which uses the Hmisc
print.char.matrix
function to print overall model statistics if
options(prType=)
was not set to "latex"
or "html"
.
Otherwise it generates customized LaTeX or html
code. The LaTeX longtable and epic packages must be in effect to use LaTeX.
reListclean
allows one to rename a subset of a named list,
ignoring the previous names and not concatenating them as R does. It
also removes NULL
elements and (by default) elements that are
NA
, as when an
optional named element is fetched that doesn't exist. It has an
argument dec
whose elements are correspondingly removed, then
dec
is appended to the result vector.
formatNP
is a function to format a vector of numerics. If
digits
is specified, formatNP
will make sure that the
formatted representation has digits
positions to the right of the
decimal place. If lang="latex"
it will translate any scientific
notation to LaTeX math form. If lang="html"
will convert to html.
If pvalue=TRUE
, it will replace
formatted values with "< 0.0001" (if digits=4
).
latex.naprint.delete
will, if appropriate, use LaTeX to draw a
dot chart of frequency of variable NA
s related to model fits.
html.naprint.delete
does the same thing in the RStudio R markdown
context, using Hmisc:dotchartp
(which uses plotly
) for
drawing any needed dot chart.
removeFormulaTerms
removes one or more terms from a model
formula, using strictly character manipulation. This handles problems
such as [.terms
removing offset()
if you subset on
anything. The function can also be used to remove the dependent
variable(s) from the formula.
## S3 method for class 'rms' vcov(object, regcoef.only=TRUE, intercepts='all', ...) ## S3 method for class 'cph' vcov(object, regcoef.only=TRUE, ...) ## S3 method for class 'Glm' vcov(object, regcoef.only=TRUE, intercepts='all', ...) ## S3 method for class 'Gls' vcov(object, intercepts='all', ...) ## S3 method for class 'lrm' vcov(object, regcoef.only=TRUE, intercepts='all', ...) ## S3 method for class 'ols' vcov(object, regcoef.only=TRUE, ...) ## S3 method for class 'orm' vcov(object, regcoef.only=TRUE, intercepts='mid', ...) ## S3 method for class 'psm' vcov(object, regcoef.only=TRUE, ...) # Given Design attributes and number of intercepts creates R # format assign list. atr non.slopes Terms DesignAssign(atr, non.slopes, Terms) oos.loglik(fit, ...) ## S3 method for class 'ols' oos.loglik(fit, lp, y, ...) ## S3 method for class 'lrm' oos.loglik(fit, lp, y, ...) ## S3 method for class 'cph' oos.loglik(fit, lp, y, ...) ## S3 method for class 'psm' oos.loglik(fit, lp, y, ...) ## S3 method for class 'Glm' oos.loglik(fit, lp, y, ...) Getlim(at, allow.null=FALSE, need.all=TRUE) Getlimi(name, Limval, need.all=TRUE) related.predictors(at, type=c("all","direct")) interactions.containing(at, pred) combineRelatedPredictors(at) param.order(at, term.order) Penalty.matrix(at, X) Penalty.setup(at, penalty) ## S3 method for class 'Gls' logLik(object, ...) ## S3 method for class 'ols' logLik(object, ...) ## S3 method for class 'rms' logLik(object, ...) ## S3 method for class 'rms' AIC(object, ..., k=2, type=c('loglik', 'chisq')) ## S3 method for class 'rms' nobs(object, ...) lrtest(fit1, fit2) ## S3 method for class 'lrtest' print(x, ...) univarLR(fit) Newlabels(fit, ...) Newlevels(fit, ...) ## S3 method for class 'rms' Newlabels(fit, labels, ...) ## S3 method for class 'rms' Newlevels(fit, levels, ...) prModFit(x, title, w, digits=4, coefs=TRUE, footer=NULL, lines.page=40, long=TRUE, needspace, subtitle=NULL, ...) prStats(labels, w, lang=c("plain", "latex", "html")) reListclean(..., dec=NULL, na.rm=TRUE) formatNP(x, digits=NULL, pvalue=FALSE, lang=c("plain", "latex", "html")) ## S3 method for class 'naprint.delete' latex(object, file="", append=TRUE, ...) ## S3 method for class 'naprint.delete' html(object, ...) removeFormulaTerms(form, which=NULL, delete.response=FALSE)
## S3 method for class 'rms' vcov(object, regcoef.only=TRUE, intercepts='all', ...) ## S3 method for class 'cph' vcov(object, regcoef.only=TRUE, ...) ## S3 method for class 'Glm' vcov(object, regcoef.only=TRUE, intercepts='all', ...) ## S3 method for class 'Gls' vcov(object, intercepts='all', ...) ## S3 method for class 'lrm' vcov(object, regcoef.only=TRUE, intercepts='all', ...) ## S3 method for class 'ols' vcov(object, regcoef.only=TRUE, ...) ## S3 method for class 'orm' vcov(object, regcoef.only=TRUE, intercepts='mid', ...) ## S3 method for class 'psm' vcov(object, regcoef.only=TRUE, ...) # Given Design attributes and number of intercepts creates R # format assign list. atr non.slopes Terms DesignAssign(atr, non.slopes, Terms) oos.loglik(fit, ...) ## S3 method for class 'ols' oos.loglik(fit, lp, y, ...) ## S3 method for class 'lrm' oos.loglik(fit, lp, y, ...) ## S3 method for class 'cph' oos.loglik(fit, lp, y, ...) ## S3 method for class 'psm' oos.loglik(fit, lp, y, ...) ## S3 method for class 'Glm' oos.loglik(fit, lp, y, ...) Getlim(at, allow.null=FALSE, need.all=TRUE) Getlimi(name, Limval, need.all=TRUE) related.predictors(at, type=c("all","direct")) interactions.containing(at, pred) combineRelatedPredictors(at) param.order(at, term.order) Penalty.matrix(at, X) Penalty.setup(at, penalty) ## S3 method for class 'Gls' logLik(object, ...) ## S3 method for class 'ols' logLik(object, ...) ## S3 method for class 'rms' logLik(object, ...) ## S3 method for class 'rms' AIC(object, ..., k=2, type=c('loglik', 'chisq')) ## S3 method for class 'rms' nobs(object, ...) lrtest(fit1, fit2) ## S3 method for class 'lrtest' print(x, ...) univarLR(fit) Newlabels(fit, ...) Newlevels(fit, ...) ## S3 method for class 'rms' Newlabels(fit, labels, ...) ## S3 method for class 'rms' Newlevels(fit, levels, ...) prModFit(x, title, w, digits=4, coefs=TRUE, footer=NULL, lines.page=40, long=TRUE, needspace, subtitle=NULL, ...) prStats(labels, w, lang=c("plain", "latex", "html")) reListclean(..., dec=NULL, na.rm=TRUE) formatNP(x, digits=NULL, pvalue=FALSE, lang=c("plain", "latex", "html")) ## S3 method for class 'naprint.delete' latex(object, file="", append=TRUE, ...) ## S3 method for class 'naprint.delete' html(object, ...) removeFormulaTerms(form, which=NULL, delete.response=FALSE)
fit |
result of a fitting function |
object |
result of a fitting function |
regcoef.only |
For fits such as parametric survival models
which have a final row and column of the covariance matrix for a
non-regression parameter such as a log(scale) parameter, setting
|
intercepts |
set to |
at |
|
pred |
index of a predictor variable (main effect) |
fit1 , fit2
|
fit objects from |
lp |
linear predictor vector for |
y |
values of a new vector of responses passed to |
name |
the name of a variable in the model |
Limval |
an object returned by |
allow.null |
prevents |
need.all |
set to |
type |
For |
term.order |
1 for all parameters, 2 for all parameters associated with either nonlinear or interaction effects, 3 for nonlinear effects (main or interaction), 4 for interaction effects, 5 for nonlinear interaction effects. |
X |
a design matrix, not including columns for intercepts |
penalty |
a vector or list specifying penalty multipliers for types of model terms |
k |
the multiplier of the degrees of freedom to be used in computing AIC. The default is 2. |
x |
a result of |
labels |
a character vector specifying new labels for variables in a fit.
To give new labels for all variables, you can specify |
levels |
a list of named vectors specifying new level labels for categorical
predictors. This will override |
title |
a single character string used to specify an overall title
for the regression fit, which is printed first by |
w |
For |
digits |
number of digits to the right of the decimal point, for formatting numeric values in printed output |
coefs |
specify |
footer |
a character string to appear at the bottom of the regression model output |
file |
name of file to which to write model output |
append |
specify |
lang |
specifies the typesetting language: plain text, LaTeX, or html |
lines.page |
see |
long |
set to |
needspace |
optional character string to insert inside a LaTeX
needspace macro call before the statistics table and before the
coefficient matrix, to avoid bad page splits. This assumes the LaTeX
needspace style is available. Example:
|
subtitle |
optional vector of character strings containing
subtitles that will appear under |
dec |
vector of decimal places used for rounding |
na.rm |
set to |
pvalue |
set to |
form |
a formula object |
which |
a vector of one or more character strings specifying the
names of functions that are called from a formula, e.g.,
|
delete.response |
set to |
atr , non.slopes , Terms
|
|
... |
other arguments. For |
vcov
returns a variance-covariance matrix
oos.loglik
returns a scalar -2 log likelihood value.
Getlim
returns a list with components limits
and values
, either
stored in fit
or retrieved from the object created by datadist
and
pointed to in options(datadist=)
.
related.predictors
and combineRelatedPredictors
return a
list of vectors, and interactions.containing
returns a vector. param.order
returns a logical vector corresponding
to non-strata terms in the model.
Penalty.matrix
returns a symmetric matrix with dimension equal to the
number of slopes in the model. For all but categorical predictor main
effect elements, the matrix is diagonal with values equal to the variances
of the columns of X
. For segments corresponding to c-1
dummy variables
for c
-category predictors, puts a c-1
x c-1
sub-matrix in
Penalty.matrix
that is constructed so that a quadratic form with
Penalty.matrix
in the middle computes the sum of squared differences
in parameter values about the mean, including a portion for the reference
cell in which the parameter is by definition zero.
Newlabels
returns a new fit object with the labels adjusted.
reListclean
returns a vector of named (by its arguments) elements.
formatNP
returns a character vector.
removeFormulaTerms
returns a formula object.
rms
, fastbw
, anova.rms
,
summary.lm
, summary.glm
,
datadist
, vif
, bootcov
,
latex
, latexTabular
,
latexSN
,
print.char.matrix
,
## Not run: f <- psm(S ~ x1 + x2 + sex + race, dist='gau') g <- psm(S ~ x1 + sex + race, dist='gau', fixed=list(scale=exp(f$parms))) lrtest(f, g) g <- Newlabels(f, c(x2='Label for x2')) g <- Newlevels(g, list(sex=c('Male','Female'),race=c('B','W'))) nomogram(g) ## End(Not run)
## Not run: f <- psm(S ~ x1 + x2 + sex + race, dist='gau') g <- psm(S ~ x1 + sex + race, dist='gau', fixed=list(scale=exp(f$parms))) lrtest(f, g) g <- Newlabels(f, c(x2='Label for x2')) g <- Newlevels(g, list(sex=c('Male','Female'),race=c('B','W'))) nomogram(g) ## End(Not run)
rms is the package that goes along with the book Regression Modeling Strategies. rms does regression modeling, testing, estimation, validation, graphics, prediction, and typesetting by storing enhanced model design attributes in the fit. rms is a re-written version of the Design package that has improved graphics and duplicates very little code in the survival package.
The package is a collection of about 180 functions that assist and streamline modeling, especially for biostatistical and epidemiologic applications. It also contains functions for binary and ordinal logistic regression models and the Buckley-James multiple regression model for right-censored responses, and implements penalized maximum likelihood estimation for logistic and ordinary linear models. rms works with almost any regression model, but it was especially written to work with logistic regression, Cox regression, accelerated failure time models, ordinary linear models, the Buckley-James model, generalized lease squares for longitudinal data (using the nlme package), generalized linear models, and quantile regression (using the quantreg package). rms requires the Hmisc package to be installed. Note that Hmisc has several functions useful for data analysis (especially data reduction and imputation).
Older references below pertaining to the Design package are relevant to rms.
To make use of automatic typesetting features you must
have LaTeX or one of its variants installed.
Some aspects of rms (e.g., latex
) will not work correctly if
options(contrasts=)
other than c("contr.treatment",
"contr.poly")
are used.
rms relies on a wealth of survival analysis functions written by Terry Therneau of Mayo Clinic. Front-ends have been written for several of Therneau's functions, and other functions have been slightly modified.
Ordinary linear regression models
Binary and ordinal logistic models (proportional odds and continuation ratio models, probit, log-log, complementary log-log including ordinal cumulative probability models for continuous Y, efficiently handling thousands of distinct Y values using full likelihood methods)
Bayesian binary and ordinal regression models, partial proportional odds model, and random effects
Cox model
Parametric survival models in the accelerated failure time class
Buckley-James least-squares linear regression model with possibly right-censored responses
Generalized linear model
Quantile regression
Generalized least squares
Bootstrap model validation to obtain unbiased estimates of model performance without requiring a separate validation sample
Automatic Wald tests of all effects in the model that are not parameterization-dependent (e.g., tests of nonlinearity of main effects when the variable does not interact with other variables, tests of nonlinearity of interaction effects, tests for whether a predictor is important, either as a main effect or as an effect modifier)
Graphical depictions of model estimates (effect plots, odds/hazard ratio plots, nomograms that allow model predictions to be obtained manually even when there are nonlinear effects and interactions in the model)
Various smoothed residual plots, including some new residual plots for verifying ordinal logistic model assumptions
Composing S functions to evaluate the linear
predictor (), hazard function, survival
function, quantile functions analytically from the
fitted model
Typesetting of fitted model using LaTeX
Robust covariance matrix estimation (Huber or bootstrap)
Cubic regression splines with linear tail restrictions (natural splines)
Tensor splines
Interactions restricted to not be doubly nonlinear
Penalized maximum likelihood estimation for ordinary linear regression and logistic regression models. Different parts of the model may be penalized by different amounts, e.g., you may want to penalize interaction or nonlinear effects more than main effects or linear effects
Estimation of hazard or odds ratios in presence of nolinearity and interaction
Sensitivity analysis for an unmeasured binary confounder in a binary logistic model
rms was motivated by the following needs:
need to automatically print interesting Wald tests that can be constructed from the design
tests of linearity with respect to each predictor
tests of linearity of interactions
pooled interaction tests (e.g., all interactions involving race)
pooled tests of effects with higher order effects
test of main effect not meaningful when effect in interaction
pooled test of main effect + interaction effect is meaningful
test of 2nd-order interaction + any 3rd-order interaction containing those factors is meaningful
need to store transformation parameters with the fit
example: knot locations for spline functions
these are "remembered" when getting predictions, unlike standard S or R
for categorical predictors, save levels so that same dummy variables will be generated for predictions; check that all levels in out-of-data predictions were present when model was fitted
need for uniform re-insertion of observations deleted because of NAs
when using predict
without newdata
or when using
resid
need to easily plot the regression effect of any predictor
example: age is represented by a linear spline with knots at 40 and 60y plot effect of age on log odds of disease, adjusting interacting factors to easily specified constants
vary 2 predictors: plot x1 on x-axis, separate curves for discrete x2 or 3d perspective plot for continuous x2
if predictor is represented as a function in the model, plots
should be with respect to the original variable:f <- lrm(y ~ log(cholesterol)+age)
plot(Predict(f, cholesterol)) # cholesterol on x-axis, default range
ggplot(Predict(f, cholesterol)) # same using ggplot2
plotp(Predict(f, cholesterol)) # same directly using plotly
need to store summary of distribution of predictors with the fit
plotting limits (default: 10th smallest, 10th largest values or %-tiles)
effect limits (default: .25 and .75 quantiles for continuous vars.)
adjustment values for other predictors (default: median for continuous predictors, most frequent level for categorical ones)
discrete numeric predictors: list of possible values example: x=0,1,2,3,5 -> by default don't plot prediction at x=4
values are on the inner-most variable, e.g. cholesterol, not log(chol.)
allows estimation/plotting long after original dataset has been deleted
for Cox models, underlying survival also stored with fit, so original data not needed to obtain predicted survival curves
need to automatically print estimates of effects in presence of non- linearity and interaction
example: age is quadratic, interacting with sex default effect is inter-quartile-range hazard ratio (for Cox model), for sex=reference level
user-controlled effects: summary(fit, age=c(30,50),
sex="female")
-> odds ratios for logistic model, relative survival time
for accelerated failure time survival models
effects for all variables (e.g. odds ratios) may be plotted with multiple-confidence-level bars
need for prettier and more concise effect names in printouts, especially for expanded nonlinear terms and interaction terms
use inner-most variable name to identify predictors
e.g. for pmin(x^2-3,10)
refer to factor with legal S-name
x
need to recognize that an intercept is not always a simple concept
some models (e.g., Cox) have no intercept
some models (e.g., ordinal logistic) have multiple intercepts
need for automatic high-quality printing of fitted mathematical
model (with dummy variables defined, regression spline terms
simplified, interactions "factored"). Focus is on regression splines
instead of nonparametric smoothers or smoothing splines, so that
explicit formulas for fit may be obtained for use outside S.
rms can also compose S functions to evaluate from
the fitted model analytically, as well as compose SAS code to
do this.
need for automatic drawing of nomogram to represent the fitted model
need for automatic bootstrap validation of a fitted model, with only one S command (with respect to calibration and discrimination)
need for robust (Huber sandwich) estimator of covariance matrix, and be able to do all other analysis (e.g., plots, C.L.) using the adjusted covariances
need for robust (bootstrap) estimator of covariance matrix, easily used in other analyses without change
need for Huber sandwich and bootstrap covariance matrices adjusted for cluster sampling
need for routine reporting of how many observations were deleted
by missing values on each predictor (see na.delete
in Hmisc)
need for optional reporting of descriptive statistics for Y stratified by missing status of each X (see na.detail.response)
need for pretty, annotated survival curves, using the same commands for parametric and Cox models
need for ordinal logistic model (proportional odds model, continuation ratio model)
need for estimating and testing general contrasts without having to be conscious of variable coding or parameter order
rms will work with a wide variety of fitting functions, but it is meant especially for the following:
Function | Purpose | Related S |
Functions | ||
ols |
Ordinary least squares linear model | lm
|
lrm |
Binary and ordinal logistic regression | glm
|
model | cr.setup
|
|
orm |
Ordinal regression model | lrm
|
blrm |
Bayesian binary and ordinal regression | \ |
psm |
Accelerated failure time parametric | survreg
|
survival model | ||
cph |
Cox proportional hazards regression | coxph
|
npsurv |
Nonparametric survival estimates |
survfit.formula |
bj |
Buckley-James censored least squares | survreg
|
linear model | ||
Glm |
Version of glm for use with rms |
glm
|
Gls |
Version of gls for use with rms |
gls
|
Rq |
Version of rq for use with rms |
rq
|
The following generic functions work with fits with rms in effect:
Function | Purpose | Related |
Functions | ||
print |
Print parameters and statistics of fit | |
coef |
Fitted regression coefficients | |
formula |
Formula used in the fit | |
specs |
Detailed specifications of fit | |
robcov |
Robust covariance matrix estimates | |
bootcov |
Bootstrap covariance matrix estimates | |
summary |
Summary of effects of predictors | |
plot.summary |
Plot continuously shaded confidence | |
bars for results of summary | ||
anova |
Wald tests of most meaningful hypotheses | |
contrast |
General contrasts, C.L., tests | |
plot.anova |
Depict results of anova graphically | dotchart |
Predict |
Partial predictor effects | predict |
plot.Predict
|
Plot predictor effects using lattice graphics | predict |
ggplot |
Similar to above but using ggplot2 | |
plotp |
Similar to above but using plotly | |
bplot |
3-D plot of effects of varying two | |
continuous predictors | image, persp, contour |
|
gendata |
Generate data frame with predictor | expand.grid |
combinations (optionally interactively) | ||
predict |
Obtain predicted values or design matrix | |
fastbw |
Fast backward step-down variable | step |
selection | ||
residuals |
Residuals, influence statistics from fit | |
(or resid ) |
||
which.influence
|
Which observations are overly | residuals |
influential | ||
sensuc |
Sensitivity of one binary predictor in | |
lrm and cph models to an unmeasured | ||
binary confounder | ||
latex |
LaTeX representation of fitted | |
model or anova or summary table |
||
Function |
S function analytic representation | Function.transcan |
of a fitted regression model ( ) |
||
hazard |
S function analytic representation | rcspline.restate |
of a fitted hazard function (for psm ) |
||
Survival |
S function analytic representation of | |
fitted survival function (for psm,cph ) |
||
Quantile |
S function analytic representation of | |
fitted function for quantiles of | ||
survival time (for psm, cph ) |
||
nomogram |
Draws a nomogram for the fitted model | latex, plot, ggplot, plotp |
survest |
Estimate survival probabilities | survfit |
(for psm, cph ) |
||
survplot |
Plot survival curves (psm, cph, npsurv) | plot.survfit |
validate |
Validate indexes of model fit using | val.prob |
resampling | ||
calibrate |
Estimate calibration curve for model | |
using resampling | ||
vif |
Variance inflation factors for a fit | |
naresid |
Bring elements corresponding to missing | |
data back into predictions and residuals | ||
naprint |
Print summary of missing values | |
pentrace |
Find optimum penality for penalized MLE | |
effective.df
|
Print effective d.f. for each type of | |
variable in model, for penalized fit or | ||
pentrace result | ||
rm.impute |
Impute repeated measures data with | transcan , |
non-random dropout | fit.mult.impute |
|
experimental, non-functional |
The following programs demonstrate how the pieces of
the rms package work together. A (usually)
one-time call to the function datadist
requires a
pass at the entire data frame to store distribution
summaries for potential predictor variables. These
summaries contain (by default) the .25 and .75
quantiles of continuous variables (for estimating
effects such as odds ratios), the 10th smallest and
10th largest values (or .1 and .9 quantiles for small
) for plotting ranges for estimated curves, and the
total range. For discrete numeric variables (those
having
unique values), the list of unique values
is also stored. Such summaries are used by the
summary.rms, Predict
, and nomogram.rms
functions. You may save time and defer running
datadist
. In that case, the distribution summary
is not stored with the fit object, but it can be
gathered before running summary
, plot
, ggplot
, or
plotp
.
d <- datadist(my.data.frame) # or datadist(x1,x2)
options(datadist="d") # omit this or use options(datadist=NULL)
# if not run datadist yet
cf <- ols(y ~ x1 * x2)
anova(f)
fastbw(f)
Predict(f, x2)
predict(f, newdata)
In the Examples section there are three detailed examples using a
fitting function
designed to be used with rms, lrm
(logistic
regression model). In Detailed Example 1 we
create 3 predictor variables and a two binary response
on 500 subjects. For the first binary response, dz
,
the true model involves only sex
and age
, and there is
a nonlinear interaction between the two because the log
odds is a truncated linear relationship in age
for
females and a quadratic function for males. For the
second binary outcome, dz.bp
, the true population model
also involves systolic blood pressure (sys.bp
) through
a truncated linear relationship. First, nonparametric
estimation of relationships is done using the Hmisc
package's plsmo
function which uses lowess
with outlier
detection turned off for binary responses. Then
parametric modeling is done using restricted cubic
splines. This modeling does not assume that we know
the true transformations for age
or sys.bp
but that
these transformations are smooth (which is not actually
the case in the population).
For Detailed Example 2, suppose that a
categorical variable treat has values "a", "b"
, and
"c"
, an ordinal variable num.diseases
has values
0,1,2,3,4, and that there are two continuous variables,
age
and cholesterol
. age
is fitted with a restricted
cubic spline, while cholesterol
is transformed using
the transformation log(cholesterol - 10)
. Cholesterol
is missing on three subjects, and we impute these using
the overall median cholesterol. We wish to allow for
interaction between treat
and cholesterol
. The
following S program will fit a logistic model,
test all effects in the design, estimate effects, and
plot estimated transformations. The fit for
num.diseases
really considers the variable to be a
5-level categorical variable. The only difference is
that a 3 d.f. test of linearity is done to assess
whether the variable can be re-modeled "asis". Here
we also show statements to attach the rms package
and store predictor characteristics from datadist.
Detailed Example 3 shows some of the survival
analysis capabilities of rms related to the Cox
proportional hazards model. We simulate data for 2000
subjects with 2 predictors, age
and sex
. In the true
population model, the log hazard function is linear in
age
and there is no age
sex
interaction. In the
analysis below we do not make use of the linearity in
age. rms makes use of many of Terry Therneau's
survival functions that are builtin to S.
The following is a typical sequence of steps that
would be used with rms in conjunction with the Hmisc
transcan
function to do single imputation of all NAs in the
predictors (multiple imputation would be better but would be
harder to do in the context of bootstrap model validation),
fit a model, do backward stepdown to reduce the number of
predictors in the model (with all the severe problems this can
entail), and use the bootstrap to validate this stepwise model,
repeating the variable selection for each re-sample. Here we
take a short cut as the imputation is not repeated within the
bootstrap.
In what follows we (atypically) have only 3 candidate predictors. In practice be sure to have the validate and calibrate functions operate on a model fit that contains all predictors that were involved in previous analyses that used the response variable. Here the imputation is necessary because backward stepdown would otherwise delete observations missing on any candidate variable.
Note that you would have to define x1, x2, x3, y
to run
the following code.
xt <- transcan(~ x1 + x2 + x3, imputed=TRUE)
impute(xt) # imputes any NAs in x1, x2, x3
# Now fit original full model on filled-in data
f <- lrm(y ~ x1 + rcs(x2,4) + x3, x=TRUE, y=TRUE) #x,y allow boot.
fastbw(f)
# derives stepdown model (using default stopping rule)
validate(f, B=100, bw=TRUE) # repeats fastbw 100 times
cal <- calibrate(f, B=100, bw=TRUE) # also repeats fastbw
plot(cal)
Don't have a formula like y ~ age + age^2
.
In S you need to connect related variables using
a function which produces a matrix, such as pol
or
rcs
.
This allows effect estimates (e.g., hazard ratios)
to be computed as well as multiple d.f. tests of
association.
Don't use poly
or strata
inside formulas used in
rms. Use pol
and strat
instead.
Almost never code your own dummy variables or
interaction variables in S. Let S do this
automatically. Otherwise, anova
can't do its
job.
Almost never transform predictors outside of
the model formula, as then plots of predicted
values vs. predictor values, and other displays,
would not be made on the original scale. Use
instead something like y ~ log(cell.count+1)
,
which will allow cell.count
to appear on
-axes. You can get fancier, e.g.,
y ~ rcs(log(cell.count+1),4)
to fit a restricted
cubic spline with 4 knots in log(cell.count+1)
.
For more complex transformations do something
like
f <- function(x) {
... various 'if' statements, etc.
log(pmin(x,50000)+1)
}
fit1 <- lrm(death ~ f(cell.count))
fit2 <- lrm(death ~ rcs(f(cell.count),4))
}
Don't put $
inside variable names used in formulas.
Either attach data frames or use data=
.
Don't forget to use datadist
. Try to use it
at the top of your program so that all model fits
can automatically take advantage if its
distributional summaries for the predictors.
Don't validate
or calibrate
models which were
reduced by dropping "insignificant" predictors.
Proper bootstrap or cross-validation must repeat
any variable selection steps for each re-sample.
Therefore, validate
or calibrate
models
which contain all candidate predictors, and if
you must reduce models, specify the option
bw=TRUE
to validate
or calibrate
.
Dropping of "insignificant" predictors ruins much
of the usual statistical inference for
regression models (confidence limits, standard
errors, -values,
, ordinary indexes of
model performance) and it also results in models
which will have worse predictive discrimination.
Use require(rms)
.
Spline fits
Spanos A, Harrell FE, Durack DT (1989): Differential diagnosis of acute meningitis: An analysis of the predictive value of initial observations. JAMA 2700-2707.
Ohman EM, Armstrong PW, Christenson RH, et al. (1996): Cardiac troponin T levels for risk stratification in acute myocardial ischemia. New Eng J Med 335:1333-1341.
Bootstrap calibration curve for a parametric survival model:
Knaus WA, Harrell FE, Fisher CJ, Wagner DP, et al. (1993): The clinical evaluation of new drugs for sepsis: A prospective study design based on survival analysis. JAMA 270:1233-1241.
Splines, interactions with splines, algebraic form of
fitted model from latex.rms
Knaus WA, Harrell FE, Lynn J, et al. (1995): The SUPPORT prognostic model: Objective estimates of survival for seriously ill hospitalized adults. Annals of Internal Medicine 122:191-203.
Splines, odds ratio chart from fitted model with
nonlinear and interaction terms, use of transcan
for
imputation
Lee KL, Woodlief LH, Topol EJ, Weaver WD, Betriu A. Col J, Simoons M, Aylward P, Van de Werf F, Califf RM. Predictors of 30-day mortality in the era of reperfusion for acute myocardial infarction: results from an international trial of 41,021 patients. Circulation 1995;91:1659-1668.
Splines, external validation of logistic models, prediction rules using point tables
Steyerberg EW, Hargrove YV, et al (2001): Residual mass histology in testicular cancer: development and validation of a clinical prediction rule. Stat in Med 2001;20:3847-3859.
van Gorp MJ, Steyerberg EW, et al (2003): Clinical prediction rule for 30-day mortality in Bjork-Shiley convexo-concave valve replacement. J Clinical Epidemiology 2003;56:1006-1012.
Model fitting, bootstrap validation, missing value imputation
Krijnen P, van Jaarsveld BC, Steyerberg EW, Man in 't Veld AJ, Schalekamp, MADH, Habbema JDF (1998): A clinical prediction rule for renal artery stenosis. Annals of Internal Medicine 129:705-711.
Model fitting, splines, bootstrap validation, nomograms
Kattan MW, Eastham JA, Stapleton AMF, Wheeler TM, Scardino PT. A preoperative nomogram for disease recurrence following radical prostatectomy for prostate cancer. J Natl Ca Inst 1998; 90(10):766-771.
Kattan, MW, Wheeler TM, Scardino PT. A postoperative nomogram for disease recurrence following radical prostatectomy for prostate cancer. J Clin Oncol 1999; 17(5):1499-1507
Kattan MW, Zelefsky MJ, Kupelian PA, Scardino PT, Fuks Z, Leibel SA. A pretreatment nomogram for predicting the outcome of three-dimensional conformal radiotherapy in prostate cancer. J Clin Oncol 2000; 18(19):3252-3259.
Eastham JA, May R, Robertson JL, Sartor O, Kattan MW. Development of a nomogram which predicts the probability of a positive prostate biopsy in men with an abnormal digital rectal examination and a prostate specific antigen between 0 and 4 ng/ml. Urology. (In press).
Kattan MW, Heller G, Brennan MF. A competing-risk nomogram fir sarcoma-specific death following local recurrence. Stat in Med 2003; 22; 3515-3525.
Penalized maximum likelihood estimation, regression splines, web site to get predicted values
Smits M, Dippel DWJ, Steyerberg EW, et al. Predicting intracranial traumatic findings on computed tomography in patients with minor head injury: The CHIP prediction rule. Ann Int Med 2007; 146:397-405.
Nomogram with 2- and 5-year survival probability and median survival time (but watch out for the use of univariable screening)
Clark TG, Stewart ME, Altman DG, Smyth JF. A prognostic model for ovarian cancer. Br J Cancer 2001; 85:944-52.
Comprehensive example of parametric survival modeling with an extensive nomogram, time ratio chart, anova chart, survival curves generated using survplot, bootstrap calibration curve
Teno JM, Harrell FE, Knaus WA, et al. Prediction of survival for older hospitalized patients: The HELP survival model. J Am Geriatrics Soc 2000; 48: S16-S24.
Model fitting, imputation, and several nomograms expressed in tabular form
Hasdai D, Holmes DR, et al. Cardiogenic shock complicating acute myocardial infarction: Predictors of death. Am Heart J 1999; 138:21-31.
Ordinal logistic model with bootstrap calibration plot
Wu AW, Yasui U, Alzola CF et al. Predicting functional status outcomes in hospitalized patients aged 80 years and older. J Am Geriatric Society 2000; 48:S6-S15.
Propensity modeling in evaluating medical diagnosis, anova dot chart
Weiss JP, Gruver C, et al. Ordering an echocardiogram for evaluation of left ventricular function: Level of expertise necessary for efficient use. J Am Soc Echocardiography 2000; 13:124-130.
Simulations using rms to study the properties of various modeling strategies
Steyerberg EW, Eijkemans MJC, Habbema JDF. Stepwise selection in small data sets: A simulation study of bias in logistic regression analysis. J Clin Epi 1999; 52:935-942.
Steyerberg WE, Eijekans MJC, Harrell FE, Habbema JDF. Prognostic modeling with logistic regression analysis: In search of a sensible strategy in small data sets. Med Decision Making 2001; 21:45-56.
Statistical methods and references related to rms, along with case studies which includes the rms code which produced the analyses
Harrell FE, Lee KL, Mark DB (1996): Multivariable prognostic models: Issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat in Med 15:361-387.
Harrell FE, Margolis PA, Gove S, Mason KE, Mulholland EK et al. (1998): Development of a clinical prediction model for an ordinal outcome: The World Health Organization ARI Multicentre Study of clinical signs and etiologic agents of pneumonia, sepsis, and meningitis in young infants. Stat in Med 17:909-944.
Bender R, Benner, A (2000): Calculating ordinal regression models in SAS and S-Plus. Biometrical J 42:677-699.
The author is willing to help with problems. Send E-mail to [email protected]. To report bugs, please do the following:
If the bug occurs when running a function on a fit
object (e.g., anova
), attach a dump
'd text
version of the fit object to your note. If you
used datadist
but not until after the fit was
created, also send the object created by
datadist
. Example: save(myfit,"/tmp/myfit.rda")
will create
an R binary save file that can be attached to the E-mail.
If the bug occurs during a model fit (e.g., with
lrm, ols, psm, cph
), send the statement causing
the error with a save
'd version of the data
frame used in the fit. If this data frame is very
large, reduce it to a small subset which still
causes the error.
GENERAL DISCLAIMER This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. In short: you may use this code any way you like, as long as you don't charge money for it, remove this notice, or hold anyone liable for its results. Also, please acknowledge the source and communicate changes to the author.
If this software is used is work presented for publication, kindly reference it using for example: Harrell FE (2009): rms: S functions for biostatistical/epidemiologic modeling, testing, estimation, validation, graphics, and prediction. Programs available from https://hbiostat.org/R/rms/. Be sure to reference other packages used as well as R itself.
Frank E Harrell Jr
Professor of Biostatistics
Vanderbilt University School of Medicine
Nashville, Tennessee
[email protected]
The primary resource for the rms package is Regression Modeling Strategies, second edition by FE Harrell (Springer-Verlag, 2015) and the web page https://hbiostat.org/R/rms/. See also the Statistics in Medicine articles by Harrell et al listed below for case studies of modeling and model validation using rms.
Several datasets useful for multivariable modeling with rms are found at https://hbiostat.org/data/.
## To run several comprehensive examples, run the following command ## Not run: demo(all, 'rms') ## End(Not run)
## To run several comprehensive examples, run the following command ## Not run: demo(all, 'rms') ## End(Not run)
Uses the Huber-White method to adjust the variance-covariance matrix of
a fit from maximum likelihood or least squares, to correct for
heteroscedasticity and for correlated responses from cluster samples.
The method uses the ordinary estimates of regression coefficients and
other parameters of the model, but involves correcting the covariance
matrix for model misspecification and sampling design.
Models currently implemented are models that have a
residuals(fit,type="score")
function implemented, such as lrm
,
cph
, coxph
, and ordinary linear models (ols
).
The fit must have specified the x=TRUE
and y=TRUE
options for certain models.
Observations in different clusters are assumed to be independent.
For the special case where every cluster contains one observation, the
corrected covariance matrix returned is the "sandwich" estimator
(see Lin and Wei). This is a consistent estimate of the covariance matrix
even if the model is misspecified (e.g. heteroscedasticity, underdispersion,
wrong covariate form).
For the special case of ols fits, robcov
can compute the improved
(especially for small samples) Efron estimator that adjusts for
natural heterogeneity of residuals (see Long and Ervin (2000)
estimator HC3).
robcov(fit, cluster, method=c('huber','efron'))
robcov(fit, cluster, method=c('huber','efron'))
fit |
a fit object from the |
cluster |
a variable indicating groupings. |
method |
can set to |
a new fit object with the same class as the original fit,
and with the element orig.var
added. orig.var
is
the covariance matrix of the original fit. Also, the original var
component is replaced with the new Huberized estimates. A component
clusterInfo
is added to contain elements name
and n
holding the name of the cluster
variable and the number of clusters.
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
Huber, PJ. Proc Fifth Berkeley Symposium Math Stat 1:221–33, 1967.
White, H. Econometrica 50:1–25, 1982.
Lin, DY, Wei, LJ. JASA 84:1074–8, 1989.
Rogers, W. Stata Technical Bulletin STB-8, p. 15–17, 1992.
Rogers, W. Stata Release 3 Manual, deff
, loneway
, huber
, hreg
, hlogit
functions.
Long, JS, Ervin, LH. The American Statistician 54:217–224, 2000.
bootcov
, naresid
,
residuals.cph
, http://gforge.se/gmisc
interfaces
rms
to the sandwich
package
# In OLS test against more manual approach set.seed(1) n <- 15 x1 <- 1:n x2 <- sample(1:n) y <- round(x1 + x2 + 8*rnorm(n)) f <- ols(y ~ x1 + x2, x=TRUE, y=TRUE) vcov(f) vcov(robcov(f)) X <- f$x G <- diag(resid(f)^2) solve(t(X) %*% X) %*% (t(X) %*% G %*% X) %*% solve(t(X) %*% X) # Duplicate data and adjust for intra-cluster correlation to see that # the cluster sandwich estimator completely ignored the duplicates x1 <- c(x1,x1) x2 <- c(x2,x2) y <- c(y, y) g <- ols(y ~ x1 + x2, x=TRUE, y=TRUE) vcov(robcov(g, c(1:n, 1:n))) # A dataset contains a variable number of observations per subject, # and all observations are laid out in separate rows. The responses # represent whether or not a given segment of the coronary arteries # is occluded. Segments of arteries may not operate independently # in the same patient. We assume a "working independence model" to # get estimates of the coefficients, i.e., that estimates assuming # independence are reasonably efficient. The job is then to get # unbiased estimates of variances and covariances of these estimates. n.subjects <- 30 ages <- rnorm(n.subjects, 50, 15) sexes <- factor(sample(c('female','male'), n.subjects, TRUE)) logit <- (ages-50)/5 prob <- plogis(logit) # true prob not related to sex id <- sample(1:n.subjects, 300, TRUE) # subjects sampled multiple times table(table(id)) # frequencies of number of obs/subject age <- ages[id] sex <- sexes[id] # In truth, observations within subject are independent: y <- ifelse(runif(300) <= prob[id], 1, 0) f <- lrm(y ~ lsp(age,50)*sex, x=TRUE, y=TRUE) g <- robcov(f, id) diag(g$var)/diag(f$var) # add ,group=w to re-sample from within each level of w anova(g) # cluster-adjusted Wald statistics # fastbw(g) # cluster-adjusted backward elimination plot(Predict(g, age=30:70, sex='female')) # cluster-adjusted confidence bands # or use ggplot(...) # Get design effects based on inflation of the variances when compared # with bootstrap estimates which ignore clustering g2 <- robcov(f) diag(g$var)/diag(g2$var) # Get design effects based on pooled tests of factors in model anova(g2)[,1] / anova(g)[,1] # A dataset contains one observation per subject, but there may be # heteroscedasticity or other model misspecification. Obtain # the robust sandwich estimator of the covariance matrix. # f <- ols(y ~ pol(age,3), x=TRUE, y=TRUE) # f.adj <- robcov(f)
# In OLS test against more manual approach set.seed(1) n <- 15 x1 <- 1:n x2 <- sample(1:n) y <- round(x1 + x2 + 8*rnorm(n)) f <- ols(y ~ x1 + x2, x=TRUE, y=TRUE) vcov(f) vcov(robcov(f)) X <- f$x G <- diag(resid(f)^2) solve(t(X) %*% X) %*% (t(X) %*% G %*% X) %*% solve(t(X) %*% X) # Duplicate data and adjust for intra-cluster correlation to see that # the cluster sandwich estimator completely ignored the duplicates x1 <- c(x1,x1) x2 <- c(x2,x2) y <- c(y, y) g <- ols(y ~ x1 + x2, x=TRUE, y=TRUE) vcov(robcov(g, c(1:n, 1:n))) # A dataset contains a variable number of observations per subject, # and all observations are laid out in separate rows. The responses # represent whether or not a given segment of the coronary arteries # is occluded. Segments of arteries may not operate independently # in the same patient. We assume a "working independence model" to # get estimates of the coefficients, i.e., that estimates assuming # independence are reasonably efficient. The job is then to get # unbiased estimates of variances and covariances of these estimates. n.subjects <- 30 ages <- rnorm(n.subjects, 50, 15) sexes <- factor(sample(c('female','male'), n.subjects, TRUE)) logit <- (ages-50)/5 prob <- plogis(logit) # true prob not related to sex id <- sample(1:n.subjects, 300, TRUE) # subjects sampled multiple times table(table(id)) # frequencies of number of obs/subject age <- ages[id] sex <- sexes[id] # In truth, observations within subject are independent: y <- ifelse(runif(300) <= prob[id], 1, 0) f <- lrm(y ~ lsp(age,50)*sex, x=TRUE, y=TRUE) g <- robcov(f, id) diag(g$var)/diag(f$var) # add ,group=w to re-sample from within each level of w anova(g) # cluster-adjusted Wald statistics # fastbw(g) # cluster-adjusted backward elimination plot(Predict(g, age=30:70, sex='female')) # cluster-adjusted confidence bands # or use ggplot(...) # Get design effects based on inflation of the variances when compared # with bootstrap estimates which ignore clustering g2 <- robcov(f) diag(g$var)/diag(g2$var) # Get design effects based on pooled tests of factors in model anova(g2)[,1] / anova(g)[,1] # A dataset contains one observation per subject, but there may be # heteroscedasticity or other model misspecification. Obtain # the robust sandwich estimator of the covariance matrix. # f <- ols(y ~ pol(age,3), x=TRUE, y=TRUE) # f.adj <- robcov(f)
The Rq
function is the rms
front-end to the
quantreg
package's rq
function. print
and
latex
methods are also provided, and a fitting function
RqFit
is defined for use in bootstrapping, etc. Its result is a
function definition.
For the print
method, format of output is controlled by the
user previously running options(prType="lang")
where
lang
is "plain"
(the default), "latex"
, or
"html"
. For the latex
method, html
will actually
be used of options(prType='html')
. When using html with Quarto
or RMarkdown, results='asis'
need not be written in the chunk header.
Rq(formula, tau = 0.5, data=environment(formula), subset, weights, na.action=na.delete, method = "br", model = FALSE, contrasts = NULL, se = "nid", hs = TRUE, x = FALSE, y = FALSE, ...) ## S3 method for class 'Rq' print(x, digits=4, coefs=TRUE, title, ...) ## S3 method for class 'Rq' latex(object, file = '', append=FALSE, which, varnames, columns=65, inline=FALSE, caption=NULL, ...) ## S3 method for class 'Rq' predict(object, ..., kint=1, se.fit=FALSE) RqFit(fit, wallow=TRUE, passdots=FALSE)
Rq(formula, tau = 0.5, data=environment(formula), subset, weights, na.action=na.delete, method = "br", model = FALSE, contrasts = NULL, se = "nid", hs = TRUE, x = FALSE, y = FALSE, ...) ## S3 method for class 'Rq' print(x, digits=4, coefs=TRUE, title, ...) ## S3 method for class 'Rq' latex(object, file = '', append=FALSE, which, varnames, columns=65, inline=FALSE, caption=NULL, ...) ## S3 method for class 'Rq' predict(object, ..., kint=1, se.fit=FALSE) RqFit(fit, wallow=TRUE, passdots=FALSE)
formula |
model formula |
tau |
the single quantile to estimate. Unlike |
data , subset , weights , na.action , method , model , contrasts , se , hs
|
see
|
x |
set to |
y |
set to |
... |
other arguments passed to one of the |
digits |
number of significant digits used in formatting results in
|
coefs |
specify |
title |
a character string title to be passed to |
object |
an object created by |
file , append , which , varnames , columns , inline , caption
|
see
|
kint |
ignored |
se.fit |
set to |
fit |
an object created by |
wallow |
set to |
passdots |
set to |
Rq
returns a list of class "rms", "lassorq"
or "scadrq",
"Rq"
, and "rq"
. RqFit
returns a function
definition. latex.Rq
returns an object of class "latex"
.
The author and developer of methodology in the quantreg
package
is Roger Koenker.
Frank Harrell
## Not run: set.seed(1) n <- 100 x1 <- rnorm(n) y <- exp(x1 + rnorm(n)/4) dd <- datadist(x1); options(datadist='dd') fq2 <- Rq(y ~ pol(x1,2)) anova(fq2) fq3 <- Rq(y ~ pol(x1,2), tau=.75) anova(fq3) pq2 <- Predict(fq2, x1) pq3 <- Predict(fq3, x1) p <- rbind(Median=pq2, Q3=pq3) plot(p, ~ x1 | .set.) # For superpositioning, with true curves superimposed a <- function(x, y, ...) { x <- unique(x) col <- trellis.par.get('superpose.line')$col llines(x, exp(x), col=col[1], lty=2) llines(x, exp(x + qnorm(.75)/4), col=col[2], lty=2) } plot(p, addpanel=a) ## End(Not run)
## Not run: set.seed(1) n <- 100 x1 <- rnorm(n) y <- exp(x1 + rnorm(n)/4) dd <- datadist(x1); options(datadist='dd') fq2 <- Rq(y ~ pol(x1,2)) anova(fq2) fq3 <- Rq(y ~ pol(x1,2), tau=.75) anova(fq3) pq2 <- Predict(fq2, x1) pq3 <- Predict(fq3, x1) p <- rbind(Median=pq2, Q3=pq3) plot(p, ~ x1 | .set.) # For superpositioning, with true curves superimposed a <- function(x, y, ...) { x <- unique(x) col <- trellis.par.get('superpose.line')$col llines(x, exp(x), col=col[1], lty=2) llines(x, exp(x + qnorm(.75)/4), col=col[2], lty=2) } plot(p, addpanel=a) ## End(Not run)
Performs an analysis of the sensitivity of a binary treatment ()
effect to an unmeasured binary confounder (
) for a fitted binary
logistic or an unstratified non-time-dependent Cox survival model (the
function works well for the former, not so well for the latter). This
is done by fitting a sequence of models with separately created
variables added to the original model. The sequence of models is formed
by simultaneously varying
and
, where
measures
the association between
and
and
measures the
association between
and
, where
is the outcome of
interest. For Cox models, an approximate solution is used by letting
represent some binary classification of the event/censoring time
and the event indicator. For example,
could be just be the
event indicator, ignoring time of the event or censoring, or it could be
if a subject failed before one year and
otherwise. When
for each combination of
and
the vector of binary values
is generated, one of two methods is used to constrain the
properties of
. With either method, the overall prevalence of
is constrained to be
prev.u
. With the default method
(or.method="x:u y:u"
), is sampled so that the
odds ratio is
and the
odds ratio is
. With the
second method,
is sampled according to the model
, where
and
and
is determined so that the
prevalence of
is
prev.u
. This second method results in
the adjusted odds ratio for given
being
whereas the default method forces the unconditional (marginal)
odds ratio to be
. Rosenbaum uses the default method.
There is a plot
method for plotting objects created by
sensuc
. Values of are placed on the x-axis and observed
marginal odds or hazards ratios for
(unadjusted ratios) appear
on the y-axis. For Cox models, the hazard ratios will not agree exactly
with
:event indicator odds ratios but they sometimes be made
close through judicious choice of the
event
function. The
default plot uses four symbols which differentiate whether for the
combination the effect of
adjusted for
(and
for any other covariables that were in the original model fit) is
positive (usually meaning an effect ratio greater than 1) and
"significant", merely positive, not positive and non significant, or not
positive but significant. There is also an option to draw the numeric
value of the
effect ratio at the
,
combination
along with its
statistic underneath in smaller letters, and an
option to draw the effect ratio in one of four colors depending on the
significance of the
statistic.
# fit <- lrm(formula=y ~ x + other.predictors, x=TRUE, y=TRUE) #or # fit <- cph(formula=Surv(event.time,event.indicator) ~ x + other.predictors, # x=TRUE, y=TRUE) sensuc(fit, or.xu=seq(1, 6, by = 0.5), or.u=or.xu, prev.u=0.5, constrain.binary.sample=TRUE, or.method=c("x:u y:u","u|x,y"), event=function(y) if(is.matrix(y))y[,ncol(y)] else 1*y) ## S3 method for class 'sensuc' plot(x, ylim=c((1+trunc(min(x$effect.u)-.01))/ ifelse(type=='numbers',2,1), 1+trunc(max(x$effect.u)-.01)), xlab='Odds Ratio for X:U', ylab=if(x$type=='lrm')'Odds Ratio for Y:U' else 'Hazard Ratio for Y:U', digits=2, cex.effect=.75, cex.z=.6*cex.effect, delta=diff(par('usr')[3:4])/40, type=c('symbols','numbers','colors'), pch=c(15,18,5,0), col=c(2,3,1,4), alpha=.05, impressive.effect=function(x)x > 1,...)
# fit <- lrm(formula=y ~ x + other.predictors, x=TRUE, y=TRUE) #or # fit <- cph(formula=Surv(event.time,event.indicator) ~ x + other.predictors, # x=TRUE, y=TRUE) sensuc(fit, or.xu=seq(1, 6, by = 0.5), or.u=or.xu, prev.u=0.5, constrain.binary.sample=TRUE, or.method=c("x:u y:u","u|x,y"), event=function(y) if(is.matrix(y))y[,ncol(y)] else 1*y) ## S3 method for class 'sensuc' plot(x, ylim=c((1+trunc(min(x$effect.u)-.01))/ ifelse(type=='numbers',2,1), 1+trunc(max(x$effect.u)-.01)), xlab='Odds Ratio for X:U', ylab=if(x$type=='lrm')'Odds Ratio for Y:U' else 'Hazard Ratio for Y:U', digits=2, cex.effect=.75, cex.z=.6*cex.effect, delta=diff(par('usr')[3:4])/40, type=c('symbols','numbers','colors'), pch=c(15,18,5,0), col=c(2,3,1,4), alpha=.05, impressive.effect=function(x)x > 1,...)
fit |
result of |
x |
result of |
or.xu |
vector of possible odds ratios measuring the |
or.u |
vector of possible odds ratios measuring the |
prev.u |
desired prevalence of |
constrain.binary.sample |
By default, the binary |
or.method |
see above |
event |
a function classifying the response variable into a binary event for the
purposes of constraining the association between |
ylim |
y-axis limits for |
xlab |
x-axis label |
ylab |
y-axis label |
digits |
number of digits to the right of the decimal point for drawing numbers
on the plot, for
|
cex.effect |
character size for drawing effect ratios |
cex.z |
character size for drawing |
delta |
decrement in |
type |
specify |
pch |
4 plotting characters corresponding to positive and significant
effects for |
col |
4 colors as for |
alpha |
significance level |
impressive.effect |
a function of the odds or hazard ratio for |
... |
optional arguments passed to |
sensuc
returns an object of class "sensuc"
with the following elements: OR.xu
(vector of desired odds ratios or
values),
OOR.xu
(observed marginal odds ratios),
OR.u
(desired odds
ratios or
values),
effect.x
(adjusted odds or hazards ratio for
in a model adjusted for
and all of the other predictors),
effect.u
(unadjusted odds or hazards ratios),
effect.u.adj
(adjusted odds or hazards ratios),
(Z-statistics),
prev.u
(input to sensuc
), cond.prev.u
(matrix with one row per ,
combination, specifying prevalences of
conditional on
and
combinations), and
type
("lrm"
or "cph"
).
Frank Harrell
Mark Conaway
Department of Biostatistics
Vanderbilt University School of Medicine
[email protected], [email protected]
Rosenbaum, Paul R (1995): Observational Studies. New York: Springer-Verlag.
Rosenbaum P, Rubin D (1983): Assessing sensitivity to an unobserved binary covariate in an observational study with binary outcome. J Roy Statist Soc B 45:212–218.
Lee WC (2011): Bounding the bias of unmeasured factors with confounding and effect-modifying potentials. Stat in Med 30:1007-1017.
set.seed(17) x <- sample(0:1, 500,TRUE) y <- sample(0:1, 500,TRUE) y[1:100] <- x[1:100] # induce an association between x and y x2 <- rnorm(500) f <- lrm(y ~ x + x2, x=TRUE, y=TRUE) #Note: in absence of U odds ratio for x is exp(2nd coefficient) g <- sensuc(f, c(1,3)) # Note: If the generated sample of U was typical, the odds ratio for # x dropped had U been known, where U had an odds ratio # with x of 3 and an odds ratio with y of 3 plot(g) # Fit a Cox model and check sensitivity to an unmeasured confounder # require(survival) # f <- cph(Surv(d.time,death) ~ treatment + pol(age,2)*sex, x=TRUE, y=TRUE) # sensuc(f, event=function(y) y[,2] & y[,1] < 365.25 ) # Event = failed, with event time before 1 year # Note: Analysis uses f$y which is a 2-column Surv object
set.seed(17) x <- sample(0:1, 500,TRUE) y <- sample(0:1, 500,TRUE) y[1:100] <- x[1:100] # induce an association between x and y x2 <- rnorm(500) f <- lrm(y ~ x + x2, x=TRUE, y=TRUE) #Note: in absence of U odds ratio for x is exp(2nd coefficient) g <- sensuc(f, c(1,3)) # Note: If the generated sample of U was typical, the odds ratio for # x dropped had U been known, where U had an odds ratio # with x of 3 and an odds ratio with y of 3 plot(g) # Fit a Cox model and check sensitivity to an unmeasured confounder # require(survival) # f <- cph(Surv(d.time,death) ~ treatment + pol(age,2)*sex, x=TRUE, y=TRUE) # sensuc(f, event=function(y) y[,2] & y[,1] < 365.25 ) # Event = failed, with event time before 1 year # Note: Analysis uses f$y which is a 2-column Surv object
Depending on prevailing options(showprogress=)
and availability
of the tcltk
package, sets up a progress bar and creates a
function for simple updating of the bar as iterations progress. Setting
options(showprogressbar=FALSE)
or
options(showprogressbar='none')
results in no progress being
shown. Setting the option to something other than "tk"
or
"none"
results in the console being used to show the current
iteration number and intended number of iterations, the same as if
tcltk
is not installed. It is not recommended that the
"tk"
be used for simulations requiring fewer than 10 seconds for
more than 100 iterations, as the time required to update the pop-up
window will be more than the time required to do the simulations. This
problem can be solved by specifying, for example, every=10
to
setPb
or to the function created by setPb
, or by using
options(showevery=10)
before setPb
is called. If
options(showprogress=)
is not specified, progress is shown in the
console with an iteration counter.
setPb(n, type = c("Monte Carlo Simulation", "Bootstrap", "Cross-Validation"), label, usetk = TRUE, onlytk=FALSE, every=1)
setPb(n, type = c("Monte Carlo Simulation", "Bootstrap", "Cross-Validation"), label, usetk = TRUE, onlytk=FALSE, every=1)
n |
maximum number of iterations |
type |
type of simulation. Used for the progress bar title if
|
label |
used to customize the bar label if present, overriding |
usetk |
set to |
onlytk |
set to |
every |
print a message for every |
a function that should be called by the user once per iteration, specifying the iteration number as the sole argument
Frank Harrell
tkProgressBar
, setTkProgressBar
## Not run: options(showprogress=TRUE) # same as ='tk' pb <- setPb(1000) for(i in 1:1000) { pb(i) # pb(i, every=10) to only show for multiples of 10 # your calculations } # Force rms functions to do simulations to not report progress options(showprogress='none') # For functions that do simulations to use the console instead of pop-up # Even with tcltk is installed options(showprogress='console') pb <- setPb(1000, label='Random Sampling') ## End(Not run)
## Not run: options(showprogress=TRUE) # same as ='tk' pb <- setPb(1000) for(i in 1:1000) { pb(i) # pb(i, every=10) to only show for multiples of 10 # your calculations } # Force rms functions to do simulations to not report progress options(showprogress='none') # For functions that do simulations to use the console instead of pop-up # Even with tcltk is installed options(showprogress='console') pb <- setPb(1000, label='Random Sampling') ## End(Not run)
Prints the design specifications, e.g., number of parameters for each factor, levels of categorical factors, knot locations in splines, pre-transformations, etc.
specs(fit, ...) ## S3 method for class 'rms' specs(fit, long=FALSE, ...) ## S3 method for class 'specs.rms' print(x, ...)
specs(fit, ...) ## S3 method for class 'rms' specs(fit, long=FALSE, ...) ## S3 method for class 'specs.rms' print(x, ...)
fit |
a fit object created with the |
x |
an object returned by |
long |
if |
... |
ignored |
a list containing information about the fit and the predictors as elements
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
rms
, rms.trans
, latexrms
, datadist
set.seed(1) blood.pressure <- rnorm(200, 120, 15) dd <- datadist(blood.pressure) options(datadist='dd') L <- .03*(blood.pressure-120) sick <- ifelse(runif(200) <= plogis(L), 1, 0) f <- lrm(sick ~ rcs(blood.pressure,5)) specs(f) # find out where 5 knots are placed g <- Glm(sick ~ rcs(blood.pressure,5), family=binomial) specs(g,long=TRUE) options(datadist=NULL)
set.seed(1) blood.pressure <- rnorm(200, 120, 15) dd <- datadist(blood.pressure) options(datadist='dd') L <- .03*(blood.pressure-120) sick <- ifelse(runif(200) <= plogis(L), 1, 0) f <- lrm(sick ~ rcs(blood.pressure,5)) specs(f) # find out where 5 knots are placed g <- Glm(sick ~ rcs(blood.pressure,5), family=binomial) specs(g,long=TRUE) options(datadist=NULL)
summary.rms
forms a summary of the effects of each
factor. When summary
is used to estimate odds or hazard ratios for
continuous variables, it allows the levels of interacting factors to be
easily set, as well as allowing the user to choose the interval for the
effect. This method of estimating effects allows for nonlinearity in
the predictor. Factors requiring multiple parameters are handled, as
summary
obtains predicted values at the needed points and takes
differences. By default, inter-quartile range effects (odds ratios,
hazards ratios, etc.) are printed for continuous factors, and all
comparisons with the reference level are made for categorical factors.
print.summary.rms
prints the results, latex.summary.rms
and html.summary.rms
typeset the results, and plot.summary.rms
plots shaded confidence bars to display the results graphically.
The longest confidence bar on each page is labeled with confidence levels
(unless this bar has been ignored due to clip
). By default, the
following confidence levels are all shown: .9, .95, and .99, using
blue of different transparencies. The plot
method currently
ignores bootstrap and Bayesian highest posterior density intervals but approximates
intervals based on standard errors. The html
method is for use
with R Markdown using html.
The print
method will call the latex
or html
method
if options(prType=)
is set to "latex"
or "html"
.
For "latex"
printing through print()
, the LaTeX table
environment is turned off. When using html with Quarto or RMarkdown,
results='asis'
need not be written in the chunk header.
If usebootcoef=TRUE
and the fit was run through bootcov
,
the confidence intervals are bootstrap nonparametric percentile
confidence intervals, basic bootstrap, or BCa intervals, obtained on contrasts
evaluated on all bootstrap samples.
If options(grType='plotly')
is in effect and the plotly
package is installed, plot
is used instead of base graphics to
draw the point estimates and confidence limits when the plot
method for summary
is called. Colors and other graphical
arguments to plot.summary
are ignored in this case. Various
special effects are implemented such as only drawing 0.95 confidence
limits by default but including a legend that allows the other CLs to be
activated. Hovering over point estimates shows adjustment values if
there are any. nbar
is not implemented for plotly
.
## S3 method for class 'rms' summary(object, ..., ycut=NULL, est.all=TRUE, antilog, conf.int=.95, abbrev=FALSE, vnames=c("names","labels"), conf.type=c('individual','simultaneous'), usebootcoef=TRUE, boot.type=c("percentile","bca","basic"), posterior.summary=c('mean', 'median', 'mode'), verbose=FALSE) ## S3 method for class 'summary.rms' print(x, ..., table.env=FALSE) ## S3 method for class 'summary.rms' latex(object, title, table.env=TRUE, ...) ## S3 method for class 'summary.rms' html(object, digits=4, dec=NULL, ...) ## S3 method for class 'summary.rms' plot(x, at, log=FALSE, q=c(0.9, 0.95, 0.99), xlim, nbar, cex=1, nint=10, cex.main=1, clip=c(-1e30,1e30), main, col=rgb(red=.1,green=.1,blue=.8,alpha=c(.1,.4,.7)), col.points=rgb(red=.1,green=.1,blue=.8,alpha=1), pch=17, lwd=if(length(q) == 1) 3 else 2 : (length(q) + 1), digits=4, declim=4, ...)
## S3 method for class 'rms' summary(object, ..., ycut=NULL, est.all=TRUE, antilog, conf.int=.95, abbrev=FALSE, vnames=c("names","labels"), conf.type=c('individual','simultaneous'), usebootcoef=TRUE, boot.type=c("percentile","bca","basic"), posterior.summary=c('mean', 'median', 'mode'), verbose=FALSE) ## S3 method for class 'summary.rms' print(x, ..., table.env=FALSE) ## S3 method for class 'summary.rms' latex(object, title, table.env=TRUE, ...) ## S3 method for class 'summary.rms' html(object, digits=4, dec=NULL, ...) ## S3 method for class 'summary.rms' plot(x, at, log=FALSE, q=c(0.9, 0.95, 0.99), xlim, nbar, cex=1, nint=10, cex.main=1, clip=c(-1e30,1e30), main, col=rgb(red=.1,green=.1,blue=.8,alpha=c(.1,.4,.7)), col.points=rgb(red=.1,green=.1,blue=.8,alpha=1), pch=17, lwd=if(length(q) == 1) 3 else 2 : (length(q) + 1), digits=4, declim=4, ...)
object |
a |
... |
For Also represents other arguments to pass to |
ycut |
must be specified if the fit is a partial proportional odds model. Specifies the single value of the response variable used to estimate ycut-specific regression effects, e.g., odds ratios |
est.all |
Set to |
antilog |
Set to |
conf.int |
Defaults to |
abbrev |
Set to |
vnames |
Set to |
conf.type |
The default type of confidence interval computed for a given
individual (1 d.f.) contrast is a pointwise confidence interval. Set
|
usebootcoef |
If |
boot.type |
set to |
posterior.summary |
set to |
verbose |
set to |
x |
result of |
title |
|
table.env |
see |
digits , dec
|
for |
declim |
number of digits to the right of the decimal point to which to round confidence limits for labeling axes |
at |
vector of coordinates at which to put tick mark labels on the main axis. If
|
log |
Set to |
q |
scalar or vector of confidence coefficients to depict |
xlim |
X-axis limits for |
nbar |
Sets up plot to leave room for |
cex |
|
nint |
Number of tick mark numbers for |
cex.main |
|
clip |
confidence limits outside the interval |
main |
main title. Default is inferred from the model and value of |
col |
vector of colors, one per value of |
col.points |
color for points estimates |
pch |
symbol for point estimates. Default is solid triangle. |
lwd |
line width for confidence intervals, corresponding to
|
For summary.rms
, a matrix of class summary.rms
with rows corresponding to factors in
the model and columns containing the low and high values for the effects,
the range for the effects, the effect point estimates (difference in
predicted values for high and low factor values), the standard error
of this effect estimate, and the lower and upper confidence limits.
If fit$scale.pred
has a second level, two rows appear for each factor,
the second corresponding to anti–logged effects. Non–categorical factors
are stored first, and effects for any categorical factors are stored at
the end of the returned matrix. scale.pred
and adjust
. adjust
is a character string containing levels of adjustment variables, if
there are any interactions. Otherwise it is "".
latex.summary.rms
returns an object of class c("latex","file")
.
It requires the latex
function in Hmisc.
Frank Harrell
Hui Nian
Department of Biostatistics, Vanderbilt University
[email protected]
datadist
, rms
, rms.trans
,
rmsMisc
,
Misc
, pretty
, contrast.rms
n <- 1000 # define sample size set.seed(17) # so can reproduce the results age <- rnorm(n, 50, 10) blood.pressure <- rnorm(n, 120, 15) cholesterol <- rnorm(n, 200, 25) sex <- factor(sample(c('female','male'), n,TRUE)) label(age) <- 'Age' # label is in Hmisc label(cholesterol) <- 'Total Cholesterol' label(blood.pressure) <- 'Systolic Blood Pressure' label(sex) <- 'Sex' units(cholesterol) <- 'mg/dl' # uses units.default in Hmisc units(blood.pressure) <- 'mmHg' # Specify population model for log odds that Y=1 L <- .4*(sex=='male') + .045*(age-50) + (log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male')) # Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)] y <- ifelse(runif(n) < plogis(L), 1, 0) ddist <- datadist(age, blood.pressure, cholesterol, sex) options(datadist='ddist') fit <- lrm(y ~ blood.pressure + sex * (age + rcs(cholesterol,4))) s <- summary(fit) # Estimate effects using default ranges # Gets odds ratio for age=3rd quartile # compared to 1st quartile ## Not run: latex(s) # Use LaTeX to print nice version latex(s, file="") # Just write LaTeX code to console html(s) # html/LaTeX to console for knitr # Or: options(prType='latex') summary(fit) # prints with LaTeX, table.env=FALSE options(prType='html') summary(fit) # prints with html ## End(Not run) summary(fit, sex='male', age=60) # Specify ref. cell and adjustment val summary(fit, age=c(50,70)) # Estimate effect of increasing age from # 50 to 70 s <- summary(fit, age=c(50,60,70)) # Increase age from 50 to 70, adjust to # 60 when estimating effects of other factors #Could have omitted datadist if specified 3 values for all non-categorical #variables (1 value for categorical ones - adjustment level) plot(s, log=TRUE, at=c(.1,.5,1,1.5,2,4,8)) options(datadist=NULL)
n <- 1000 # define sample size set.seed(17) # so can reproduce the results age <- rnorm(n, 50, 10) blood.pressure <- rnorm(n, 120, 15) cholesterol <- rnorm(n, 200, 25) sex <- factor(sample(c('female','male'), n,TRUE)) label(age) <- 'Age' # label is in Hmisc label(cholesterol) <- 'Total Cholesterol' label(blood.pressure) <- 'Systolic Blood Pressure' label(sex) <- 'Sex' units(cholesterol) <- 'mg/dl' # uses units.default in Hmisc units(blood.pressure) <- 'mmHg' # Specify population model for log odds that Y=1 L <- .4*(sex=='male') + .045*(age-50) + (log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male')) # Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)] y <- ifelse(runif(n) < plogis(L), 1, 0) ddist <- datadist(age, blood.pressure, cholesterol, sex) options(datadist='ddist') fit <- lrm(y ~ blood.pressure + sex * (age + rcs(cholesterol,4))) s <- summary(fit) # Estimate effects using default ranges # Gets odds ratio for age=3rd quartile # compared to 1st quartile ## Not run: latex(s) # Use LaTeX to print nice version latex(s, file="") # Just write LaTeX code to console html(s) # html/LaTeX to console for knitr # Or: options(prType='latex') summary(fit) # prints with LaTeX, table.env=FALSE options(prType='html') summary(fit) # prints with html ## End(Not run) summary(fit, sex='male', age=60) # Specify ref. cell and adjustment val summary(fit, age=c(50,70)) # Estimate effect of increasing age from # 50 to 70 s <- summary(fit, age=c(50,60,70)) # Increase age from 50 to 70, adjust to # 60 when estimating effects of other factors #Could have omitted datadist if specified 3 values for all non-categorical #variables (1 value for categorical ones - adjustment level) plot(s, log=TRUE, at=c(.1,.5,1,1.5,2,4,8)) options(datadist=NULL)
Compute survival probabilities and optional confidence limits for
Cox survival models. If x=TRUE, y=TRUE
were specified to cph
,
confidence limits use the correct formula for any combination of
predictors. Otherwise, if surv=TRUE
was specified to cph
,
confidence limits are based only on standard errors of log(S(t))
at the mean value of . If the model
contained only stratification factors, or if predictions are being
requested near the mean of each covariable, this approximation will be
accurate. Unless
times
is given, at most one observation may be
predicted.
survest(fit, ...) ## S3 method for class 'cph' survest(fit, newdata, linear.predictors, x, times, fun, loglog=FALSE, conf.int=0.95, type, vartype, conf.type=c("log", "log-log", "plain", "none"), se.fit=TRUE, what=c('survival','parallel'), individual=FALSE, ...)
survest(fit, ...) ## S3 method for class 'cph' survest(fit, newdata, linear.predictors, x, times, fun, loglog=FALSE, conf.int=0.95, type, vartype, conf.type=c("log", "log-log", "plain", "none"), se.fit=TRUE, what=c('survival','parallel'), individual=FALSE, ...)
fit |
a model fit from |
newdata |
a data frame containing predictor variable combinations for which predictions are desired |
linear.predictors |
a vector of linear predictor values (centered) for which predictions are desired. If the model is stratified, the "strata" attribute must be attached to this vector (see example). |
x |
a design matrix at which to compute estimates, with any strata attached
as a "strata" attribute. Only one of |
times |
a vector of times at which to get predictions. If omitted, predictions are made at all unique failure times in the original input data. |
loglog |
set to |
fun |
any function to transform the estimates and confidence limits ( |
conf.int |
set to |
type |
see |
vartype |
see |
conf.type |
specifies the basis for computing confidence limits. |
se.fit |
set to |
individual |
set to |
what |
Normally use |
... |
unused |
The result is passed through naresid
if newdata
,
linear.predictors
, and x
are not specified, to restore
placeholders for NA
s.
If times
is omitted, returns a list with the elements
time
, n.risk
, n.event
, surv
, call
(calling statement), and optionally std.err
, upper
,
lower
, conf.type
, conf.int
. The estimates in this
case correspond to one subject. If times
is specified, the
returned list has possible components time
, surv
,
std.err
, lower
, and upper
. These will be matrices
(except for time
) if more than one subject is being predicted,
with rows representing subjects and columns representing times
.
If times
has only one time, these are reduced to vectors with
the number of elements equal to the number of subjects.
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
cph
, survfit.cph
, survfit.coxph
, predictrms
, survplot
# Simulate data from a population model in which the log hazard # function is linear in age and there is no age x sex interaction # Proportional hazards holds for both variables but we # unnecessarily stratify on sex to see what happens require(survival) n <- 1000 set.seed(731) age <- 50 + 12*rnorm(n) label(age) <- "Age" sex <- factor(sample(c('Male','Female'), n, TRUE)) cens <- 15*runif(n) h <- .02*exp(.04*(age-50)+.8*(sex=='Female')) dt <- -log(runif(n))/h label(dt) <- 'Follow-up Time' e <- ifelse(dt <= cens,1,0) dt <- pmin(dt, cens) units(dt) <- "Year" dd <- datadist(age, sex) options(datadist='dd') Srv <- Surv(dt,e) f <- cph(Srv ~ age*strat(sex), x=TRUE, y=TRUE) #or surv=T survest(f, expand.grid(age=c(20,40,60),sex=c("Male","Female")), times=c(2,4,6), conf.int=.9) f <- update(f, surv=TRUE) lp <- c(0, .5, 1) f$strata # check strata names attr(lp,'strata') <- rep(1,3) # or rep('sex=Female',3) survest(f, linear.predictors=lp, times=c(2,4,6)) # Test survest by comparing to survfit.coxph for a more complex model f <- cph(Srv ~ pol(age,2)*strat(sex), x=TRUE, y=TRUE) survest(f, data.frame(age=median(age), sex=levels(sex)), times=6) age2 <- age^2 f2 <- coxph(Srv ~ (age + age2)*strata(sex)) new <- data.frame(age=median(age), age2=median(age)^2, sex='Male') summary(survfit(f2, new), times=6) new$sex <- 'Female' summary(survfit(f2, new), times=6) options(datadist=NULL)
# Simulate data from a population model in which the log hazard # function is linear in age and there is no age x sex interaction # Proportional hazards holds for both variables but we # unnecessarily stratify on sex to see what happens require(survival) n <- 1000 set.seed(731) age <- 50 + 12*rnorm(n) label(age) <- "Age" sex <- factor(sample(c('Male','Female'), n, TRUE)) cens <- 15*runif(n) h <- .02*exp(.04*(age-50)+.8*(sex=='Female')) dt <- -log(runif(n))/h label(dt) <- 'Follow-up Time' e <- ifelse(dt <= cens,1,0) dt <- pmin(dt, cens) units(dt) <- "Year" dd <- datadist(age, sex) options(datadist='dd') Srv <- Surv(dt,e) f <- cph(Srv ~ age*strat(sex), x=TRUE, y=TRUE) #or surv=T survest(f, expand.grid(age=c(20,40,60),sex=c("Male","Female")), times=c(2,4,6), conf.int=.9) f <- update(f, surv=TRUE) lp <- c(0, .5, 1) f$strata # check strata names attr(lp,'strata') <- rep(1,3) # or rep('sex=Female',3) survest(f, linear.predictors=lp, times=c(2,4,6)) # Test survest by comparing to survfit.coxph for a more complex model f <- cph(Srv ~ pol(age,2)*strat(sex), x=TRUE, y=TRUE) survest(f, data.frame(age=median(age), sex=levels(sex)), times=6) age2 <- age^2 f2 <- coxph(Srv ~ (age + age2)*strata(sex)) new <- data.frame(age=median(age), age2=median(age)^2, sex='Male') summary(survfit(f2, new), times=6) new$sex <- 'Female' summary(survfit(f2, new), times=6) options(datadist=NULL)
Computes predicted survival probabilities or hazards and optionally confidence
limits (for survival only) for parametric survival models fitted with
psm
.
If getting predictions for more than one observation, times
must
be specified. For a model without predictors, no input data are
specified.
## S3 method for class 'psm' survest(fit, newdata, linear.predictors, x, times, fun, loglog=FALSE, conf.int=0.95, what=c("survival","hazard","parallel"), ...) ## S3 method for class 'survest.psm' print(x, ...)
## S3 method for class 'psm' survest(fit, newdata, linear.predictors, x, times, fun, loglog=FALSE, conf.int=0.95, what=c("survival","hazard","parallel"), ...) ## S3 method for class 'survest.psm' print(x, ...)
fit |
fit from |
newdata , linear.predictors , x , times , conf.int
|
see
|
what |
The default is to compute survival probabilities. Set |
loglog |
set to |
fun |
a function to transform estimates and optional confidence intervals |
... |
unused |
Confidence intervals are based on asymptotic normality of the linear predictors. The intervals account for the fact that a scale parameter may have been estimated jointly with beta.
see survest.cph
. If the model has no predictors, predictions are
made with respect to varying time only, and the returned object
is of class "npsurv"
so the survival curve can be plotted
with survplot.npsurv
. If times
is omitted, the
entire survival curve or hazard from t=0,...,fit$maxtime
is estimated, with
increments computed to yield 200 points where fit$maxtime
is the
maximum survival time in the data used in model fitting. Otherwise,
the times
vector controls the time points used.
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
psm
, survreg
, rms
, survfit
, predictrms
, survplot
,
survreg.distributions
# Simulate data from a proportional hazards population model require(survival) n <- 1000 set.seed(731) age <- 50 + 12*rnorm(n) label(age) <- "Age" cens <- 15*runif(n) h <- .02*exp(.04*(age-50)) dt <- -log(runif(n))/h label(dt) <- 'Follow-up Time' e <- ifelse(dt <= cens,1,0) dt <- pmin(dt, cens) units(dt) <- "Year" S <- Surv(dt,e) f <- psm(S ~ lsp(age,c(40,70))) survest(f, data.frame(age=seq(20,80,by=5)), times=2) #Get predicted survival curve for 40 year old survest(f, data.frame(age=40)) #Get hazard function for 40 year old survest(f, data.frame(age=40), what="hazard")$surv #still called surv
# Simulate data from a proportional hazards population model require(survival) n <- 1000 set.seed(731) age <- 50 + 12*rnorm(n) label(age) <- "Age" cens <- 15*runif(n) h <- .02*exp(.04*(age-50)) dt <- -log(runif(n))/h label(dt) <- 'Follow-up Time' e <- ifelse(dt <= cens,1,0) dt <- pmin(dt, cens) units(dt) <- "Year" S <- Surv(dt,e) f <- psm(S ~ lsp(age,c(40,70))) survest(f, data.frame(age=seq(20,80,by=5)), times=2) #Get predicted survival curve for 40 year old survest(f, data.frame(age=40)) #Get hazard function for 40 year old survest(f, data.frame(age=40), what="hazard")$surv #still called surv
This is a slightly modified version of Therneau's survfit.coxph
function. The difference is that survfit.cph
assumes that
x=TRUE,y=TRUE
were specified to the fit. This assures that the
environment in effect at the time of the fit (e.g., automatic knot
estimation for spline functions) is the same one used for basing predictions.
## S3 method for class 'cph' survfit(formula, newdata, se.fit=TRUE, conf.int=0.95, individual=FALSE, type=NULL, vartype=NULL, conf.type=c('log', "log-log", "plain", "none"), id, ...)
## S3 method for class 'cph' survfit(formula, newdata, se.fit=TRUE, conf.int=0.95, individual=FALSE, type=NULL, vartype=NULL, conf.type=c('log', "log-log", "plain", "none"), id, ...)
formula |
a fit object from |
newdata , se.fit , conf.int , individual , type , vartype , conf.type , id
|
see
|
... |
Not used |
see survfit.coxph
Plot estimated survival curves, and for parametric survival models, plot
hazard functions. There is an option to print the number of subjects
at risk at the start of each time interval. Curves are automatically
labeled at the points of maximum separation (using the labcurve
function), and there are many other options for labeling that can be
specified with the label.curves
parameter. For example, different
plotting symbols can be placed at constant x-increments and a legend
linking the symbols with category labels can automatically positioned on
the most empty portion of the plot.
For the case of a two stratum analysis by npsurv
,
survdiffplot
plots the difference in two Kaplan-Meier estimates
along with approximate confidence bands for the differences, with a
reference line at zero. The number of subjects at risk is optionally
plotted. This number is taken as the minimum of the number of subjects
at risk over the two strata. When conf='diffbands'
,
survdiffplot
instead does not make a new plot but adds a shaded
polygon to an existing plot, showing the midpoint of two survival
estimates plus or minus 1/2 the width of the confidence interval for the
difference of two Kaplan-Meier estimates.
survplotp
creates an interactive plotly
graphic with
shaded confidence bands. In the two strata case, it draws the 1/2
confidence bands for the difference in two probabilities centered at the
midpoint of the probability estimates, so that where the two curves
touch this band there is no significant difference (no multiplicity
adjustment is made). For the two strata case, the two individual
confidence bands have entries in the legend but are not displayed until
the user clicks on the legend.
When code
was from running npsurv
on a
multi-state/competing risk Surv
object, survplot
plots
cumulative incidence curves properly accounting for competing risks.
You must specify exactly one state/event cause to plot using the
state
argument. survplot
will not plot multiple states on
one graph. This can be accomplished using multiple calls with different
values of state
and specifying add=TRUE
for all but the
first call.
survplot(fit, ...) survplotp(fit, ...) ## S3 method for class 'rms' survplot(fit, ..., xlim, ylim=if(loglog) c(-5, 1.5) else if (what == "survival" & missing(fun)) c(0, 1), xlab, ylab, time.inc, what=c("survival","hazard"), type=c("tsiatis","kaplan-meier"), conf.type=c("log","log-log","plain","none"), conf.int=FALSE, conf=c("bands","bars"), mylim=NULL, add=FALSE, label.curves=TRUE, abbrev.label=FALSE, levels.only=FALSE, lty, lwd=par("lwd"), col=1, col.fill=gray(seq(.95, .75, length=5)), adj.subtitle=TRUE, loglog=FALSE, fun, n.risk=FALSE, logt=FALSE, dots=FALSE, dotsize=.003, grid=NULL, srt.n.risk=0, sep.n.risk=0.056, adj.n.risk=1, y.n.risk, cex.n.risk=.6, cex.xlab=par('cex.lab'), cex.ylab=cex.xlab, pr=FALSE) ## S3 method for class 'npsurv' survplot(fit, xlim, ylim, xlab, ylab, time.inc, state=NULL, conf=c("bands","bars","diffbands","none"), mylim=NULL, add=FALSE, label.curves=TRUE, abbrev.label=FALSE, levels.only=FALSE, lty,lwd=par('lwd'), col=1, col.fill=gray(seq(.95, .75, length=5)), loglog=FALSE, fun, n.risk=FALSE, aehaz=FALSE, times=NULL, logt=FALSE, dots=FALSE, dotsize=.003, grid=NULL, srt.n.risk=0, sep.n.risk=.056, adj.n.risk=1, y.n.risk, cex.n.risk=.6, cex.xlab=par('cex.lab'), cex.ylab=cex.xlab, pr=FALSE, ...) ## S3 method for class 'npsurv' survplotp(fit, xlim, ylim, xlab, ylab, time.inc, state=NULL, conf=c("bands", "none"), mylim=NULL, abbrev.label=FALSE, col=colorspace::rainbow_hcl, levels.only=TRUE, loglog=FALSE, fun=function(y) y, aehaz=FALSE, times=NULL, logt=FALSE, pr=FALSE, ...) survdiffplot(fit, order=1:2, fun=function(y) y, xlim, ylim, xlab, ylab="Difference in Survival Probability", time.inc, conf.int, conf=c("shaded", "bands","diffbands","none"), add=FALSE, lty=1, lwd=par('lwd'), col=1, n.risk=FALSE, grid=NULL, srt.n.risk=0, adj.n.risk=1, y.n.risk, cex.n.risk=.6, cex.xlab=par('cex.lab'), cex.ylab=cex.xlab, convert=function(f) f)
survplot(fit, ...) survplotp(fit, ...) ## S3 method for class 'rms' survplot(fit, ..., xlim, ylim=if(loglog) c(-5, 1.5) else if (what == "survival" & missing(fun)) c(0, 1), xlab, ylab, time.inc, what=c("survival","hazard"), type=c("tsiatis","kaplan-meier"), conf.type=c("log","log-log","plain","none"), conf.int=FALSE, conf=c("bands","bars"), mylim=NULL, add=FALSE, label.curves=TRUE, abbrev.label=FALSE, levels.only=FALSE, lty, lwd=par("lwd"), col=1, col.fill=gray(seq(.95, .75, length=5)), adj.subtitle=TRUE, loglog=FALSE, fun, n.risk=FALSE, logt=FALSE, dots=FALSE, dotsize=.003, grid=NULL, srt.n.risk=0, sep.n.risk=0.056, adj.n.risk=1, y.n.risk, cex.n.risk=.6, cex.xlab=par('cex.lab'), cex.ylab=cex.xlab, pr=FALSE) ## S3 method for class 'npsurv' survplot(fit, xlim, ylim, xlab, ylab, time.inc, state=NULL, conf=c("bands","bars","diffbands","none"), mylim=NULL, add=FALSE, label.curves=TRUE, abbrev.label=FALSE, levels.only=FALSE, lty,lwd=par('lwd'), col=1, col.fill=gray(seq(.95, .75, length=5)), loglog=FALSE, fun, n.risk=FALSE, aehaz=FALSE, times=NULL, logt=FALSE, dots=FALSE, dotsize=.003, grid=NULL, srt.n.risk=0, sep.n.risk=.056, adj.n.risk=1, y.n.risk, cex.n.risk=.6, cex.xlab=par('cex.lab'), cex.ylab=cex.xlab, pr=FALSE, ...) ## S3 method for class 'npsurv' survplotp(fit, xlim, ylim, xlab, ylab, time.inc, state=NULL, conf=c("bands", "none"), mylim=NULL, abbrev.label=FALSE, col=colorspace::rainbow_hcl, levels.only=TRUE, loglog=FALSE, fun=function(y) y, aehaz=FALSE, times=NULL, logt=FALSE, pr=FALSE, ...) survdiffplot(fit, order=1:2, fun=function(y) y, xlim, ylim, xlab, ylab="Difference in Survival Probability", time.inc, conf.int, conf=c("shaded", "bands","diffbands","none"), add=FALSE, lty=1, lwd=par('lwd'), col=1, n.risk=FALSE, grid=NULL, srt.n.risk=0, adj.n.risk=1, y.n.risk, cex.n.risk=.6, cex.xlab=par('cex.lab'), cex.ylab=cex.xlab, convert=function(f) f)
fit |
result of fit ( |
... |
list of factors with names used in model. For fits from |
xlim |
a vector of two numbers specifiying the x-axis range for follow-up time.
Default is |
ylim |
y-axis limits. Default is |
xlab |
x-axis label. Default is |
ylab |
y-axis label. Default is |
time.inc |
time increment for labeling the x-axis and printing numbers at risk.
If not specified, the value
of |
state |
the state/event cause to use in plotting if the fit was for
a multi-state/competing risk |
type |
specifies type of estimates, |
conf.type |
specifies the basis for confidence limits. This argument is
ignored for fits from |
conf.int |
Default is |
conf |
|
mylim |
used to curtail computed |
what |
defaults to |
add |
set to |
label.curves |
default is |
abbrev.label |
set to |
levels.only |
set to |
lty |
vector of line types to use for different factor levels. Default is
|
lwd |
vector of line widths to use for different factor levels. Default is
current |
col |
color for curve, default is |
col.fill |
a vector of colors to used in filling confidence bands |
adj.subtitle |
set to |
loglog |
set to |
fun |
specifies any function to translate estimates and confidence limits
before plotting. If the fit is a multi-state object the default for
|
logt |
set to |
n.risk |
set to |
srt.n.risk |
angle of rotation for leftmost number of subjects at risk (since this number
may run into the second or into the y-axis). Default is |
adj.n.risk |
justification for leftmost number at risk. Default is |
sep.n.risk |
multiple of upper y limit - lower y limit for separating lines of text
containing number of subjects at risk. Default is |
y.n.risk |
When |
cex.n.risk |
character size for number of subjects at risk (when |
cex.xlab |
|
cex.ylab |
|
dots |
set to |
dotsize |
size of dots in inches |
grid |
defaults to |
pr |
set to |
aehaz |
set to |
times |
a numeric vector of times at which to compute cumulative incidence probability estimates to add to curve labels |
order |
an integer vector of length two specifying the order of groups when
computing survival differences. The default of |
convert |
a function to convert the output of
|
survplot
will not work for Cox models with time-dependent covariables.
Use survest
or survfit
for that purpose.
There is a set a system option mgp.axis.labels
to allow x
and y-axes to have differing mgp
graphical parameters (see par
).
This is important when labels for y-axis tick marks are to be written
horizontally (par(las=1)
), as a larger gap between the labels and
the tick marks are needed. You can set the axis-specific 2nd
component of mgp
using mgp.axis.labels(c(xvalue,yvalue))
.
list with components adjust (text string specifying adjustment levels)
and curve.labels
(vector of text strings corresponding to levels
of factor used to distinguish curves). For npsurv
, the returned
value is the vector of strata labels, or NULL if there are no strata.
plots. If par()$mar[4] < 4
, issues par(mar=)
to increment mar[4]
by 2
if n.risk=TRUE
and add=FALSE
. The user may want to reset par(mar)
in
this case to not leave such a wide right margin for plots. You usually
would issue par(mar=c(5,4,4,2)+.1)
.
Boers M (2004): Null bar and null zone are better than the error bar to compare group means in graphs. J Clin Epi 57:712-715.
datadist
, rms
, cph
,
psm
, survest
, predictrms
,
plot.Predict
, ggplot.Predict
,
units
, errbar
,
survfit
, survreg.distributions
,
labcurve
,
mgp.axis
, par
,
# Simulate data from a population model in which the log hazard # function is linear in age and there is no age x sex interaction require(survival) n <- 1000 set.seed(731) age <- 50 + 12*rnorm(n) label(age) <- "Age" sex <- factor(sample(c('male','female'), n, TRUE)) cens <- 15*runif(n) h <- .02*exp(.04*(age-50)+.8*(sex=='female')) dt <- -log(runif(n))/h label(dt) <- 'Follow-up Time' e <- ifelse(dt <= cens,1,0) dt <- pmin(dt, cens) units(dt) <- "Year" dd <- datadist(age, sex) options(datadist='dd') S <- Surv(dt,e) # When age is in the model by itself and we predict at the mean age, # approximate confidence intervals are ok f <- cph(S ~ age, surv=TRUE) survplot(f, age=mean(age), conf.int=.95) g <- cph(S ~ age, x=TRUE, y=TRUE) survplot(g, age=mean(age), conf.int=.95, add=TRUE, col='red', conf='bars') # Repeat for an age far from the mean; not ok survplot(f, age=75, conf.int=.95) survplot(g, age=75, conf.int=.95, add=TRUE, col='red', conf='bars') #Plot stratified survival curves by sex, adj for quadratic age effect # with age x sex interaction (2 d.f. interaction) f <- cph(S ~ pol(age,2)*strat(sex), x=TRUE, y=TRUE) #or f <- psm(S ~ pol(age,2)*sex) Predict(f, sex, age=c(30,50,70)) survplot(f, sex, n.risk=TRUE, levels.only=TRUE) #Adjust age to median survplot(f, sex, logt=TRUE, loglog=TRUE) #Check for Weibull-ness (linearity) survplot(f, sex=c("male","female"), age=50) #Would have worked without datadist #or with an incomplete datadist survplot(f, sex, label.curves=list(keys=c(2,0), point.inc=2)) #Identify curves with symbols survplot(f, sex, label.curves=list(keys=c('m','f'))) #Identify curves with single letters #Plots by quintiles of age, adjusting sex to male options(digits=3) survplot(f, age=quantile(age,(1:4)/5), sex="male") #Plot survival Kaplan-Meier survival estimates for males f <- npsurv(S ~ 1, subset=sex=="male") survplot(f) #Plot survival for both sexes and show exponential hazard estimates f <- npsurv(S ~ sex) survplot(f, aehaz=TRUE) #Check for log-normal and log-logistic fits survplot(f, fun=qnorm, ylab="Inverse Normal Transform") survplot(f, fun=function(y)log(y/(1-y)), ylab="Logit S(t)") #Plot the difference between sexes survdiffplot(f) #Similar but show half-width of confidence intervals centered #at average of two survival estimates #See Boers (2004) survplot(f, conf='diffbands') options(datadist=NULL) ## Not run: # # Time to progression/death for patients with monoclonal gammopathy # Competing risk curves (cumulative incidence) # status variable must be a factor with first level denoting right censoring m <- upData(mgus1, stop = stop / 365.25, units=c(stop='years'), labels=c(stop='Follow-up Time'), subset=start == 0) f <- npsurv(Surv(stop, event) ~ 1, data=m) # Use survplot for enhanced displays of cumulative incidence curves for # competing risks survplot(f, state='pcm', n.risk=TRUE, xlim=c(0, 20), ylim=c(0, .5), col=2) survplot(f, state='death', aehaz=TRUE, col=3, label.curves=list(keys='lines')) f <- npsurv(Surv(stop, event) ~ sex, data=m) survplot(f, state='death', aehaz=TRUE, n.risk=TRUE, conf='diffbands', label.curves=list(keys='lines')) ## End(Not run)
# Simulate data from a population model in which the log hazard # function is linear in age and there is no age x sex interaction require(survival) n <- 1000 set.seed(731) age <- 50 + 12*rnorm(n) label(age) <- "Age" sex <- factor(sample(c('male','female'), n, TRUE)) cens <- 15*runif(n) h <- .02*exp(.04*(age-50)+.8*(sex=='female')) dt <- -log(runif(n))/h label(dt) <- 'Follow-up Time' e <- ifelse(dt <= cens,1,0) dt <- pmin(dt, cens) units(dt) <- "Year" dd <- datadist(age, sex) options(datadist='dd') S <- Surv(dt,e) # When age is in the model by itself and we predict at the mean age, # approximate confidence intervals are ok f <- cph(S ~ age, surv=TRUE) survplot(f, age=mean(age), conf.int=.95) g <- cph(S ~ age, x=TRUE, y=TRUE) survplot(g, age=mean(age), conf.int=.95, add=TRUE, col='red', conf='bars') # Repeat for an age far from the mean; not ok survplot(f, age=75, conf.int=.95) survplot(g, age=75, conf.int=.95, add=TRUE, col='red', conf='bars') #Plot stratified survival curves by sex, adj for quadratic age effect # with age x sex interaction (2 d.f. interaction) f <- cph(S ~ pol(age,2)*strat(sex), x=TRUE, y=TRUE) #or f <- psm(S ~ pol(age,2)*sex) Predict(f, sex, age=c(30,50,70)) survplot(f, sex, n.risk=TRUE, levels.only=TRUE) #Adjust age to median survplot(f, sex, logt=TRUE, loglog=TRUE) #Check for Weibull-ness (linearity) survplot(f, sex=c("male","female"), age=50) #Would have worked without datadist #or with an incomplete datadist survplot(f, sex, label.curves=list(keys=c(2,0), point.inc=2)) #Identify curves with symbols survplot(f, sex, label.curves=list(keys=c('m','f'))) #Identify curves with single letters #Plots by quintiles of age, adjusting sex to male options(digits=3) survplot(f, age=quantile(age,(1:4)/5), sex="male") #Plot survival Kaplan-Meier survival estimates for males f <- npsurv(S ~ 1, subset=sex=="male") survplot(f) #Plot survival for both sexes and show exponential hazard estimates f <- npsurv(S ~ sex) survplot(f, aehaz=TRUE) #Check for log-normal and log-logistic fits survplot(f, fun=qnorm, ylab="Inverse Normal Transform") survplot(f, fun=function(y)log(y/(1-y)), ylab="Logit S(t)") #Plot the difference between sexes survdiffplot(f) #Similar but show half-width of confidence intervals centered #at average of two survival estimates #See Boers (2004) survplot(f, conf='diffbands') options(datadist=NULL) ## Not run: # # Time to progression/death for patients with monoclonal gammopathy # Competing risk curves (cumulative incidence) # status variable must be a factor with first level denoting right censoring m <- upData(mgus1, stop = stop / 365.25, units=c(stop='years'), labels=c(stop='Follow-up Time'), subset=start == 0) f <- npsurv(Surv(stop, event) ~ 1, data=m) # Use survplot for enhanced displays of cumulative incidence curves for # competing risks survplot(f, state='pcm', n.risk=TRUE, xlim=c(0, 20), ylim=c(0, .5), col=2) survplot(f, state='death', aehaz=TRUE, col=3, label.curves=list(keys='lines')) f <- npsurv(Surv(stop, event) ~ sex, data=m) survplot(f, state='death', aehaz=TRUE, n.risk=TRUE, conf='diffbands', label.curves=list(keys='lines')) ## End(Not run)
The val.prob
function is useful for validating
predicted probabilities against binary events.
Given a set of predicted probabilities p
or predicted log odds
logit
, and a vector of binary outcomes y
that were not
used in developing the predictions p
or logit
,
val.prob
computes the following indexes and statistics: Somers'
rank correlation between
p
and y
[,
=ROC area], Nagelkerke-Cox-Snell-Maddala-Magee
R-squared index, Discrimination index
D
[ (Logistic model
L.R. - 1)/n], L.R.
,
its
-value, Unreliability index
,
with 2 d.f. for testing unreliability (H0: intercept=0, slope=1), its
-value, the quality index
,
Brier
score (average
squared difference in p
and y
), Intercept
, and
Slope
, =maximum absolute difference in predicted
and loess-calibrated probabilities,
Eavg
, the average in same,
E90
, the 0.9 quantile of same, the Spiegelhalter -test for
calibration accuracy, and its two-tailed
-value. If
pl=TRUE
, plots fitted logistic
calibration curve and optionally a smooth nonparametric fit using
lowess(p,y,iter=0)
and grouped proportions vs. mean predicted
probability in group. If the predicted probabilities or logits are
constant, the statistics are returned and no plot is made.
Eavg, Emax, E90
were from linear logistic calibration before
rms 4.5-1.
When group
is present, different statistics are computed,
different graphs are made, and the object returned by val.prob
is
different. group
specifies a stratification variable.
Validations are done separately by levels of group and overall. A
print
method prints summary statistics and several quantiles of
predicted probabilities, and a plot
method plots calibration
curves with summary statistics superimposed, along with selected
quantiles of the predicted probabilities (shown as tick marks on
calibration curves). Only the lowess
calibration curve is
estimated. The statistics computed are the average predicted
probability, the observed proportion of events, a 1 d.f. chi-square
statistic for testing for overall mis-calibration (i.e., a test of the
observed vs. the overall average predicted probability of the event)
(ChiSq
), and a 2 d.f. chi-square statistic for testing
simultaneously that the intercept of a linear logistic calibration curve
is zero and the slope is one (ChiSq2
), average absolute
calibration error (average absolute difference between the
lowess
-estimated calibration curve and the line of identity,
labeled Eavg
), Eavg
divided by the difference between the
0.95 and 0.05 quantiles of predictive probabilities (Eavg/P90
), a
"median odds ratio", i.e., the anti-log of the median absolute
difference between predicted and calibrated predicted log odds of the
event (Med OR
), the C-index (ROC area), the Brier quadratic error
score (B
), a chi-square test of goodness of fit based on the
Brier score (B ChiSq
), and the Brier score computed on calibrated rather than raw
predicted probabilities (B cal
). The first chi-square test is a
test of overall calibration accuracy ("calibration in the large"), and
the second will also detect errors such as slope shrinkage caused by
overfitting or regression to the mean. See Cox (1970) for both of these
score tests. The goodness of fit test based on the (uncalibrated) Brier
score is due to Hilden, Habbema, and Bjerregaard (1978) and is discussed
in Spiegelhalter (1986). When group
is present you can also
specify sampling weights
(usually frequencies), to obtained
weighted calibration curves.
To get the behavior that results from a grouping variable being present
without having a grouping variable, use group=TRUE
. In the
plot
method, calibration curves are drawn and labeled by default
where they are maximally separated using the labcurve
function.
The following parameters do not apply when group
is present:
pl
, smooth
, logistic.cal
, m
, g
,
cuts
, emax.lim
, legendloc
, riskdist
,
mkh
, connect.group
, connect.smooth
. The following
parameters apply to the plot
method but not to val.prob
:
xlab
, ylab
, lim
, statloc
, cex
.
val.prob(p, y, logit, group, weights=rep(1,length(y)), normwt=FALSE, pl=TRUE, smooth=TRUE, logistic.cal=TRUE, xlab="Predicted Probability", ylab="Actual Probability", lim=c(0, 1), m, g, cuts, emax.lim=c(0,1), legendloc=lim[1] + c(0.55 * diff(lim), 0.27 * diff(lim)), statloc=c(0,0.99), riskdist=c("predicted", "calibrated"), cex=.7, mkh=.02, connect.group=FALSE, connect.smooth=TRUE, g.group=4, evaluate=100, nmin=0) ## S3 method for class 'val.prob' print(x, ...) ## S3 method for class 'val.prob' plot(x, xlab="Predicted Probability", ylab="Actual Probability", lim=c(0,1), statloc=lim, stats=1:12, cex=.5, lwd.overall=4, quantiles=c(.05,.95), flag, ...)
val.prob(p, y, logit, group, weights=rep(1,length(y)), normwt=FALSE, pl=TRUE, smooth=TRUE, logistic.cal=TRUE, xlab="Predicted Probability", ylab="Actual Probability", lim=c(0, 1), m, g, cuts, emax.lim=c(0,1), legendloc=lim[1] + c(0.55 * diff(lim), 0.27 * diff(lim)), statloc=c(0,0.99), riskdist=c("predicted", "calibrated"), cex=.7, mkh=.02, connect.group=FALSE, connect.smooth=TRUE, g.group=4, evaluate=100, nmin=0) ## S3 method for class 'val.prob' print(x, ...) ## S3 method for class 'val.prob' plot(x, xlab="Predicted Probability", ylab="Actual Probability", lim=c(0,1), statloc=lim, stats=1:12, cex=.5, lwd.overall=4, quantiles=c(.05,.95), flag, ...)
p |
predicted probability |
y |
vector of binary outcomes |
logit |
predicted log odds of outcome. Specify either |
group |
a grouping variable. If numeric this variable is grouped into
|
weights |
an optional numeric vector of per-observation weights (usually frequencies),
used only if |
normwt |
set to |
pl |
TRUE to plot calibration curves and optionally statistics |
smooth |
plot smooth fit to |
logistic.cal |
plot linear logistic calibration fit to |
xlab |
x-axis label, default is |
ylab |
y-axis label, default is |
lim |
limits for both x and y axes |
m |
If grouped proportions are desired, average no. observations per group |
g |
If grouped proportions are desired, number of quantile groups |
cuts |
If grouped proportions are desired, actual cut points for constructing
intervals, e.g. |
emax.lim |
Vector containing lowest and highest predicted probability over which to
compute |
legendloc |
If |
statloc |
|
riskdist |
Use |
cex |
Character size for legend or for table of statistics when |
mkh |
Size of symbols for legend. Default is 0.02 (see |
connect.group |
Defaults to |
connect.smooth |
Defaults to |
g.group |
number of quantile groups to use when |
evaluate |
number of points at which to store the |
nmin |
applies when |
x |
result of |
... |
optional arguments for |
stats |
vector of column numbers of statistical indexes to write on plot |
lwd.overall |
line width for plotting the overall calibration curve |
quantiles |
a vector listing which quantiles should be indicated on each
calibration curve using tick marks. The values in |
flag |
a function of the matrix of statistics (rows representing groups)
returning a vector of character strings (one value for each group, including
"Overall"). |
The 2 d.f. test and
Med OR
exclude predicted or
calibrated predicted probabilities to zero or
,
adjusting the sample size as needed.
val.prob
without group
returns a vector with the following named
elements: Dxy
, R2
, D
, D:Chi-sq
, D:p
,
U
, U:Chi-sq
, U:p
, Q
, Brier
,
Intercept
, Slope
, S:z
, S:p
, Emax
.
When group
is present val.prob
returns an object of class
val.prob
containing a list with summary statistics and calibration
curves for all the strata plus "Overall"
.
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
Harrell FE, Lee KL, Mark DB (1996): Multivariable prognostic models: Issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat in Med 15:361–387.
Harrell FE, Lee KL (1987): Using logistic calibration to assess the accuracy of probability predictions (Technical Report).
Miller ME, Hui SL, Tierney WM (1991): Validation techniques for logistic regression models. Stat in Med 10:1213–1226.
Stallard N (2009): Simple tests for the external validation of mortality prediction scores. Stat in Med 28:377–388.
Harrell FE, Lee KL (1985): A comparison of the discrimination of discriminant analysis and logistic regression under multivariate normality. In Biostatistics: Statistics in Biomedical, Public Health, and Environmental Sciences. The Bernard G. Greenberg Volume, ed. PK Sen. New York: North-Holland, p. 333–343.
Cox DR (1970): The Analysis of Binary Data, 1st edition, section 4.4. London: Methuen.
Spiegelhalter DJ (1986):Probabilistic prediction in patient management. Stat in Med 5:421–433.
Rufibach K (2010):Use of Brier score to assess binary predictions. J Clin Epi 63:938-939
Tjur T (2009):Coefficients of determination in logistic regression models-A new proposal:The coefficient of discrimination. Am Statist 63:366–372.
validate.lrm
, lrm.fit
, lrm
,
labcurve
,
wtd.stats
, scat1d
# Fit logistic model on 100 observations simulated from the actual # model given by Prob(Y=1 given X1, X2, X3) = 1/(1+exp[-(-1 + 2X1)]), # where X1 is a random uniform [0,1] variable. Hence X2 and X3 are # irrelevant. After fitting a linear additive model in X1, X2, # and X3, the coefficients are used to predict Prob(Y=1) on a # separate sample of 100 observations. Note that data splitting is # an inefficient validation method unless n > 20,000. set.seed(1) n <- 200 x1 <- runif(n) x2 <- runif(n) x3 <- runif(n) logit <- 2*(x1-.5) P <- 1/(1+exp(-logit)) y <- ifelse(runif(n)<=P, 1, 0) d <- data.frame(x1,x2,x3,y) f <- lrm(y ~ x1 + x2 + x3, subset=1:100) pred.logit <- predict(f, d[101:200,]) phat <- 1/(1+exp(-pred.logit)) val.prob(phat, y[101:200], m=20, cex=.5) # subgroups of 20 obs. # Validate predictions more stringently by stratifying on whether # x1 is above or below the median v <- val.prob(phat, y[101:200], group=x1[101:200], g.group=2) v plot(v) plot(v, flag=function(stats) ifelse( stats[,'ChiSq2'] > qchisq(.95,2) | stats[,'B ChiSq'] > qchisq(.95,1), '*', ' ') ) # Stars rows of statistics in plot corresponding to significant # mis-calibration at the 0.05 level instead of the default, 0.01 plot(val.prob(phat, y[101:200], group=x1[101:200], g.group=2), col=1:3) # 3 colors (1 for overall) # Weighted calibration curves # plot(val.prob(pred, y, group=age, weights=freqs))
# Fit logistic model on 100 observations simulated from the actual # model given by Prob(Y=1 given X1, X2, X3) = 1/(1+exp[-(-1 + 2X1)]), # where X1 is a random uniform [0,1] variable. Hence X2 and X3 are # irrelevant. After fitting a linear additive model in X1, X2, # and X3, the coefficients are used to predict Prob(Y=1) on a # separate sample of 100 observations. Note that data splitting is # an inefficient validation method unless n > 20,000. set.seed(1) n <- 200 x1 <- runif(n) x2 <- runif(n) x3 <- runif(n) logit <- 2*(x1-.5) P <- 1/(1+exp(-logit)) y <- ifelse(runif(n)<=P, 1, 0) d <- data.frame(x1,x2,x3,y) f <- lrm(y ~ x1 + x2 + x3, subset=1:100) pred.logit <- predict(f, d[101:200,]) phat <- 1/(1+exp(-pred.logit)) val.prob(phat, y[101:200], m=20, cex=.5) # subgroups of 20 obs. # Validate predictions more stringently by stratifying on whether # x1 is above or below the median v <- val.prob(phat, y[101:200], group=x1[101:200], g.group=2) v plot(v) plot(v, flag=function(stats) ifelse( stats[,'ChiSq2'] > qchisq(.95,2) | stats[,'B ChiSq'] > qchisq(.95,1), '*', ' ') ) # Stars rows of statistics in plot corresponding to significant # mis-calibration at the 0.05 level instead of the default, 0.01 plot(val.prob(phat, y[101:200], group=x1[101:200], g.group=2), col=1:3) # 3 colors (1 for overall) # Weighted calibration curves # plot(val.prob(pred, y, group=age, weights=freqs))
The val.surv
function is useful for validating predicted survival
probabilities against right-censored failure times. If u
is
specified, the hazard regression function hare
in the
polspline
package is used to relate predicted survival
probability at time u
to observed survival times (and censoring
indicators) to estimate the actual survival probability at time
u
as a function of the estimated survival probability at that
time, est.surv
. If est.surv
is not given, fit
must
be specified and the survest
function is used to obtain the
predicted values (using newdata
if it is given, or using the
stored linear predictor values if not). hare
is given the sole
predictor fun(est.surv)
where fun
is given by the user or
is inferred from fit
. fun
is the function of predicted
survival probabilities that one expects to create a linear relationship
with the linear predictors.
hare
uses an adaptive procedure to find a linear spline of
fun(est.surv)
in a model where the log hazard is a linear spline
in time , and cross-products between the two splines are allowed so as to
not assume proportional hazards. Thus
hare
assumes that the
covariate and time functions are smooth but not much else, if the number
of events in the dataset is large enough for obtaining a reliable
flexible fit. There are special print
and plot
methods
when u
is given. In this case, val.surv
returns an object
of class "val.survh"
, otherwise it returns an object of class
"val.surv"
.
If u
is not specified, val.surv
uses Cox-Snell (1968)
residuals on the cumulative
probability scale to check on the calibration of a survival model
against right-censored failure time data. If the predicted survival
probability at time for a subject having predictors
is
, this method is based on the fact that the predicted
probability of failure before time
,
, when
evaluated at the subject's actual survival time
, has a uniform
(0,1) distribution. The quantity
is right-censored
when
is. By getting one minus the Kaplan-Meier estimate of the
distribution of
and plotting against the 45 degree line
we can check for calibration accuracy. A more stringent assessment can
be obtained by stratifying this analysis by an important predictor
variable. The theoretical uniform distribution is only an approximation
when the survival probabilities are estimates and not population values.
When censor
is specified to val.surv
, a different
validation is done that is more stringent but that only uses the
uncensored failure times. This method is used for type I censoring when
the theoretical censoring times are known for subjects having uncensored
failure times. Let ,
, and
denote respectively
the failure time, censoring time, and cumulative failure time
distribution (
). The expected value of
is 0.5
when
represents the subject's actual failure time. The expected
value for an uncensored time is the expected value of
. A smooth plot of
for
uncensored
should be a flat line through
if the model
is well calibrated. A smooth plot of
for
uncensored
should be a flat line through
. The smooth
plot is obtained by smoothing the (linear predictor, difference or
ratio) pairs.
val.surv(fit, newdata, S, est.surv, censor, u, fun, lim, evaluate=100, pred, maxdim=5, ...) ## S3 method for class 'val.survh' print(x, ...) ## S3 method for class 'val.survh' plot(x, lim, xlab, ylab, riskdist=TRUE, add=FALSE, scat1d.opts=list(nhistSpike=200), ...) ## S3 method for class 'val.surv' plot(x, group, g.group=4, what=c('difference','ratio'), type=c('l','b','p'), xlab, ylab, xlim, ylim, datadensity=TRUE, ...)
val.surv(fit, newdata, S, est.surv, censor, u, fun, lim, evaluate=100, pred, maxdim=5, ...) ## S3 method for class 'val.survh' print(x, ...) ## S3 method for class 'val.survh' plot(x, lim, xlab, ylab, riskdist=TRUE, add=FALSE, scat1d.opts=list(nhistSpike=200), ...) ## S3 method for class 'val.surv' plot(x, group, g.group=4, what=c('difference','ratio'), type=c('l','b','p'), xlab, ylab, xlim, ylim, datadensity=TRUE, ...)
fit |
a fit object created by |
newdata |
a data frame for which |
S |
an |
est.surv |
a vector of estimated survival probabilities corresponding to times in
the first column of |
censor |
a vector of censoring times. Only the censoring times for uncensored observations are used. |
u |
a single numeric follow-up time |
fun |
a function that transforms survival probabilities into the
scale of the linear predictor. If |
lim |
a 2-vector specifying limits of predicted survival
probabilities for obtaining estimated actual probabilities at time
|
evaluate |
the number of evenly spaced points over the range of predicted probabilities. This defines the points at which calibrated predictions are obtained for plotting. |
pred |
a vector of points at which to evaluate predicted
probabilities, overriding |
maxdim |
see |
x |
result of |
xlab |
x-axis label. For |
ylab |
y-axis label |
riskdist |
set to |
add |
set to |
scat1d.opts |
a |
... |
When |
group |
a grouping variable. If numeric this variable is grouped into
|
g.group |
number of quantile groups to use when |
what |
the quantity to plot when |
type |
Set to the default ( |
xlim , ylim
|
axis limits for |
datadensity |
By default, |
a list of class "val.surv"
or "val.survh"
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
Cox DR, Snell EJ (1968):A general definition of residuals (with discussion). JRSSB 30:248–275.
Kooperberg C, Stone C, Truong Y (1995): Hazard regression. JASA 90:78–94.
May M, Royston P, Egger M, Justice AC, Sterne JAC (2004):Development and validation of a prognostic model for survival time data: application to prognosis of HIV positive patients treated with antiretroviral therapy. Stat in Med 23:2375–2398.
Stallard N (2009): Simple tests for th external validation of mortality prediction scores. Stat in Med 28:377–388.
validate
, calibrate
, hare
,
scat1d
, cph
, psm
,
groupkm
# Generate failure times from an exponential distribution require(survival) set.seed(123) # so can reproduce results n <- 1000 age <- 50 + 12*rnorm(n) sex <- factor(sample(c('Male','Female'), n, rep=TRUE, prob=c(.6, .4))) cens <- 15*runif(n) h <- .02*exp(.04*(age-50)+.8*(sex=='Female')) t <- -log(runif(n))/h units(t) <- 'Year' label(t) <- 'Time to Event' ev <- ifelse(t <= cens, 1, 0) t <- pmin(t, cens) S <- Surv(t, ev) # First validate true model used to generate data # If hare is available, make a smooth calibration plot for 1-year # survival probability where we predict 1-year survival using the # known true population survival probability # In addition, use groupkm to show that grouping predictions into # intervals and computing Kaplan-Meier estimates is not as accurate. if(requireNamespace('polspline')) { s1 <- exp(-h*1) w <- val.surv(est.surv=s1, S=S, u=1, fun=function(p)log(-log(p))) plot(w, lim=c(.85,1), scat1d.opts=list(nhistSpike=200, side=1)) groupkm(s1, S, m=100, u=1, pl=TRUE, add=TRUE) } # Now validate the true model using residuals w <- val.surv(est.surv=exp(-h*t), S=S) plot(w) plot(w, group=sex) # stratify by sex # Now fit an exponential model and validate # Note this is not really a validation as we're using the # training data here f <- psm(S ~ age + sex, dist='exponential', y=TRUE) w <- val.surv(f) plot(w, group=sex) # We know the censoring time on every subject, so we can # compare the predicted Pr[T <= observed T | T>c, X] to # its expectation 0.5 Pr[T <= C | X] where C = censoring time # We plot a ratio that should equal one w <- val.surv(f, censor=cens) plot(w) plot(w, group=age, g=3) # stratify by tertile of age
# Generate failure times from an exponential distribution require(survival) set.seed(123) # so can reproduce results n <- 1000 age <- 50 + 12*rnorm(n) sex <- factor(sample(c('Male','Female'), n, rep=TRUE, prob=c(.6, .4))) cens <- 15*runif(n) h <- .02*exp(.04*(age-50)+.8*(sex=='Female')) t <- -log(runif(n))/h units(t) <- 'Year' label(t) <- 'Time to Event' ev <- ifelse(t <= cens, 1, 0) t <- pmin(t, cens) S <- Surv(t, ev) # First validate true model used to generate data # If hare is available, make a smooth calibration plot for 1-year # survival probability where we predict 1-year survival using the # known true population survival probability # In addition, use groupkm to show that grouping predictions into # intervals and computing Kaplan-Meier estimates is not as accurate. if(requireNamespace('polspline')) { s1 <- exp(-h*1) w <- val.surv(est.surv=s1, S=S, u=1, fun=function(p)log(-log(p))) plot(w, lim=c(.85,1), scat1d.opts=list(nhistSpike=200, side=1)) groupkm(s1, S, m=100, u=1, pl=TRUE, add=TRUE) } # Now validate the true model using residuals w <- val.surv(est.surv=exp(-h*t), S=S) plot(w) plot(w, group=sex) # stratify by sex # Now fit an exponential model and validate # Note this is not really a validation as we're using the # training data here f <- psm(S ~ age + sex, dist='exponential', y=TRUE) w <- val.surv(f) plot(w, group=sex) # We know the censoring time on every subject, so we can # compare the predicted Pr[T <= observed T | T>c, X] to # its expectation 0.5 Pr[T <= C | X] where C = censoring time # We plot a ratio that should equal one w <- val.surv(f, censor=cens) plot(w) plot(w, group=age, g=3) # stratify by tertile of age
The validate
function when used on an object created by one of the
rms
series does resampling validation of a
regression model, with or without backward step-down variable
deletion.
The print
method will call the latex
or html
method
if options(prType=)
is set to "latex"
or "html"
.
For "latex"
printing through print()
, the LaTeX table
environment is turned off. When using html with Quarto or RMarkdown,
results='asis'
need not be written in the chunk header.
# fit <- fitting.function(formula=response ~ terms, x=TRUE, y=TRUE) validate(fit, method="boot", B=40, bw=FALSE, rule="aic", type="residual", sls=0.05, aics=0, force=NULL, estimates=TRUE, pr=FALSE, ...) ## S3 method for class 'validate' print(x, digits=4, B=Inf, ...) ## S3 method for class 'validate' latex(object, digits=4, B=Inf, file='', append=FALSE, title=first.word(deparse(substitute(x))), caption=NULL, table.env=FALSE, size='normalsize', extracolsize=size, ...) ## S3 method for class 'validate' html(object, digits=4, B=Inf, caption=NULL, ...)
# fit <- fitting.function(formula=response ~ terms, x=TRUE, y=TRUE) validate(fit, method="boot", B=40, bw=FALSE, rule="aic", type="residual", sls=0.05, aics=0, force=NULL, estimates=TRUE, pr=FALSE, ...) ## S3 method for class 'validate' print(x, digits=4, B=Inf, ...) ## S3 method for class 'validate' latex(object, digits=4, B=Inf, file='', append=FALSE, title=first.word(deparse(substitute(x))), caption=NULL, table.env=FALSE, size='normalsize', extracolsize=size, ...) ## S3 method for class 'validate' html(object, digits=4, B=Inf, caption=NULL, ...)
fit |
a fit derived by e.g. |
method |
may be |
B |
number of repetitions. For |
bw |
|
rule |
Applies if |
type |
|
sls |
significance level for a factor to be kept in a model, or for judging the
residual |
aics |
cutoff on AIC when |
force |
see |
estimates |
see |
pr |
|
... |
parameters for each specific validate function, and parameters to
pass to For For |
x , object
|
an object produced by one of the |
digits |
number of decimal places to print |
file |
file to write LaTeX output. Default is standard output. |
append |
set to |
title , caption , table.env , extracolsize
|
see
|
size |
size of LaTeX output. Default is |
It provides bias-corrected indexes that are specific to each type
of model. For validate.cph
and validate.psm
, see validate.lrm
,
which is similar.
For validate.cph
and validate.psm
, there is
an extra argument dxy
, which if TRUE
causes the dxy.cens
function to be invoked to compute the Somers' rank correlation
to be computed at each resample. The values corresponding to the row
are equal to
where C is the
C-index or concordance probability.
For validate.cph
with dxy=TRUE
,
you must specify an argument u
if the model is stratified, since
survival curves can then cross and is not 1-1 with
predicted survival.
There is also validate
method for
tree
, which only does cross-validation and which has a different
list of arguments.
a matrix with rows corresponding to the statistical indexes and columns for columns for the original index, resample estimates, indexes applied to the whole or omitted sample using the model derived from the resample, average optimism, corrected index, and number of successful re-samples.
prints a summary, and optionally statistics for each re-fit
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
validate.ols
, validate.cph
,
validate.lrm
, validate.rpart
,
predab.resample
, fastbw
, rms
,
rms.trans
, calibrate
,
dxy.cens
, concordancefit
# See examples for validate.cph, validate.lrm, validate.ols # Example of validating a parametric survival model: require(survival) n <- 1000 set.seed(731) age <- 50 + 12*rnorm(n) label(age) <- "Age" sex <- factor(sample(c('Male','Female'), n, TRUE)) cens <- 15*runif(n) h <- .02*exp(.04*(age-50)+.8*(sex=='Female')) dt <- -log(runif(n))/h e <- ifelse(dt <= cens,1,0) dt <- pmin(dt, cens) units(dt) <- "Year" S <- Surv(dt,e) f <- psm(S ~ age*sex, x=TRUE, y=TRUE) # Weibull model # Validate full model fit validate(f, B=10) # usually B=150 # Validate stepwise model with typical (not so good) stopping rule # bw=TRUE does not preserve hierarchy of terms at present validate(f, B=10, bw=TRUE, rule="p", sls=.1, type="individual")
# See examples for validate.cph, validate.lrm, validate.ols # Example of validating a parametric survival model: require(survival) n <- 1000 set.seed(731) age <- 50 + 12*rnorm(n) label(age) <- "Age" sex <- factor(sample(c('Male','Female'), n, TRUE)) cens <- 15*runif(n) h <- .02*exp(.04*(age-50)+.8*(sex=='Female')) dt <- -log(runif(n))/h e <- ifelse(dt <= cens,1,0) dt <- pmin(dt, cens) units(dt) <- "Year" S <- Surv(dt,e) f <- psm(S ~ age*sex, x=TRUE, y=TRUE) # Weibull model # Validate full model fit validate(f, B=10) # usually B=150 # Validate stepwise model with typical (not so good) stopping rule # bw=TRUE does not preserve hierarchy of terms at present validate(f, B=10, bw=TRUE, rule="p", sls=.1, type="individual")
This is the version of the validate
function specific to models
fitted with cph
or psm
. Also included is a small
function dxy.cens
that retrieves and its
standard error from the
survival
package's
concordancefit
function. This allows for incredibly fast
computation of or the c-index even for hundreds of
thousands of observations.
dxy.cens
negates
if log relative hazard is being predicted. If
y
is a
left-censored Surv
object, times are negated and a
right-censored object is created, then is negated.
# fit <- cph(formula=Surv(ftime,event) ~ terms, x=TRUE, y=TRUE, \dots) ## S3 method for class 'cph' validate(fit, method="boot", B=40, bw=FALSE, rule="aic", type="residual", sls=.05, aics=0, force=NULL, estimates=TRUE, pr=FALSE, dxy=TRUE, u, tol=1e-9, ...) ## S3 method for class 'psm' validate(fit, method="boot",B=40, bw=FALSE, rule="aic", type="residual", sls=.05, aics=0, force=NULL, estimates=TRUE, pr=FALSE, dxy=TRUE, tol=1e-12, rel.tolerance=1e-5, maxiter=15, ...) dxy.cens(x, y, type=c('time','hazard'))
# fit <- cph(formula=Surv(ftime,event) ~ terms, x=TRUE, y=TRUE, \dots) ## S3 method for class 'cph' validate(fit, method="boot", B=40, bw=FALSE, rule="aic", type="residual", sls=.05, aics=0, force=NULL, estimates=TRUE, pr=FALSE, dxy=TRUE, u, tol=1e-9, ...) ## S3 method for class 'psm' validate(fit, method="boot",B=40, bw=FALSE, rule="aic", type="residual", sls=.05, aics=0, force=NULL, estimates=TRUE, pr=FALSE, dxy=TRUE, tol=1e-12, rel.tolerance=1e-5, maxiter=15, ...) dxy.cens(x, y, type=c('time','hazard'))
fit |
a fit derived |
method |
see |
B |
number of repetitions. For |
rel.tolerance , maxiter , bw
|
|
rule |
Applies if |
type |
|
sls |
significance level for a factor to be kept in a model, or for judging the
residual |
aics |
cutoff on AIC when |
force |
see |
estimates |
see |
pr |
|
tol , ...
|
see |
dxy |
set to |
u |
must be specified if the model has any stratification factors and
|
x |
a numeric vector |
y |
a |
Statistics validated include the Nagelkerke ,
, slope shrinkage, the discrimination index
[(model L.R.
- 1)/L], the unreliability index
= (difference in -2 log likelihood between uncalibrated
and
with overall slope calibrated to test sample) / L,
and the overall quality index
.
is the
-index on the log relative hazard (linear predictor) scale.
L is -2 log likelihood with beta=0. The "corrected" slope
can be thought of as shrinkage factor that takes into account overfitting.
See
predab.resample
for the list of resampling methods.
matrix with rows corresponding to , Slope,
,
, and
, and columns for the original index, resample estimates,
indexes applied to whole or omitted sample using model derived from
resample, average optimism, corrected index, and number of successful
resamples.
The values corresponding to the row are equal to
where C is the C-index or concordance probability.
If the user is correlating the linear predictor (predicted log hazard)
with survival time,
is automatically negated.
prints a summary, and optionally statistics for each re-fit (if
pr=TRUE
)
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
validate
, predab.resample
,
fastbw
, rms
, rms.trans
,
calibrate
, rcorr.cens
,
cph
, survival-internal
,
gIndex
, concordancefit
require(survival) n <- 1000 set.seed(731) age <- 50 + 12*rnorm(n) label(age) <- "Age" sex <- factor(sample(c('Male','Female'), n, TRUE)) cens <- 15*runif(n) h <- .02*exp(.04*(age-50)+.8*(sex=='Female')) dt <- -log(runif(n))/h e <- ifelse(dt <= cens,1,0) dt <- pmin(dt, cens) units(dt) <- "Year" S <- Surv(dt,e) f <- cph(S ~ age*sex, x=TRUE, y=TRUE) # Validate full model fit validate(f, B=10) # normally B=150 # Validate a model with stratification. Dxy is the only # discrimination measure for such models, by Dxy requires # one to choose a single time at which to predict S(t|X) f <- cph(S ~ rcs(age)*strat(sex), x=TRUE, y=TRUE, surv=TRUE, time.inc=2) validate(f, u=2, B=10) # normally B=150 # Note u=time.inc
require(survival) n <- 1000 set.seed(731) age <- 50 + 12*rnorm(n) label(age) <- "Age" sex <- factor(sample(c('Male','Female'), n, TRUE)) cens <- 15*runif(n) h <- .02*exp(.04*(age-50)+.8*(sex=='Female')) dt <- -log(runif(n))/h e <- ifelse(dt <= cens,1,0) dt <- pmin(dt, cens) units(dt) <- "Year" S <- Surv(dt,e) f <- cph(S ~ age*sex, x=TRUE, y=TRUE) # Validate full model fit validate(f, B=10) # normally B=150 # Validate a model with stratification. Dxy is the only # discrimination measure for such models, by Dxy requires # one to choose a single time at which to predict S(t|X) f <- cph(S ~ rcs(age)*strat(sex), x=TRUE, y=TRUE, surv=TRUE, time.inc=2) validate(f, u=2, B=10) # normally B=150 # Note u=time.inc
The validate
function when used on an object created by
lrm
or orm
does resampling validation of a logistic
regression model,
with or without backward step-down variable deletion. It provides
bias-corrected Somers' rank correlation, R-squared index,
the intercept and slope of an overall logistic calibration equation, the
maximum absolute difference in predicted and calibrated probabilities
, the discrimination index
(model L.R.
), the unreliability index
=
difference in -2 log likelihood between un-calibrated
and
with overall intercept and slope
calibrated to test sample / n, the overall quality index (logarithmic
probability score)
, and the Brier or quadratic
probability score,
(the last 3 are not computed for ordinal
models), the
-index, and
gp
, the -index on the
probability scale. The corrected slope can be thought of as shrinkage
factor that takes into account overfitting. For
orm
fits, a
subset of the above indexes is provided, Spearman's is
substituted for
, and a new index is reported:
pdm
, the mean
absolute difference between 0.5 and the predicted probability that
the marginal median of
.
# fit <- lrm(formula=response ~ terms, x=TRUE, y=TRUE) or orm ## S3 method for class 'lrm' validate(fit, method="boot", B=40, bw=FALSE, rule="aic", type="residual", sls=0.05, aics=0, force=NULL, estimates=TRUE, pr=FALSE, kint, Dxy.method=if(k==1) 'somers2' else 'lrm', emax.lim=c(0,1), ...) ## S3 method for class 'orm' validate(fit, method="boot", B=40, bw=FALSE, rule="aic", type="residual", sls=.05, aics=0, force=NULL, estimates=TRUE, pr=FALSE, ...)
# fit <- lrm(formula=response ~ terms, x=TRUE, y=TRUE) or orm ## S3 method for class 'lrm' validate(fit, method="boot", B=40, bw=FALSE, rule="aic", type="residual", sls=0.05, aics=0, force=NULL, estimates=TRUE, pr=FALSE, kint, Dxy.method=if(k==1) 'somers2' else 'lrm', emax.lim=c(0,1), ...) ## S3 method for class 'orm' validate(fit, method="boot", B=40, bw=FALSE, rule="aic", type="residual", sls=.05, aics=0, force=NULL, estimates=TRUE, pr=FALSE, ...)
fit |
a fit derived by |
method , B , bw , rule , type , sls , aics , force , estimates , pr
|
see |
kint |
In the case of an ordinal model, specify which intercept to validate.
Default is the middle intercept. For |
Dxy.method |
|
emax.lim |
range of predicted probabilities over which to compute the maximum error. Default is entire range. |
... |
other arguments to pass to |
If the original fit was created using penalized maximum likelihood estimation,
the same penalty.matrix
used with the original
fit are used during validation.
a matrix with rows corresponding to ,
,
Intercept
, Slope
, ,
,
,
,
,
,
, and
columns for the original index, resample estimates, indexes applied to
the whole or omitted sample using the model derived from the resample,
average optimism, corrected index, and number of successful re-samples.
For
validate.orm
not all columns are provided, Spearman's rho
is returned instead of , and
pdm
is reported.
prints a summary, and optionally statistics for each re-fit
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
Miller ME, Hui SL, Tierney WM (1991): Validation techniques for logistic regression models. Stat in Med 10:1213–1226.
Harrell FE, Lee KL (1985): A comparison of the discrimination of discriminant analysis and logistic regression under multivariate normality. In Biostatistics: Statistics in Biomedical, Public Health, and Environmental Sciences. The Bernard G. Greenberg Volume, ed. PK Sen. New York: North-Holland, p. 333–343.
predab.resample
, fastbw
, lrm
,
rms
, rms.trans
, calibrate
,
somers2
, cr.setup
,
gIndex
, orm
n <- 1000 # define sample size age <- rnorm(n, 50, 10) blood.pressure <- rnorm(n, 120, 15) cholesterol <- rnorm(n, 200, 25) sex <- factor(sample(c('female','male'), n,TRUE)) # Specify population model for log odds that Y=1 L <- .4*(sex=='male') + .045*(age-50) + (log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male')) # Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)] y <- ifelse(runif(n) < plogis(L), 1, 0) f <- lrm(y ~ sex*rcs(cholesterol)+pol(age,2)+blood.pressure, x=TRUE, y=TRUE) #Validate full model fit validate(f, B=10) # normally B=300 validate(f, B=10, group=y) # two-sample validation: make resamples have same numbers of # successes and failures as original sample #Validate stepwise model with typical (not so good) stopping rule validate(f, B=10, bw=TRUE, rule="p", sls=.1, type="individual") ## Not run: #Fit a continuation ratio model and validate it for the predicted #probability that y=0 u <- cr.setup(y) Y <- u$y cohort <- u$cohort attach(mydataframe[u$subs,]) f <- lrm(Y ~ cohort+rcs(age,4)*sex, penalty=list(interaction=2)) validate(f, cluster=u$subs, subset=cohort=='all') #see predab.resample for cluster and subset ## End(Not run)
n <- 1000 # define sample size age <- rnorm(n, 50, 10) blood.pressure <- rnorm(n, 120, 15) cholesterol <- rnorm(n, 200, 25) sex <- factor(sample(c('female','male'), n,TRUE)) # Specify population model for log odds that Y=1 L <- .4*(sex=='male') + .045*(age-50) + (log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male')) # Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)] y <- ifelse(runif(n) < plogis(L), 1, 0) f <- lrm(y ~ sex*rcs(cholesterol)+pol(age,2)+blood.pressure, x=TRUE, y=TRUE) #Validate full model fit validate(f, B=10) # normally B=300 validate(f, B=10, group=y) # two-sample validation: make resamples have same numbers of # successes and failures as original sample #Validate stepwise model with typical (not so good) stopping rule validate(f, B=10, bw=TRUE, rule="p", sls=.1, type="individual") ## Not run: #Fit a continuation ratio model and validate it for the predicted #probability that y=0 u <- cr.setup(y) Y <- u$y cohort <- u$cohort attach(mydataframe[u$subs,]) f <- lrm(Y ~ cohort+rcs(age,4)*sex, penalty=list(interaction=2)) validate(f, cluster=u$subs, subset=cohort=='all') #see predab.resample for cluster and subset ## End(Not run)
The validate
function when used on an object created by
ols
does resampling validation of a multiple linear regression
model, with or without backward step-down variable deletion. Uses
resampling to estimate the optimism in various measures of predictive
accuracy which include ,
(mean squared error with a
denominator of
), the
-index, and the intercept and slope
of an overall
calibration
. The "corrected"
slope can be thought of as shrinkage factor that takes into account
overfitting.
validate.ols
can also be used when a model for a
continuous response is going to be applied to a binary response. A
Somers' for this case is computed for each resample by
dichotomizing
y
. This can be used to obtain an ordinary receiver
operating characteristic curve area using the formula . The Nagelkerke-Maddala
index for the dichotomized
y
is also given. See predab.resample
for the list of
resampling methods.
The LaTeX needspace package must be in effect to use the latex
method.
# fit <- fitting.function(formula=response ~ terms, x=TRUE, y=TRUE) ## S3 method for class 'ols' validate(fit, method="boot", B=40, bw=FALSE, rule="aic", type="residual", sls=0.05, aics=0, force=NULL, estimates=TRUE, pr=FALSE, u=NULL, rel=">", tolerance=1e-7, ...)
# fit <- fitting.function(formula=response ~ terms, x=TRUE, y=TRUE) ## S3 method for class 'ols' validate(fit, method="boot", B=40, bw=FALSE, rule="aic", type="residual", sls=0.05, aics=0, force=NULL, estimates=TRUE, pr=FALSE, u=NULL, rel=">", tolerance=1e-7, ...)
fit |
a fit derived by |
method , B , bw , rule , type , sls , aics , force , estimates , pr
|
see
|
u |
If specifed, |
rel |
relationship for dichotomizing predicted |
tolerance |
tolerance for singularity; passed to |
... |
other arguments to pass to |
matrix with rows corresponding to R-square, MSE, g, intercept, slope, and
optionally and
, and
columns for the original index, resample estimates,
indexes applied to whole or omitted sample using model derived from
resample, average optimism, corrected index, and number of successful resamples.
prints a summary, and optionally statistics for each re-fit
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
ols
, predab.resample
, fastbw
,
rms
, rms.trans
, calibrate
,
gIndex
set.seed(1) x1 <- runif(200) x2 <- sample(0:3, 200, TRUE) x3 <- rnorm(200) distance <- (x1 + x2/3 + rnorm(200))^2 f <- ols(sqrt(distance) ~ rcs(x1,4) + scored(x2) + x3, x=TRUE, y=TRUE) #Validate full model fit (from all observations) but for x1 < .75 validate(f, B=20, subset=x1 < .75) # normally B=300 #Validate stepwise model with typical (not so good) stopping rule validate(f, B=20, bw=TRUE, rule="p", sls=.1, type="individual")
set.seed(1) x1 <- runif(200) x2 <- sample(0:3, 200, TRUE) x3 <- rnorm(200) distance <- (x1 + x2/3 + rnorm(200))^2 f <- ols(sqrt(distance) ~ rcs(x1,4) + scored(x2) + x3, x=TRUE, y=TRUE) #Validate full model fit (from all observations) but for x1 < .75 validate(f, B=20, subset=x1 < .75) # normally B=300 #Validate stepwise model with typical (not so good) stopping rule validate(f, B=20, bw=TRUE, rule="p", sls=.1, type="individual")
Uses xval
-fold cross-validation of a sequence of trees to derive
estimates of the mean squared error and Somers' Dxy
rank correlation
between predicted and observed responses. In the case of a binary response
variable, the mean squared error is the Brier accuracy score. For
survival trees, Dxy
is negated so that larger is better.
There are print
and plot
methods for
objects created by validate.rpart
.
# f <- rpart(formula=y ~ x1 + x2 + \dots) # or rpart ## S3 method for class 'rpart' validate(fit, method, B, bw, rule, type, sls, aics, force, estimates, pr=TRUE, k, rand, xval=10, FUN, ...) ## S3 method for class 'validate.rpart' print(x, ...) ## S3 method for class 'validate.rpart' plot(x, what=c("mse","dxy"), legendloc=locator, ...)
# f <- rpart(formula=y ~ x1 + x2 + \dots) # or rpart ## S3 method for class 'rpart' validate(fit, method, B, bw, rule, type, sls, aics, force, estimates, pr=TRUE, k, rand, xval=10, FUN, ...) ## S3 method for class 'validate.rpart' print(x, ...) ## S3 method for class 'validate.rpart' plot(x, what=c("mse","dxy"), legendloc=locator, ...)
fit |
an object created by |
method , B , bw , rule , type , sls , aics , force , estimates
|
are there only for consistency with the generic |
x |
the result of |
k |
a sequence of cost/complexity values. By default these are obtained
from calling |
rand |
a random sample (usually omitted) |
xval |
number of splits |
FUN |
the name of a function which produces a sequence of trees, such
|
... |
additional arguments to |
pr |
set to |
what |
a vector of things to plot. By default, 2 plots will be done, one for
|
legendloc |
a function that is evaluated with a single argument equal to |
a list of class "validate.rpart"
with components named k, size, dxy.app
,
dxy.val, mse.app, mse.val, binary, xval
. size
is the number of nodes,
dxy
refers to Somers' D
, mse
refers to mean squared error of prediction,
app
means apparent accuracy on training samples, val
means validated
accuracy on test samples, binary
is a logical variable indicating whether
or not the response variable was binary (a logical or 0/1 variable is
binary). size
will not be present if the user specifies k
.
prints if pr=TRUE
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
rpart
, somers2
,
dxy.cens
, locator
,
legend
## Not run: n <- 100 set.seed(1) x1 <- runif(n) x2 <- runif(n) x3 <- runif(n) y <- 1*(x1+x2+rnorm(n) > 1) table(y) require(rpart) f <- rpart(y ~ x1 + x2 + x3, model=TRUE) v <- validate(f) v # note the poor validation par(mfrow=c(1,2)) plot(v, legendloc=c(.2,.5)) par(mfrow=c(1,1)) ## End(Not run)
## Not run: n <- 100 set.seed(1) x1 <- runif(n) x2 <- runif(n) x3 <- runif(n) y <- 1*(x1+x2+rnorm(n) > 1) table(y) require(rpart) f <- rpart(y ~ x1 + x2 + x3, model=TRUE) v <- validate(f) v # note the poor validation par(mfrow=c(1,2)) plot(v, legendloc=c(.2,.5)) par(mfrow=c(1,1)) ## End(Not run)
The validate
function when used on an object created by
Rq
does resampling validation of a quantile regression
model, with or without backward step-down variable deletion. Uses
resampling to estimate the optimism in various measures of predictive
accuracy which include mean absolute prediction error (MAD), Spearman
rho, the -index, and the intercept and slope
of an overall
calibration
. The "corrected"
slope can be thought of as shrinkage factor that takes into account
overfitting.
validate.Rq
can also be used when a model for a
continuous response is going to be applied to a binary response. A
Somers' for this case is computed for each resample by
dichotomizing
y
. This can be used to obtain an ordinary receiver
operating characteristic curve area using the formula . See
predab.resample
for the list of
resampling methods.
The LaTeX needspace
package must be in effect to use the
latex
method.
# fit <- fitting.function(formula=response ~ terms, x=TRUE, y=TRUE) ## S3 method for class 'Rq' validate(fit, method="boot", B=40, bw=FALSE, rule="aic", type="residual", sls=0.05, aics=0, force=NULL, estimates=TRUE, pr=FALSE, u=NULL, rel=">", tolerance=1e-7, ...)
# fit <- fitting.function(formula=response ~ terms, x=TRUE, y=TRUE) ## S3 method for class 'Rq' validate(fit, method="boot", B=40, bw=FALSE, rule="aic", type="residual", sls=0.05, aics=0, force=NULL, estimates=TRUE, pr=FALSE, u=NULL, rel=">", tolerance=1e-7, ...)
fit |
a fit derived by |
method , B , bw , rule , type , sls , aics , force , estimates , pr
|
see
|
u |
If specifed, |
rel |
relationship for dichotomizing predicted |
tolerance |
ignored |
... |
other arguments to pass to |
matrix with rows corresponding to various indexes, and
optionally , and
columns for the original index, resample estimates,
indexes applied to whole or omitted sample using model derived from
resample, average optimism, corrected index, and number of successful resamples.
prints a summary, and optionally statistics for each re-fit
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
Rq
, predab.resample
, fastbw
,
rms
, rms.trans
,
gIndex
set.seed(1) x1 <- runif(200) x2 <- sample(0:3, 200, TRUE) x3 <- rnorm(200) distance <- (x1 + x2/3 + rnorm(200))^2 f <- Rq(sqrt(distance) ~ rcs(x1,4) + scored(x2) + x3, x=TRUE, y=TRUE) #Validate full model fit (from all observations) but for x1 < .75 validate(f, B=20, subset=x1 < .75) # normally B=300 #Validate stepwise model with typical (not so good) stopping rule validate(f, B=20, bw=TRUE, rule="p", sls=.1, type="individual")
set.seed(1) x1 <- runif(200) x2 <- sample(0:3, 200, TRUE) x3 <- rnorm(200) distance <- (x1 + x2/3 + rnorm(200))^2 f <- Rq(sqrt(distance) ~ rcs(x1,4) + scored(x2) + x3, x=TRUE, y=TRUE) #Validate full model fit (from all observations) but for x1 < .75 validate(f, B=20, subset=x1 < .75) # normally B=300 #Validate stepwise model with typical (not so good) stopping rule validate(f, B=20, bw=TRUE, rule="p", sls=.1, type="individual")
Computes variance inflation factors from the covariance matrix of parameter estimates, using the method of Davis et al. (1986), which is based on the correlation matrix from the information matrix.
vif(fit)
vif(fit)
fit |
an object created by |
vector of vifs
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
Davis CE, Hyde JE, Bangdiwala SI, Nelson JJ: An example of dependencies among variables in a conditional logistic regression. In Modern Statistical Methods in Chronic Disease Epidemiology, Eds SH Moolgavkar and RL Prentice, pp. 140–147. New York: Wiley; 1986.
rmsMisc
(for num.intercepts
set.seed(1) x1 <- rnorm(100) x2 <- x1+.1*rnorm(100) y <- sample(0:1, 100, TRUE) f <- lrm(y ~ x1 + x2) vif(f)
set.seed(1) x1 <- rnorm(100) x2 <- x1+.1*rnorm(100) y <- sample(0:1, 100, TRUE) f <- lrm(y ~ x1 + x2) vif(f)
Creates a list with a component for each factor in the model. The
names of the components are the factor names. Each component contains
the observation identifiers of all observations that are "overly
influential" with respect to that factor, meaning that for at least one
associated with that factor,
for a given
cutoff
. The default cutoff
is .2
. The
fit must come from a function that has resid(fit, type="dfbetas")
defined.
show.influence
, written by Jens Oehlschlaegel-Akiyoshi, applies the
result of which.influence
to a data frame, usually the one used to
fit the model, to report the results.
which.influence(fit, cutoff=.2) show.influence(object, dframe, report=NULL, sig=NULL, id=NULL)
which.influence(fit, cutoff=.2) show.influence(object, dframe, report=NULL, sig=NULL, id=NULL)
fit |
fit object |
object |
the result of |
dframe |
data frame containing observations pertinent to the model fit |
cutoff |
cutoff value |
report |
other columns of the data frame to report besides those corresponding to predictors that are influential for some observations |
sig |
runs results through |
id |
a character vector that labels rows of |
show.influence
returns a marked dataframe with the first column being
a count of influence values
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
Jens Oehlschlaegel-Akiyoshi
Center for Psychotherapy Research
Christian-Belser-Strasse 79a
D-70597 Stuttgart Germany
[email protected]
residuals.lrm
, residuals.cph
,
residuals.ols
, rms
, lrm
,
ols
, cph
#print observations in data frame that are influential, #separately for each factor in the model x1 <- 1:20 x2 <- abs(x1-10) x3 <- factor(rep(0:2,length.out=20)) y <- c(rep(0:1,8),1,1,1,1) f <- lrm(y ~ rcs(x1,3) + x2 + x3, x=TRUE,y=TRUE) w <- which.influence(f, .55) nam <- names(w) d <- data.frame(x1,x2,x3,y) for(i in 1:length(nam)) { print(paste("Influential observations for effect of ",nam[i]),quote=FALSE) print(d[w[[i]],]) } show.influence(w, d) # better way to show results
#print observations in data frame that are influential, #separately for each factor in the model x1 <- 1:20 x2 <- abs(x1-10) x3 <- factor(rep(0:2,length.out=20)) y <- c(rep(0:1,8),1,1,1,1) f <- lrm(y ~ rcs(x1,3) + x2 + x3, x=TRUE,y=TRUE) w <- which.influence(f, .55) nam <- names(w) d <- data.frame(x1,x2,x3,y) for(i in 1:length(nam)) { print(paste("Influential observations for effect of ",nam[i]),quote=FALSE) print(d[w[[i]],]) } show.influence(w, d) # better way to show results
Produce Design Matrices for Contrasts
Xcontrast( fit, a, b = NULL, a2 = NULL, b2 = NULL, ycut = NULL, weights = "equal", expand = TRUE, Zmatrix = TRUE )
Xcontrast( fit, a, b = NULL, a2 = NULL, b2 = NULL, ycut = NULL, weights = "equal", expand = TRUE, Zmatrix = TRUE )
fit |
an 'rms' or 'rmsb' fit object, not necessarily complete |
a |
see [rms::contrast.rms()] |
b |
see [rms::contrast.rms()] |
a2 |
see [rms::contrast.rms()] |
b2 |
see [rms::contrast.rms()] |
ycut |
see [rms::contrast.rms()] |
weights |
see [rms::contrast.rms()] |
expand |
see [rms::contrast.rms()] |
Zmatrix |
set to 'FALSE' for a partial PO model in which you do not want to include the Z matrix in the returned contrast matrix |
This is a simpler version of 'contrast.rms' that creates design matrices or differences of them and does not require the fit object to be complete (i.e., to have coefficients). This is used for the 'pcontrast' option in [rmsb::blrm()].
numeric matrix
Frank Harrell