Title: | Regression Modeling Strategies |
---|---|
Description: | Regression modeling, testing, estimation, validation, graphics, prediction, and typesetting by storing enhanced model design attributes in the fit. 'rms' is a collection of functions that assist with and streamline modeling. It also contains functions for binary and ordinal logistic regression models, ordinal models for continuous Y with a variety of distribution families, and the Buckley-James multiple regression model for right-censored responses, and implements penalized maximum likelihood estimation for logistic and ordinary linear models. 'rms' works with almost any regression model, but it was especially written to work with binary or ordinal regression models, Cox regression, accelerated failure time models, ordinary linear models, the Buckley-James model, generalized least squares for serially or spatially correlated observations, generalized linear models, and quantile regression. |
Authors: | Frank E Harrell Jr [aut, cre] |
Maintainer: | Frank E Harrell Jr <[email protected]> |
License: | GPL (>= 2) |
Version: | 7.0-0 |
Built: | 2025-01-17 15:17:58 UTC |
Source: | CRAN |
Subset Method for Ocens
Objects
## S3 method for class 'Ocens' x[rows = 1:d[1], cols = 1:d[2], ...]
## S3 method for class 'Ocens' x[rows = 1:d[1], cols = 1:d[2], ...]
x |
an |
rows |
logical or integer vector |
cols |
logical or integer vector |
... |
ignored |
Subsets an Ocens
object, preserving its special attributes. Attributes are not updated. In the future such updating should be implemented.
new Ocens
object
Frank Harrell
The anova
function automatically tests most meaningful hypotheses
in a design. For example, suppose that age and cholesterol are
predictors, and that a general interaction is modeled using a restricted
spline surface. anova
prints Wald statistics ( statistics
for an
ols
fit) for testing linearity of age, linearity of
cholesterol, age effect (age + age by cholesterol interaction),
cholesterol effect (cholesterol + age by cholesterol interaction),
linearity of the age by cholesterol interaction (i.e., adequacy of the
simple age * cholesterol 1 d.f. product), linearity of the interaction
in age alone, and linearity of the interaction in cholesterol
alone. Joint tests of all interaction terms in the model and all
nonlinear terms in the model are also performed. For any multiple
d.f. effects for continuous variables that were not modeled through
rcs
, pol
, lsp
, etc., tests of linearity will be
omitted. This applies to matrix predictors produced by e.g.
poly
or ns
.
For lrm, orm, cph, psm
and Glm
fits, the better likelihood
ratio chi-square tests may be obtained by specifying test='LR'
.
Fits must use x=TRUE, y=TRUE
to run LR tests. The tests are run
fairly efficiently by subsetting the design matrix rather than
recreating it.
print.anova.rms
is the printing
method. plot.anova.rms
draws dot charts depicting the importance
of variables in the model, as measured by Wald or LR ,
minus d.f., AIC,
-values, partial
,
for the whole model after deleting the effects in
question, or proportion of overall model
that is due to each
predictor.
latex.anova.rms
is the latex
method. It
substitutes Greek/math symbols in column headings, uses boldface for
TOTAL
lines, and constructs a caption. Then it passes the result
to latex.default
for conversion to LaTeX.
When the anova table was converted to account for missing data
imputation by processMI
, a separate function prmiInfo
can
be used to print information related to imputation adjustments.
For Bayesian models such as blrm
, anova
computes relative
explained variation indexes (REV) based on approximate Wald statistics.
This uses the variance-covariance matrix of all of the posterior draws,
and the individual draws of betas, plus an overall summary from the
posterior mode/mean/median beta. Wald chi-squares assuming multivariate
normality of betas are computed just as with frequentist models, and for
each draw (or for the summary) the ratio of the partial Wald chi-square
to the total Wald statistic for the model is computed as REV.
The print
method calls latex
or html
methods
depending on options(prType=)
. For
latex
a table
environment is not used and an ordinary
tabular
is produced. When using html with Quarto or RMarkdown,
results='asis'
need not be written in the chunk header.
html.anova.rms
just calls latex.anova.rms
.
## S3 method for class 'rms' anova(object, ..., main.effect=FALSE, tol=.Machine$double.eps, test=c('F','Chisq','LR'), india=TRUE, indnl=TRUE, ss=TRUE, vnames=c('names','labels'), posterior.summary=c('mean', 'median', 'mode'), ns=500, cint=0.95, fitargs=NULL) ## S3 method for class 'anova.rms' print(x, which=c('none','subscripts','names','dots'), table.env=FALSE, ...) ## S3 method for class 'anova.rms' plot(x, what=c("chisqminusdf","chisq","aic","P","partial R2","remaining R2", "proportion R2", "proportion chisq"), xlab=NULL, pch=16, rm.totals=TRUE, rm.ia=FALSE, rm.other=NULL, newnames, sort=c("descending","ascending","none"), margin=c('chisq','P'), pl=TRUE, trans=NULL, ntrans=40, height=NULL, width=NULL, ...) ## S3 method for class 'anova.rms' latex(object, title, dec.chisq=2, dec.F=2, dec.ss=NA, dec.ms=NA, dec.P=4, dec.REV=3, table.env=TRUE, caption=NULL, fontsize=1, params, ...) ## S3 method for class 'anova.rms' html(object, ...)
## S3 method for class 'rms' anova(object, ..., main.effect=FALSE, tol=.Machine$double.eps, test=c('F','Chisq','LR'), india=TRUE, indnl=TRUE, ss=TRUE, vnames=c('names','labels'), posterior.summary=c('mean', 'median', 'mode'), ns=500, cint=0.95, fitargs=NULL) ## S3 method for class 'anova.rms' print(x, which=c('none','subscripts','names','dots'), table.env=FALSE, ...) ## S3 method for class 'anova.rms' plot(x, what=c("chisqminusdf","chisq","aic","P","partial R2","remaining R2", "proportion R2", "proportion chisq"), xlab=NULL, pch=16, rm.totals=TRUE, rm.ia=FALSE, rm.other=NULL, newnames, sort=c("descending","ascending","none"), margin=c('chisq','P'), pl=TRUE, trans=NULL, ntrans=40, height=NULL, width=NULL, ...) ## S3 method for class 'anova.rms' latex(object, title, dec.chisq=2, dec.F=2, dec.ss=NA, dec.ms=NA, dec.P=4, dec.REV=3, table.env=TRUE, caption=NULL, fontsize=1, params, ...) ## S3 method for class 'anova.rms' html(object, ...)
object |
a |
... |
If omitted, all variables are tested, yielding tests for individual factors
and for pooled effects. Specify a subset of the variables to obtain tests
for only those factors, with a pooled tests for the combined effects
of all factors listed. Names may be abbreviated. For example, specify
Can be optional graphical parameters to send to
For |
main.effect |
Set to |
tol |
singularity criterion for use in matrix inversion |
test |
For an |
india |
set to |
indnl |
set to |
ss |
For an |
vnames |
set to |
posterior.summary |
specifies whether the posterior mode/mean/median beta are to be used as a measure of central tendence of the posterior distribution, for use in relative explained variation from Bayesian models |
ns |
number of random samples from the posterior draws to use for REV highest posterior density intervals |
cint |
HPD interval probability |
fitargs |
a list of extra arguments to be passed to the fitter for LR tests |
x |
for |
which |
If |
what |
what type of statistic to plot. The default is the |
xlab |
x-axis label, default is constructed according to |
pch |
character for plotting dots in dot charts. Default is 16 (solid dot). |
rm.totals |
set to |
rm.ia |
set to |
rm.other |
a list of other predictor names to omit from the chart |
newnames |
a list of substitute predictor names to use, after omitting any. |
sort |
default is to sort bars in descending order of the summary statistic. Available options: 'ascending', 'descending', 'none'. |
margin |
set to a vector of character strings to write text for
selected statistics in the right margin of the dot chart. The
character strings can be any combination of |
pl |
set to |
trans |
set to a function to apply that transformation to the statistics
being plotted, and to truncate negative values at zero. A good choice
is |
ntrans |
|
height , width
|
height and width of |
title |
title to pass to |
dec.chisq |
number of places to the right of the decimal place for typesetting
|
dec.F |
digits to the right for |
dec.ss |
digits to the right for sums of squares (default is |
dec.ms |
digits to the right for mean squares (default is |
dec.P |
digits to the right for |
dec.REV |
digits to the right for REV |
table.env |
see |
caption |
caption for table if |
fontsize |
font size for html output; default is 1 for |
params |
used internally when called through print. |
If the statistics being plotted with plot.anova.rms
are few in
number and one of them is negative or zero, plot.anova.rms
will quit because of an error in dotchart2
.
The latex
method requires LaTeX packages relsize
and
needspace
.
anova.rms
returns a matrix of class anova.rms
containing factors
as rows and , d.f., and
-values as
columns (or d.f., partial
). An attribute
vinfo
provides list of variables involved in each row and the
type of test done.
plot.anova.rms
invisibly returns the vector of quantities
plotted. This vector has a names attribute describing the terms for
which the statistics in the vector are calculated.
print
prints, latex
creates a
file with a name of the form "title.tex"
(see the title
argument above).
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
prmiInfo
,
rms
, rmsMisc
, lrtest
,
rms.trans
, summary.rms
, plot.Predict
,
ggplot.Predict
, solvet
,
locator
,
dotchart2
, latex
,
xYplot
, anova.lm
,
contrast.rms
, pantext
require(ggplot2) n <- 1000 # define sample size set.seed(17) # so can reproduce the results treat <- factor(sample(c('a','b','c'), n,TRUE)) num.diseases <- sample(0:4, n,TRUE) age <- rnorm(n, 50, 10) cholesterol <- rnorm(n, 200, 25) weight <- rnorm(n, 150, 20) sex <- factor(sample(c('female','male'), n,TRUE)) label(age) <- 'Age' # label is in Hmisc label(num.diseases) <- 'Number of Comorbid Diseases' label(cholesterol) <- 'Total Cholesterol' label(weight) <- 'Weight, lbs.' label(sex) <- 'Sex' units(cholesterol) <- 'mg/dl' # uses units.default in Hmisc # Specify population model for log odds that Y=1 L <- .1*(num.diseases-2) + .045*(age-50) + (log(cholesterol - 10)-5.2)*(-2*(treat=='a') + 3.5*(treat=='b')+2*(treat=='c')) # Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)] y <- ifelse(runif(n) < plogis(L), 1, 0) fit <- lrm(y ~ treat + scored(num.diseases) + rcs(age) + log(cholesterol+10) + treat:log(cholesterol+10), x=TRUE, y=TRUE) # x, y needed for test='LR' a <- anova(fit) # Test all factors b <- anova(fit, treat, cholesterol) # Test these 2 by themselves # to get their pooled effects a b a2 <- anova(fit, test='LR') b2 <- anova(fit, treat, cholesterol, test='LR') a2 b2 # Add a new line to the plot with combined effects s <- rbind(a2, 'treat+cholesterol'=b2['TOTAL',]) class(s) <- 'anova.rms' plot(s, margin=c('chisq', 'proportion chisq')) g <- lrm(y ~ treat*rcs(age)) dd <- datadist(treat, num.diseases, age, cholesterol) options(datadist='dd') p <- Predict(g, age, treat="b") s <- anova(g) tx <- paste(capture.output(s), collapse='\n') ggplot(p) + annotate('text', x=27, y=3.2, family='mono', label=tx, hjust=0, vjust=1, size=1.5) plot(s, margin=c('chisq', 'proportion chisq')) # new plot - dot chart of chisq-d.f. with 2 other stats in right margin # latex(s) # nice printout - creates anova.g.tex options(datadist=NULL) # Simulate data with from a given model, and display exactly which # hypotheses are being tested set.seed(123) age <- rnorm(500, 50, 15) treat <- factor(sample(c('a','b','c'), 500, TRUE)) bp <- rnorm(500, 120, 10) y <- ifelse(treat=='a', (age-50)*.05, abs(age-50)*.08) + 3*(treat=='c') + pmax(bp, 100)*.09 + rnorm(500) f <- ols(y ~ treat*lsp(age,50) + rcs(bp,4)) print(names(coef(f)), quote=FALSE) specs(f) anova(f) an <- anova(f) options(digits=3) print(an, 'subscripts') print(an, 'dots') an <- anova(f, test='Chisq', ss=FALSE) # plot(0:1) # make some plot # tab <- pantext(an, 1.2, .6, lattice=FALSE, fontfamily='Helvetica') # create function to write table; usually omit fontfamily # tab() # execute it; could do tab(cex=.65) plot(an) # new plot - dot chart of chisq-d.f. # Specify plot(an, trans=sqrt) to use a square root scale for this plot # latex(an) # nice printout - creates anova.f.tex ## Example to save partial R^2 for all predictors, along with overall ## R^2, from two separate fits, and to combine them with ggplot2 require(ggplot2) set.seed(1) n <- 100 x1 <- runif(n) x2 <- runif(n) y <- (x1-.5)^2 + x2 + runif(n) group <- c(rep('a', n/2), rep('b', n/2)) A <- NULL for(g in c('a','b')) { f <- ols(y ~ pol(x1,2) + pol(x2,2) + pol(x1,2) %ia% pol(x2,2), subset=group==g) a <- plot(anova(f), what='partial R2', pl=FALSE, rm.totals=FALSE, sort='none') a <- a[-grep('NONLINEAR', names(a))] d <- data.frame(group=g, Variable=factor(names(a), names(a)), partialR2=unname(a)) A <- rbind(A, d) } ggplot(A, aes(x=partialR2, y=Variable)) + geom_point() + facet_wrap(~ group) + xlab(ex <- expression(partial~R^2)) + scale_y_discrete(limits=rev) ggplot(A, aes(x=partialR2, y=Variable, color=group)) + geom_point() + xlab(ex <- expression(partial~R^2)) + scale_y_discrete(limits=rev) # Suppose that a researcher wants to make a big deal about a variable # because it has the highest adjusted chi-square. We use the # bootstrap to derive 0.95 confidence intervals for the ranks of all # the effects in the model. We use the plot method for anova, with # pl=FALSE to suppress actual plotting of chi-square - d.f. for each # bootstrap repetition. # It is important to tell plot.anova.rms not to sort the results, or # every bootstrap replication would have ranks of 1,2,3,... for the stats. n <- 300 set.seed(1) d <- data.frame(x1=runif(n), x2=runif(n), x3=runif(n), x4=runif(n), x5=runif(n), x6=runif(n), x7=runif(n), x8=runif(n), x9=runif(n), x10=runif(n), x11=runif(n), x12=runif(n)) d$y <- with(d, 1*x1 + 2*x2 + 3*x3 + 4*x4 + 5*x5 + 6*x6 + 7*x7 + 8*x8 + 9*x9 + 10*x10 + 11*x11 + 12*x12 + 9*rnorm(n)) f <- ols(y ~ x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+x12, data=d) B <- 20 # actually use B=1000 ranks <- matrix(NA, nrow=B, ncol=12) rankvars <- function(fit) rank(plot(anova(fit), sort='none', pl=FALSE)) Rank <- rankvars(f) for(i in 1:B) { j <- sample(1:n, n, TRUE) bootfit <- update(f, data=d, subset=j) ranks[i,] <- rankvars(bootfit) } lim <- t(apply(ranks, 2, quantile, probs=c(.025,.975))) predictor <- factor(names(Rank), names(Rank)) w <- data.frame(predictor, Rank, lower=lim[,1], upper=lim[,2]) ggplot(w, aes(x=predictor, y=Rank)) + geom_point() + coord_flip() + scale_y_continuous(breaks=1:12) + geom_errorbar(aes(ymin=lim[,1], ymax=lim[,2]), width=0)
require(ggplot2) n <- 1000 # define sample size set.seed(17) # so can reproduce the results treat <- factor(sample(c('a','b','c'), n,TRUE)) num.diseases <- sample(0:4, n,TRUE) age <- rnorm(n, 50, 10) cholesterol <- rnorm(n, 200, 25) weight <- rnorm(n, 150, 20) sex <- factor(sample(c('female','male'), n,TRUE)) label(age) <- 'Age' # label is in Hmisc label(num.diseases) <- 'Number of Comorbid Diseases' label(cholesterol) <- 'Total Cholesterol' label(weight) <- 'Weight, lbs.' label(sex) <- 'Sex' units(cholesterol) <- 'mg/dl' # uses units.default in Hmisc # Specify population model for log odds that Y=1 L <- .1*(num.diseases-2) + .045*(age-50) + (log(cholesterol - 10)-5.2)*(-2*(treat=='a') + 3.5*(treat=='b')+2*(treat=='c')) # Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)] y <- ifelse(runif(n) < plogis(L), 1, 0) fit <- lrm(y ~ treat + scored(num.diseases) + rcs(age) + log(cholesterol+10) + treat:log(cholesterol+10), x=TRUE, y=TRUE) # x, y needed for test='LR' a <- anova(fit) # Test all factors b <- anova(fit, treat, cholesterol) # Test these 2 by themselves # to get their pooled effects a b a2 <- anova(fit, test='LR') b2 <- anova(fit, treat, cholesterol, test='LR') a2 b2 # Add a new line to the plot with combined effects s <- rbind(a2, 'treat+cholesterol'=b2['TOTAL',]) class(s) <- 'anova.rms' plot(s, margin=c('chisq', 'proportion chisq')) g <- lrm(y ~ treat*rcs(age)) dd <- datadist(treat, num.diseases, age, cholesterol) options(datadist='dd') p <- Predict(g, age, treat="b") s <- anova(g) tx <- paste(capture.output(s), collapse='\n') ggplot(p) + annotate('text', x=27, y=3.2, family='mono', label=tx, hjust=0, vjust=1, size=1.5) plot(s, margin=c('chisq', 'proportion chisq')) # new plot - dot chart of chisq-d.f. with 2 other stats in right margin # latex(s) # nice printout - creates anova.g.tex options(datadist=NULL) # Simulate data with from a given model, and display exactly which # hypotheses are being tested set.seed(123) age <- rnorm(500, 50, 15) treat <- factor(sample(c('a','b','c'), 500, TRUE)) bp <- rnorm(500, 120, 10) y <- ifelse(treat=='a', (age-50)*.05, abs(age-50)*.08) + 3*(treat=='c') + pmax(bp, 100)*.09 + rnorm(500) f <- ols(y ~ treat*lsp(age,50) + rcs(bp,4)) print(names(coef(f)), quote=FALSE) specs(f) anova(f) an <- anova(f) options(digits=3) print(an, 'subscripts') print(an, 'dots') an <- anova(f, test='Chisq', ss=FALSE) # plot(0:1) # make some plot # tab <- pantext(an, 1.2, .6, lattice=FALSE, fontfamily='Helvetica') # create function to write table; usually omit fontfamily # tab() # execute it; could do tab(cex=.65) plot(an) # new plot - dot chart of chisq-d.f. # Specify plot(an, trans=sqrt) to use a square root scale for this plot # latex(an) # nice printout - creates anova.f.tex ## Example to save partial R^2 for all predictors, along with overall ## R^2, from two separate fits, and to combine them with ggplot2 require(ggplot2) set.seed(1) n <- 100 x1 <- runif(n) x2 <- runif(n) y <- (x1-.5)^2 + x2 + runif(n) group <- c(rep('a', n/2), rep('b', n/2)) A <- NULL for(g in c('a','b')) { f <- ols(y ~ pol(x1,2) + pol(x2,2) + pol(x1,2) %ia% pol(x2,2), subset=group==g) a <- plot(anova(f), what='partial R2', pl=FALSE, rm.totals=FALSE, sort='none') a <- a[-grep('NONLINEAR', names(a))] d <- data.frame(group=g, Variable=factor(names(a), names(a)), partialR2=unname(a)) A <- rbind(A, d) } ggplot(A, aes(x=partialR2, y=Variable)) + geom_point() + facet_wrap(~ group) + xlab(ex <- expression(partial~R^2)) + scale_y_discrete(limits=rev) ggplot(A, aes(x=partialR2, y=Variable, color=group)) + geom_point() + xlab(ex <- expression(partial~R^2)) + scale_y_discrete(limits=rev) # Suppose that a researcher wants to make a big deal about a variable # because it has the highest adjusted chi-square. We use the # bootstrap to derive 0.95 confidence intervals for the ranks of all # the effects in the model. We use the plot method for anova, with # pl=FALSE to suppress actual plotting of chi-square - d.f. for each # bootstrap repetition. # It is important to tell plot.anova.rms not to sort the results, or # every bootstrap replication would have ranks of 1,2,3,... for the stats. n <- 300 set.seed(1) d <- data.frame(x1=runif(n), x2=runif(n), x3=runif(n), x4=runif(n), x5=runif(n), x6=runif(n), x7=runif(n), x8=runif(n), x9=runif(n), x10=runif(n), x11=runif(n), x12=runif(n)) d$y <- with(d, 1*x1 + 2*x2 + 3*x3 + 4*x4 + 5*x5 + 6*x6 + 7*x7 + 8*x8 + 9*x9 + 10*x10 + 11*x11 + 12*x12 + 9*rnorm(n)) f <- ols(y ~ x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+x12, data=d) B <- 20 # actually use B=1000 ranks <- matrix(NA, nrow=B, ncol=12) rankvars <- function(fit) rank(plot(anova(fit), sort='none', pl=FALSE)) Rank <- rankvars(f) for(i in 1:B) { j <- sample(1:n, n, TRUE) bootfit <- update(f, data=d, subset=j) ranks[i,] <- rankvars(bootfit) } lim <- t(apply(ranks, 2, quantile, probs=c(.025,.975))) predictor <- factor(names(Rank), names(Rank)) w <- data.frame(predictor, Rank, lower=lim[,1], upper=lim[,2]) ggplot(w, aes(x=predictor, y=Rank)) + geom_point() + coord_flip() + scale_y_continuous(breaks=1:12) + geom_errorbar(aes(ymin=lim[,1], ymax=lim[,2]), width=0)
Converts an 'Ocens' object to a data frame so that subsetting will preserve all needed attributes
## S3 method for class 'Ocens' as.data.frame(x, row.names = NULL, optional = FALSE, ...)
## S3 method for class 'Ocens' as.data.frame(x, row.names = NULL, optional = FALSE, ...)
x |
an 'Ocens' object |
row.names |
optional vector of row names |
optional |
set to 'TRUE' if needed |
... |
ignored |
data frame containing a 2-column integer matrix with attributes
Frank Harrell
bj
fits the Buckley-James distribution-free least squares multiple
regression model to a possibly right-censored response variable.
This model reduces to ordinary least squares if
there is no censoring. By default, model fitting is done after
taking logs of the response variable.
bj
uses the rms
class
for automatic anova
, fastbw
, validate
, Function
, nomogram
,
summary
, plot
, bootcov
, and other functions. The bootcov
function may be worth using with bj
fits, as the properties of the
Buckley-James covariance matrix estimator are not fully known for
strange censoring patterns.
For the print
method, format of output is controlled by the
user previously running options(prType="lang")
where
lang
is "plain"
(the default), "latex"
, or
"html"
. When using html with Quarto or RMarkdown,
results='asis'
need not be written in the chunk header.
The residuals.bj
function exists mainly to compute
residuals and to censor them (i.e., return them as
Surv
objects) just as the original
failure time variable was censored. These residuals are useful for
checking to see if the model also satisfies certain distributional assumptions.
To get these residuals, the fit must have specified y=TRUE
.
The bjplot
function is a special plotting function for objects
created by bj
with x=TRUE, y=TRUE
in effect. It produces three
scatterplots for every covariate in the model: the first plots the
original situation, where censored data are distingushed from
non-censored data by a different plotting symbol. In the second plot,
called a renovated plot, vertical lines show how censored data were
changed by the procedure, and the third is equal to the second, but
without vertical lines. Imputed data are again distinguished from the
non-censored by a different symbol.
The validate
method for bj
validates the Somers' Dxy
rank
correlation between predicted and observed responses, accounting for censoring.
The primary fitting function for bj
is bj.fit
, which does not
allow missing data and expects a full design matrix as input.
bj(formula, data=environment(formula), subset, na.action=na.delete, link="log", control, method='fit', x=FALSE, y=FALSE, time.inc) ## S3 method for class 'bj' print(x, digits=4, long=FALSE, coefs=TRUE, title="Buckley-James Censored Data Regression", ...) ## S3 method for class 'bj' residuals(object, type=c("censored","censored.normalized"),...) bjplot(fit, which=1:dim(X)[[2]]) ## S3 method for class 'bj' validate(fit, method="boot", B=40, bw=FALSE,rule="aic",type="residual",sls=.05,aics=0, force=NULL, estimates=TRUE, pr=FALSE, tol=1e-7, rel.tolerance=1e-3, maxiter=15, ...) bj.fit(x, y, control)
bj(formula, data=environment(formula), subset, na.action=na.delete, link="log", control, method='fit', x=FALSE, y=FALSE, time.inc) ## S3 method for class 'bj' print(x, digits=4, long=FALSE, coefs=TRUE, title="Buckley-James Censored Data Regression", ...) ## S3 method for class 'bj' residuals(object, type=c("censored","censored.normalized"),...) bjplot(fit, which=1:dim(X)[[2]]) ## S3 method for class 'bj' validate(fit, method="boot", B=40, bw=FALSE,rule="aic",type="residual",sls=.05,aics=0, force=NULL, estimates=TRUE, pr=FALSE, tol=1e-7, rel.tolerance=1e-3, maxiter=15, ...) bj.fit(x, y, control)
formula |
an S statistical model formula. Interactions up to third order are
supported. The left hand side must be a |
data , subset , na.action
|
the usual statistical model fitting arguments |
fit |
a fit created by |
x |
a design matrix with or without a first column of ones, to pass
to |
y |
a |
link |
set to, for example, |
control |
a list containing any or all of the following components: |
method |
set to |
time.inc |
setting for default time spacing.
Default is 30 if time variable has |
digits |
number of significant digits to print if not 4. |
long |
set to |
coefs |
specify |
title |
a character string title to be passed to |
object |
the result of |
type |
type of residual desired. Default is censored unnormalized residuals,
defined as link(Y) - linear.predictors, where the
link function was usually the log function. You can specify
|
which |
vector of integers or character strings naming elements of the design
matrix (the names of the original predictors if they entered the model
linearly) for which to have |
B , bw , rule , sls , aics , force , estimates , pr , tol , rel.tolerance , maxiter
|
see
|
... |
ignored for |
The program implements the algorithm as described in the original article by Buckley & James. Also, we have used the original Buckley & James prescription for computing variance/covariance estimator. This is based on non-censored observations only and does not have any theoretical justification, but has been shown in simulation studies to behave well. Our experience confirms this view. Convergence is rather slow with this method, so you may want to increase the number of iterations. Our experience shows that often, in particular with high censoring, 100 iterations is not too many. Sometimes the method will not converge, but will instead enter a loop of repeating values (this is due to the discrete nature of Kaplan and Meier estimator and usually happens with small sample sizes). The program will look for such a loop and return the average betas. It will also issue a warning message and give the size of the cycle (usually less than 6).
bj
returns a fit object with similar information to what survreg
,
psm
, cph
would store as
well as what rms
stores and units
and time.inc
.
residuals.bj
returns a Surv
object. One of the components of the
fit
object produced by bj
(and bj.fit
) is a vector called
stats
which contains the following names elements:
"Obs", "Events", "d.f.","error d.f.","sigma","g"
. Here
sigma
is the estimate of the residual standard deviation.
g
is the -index. If the link function is
"log"
,
the -index on the anti-log scale is also returned as
gr
.
Janez Stare
Department of Biomedical Informatics
Ljubljana University
Ljubljana, Slovenia
[email protected]
Harald Heinzl
Department of Medical Computer Sciences
Vienna University
Vienna, Austria
[email protected]
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
Buckley JJ, James IR. Linear regression with censored data. Biometrika 1979; 66:429–36.
Miller RG, Halpern J. Regression with censored data. Biometrika 1982; 69: 521–31.
James IR, Smith PJ. Consistency results for linear regression with censored data. Ann Statist 1984; 12: 590–600.
Lai TL, Ying Z. Large sample theory of a modified Buckley-James estimator for regression analysis with censored data. Ann Statist 1991; 19: 1370–402.
Hillis SL. Residual plots for the censored data linear regression model. Stat in Med 1995; 14: 2023–2036.
Jin Z, Lin DY, Ying Z. On least-squares regression with censored data. Biometrika 2006; 93:147–161.
rms
, psm
, survreg
,
cph
, Surv
,
na.delete
,
na.detail.response
, datadist
,
rcorr.cens
, GiniMd
,
prModFit
, dxy.cens
require(survival) suppressWarnings(RNGversion("3.5.0")) set.seed(1) ftime <- 10*rexp(200) stroke <- ifelse(ftime > 10, 0, 1) ftime <- pmin(ftime, 10) units(ftime) <- "Month" age <- rnorm(200, 70, 10) hospital <- factor(sample(c('a','b'),200,TRUE)) dd <- datadist(age, hospital) options(datadist="dd") # Prior to rms 6.0 and R 4.0 the following worked with 5 knots f <- bj(Surv(ftime, stroke) ~ rcs(age,3) + hospital, x=TRUE, y=TRUE) # add link="identity" to use a censored normal regression model instead # of a lognormal one anova(f) fastbw(f) validate(f, B=15) plot(Predict(f, age, hospital)) # needs datadist since no explicit age,hosp. coef(f) # look at regression coefficients coef(psm(Surv(ftime, stroke) ~ rcs(age,3) + hospital, dist='lognormal')) # compare with coefficients from likelihood-based # log-normal regression model # use dist='gau' not under R r <- resid(f, 'censored.normalized') survplot(npsurv(r ~ 1), conf='none') # plot Kaplan-Meier estimate of # survival function of standardized residuals survplot(npsurv(r ~ cut2(age, g=2)), conf='none') # may desire both strata to be n(0,1) options(datadist=NULL)
require(survival) suppressWarnings(RNGversion("3.5.0")) set.seed(1) ftime <- 10*rexp(200) stroke <- ifelse(ftime > 10, 0, 1) ftime <- pmin(ftime, 10) units(ftime) <- "Month" age <- rnorm(200, 70, 10) hospital <- factor(sample(c('a','b'),200,TRUE)) dd <- datadist(age, hospital) options(datadist="dd") # Prior to rms 6.0 and R 4.0 the following worked with 5 knots f <- bj(Surv(ftime, stroke) ~ rcs(age,3) + hospital, x=TRUE, y=TRUE) # add link="identity" to use a censored normal regression model instead # of a lognormal one anova(f) fastbw(f) validate(f, B=15) plot(Predict(f, age, hospital)) # needs datadist since no explicit age,hosp. coef(f) # look at regression coefficients coef(psm(Surv(ftime, stroke) ~ rcs(age,3) + hospital, dist='lognormal')) # compare with coefficients from likelihood-based # log-normal regression model # use dist='gau' not under R r <- resid(f, 'censored.normalized') survplot(npsurv(r ~ 1), conf='none') # plot Kaplan-Meier estimate of # survival function of standardized residuals survplot(npsurv(r ~ cut2(age, g=2)), conf='none') # may desire both strata to be n(0,1) options(datadist=NULL)
This functions constructs an object resembling one produced by the
boot
package's boot
function, and runs that package's
boot.ci
function to compute BCa and percentile confidence limits.
bootBCa
can provide separate confidence limits for a vector of
statistics when estimate
has length greater than 1. In that
case, estimates
must have the same number of columns as
estimate
has values.
bootBCa(estimate, estimates, type=c('percentile','bca','basic'), n, seed, conf.int = 0.95)
bootBCa(estimate, estimates, type=c('percentile','bca','basic'), n, seed, conf.int = 0.95)
estimate |
original whole-sample estimate |
estimates |
vector of bootstrap estimates |
type |
type of confidence interval, defaulting to nonparametric percentile |
n |
original number of observations |
seed |
|
conf.int |
confidence level |
a 2-vector if estimate
is of length 1, otherwise a matrix
with 2 rows and number of columns equal to the length of
estimate
You can use if(!exists('.Random.seed')) runif(1)
before running
your bootstrap to make sure that .Random.seed
will be available
to bootBCa
.
Frank Harrell
## Not run: x1 <- runif(100); x2 <- runif(100); y <- sample(0:1, 100, TRUE) f <- lrm(y ~ x1 + x2, x=TRUE, y=TRUE) seed <- .Random.seed b <- bootcov(f) # Get estimated log odds at x1=.4, x2=.6 X <- cbind(c(1,1), x1=c(.4,2), x2=c(.6,3)) est <- X ests <- t(X bootBCa(est, ests, n=100, seed=seed) bootBCa(est, ests, type='bca', n=100, seed=seed) bootBCa(est, ests, type='basic', n=100, seed=seed) ## End(Not run)
## Not run: x1 <- runif(100); x2 <- runif(100); y <- sample(0:1, 100, TRUE) f <- lrm(y ~ x1 + x2, x=TRUE, y=TRUE) seed <- .Random.seed b <- bootcov(f) # Get estimated log odds at x1=.4, x2=.6 X <- cbind(c(1,1), x1=c(.4,2), x2=c(.6,3)) est <- X ests <- t(X bootBCa(est, ests, n=100, seed=seed) bootBCa(est, ests, type='bca', n=100, seed=seed) bootBCa(est, ests, type='basic', n=100, seed=seed) ## End(Not run)
bootcov
computes a bootstrap estimate of the covariance matrix for a set
of regression coefficients from ols
, lrm
, cph
,
psm
, Rq
, and any
other fit where x=TRUE, y=TRUE
was used to store the data used in making
the original regression fit and where an appropriate fitter
function
is provided here. The estimates obtained are not conditional on
the design matrix, but are instead unconditional estimates. For
small sample sizes, this will make a difference as the unconditional
variance estimates are larger. This function will also obtain
bootstrap estimates corrected for cluster sampling (intra-cluster
correlations) when a "working independence" model was used to fit
data which were correlated within clusters. This is done by substituting
cluster sampling with replacement for the usual simple sampling with
replacement. bootcov
has an option (coef.reps
) that causes all
of the regression coefficient estimates from all of the bootstrap
re-samples to be saved, facilitating computation of nonparametric
bootstrap confidence limits and plotting of the distributions of the
coefficient estimates (using histograms and kernel smoothing estimates).
The loglik
option facilitates the calculation of simultaneous
confidence regions from quantities of interest that are functions of
the regression coefficients, using the method of Tibshirani(1996).
With Tibshirani's method, one computes the objective criterion (-2 log
likelihood evaluated at the bootstrap estimate of but with
respect to the original design matrix and response vector) for the
original fit as well as for all of the bootstrap fits. The confidence
set of the regression coefficients is the set of all coefficients that
are associated with objective function values that are less than or
equal to say the 0.95 quantile of the vector of
B + 1
objective
function values. For the coefficients satisfying this condition,
predicted values are computed at a user-specified design matrix X
,
and minima and maxima of these predicted values (over the qualifying
bootstrap repetitions) are computed to derive the final simultaneous
confidence band.
The bootplot
function takes the output of bootcov
and
either plots a histogram and kernel density
estimate of specified regression coefficients (or linear combinations
of them through the use of a specified design matrix X
), or a
qqnorm
plot of the quantities of interest to check for normality of
the maximum likelihood estimates. bootplot
draws vertical lines at
specified quantiles of the bootstrap distribution, and returns these
quantiles for possible printing by the user. Bootstrap estimates may
optionally be transformed by a user-specified function fun
before
plotting.
The confplot
function also uses the output of bootcov
but to
compute and optionally plot nonparametric bootstrap pointwise confidence
limits or (by default) Tibshirani (1996) simultaneous confidence sets.
A design matrix must be specified to allow confplot
to compute
quantities of interest such as predicted values across a range
of values or differences in predicted values (plots of effects of
changing one or more predictor variable values).
bootplot
and confplot
are actually generic functions, with
the particular functions bootplot.bootcov
and confplot.bootcov
automatically invoked for bootcov
objects.
A service function called histdensity
is also provided (for use with
bootplot
). It runs hist
and density
on the same plot, using
twice the number of classes than the default for hist
, and 1.5 times the
width
than the default used by density
.
A comprehensive example demonstrates the use of all of the functions.
When bootstrapping an ordinal model for a numeric Y (when ytarget
is not specified), some original distinct Y values are not sampled so there will be fewer intercepts in the model. bootcov
linearly interpolates and extrapolates to fill in the missing intercepts so that the intercepts are aligned over bootstrap samples. Also see the Hmisc
ordGroupBoot
function.
bootcov(fit, cluster, B=200, fitter, coef.reps=TRUE, loglik=FALSE, pr=FALSE, group=NULL, stat=NULL, seed=sample(10000, 1), ytarget=NULL, ...) bootplot(obj, which=1 : ncol(Coef), X, conf.int=c(.9,.95,.99), what=c('density', 'qqnorm', 'box'), fun=function(x) x, labels., ...) confplot(obj, X, against, method=c('simultaneous','pointwise'), conf.int=0.95, fun=function(x)x, add=FALSE, lty.conf=2, ...) histdensity(y, xlab, nclass, width, mult.width=1, ...)
bootcov(fit, cluster, B=200, fitter, coef.reps=TRUE, loglik=FALSE, pr=FALSE, group=NULL, stat=NULL, seed=sample(10000, 1), ytarget=NULL, ...) bootplot(obj, which=1 : ncol(Coef), X, conf.int=c(.9,.95,.99), what=c('density', 'qqnorm', 'box'), fun=function(x) x, labels., ...) confplot(obj, X, against, method=c('simultaneous','pointwise'), conf.int=0.95, fun=function(x)x, add=FALSE, lty.conf=2, ...) histdensity(y, xlab, nclass, width, mult.width=1, ...)
fit |
a fit object containing components |
obj |
an object created by |
X |
a design matrix specified to |
y |
a vector to pass to |
cluster |
a variable indicating groupings. |
B |
number of bootstrap repetitions. Default is 200. |
fitter |
the name of a function with arguments |
coef.reps |
set to |
loglik |
set to |
pr |
set to |
group |
a grouping variable used to stratify the sample upon bootstrapping.
This allows one to handle k-sample problems, i.e., each bootstrap
sample will be forced to select the same number of observations from
each level of group as the number appearing in the original dataset.
You may specify both |
stat |
a single character string specifying the name of a |
seed |
random number seed for |
ytarget |
when using |
which |
one or more integers specifying which regression coefficients to
plot for |
conf.int |
a vector (for |
what |
for |
fun |
for |
labels. |
a vector of labels for labeling the axes in plots produced by |
... |
For |
against |
For |
method |
specifies whether |
add |
set to |
lty.conf |
line type for plotting confidence bands in |
xlab |
label for x-axis for |
nclass |
passed to |
width |
passed to |
mult.width |
multiplier by which to adjust the default |
If the fit has a scale parameter (e.g., a fit from psm
), the log
of the individual bootstrap scale estimates are added to the vector
of parameter estimates and and column and row for the log scale are
added to the new covariance matrix (the old covariance matrix also
has this row and column).
For Rq
fits, the tau
, method
, and hs
arguments are taken from the original fit.
a new fit object with class of the original object and with the element
orig.var
added. orig.var
is
the covariance matrix of the original fit. Also, the original var
component is replaced with the new bootstrap estimates. The component
boot.coef
is also added. This contains the mean bootstrap estimates
of regression coefficients (with a log scale element added if
applicable). boot.Coef
is added if coef.reps=TRUE
.
boot.loglik
is added if loglik=TRUE
. If stat
is
specified an additional vector boot.stats
will be contained in
the returned object. B
contains the number of successfully fitted
bootstrap resamples. A component
clusterInfo
is added to contain elements name
and n
holding the name of the cluster
variable and the number of clusters.
bootplot
returns a (possible matrix) of quantities of interest and
the requested quantiles of them. confplot
returns three vectors:
fitted
, lower
, and upper
.
bootcov
prints if pr=TRUE
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
Bill Pikounis
Biometrics Research Department
Merck Research Laboratories
https://billpikounis.com/wpb/
Feng Z, McLerran D, Grizzle J (1996): A comparison of statistical methods for clustered data analysis with Gaussian error. Stat in Med 15:1793–1806.
Tibshirani R, Knight K (1996): Model search and inference by bootstrap
"bumping". Department of Statistics, University of Toronto. Technical
report available from
http://www-stat.stanford.edu/~tibs/.
Presented at the Joint Statistical Meetings,
Chicago, August 1996.
ordGroupBoot
,
robcov
, sample
, rms
,
lm.fit
, lrm.fit
, orm.fit
,
survival-internal
,
predab.resample
, rmsMisc
,
Predict
, gendata
,
contrast.rms
, Predict
, setPb
,
multiwayvcov::cluster.boot
set.seed(191) x <- exp(rnorm(200)) logit <- 1 + x/2 y <- ifelse(runif(200) <= plogis(logit), 1, 0) f <- lrm(y ~ pol(x,2), x=TRUE, y=TRUE) g <- bootcov(f, B=50, pr=TRUE, seed=3) anova(g) # using bootstrap covariance estimates fastbw(g) # using bootstrap covariance estimates beta <- g$boot.Coef[,1] hist(beta, nclass=15) #look at normality of parameter estimates qqnorm(beta) # bootplot would be better than these last two commands # A dataset contains a variable number of observations per subject, # and all observations are laid out in separate rows. The responses # represent whether or not a given segment of the coronary arteries # is occluded. Segments of arteries may not operate independently # in the same patient. We assume a "working independence model" to # get estimates of the coefficients, i.e., that estimates assuming # independence are reasonably efficient. The job is then to get # unbiased estimates of variances and covariances of these estimates. set.seed(2) n.subjects <- 30 ages <- rnorm(n.subjects, 50, 15) sexes <- factor(sample(c('female','male'), n.subjects, TRUE)) logit <- (ages-50)/5 prob <- plogis(logit) # true prob not related to sex id <- sample(1:n.subjects, 300, TRUE) # subjects sampled multiple times table(table(id)) # frequencies of number of obs/subject age <- ages[id] sex <- sexes[id] # In truth, observations within subject are independent: y <- ifelse(runif(300) <= prob[id], 1, 0) f <- lrm(y ~ lsp(age,50)*sex, x=TRUE, y=TRUE) g <- bootcov(f, id, B=50, seed=3) # usually do B=200 or more diag(g$var)/diag(f$var) # add ,group=w to re-sample from within each level of w anova(g) # cluster-adjusted Wald statistics # fastbw(g) # cluster-adjusted backward elimination plot(Predict(g, age=30:70, sex='female')) # cluster-adjusted confidence bands # Get design effects based on inflation of the variances when compared # with bootstrap estimates which ignore clustering g2 <- bootcov(f, B=50, seed=3) diag(g$var)/diag(g2$var) # Get design effects based on pooled tests of factors in model anova(g2)[,1] / anova(g)[,1] # Simulate binary data where there is a strong # age x sex interaction with linear age effects # for both sexes, but where not knowing that # we fit a quadratic model. Use the bootstrap # to get bootstrap distributions of various # effects, and to get pointwise and simultaneous # confidence limits set.seed(71) n <- 500 age <- rnorm(n, 50, 10) sex <- factor(sample(c('female','male'), n, rep=TRUE)) L <- ifelse(sex=='male', 0, .1*(age-50)) y <- ifelse(runif(n)<=plogis(L), 1, 0) f <- lrm(y ~ sex*pol(age,2), x=TRUE, y=TRUE) b <- bootcov(f, B=50, loglik=TRUE, pr=TRUE, seed=3) # better: B=500 par(mfrow=c(2,3)) # Assess normality of regression estimates bootplot(b, which=1:6, what='qq') # They appear somewhat non-normal # Plot histograms and estimated densities # for 6 coefficients w <- bootplot(b, which=1:6) # Print bootstrap quantiles w$quantiles # Show box plots for bootstrap reps for all coefficients bootplot(b, what='box') # Estimate regression function for females # for a sequence of ages ages <- seq(25, 75, length=100) label(ages) <- 'Age' # Plot fitted function and pointwise normal- # theory confidence bands par(mfrow=c(1,1)) p <- Predict(f, age=ages, sex='female') plot(p) # Save curve coordinates for later automatic # labeling using labcurve in the Hmisc library curves <- vector('list',8) curves[[1]] <- with(p, list(x=age, y=lower)) curves[[2]] <- with(p, list(x=age, y=upper)) # Add pointwise normal-distribution confidence # bands using unconditional variance-covariance # matrix from the 500 bootstrap reps p <- Predict(b, age=ages, sex='female') curves[[3]] <- with(p, list(x=age, y=lower)) curves[[4]] <- with(p, list(x=age, y=upper)) dframe <- expand.grid(sex='female', age=ages) X <- predict(f, dframe, type='x') # Full design matrix # Add pointwise bootstrap nonparametric # confidence limits p <- confplot(b, X=X, against=ages, method='pointwise', add=TRUE, lty.conf=4) curves[[5]] <- list(x=ages, y=p$lower) curves[[6]] <- list(x=ages, y=p$upper) # Add simultaneous bootstrap confidence band p <- confplot(b, X=X, against=ages, add=TRUE, lty.conf=5) curves[[7]] <- list(x=ages, y=p$lower) curves[[8]] <- list(x=ages, y=p$upper) lab <- c('a','a','b','b','c','c','d','d') labcurve(curves, lab, pl=TRUE) # Now get bootstrap simultaneous confidence set for # female:male odds ratios for a variety of ages dframe <- expand.grid(age=ages, sex=c('female','male')) X <- predict(f, dframe, type='x') # design matrix f.minus.m <- X[1:100,] - X[101:200,] # First 100 rows are for females. By subtracting # design matrices are able to get Xf*Beta - Xm*Beta # = (Xf - Xm)*Beta confplot(b, X=f.minus.m, against=ages, method='pointwise', ylab='F:M Log Odds Ratio') confplot(b, X=f.minus.m, against=ages, lty.conf=3, add=TRUE) # contrast.rms makes it easier to compute the design matrix for use # in bootstrapping contrasts: f.minus.m <- contrast(f, list(sex='female',age=ages), list(sex='male', age=ages))$X confplot(b, X=f.minus.m) # For a quadratic binary logistic regression model use bootstrap # bumping to estimate coefficients under a monotonicity constraint set.seed(177) n <- 400 x <- runif(n) logit <- 3*(x^2-1) y <- rbinom(n, size=1, prob=plogis(logit)) f <- lrm(y ~ pol(x,2), x=TRUE, y=TRUE) k <- coef(f) k vertex <- -k[2]/(2*k[3]) vertex # Outside [0,1] so fit satisfies monotonicity constraint within # x in [0,1], i.e., original fit is the constrained MLE g <- bootcov(f, B=50, coef.reps=TRUE, loglik=TRUE, seed=3) bootcoef <- g$boot.Coef # 100x3 matrix vertex <- -bootcoef[,2]/(2*bootcoef[,3]) table(cut2(vertex, c(0,1))) mono <- !(vertex >= 0 & vertex <= 1) mean(mono) # estimate of Prob{monotonicity in [0,1]} var(bootcoef) # var-cov matrix for unconstrained estimates var(bootcoef[mono,]) # for constrained estimates # Find second-best vector of coefficient estimates, i.e., best # from among bootstrap estimates g$boot.Coef[order(g$boot.loglik[-length(g$boot.loglik)])[1],] # Note closeness to MLE ## Not run: # Get the bootstrap distribution of the difference in two ROC areas for # two binary logistic models fitted on the same dataset. This analysis # does not adjust for the bias ROC area (C-index) due to overfitting. # The same random number seed is used in two runs to enforce pairing. set.seed(17) x1 <- rnorm(100) x2 <- rnorm(100) y <- sample(0:1, 100, TRUE) f <- lrm(y ~ x1, x=TRUE, y=TRUE) g <- lrm(y ~ x1 + x2, x=TRUE, y=TRUE) f <- bootcov(f, stat='C', seed=4) g <- bootcov(g, stat='C', seed=4) dif <- g$boot.stats - f$boot.stats hist(dif) quantile(dif, c(.025,.25,.5,.75,.975)) # Compute a z-test statistic. Note that comparing ROC areas is far less # powerful than likelihood or Brier score-based methods z <- (g$stats['C'] - f$stats['C'])/sd(dif) names(z) <- NULL c(z=z, P=2*pnorm(-abs(z))) # For an ordinal y with some distinct values of y not very popular, let # bootcov use linear extrapolation to fill in intercepts for non-sampled levels f <- orm(y ~ x1 + x2, x=TRUE, y=TRUE) bootcov(f, B=200) # Instead of filling in missing intercepts, perform minimum binning so that # there is a 0.9999 probability that all distinct Y values will be represented # in bootstrap samples y <- ordGroupBoot(y) f <- orm(y ~ x1 + x2, x=TRUE, y=TRUE) bootcov(f, B=200) # Instead just keep one intercept for all bootstrap fits - the intercept # that pertains to y=10 bootcov(f, B=200, ytarget=10) # use ytarget=NA for the median ## End(Not run)
set.seed(191) x <- exp(rnorm(200)) logit <- 1 + x/2 y <- ifelse(runif(200) <= plogis(logit), 1, 0) f <- lrm(y ~ pol(x,2), x=TRUE, y=TRUE) g <- bootcov(f, B=50, pr=TRUE, seed=3) anova(g) # using bootstrap covariance estimates fastbw(g) # using bootstrap covariance estimates beta <- g$boot.Coef[,1] hist(beta, nclass=15) #look at normality of parameter estimates qqnorm(beta) # bootplot would be better than these last two commands # A dataset contains a variable number of observations per subject, # and all observations are laid out in separate rows. The responses # represent whether or not a given segment of the coronary arteries # is occluded. Segments of arteries may not operate independently # in the same patient. We assume a "working independence model" to # get estimates of the coefficients, i.e., that estimates assuming # independence are reasonably efficient. The job is then to get # unbiased estimates of variances and covariances of these estimates. set.seed(2) n.subjects <- 30 ages <- rnorm(n.subjects, 50, 15) sexes <- factor(sample(c('female','male'), n.subjects, TRUE)) logit <- (ages-50)/5 prob <- plogis(logit) # true prob not related to sex id <- sample(1:n.subjects, 300, TRUE) # subjects sampled multiple times table(table(id)) # frequencies of number of obs/subject age <- ages[id] sex <- sexes[id] # In truth, observations within subject are independent: y <- ifelse(runif(300) <= prob[id], 1, 0) f <- lrm(y ~ lsp(age,50)*sex, x=TRUE, y=TRUE) g <- bootcov(f, id, B=50, seed=3) # usually do B=200 or more diag(g$var)/diag(f$var) # add ,group=w to re-sample from within each level of w anova(g) # cluster-adjusted Wald statistics # fastbw(g) # cluster-adjusted backward elimination plot(Predict(g, age=30:70, sex='female')) # cluster-adjusted confidence bands # Get design effects based on inflation of the variances when compared # with bootstrap estimates which ignore clustering g2 <- bootcov(f, B=50, seed=3) diag(g$var)/diag(g2$var) # Get design effects based on pooled tests of factors in model anova(g2)[,1] / anova(g)[,1] # Simulate binary data where there is a strong # age x sex interaction with linear age effects # for both sexes, but where not knowing that # we fit a quadratic model. Use the bootstrap # to get bootstrap distributions of various # effects, and to get pointwise and simultaneous # confidence limits set.seed(71) n <- 500 age <- rnorm(n, 50, 10) sex <- factor(sample(c('female','male'), n, rep=TRUE)) L <- ifelse(sex=='male', 0, .1*(age-50)) y <- ifelse(runif(n)<=plogis(L), 1, 0) f <- lrm(y ~ sex*pol(age,2), x=TRUE, y=TRUE) b <- bootcov(f, B=50, loglik=TRUE, pr=TRUE, seed=3) # better: B=500 par(mfrow=c(2,3)) # Assess normality of regression estimates bootplot(b, which=1:6, what='qq') # They appear somewhat non-normal # Plot histograms and estimated densities # for 6 coefficients w <- bootplot(b, which=1:6) # Print bootstrap quantiles w$quantiles # Show box plots for bootstrap reps for all coefficients bootplot(b, what='box') # Estimate regression function for females # for a sequence of ages ages <- seq(25, 75, length=100) label(ages) <- 'Age' # Plot fitted function and pointwise normal- # theory confidence bands par(mfrow=c(1,1)) p <- Predict(f, age=ages, sex='female') plot(p) # Save curve coordinates for later automatic # labeling using labcurve in the Hmisc library curves <- vector('list',8) curves[[1]] <- with(p, list(x=age, y=lower)) curves[[2]] <- with(p, list(x=age, y=upper)) # Add pointwise normal-distribution confidence # bands using unconditional variance-covariance # matrix from the 500 bootstrap reps p <- Predict(b, age=ages, sex='female') curves[[3]] <- with(p, list(x=age, y=lower)) curves[[4]] <- with(p, list(x=age, y=upper)) dframe <- expand.grid(sex='female', age=ages) X <- predict(f, dframe, type='x') # Full design matrix # Add pointwise bootstrap nonparametric # confidence limits p <- confplot(b, X=X, against=ages, method='pointwise', add=TRUE, lty.conf=4) curves[[5]] <- list(x=ages, y=p$lower) curves[[6]] <- list(x=ages, y=p$upper) # Add simultaneous bootstrap confidence band p <- confplot(b, X=X, against=ages, add=TRUE, lty.conf=5) curves[[7]] <- list(x=ages, y=p$lower) curves[[8]] <- list(x=ages, y=p$upper) lab <- c('a','a','b','b','c','c','d','d') labcurve(curves, lab, pl=TRUE) # Now get bootstrap simultaneous confidence set for # female:male odds ratios for a variety of ages dframe <- expand.grid(age=ages, sex=c('female','male')) X <- predict(f, dframe, type='x') # design matrix f.minus.m <- X[1:100,] - X[101:200,] # First 100 rows are for females. By subtracting # design matrices are able to get Xf*Beta - Xm*Beta # = (Xf - Xm)*Beta confplot(b, X=f.minus.m, against=ages, method='pointwise', ylab='F:M Log Odds Ratio') confplot(b, X=f.minus.m, against=ages, lty.conf=3, add=TRUE) # contrast.rms makes it easier to compute the design matrix for use # in bootstrapping contrasts: f.minus.m <- contrast(f, list(sex='female',age=ages), list(sex='male', age=ages))$X confplot(b, X=f.minus.m) # For a quadratic binary logistic regression model use bootstrap # bumping to estimate coefficients under a monotonicity constraint set.seed(177) n <- 400 x <- runif(n) logit <- 3*(x^2-1) y <- rbinom(n, size=1, prob=plogis(logit)) f <- lrm(y ~ pol(x,2), x=TRUE, y=TRUE) k <- coef(f) k vertex <- -k[2]/(2*k[3]) vertex # Outside [0,1] so fit satisfies monotonicity constraint within # x in [0,1], i.e., original fit is the constrained MLE g <- bootcov(f, B=50, coef.reps=TRUE, loglik=TRUE, seed=3) bootcoef <- g$boot.Coef # 100x3 matrix vertex <- -bootcoef[,2]/(2*bootcoef[,3]) table(cut2(vertex, c(0,1))) mono <- !(vertex >= 0 & vertex <= 1) mean(mono) # estimate of Prob{monotonicity in [0,1]} var(bootcoef) # var-cov matrix for unconstrained estimates var(bootcoef[mono,]) # for constrained estimates # Find second-best vector of coefficient estimates, i.e., best # from among bootstrap estimates g$boot.Coef[order(g$boot.loglik[-length(g$boot.loglik)])[1],] # Note closeness to MLE ## Not run: # Get the bootstrap distribution of the difference in two ROC areas for # two binary logistic models fitted on the same dataset. This analysis # does not adjust for the bias ROC area (C-index) due to overfitting. # The same random number seed is used in two runs to enforce pairing. set.seed(17) x1 <- rnorm(100) x2 <- rnorm(100) y <- sample(0:1, 100, TRUE) f <- lrm(y ~ x1, x=TRUE, y=TRUE) g <- lrm(y ~ x1 + x2, x=TRUE, y=TRUE) f <- bootcov(f, stat='C', seed=4) g <- bootcov(g, stat='C', seed=4) dif <- g$boot.stats - f$boot.stats hist(dif) quantile(dif, c(.025,.25,.5,.75,.975)) # Compute a z-test statistic. Note that comparing ROC areas is far less # powerful than likelihood or Brier score-based methods z <- (g$stats['C'] - f$stats['C'])/sd(dif) names(z) <- NULL c(z=z, P=2*pnorm(-abs(z))) # For an ordinal y with some distinct values of y not very popular, let # bootcov use linear extrapolation to fill in intercepts for non-sampled levels f <- orm(y ~ x1 + x2, x=TRUE, y=TRUE) bootcov(f, B=200) # Instead of filling in missing intercepts, perform minimum binning so that # there is a 0.9999 probability that all distinct Y values will be represented # in bootstrap samples y <- ordGroupBoot(y) f <- orm(y ~ x1 + x2, x=TRUE, y=TRUE) bootcov(f, B=200) # Instead just keep one intercept for all bootstrap fits - the intercept # that pertains to y=10 bootcov(f, B=200, ytarget=10) # use ytarget=NA for the median ## End(Not run)
Uses lattice graphics and the output from Predict
to plot image,
contour, or perspective plots showing the simultaneous effects of two
continuous predictor variables. Unless formula
is provided, the
-axis is constructed from the first variable listed in the call
to
Predict
and the -axis variable comes from the second.
The perimeter
function is used to generate the boundary of data
to plot when a 3-d plot is made. It finds the area where there are
sufficient data to generate believable interaction fits.
bplot(x, formula, lfun=lattice::levelplot, xlab, ylab, zlab, adj.subtitle=!info$ref.zero, cex.adj=.75, cex.lab=1, perim, showperim=FALSE, zlim=range(yhat, na.rm=TRUE), scales=list(arrows=FALSE), xlabrot, ylabrot, zlabrot=90, ...) perimeter(x, y, xinc=diff(range(x))/10, n=10, lowess.=TRUE)
bplot(x, formula, lfun=lattice::levelplot, xlab, ylab, zlab, adj.subtitle=!info$ref.zero, cex.adj=.75, cex.lab=1, perim, showperim=FALSE, zlim=range(yhat, na.rm=TRUE), scales=list(arrows=FALSE), xlabrot, ylabrot, zlabrot=90, ...) perimeter(x, y, xinc=diff(range(x))/10, n=10, lowess.=TRUE)
x |
for |
formula |
a formula of the form |
lfun |
a high-level lattice plotting function that takes formulas of the
form |
xlab |
Character string label for |
ylab |
Character string abel for |
zlab |
Character string |
adj.subtitle |
Set to |
cex.adj |
|
cex.lab |
|
perim |
names a matrix created by |
showperim |
set to |
zlim |
Controls the range for plotting in the |
scales |
see |
xlabrot |
rotation angle for the x-axis. Default is 30 for
|
ylabrot |
rotation angle for the y-axis. Default is -40 for
|
zlabrot |
rotation angle for z-axis rotation for
|
... |
other arguments to pass to the lattice function |
y |
second variable of the pair for |
xinc |
increment in |
n |
within intervals of |
lowess. |
set to |
perimeter
is a kind of generalization of datadist
for 2
continuous variables. First, the n
smallest and largest x
values are determined. These form the lowest and highest possible
x
s to display. Then x
is grouped into intervals bounded
by these two numbers, with the interval widths defined by xinc
.
Within each interval, y
is sorted and the th smallest and
largest
y
are taken as the interval containing sufficient data
density to plot interaction surfaces. The interval is ignored when
there are insufficient y
values. When the data are being
readied for persp
, bplot
uses the approx
function to do
linear interpolation of the y
-boundaries as a function of the
x
values actually used in forming the grid (the values of the
first variable specified to Predict
). To make the perimeter smooth,
specify lowess.=TRUE
to perimeter
.
perimeter
returns a matrix of class perimeter
. This
outline can be conveniently plotted by lines.perimeter
.
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
datadist
, Predict
,
rms
, rmsMisc
, levelplot
,
contourplot
, wireframe
n <- 1000 # define sample size set.seed(17) # so can reproduce the results age <- rnorm(n, 50, 10) blood.pressure <- rnorm(n, 120, 15) cholesterol <- rnorm(n, 200, 25) sex <- factor(sample(c('female','male'), n,TRUE)) label(age) <- 'Age' # label is in Hmisc label(cholesterol) <- 'Total Cholesterol' label(blood.pressure) <- 'Systolic Blood Pressure' label(sex) <- 'Sex' units(cholesterol) <- 'mg/dl' # uses units.default in Hmisc units(blood.pressure) <- 'mmHg' # Specify population model for log odds that Y=1 L <- .4*(sex=='male') + .045*(age-50) + (log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male')) # Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)] y <- ifelse(runif(n) < plogis(L), 1, 0) ddist <- datadist(age, blood.pressure, cholesterol, sex) options(datadist='ddist') fit <- lrm(y ~ blood.pressure + sex * (age + rcs(cholesterol,4)), x=TRUE, y=TRUE) p <- Predict(fit, age, cholesterol, sex, np=50) # vary sex last require(lattice) bplot(p) # image plot for age, cholesterol with color # coming from yhat; use default ranges for # both continuous predictors; two panels (for sex) bplot(p, lfun=wireframe) # same as bplot(p,,wireframe) # View from different angle, change y label orientation accordingly # Default is z=40, x=-60 bplot(p,, wireframe, screen=list(z=40, x=-75), ylabrot=-25) bplot(p,, contourplot) # contour plot bounds <- perimeter(age, cholesterol, lowess=TRUE) plot(age, cholesterol) # show bivariate data density and perimeter lines(bounds[,c('x','ymin')]); lines(bounds[,c('x','ymax')]) p <- Predict(fit, age, cholesterol) # use only one sex bplot(p, perim=bounds) # draws image() plot # don't show estimates where data are sparse # doesn't make sense here since vars don't interact bplot(p, plogis(yhat) ~ age*cholesterol) # Probability scale options(datadist=NULL)
n <- 1000 # define sample size set.seed(17) # so can reproduce the results age <- rnorm(n, 50, 10) blood.pressure <- rnorm(n, 120, 15) cholesterol <- rnorm(n, 200, 25) sex <- factor(sample(c('female','male'), n,TRUE)) label(age) <- 'Age' # label is in Hmisc label(cholesterol) <- 'Total Cholesterol' label(blood.pressure) <- 'Systolic Blood Pressure' label(sex) <- 'Sex' units(cholesterol) <- 'mg/dl' # uses units.default in Hmisc units(blood.pressure) <- 'mmHg' # Specify population model for log odds that Y=1 L <- .4*(sex=='male') + .045*(age-50) + (log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male')) # Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)] y <- ifelse(runif(n) < plogis(L), 1, 0) ddist <- datadist(age, blood.pressure, cholesterol, sex) options(datadist='ddist') fit <- lrm(y ~ blood.pressure + sex * (age + rcs(cholesterol,4)), x=TRUE, y=TRUE) p <- Predict(fit, age, cholesterol, sex, np=50) # vary sex last require(lattice) bplot(p) # image plot for age, cholesterol with color # coming from yhat; use default ranges for # both continuous predictors; two panels (for sex) bplot(p, lfun=wireframe) # same as bplot(p,,wireframe) # View from different angle, change y label orientation accordingly # Default is z=40, x=-60 bplot(p,, wireframe, screen=list(z=40, x=-75), ylabrot=-25) bplot(p,, contourplot) # contour plot bounds <- perimeter(age, cholesterol, lowess=TRUE) plot(age, cholesterol) # show bivariate data density and perimeter lines(bounds[,c('x','ymin')]); lines(bounds[,c('x','ymax')]) p <- Predict(fit, age, cholesterol) # use only one sex bplot(p, perim=bounds) # draws image() plot # don't show estimates where data are sparse # doesn't make sense here since vars don't interact bplot(p, plogis(yhat) ~ age*cholesterol) # Probability scale options(datadist=NULL)
Uses bootstrapping or cross-validation to get bias-corrected (overfitting-
corrected) estimates of predicted vs. observed values based on
subsetting predictions into intervals (for survival models) or on
nonparametric smoothers (for other models). There are calibration
functions for Cox (cph
), parametric survival models (psm
),
binary and ordinal logistic models (lrm
) and ordinary least
squares (ols
). For survival models,
"predicted" means predicted survival probability at a single
time point, and "observed" refers to the corresponding Kaplan-Meier
survival estimate, stratifying on intervals of predicted survival, or,
if the polspline
package is installed, the predicted survival
probability as a function of transformed predicted survival probability
using the flexible hazard regression approach (see the val.surv
function for details). For logistic and linear models, a nonparametric
calibration curve is estimated over a sequence of predicted values. The
fit must have specified x=TRUE, y=TRUE
. The print
and
plot
methods for lrm
and ols
models (which use
calibrate.default
) print the mean absolute error in predictions,
the mean squared error, and the 0.9 quantile of the absolute error.
Here, error refers to the difference between the predicted values and
the corresponding bias-corrected calibrated values.
Below, the second, third, and fourth invocations of calibrate
are, respectively, for ols
and lrm
, cph
, and
psm
. The first and second plot
invocation are
respectively for lrm
and ols
fits or all other fits.
calibrate(fit, ...) ## Default S3 method: calibrate(fit, predy, method=c("boot","crossvalidation",".632","randomization"), B=40, bw=FALSE, rule=c("aic","p"), type=c("residual","individual"), sls=.05, aics=0, force=NULL, estimates=TRUE, pr=FALSE, kint, smoother="lowess", digits=NULL, ...) ## S3 method for class 'cph' calibrate(fit, cmethod=c('hare', 'KM'), method="boot", u, m=150, pred, cuts, B=40, bw=FALSE, rule="aic", type="residual", sls=0.05, aics=0, force=NULL, estimates=TRUE, pr=FALSE, what="observed-predicted", tol=1e-12, maxdim=5, ...) ## S3 method for class 'psm' calibrate(fit, cmethod=c('hare', 'KM'), method="boot", u, m=150, pred, cuts, B=40, bw=FALSE,rule="aic", type="residual", sls=.05, aics=0, force=NULL, estimates=TRUE, pr=FALSE, what="observed-predicted", tol=1e-12, maxiter=15, rel.tolerance=1e-5, maxdim=5, ...) ## S3 method for class 'calibrate' print(x, B=Inf, ...) ## S3 method for class 'calibrate.default' print(x, B=Inf, ...) ## S3 method for class 'calibrate' plot(x, xlab, ylab, subtitles=TRUE, conf.int=TRUE, cex.subtitles=.75, riskdist=TRUE, add=FALSE, scat1d.opts=list(nhistSpike=200), par.corrected=NULL, ...) ## S3 method for class 'calibrate.default' plot(x, xlab, ylab, xlim, ylim, legend=TRUE, subtitles=TRUE, cex.subtitles=.75, riskdist=TRUE, scat1d.opts=list(nhistSpike=200), ...)
calibrate(fit, ...) ## Default S3 method: calibrate(fit, predy, method=c("boot","crossvalidation",".632","randomization"), B=40, bw=FALSE, rule=c("aic","p"), type=c("residual","individual"), sls=.05, aics=0, force=NULL, estimates=TRUE, pr=FALSE, kint, smoother="lowess", digits=NULL, ...) ## S3 method for class 'cph' calibrate(fit, cmethod=c('hare', 'KM'), method="boot", u, m=150, pred, cuts, B=40, bw=FALSE, rule="aic", type="residual", sls=0.05, aics=0, force=NULL, estimates=TRUE, pr=FALSE, what="observed-predicted", tol=1e-12, maxdim=5, ...) ## S3 method for class 'psm' calibrate(fit, cmethod=c('hare', 'KM'), method="boot", u, m=150, pred, cuts, B=40, bw=FALSE,rule="aic", type="residual", sls=.05, aics=0, force=NULL, estimates=TRUE, pr=FALSE, what="observed-predicted", tol=1e-12, maxiter=15, rel.tolerance=1e-5, maxdim=5, ...) ## S3 method for class 'calibrate' print(x, B=Inf, ...) ## S3 method for class 'calibrate.default' print(x, B=Inf, ...) ## S3 method for class 'calibrate' plot(x, xlab, ylab, subtitles=TRUE, conf.int=TRUE, cex.subtitles=.75, riskdist=TRUE, add=FALSE, scat1d.opts=list(nhistSpike=200), par.corrected=NULL, ...) ## S3 method for class 'calibrate.default' plot(x, xlab, ylab, xlim, ylim, legend=TRUE, subtitles=TRUE, cex.subtitles=.75, riskdist=TRUE, scat1d.opts=list(nhistSpike=200), ...)
fit |
a fit from |
x |
an object created by |
method , B , bw , rule , type , sls , aics , force , estimates
|
see |
cmethod |
method for validating survival predictions using
right-censored data. The default is |
u |
the time point for which to validate predictions for survival
models. For |
m |
group predicted |
pred |
vector of predicted survival probabilities at which to evaluate the
calibration curve. By default, the low and high prediction values
from |
cuts |
actual cut points for predicted survival probabilities. You may
specify only one of |
pr |
set to |
what |
The default is |
tol |
criterion for matrix singularity (default is |
maxdim |
see |
maxiter |
for |
rel.tolerance |
parameter passed to
|
predy |
a scalar or vector of predicted values to calibrate (for |
kint |
For an ordinal logistic model the default predicted
probability that |
smoother |
a function in two variables which produces |
digits |
If specified, predicted values are rounded to
|
... |
other arguments to pass to |
xlab |
defaults to "Predicted x-units Survival" or to a suitable label for other models |
ylab |
defaults to "Fraction Surviving x-units" or to a suitable label for other models |
xlim , ylim
|
2-vectors specifying x- and y-axis limits, if not using defaults |
subtitles |
set to |
conf.int |
set to |
cex.subtitles |
character size for plotting subtitles |
riskdist |
set to |
add |
set to |
scat1d.opts |
a list specifying options to send to |
par.corrected |
a list specifying graphics parameters |
legend |
set to |
If the fit was created using penalized maximum likelihood estimation,
the same penalty
and penalty.scale
parameters are used during
validation.
matrix specifying mean predicted survival in each interval, the
corresponding estimated bias-corrected Kaplan-Meier estimates,
number of subjects, and other statistics. For linear and logistic models,
the matrix instead has rows corresponding to the prediction points, and
the vector of predicted values being validated is returned as an attribute.
The returned object has class "calibrate"
or
"calibrate.default"
.
plot.calibrate.default
invisibly returns the vector of estimated
prediction errors corresponding to the dataset used to fit the model.
prints, and stores an object pred.obs
or .orig.cal
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
validate
, predab.resample
,
groupkm
, errbar
,
scat1d
, cph
, psm
,
lowess
,fit.mult.impute
,
processMI
require(survival) set.seed(1) n <- 200 d.time <- rexp(n) x1 <- runif(n) x2 <- factor(sample(c('a', 'b', 'c'), n, TRUE)) f <- cph(Surv(d.time) ~ pol(x1,2) * x2, x=TRUE, y=TRUE, surv=TRUE, time.inc=1.5) #or f <- psm(S ~ \dots) pa <- requireNamespace('polspline') if(pa) { cal <- calibrate(f, u=1.5, B=20) # cmethod='hare' plot(cal) } cal <- calibrate(f, u=1.5, cmethod='KM', m=50, B=20) # usually B=200 or 300 plot(cal, add=pa) set.seed(1) y <- sample(0:2, n, TRUE) x1 <- runif(n) x2 <- runif(n) x3 <- runif(n) x4 <- runif(n) f <- lrm(y ~ x1 + x2 + x3 * x4, x=TRUE, y=TRUE) cal <- calibrate(f, kint=2, predy=seq(.2, .8, length=60), group=y) # group= does k-sample validation: make resamples have same # numbers of subjects in each level of y as original sample plot(cal) #See the example for the validate function for a method of validating #continuation ratio ordinal logistic models. You can do the same #thing for calibrate
require(survival) set.seed(1) n <- 200 d.time <- rexp(n) x1 <- runif(n) x2 <- factor(sample(c('a', 'b', 'c'), n, TRUE)) f <- cph(Surv(d.time) ~ pol(x1,2) * x2, x=TRUE, y=TRUE, surv=TRUE, time.inc=1.5) #or f <- psm(S ~ \dots) pa <- requireNamespace('polspline') if(pa) { cal <- calibrate(f, u=1.5, B=20) # cmethod='hare' plot(cal) } cal <- calibrate(f, u=1.5, cmethod='KM', m=50, B=20) # usually B=200 or 300 plot(cal, add=pa) set.seed(1) y <- sample(0:2, n, TRUE) x1 <- runif(n) x2 <- runif(n) x3 <- runif(n) x4 <- runif(n) f <- lrm(y ~ x1 + x2 + x3 * x4, x=TRUE, y=TRUE) cal <- calibrate(f, kint=2, predy=seq(.2, .8, length=60), group=y) # group= does k-sample validation: make resamples have same # numbers of subjects in each level of y as original sample plot(cal) #See the example for the validate function for a method of validating #continuation ratio ordinal logistic models. You can do the same #thing for calibrate
This function computes one or more contrasts of the estimated
regression coefficients in a fit from one of the functions in rms,
along with standard errors, confidence limits, t or Z statistics, P-values.
General contrasts are handled by obtaining the design matrix for two
sets of predictor settings (a
, b
) and subtracting the
corresponding rows of the two design matrics to obtain a new contrast
design matrix for testing the a
- b
differences. This allows for
quite general contrasts (e.g., estimated differences in means between
a 30 year old female and a 40 year old male).
This can also be used
to obtain a series of contrasts in the presence of interactions (e.g.,
female:male log odds ratios for several ages when the model contains
age by sex interaction). Another use of contrast
is to obtain
center-weighted (Type III test) and subject-weighted (Type II test)
estimates in a model containing treatment by center interactions. For
the latter case, you can specify type="average"
and an optional
weights
vector to average the within-center treatment contrasts.
The design contrast matrix computed by contrast.rms
can be used
by other functions.
When the model was fitted by a Bayesian function such as blrm
,
highest posterior density intervals for contrasts are computed instead, along with the
posterior probability that the contrast is positive.
posterior.summary
specifies whether posterior mean/median/mode is
to be used for contrast point estimates.
contrast.rms
also allows one to specify four settings to
contrast, yielding contrasts that are double differences - the
difference between the first two settings (a
- b
) and the
last two (a2
- b2
). This allows assessment of interactions.
If usebootcoef=TRUE
, the fit was run through bootcov
, and
conf.type="individual"
, the confidence intervals are bootstrap
nonparametric percentile confidence intervals, basic bootstrap, or BCa
intervals, obtained on contrasts evaluated on all bootstrap samples.
By omitting the b
argument, contrast
can be used to obtain
an average or weighted average of a series of predicted values, along
with a confidence interval for this average. This can be useful for
"unconditioning" on one of the predictors (see the next to last
example).
Specifying type="joint"
, and specifying at least as many contrasts
as needed to span the space of a complex test, one can make
multiple degree of freedom tests flexibly and simply. Redundant
contrasts will be ignored in the joint test. See the examples below.
These include an example of an "incomplete interaction test" involving
only two of three levels of a categorical variable (the test also tests
the main effect).
When more than one contrast is computed, the list created by
contrast.rms
is suitable for plotting (with error bars or bands)
with xYplot
or Dotplot
(see the last example before the
type="joint"
examples).
When fit
is the result of a Bayesian model fit and fun
is
specified, contrast.rms
operates altogether differently. a
and b
must both be specified and a2, b2
not specified.
fun
is evaluated on the estimates
separately on a
and b
and the subtraction is deferred. So
even in the absence of interactions, when fun
is nonlinear, the
settings of factors (predictors) will not cancel out and estimates of
differences will be covariate-specific (unless there are no covariates
in the model besides the one being varied to get from a
to b
).
That the the use of offsets to compute profile confidence intervals prevents this function from working with certain models that use offsets for other purposes, e.g., Poisson models with offsets to account for population size.
contrast(fit, ...) ## S3 method for class 'rms' contrast(fit, a, b, a2, b2, ycut=NULL, cnames=NULL, fun=NULL, funint=TRUE, type=c("individual", "average", "joint"), conf.type=c("individual","simultaneous","profile"), usebootcoef=TRUE, boot.type=c("percentile","bca","basic"), posterior.summary=c('mean', 'median', 'mode'), weights="equal", conf.int=0.95, tol=1e-7, expand=TRUE, se_factor=4, plot_profile=FALSE, ...) ## S3 method for class 'contrast.rms' print(x, X=FALSE, fun=function(u)u, jointonly=FALSE, prob=0.95, ...)
contrast(fit, ...) ## S3 method for class 'rms' contrast(fit, a, b, a2, b2, ycut=NULL, cnames=NULL, fun=NULL, funint=TRUE, type=c("individual", "average", "joint"), conf.type=c("individual","simultaneous","profile"), usebootcoef=TRUE, boot.type=c("percentile","bca","basic"), posterior.summary=c('mean', 'median', 'mode'), weights="equal", conf.int=0.95, tol=1e-7, expand=TRUE, se_factor=4, plot_profile=FALSE, ...) ## S3 method for class 'contrast.rms' print(x, X=FALSE, fun=function(u)u, jointonly=FALSE, prob=0.95, ...)
fit |
a fit of class |
a |
a list containing settings for all predictors that you do not wish to
set to default (adjust-to) values. Usually you will specify two
variables in this list, one set to a constant and one to a sequence of
values, to obtain contrasts for the sequence of values of an
interacting factor. The |
b |
another list that generates the same number of observations as |
a2 |
an optional third list of settings of predictors |
b2 |
an optional fourth list of settings of predictors. Mandatory
if |
ycut |
used of the fit is a constrained partial proportional odds
model fit, to specify the single value or vector of values
(corresponding to the multiple contrasts) of the response
variable to use in forming contrasts. When there is
non-proportional odds, odds ratios will vary over levels of the
response variable. When there are multiple contrasts and only
one value is given for |
cnames |
vector of character strings naming the contrasts when
|
fun |
a function to evaluate on the linear predictor for each of
|
type |
set |
conf.type |
The default type of confidence interval computed for a given
individual (1 d.f.) contrast is a pointwise confidence interval. Set
|
usebootcoef |
If |
boot.type |
set to |
posterior.summary |
By default the posterior mean is used.
Specify |
weights |
a numeric vector, used when |
conf.int |
confidence level for confidence intervals for the contrasts (HPD interval probability for Bayesian analyses) |
tol |
tolerance for |
expand |
set to |
se_factor |
multiplier for a contrast's standard error used for root finding of the profile likelihood confidence limits when |
plot_profile |
when |
... |
passed to |
x |
result of |
X |
set |
funint |
set to |
jointonly |
set to |
prob |
highest posterior density interval probability when the fit
was Bayesian and |
a list of class "contrast.rms"
containing the elements
Contrast
, SE
, Z
, var
, df.residual
Lower
, Upper
, Pvalue
, X
, cnames
, redundant
, which denote the contrast
estimates, standard errors, Z or t-statistics, variance matrix,
residual degrees of freedom (this is NULL
if the model was not
ols
), lower and upper confidence limits, 2-sided P-value, design
matrix, contrast names (or NULL
), and a logical vector denoting
which contrasts are redundant with the other contrasts. If there are
any redundant contrasts, when the results of contrast
are
printed, and asterisk is printed at the start of the corresponding
lines. The object also contains ctype
indicating what method was
used for compute confidence intervals.
Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
[email protected]
Predict
, gendata
, bootcov
,
summary.rms
, anova.rms
,
require(ggplot2) set.seed(1) age <- rnorm(200,40,12) sex <- factor(sample(c('female','male'),200,TRUE)) logit <- (sex=='male') + (age-40)/5 y <- ifelse(runif(200) <= plogis(logit), 1, 0) f <- lrm(y ~ pol(age,2)*sex) anova(f) # Compare a 30 year old female to a 40 year old male # (with or without age x sex interaction in the model) contrast(f, list(sex='female', age=30), list(sex='male', age=40)) # Test for interaction between age and sex, duplicating anova contrast(f, list(sex='female', age=30), list(sex='male', age=30), list(sex='female', age=c(40,50)), list(sex='male', age=c(40,50)), type='joint') # Duplicate overall sex effect in anova with 3 d.f. contrast(f, list(sex='female', age=c(30,40,50)), list(sex='male', age=c(30,40,50)), type='joint') # For females get an array of odds ratios against age=40 k <- contrast(f, list(sex='female', age=30:50), list(sex='female', age=40)) print(k, fun=exp) # Plot odds ratios with pointwise 0.95 confidence bands using log scale k <- as.data.frame(k[c('Contrast','Lower','Upper')]) ggplot(k, aes(x=30:50, y=exp(Contrast))) + geom_line() + geom_ribbon(aes(ymin=exp(Lower), ymax=exp(Upper)), alpha=0.15, linetype=0) + scale_y_continuous(trans='log10', n.breaks=10, minor_breaks=c(seq(0.1, 1, by=.1), seq(1, 10, by=.5))) + xlab('Age') + ylab('OR against age 40') # For an ordinal model with 3 variables (x1 is quadratic, x2 & x3 linear) # Get a 1 d.f. likelihood ratio (LR) test for x1=1 vs x1=0.25 # For the other variables get contrasts and LR tests that are the # ordinary ones for their original coefficients. # Get 0.95 profile likelihood confidence intervals for the x1 contrast # and for the x2 and x3 coefficients set.seed(7) x1 <- runif(50) x2 <- runif(50) x3 <- runif(50) dd <- datadist(x1, x2, x3); options(datadist='dd') y <- x1 + runif(50) # need x=TRUE,y=TRUE for profile likelihood f <- orm(y ~ pol(x1, 2) + x2 + x3, x=TRUE, y=TRUE) a <- list(x1=c( 1,0,0), x2=c(0,1,0), x3=c(0,0,1)) b <- list(x1=c(0.25,0,0), x2=c(0,0,0), x3=c(0,0,0)) k <- contrast(f, a, b, expand=FALSE) # Wald intervals and tests k; k$X[1,] summary(f, x1=c(.25, 1), x2=0:1, x3=0:1) # Wald intervals anova(f, test='LR') # LR tests contrast(f, a, b, expand=FALSE, conf.type='profile', plot_profile=TRUE) options(datadist=NULL) # For a model containing two treatments, centers, and treatment # x center interaction, get 0.95 confidence intervals separately # by center center <- factor(sample(letters[1 : 8], 500, TRUE)) treat <- factor(sample(c('a','b'), 500, TRUE)) y <- 8*(treat == 'b') + rnorm(500, 100, 20) f <- ols(y ~ treat*center) lc <- levels(center) contrast(f, list(treat='b', center=lc), list(treat='a', center=lc)) # Get 'Type III' contrast: average b - a treatment effect over # centers, weighting centers equally (which is almost always # an unreasonable thing to do) contrast(f, list(treat='b', center=lc), list(treat='a', center=lc), type='average') # Get 'Type II' contrast, weighting centers by the number of # subjects per center. Print the design contrast matrix used. k <- contrast(f, list(treat='b', center=lc), list(treat='a', center=lc), type='average', weights=table(center)) print(k, X=TRUE) # Note: If other variables had interacted with either treat # or center, we may want to list settings for these variables # inside the list()'s, so as to not use default settings # For a 4-treatment study, get all comparisons with treatment 'a' treat <- factor(sample(c('a','b','c','d'), 500, TRUE)) y <- 8*(treat == 'b') + rnorm(500, 100, 20) dd <- datadist(treat, center); options(datadist='dd') f <- ols(y ~ treat*center) lt <- levels(treat) contrast(f, list(treat=lt[-1]), list(treat=lt[ 1]), cnames=paste(lt[-1], lt[1], sep=':'), conf.int=1 - .05 / 3) # Compare each treatment with average of all others for(i in 1 : length(lt)) { cat('Comparing with', lt[i], '\n\n') print(contrast(f, list(treat=lt[-i]), list(treat=lt[ i]), type='average')) } options(datadist=NULL) # Six ways to get the same thing, for a variable that # appears linearly in a model and does not interact with # any other variables. We estimate the change in y per # unit change in a predictor x1. Methods 4, 5 also # provide confidence limits. Method 6 computes nonparametric # bootstrap confidence limits. Methods 2-6 can work # for models that are nonlinear or non-additive in x1. # For that case more care is needed in choice of settings # for x1 and the variables that interact with x1. ## Not run: coef(fit)['x1'] # method 1 diff(predict(fit, gendata(x1=c(0,1)))) # method 2 g <- Function(fit) # method 3 g(x1=1) - g(x1=0) summary(fit, x1=c(0,1)) # method 4 k <- contrast(fit, list(x1=1), list(x1=0)) # method 5 print(k, X=TRUE) fit <- update(fit, x=TRUE, y=TRUE) # method 6 b <- bootcov(fit, B=500) contrast(fit, list(x1=1), list(x1=0)) # In a model containing age, race, and sex, # compute an estimate of the mean response for a # 50 year old male, averaged over the races using # observed frequencies for the races as weights f <- ols(y ~ age + race + sex) contrast(f, list(age=50, sex='male', race=levels(race)), type='average', weights=table(race)) # For a Bayesian model get the highest posterior interval for the # difference in two nonlinear functions of predicted values # Start with the mean from a proportional odds model g <- blrm(y ~ x) M <- Mean(g) contrast(g, list(x=1), list(x=0), fun=M) # For the median we have to make sure that contrast can pass the # per-posterior-draw vector of intercepts through qu <- Quantile(g) med <- function(lp, intercepts) qu(0.5, lp, intercepts=intercepts) contrast(g, list(x=1), list(x=0), fun=med) ## End(Not run) # Plot the treatment effect (drug - placebo) as a function of age # and sex in a model in which age nonlinearly interacts with treatment # for females only set.seed(1) n <- 800 treat <- factor(sample(c('drug','placebo'), n,TRUE)) sex <- factor(sample(c('female','male'), n,TRUE)) age <- rnorm(n, 50, 10) y <- .05*age + (sex=='female')*(treat=='drug')*.05*abs(age-50) + rnorm(n) f <- ols(y ~ rcs(age,4)*treat*sex) d <- datadist(age, treat, sex); options(datadist='d') # show separate estimates by treatment and sex require(ggplot2) ggplot(Predict(f, age, treat, sex='female')) ggplot(Predict(f, age, treat, sex='male')) ages <- seq(35,65,by=5); sexes <- c('female','male') w <- contrast(f, list(treat='drug', age=ages, sex=sexes), list(treat='placebo', age=ages, sex=sexes)) # add conf.type="simultaneous" to adjust for having done 14 contrasts xYplot(Cbind(Contrast, Lower, Upper) ~ age | sex, data=w, ylab='Drug - Placebo') w <- as.data.frame(w[c('age','sex','Contrast','Lower','Upper')]) ggplot(w, aes(x=age, y=Contrast)) + geom_point() + facet_grid(sex ~ .) + geom_errorbar(aes(ymin=Lower, ymax=Upper), width=0) ggplot(w, aes(x=age, y=Contrast)) + geom_line() + facet_grid(sex ~ .) + geom_ribbon(aes(ymin=Lower, ymax=Upper), width=0, alpha=0.15, linetype=0) xYplot(Cbind(Contrast, Lower, Upper) ~ age, groups=sex, data=w, ylab='Drug - Placebo', method='alt bars') options(datadist=NULL) # Examples of type='joint' contrast tests set.seed(1) x1 <- rnorm(100) x2 <- factor(sample(c('a','b','c'), 100, TRUE)) dd <- datadist(x1, x2); options(datadist='dd') y <- x1 + (x2=='b') + rnorm(100) # First replicate a test statistic from anova() f <- ols(y ~ x2) anova(f) contrast(f, list(x2=c('b','c')), list(x2='a'), type='joint') # Repeat with a redundancy; compare a vs b, a vs c, b vs c contrast(f, list(x2=c('a','a','b')), list(x2=c('b','c','c')), type='joint') # Get a test of association of a continuous predictor with y # First assume linearity, then cubic f <- lrm(y>0 ~ x1 + x2) anova(f) contrast(f, list(x1=1), list(x1=0), type='joint') # a minimum set of contrasts xs <- seq(-2, 2, length=20) contrast(f, list(x1=0), list(x1=xs), type='joint') # All contrasts were redundant except for the first, because of # linearity assumption f <- lrm(y>0 ~ pol(x1,3) + x2, x=TRUE, y=TRUE) anova(f) anova(f, test='LR') # discrepancy with Wald statistics points out a problem w/them contrast(f, list(x1=0), list(x1=xs), type='joint') print(contrast(f, list(x1=0), list(x1=xs), type='joint'), jointonly=TRUE) # All contrasts were redundant except for the first 3, because of # cubic regression assumption # These Wald tests and intervals are not very accurate. Although joint # testing is not implemented in contrast(), individual profile likelihood # confidence intervals and associted likelihood ratio tests are helpful: # contrast(f, list(x1=0), list(x1=xs), conf.type='profile', plot_profile=TRUE) # Now do something that is difficult to do without cryptic contrast # matrix operations: Allow each of the three x2 groups to have a different # shape for the x1 effect where x1 is quadratic. Test whether there is # a difference in mean levels of y for x2='b' vs. 'c' or whether # the shape or slope of x1 is different between x2='b' and x2='c' regardless # of how they differ when x2='a'. In other words, test whether the mean # response differs between group b and c at any value of x1. # This is a 3 d.f. test (intercept, linear, quadratic effects) and is # a better approach than subsetting the data to remove x2='a' then # fitting a simpler model, as it uses a better estimate of sigma from # all the data. f <- ols(y ~ pol(x1,2) * x2) anova(f) contrast(f, list(x1=xs, x2='b'), list(x1=xs, x2='c'), type='joint') # Note: If using a spline fit, there should be at least one value of # x1 between any two knots and beyond the outer knots. options(datadist=NULL)
require(ggplot2) set.seed(1) age <- rnorm(200,40,12) sex <- factor(sample(c('female','male'),200,TRUE)) logit <- (sex=='male') + (age-40)/5 y <- ifelse(runif(200) <= plogis(logit), 1, 0) f <- lrm(y ~ pol(age,2)*sex) anova(f) # Compare a 30 year old female to a 40 year old male # (with or without age x sex interaction in the model) contrast(f, list(sex='female', age=30), list(sex='male', age=40)) # Test for interaction between age and sex, duplicating anova contrast(f, list(sex='female', age=30), list(sex='male', age=30), list(sex='female', age=c(40,50)), list(sex='male', age=c(40,50)), type='joint') # Duplicate overall sex effect in anova with 3 d.f. contrast(f, list(sex='female', age=c(30,40,50)), list(sex='male', age=c(30,40,50)), type='joint') # For females get an array of odds ratios against age=40 k <- contrast(f, list(sex='female', age=30:50), list(sex='female', age=40)) print(k, fun=exp) # Plot odds ratios with pointwise 0.95 confidence bands using log scale k <- as.data.frame(k[c('Contrast','Lower','Upper')]) ggplot(k, aes(x=30:50, y=exp(Contrast))) + geom_line() + geom_ribbon(aes(ymin=exp(Lower), ymax=exp(Upper)), alpha=0.15, linetype=0) + scale_y_continuous(trans='log10', n.breaks=10, minor_breaks=c(seq(0.1, 1, by=.1), seq(1, 10, by=.5))) + xlab('Age') + ylab('OR against age 40') # For an ordinal model with 3 variables (x1 is quadratic, x2 & x3 linear) # Get a 1 d.f. likelihood ratio (LR) test for x1=1 vs x1=0.25 # For the other variables get contrasts and LR tests that are the # ordinary ones for their original coefficients. # Get 0.95 profile likelihood confidence intervals for the x1 contrast # and for the x2 and x3 coefficients set.seed(7) x1 <- runif(50) x2 <- runif(50) x3 <- runif(50) dd <- datadist(x1, x2, x3); options(datadist='dd') y <- x1 + runif(50) # need x=TRUE,y=TRUE for profile likelihood f <- orm(y ~ pol(x1, 2) + x2 + x3, x=TRUE, y=TRUE) a <- list(x1=c( 1,0,0), x2=c(0,1,0), x3=c(0,0,1)) b <- list(x1=c(0.25,0,0), x2=c(0,0,0), x3=c(0,0,0)) k <- contrast(f, a, b, expand=FALSE) # Wald intervals and tests k; k$X[1,] summary(f, x1=c(.25, 1), x2=0:1, x3=0:1) # Wald intervals anova(f, test='LR') # LR tests contrast(f, a, b, expand=FALSE, conf.type='profile', plot_profile=TRUE) options(datadist=NULL) # For a model containing two treatments, centers, and treatment # x center interaction, get 0.95 confidence intervals separately # by center center <- factor(sample(letters[1 : 8], 500, TRUE)) treat <- factor(sample(c('a','b'), 500, TRUE)) y <- 8*(treat == 'b') + rnorm(500, 100, 20) f <- ols(y ~ treat*center) lc <- levels(center) contrast(f, list(treat='b', center=lc), list(treat='a', center=lc)) # Get 'Type III' contrast: average b - a treatment effect over # centers, weighting centers equally (which is almost always # an unreasonable thing to do) contrast(f, list(treat='b', center=lc), list(treat='a', center=lc), type='average') # Get 'Type II' contrast, weighting centers by the number of # subjects per center. Print the design contrast matrix used. k <- contrast(f, list(treat='b', center=lc), list(treat='a', center=lc), type='average', weights=table(center)) print(k, X=TRUE) # Note: If other variables had interacted with either treat # or center, we may want to list settings for these variables # inside the list()'s, so as to not use default settings # For a 4-treatment study, get all comparisons with treatment 'a' treat <- factor(sample(c('a','b','c','d'), 500, TRUE)) y <- 8*(treat == 'b') + rnorm(500, 100, 20) dd <- datadist(treat, center); options(datadist='dd') f <- ols(y ~ treat*center) lt <- levels(treat) contrast(f, list(treat=lt[-1]), list(treat=lt[ 1]), cnames=paste(lt[-1], lt[1], sep=':'), conf.int=1 - .05 / 3) # Compare each treatment with average of all others for(i in 1 : length(lt)) { cat('Comparing with', lt[i], '\n\n') print(contrast(f, list(treat=lt[-i]), list(treat=lt[ i]), type='average')) } options(datadist=NULL) # Six ways to get the same thing, for a variable that # appears linearly in a model and does not interact with # any other variables. We estimate the change in y per # unit change in a predictor x1. Methods 4, 5 also # provide confidence limits. Method 6 computes nonparametric # bootstrap confidence limits. Methods 2-6 can work # for models that are nonlinear or non-additive in x1. # For that case more care is needed in choice of settings # for x1 and the variables that interact with x1. ## Not run: coef(fit)['x1'] # method 1 diff(predict(fit, gendata(x1=c(0,1)))) # method 2 g <- Function(fit) # method 3 g(x1=1) - g(x1=0) summary(fit, x1=c(0,1)) # method 4 k <- contrast(fit, list(x1=1), list(x1=0)) # method 5 print(k, X=TRUE) fit <- update(fit, x=TRUE, y=TRUE) # method 6 b <- bootcov(fit, B=500) contrast(fit, list(x1=1), list(x1=0)) # In a model containing age, race, and sex, # compute an estimate of the mean response for a # 50 year old male, averaged over the races using # observed frequencies for the races as weights f <- ols(y ~ age + race + sex) contrast(f, list(age=50, sex='male', race=levels(race)), type='average', weights=table(race)) # For a Bayesian model get the highest posterior interval for the # difference in two nonlinear functions of predicted values # Start with the mean from a proportional odds model g <- blrm(y ~ x) M <- Mean(g) contrast(g, list(x=1), list(x=0), fun=M) # For the median we have to make sure that contrast can pass the # per-posterior-draw vector of intercepts through qu <- Quantile(g) med <- function(lp, intercepts) qu(0.5, lp, intercepts=intercepts) contrast(g, list(x=1), list(x=0), fun=med) ## End(Not run) # Plot the treatment effect (drug - placebo) as a function of age # and sex in a model in which age nonlinearly interacts with treatment # for females only set.seed(1) n <- 800 treat <- factor(sample(c('drug','placebo'), n,TRUE)) sex <- factor(sample(c('female','male'), n,TRUE)) age <- rnorm(n, 50, 10) y <- .05*age + (sex=='female')*(treat=='drug')*.05*abs(age-50) + rnorm(n) f <- ols(y ~ rcs(age,4)*treat*sex) d <- datadist(age, treat, sex); options(datadist='d') # show separate estimates by treatment and sex require(ggplot2) ggplot(Predict(f, age, treat, sex='female')) ggplot(Predict(f, age, treat, sex='male')) ages <- seq(35,65,by=5); sexes <- c('female','male') w <- contrast(f, list(treat='drug', age=ages, sex=sexes), list(treat='placebo', age=ages, sex=sexes)) # add conf.type="simultaneous" to adjust for having done 14 contrasts xYplot(Cbind(Contrast, Lower, Upper) ~ age | sex, data=w, ylab='Drug - Placebo') w <- as.data.frame(w[c('age','sex','Contrast','Lower','Upper')]) ggplot(w, aes(x=age, y=Contrast)) + geom_point() + facet_grid(sex ~ .) + geom_errorbar(aes(ymin=Lower, ymax=Upper), width=0) ggplot(w, aes(x=age, y=Contrast)) + geom_line() + facet_grid(sex ~ .) + geom_ribbon(aes(ymin=Lower, ymax=Upper), width=0, alpha=0.15, linetype=0) xYplot(Cbind(Contrast, Lower, Upper) ~ age, groups=sex, data=w, ylab='Drug - Placebo', method='alt bars') options(datadist=NULL) # Examples of type='joint' contrast tests set.seed(1) x1 <- rnorm(100) x2 <- factor(sample(c('a','b','c'), 100, TRUE)) dd <- datadist(x1, x2); options(datadist='dd') y <- x1 + (x2=='b') + rnorm(100) # First replicate a test statistic from anova() f <- ols(y ~ x2) anova(f) contrast(f, list(x2=c('b','c')), list(x2='a'), type='joint') # Repeat with a redundancy; compare a vs b, a vs c, b vs c contrast(f, list(x2=c('a','a','b')), list(x2=c('b','c','c')), type='joint') # Get a test of association of a continuous predictor with y # First assume linearity, then cubic f <- lrm(y>0 ~ x1 + x2) anova(f) contrast(f, list(x1=1), list(x1=0), type='joint') # a minimum set of contrasts xs <- seq(-2, 2, length=20) contrast(f, list(x1=0), list(x1=xs), type='joint') # All contrasts were redundant except for the first, because of # linearity assumption f <- lrm(y>0 ~ pol(x1,3) + x2, x=TRUE, y=TRUE) anova(f) anova(f, test='LR') # discrepancy with Wald statistics points out a problem w/them contrast(f, list(x1=0), list(x1=xs), type='joint') print(contrast(f, list(x1=0), list(x1=xs), type='joint'), jointonly=TRUE) # All contrasts were redundant except for the first 3, because of # cubic regression assumption # These Wald tests and intervals are not very accurate. Although joint # testing is not implemented in contrast(), individual profile likelihood # confidence intervals and associted likelihood ratio tests are helpful: # contrast(f, list(x1=0), list(x1=xs), conf.type='profile', plot_profile=TRUE) # Now do something that is difficult to do without cryptic contrast # matrix operations: Allow each of the three x2 groups to have a different # shape for the x1 effect where x1 is quadratic. Test whether there is # a difference in mean levels of y for x2='b' vs. 'c' or whether # the shape or slope of x1 is different between x2='b' and x2='c' regardless # of how they differ when x2='a'. In other words, test whether the mean # response differs between group b and c at any value of x1. # This is a 3 d.f. test (intercept, linear, quadratic effects) and is # a better approach than subsetting the data to remove x2='a' then # fitting a simpler model, as it uses a better estimate of sigma from # all the data. f <- ols(y ~ pol(x1,2) * x2) anova(f) contrast(f, list(x1=xs, x2='b'), list(x1=xs, x2='c'), type='joint') # Note: If using a spline fit, there should be at least one value of # x1 between any two knots and beyond the outer knots. options(datadist=NULL)
Modification of Therneau's coxph
function to fit the Cox model and
its extension, the Andersen-Gill model. The latter allows for interval
time-dependent covariables, time-dependent strata, and repeated events.
The Survival
method for an object created by cph
returns an S
function for computing estimates of the survival function.
The Quantile
method for cph
returns an S function for computing
quantiles of survival time (median, by default).
The Mean
method returns a function for computing the mean survival
time. This function issues a warning if the last follow-up time is uncensored,
unless a restricted mean is explicitly requested.
cph(formula = formula(data), data=environment(formula), weights, subset, na.action=na.delete, method=c("efron","breslow","exact","model.frame","model.matrix"), singular.ok=FALSE, robust=FALSE, model=FALSE, x=FALSE, y=FALSE, se.fit=FALSE, linear.predictors=TRUE, residuals=TRUE, nonames=FALSE, eps=1e-4, init, iter.max=10, tol=1e-9, surv=FALSE, time.inc, type=NULL, vartype=NULL, debug=FALSE, ...) ## S3 method for class 'cph' Survival(object, ...) # Evaluate result as g(times, lp, stratum=1, type=c("step","polygon")) ## S3 method for class 'cph' Quantile(object, ...) # Evaluate like h(q, lp, stratum=1, type=c("step","polygon")) ## S3 method for class 'cph' Mean(object, method=c("exact","approximate"), type=c("step","polygon"), n=75, tmax, ...) # E.g. m(lp, stratum=1, type=c("step","polygon"), tmax, \dots)
cph(formula = formula(data), data=environment(formula), weights, subset, na.action=na.delete, method=c("efron","breslow","exact","model.frame","model.matrix"), singular.ok=FALSE, robust=FALSE, model=FALSE, x=FALSE, y=FALSE, se.fit=FALSE, linear.predictors=TRUE, residuals=TRUE, nonames=FALSE, eps=1e-4, init, iter.max=10, tol=1e-9, surv=FALSE, time.inc, type=NULL, vartype=NULL, debug=FALSE, ...) ## S3 method for class 'cph' Survival(object, ...) # Evaluate result as g(times, lp, stratum=1, type=c("step","polygon")) ## S3 method for class 'cph' Quantile(object, ...) # Evaluate like h(q, lp, stratum=1, type=c("step","polygon")) ## S3 method for class 'cph' Mean(object, method=c("exact","approximate"), type=c("step","polygon"), n=75, tmax, ...) # E.g. m(lp, stratum=1, type=c("step","polygon"), tmax, \dots)
formula |
an S formula object with a |
object |
an object created by |
data |
name of an S data frame containing all needed variables. Omit this to use a data frame already in the S “search list”. |
weights |
case weights |
subset |
an expression defining a subset of the observations to use in the fit. The default
is to use all observations. Specify for example |
na.action |
specifies an S function to handle missing data. The default is the function |
method |
for For |
singular.ok |
If |
robust |
if |
model |
default is |
x |
default is |
y |
default is |
se.fit |
default is |
linear.predictors |
set to |
residuals |
set to |
nonames |
set to |
eps |
convergence criterion - change in log likelihood. |
init |
vector of initial parameter estimates. Defaults to all zeros.
Special residuals can be obtained by setting some elements of |
iter.max |
maximum number of iterations to allow. Set to |
tol |
tolerance for declaring singularity for matrix inversion (available only when survival5 or later package is in effect) |
surv |
set to |
time.inc |
time increment used in deriving |
type |
(for For |
vartype |
see |
debug |
set to |
... |
other arguments passed to |
times |
a scalar or vector of times at which to evaluate the survival estimates |
lp |
a scalar or vector of linear predictors (including the centering constant) at which to evaluate the survival estimates |
stratum |
a scalar stratum number or name (e.g., |
q |
a scalar quantile or a vector of quantiles to compute |
n |
the number of points at which to evaluate the mean survival time, for
|
tmax |
For |
If there is any strata by covariable interaction in the model such that
the mean X beta varies greatly over strata, method="approximate"
may
not yield very accurate estimates of the mean in Mean.cph
.
For method="approximate"
if you ask for an estimate of the mean for
a linear predictor value that was outside the range of linear predictors
stored with the fit, the mean for that observation will be NA
.
For Survival
, Quantile
, or Mean
, an S function is returned. Otherwise,
in addition to what is listed below, formula/design information and
the components
maxtime, time.inc, units, model, x, y, se.fit
are stored, the last 5
depending on the settings of options by the same names.
The vectors or matrix stored if y=TRUE
or x=TRUE
have rows deleted according to subset
and
to missing data, and have names or row names that come from the
data frame used as input data.
n |
table with one row per stratum containing number of censored and uncensored observations |
coef |
vector of regression coefficients |
stats |
vector containing the named elements |
var |
variance/covariance matrix of coefficients |
linear.predictors |
values of predicted X beta for observations used in fit, normalized to have overall mean zero, then having any offsets added |
resid |
martingale residuals |
loglik |
log likelihood at initial and final parameter values |
score |
value of score statistic at initial values of parameters |
times |
lists of times (if |
surv |
lists of underlying survival probability estimates |
std.err |
lists of standard errors of estimate log-log survival |
surv.summary |
a 3 dimensional array if |
center |
centering constant, equal to overall mean of X beta. |
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
coxph
, survival-internal
,
Surv
, residuals.cph
,
cox.zph
, survfit.cph
,
survest.cph
, survfit.coxph
,
survplot
, datadist
,
rms
, rms.trans
, anova.rms
,
summary.rms
, Predict
,
fastbw
, validate
, calibrate
,
plot.Predict
, ggplot.Predict
,
specs.rms
, lrm
, which.influence
,
na.delete
,
na.detail.response
, print.cph
,
latex.cph
, vif
, ie.setup
,
GiniMd
, dxy.cens
,
concordance
# Simulate data from a population model in which the log hazard # function is linear in age and there is no age x sex interaction require(survival) require(ggplot2) n <- 1000 set.seed(731) age <- 50 + 12*rnorm(n) label(age) <- "Age" sex <- factor(sample(c('Male','Female'), n, rep=TRUE, prob=c(.6, .4))) cens <- 15*runif(n) h <- .02*exp(.04*(age-50)+.8*(sex=='Female')) dt <- -log(runif(n))/h label(dt) <- 'Follow-up Time' e <- ifelse(dt <= cens,1,0) dt <- pmin(dt, cens) units(dt) <- "Year" dd <- datadist(age, sex) options(datadist='dd') S <- Surv(dt,e) f <- cph(S ~ rcs(age,4) + sex, x=TRUE, y=TRUE) cox.zph(f, "rank") # tests of PH anova(f) ggplot(Predict(f, age, sex)) # plot age effect, 2 curves for 2 sexes survplot(f, sex) # time on x-axis, curves for x2 res <- resid(f, "scaledsch") time <- as.numeric(dimnames(res)[[1]]) z <- loess(res[,4] ~ time, span=0.50) # residuals for sex plot(time, fitted(z)) lines(supsmu(time, res[,4]),lty=2) plot(cox.zph(f,"identity")) #Easier approach for last few lines # latex(f) f <- cph(S ~ age + strat(sex), surv=TRUE) g <- Survival(f) # g is a function g(seq(.1,1,by=.1), stratum="sex=Male", type="poly") #could use stratum=2 med <- Quantile(f) plot(Predict(f, age, fun=function(x) med(lp=x))) #plot median survival # Fit a model that is quadratic in age, interacting with sex as strata # Compare standard errors of linear predictor values with those from # coxph # Use more stringent convergence criteria to match with coxph f <- cph(S ~ pol(age,2)*strat(sex), x=TRUE, eps=1e-9, iter.max=20) coef(f) se <- predict(f, se.fit=TRUE)$se.fit require(lattice) xyplot(se ~ age | sex, main='From cph') a <- c(30,50,70) comb <- data.frame(age=rep(a, each=2), sex=rep(levels(sex), 3)) p <- predict(f, comb, se.fit=TRUE) comb$yhat <- p$linear.predictors comb$se <- p$se.fit z <- qnorm(.975) comb$lower <- p$linear.predictors - z*p$se.fit comb$upper <- p$linear.predictors + z*p$se.fit comb age2 <- age^2 f2 <- coxph(S ~ (age + age2)*strata(sex)) coef(f2) se <- predict(f2, se.fit=TRUE)$se.fit xyplot(se ~ age | sex, main='From coxph') comb <- data.frame(age=rep(a, each=2), age2=rep(a, each=2)^2, sex=rep(levels(sex), 3)) p <- predict(f2, newdata=comb, se.fit=TRUE) comb$yhat <- p$fit comb$se <- p$se.fit comb$lower <- p$fit - z*p$se.fit comb$upper <- p$fit + z*p$se.fit comb # g <- cph(Surv(hospital.charges) ~ age, surv=TRUE) # Cox model very useful for analyzing highly skewed data, censored or not # m <- Mean(g) # m(0) # Predicted mean charge for reference age #Fit a time-dependent covariable representing the instantaneous effect #of an intervening non-fatal event rm(age) set.seed(121) dframe <- data.frame(failure.time=1:10, event=rep(0:1,5), ie.time=c(NA,1.5,2.5,NA,3,4,NA,5,5,5), age=sample(40:80,10,rep=TRUE)) z <- ie.setup(dframe$failure.time, dframe$event, dframe$ie.time) S <- z$S ie.status <- z$ie.status attach(dframe[z$subs,]) # replicates all variables f <- cph(S ~ age + ie.status, x=TRUE, y=TRUE) #Must use x=TRUE,y=TRUE to get survival curves with time-dep. covariables #Get estimated survival curve for a 50-year old who has an intervening #non-fatal event at 5 days new <- data.frame(S=Surv(c(0,5), c(5,999), c(FALSE,FALSE)), age=rep(50,2), ie.status=c(0,1)) g <- survfit(f, new) plot(c(0,g$time), c(1,g$surv[,2]), type='s', xlab='Days', ylab='Survival Prob.') # Not certain about what columns represent in g$surv for survival5 # but appears to be for different ie.status #or: #g <- survest(f, new) #plot(g$time, g$surv, type='s', xlab='Days', ylab='Survival Prob.') #Compare with estimates when there is no intervening event new2 <- data.frame(S=Surv(c(0,5), c(5, 999), c(FALSE,FALSE)), age=rep(50,2), ie.status=c(0,0)) g2 <- survfit(f, new2) lines(c(0,g2$time), c(1,g2$surv[,2]), type='s', lty=2) #or: #g2 <- survest(f, new2) #lines(g2$time, g2$surv, type='s', lty=2) detach("dframe[z$subs, ]") options(datadist=NULL)
# Simulate data from a population model in which the log hazard # function is linear in age and there is no age x sex interaction require(survival) require(ggplot2) n <- 1000 set.seed(731) age <- 50 + 12*rnorm(n) label(age) <- "Age" sex <- factor(sample(c('Male','Female'), n, rep=TRUE, prob=c(.6, .4))) cens <- 15*runif(n) h <- .02*exp(.04*(age-50)+.8*(sex=='Female')) dt <- -log(runif(n))/h label(dt) <- 'Follow-up Time' e <- ifelse(dt <= cens,1,0) dt <- pmin(dt, cens) units(dt) <- "Year" dd <- datadist(age, sex) options(datadist='dd') S <- Surv(dt,e) f <- cph(S ~ rcs(age,4) + sex, x=TRUE, y=TRUE) cox.zph(f, "rank") # tests of PH anova(f) ggplot(Predict(f, age, sex)) # plot age effect, 2 curves for 2 sexes survplot(f, sex) # time on x-axis, curves for x2 res <- resid(f, "scaledsch") time <- as.numeric(dimnames(res)[[1]]) z <- loess(res[,4] ~ time, span=0.50) # residuals for sex plot(time, fitted(z)) lines(supsmu(time, res[,4]),lty=2) plot(cox.zph(f,"identity")) #Easier approach for last few lines # latex(f) f <- cph(S ~ age + strat(sex), surv=TRUE) g <- Survival(f) # g is a function g(seq(.1,1,by=.1), stratum="sex=Male", type="poly") #could use stratum=2 med <- Quantile(f) plot(Predict(f, age, fun=function(x) med(lp=x))) #plot median survival # Fit a model that is quadratic in age, interacting with sex as strata # Compare standard errors of linear predictor values with those from # coxph # Use more stringent convergence criteria to match with coxph f <- cph(S ~ pol(age,2)*strat(sex), x=TRUE, eps=1e-9, iter.max=20) coef(f) se <- predict(f, se.fit=TRUE)$se.fit require(lattice) xyplot(se ~ age | sex, main='From cph') a <- c(30,50,70) comb <- data.frame(age=rep(a, each=2), sex=rep(levels(sex), 3)) p <- predict(f, comb, se.fit=TRUE) comb$yhat <- p$linear.predictors comb$se <- p$se.fit z <- qnorm(.975) comb$lower <- p$linear.predictors - z*p$se.fit comb$upper <- p$linear.predictors + z*p$se.fit comb age2 <- age^2 f2 <- coxph(S ~ (age + age2)*strata(sex)) coef(f2) se <- predict(f2, se.fit=TRUE)$se.fit xyplot(se ~ age | sex, main='From coxph') comb <- data.frame(age=rep(a, each=2), age2=rep(a, each=2)^2, sex=rep(levels(sex), 3)) p <- predict(f2, newdata=comb, se.fit=TRUE) comb$yhat <- p$fit comb$se <- p$se.fit comb$lower <- p$fit - z*p$se.fit comb$upper <- p$fit + z*p$se.fit comb # g <- cph(Surv(hospital.charges) ~ age, surv=TRUE) # Cox model very useful for analyzing highly skewed data, censored or not # m <- Mean(g) # m(0) # Predicted mean charge for reference age #Fit a time-dependent covariable representing the instantaneous effect #of an intervening non-fatal event rm(age) set.seed(121) dframe <- data.frame(failure.time=1:10, event=rep(0:1,5), ie.time=c(NA,1.5,2.5,NA,3,4,NA,5,5,5), age=sample(40:80,10,rep=TRUE)) z <- ie.setup(dframe$failure.time, dframe$event, dframe$ie.time) S <- z$S ie.status <- z$ie.status attach(dframe[z$subs,]) # replicates all variables f <- cph(S ~ age + ie.status, x=TRUE, y=TRUE) #Must use x=TRUE,y=TRUE to get survival curves with time-dep. covariables #Get estimated survival curve for a 50-year old who has an intervening #non-fatal event at 5 days new <- data.frame(S=Surv(c(0,5), c(5,999), c(FALSE,FALSE)), age=rep(50,2), ie.status=c(0,1)) g <- survfit(f, new) plot(c(0,g$time), c(1,g$surv[,2]), type='s', xlab='Days', ylab='Survival Prob.') # Not certain about what columns represent in g$surv for survival5 # but appears to be for different ie.status #or: #g <- survest(f, new) #plot(g$time, g$surv, type='s', xlab='Days', ylab='Survival Prob.') #Compare with estimates when there is no intervening event new2 <- data.frame(S=Surv(c(0,5), c(5, 999), c(FALSE,FALSE)), age=rep(50,2), ie.status=c(0,0)) g2 <- survfit(f, new2) lines(c(0,g2$time), c(1,g2$surv[,2]), type='s', lty=2) #or: #g2 <- survest(f, new2) #lines(g2$time, g2$surv, type='s', lty=2) detach("dframe[z$subs, ]") options(datadist=NULL)
Creates several new variables which help set up a dataset with an
ordinal response variable for use in fitting a forward continuation
ratio (CR) model. The CR model can be fitted with binary logistic
regression if each input observation is replicated the proper
number of times according to the
value, a new binary
is computed that has at most one
per subject,
and if a
cohort
variable is used to define the current
qualifying condition for a cohort of subjects, e.g., .
cr.setup
creates the needed auxilliary variables. See
predab.resample
and validate.lrm
for information about
validating CR models (e.g., using the bootstrap to sample with
replacement from the original subjects instead of the records used in
the fit, validating the model separately for user-specified values of
cohort
).
cr.setup(y)
cr.setup(y)
y |
a character, numeric, |
a list with components y, cohort, subs, reps
. y
is a new binary
variable that is to be used in the binary logistic fit. cohort
is
a factor
vector specifying which cohort condition currently applies.
subs
is a vector of subscripts that can be used to replicate other
variables the same way y
was replicated. reps
specifies how many
times each original observation was replicated. y, cohort, subs
are
all the same length and are longer than the original y
vector.
reps
is the same length as the original y
vector.
The subs
vector is suitable for passing to validate.lrm
or calibrate
,
which pass this vector under the name cluster
on to predab.resample
so that bootstrapping can be
done by sampling with replacement from the original subjects rather than
from the individual records created by cr.setup
.
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
Berridge DM, Whitehead J: Analysis of failure time data with ordinal categories of response. Stat in Med 10:1703–1710, 1991.
y <- c(NA, 10, 21, 32, 32) cr.setup(y) set.seed(171) y <- sample(0:2, 100, rep=TRUE) sex <- sample(c("f","m"),100,rep=TRUE) sex <- factor(sex) table(sex, y) options(digits=5) tapply(y==0, sex, mean) tapply(y==1, sex, mean) tapply(y==2, sex, mean) cohort <- y>=1 tapply(y[cohort]==1, sex[cohort], mean) u <- cr.setup(y) Y <- u$y cohort <- u$cohort sex <- sex[u$subs] lrm(Y ~ cohort + sex) f <- lrm(Y ~ cohort*sex) # saturated model - has to fit all data cells f #Prob(y=0|female): # plogis(-.50078) #Prob(y=0|male): # plogis(-.50078+.11301) #Prob(y=1|y>=1, female): plogis(-.50078+.31845) #Prob(y=1|y>=1, male): plogis(-.50078+.31845+.11301-.07379) combinations <- expand.grid(cohort=levels(cohort), sex=levels(sex)) combinations p <- predict(f, combinations, type="fitted") p p0 <- p[c(1,3)] p1 <- p[c(2,4)] p1.unconditional <- (1 - p0) *p1 p1.unconditional p2.unconditional <- 1 - p0 - p1.unconditional p2.unconditional ## Not run: dd <- datadist(inputdata) # do this on non-replicated data options(datadist='dd') pain.severity <- inputdata$pain.severity u <- cr.setup(pain.severity) # inputdata frame has age, sex with pain.severity attach(inputdata[u$subs,]) # replicate age, sex # If age, sex already available, could do age <- age[u$subs] etc., or # age <- rep(age, u$reps), etc. y <- u$y cohort <- u$cohort dd <- datadist(dd, cohort) # add to dd f <- lrm(y ~ cohort + age*sex) # ordinary cont. ratio model g <- lrm(y ~ cohort*sex + age, x=TRUE,y=TRUE) # allow unequal slopes for # sex across cutoffs cal <- calibrate(g, cluster=u$subs, subset=cohort=='all') # subs makes bootstrap sample the correct units, subset causes # Predicted Prob(pain.severity=0) to be checked for calibration ## End(Not run)
y <- c(NA, 10, 21, 32, 32) cr.setup(y) set.seed(171) y <- sample(0:2, 100, rep=TRUE) sex <- sample(c("f","m"),100,rep=TRUE) sex <- factor(sex) table(sex, y) options(digits=5) tapply(y==0, sex, mean) tapply(y==1, sex, mean) tapply(y==2, sex, mean) cohort <- y>=1 tapply(y[cohort]==1, sex[cohort], mean) u <- cr.setup(y) Y <- u$y cohort <- u$cohort sex <- sex[u$subs] lrm(Y ~ cohort + sex) f <- lrm(Y ~ cohort*sex) # saturated model - has to fit all data cells f #Prob(y=0|female): # plogis(-.50078) #Prob(y=0|male): # plogis(-.50078+.11301) #Prob(y=1|y>=1, female): plogis(-.50078+.31845) #Prob(y=1|y>=1, male): plogis(-.50078+.31845+.11301-.07379) combinations <- expand.grid(cohort=levels(cohort), sex=levels(sex)) combinations p <- predict(f, combinations, type="fitted") p p0 <- p[c(1,3)] p1 <- p[c(2,4)] p1.unconditional <- (1 - p0) *p1 p1.unconditional p2.unconditional <- 1 - p0 - p1.unconditional p2.unconditional ## Not run: dd <- datadist(inputdata) # do this on non-replicated data options(datadist='dd') pain.severity <- inputdata$pain.severity u <- cr.setup(pain.severity) # inputdata frame has age, sex with pain.severity attach(inputdata[u$subs,]) # replicate age, sex # If age, sex already available, could do age <- age[u$subs] etc., or # age <- rep(age, u$reps), etc. y <- u$y cohort <- u$cohort dd <- datadist(dd, cohort) # add to dd f <- lrm(y ~ cohort + age*sex) # ordinary cont. ratio model g <- lrm(y ~ cohort*sex + age, x=TRUE,y=TRUE) # allow unequal slopes for # sex across cutoffs cal <- calibrate(g, cluster=u$subs, subset=cohort=='all') # subs makes bootstrap sample the correct units, subset causes # Predicted Prob(pain.severity=0) to be checked for calibration ## End(Not run)
For a given set of variables or a data frame, determines summaries
of variables for effect and plotting ranges, values to adjust to,
and overall ranges
for Predict
, plot.Predict
, ggplot.Predict
,
summary.rms
, survplot
, and nomogram.rms
.
If datadist
is called before
a model fit and the resulting object pointed to with options(datadist="name")
,
the data characteristics will be stored with the fit by Design()
, so
that later predictions and summaries of the fit will not need to access
the original data used in the fit. Alternatively, you can specify the
values for each variable in the model when using these 3 functions, or
specify the values of some of them and let the functions look up the
remainder (of say adjustmemt levels) from an object created by datadist
.
The best method is probably to run datadist
once before any models are
fitted, storing the distribution summaries for all potential variables.
Adjustment values are 0
for binary variables, the most frequent
category (or optionally the first category level)
for categorical (factor
) variables, the middle level for
ordered factor
variables, and medians for continuous variables.
See descriptions of q.display
and q.effect
for how display and
effect ranges are chosen for continuous variables.
datadist(..., data, q.display, q.effect=c(0.25, 0.75), adjto.cat=c('mode','first'), n.unique=10) ## S3 method for class 'datadist' print(x, ...) # options(datadist="dd") # used by summary, plot, survplot, sometimes predict # For dd substitute the name of the result of datadist
datadist(..., data, q.display, q.effect=c(0.25, 0.75), adjto.cat=c('mode','first'), n.unique=10) ## S3 method for class 'datadist' print(x, ...) # options(datadist="dd") # used by summary, plot, survplot, sometimes predict # For dd substitute the name of the result of datadist
... |
a list of variable names, separated by commas, a single data frame, or
a fit with |
data |
a data frame or a search position. If |
q.display |
set of two quantiles for computing the range of continuous variables
to use in displaying regression relationships. Defaults are
|
q.effect |
set of two quantiles for computing the range of continuous variables to use in estimating regression effects. Defaults are c(.25,.75), which yields inter-quartile-range odds ratios, etc. |
adjto.cat |
default is |
n.unique |
variables having |
x |
result of |
For categorical variables, the 7 limits are set to character strings
(factors) which correspond to
c(NA,adjto.level,NA,1,k,1,k)
, where k
is the number of levels.
For ordered variables with numeric levels, the limits are set to
c(L,M,H,L,H,L,H)
, where L
is the lowest level, M
is the middle
level, and H
is the highest level.
a list of class "datadist"
with the following components
limits |
a |
values |
a named list, with one vector of unique values for each numeric
variable having no more than |
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
rms
, rms.trans
, describe
, Predict
, summary.rms
## Not run: d <- datadist(data=1) # use all variables in search pos. 1 d <- datadist(x1, x2, x3) page(d) # if your options(pager) leaves up a pop-up # window, this is a useful guide in analyses d <- datadist(data=2) # all variables in search pos. 2 d <- datadist(data=my.data.frame) d <- datadist(my.data.frame) # same as previous. Run for all potential vars. d <- datadist(x2, x3, data=my.data.frame) # combine variables d <- datadist(x2, x3, q.effect=c(.1,.9), q.display=c(0,1)) # uses inter-decile range odds ratios, # total range of variables for regression function plots d <- datadist(d, z) # add a new variable to an existing datadist options(datadist="d") #often a good idea, to store info with fit f <- ols(y ~ x1*x2*x3) options(datadist=NULL) #default at start of session f <- ols(y ~ x1*x2) d <- datadist(f) #info not stored in `f' d$limits["Adjust to","x1"] <- .5 #reset adjustment level to .5 options(datadist="d") f <- lrm(y ~ x1*x2, data=mydata) d <- datadist(f, data=mydata) options(datadist="d") f <- lrm(y ~ x1*x2) #datadist not used - specify all values for summary(f, x1=c(200,500,800), x2=c(1,3,5)) # obtaining predictions plot(Predict(f, x1=200:800, x2=3)) # or ggplot() # Change reference value to get a relative odds plot for a logistic model d$limits$age[2] <- 30 # make 30 the reference value for age # Could also do: d$limits["Adjust to","age"] <- 30 fit <- update(fit) # make new reference value take effect plot(Predict(fit, age, ref.zero=TRUE, fun=exp), ylab='Age=x:Age=30 Odds Ratio') # or ggplot() ## End(Not run)
## Not run: d <- datadist(data=1) # use all variables in search pos. 1 d <- datadist(x1, x2, x3) page(d) # if your options(pager) leaves up a pop-up # window, this is a useful guide in analyses d <- datadist(data=2) # all variables in search pos. 2 d <- datadist(data=my.data.frame) d <- datadist(my.data.frame) # same as previous. Run for all potential vars. d <- datadist(x2, x3, data=my.data.frame) # combine variables d <- datadist(x2, x3, q.effect=c(.1,.9), q.display=c(0,1)) # uses inter-decile range odds ratios, # total range of variables for regression function plots d <- datadist(d, z) # add a new variable to an existing datadist options(datadist="d") #often a good idea, to store info with fit f <- ols(y ~ x1*x2*x3) options(datadist=NULL) #default at start of session f <- ols(y ~ x1*x2) d <- datadist(f) #info not stored in `f' d$limits["Adjust to","x1"] <- .5 #reset adjustment level to .5 options(datadist="d") f <- lrm(y ~ x1*x2, data=mydata) d <- datadist(f, data=mydata) options(datadist="d") f <- lrm(y ~ x1*x2) #datadist not used - specify all values for summary(f, x1=c(200,500,800), x2=c(1,3,5)) # obtaining predictions plot(Predict(f, x1=200:800, x2=3)) # or ggplot() # Change reference value to get a relative odds plot for a logistic model d$limits$age[2] <- 30 # make 30 the reference value for age # Could also do: d$limits["Adjust to","age"] <- 30 fit <- update(fit) # make new reference value take effect plot(Predict(fit, age, ref.zero=TRUE, fun=exp), ylab='Age=x:Age=30 Odds Ratio') # or ggplot() ## End(Not run)
For an orm
object generates a function for computing the
estimates of the function Prob(Y>=y) given one or more values of the
linear predictor using the reference (median) intercept. This
function can optionally be evaluated at only a set of user-specified
y
values, otherwise a right-step function is returned. There
is a plot method for plotting the step functions, and if more than one
linear predictor was evaluated multiple step functions are drawn.
ExProb
is especially useful for nomogram
.
Optionally a normal approximation for a confidence
interval for exceedance probabilities will be computed using the delta
method, if
conf.int > 0
is specified to the function generated from calling
ExProb
. In that case, a "lims"
attribute is included
in the result computed by the derived cumulative probability function.
ExProb(object, ...) ## S3 method for class 'orm' ExProb(object, codes = FALSE, ...) ## S3 method for class 'ExProb' plot(x, ..., data=NULL, xlim=NULL, xlab=x$yname, ylab=expression(Prob(Y>=y)), col=par('col'), col.vert='gray85', pch=20, pch.data=21, lwd=par('lwd'), lwd.data=lwd, lty.data=2, key=TRUE)
ExProb(object, ...) ## S3 method for class 'orm' ExProb(object, codes = FALSE, ...) ## S3 method for class 'ExProb' plot(x, ..., data=NULL, xlim=NULL, xlab=x$yname, ylab=expression(Prob(Y>=y)), col=par('col'), col.vert='gray85', pch=20, pch.data=21, lwd=par('lwd'), lwd.data=lwd, lty.data=2, key=TRUE)
object |
a fit object from |
codes |
if |
... |
ignored for |
data |
Specify |
x |
an object created by running the function created by |
xlim |
limits for x-axis; default is range of observed |
xlab |
x-axis label |
ylab |
y-axis label |
col |
color for horizontal lines and points |
col.vert |
color for vertical discontinuities |
pch |
plotting symbol for predicted curves |
lwd |
line width for predicted curves |
pch.data , lwd.data , lty.data
|
plotting parameters for data |
key |
set to |
ExProb
returns an R function. Running the function returns an
object of class "ExProb"
.
Frank Harrell and Shengxin Tu
set.seed(1) x1 <- runif(200) yvar <- x1 + runif(200) f <- orm(yvar ~ x1) d <- ExProb(f) lp <- predict(f, newdata=data.frame(x1=c(.2,.8))) w <- d(lp) s1 <- abs(x1 - .2) < .1 s2 <- abs(x1 - .8) < .1 plot(w, data=data.frame(x1=c(rep(.2, sum(s1)), rep(.8, sum(s2))), yvar=c(yvar[s1], yvar[s2]))) qu <- Quantile(f) abline(h=c(.1,.5), col='gray80') abline(v=qu(.5, lp), col='gray80') abline(v=qu(.9, lp), col='green')
set.seed(1) x1 <- runif(200) yvar <- x1 + runif(200) f <- orm(yvar ~ x1) d <- ExProb(f) lp <- predict(f, newdata=data.frame(x1=c(.2,.8))) w <- d(lp) s1 <- abs(x1 - .2) < .1 s2 <- abs(x1 - .8) < .1 plot(w, data=data.frame(x1=c(rep(.2, sum(s1)), rep(.8, sum(s2))), yvar=c(yvar[s1], yvar[s2]))) qu <- Quantile(f) abline(h=c(.1,.5), col='gray80') abline(v=qu(.5, lp), col='gray80') abline(v=qu(.9, lp), col='green')
Performs a slightly inefficient but numerically stable version of fast
backward elimination on factors, using a method based on Lawless and Singhal
(1978).
This method uses the fitted complete model and computes approximate Wald
statistics by computing conditional (restricted) maximum likelihood estimates
assuming multivariate normality of estimates.
fastbw
deletes factors, not columns of the design matrix. Factors requiring multiple d.f. will be retained or dropped as a group.
The function prints the deletion statistics for each variable in
turn, and prints approximate parameter estimates for the model after
deleting variables. The approximation is better when the number of
factors deleted is not large. For ols
, the approximation is exact for
regression coefficients, and standard errors are only off by a factor
equal to the ratio of the mean squared error estimate for the reduced
model to the original mean squared error estimate for the full model.
If the fit was from ols
, fastbw
will compute the usual
statistic for each model.
fastbw(fit, rule=c("aic", "p"), type=c("residual", "individual", "total"), sls=.05, aics=0, eps=.Machine$double.eps, k.aic=2, force=NULL) ## S3 method for class 'fastbw' print(x, digits=4, estimates=TRUE, ...)
fastbw(fit, rule=c("aic", "p"), type=c("residual", "individual", "total"), sls=.05, aics=0, eps=.Machine$double.eps, k.aic=2, force=NULL) ## S3 method for class 'fastbw' print(x, digits=4, estimates=TRUE, ...)
fit |
fit object with |
rule |
Stopping rule. Defaults to |
type |
Type of statistic on which to base the stopping rule. Default is
|
sls |
Significance level for staying in a model if |
aics |
For |
eps |
Singularity criterion, default is |
k.aic |
multiplier to compute AIC, default is 2. To use BIC, set |
force |
a vector of integers specifying parameters forced to be in the model, not counting intercept(s) |
x |
result of |
digits |
number of significant digits to print |
estimates |
set to |
... |
ignored |
a list with an attribute kept
if bw=TRUE
, and the
following components:
result |
matrix of statistics with rows in order of deletion. |
names.kept |
names of factors kept in final model. |
factors.kept |
the subscripts of factors kept in the final model |
factors.deleted |
opposite of |
parms.kept |
column numbers in design matrix corresponding to parameters kept in the final model. |
parms.deleted |
opposite of |
coefficients |
vector of approximate coefficients of reduced model. |
var |
approximate covariance matrix for reduced model. |
Coefficients |
matrix of coefficients of all models. Rows correspond to the successive models examined and columns correspond to the coefficients in the full model. For variables not in a particular sub-model (row), the coefficients are zero. |
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
Lawless, J. F. and Singhal, K. (1978): Efficient screening of nonnormal regression models. Biometrics 34:318–327.
rms
, ols
, lrm
,
cph
, psm
, validate
,
solvet
, rmsMisc
## Not run: fastbw(fit, optional.arguments) # print results z <- fastbw(fit, optional.args) # typically used in simulations lm.fit(X[,z$parms.kept], Y) # least squares fit of reduced model ## End(Not run)
## Not run: fastbw(fit, optional.arguments) # print results z <- fastbw(fit, optional.args) # typically used in simulations lm.fit(X[,z$parms.kept], Y) # least squares fit of reduced model ## End(Not run)
Function
is a class of functions for creating other S functions.
Function.rms
is the method for creating S functions to compute
X beta, based on a model fitted with rms
in effect.
Like latexrms
, Function.rms
simplifies restricted cubic
spline functions and factors out terms in second-order interactions.
Function.rms
will not work for models that have third-order
interactions involving restricted cubic splines.
Function.cph
is a particular method for handling fits from
cph
, for which an intercept (the negative of the centering
constant) is added to
the model. sascode
is a function that takes an S function such
as one created by Function
and does most of the editing
to turn the function definition into
a fragment of SAS code for computing X beta from the fitted model, along
with assignment statements that initialize predictors to reference
values.
perlcode
similarly creates Perl code to evaluate a fitted
regression model.
## S3 method for class 'rms' Function(object, intercept=NULL, digits=max(8, .Options$digits), posterior.summary=c('mean', 'median', 'mode'), ...) ## S3 method for class 'cph' Function(object, intercept=-object$center, ...) # Use result as fun(predictor1=value1, predictor2=value2, \dots) sascode(object, file='', append=FALSE) perlcode(object)
## S3 method for class 'rms' Function(object, intercept=NULL, digits=max(8, .Options$digits), posterior.summary=c('mean', 'median', 'mode'), ...) ## S3 method for class 'cph' Function(object, intercept=-object$center, ...) # Use result as fun(predictor1=value1, predictor2=value2, \dots) sascode(object, file='', append=FALSE) perlcode(object)
object |
a fit created with |
intercept |
an intercept value to use (not allowed to be specified to |
digits |
number of significant digits to use for coefficients and knot locations |
posterior.summary |
if using a Bayesian model fit such as from
|
file |
name of a file in which to write the SAS code. Default is to write to standard output. |
append |
set to |
... |
arguments to pass to |
Function
returns an S-Plus function that can be invoked in any
usual context. The function has one argument per predictor variable,
and the default values of the predictors are set to adjust-to
values
(see datadist
). Multiple predicted X beta values may be calculated
by specifying vectors as arguments to the created function.
All non-scalar argument values must have the same length.
perlcode
returns a character string with embedded newline characters.
Frank Harrell, Jeremy Stephens, and Thomas Dupont
Department of Biostatistics
Vanderbilt University
[email protected]
latexrms
, transcan
,
predict.rms
, rms
, rms.trans
suppressWarnings(RNGversion("3.5.0")) set.seed(1331) x1 <- exp(rnorm(100)) x2 <- factor(sample(c('a','b'),100,rep=TRUE)) dd <- datadist(x1, x2) options(datadist='dd') y <- log(x1)^2+log(x1)*(x2=='b')+rnorm(100)/4 f <- ols(y ~ pol(log(x1),2)*x2) f$coef g <- Function(f, digits=5) g sascode(g) cat(perlcode(g), '\n') g() g(x1=c(2,3), x2='b') #could omit x2 since b is default category predict(f, expand.grid(x1=c(2,3),x2='b')) g8 <- Function(f) # default is 8 sig. digits g8(x1=c(2,3), x2='b') options(datadist=NULL) ## Not run: require(survival) # Make self-contained functions for computing survival probabilities # using a log-normal regression f <- psm(Surv(d.time, death) ~ rcs(age,4)*sex, dist='gaussian') g <- Function(f) surv <- Survival(f) # Compute 2 and 5-year survival estimates for 50 year old male surv(c(2,5), g(age=50, sex='male')) ## End(Not run)
suppressWarnings(RNGversion("3.5.0")) set.seed(1331) x1 <- exp(rnorm(100)) x2 <- factor(sample(c('a','b'),100,rep=TRUE)) dd <- datadist(x1, x2) options(datadist='dd') y <- log(x1)^2+log(x1)*(x2=='b')+rnorm(100)/4 f <- ols(y ~ pol(log(x1),2)*x2) f$coef g <- Function(f, digits=5) g sascode(g) cat(perlcode(g), '\n') g() g(x1=c(2,3), x2='b') #could omit x2 since b is default category predict(f, expand.grid(x1=c(2,3),x2='b')) g8 <- Function(f) # default is 8 sig. digits g8(x1=c(2,3), x2='b') options(datadist=NULL) ## Not run: require(survival) # Make self-contained functions for computing survival probabilities # using a log-normal regression f <- psm(Surv(d.time, death) ~ rcs(age,4)*sex, dist='gaussian') g <- Function(f) surv <- Survival(f) # Compute 2 and 5-year survival estimates for 50 year old male surv(c(2,5), g(age=50, sex='male')) ## End(Not run)
If nobs
is not specified, allows user to specify predictor settings
by e.g. age=50, sex="male"
, and any omitted predictors are set to
reference values (default=median for continuous variables, first level
for categorical ones - see datadist
). If any predictor has more than one
value given, expand.grid
is called to generate all possible combinations
of values, unless expand=FALSE
. If nobs
is given, a data
frame is first generated which has
nobs
of adjust-to values duplicated. Then an editor window is opened
which allows the user to subset the variable names down to ones which she
intends to vary (this streamlines the data.ed
step). Then, if any
predictors kept are discrete and viewvals=TRUE
, a window (using page
)
is opened defining the possible values of this subset, to facilitate
data editing. Then the data.ed
function is invoked to allow interactive
overriding of predictor settings in the nobs
rows. The subset of
variables are combined with the other predictors which were not
displayed with data.ed
, and a final full data frame is returned.
gendata
is most useful for creating a newdata
data frame to pass
to predict
.
gendata(fit, ..., nobs, viewvals=FALSE, expand=TRUE, factors)
gendata(fit, ..., nobs, viewvals=FALSE, expand=TRUE, factors)
fit |
a fit object created with |
... |
predictor settings, if |
nobs |
number of observations to create if doing it interactively using X-windows |
viewvals |
if |
expand |
set to |
factors |
a list containing predictor settings with their names. This is an
alternative to specifying the variables separately in .... Unlike the
usage of ..., variables getting default ranges in |
if you have a variable in ...
that is named n, no, nob,
nob
, add nobs=FALSE
to the invocation to prevent that variable
from being misrecognized as nobs
a data frame with all predictors, and an attribute names.subset
if
nobs
is specified. This attribute contains the vector of variable
names for predictors which were passed to de
and hence were
allowed to vary. If neither nobs
nor any predictor settings were
given, returns a data frame with adjust-to values.
optionally writes to the terminal, opens X-windows, and generates a
temporary file using sink
.
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
predict.rms
, survest.cph
,
survest.psm
, rmsMisc
,
expand.grid
, de
, page
,
print.datadist
, Predict
set.seed(1) age <- rnorm(200, 50, 10) sex <- factor(sample(c('female','male'),200,TRUE)) race <- factor(sample(c('a','b','c','d'),200,TRUE)) y <- sample(0:1, 200, TRUE) dd <- datadist(age,sex,race) options(datadist="dd") f <- lrm(y ~ age*sex + race) gendata(f) gendata(f, age=50) d <- gendata(f, age=50, sex="female") # leave race=reference category d <- gendata(f, age=c(50,60), race=c("b","a")) # 4 obs. d$Predicted <- predict(f, d, type="fitted") d # Predicted column prints at the far right options(datadist=NULL) ## Not run: d <- gendata(f, nobs=5, view=TRUE) # 5 interactively defined obs. d[,attr(d,"names.subset")] # print variables which varied predict(f, d) ## End(Not run)
set.seed(1) age <- rnorm(200, 50, 10) sex <- factor(sample(c('female','male'),200,TRUE)) race <- factor(sample(c('a','b','c','d'),200,TRUE)) y <- sample(0:1, 200, TRUE) dd <- datadist(age,sex,race) options(datadist="dd") f <- lrm(y ~ age*sex + race) gendata(f) gendata(f, age=50) d <- gendata(f, age=50, sex="female") # leave race=reference category d <- gendata(f, age=c(50,60), race=c("b","a")) # 4 obs. d$Predicted <- predict(f, d, type="fitted") d # Predicted column prints at the far right options(datadist=NULL) ## Not run: d <- gendata(f, nobs=5, view=TRUE) # 5 interactively defined obs. d[,attr(d,"names.subset")] # print variables which varied predict(f, d) ## End(Not run)
Uses ggplot2
graphics to plot the effect of one or two predictors
on the linear predictor or X beta scale, or on some transformation of
that scale. The first argument specifies the result of the
Predict
function. The predictor is always plotted in its
original coding.
If rdata
is given, a spike histogram is drawn showing
the location/density of data values for the -axis variable. If
there is a
groups
(superposition) variable that generated separate
curves, the data density specific to each class of points is shown.
This assumes that the second variable was a factor variable. The histograms
are drawn by histSpikeg
.
To plot effects instead of estimates (e.g., treatment differences as a
function of interacting factors) see contrast.rms
and
summary.rms
.
## S3 method for class 'Predict' ggplot(data, mapping, formula=NULL, groups=NULL, aestype=c('color', 'linetype'), conf=c('fill', 'lines'), conflinetype=1, varypred=FALSE, sepdiscrete=c('no', 'list', 'vertical', 'horizontal'), subset, xlim., ylim., xlab, ylab, colorscale=function(...) scale_color_manual(..., values=c("#000000", "#E69F00", "#56B4E9", "#009E73","#F0E442", "#0072B2", "#D55E00", "#CC79A7")), colfill='black', rdata=NULL, anova=NULL, pval=FALSE, size.anova=4, adj.subtitle, size.adj=2.5, perim=NULL, nlevels=3, flipxdiscrete=TRUE, legend.position='right', legend.label=NULL, vnames=c('labels','names'), abbrev=FALSE, minlength=6, layout=NULL, addlayer, histSpike.opts=list(frac=function(f) 0.01 + 0.02 * sqrt(f - 1)/sqrt(max(f, 2) - 1), side=1, nint=100), type=NULL, ggexpr=FALSE, height=NULL, width=NULL, ..., environment)
## S3 method for class 'Predict' ggplot(data, mapping, formula=NULL, groups=NULL, aestype=c('color', 'linetype'), conf=c('fill', 'lines'), conflinetype=1, varypred=FALSE, sepdiscrete=c('no', 'list', 'vertical', 'horizontal'), subset, xlim., ylim., xlab, ylab, colorscale=function(...) scale_color_manual(..., values=c("#000000", "#E69F00", "#56B4E9", "#009E73","#F0E442", "#0072B2", "#D55E00", "#CC79A7")), colfill='black', rdata=NULL, anova=NULL, pval=FALSE, size.anova=4, adj.subtitle, size.adj=2.5, perim=NULL, nlevels=3, flipxdiscrete=TRUE, legend.position='right', legend.label=NULL, vnames=c('labels','names'), abbrev=FALSE, minlength=6, layout=NULL, addlayer, histSpike.opts=list(frac=function(f) 0.01 + 0.02 * sqrt(f - 1)/sqrt(max(f, 2) - 1), side=1, nint=100), type=NULL, ggexpr=FALSE, height=NULL, width=NULL, ..., environment)
data |
a data frame created by |
mapping |
kept because of |
formula |
a |
groups |
an optional character string containing the
name of one of the variables in |
aestype |
a string vector of aesthetic names corresponding to
variables in the |
conf |
specify |
conflinetype |
specify an alternative |
varypred |
set to |
sepdiscrete |
set to something other than |
subset |
a subsetting expression for restricting the rows of
|
xlim. |
This parameter is seldom used, as limits are usually controlled with
|
ylim. |
Range for plotting on response variable axis. Computed by default.
Usually specified using its legal definition |
xlab |
Label for |
ylab |
Label for |
colorscale |
a |
colfill |
a single character string or number specifying the fill color
to use for |
rdata |
a data frame containing the original raw data on which the
regression model were based, or at least containing the |
anova |
an object returned by |
pval |
specify |
size.anova |
character size for the test statistic printed on the panel, mm |
adj.subtitle |
Set to |
size.adj |
Size of adjustment settings in subtitles in mm. Default is 2.5. |
perim |
|
nlevels |
when |
flipxdiscrete |
see |
legend.position |
|
legend.label |
if omitted, group variable labels will be used for
label the legend. Specify |
vnames |
applies to the case where multiple plots are produced
separately by predictor. Set to |
abbrev |
set to true to abbreviate levels of predictors that are
categorical to a minimum length of |
minlength |
see |
layout |
for multi-panel plots a 2-vector specifying the number of rows and number of columns. If omitted will be computed from the number of panels to make as square as possible. |
addlayer |
a |
histSpike.opts |
a list containing named elements that specifies
parameters to |
type |
a value ( |
ggexpr |
set to |
height , width
|
used if |
... |
ignored |
environment |
ignored; used to satisfy rules because of the generic ggplot |
an object of class "ggplot2"
ready for printing. For the
case where predictors were not specified to Predict
,
sepdiscrete=TRUE
, and there were both continuous and discrete
predictors in the model, a list of two graphics objects is returned.
If plotting the effects of all predictors you can reorder the
panels using for example p <- Predict(fit); p$.predictor. <-
factor(p$.predictor., v)
where v
is a vector of predictor
names specified in the desired order.
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
Fox J, Hong J (2009): Effect displays in R for multinomial and proportional-odds logit models: Extensions to the effects package. J Stat Software 32 No. 1.
Predict
, rbind.Predict
,
datadist
, predictrms
, anova.rms
,
contrast.rms
, summary.rms
,
rms
, rmsMisc
, plot.Predict
,
labcurve
, histSpikeg
,
ggplot
, Overview
require(ggplot2) n <- 350 # define sample size set.seed(17) # so can reproduce the results age <- rnorm(n, 50, 10) blood.pressure <- rnorm(n, 120, 15) cholesterol <- rnorm(n, 200, 25) sex <- factor(sample(c('female','male'), n,TRUE)) label(age) <- 'Age' # label is in Hmisc label(cholesterol) <- 'Total Cholesterol' label(blood.pressure) <- 'Systolic Blood Pressure' label(sex) <- 'Sex' units(cholesterol) <- 'mg/dl' # uses units.default in Hmisc units(blood.pressure) <- 'mmHg' # Specify population model for log odds that Y=1 L <- .4*(sex=='male') + .045*(age-50) + (log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male')) + .01 * (blood.pressure - 120) # Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)] y <- ifelse(runif(n) < plogis(L), 1, 0) ddist <- datadist(age, blood.pressure, cholesterol, sex) options(datadist='ddist') fit <- lrm(y ~ blood.pressure + sex * (age + rcs(cholesterol,4)), x=TRUE, y=TRUE) an <- anova(fit) # Plot effects in two vertical sub-panels with continuous predictors on top # ggplot(Predict(fit), sepdiscrete='vertical') # Plot effects of all 4 predictors with test statistics from anova, and P ggplot(Predict(fit), anova=an, pval=TRUE) # ggplot(Predict(fit), rdata=llist(blood.pressure, age)) # spike histogram plot for two of the predictors # p <- Predict(fit, name=c('age','cholesterol')) # Make 2 plots # ggplot(p) # p <- Predict(fit, age=seq(20,80,length=100), sex, conf.int=FALSE) # # Plot relationship between age and log # odds, separate curve for each sex, # ggplot(p, subset=sex=='female' | age > 30) # No confidence interval, suppress estimates for males <= 30 # p <- Predict(fit, age, sex) # ggplot(p, rdata=llist(age,sex)) # rdata= allows rug plots (1-dimensional scatterplots) # on each sex's curve, with sex- # specific density of age # If data were in data frame could have used that # p <- Predict(fit, age=seq(20,80,length=100), sex='male', fun=plogis) # works if datadist not used # ggplot(p, ylab=expression(hat(P))) # plot predicted probability in place of log odds # per <- function(x, y) x >= 30 # ggplot(p, perim=per) # suppress output for age < 30 but leave scale alone # Do ggplot2 faceting a few different ways p <- Predict(fit, age, sex, blood.pressure=c(120,140,160), cholesterol=c(180,200,215)) # ggplot(p) ggplot(p, cholesterol ~ blood.pressure) # ggplot(p, ~ cholesterol + blood.pressure) # color for sex, line type for blood.pressure: ggplot(p, groups=c('sex', 'blood.pressure')) # Add legend.position='top' to allow wider plot # Map blood.pressure to line thickness instead of line type: # ggplot(p, groups=c('sex', 'blood.pressure'), aestype=c('color', 'size')) # Plot the age effect as an odds ratio # comparing the age shown on the x-axis to age=30 years # ddist$limits$age[2] <- 30 # make 30 the reference value for age # Could also do: ddist$limits["Adjust to","age"] <- 30 # fit <- update(fit) # make new reference value take effect # p <- Predict(fit, age, ref.zero=TRUE, fun=exp) # ggplot(p, ylab='Age=x:Age=30 Odds Ratio', # addlayer=geom_hline(yintercept=1, col=gray(.8)) + # geom_vline(xintercept=30, col=gray(.8)) + # scale_y_continuous(trans='log', # breaks=c(.5, 1, 2, 4, 8)))) # Compute predictions for three predictors, with superpositioning or # conditioning on sex, combined into one graph p1 <- Predict(fit, age, sex) p2 <- Predict(fit, cholesterol, sex) p3 <- Predict(fit, blood.pressure, sex) p <- rbind(age=p1, cholesterol=p2, blood.pressure=p3) ggplot(p, groups='sex', varypred=TRUE, adj.subtitle=FALSE) # ggplot(p, groups='sex', varypred=TRUE, adj.subtitle=FALSE, sepdiscrete='vert') ## Not run: # For males at the median blood pressure and cholesterol, plot 3 types # of confidence intervals for the probability on one plot, for varying age ages <- seq(20, 80, length=100) p1 <- Predict(fit, age=ages, sex='male', fun=plogis) # standard pointwise p2 <- Predict(fit, age=ages, sex='male', fun=plogis, conf.type='simultaneous') # simultaneous p3 <- Predict(fit, age=c(60,65,70), sex='male', fun=plogis, conf.type='simultaneous') # simultaneous 3 pts # The previous only adjusts for a multiplicity of 3 points instead of 100 f <- update(fit, x=TRUE, y=TRUE) g <- bootcov(f, B=500, coef.reps=TRUE) p4 <- Predict(g, age=ages, sex='male', fun=plogis) # bootstrap percentile p <- rbind(Pointwise=p1, 'Simultaneous 100 ages'=p2, 'Simultaneous 3 ages'=p3, 'Bootstrap nonparametric'=p4) # as.data.frame so will call built-in ggplot ggplot(as.data.frame(p), aes(x=age, y=yhat)) + geom_line() + geom_ribbon(data=p, aes(ymin=lower, ymax=upper), alpha=0.2, linetype=0)+ facet_wrap(~ .set., ncol=2) # Plots for a parametric survival model n <- 1000 set.seed(731) age <- 50 + 12*rnorm(n) label(age) <- "Age" sex <- factor(sample(c('Male','Female'), n, rep=TRUE, prob=c(.6, .4))) cens <- 15*runif(n) h <- .02*exp(.04*(age-50)+.8*(sex=='Female')) t <- -log(runif(n))/h label(t) <- 'Follow-up Time' e <- ifelse(t<=cens,1,0) t <- pmin(t, cens) units(t) <- "Year" ddist <- datadist(age, sex) require(survival) Srv <- Surv(t,e) # Fit log-normal survival model and plot median survival time vs. age f <- psm(Srv ~ rcs(age), dist='lognormal') med <- Quantile(f) # Creates function to compute quantiles # (median by default) p <- Predict(f, age, fun=function(x) med(lp=x)) ggplot(p, ylab="Median Survival Time") # Note: confidence intervals from this method are approximate since # they don't take into account estimation of scale parameter # Fit an ols model to log(y) and plot the relationship between x1 # and the predicted mean(y) on the original scale without assuming # normality of residuals; use the smearing estimator # See help file for rbind.Predict for a method of showing two # types of confidence intervals simultaneously. # Add raw data scatterplot to graph set.seed(1) x1 <- runif(300) x2 <- runif(300) ddist <- datadist(x1, x2); options(datadist='ddist') y <- exp(x1 + x2 - 1 + rnorm(300)) f <- ols(log(y) ~ pol(x1,2) + x2) r <- resid(f) smean <- function(yhat)smearingEst(yhat, exp, res, statistic='mean') formals(smean) <- list(yhat=numeric(0), res=r[! is.na(r)]) #smean$res <- r[! is.na(r)] # define default res argument to function ggplot(Predict(f, x1, fun=smean), ylab='Predicted Mean on y-scale', addlayer=geom_point(aes(x=x1, y=y), data.frame(x1, y))) # Had ggplot not added a subtitle (i.e., if x2 were not present), you # could have done ggplot(Predict(), ylab=...) + geom_point(...) ## End(Not run) # Make an 'interaction plot', forcing the x-axis variable to be # plotted at integer values but labeled with category levels n <- 100 set.seed(1) gender <- c(rep('male', n), rep('female',n)) m <- sample(c('a','b'), 2*n, TRUE) d <- datadist(gender, m); options(datadist='d') anxiety <- runif(2*n) + .2*(gender=='female') + .4*(gender=='female' & m=='b') tapply(anxiety, llist(gender,m), mean) f <- ols(anxiety ~ gender*m) p <- Predict(f, gender, m) # ggplot(p) # horizontal dot chart; usually preferred for categorical predictors # ggplot(p, flipxdiscrete=FALSE) # back to vertical ggplot(p, groups='gender') ggplot(p, ~ m, groups=FALSE, flipxdiscrete=FALSE) options(datadist=NULL) ## Not run: # Example in which separate curves are shown for 4 income values # For each curve the estimated percentage of voters voting for # the democratic party is plotted against the percent of voters # who graduated from college. Data are county-level percents. incomes <- seq(22900, 32800, length=4) # equally spaced to outer quintiles p <- Predict(f, college, income=incomes, conf.int=FALSE) ggplot(p, xlim=c(0,35), ylim=c(30,55)) # Erase end portions of each curve where there are fewer than 10 counties having # percent of college graduates to the left of the x-coordinate being plotted, # for the subset of counties having median family income with 1650 # of the target income for the curve show.pts <- function(college.pts, income.pt) { s <- abs(income - income.pt) < 1650 #assumes income known to top frame x <- college[s] x <- sort(x[!is.na(x)]) n <- length(x) low <- x[10]; high <- x[n-9] college.pts >= low & college.pts <= high } ggplot(p, xlim=c(0,35), ylim=c(30,55), perim=show.pts) # Rename variables for better plotting of a long list of predictors f <- ... p <- Predict(f) re <- c(trt='treatment', diabet='diabetes', sbp='systolic blood pressure') for(n in names(re)) { names(p)[names(p)==n] <- re[n] p$.predictor.[p$.predictor.==n] <- re[n] } ggplot(p) ## End(Not run)
require(ggplot2) n <- 350 # define sample size set.seed(17) # so can reproduce the results age <- rnorm(n, 50, 10) blood.pressure <- rnorm(n, 120, 15) cholesterol <- rnorm(n, 200, 25) sex <- factor(sample(c('female','male'), n,TRUE)) label(age) <- 'Age' # label is in Hmisc label(cholesterol) <- 'Total Cholesterol' label(blood.pressure) <- 'Systolic Blood Pressure' label(sex) <- 'Sex' units(cholesterol) <- 'mg/dl' # uses units.default in Hmisc units(blood.pressure) <- 'mmHg' # Specify population model for log odds that Y=1 L <- .4*(sex=='male') + .045*(age-50) + (log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male')) + .01 * (blood.pressure - 120) # Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)] y <- ifelse(runif(n) < plogis(L), 1, 0) ddist <- datadist(age, blood.pressure, cholesterol, sex) options(datadist='ddist') fit <- lrm(y ~ blood.pressure + sex * (age + rcs(cholesterol,4)), x=TRUE, y=TRUE) an <- anova(fit) # Plot effects in two vertical sub-panels with continuous predictors on top # ggplot(Predict(fit), sepdiscrete='vertical') # Plot effects of all 4 predictors with test statistics from anova, and P ggplot(Predict(fit), anova=an, pval=TRUE) # ggplot(Predict(fit), rdata=llist(blood.pressure, age)) # spike histogram plot for two of the predictors # p <- Predict(fit, name=c('age','cholesterol')) # Make 2 plots # ggplot(p) # p <- Predict(fit, age=seq(20,80,length=100), sex, conf.int=FALSE) # # Plot relationship between age and log # odds, separate curve for each sex, # ggplot(p, subset=sex=='female' | age > 30) # No confidence interval, suppress estimates for males <= 30 # p <- Predict(fit, age, sex) # ggplot(p, rdata=llist(age,sex)) # rdata= allows rug plots (1-dimensional scatterplots) # on each sex's curve, with sex- # specific density of age # If data were in data frame could have used that # p <- Predict(fit, age=seq(20,80,length=100), sex='male', fun=plogis) # works if datadist not used # ggplot(p, ylab=expression(hat(P))) # plot predicted probability in place of log odds # per <- function(x, y) x >= 30 # ggplot(p, perim=per) # suppress output for age < 30 but leave scale alone # Do ggplot2 faceting a few different ways p <- Predict(fit, age, sex, blood.pressure=c(120,140,160), cholesterol=c(180,200,215)) # ggplot(p) ggplot(p, cholesterol ~ blood.pressure) # ggplot(p, ~ cholesterol + blood.pressure) # color for sex, line type for blood.pressure: ggplot(p, groups=c('sex', 'blood.pressure')) # Add legend.position='top' to allow wider plot # Map blood.pressure to line thickness instead of line type: # ggplot(p, groups=c('sex', 'blood.pressure'), aestype=c('color', 'size')) # Plot the age effect as an odds ratio # comparing the age shown on the x-axis to age=30 years # ddist$limits$age[2] <- 30 # make 30 the reference value for age # Could also do: ddist$limits["Adjust to","age"] <- 30 # fit <- update(fit) # make new reference value take effect # p <- Predict(fit, age, ref.zero=TRUE, fun=exp) # ggplot(p, ylab='Age=x:Age=30 Odds Ratio', # addlayer=geom_hline(yintercept=1, col=gray(.8)) + # geom_vline(xintercept=30, col=gray(.8)) + # scale_y_continuous(trans='log', # breaks=c(.5, 1, 2, 4, 8)))) # Compute predictions for three predictors, with superpositioning or # conditioning on sex, combined into one graph p1 <- Predict(fit, age, sex) p2 <- Predict(fit, cholesterol, sex) p3 <- Predict(fit, blood.pressure, sex) p <- rbind(age=p1, cholesterol=p2, blood.pressure=p3) ggplot(p, groups='sex', varypred=TRUE, adj.subtitle=FALSE) # ggplot(p, groups='sex', varypred=TRUE, adj.subtitle=FALSE, sepdiscrete='vert') ## Not run: # For males at the median blood pressure and cholesterol, plot 3 types # of confidence intervals for the probability on one plot, for varying age ages <- seq(20, 80, length=100) p1 <- Predict(fit, age=ages, sex='male', fun=plogis) # standard pointwise p2 <- Predict(fit, age=ages, sex='male', fun=plogis, conf.type='simultaneous') # simultaneous p3 <- Predict(fit, age=c(60,65,70), sex='male', fun=plogis, conf.type='simultaneous') # simultaneous 3 pts # The previous only adjusts for a multiplicity of 3 points instead of 100 f <- update(fit, x=TRUE, y=TRUE) g <- bootcov(f, B=500, coef.reps=TRUE) p4 <- Predict(g, age=ages, sex='male', fun=plogis) # bootstrap percentile p <- rbind(Pointwise=p1, 'Simultaneous 100 ages'=p2, 'Simultaneous 3 ages'=p3, 'Bootstrap nonparametric'=p4) # as.data.frame so will call built-in ggplot ggplot(as.data.frame(p), aes(x=age, y=yhat)) + geom_line() + geom_ribbon(data=p, aes(ymin=lower, ymax=upper), alpha=0.2, linetype=0)+ facet_wrap(~ .set., ncol=2) # Plots for a parametric survival model n <- 1000 set.seed(731) age <- 50 + 12*rnorm(n) label(age) <- "Age" sex <- factor(sample(c('Male','Female'), n, rep=TRUE, prob=c(.6, .4))) cens <- 15*runif(n) h <- .02*exp(.04*(age-50)+.8*(sex=='Female')) t <- -log(runif(n))/h label(t) <- 'Follow-up Time' e <- ifelse(t<=cens,1,0) t <- pmin(t, cens) units(t) <- "Year" ddist <- datadist(age, sex) require(survival) Srv <- Surv(t,e) # Fit log-normal survival model and plot median survival time vs. age f <- psm(Srv ~ rcs(age), dist='lognormal') med <- Quantile(f) # Creates function to compute quantiles # (median by default) p <- Predict(f, age, fun=function(x) med(lp=x)) ggplot(p, ylab="Median Survival Time") # Note: confidence intervals from this method are approximate since # they don't take into account estimation of scale parameter # Fit an ols model to log(y) and plot the relationship between x1 # and the predicted mean(y) on the original scale without assuming # normality of residuals; use the smearing estimator # See help file for rbind.Predict for a method of showing two # types of confidence intervals simultaneously. # Add raw data scatterplot to graph set.seed(1) x1 <- runif(300) x2 <- runif(300) ddist <- datadist(x1, x2); options(datadist='ddist') y <- exp(x1 + x2 - 1 + rnorm(300)) f <- ols(log(y) ~ pol(x1,2) + x2) r <- resid(f) smean <- function(yhat)smearingEst(yhat, exp, res, statistic='mean') formals(smean) <- list(yhat=numeric(0), res=r[! is.na(r)]) #smean$res <- r[! is.na(r)] # define default res argument to function ggplot(Predict(f, x1, fun=smean), ylab='Predicted Mean on y-scale', addlayer=geom_point(aes(x=x1, y=y), data.frame(x1, y))) # Had ggplot not added a subtitle (i.e., if x2 were not present), you # could have done ggplot(Predict(), ylab=...) + geom_point(...) ## End(Not run) # Make an 'interaction plot', forcing the x-axis variable to be # plotted at integer values but labeled with category levels n <- 100 set.seed(1) gender <- c(rep('male', n), rep('female',n)) m <- sample(c('a','b'), 2*n, TRUE) d <- datadist(gender, m); options(datadist='d') anxiety <- runif(2*n) + .2*(gender=='female') + .4*(gender=='female' & m=='b') tapply(anxiety, llist(gender,m), mean) f <- ols(anxiety ~ gender*m) p <- Predict(f, gender, m) # ggplot(p) # horizontal dot chart; usually preferred for categorical predictors # ggplot(p, flipxdiscrete=FALSE) # back to vertical ggplot(p, groups='gender') ggplot(p, ~ m, groups=FALSE, flipxdiscrete=FALSE) options(datadist=NULL) ## Not run: # Example in which separate curves are shown for 4 income values # For each curve the estimated percentage of voters voting for # the democratic party is plotted against the percent of voters # who graduated from college. Data are county-level percents. incomes <- seq(22900, 32800, length=4) # equally spaced to outer quintiles p <- Predict(f, college, income=incomes, conf.int=FALSE) ggplot(p, xlim=c(0,35), ylim=c(30,55)) # Erase end portions of each curve where there are fewer than 10 counties having # percent of college graduates to the left of the x-coordinate being plotted, # for the subset of counties having median family income with 1650 # of the target income for the curve show.pts <- function(college.pts, income.pt) { s <- abs(income - income.pt) < 1650 #assumes income known to top frame x <- college[s] x <- sort(x[!is.na(x)]) n <- length(x) low <- x[10]; high <- x[n-9] college.pts >= low & college.pts <= high } ggplot(p, xlim=c(0,35), ylim=c(30,55), perim=show.pts) # Rename variables for better plotting of a long list of predictors f <- ... p <- Predict(f) re <- c(trt='treatment', diabet='diabetes', sbp='systolic blood pressure') for(n in names(re)) { names(p)[names(p)==n] <- re[n] p$.predictor.[p$.predictor.==n] <- re[n] } ggplot(p) ## End(Not run)
gIndex
computes the total -index for a model based on
the vector of linear predictors, and the partial
-index for
each predictor in a model. The latter is computed by summing all the
terms involving each variable, weighted by their regression
coefficients, then computing Gini's mean difference on this sum. For
example, a regression model having age and sex and age*sex on the
right hand side, with corresponding regression coefficients
will have the
-index for age
computed from Gini's mean
difference on the product of age
where
is an indicator set to one for observations with sex not equal
to the reference value. When there are nonlinear terms associated
with a predictor, these terms will also be combined.
A print
method is defined, and there is a plot
method for displaying
-indexes using a dot chart.
These functions use Hmisc::GiniMd
.
gIndex(object, partials=TRUE, type=c('ccterms', 'cterms', 'terms'), lplabel=if(length(object$scale) && is.character(object$scale)) object$scale[1] else 'X*Beta', fun, funlabel=if(missing(fun)) character(0) else deparse(substitute(fun)), postfun=if(length(object$scale)==2) exp else NULL, postlabel=if(length(postfun)) ifelse(missing(postfun), if((length(object$scale) > 1) && is.character(object$scale)) object$scale[2] else 'Anti-log', deparse(substitute(postfun))) else character(0), ...) ## S3 method for class 'gIndex' print(x, digits=4, abbrev=FALSE, vnames=c("names","labels"), ...) ## S3 method for class 'gIndex' plot(x, what=c('pre', 'post'), xlab=NULL, pch=16, rm.totals=FALSE, sort=c('descending', 'ascending', 'none'), ...)
gIndex(object, partials=TRUE, type=c('ccterms', 'cterms', 'terms'), lplabel=if(length(object$scale) && is.character(object$scale)) object$scale[1] else 'X*Beta', fun, funlabel=if(missing(fun)) character(0) else deparse(substitute(fun)), postfun=if(length(object$scale)==2) exp else NULL, postlabel=if(length(postfun)) ifelse(missing(postfun), if((length(object$scale) > 1) && is.character(object$scale)) object$scale[2] else 'Anti-log', deparse(substitute(postfun))) else character(0), ...) ## S3 method for class 'gIndex' print(x, digits=4, abbrev=FALSE, vnames=c("names","labels"), ...) ## S3 method for class 'gIndex' plot(x, what=c('pre', 'post'), xlab=NULL, pch=16, rm.totals=FALSE, sort=c('descending', 'ascending', 'none'), ...)
object |
result of an |
partials |
set to |
type |
defaults to |
lplabel |
a replacement for default values such as
|
fun |
an optional function to transform the linear predictors
before computing the total (only) |
funlabel |
a character string label for |
postfun |
a function to transform |
postlabel |
a label for |
... |
For |
x |
an object created by |
digits |
causes rounding to the |
abbrev |
set to |
vnames |
set to |
what |
set to |
xlab |
|
pch |
plotting character for point |
rm.totals |
set to |
sort |
specifies how to sort predictors by |
For stratification factors in a Cox proportional hazards model, there is
no contribution of variation towards computing a partial
except from terms that interact with the stratification variable.
gIndex
returns a matrix of class "gIndex"
with auxiliary
information stored as attributes, such as variable labels.
GiniMd
returns a scalar.
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
David HA (1968): Gini's mean difference rediscovered. Biometrika 55:573–575.
set.seed(1) n <- 40 x <- 1:n w <- factor(sample(c('a','b'), n, TRUE)) u <- factor(sample(c('A','B'), n, TRUE)) y <- .01*x + .2*(w=='b') + .3*(u=='B') + .2*(w=='b' & u=='B') + rnorm(n)/5 dd <- datadist(x,w,u); options(datadist='dd') f <- ols(y ~ x*w*u, x=TRUE, y=TRUE) f anova(f) z <- list() for(type in c('terms','cterms','ccterms')) { zc <- predict(f, type=type) cat('type:', type, '\n') print(zc) z[[type]] <- zc } zc <- z$cterms GiniMd(zc[, 1]) GiniMd(zc[, 2]) GiniMd(zc[, 3]) GiniMd(f$linear.predictors) g <- gIndex(f) g g['Total',] gIndex(f, partials=FALSE) gIndex(f, type='cterms') gIndex(f, type='terms') y <- y > .8 f <- lrm(y ~ x * w * u, x=TRUE, y=TRUE, reltol=1e-5) gIndex(f, fun=plogis, funlabel='Prob[y=1]') # Manual calculation of combined main effect + interaction effort of # sex in a 2x2 design with treatments A B, sexes F M, # model -.1 + .3*(treat=='B') + .5*(sex=='M') + .4*(treat=='B' & sex=='M') set.seed(1) X <- expand.grid(treat=c('A','B'), sex=c('F', 'M')) a <- 3; b <- 7; c <- 13; d <- 5 X <- rbind(X[rep(1, a),], X[rep(2, b),], X[rep(3, c),], X[rep(4, d),]) y <- with(X, -.1 + .3*(treat=='B') + .5*(sex=='M') + .4*(treat=='B' & sex=='M')) f <- ols(y ~ treat*sex, data=X, x=TRUE) gIndex(f, type='cterms') k <- coef(f) b1 <- k[2]; b2 <- k[3]; b3 <- k[4] n <- nrow(X) ( (a+b)*c*abs(b2) + (a+b)*d*abs(b2+b3) + c*d*abs(b3))/(n*(n-1)/2 ) # Manual calculation for combined age effect in a model with sex, # age, and age*sex interaction a <- 13; b <- 7 sex <- c(rep('female',a), rep('male',b)) agef <- round(runif(a, 20, 30)) agem <- round(runif(b, 20, 40)) age <- c(agef, agem) y <- (sex=='male') + age/10 - (sex=='male')*age/20 f <- ols(y ~ sex*age, x=TRUE) f gIndex(f, type='cterms') k <- coef(f) b1 <- k[2]; b2 <- k[3]; b3 <- k[4] n <- a + b sp <- function(w, z=w) sum(outer(w, z, function(u, v) abs(u-v))) ( abs(b2)*sp(agef) + abs(b2+b3)*sp(agem) + 2*sp(b2*agef, (b2+b3)*agem) ) / (n*(n-1)) ( abs(b2)*GiniMd(agef)*a*(a-1) + abs(b2+b3)*GiniMd(agem)*b*(b-1) + 2*sp(b2*agef, (b2+b3)*agem) ) / (n*(n-1)) ## Not run: # Compare partial and total g-indexes over many random fits plot(NA, NA, xlim=c(0,3), ylim=c(0,3), xlab='Global', ylab='x1 (black) x2 (red) x3 (green) x4 (blue)') abline(a=0, b=1, col=gray(.9)) big <- integer(3) n <- 50 # try with n=7 - see lots of exceptions esp. for interacting var for(i in 1:100) { x1 <- runif(n) x2 <- runif(n) x3 <- runif(n) x4 <- runif(n) y <- x1 + x2 + x3 + x4 + 2*runif(n) f <- ols(y ~ x1*x2+x3+x4, x=TRUE) # f <- ols(y ~ x1+x2+x3+x4, x=TRUE) # also try this w <- gIndex(f)[,1] gt <- w['Total'] points(gt, w['x1, x2']) points(gt, w['x3'], col='green') points(gt, w['x4'], col='blue') big[1] <- big[1] + (w['x1, x2'] > gt) big[2] <- big[2] + (w['x3'] > gt) big[3] <- big[3] + (w['x4'] > gt) } print(big) ## End(Not run) options(datadist=NULL)
set.seed(1) n <- 40 x <- 1:n w <- factor(sample(c('a','b'), n, TRUE)) u <- factor(sample(c('A','B'), n, TRUE)) y <- .01*x + .2*(w=='b') + .3*(u=='B') + .2*(w=='b' & u=='B') + rnorm(n)/5 dd <- datadist(x,w,u); options(datadist='dd') f <- ols(y ~ x*w*u, x=TRUE, y=TRUE) f anova(f) z <- list() for(type in c('terms','cterms','ccterms')) { zc <- predict(f, type=type) cat('type:', type, '\n') print(zc) z[[type]] <- zc } zc <- z$cterms GiniMd(zc[, 1]) GiniMd(zc[, 2]) GiniMd(zc[, 3]) GiniMd(f$linear.predictors) g <- gIndex(f) g g['Total',] gIndex(f, partials=FALSE) gIndex(f, type='cterms') gIndex(f, type='terms') y <- y > .8 f <- lrm(y ~ x * w * u, x=TRUE, y=TRUE, reltol=1e-5) gIndex(f, fun=plogis, funlabel='Prob[y=1]') # Manual calculation of combined main effect + interaction effort of # sex in a 2x2 design with treatments A B, sexes F M, # model -.1 + .3*(treat=='B') + .5*(sex=='M') + .4*(treat=='B' & sex=='M') set.seed(1) X <- expand.grid(treat=c('A','B'), sex=c('F', 'M')) a <- 3; b <- 7; c <- 13; d <- 5 X <- rbind(X[rep(1, a),], X[rep(2, b),], X[rep(3, c),], X[rep(4, d),]) y <- with(X, -.1 + .3*(treat=='B') + .5*(sex=='M') + .4*(treat=='B' & sex=='M')) f <- ols(y ~ treat*sex, data=X, x=TRUE) gIndex(f, type='cterms') k <- coef(f) b1 <- k[2]; b2 <- k[3]; b3 <- k[4] n <- nrow(X) ( (a+b)*c*abs(b2) + (a+b)*d*abs(b2+b3) + c*d*abs(b3))/(n*(n-1)/2 ) # Manual calculation for combined age effect in a model with sex, # age, and age*sex interaction a <- 13; b <- 7 sex <- c(rep('female',a), rep('male',b)) agef <- round(runif(a, 20, 30)) agem <- round(runif(b, 20, 40)) age <- c(agef, agem) y <- (sex=='male') + age/10 - (sex=='male')*age/20 f <- ols(y ~ sex*age, x=TRUE) f gIndex(f, type='cterms') k <- coef(f) b1 <- k[2]; b2 <- k[3]; b3 <- k[4] n <- a + b sp <- function(w, z=w) sum(outer(w, z, function(u, v) abs(u-v))) ( abs(b2)*sp(agef) + abs(b2+b3)*sp(agem) + 2*sp(b2*agef, (b2+b3)*agem) ) / (n*(n-1)) ( abs(b2)*GiniMd(agef)*a*(a-1) + abs(b2+b3)*GiniMd(agem)*b*(b-1) + 2*sp(b2*agef, (b2+b3)*agem) ) / (n*(n-1)) ## Not run: # Compare partial and total g-indexes over many random fits plot(NA, NA, xlim=c(0,3), ylim=c(0,3), xlab='Global', ylab='x1 (black) x2 (red) x3 (green) x4 (blue)') abline(a=0, b=1, col=gray(.9)) big <- integer(3) n <- 50 # try with n=7 - see lots of exceptions esp. for interacting var for(i in 1:100) { x1 <- runif(n) x2 <- runif(n) x3 <- runif(n) x4 <- runif(n) y <- x1 + x2 + x3 + x4 + 2*runif(n) f <- ols(y ~ x1*x2+x3+x4, x=TRUE) # f <- ols(y ~ x1+x2+x3+x4, x=TRUE) # also try this w <- gIndex(f)[,1] gt <- w['Total'] points(gt, w['x1, x2']) points(gt, w['x3'], col='green') points(gt, w['x4'], col='blue') big[1] <- big[1] + (w['x1, x2'] > gt) big[2] <- big[2] + (w['x3'] > gt) big[3] <- big[3] + (w['x4'] > gt) } print(big) ## End(Not run) options(datadist=NULL)
This function saves rms
attributes with the fit object so that
anova.rms
, Predict
, etc. can be used just as with ols
and other fits. No validate
or calibrate
methods exist for
Glm
though.
Glm( formula, family = gaussian, data = environment(formula), weights, subset, na.action = na.delete, start = NULL, offset = NULL, control = glm.control(...), model = TRUE, method = "glm.fit", x = FALSE, y = TRUE, contrasts = NULL, ... )
Glm( formula, family = gaussian, data = environment(formula), weights, subset, na.action = na.delete, start = NULL, offset = NULL, control = glm.control(...), model = TRUE, method = "glm.fit", x = FALSE, y = TRUE, contrasts = NULL, ... )
formula , family , data , weights , subset , na.action , start , offset , control , model , method , x , y , contrasts
|
see |
... |
ignored |
For the print
method, format of output is controlled by the user
previously running options(prType="lang")
where lang
is
"plain"
(the default), "latex"
, or "html"
.
a fit object like that produced by stats::glm()
but with
rms
attributes and a class
of "rms"
, "Glm"
,
"glm"
, and "lm"
. The g
element of the fit object is
the -index.
stats::glm()
,Hmisc::GiniMd()
, prModFit()
, stats::residuals.glm
## Dobson (1990) Page 93: Randomized Controlled Trial : counts <- c(18,17,15,20,10,20,25,13,12) outcome <- gl(3,1,9) treatment <- gl(3,3) f <- glm(counts ~ outcome + treatment, family=poisson()) f anova(f) summary(f) f <- Glm(counts ~ outcome + treatment, family=poisson()) # could have had rcs( ) etc. if there were continuous predictors f anova(f) summary(f, outcome=c('1','2','3'), treatment=c('1','2','3'))
## Dobson (1990) Page 93: Randomized Controlled Trial : counts <- c(18,17,15,20,10,20,25,13,12) outcome <- gl(3,1,9) treatment <- gl(3,3) f <- glm(counts ~ outcome + treatment, family=poisson()) f anova(f) summary(f) f <- Glm(counts ~ outcome + treatment, family=poisson()) # could have had rcs( ) etc. if there were continuous predictors f anova(f) summary(f, outcome=c('1','2','3'), treatment=c('1','2','3'))
This function fits a linear model using generalized least
squares. The errors are allowed to be correlated and/or have unequal
variances. Gls
is a slightly enhanced version of the
Pinheiro and Bates gls
function in the nlme
package to
make it easy to use with the rms package and to implement cluster
bootstrapping (primarily for nonparametric estimates of the
variance-covariance matrix of the parameter estimates and for
nonparametric confidence limits of correlation parameters).
For the print
method, format of output is controlled by the
user previously running options(prType="lang")
where
lang
is "plain"
(the default), "latex"
, or
"html"
. When using html with Quarto or RMarkdown,
results='asis'
need not be written in the chunk header.
Gls(model, data, correlation, weights, subset, method, na.action=na.omit, control, verbose, B=0, dupCluster=FALSE, pr=FALSE, x=FALSE) ## S3 method for class 'Gls' print(x, digits=4, coefs=TRUE, title, ...)
Gls(model, data, correlation, weights, subset, method, na.action=na.omit, control, verbose, B=0, dupCluster=FALSE, pr=FALSE, x=FALSE) ## S3 method for class 'Gls' print(x, digits=4, coefs=TRUE, title, ...)
model |
a two-sided linear formula object describing the
model, with the response on the left of a |
data |
an optional data frame containing the variables named in
|
correlation |
an optional |
weights |
an optional |
subset |
an optional expression indicating which subset of the rows of
|
method |
a character string. If |
na.action |
a function that indicates what should happen when the
data contain |
control |
a list of control values for the estimation algorithm to
replace the default values returned by the function |
verbose |
an optional logical value. If |
B |
number of bootstrap resamples to fit and store, default is none |
dupCluster |
set to |
pr |
set to |
x |
for |
digits |
number of significant digits to print |
coefs |
specify |
title |
a character string title to be passed to |
... |
ignored |
The na.delete
function will not work with
Gls
due to some nuance in the model.frame.default
function. This probably relates to na.delete
storing extra
information in the "na.action"
attribute of the returned data
frame.
an object of classes Gls
, rms
, and gls
representing the linear model
fit. Generic functions such as print
, plot
,
ggplot
, and summary
have methods to show the results of
the fit. See
glsObject
for the components of the fit. The functions
resid
, coef
, and fitted
can be used to extract
some of its components. Gls
returns the following components
not returned by gls
: Design
, assign
,
formula
(see arguments), B
(see
arguments), bootCoef
(matrix of B
bootstrapped
coefficients), boot.Corr
(vector of bootstrapped correlation
parameters), Nboot
(vector of total sample size used in each
bootstrap (may vary if have unbalanced clusters), and var
(sample variance-covariance matrix of bootstrapped coefficients). The
-index is also stored in the returned object under the name
"g"
.
Jose Pinheiro, Douglas Bates [email protected], Saikat DebRoy, Deepayan Sarkar, R-core [email protected], Frank Harrell [email protected], Patrick Aboyoun
Pinheiro J, Bates D (2000): Mixed effects models in S and S-Plus. New York: Springer-Verlag.
gls
glsControl
, glsObject
,
varFunc
, corClasses
,
varClasses
, GiniMd
,
prModFit
, logLik.Gls
## Not run: require(ggplot2) ns <- 20 # no. subjects nt <- 10 # no. time points/subject B <- 10 # no. bootstrap resamples # usually do 100 for variances, 1000 for nonparametric CLs rho <- .5 # AR(1) correlation parameter V <- matrix(0, nrow=nt, ncol=nt) V <- rho^abs(row(V)-col(V)) # per-subject correlation/covariance matrix d <- expand.grid(tim=1:nt, id=1:ns) d$trt <- factor(ifelse(d$id <= ns/2, 'a', 'b')) true.beta <- c(Intercept=0,tim=.1,'tim^2'=0,'trt=b'=1) d$ey <- true.beta['Intercept'] + true.beta['tim']*d$tim + true.beta['tim^2']*(d$tim^2) + true.beta['trt=b']*(d$trt=='b') set.seed(13) library(MASS) # needed for mvrnorm d$y <- d$ey + as.vector(t(mvrnorm(n=ns, mu=rep(0,nt), Sigma=V))) dd <- datadist(d); options(datadist='dd') f <- Gls(y ~ pol(tim,2) + trt, correlation=corCAR1(form= ~tim | id), data=d, B=B) f AIC(f) f$var # bootstrap variances f$varBeta # original variances summary(f) anova(f) ggplot(Predict(f, tim, trt)) # v <- Variogram(f, form=~tim|id, data=d) nlme:::summary.gls(f)$tTable # print matrix of estimates etc. options(datadist=NULL) ## End(Not run)
## Not run: require(ggplot2) ns <- 20 # no. subjects nt <- 10 # no. time points/subject B <- 10 # no. bootstrap resamples # usually do 100 for variances, 1000 for nonparametric CLs rho <- .5 # AR(1) correlation parameter V <- matrix(0, nrow=nt, ncol=nt) V <- rho^abs(row(V)-col(V)) # per-subject correlation/covariance matrix d <- expand.grid(tim=1:nt, id=1:ns) d$trt <- factor(ifelse(d$id <= ns/2, 'a', 'b')) true.beta <- c(Intercept=0,tim=.1,'tim^2'=0,'trt=b'=1) d$ey <- true.beta['Intercept'] + true.beta['tim']*d$tim + true.beta['tim^2']*(d$tim^2) + true.beta['trt=b']*(d$trt=='b') set.seed(13) library(MASS) # needed for mvrnorm d$y <- d$ey + as.vector(t(mvrnorm(n=ns, mu=rep(0,nt), Sigma=V))) dd <- datadist(d); options(datadist='dd') f <- Gls(y ~ pol(tim,2) + trt, correlation=corCAR1(form= ~tim | id), data=d, B=B) f AIC(f) f$var # bootstrap variances f$varBeta # original variances summary(f) anova(f) ggplot(Predict(f, tim, trt)) # v <- Variogram(f, form=~tim|id, data=d) nlme:::summary.gls(f)$tTable # print matrix of estimates etc. options(datadist=NULL) ## End(Not run)
Function to divide x
(e.g. age, or predicted survival at time
u
created by survest
) into g
quantile groups, get
Kaplan-Meier estimates at time u
(a scaler), and to return a
matrix with columns x
=mean x
in quantile, n
=number
of subjects, events
=no. events, and KM
=K-M survival at
time u
, std.err
= s.e. of -log K-M. Confidence intervals
are based on -log S(t). Instead of supplying g
, the user can
supply the minimum number of subjects to have in the quantile group
(m
, default=50). If cuts
is given
(e.g. cuts=c(0,.1,.2,...,.9,.1)
), it overrides m
and
g
. Calls Therneau's survfitKM
in the survival
package to get Kaplan-Meiers estimates and standard errors.
groupkm(x, Srv, m=50, g, cuts, u, pl=FALSE, loglog=FALSE, conf.int=.95, xlab, ylab, lty=1, add=FALSE, cex.subtitle=.7, ...)
groupkm(x, Srv, m=50, g, cuts, u, pl=FALSE, loglog=FALSE, conf.int=.95, xlab, ylab, lty=1, add=FALSE, cex.subtitle=.7, ...)
x |
variable to stratify |
Srv |
a |
m |
desired minimum number of observations in a group |
g |
number of quantile groups |
cuts |
actual cuts in |
u |
time for which to estimate survival |
pl |
TRUE to plot results |
loglog |
set to |
conf.int |
defaults to |
xlab |
if |
ylab |
if |
lty |
line time for primary line connecting estimates |
add |
set to |
cex.subtitle |
character size for subtitle. Default is |
... |
plotting parameters to pass to the plot and errbar functions |
matrix with columns named x
(mean predictor value in interval), n
(sample size
in interval), events
(number of events in interval), KM
(Kaplan-Meier
estimate), std.err
(standard error of -log KM
)
survfit
, errbar
,
cut2
, Surv
,
units
require(survival) n <- 1000 set.seed(731) age <- 50 + 12*rnorm(n) cens <- 15*runif(n) h <- .02*exp(.04*(age-50)) d.time <- -log(runif(n))/h label(d.time) <- 'Follow-up Time' e <- ifelse(d.time <= cens,1,0) d.time <- pmin(d.time, cens) units(d.time) <- "Year" groupkm(age, Surv(d.time, e), g=10, u=5, pl=TRUE) #Plot 5-year K-M survival estimates and 0.95 confidence bars by #decile of age. If omit g=10, will have >= 50 obs./group.
require(survival) n <- 1000 set.seed(731) age <- 50 + 12*rnorm(n) cens <- 15*runif(n) h <- .02*exp(.04*(age-50)) d.time <- -log(runif(n))/h label(d.time) <- 'Follow-up Time' e <- ifelse(d.time <= cens,1,0) d.time <- pmin(d.time, cens) units(d.time) <- "Year" groupkm(age, Surv(d.time, e), g=10, u=5, pl=TRUE) #Plot 5-year K-M survival estimates and 0.95 confidence bars by #decile of age. If omit g=10, will have >= 50 obs./group.
The hazard.ratio.plot
function repeatedly estimates Cox
regression coefficients and confidence limits within time intervals.
The log hazard ratios are plotted against the mean failure/censoring
time within the interval. Unless times
is specified, the number of
time intervals will be , where
is the
total number
of events in the sample. Efron's likelihood is used for estimating
Cox regression coefficients (using
coxph.fit
). In the case of
tied failure times, some intervals may have a point in common.
hazard.ratio.plot(x, Srv, which, times=, e=30, subset, conf.int=.95, legendloc=NULL, smooth=TRUE, pr=FALSE, pl=TRUE, add=FALSE, ylim, cex=.5, xlab="t", ylab, antilog=FALSE, ...)
hazard.ratio.plot(x, Srv, which, times=, e=30, subset, conf.int=.95, legendloc=NULL, smooth=TRUE, pr=FALSE, pl=TRUE, add=FALSE, ylim, cex=.5, xlab="t", ylab, antilog=FALSE, ...)
x |
a vector or matrix of predictors |
Srv |
a |
which |
a vector of column numbers of |
times |
optional vector of time interval endpoints.
Example: |
e |
number of events per time interval if times not given |
subset |
vector used for subsetting the entire analysis,
e.g. |
conf.int |
confidence interval coverage |
legendloc |
location for legend. Omit to use mouse, |
smooth |
also plot the super–smoothed version of the log hazard ratios |
pr |
defaults to |
pl |
defaults to |
add |
add this plot to an already existing plot |
ylim |
vector of |
cex |
character size for legend information, default is 0.5 |
xlab |
label for |
ylab |
label for |
antilog |
default is |
... |
optional graphical parameters |
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
cox.zph
, residuals.cph
,
survival-internal
, cph
,
coxph
, Surv
require(survival) n <- 500 set.seed(1) age <- 50 + 12*rnorm(n) cens <- 15*runif(n) h <- .02*exp(.04*(age-50)) d.time <- -log(runif(n))/h label(d.time) <- 'Follow-up Time' e <- ifelse(d.time <= cens,1,0) d.time <- pmin(d.time, cens) units(d.time) <- "Year" hazard.ratio.plot(age, Surv(d.time,e), e=20, legendloc='ll')
require(survival) n <- 500 set.seed(1) age <- 50 + 12*rnorm(n) cens <- 15*runif(n) h <- .02*exp(.04*(age-50)) d.time <- -log(runif(n))/h label(d.time) <- 'Follow-up Time' e <- ifelse(d.time <= cens,1,0) d.time <- pmin(d.time, cens) units(d.time) <- "Year" hazard.ratio.plot(age, Surv(d.time,e), e=20, legendloc='ll')
Creates several new variables which help set up a dataset for modeling
with cph
or coxph
when there is a single binary time-dependent
covariable which turns on at a given time, and stays on. This is
typical when analyzing the impact of an intervening event.
ie.setup
creates a Surv
object using the start time, stop time
format. It also creates a binary indicator for the intervening event,
and a variable called subs
that is useful when attach
-ing
a dataframe.
subs
has observation numbers duplicated for subjects having an
intervening event, so those subject's baseline covariables (that are
not time-dependent) can be duplicated correctly.
ie.setup(failure.time, event, ie.time, break.ties=FALSE)
ie.setup(failure.time, event, ie.time, break.ties=FALSE)
failure.time |
a numeric variable containing the event or censoring times for the terminating event |
event |
a binary (0/1) variable specifying whether observations had the terminating event (event=1) or were censored (event=0) |
ie.time |
intervening event times. For subjects having no intervening events, the corresponding values of ie.time must be NA. |
break.ties |
Occasionally intervening events are recorded as happening at exactly
the same time as the termination of follow-up for some subjects.
The |
a list with components S, ie.status, subs, reps
. S
is a
Surv
object containing start and stop times for intervals of observation,
along with event indicators. ie.status
is one if the intervening
event has occurred at the start of the interval, zero otherwise.
subs
is a vector of subscripts that can be used to replicate other
variables the same way S
was replicated. reps
specifies how many
times each original observation was replicated. S, ie.status, subs
are
all the same length (at least the number of rows for S
is) and are longer than the original failure.time
vector.
reps
is the same length as the original failure.time
vector.
The subs
vector is suitable for passing to validate.lrm
or calibrate
,
which pass this vector under the name cluster
on to predab.resample
so that bootstrapping can be
done by sampling with replacement from the original subjects rather than
from the individual records created by ie.setup
.
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
cph
, coxph
,
Surv
, cr.setup
,
predab.resample
failure.time <- c(1 , 2, 3) event <- c(1 , 1, 0) ie.time <- c(NA, 1.5, 2.5) z <- ie.setup(failure.time, event, ie.time) S <- z$S S ie.status <- z$ie.status ie.status z$subs z$reps ## Not run: attach(input.data.frame[z$subs,]) #replicates all variables f <- cph(S ~ age + sex + ie.status) # Instead of duplicating rows of data frame, could do this: attach(input.data.frame) z <- ie.setup(failure.time, event, ie.time) s <- z$subs age <- age[s] sex <- sex[s] f <- cph(S ~ age + sex + ie.status) ## End(Not run)
failure.time <- c(1 , 2, 3) event <- c(1 , 1, 0) ie.time <- c(NA, 1.5, 2.5) z <- ie.setup(failure.time, event, ie.time) S <- z$S S ie.status <- z$ie.status ie.status z$subs z$reps ## Not run: attach(input.data.frame[z$subs,]) #replicates all variables f <- cph(S ~ age + sex + ie.status) # Instead of duplicating rows of data frame, could do this: attach(input.data.frame) z <- ie.setup(failure.time, event, ie.time) s <- z$subs age <- age[s] sex <- sex[s] f <- cph(S ~ age + sex + ie.status) ## End(Not run)
Checks the impact of the proportional odds assumption by comparing predicted cell probabilities from a PO model with those from a multinomial or partial proportional odds logistic model that relax assumptions. For a given model formula, fits the model with both lrm
and either nnet::multinom
or VGAM::vglm
or both, and obtains predicted cell probabilities for the PO and relaxed models on the newdata
data frame. A print
method formats the output.
impactPO( formula, relax = if (missing(nonpo)) "multinomial" else "both", nonpo, newdata, data = environment(formula), minfreq = 15, B = 0, ... )
impactPO( formula, relax = if (missing(nonpo)) "multinomial" else "both", nonpo, newdata, data = environment(formula), minfreq = 15, B = 0, ... )
formula |
a model formula. To work properly with |
relax |
defaults to |
nonpo |
a formula with no left hand side variable, specifying the variable or variables for which PO is not assumed. Specifying |
newdata |
a data frame or data table with one row per covariate setting for which predictions are to be made |
data |
data frame containing variables to fit; default is the frame in which |
minfreq |
minimum sample size to allow for the least frequent category of the dependent variable. If the observed minimum frequency is less than this, the |
B |
number of bootstrap resamples to do to get confidence intervals for differences in predicted probabilities for relaxed methods vs. PO model fits. Default is not to run the bootstrap. When running the bootstrap make sure that all model variables are explicitly in |
... |
other parameters to pass to |
Since partial proportional odds models and especially multinomial logistic models can have many parameters, it is not feasible to use this model comparison approach when the number of levels of the dependent variable Y is large. By default, the function will use Hmisc::combine.levels()
to combine consecutive levels if the lowest frequency category of Y has fewer than minfreq
observations.
an impactPO
object which is a list with elements estimates
, stats
, mad
, newdata
, nboot
, and boot
. estimates
is a data frame containing the variables and values in newdata
in a tall and thin format with additional variable method
("PO", "Multinomial", "PPO"), y
(current level of the dependent variable), and Probability
(predicted cell probability for covariate values and value of y
in the current row). stats
is a data frame containing Deviance
the model deviance, d.f.
the total number of parameters counting intercepts, AIC
, p
the number of regression coefficients, LR chi^2
the likelihood ratio chi-square statistic for testing the predictors, LR - p
a chance-corrected LR chi-square, LR chi^2 test for PO
the likelihood ratio chi-square test statistic for testing the PO assumption (by comparing -2 log likelihood for a relaxed model to that of a fully PO model), d.f.
the degrees of freedom for this test, Pr(>chi^2)
the P-value for this test, MCS R2
the Maddala-Cox-Snell R2 using the actual sample size, MCS R2 adj
(MCS R2
adjusted for estimating p
regression coefficients by subtracting p
from LR
), McFadden R2
, McFadden R2 adj
(an AIC-like adjustment proposed by McFadden without full justification), Mean |difference} from PO
the overall mean absolute difference between predicted probabilities over all categories of Y and over all covariate settings. mad
contains newdata
and separately by rows in newdata
the mean absolute difference (over Y categories) between estimated probabilities by the indicated relaxed model and those from the PO model. nboot
is the number of successful bootstrap repetitions, and boot
is a 4-way array with dimensions represented by the nboot
resamples, the number of rows in newdata
, the number of outcome levels, and elements for PPO
and multinomial
. For the modifications of the Maddala-Cox-Snell indexes see Hmisc::R2Measures
.
Frank Harrell [email protected]
nnet::multinom()
, VGAM::vglm()
, lrm()
, Hmisc::propsPO()
, Hmisc::R2Measures()
, Hmisc::combine.levels()
## Not run: set.seed(1) age <- rnorm(500, 50, 10) sex <- sample(c('female', 'male'), 500, TRUE) y <- sample(0:4, 500, TRUE) d <- expand.grid(age=50, sex=c('female', 'male')) w <- impactPO(y ~ age + sex, nonpo = ~ sex, newdata=d) w # Note that PO model is a better model than multinomial (lower AIC) # since multinomial model's improvement in fit is low in comparison # with number of additional parameters estimated. Same for PO model # in comparison with partial PO model. # Reverse levels of y so stacked bars have higher y located higher revo <- function(z) { z <- as.factor(z) factor(z, levels=rev(levels(as.factor(z)))) } require(ggplot2) ggplot(w$estimates, aes(x=method, y=Probability, fill=revo(y))) + facet_wrap(~ sex) + geom_col() + xlab('') + guides(fill=guide_legend(title='')) # Now vary 2 predictors d <- expand.grid(sex=c('female', 'male'), age=c(40, 60)) w <- impactPO(y ~ age + sex, nonpo = ~ sex, newdata=d) w ggplot(w$estimates, aes(x=method, y=Probability, fill=revo(y))) + facet_grid(age ~ sex) + geom_col() + xlab('') + guides(fill=guide_legend(title='')) ## End(Not run)
## Not run: set.seed(1) age <- rnorm(500, 50, 10) sex <- sample(c('female', 'male'), 500, TRUE) y <- sample(0:4, 500, TRUE) d <- expand.grid(age=50, sex=c('female', 'male')) w <- impactPO(y ~ age + sex, nonpo = ~ sex, newdata=d) w # Note that PO model is a better model than multinomial (lower AIC) # since multinomial model's improvement in fit is low in comparison # with number of additional parameters estimated. Same for PO model # in comparison with partial PO model. # Reverse levels of y so stacked bars have higher y located higher revo <- function(z) { z <- as.factor(z) factor(z, levels=rev(levels(as.factor(z)))) } require(ggplot2) ggplot(w$estimates, aes(x=method, y=Probability, fill=revo(y))) + facet_wrap(~ sex) + geom_col() + xlab('') + guides(fill=guide_legend(title='')) # Now vary 2 predictors d <- expand.grid(sex=c('female', 'male'), age=c(40, 60)) w <- impactPO(y ~ age + sex, nonpo = ~ sex, newdata=d) w ggplot(w$estimates, aes(x=method, y=Probability, fill=revo(y))) + facet_grid(age ~ sex) + geom_col() + xlab('') + guides(fill=guide_legend(title='')) ## End(Not run)
Surv
and ggplot
are imported from, respectively, the
survival
and ggplot2
packages and are exported from
rms
so that the user does not have to attach these packages to do
simple things.
Surv(time, time2, event, type = c("right", "left", "interval", "counting", "interval2", "mstate"), origin = 0) ggplot(data = NULL, mapping = aes(), ..., environment = parent.frame())
Surv(time, time2, event, type = c("right", "left", "interval", "counting", "interval2", "mstate"), origin = 0) ggplot(data = NULL, mapping = aes(), ..., environment = parent.frame())
time , time2 , event , type , origin
|
see |
data , mapping , ... , environment
|
see |
see documentation in the original packages
## Not run: f <- psm(Surv(dtime, death) ~ x1 + x2 + sex + race, dist='gau') ggplot(Predict(f)) ## End(Not run)
## Not run: f <- psm(Surv(dtime, death) ~ x1 + x2 + sex + race, dist='gau') ggplot(Predict(f)) ## End(Not run)
Processes three types of information matrices: ones produced by the SparseM
package for the orm
function in rms
version 6.9-0 and earlier, by the Matrix
package for version 7.0-0 of rms
, or plain matrices. For Matrix
, the input information matrix is a list with three elements: a
containing in two columns the diagonal and superdiagonal for intercepts, b
, a square matrix for the covariates, and ab
for intercepts x covariates. If nothing else is specified, the assembled information matrix is returned for Matrix
, or the original info
otherwise. If p=TRUE
, the number of parameters in the model (number of rows and columns in the whole information matrix) is returned. If i
is given, the i
elements of the inverse of info
are returned, using efficient calculation to avoid inverting the whole matrix. Otherwise if invert=TRUE
or B
is given without i
, the efficiently (if Matrix
or SparseM
) inverted matrix is returned, or the matrix multiplication of the inverse and B
. If both i
and B
are given, what is returned is the i
portion of the inverse of the information matrix, matrix multiplied by B
. This is done inside solve()
.
infoMxop( info, i, invert = !missing(i) || !missing(B), B, np = FALSE, tol = .Machine$double.eps, abort = TRUE )
infoMxop( info, i, invert = !missing(i) || !missing(B), B, np = FALSE, tol = .Machine$double.eps, abort = TRUE )
info |
an information matrix object |
i |
integer vector specifying elements returned from the inverse. You an also specify |
invert |
set to |
B |
multiplier matrix |
np |
set to |
tol |
tolerance for matrix inversion singularity |
abort |
set to |
When only variance-covariance matrix elements corresponding to the non-intercepts are desired, specify
i='x'
or i=(k + 1) : nv
where nv
is the number of intercepts and slopes combined. infoMxop
computes the needed covariance matrix very quickly in this case.
When inverting info
, if info
has a 'scale'
attribute with elements mean
and sd
, the scaling is reversed after inverting info
.
a single integer or a matrix
Frank Harrell
## Not run: f <- orm(y ~ x) infoMxop(f$info.matrix) # assembles 3 pieces infoMxop(v, i=c(2,4)) # returns a submatrix of v inverse infoMxop(f$info.matrix, i='x') # sub-covariance matrix for just the betas ## End(Not run)
## Not run: f <- orm(y ~ x) infoMxop(f$info.matrix) # assembles 3 pieces infoMxop(v, i=c(2,4)) # returns a submatrix of v inverse infoMxop(f$info.matrix, i='x') # sub-covariance matrix for just the betas ## End(Not run)
Creates a file containing a LaTeX representation of the fitted model.
## S3 method for class 'cph' latex(object, title, file='', append=FALSE, surv=TRUE, maxt=FALSE, which=NULL, varnames, columns=65, inline=FALSE, before=if(inline)"" else "& &", after="", dec=3, pretrans=TRUE, caption, digits=.Options$digits, size="", ...) # for cph fit ## S3 method for class 'lrm' latex(object, title, file, append, which, varnames, columns, inline, before, after, pretrans, caption, digits=.Options$digits, size="", ...) # for lrm fit ## S3 method for class 'ols' latex(object, title, file, append, which, varnames, columns, inline, before, after, pretrans, caption, digits=.Options$digits, size="", ...) # ols fit ## S3 method for class 'orm' latex(object, title, file, append, which, varnames, columns, inline, before, after, pretrans, caption, digits=.Options$digits, size="", intercepts=nrp < 10, ...) # for orm fit ## S3 method for class 'pphsm' latex(object, title, file, append, which=NULL, varnames, columns, inline, before, after, pretrans, caption, digits=.Options$digits, size="", ...) # pphsm fit ## S3 method for class 'psm' latex(object, title, file, append, which=NULL, varnames, columns, inline, before, after, pretrans, caption, digits=.Options$digits, size="", ...) # psm fit
## S3 method for class 'cph' latex(object, title, file='', append=FALSE, surv=TRUE, maxt=FALSE, which=NULL, varnames, columns=65, inline=FALSE, before=if(inline)"" else "& &", after="", dec=3, pretrans=TRUE, caption, digits=.Options$digits, size="", ...) # for cph fit ## S3 method for class 'lrm' latex(object, title, file, append, which, varnames, columns, inline, before, after, pretrans, caption, digits=.Options$digits, size="", ...) # for lrm fit ## S3 method for class 'ols' latex(object, title, file, append, which, varnames, columns, inline, before, after, pretrans, caption, digits=.Options$digits, size="", ...) # ols fit ## S3 method for class 'orm' latex(object, title, file, append, which, varnames, columns, inline, before, after, pretrans, caption, digits=.Options$digits, size="", intercepts=nrp < 10, ...) # for orm fit ## S3 method for class 'pphsm' latex(object, title, file, append, which=NULL, varnames, columns, inline, before, after, pretrans, caption, digits=.Options$digits, size="", ...) # pphsm fit ## S3 method for class 'psm' latex(object, title, file, append, which=NULL, varnames, columns, inline, before, after, pretrans, caption, digits=.Options$digits, size="", ...) # psm fit
object |
a fit object created by a |
title |
ignored |
file , append
|
see |
surv |
if |
maxt |
if the maximum follow-up time in the data ( |
which , varnames , columns , inline , before , dec , pretrans
|
see
|
after |
if not an empty string, added to end of markup if
|
caption |
a character string specifying a title for the equation to be centered and typeset in bold face. Default is no title. |
digits |
see latexrms |
size |
a LaTeX size to use, without the slash. Default is the prevailing size |
intercepts |
for |
... |
ignored |
the name of the created file, with class c("latex","file")
. This
object works with latex viewing and printing commands in Hmisc. If
file=''
and options(prType=x
is in effect, where x
is "html", "markdown"
or "md"
, the result is run through
knitr::asis_output
so that it will be rendered correctly no
matter which options are in effect in the chunk header.
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
latexrms
, rcspline.restate
,
latex
## Not run: require(survival) units(ftime) <- "Day" f <- cph(Surv(ftime, death) ~ rcs(age)+sex, surv=TRUE, time.inc=60) w <- latex(f, file='f.tex') #Interprets fitted model and makes table of S0(t) #for t=0,60,120,180,... w #displays image, if viewer installed and file given above latex(f) # send LaTeX code to the console for knitr options(prType='html') latex(f) # for use with knitr and R Markdown/Quarto using MathJax ## End(Not run)
## Not run: require(survival) units(ftime) <- "Day" f <- cph(Surv(ftime, death) ~ rcs(age)+sex, surv=TRUE, time.inc=60) w <- latex(f, file='f.tex') #Interprets fitted model and makes table of S0(t) #for t=0,60,120,180,... w #displays image, if viewer installed and file given above latex(f) # send LaTeX code to the console for knitr options(prType='html') latex(f) # for use with knitr and R Markdown/Quarto using MathJax ## End(Not run)
Creates a file containing a LaTeX representation of the fitted model. For
model-specific typesetting there is latex.lrm
, latex.cph
,
latex.psm
and latex.ols
. latex.cph
has some
arguments that are specific to cph
models.
latexrms
is the core function which is
called internally by latexrms
(which is called by
latex.cph
, latex.ols
, etc.). html
and R
Markdown-compatible markup (using MathJax) are written if
options(prType='html')
.
latexrms(object, file='', append=FALSE, which=1:p, varnames, columns=65, prefix=NULL, inline=FALSE, before=if(inline)"" else "& &", after="", intercept, pretrans=TRUE, digits=.Options$digits, size="")
latexrms(object, file='', append=FALSE, which=1:p, varnames, columns=65, prefix=NULL, inline=FALSE, before=if(inline)"" else "& &", after="", intercept, pretrans=TRUE, digits=.Options$digits, size="")
object |
a fit object created by a fitting function in the |
file |
name of |
append |
whether or not to append to an existing file |
which |
a vector of subcripts (corresponding to |
varnames |
variable names to substitute for non-interactions. Order must correspond
to |
columns |
maximum number of columns of printing characters to allow before outputting a LaTeX newline command |
prefix |
if given, a LaTeX \lefteqn command of the form |
inline |
Set to |
before |
a character string to place before each line of output. Use the default
for a LaTeX |
after |
a character string to place after the output if |
intercept |
a special intercept value to include that is not part of the standard
model parameters (e.g., centering constant in Cox model). Only allowed
in the |
pretrans |
if any spline or polynomial-expanded variables are themselves
transformed, a table of pre-transformations will be formed unless
|
digits |
number of digits of precision to use in formatting coefficients and other numbers |
size |
a LaTeX font size to use for the output, without the slash. Default is current size. |
latexrms
returns a character vector if file=''
,
otherwise writes the output to file
. For particular model
fits, the latex
method returns the result of running
knitr::asis_output
on the LaTeX or HTML code if file=''
,
options(prType)
was set but not to 'plain'
, and if
knitr
is currently running. This causes correct output to be
rendered whether or not results='asis'
appeared in the R
Markdown or Quarto chunk header.
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
## Not run: f <- lrm(death ~ rcs(age)+sex) w <- latex(f, file='f.tex') w # displays, using e.g. xdvi latex(f) # send LaTeX code to console, as for knitr options(prType='html') latex(f) # emit html and latex for knitr html and html notebooks ## End(Not run)
## Not run: f <- lrm(death ~ rcs(age)+sex) w <- latex(f, file='f.tex') w # displays, using e.g. xdvi latex(f) # send LaTeX code to console, as for knitr options(prType='html') latex(f) # emit html and latex for knitr html and html notebooks ## End(Not run)
Fit binary and proportional odds ordinal
logistic regression models using maximum likelihood estimation or
penalized maximum likelihood estimation. See cr.setup
for how to
fit forward continuation ratio models with lrm
. The fitting function used by lrm
is lrm.fit
,
for which details and comparisons of its various optimization methods may be found here.
For the print
method, format of output is controlled by the
user previously running options(prType="lang")
where
lang
is "plain"
(the default), "latex"
, or
"html"
. When using html with Quarto or RMarkdown,
results='asis'
need not be written in the chunk header.
lrm(formula, data=environment(formula), subset, na.action=na.delete, method="lrm.fit", model=FALSE, x=FALSE, y=FALSE, linear.predictors=TRUE, se.fit=FALSE, penalty=0, penalty.matrix, var.penalty, weights, normwt=FALSE, scale, ...) ## S3 method for class 'lrm' print(x, digits=4, r2=c(0,2,4), coefs=TRUE, pg=FALSE, intercepts=x$non.slopes < 10, title='Logistic Regression Model', ...)
lrm(formula, data=environment(formula), subset, na.action=na.delete, method="lrm.fit", model=FALSE, x=FALSE, y=FALSE, linear.predictors=TRUE, se.fit=FALSE, penalty=0, penalty.matrix, var.penalty, weights, normwt=FALSE, scale, ...) ## S3 method for class 'lrm' print(x, digits=4, r2=c(0,2,4), coefs=TRUE, pg=FALSE, intercepts=x$non.slopes < 10, title='Logistic Regression Model', ...)
formula |
a formula object. An |
data |
data frame to use. Default is the current frame. |
subset |
logical expression or vector of subscripts defining a subset of observations to analyze |
na.action |
function to handle |
method |
name of fitting function. Only allowable choice at present is |
model |
causes the model frame to be returned in the fit object |
x |
causes the expanded design matrix (with missings excluded)
to be returned under the name |
y |
causes the response variable (with missings excluded) to be returned
under the name |
linear.predictors |
causes the predicted X beta (with missings excluded) to be returned
under the name |
se.fit |
causes the standard errors of the fitted values to be returned under
the name |
penalty |
The penalty factor subtracted from the log likelihood is
|
penalty.matrix |
specifies the symmetric penalty matrix for non-intercept terms.
The default matrix for continuous predictors has
the variance of the columns of the design matrix in its diagonal elements
so that the penalty to the log likelhood is unitless. For main effects
for categorical predictors with |
var.penalty |
deprecated and ignored |
weights |
a vector (same length as |
normwt |
set to |
scale |
deprecated; see |
... |
arguments that are passed to |
digits |
number of significant digits to use |
r2 |
vector of integers specifying which R^2 measures to print,
with 0 for Nagelkerke R^2 and 1:4 corresponding to the 4 measures
computed by |
coefs |
specify |
pg |
set to |
intercepts |
controls printing of intercepts. By default they are only printed if there aren't more than 10 of them. |
title |
a character string title to be passed to |
The returned fit object of lrm
contains the following components
in addition to the ones mentioned under the optional arguments.
call |
calling expression |
freq |
table of frequencies for |
stats |
vector with the following elements: number of observations used in the
fit, maximum absolute value of first
derivative of log likelihood, model likelihood ratio
|
fail |
set to |
coefficients |
estimated parameters |
var |
estimated variance-covariance matrix (inverse of information matrix).
If |
effective.df.diagonal |
is returned if |
u |
vector of first derivatives of log-likelihood |
deviance |
-2 log likelihoods (counting penalty components) When an offset variable is present, three deviances are computed: for intercept(s) only, for intercepts+offset, and for intercepts+offset+predictors. When there is no offset variable, the vector contains deviances for the intercept(s)-only model and the model with intercept(s) and predictors. |
est |
vector of column numbers of |
non.slopes |
number of intercepts in model |
penalty |
see above |
penalty.matrix |
the penalty matrix actually used in the estimation |
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
Le Cessie S, Van Houwelingen JC: Ridge estimators in logistic regression. Applied Statistics 41:191–201, 1992.
Verweij PJM, Van Houwelingen JC: Penalized likelihood in Cox regression. Stat in Med 13:2427–2436, 1994.
Gray RJ: Flexible methods for analyzing survival data using splines, with applications to breast cancer prognosis. JASA 87:942–951, 1992.
Shao J: Linear model selection by cross-validation. JASA 88:486–494, 1993.
Verweij PJM, Van Houwelingen JC: Crossvalidation in survival analysis. Stat in Med 12:2305–2314, 1993.
Harrell FE: Model uncertainty, penalization, and parsimony. ISCB Presentation on UVa Web page, 1998.
lrm.fit
, predict.lrm
,
rms.trans
, rms
, glm
,
latex.lrm
,
residuals.lrm
, na.delete
,
na.detail.response
,
pentrace
, rmsMisc
, vif
,
cr.setup
, predab.resample
,
validate.lrm
, calibrate
,
Mean.lrm
, gIndex
, prModFit
#Fit a logistic model containing predictors age, blood.pressure, sex #and cholesterol, with age fitted with a smooth 5-knot restricted cubic #spline function and a different shape of the age relationship for males #and females. As an intermediate step, predict mean cholesterol from #age using a proportional odds ordinal logistic model # require(ggplot2) n <- 1000 # define sample size set.seed(17) # so can reproduce the results age <- rnorm(n, 50, 10) blood.pressure <- rnorm(n, 120, 15) cholesterol <- rnorm(n, 200, 25) sex <- factor(sample(c('female','male'), n,TRUE)) label(age) <- 'Age' # label is in Hmisc label(cholesterol) <- 'Total Cholesterol' label(blood.pressure) <- 'Systolic Blood Pressure' label(sex) <- 'Sex' units(cholesterol) <- 'mg/dl' # uses units.default in Hmisc units(blood.pressure) <- 'mmHg' # Group cholesterol unnecessarily into 40-tiles ch <- cut2(cholesterol, g=40, levels.mean=TRUE) # use mean values in intervals table(ch) f <- lrm(ch ~ age) # options(prType='latex') print(f) # write latex code to console if prType='latex' is in effect m <- Mean(f) # see help file for Mean.lrm d <- data.frame(age=seq(0,90,by=10)) m(predict(f, d)) # Repeat using ols f <- ols(cholesterol ~ age) predict(f, d) # Specify population model for log odds that Y=1 L <- .4*(sex=='male') + .045*(age-50) + (log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male')) # Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)] y <- ifelse(runif(n) < plogis(L), 1, 0) cholesterol[1:3] <- NA # 3 missings, at random ddist <- datadist(age, blood.pressure, cholesterol, sex) options(datadist='ddist') fit <- lrm(y ~ blood.pressure + sex * (age + rcs(cholesterol,4)), x=TRUE, y=TRUE) # x=TRUE, y=TRUE allows use of resid(), which.influence below # could define d <- datadist(fit) after lrm(), but data distribution # summary would not be stored with fit, so later uses of Predict # or summary.rms would require access to the original dataset or # d or specifying all variable values to summary, Predict, nomogram anova(fit) p <- Predict(fit, age, sex) ggplot(p) # or plot() ggplot(Predict(fit, age=20:70, sex="male")) # need if datadist not used print(cbind(resid(fit,"dfbetas"), resid(fit,"dffits"))[1:20,]) which.influence(fit, .3) # latex(fit) #print nice statement of fitted model # #Repeat this fit using penalized MLE, penalizing complex terms #(for nonlinear or interaction effects) # fitp <- update(fit, penalty=list(simple=0,nonlinear=10), x=TRUE, y=TRUE) effective.df(fitp) # or lrm(y ~ \dots, penalty=\dots) #Get fits for a variety of penalties and assess predictive accuracy #in a new data set. Program efficiently so that complex design #matrices are only created once. set.seed(201) x1 <- rnorm(500) x2 <- rnorm(500) x3 <- sample(0:1,500,rep=TRUE) L <- x1+abs(x2)+x3 y <- ifelse(runif(500)<=plogis(L), 1, 0) new.data <- data.frame(x1,x2,x3,y)[301:500,] # for(penlty in seq(0,.15,by=.005)) { if(penlty==0) { f <- lrm(y ~ rcs(x1,4)+rcs(x2,6)*x3, subset=1:300, x=TRUE, y=TRUE) # True model is linear in x1 and has no interaction X <- f$x # saves time for future runs - don't have to use rcs etc. Y <- f$y # this also deletes rows with NAs (if there were any) penalty.matrix <- diag(diag(var(X))) Xnew <- predict(f, new.data, type="x") # expand design matrix for new data Ynew <- new.data$y } else f <- lrm.fit(X,Y, penalty.matrix=penlty*penalty.matrix) # cat("\nPenalty :",penlty,"\n") pred.logit <- f$coef[1] + (Xnew %*% f$coef[-1]) pred <- plogis(pred.logit) C.index <- somers2(pred, Ynew)["C"] Brier <- mean((pred-Ynew)^2) Deviance<- -2*sum( Ynew*log(pred) + (1-Ynew)*log(1-pred) ) cat("ROC area:",format(C.index)," Brier score:",format(Brier), " -2 Log L:",format(Deviance),"\n") } #penalty=0.045 gave lowest -2 Log L, Brier, ROC in test sample for S+ # #Use bootstrap validation to estimate predictive accuracy of #logistic models with various penalties #To see how noisy cross-validation estimates can be, change the #validate(f, \dots) to validate(f, method="cross", B=10) for example. #You will see tremendous variation in accuracy with minute changes in #the penalty. This comes from the error inherent in using 10-fold #cross validation but also because we are not fixing the splits. #20-fold cross validation was even worse for some #indexes because of the small test sample size. Stability would be #obtained by using the same sample splits for all penalty values #(see above), but then we wouldn't be sure that the choice of the #best penalty is not specific to how the sample was split. This #problem is addressed in the last example. # penalties <- seq(0,.7,length=3) # really use by=.02 index <- matrix(NA, nrow=length(penalties), ncol=11, dimnames=list(format(penalties), c("Dxy","R2","Intercept","Slope","Emax","D","U","Q","B","g","gp"))) i <- 0 for(penlty in penalties) { cat(penlty, "") i <- i+1 if(penlty==0) { f <- lrm(y ~ rcs(x1,4)+rcs(x2,6)*x3, x=TRUE, y=TRUE) # fit whole sample X <- f$x Y <- f$y penalty.matrix <- diag(diag(var(X))) # save time - only do once } else f <- lrm(Y ~ X, penalty=penlty, penalty.matrix=penalty.matrix, x=TRUE,y=TRUE) val <- validate(f, method="boot", B=20) # use larger B in practice index[i,] <- val[,"index.corrected"] } par(mfrow=c(3,3)) for(i in 1:9) { plot(penalties, index[,i], xlab="Penalty", ylab=dimnames(index)[[2]][i]) lines(lowess(penalties, index[,i])) } options(datadist=NULL) # Example of weighted analysis x <- 1:5 y <- c(0,1,0,1,0) reps <- c(1,2,3,2,1) lrm(y ~ x, weights=reps) x <- rep(x, reps) y <- rep(y, reps) lrm(y ~ x) # same as above # #Study performance of a modified AIC which uses the effective d.f. #See Verweij and Van Houwelingen (1994) Eq. (6). Here AIC=chisq-2*df. #Also try as effective d.f. equation (4) of the previous reference. #Also study performance of Shao's cross-validation technique (which was #designed to pick the "right" set of variables, and uses a much smaller #training sample than most methods). Compare cross-validated deviance #vs. penalty to the gold standard accuracy on a 7500 observation dataset. #Note that if you only want to get AIC or Schwarz Bayesian information #criterion, all you need is to invoke the pentrace function. #NOTE: the effective.df( ) function is used in practice # ## Not run: for(seed in c(339,777,22,111,3)){ # study performance for several datasets set.seed(seed) n <- 175; p <- 8 X <- matrix(rnorm(n*p), ncol=p) # p normal(0,1) predictors Coef <- c(-.1,.2,-.3,.4,-.5,.6,-.65,.7) # true population coefficients L <- X %*% Coef # intercept is zero Y <- ifelse(runif(n)<=plogis(L), 1, 0) pm <- diag(diag(var(X))) #Generate a large validation sample to use as a gold standard n.val <- 7500 X.val <- matrix(rnorm(n.val*p), ncol=p) L.val <- X.val %*% Coef Y.val <- ifelse(runif(n.val)<=plogis(L.val), 1, 0) # Penalty <- seq(0,30,by=1) reps <- length(Penalty) effective.df <- effective.df2 <- aic <- aic2 <- deviance.val <- Lpenalty <- single(reps) n.t <- round(n^.75) ncv <- c(10,20,30,40) # try various no. of reps in cross-val. deviance <- matrix(NA,nrow=reps,ncol=length(ncv)) #If model were complex, could have started things off by getting X, Y #penalty.matrix from an initial lrm fit to save time # for(i in 1:reps) { pen <- Penalty[i] cat(format(pen),"") f.full <- lrm.fit(X, Y, penalty.matrix=pen*pm) Lpenalty[i] <- pen* t(f.full$coef[-1]) %*% pm %*% f.full$coef[-1] f.full.nopenalty <- lrm.fit(X, Y, initial=f.full$coef, maxit=1) info.matrix.unpenalized <- solve(f.full.nopenalty$var) effective.df[i] <- sum(diag(info.matrix.unpenalized %*% f.full$var)) - 1 lrchisq <- f.full.nopenalty$stats["Model L.R."] # lrm does all this penalty adjustment automatically (for var, d.f., # chi-square) aic[i] <- lrchisq - 2*effective.df[i] # pred <- plogis(f.full$linear.predictors) score.matrix <- cbind(1,X) * (Y - pred) sum.u.uprime <- t(score.matrix) %*% score.matrix effective.df2[i] <- sum(diag(f.full$var %*% sum.u.uprime)) aic2[i] <- lrchisq - 2*effective.df2[i] # #Shao suggested averaging 2*n cross-validations, but let's do only 40 #and stop along the way to see if fewer is OK dev <- 0 for(j in 1:max(ncv)) { s <- sample(1:n, n.t) cof <- lrm.fit(X[s,],Y[s], penalty.matrix=pen*pm)$coef pred <- cof[1] + (X[-s,] %*% cof[-1]) dev <- dev -2*sum(Y[-s]*pred + log(1-plogis(pred))) for(k in 1:length(ncv)) if(j==ncv[k]) deviance[i,k] <- dev/j } # pred.val <- f.full$coef[1] + (X.val %*% f.full$coef[-1]) prob.val <- plogis(pred.val) deviance.val[i] <- -2*sum(Y.val*pred.val + log(1-prob.val)) } postscript(hor=TRUE) # along with graphics.off() below, allow plots par(mfrow=c(2,4)) # to be printed as they are finished plot(Penalty, effective.df, type="l") lines(Penalty, effective.df2, lty=2) plot(Penalty, Lpenalty, type="l") title("Penalty on -2 log L") plot(Penalty, aic, type="l") lines(Penalty, aic2, lty=2) for(k in 1:length(ncv)) { plot(Penalty, deviance[,k], ylab="deviance") title(paste(ncv[k],"reps")) lines(supsmu(Penalty, deviance[,k])) } plot(Penalty, deviance.val, type="l") title("Gold Standard (n=7500)") title(sub=format(seed),adj=1,cex=.5) graphics.off() } ## End(Not run) #The results showed that to obtain a clear picture of the penalty- #accuracy relationship one needs 30 or 40 reps in the cross-validation. #For 4 of 5 samples, though, the super smoother was able to detect #an accurate penalty giving the best (lowest) deviance using 10-fold #cross-validation. Cross-validation would have worked better had #the same splits been used for all penalties. #The AIC methods worked just as well and are much quicker to compute. #The first AIC based on the effective d.f. in Gray's Eq. 2.9 #(Verweij and Van Houwelingen (1994) Eq. 5 (note typo)) worked best.
#Fit a logistic model containing predictors age, blood.pressure, sex #and cholesterol, with age fitted with a smooth 5-knot restricted cubic #spline function and a different shape of the age relationship for males #and females. As an intermediate step, predict mean cholesterol from #age using a proportional odds ordinal logistic model # require(ggplot2) n <- 1000 # define sample size set.seed(17) # so can reproduce the results age <- rnorm(n, 50, 10) blood.pressure <- rnorm(n, 120, 15) cholesterol <- rnorm(n, 200, 25) sex <- factor(sample(c('female','male'), n,TRUE)) label(age) <- 'Age' # label is in Hmisc label(cholesterol) <- 'Total Cholesterol' label(blood.pressure) <- 'Systolic Blood Pressure' label(sex) <- 'Sex' units(cholesterol) <- 'mg/dl' # uses units.default in Hmisc units(blood.pressure) <- 'mmHg' # Group cholesterol unnecessarily into 40-tiles ch <- cut2(cholesterol, g=40, levels.mean=TRUE) # use mean values in intervals table(ch) f <- lrm(ch ~ age) # options(prType='latex') print(f) # write latex code to console if prType='latex' is in effect m <- Mean(f) # see help file for Mean.lrm d <- data.frame(age=seq(0,90,by=10)) m(predict(f, d)) # Repeat using ols f <- ols(cholesterol ~ age) predict(f, d) # Specify population model for log odds that Y=1 L <- .4*(sex=='male') + .045*(age-50) + (log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male')) # Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)] y <- ifelse(runif(n) < plogis(L), 1, 0) cholesterol[1:3] <- NA # 3 missings, at random ddist <- datadist(age, blood.pressure, cholesterol, sex) options(datadist='ddist') fit <- lrm(y ~ blood.pressure + sex * (age + rcs(cholesterol,4)), x=TRUE, y=TRUE) # x=TRUE, y=TRUE allows use of resid(), which.influence below # could define d <- datadist(fit) after lrm(), but data distribution # summary would not be stored with fit, so later uses of Predict # or summary.rms would require access to the original dataset or # d or specifying all variable values to summary, Predict, nomogram anova(fit) p <- Predict(fit, age, sex) ggplot(p) # or plot() ggplot(Predict(fit, age=20:70, sex="male")) # need if datadist not used print(cbind(resid(fit,"dfbetas"), resid(fit,"dffits"))[1:20,]) which.influence(fit, .3) # latex(fit) #print nice statement of fitted model # #Repeat this fit using penalized MLE, penalizing complex terms #(for nonlinear or interaction effects) # fitp <- update(fit, penalty=list(simple=0,nonlinear=10), x=TRUE, y=TRUE) effective.df(fitp) # or lrm(y ~ \dots, penalty=\dots) #Get fits for a variety of penalties and assess predictive accuracy #in a new data set. Program efficiently so that complex design #matrices are only created once. set.seed(201) x1 <- rnorm(500) x2 <- rnorm(500) x3 <- sample(0:1,500,rep=TRUE) L <- x1+abs(x2)+x3 y <- ifelse(runif(500)<=plogis(L), 1, 0) new.data <- data.frame(x1,x2,x3,y)[301:500,] # for(penlty in seq(0,.15,by=.005)) { if(penlty==0) { f <- lrm(y ~ rcs(x1,4)+rcs(x2,6)*x3, subset=1:300, x=TRUE, y=TRUE) # True model is linear in x1 and has no interaction X <- f$x # saves time for future runs - don't have to use rcs etc. Y <- f$y # this also deletes rows with NAs (if there were any) penalty.matrix <- diag(diag(var(X))) Xnew <- predict(f, new.data, type="x") # expand design matrix for new data Ynew <- new.data$y } else f <- lrm.fit(X,Y, penalty.matrix=penlty*penalty.matrix) # cat("\nPenalty :",penlty,"\n") pred.logit <- f$coef[1] + (Xnew %*% f$coef[-1]) pred <- plogis(pred.logit) C.index <- somers2(pred, Ynew)["C"] Brier <- mean((pred-Ynew)^2) Deviance<- -2*sum( Ynew*log(pred) + (1-Ynew)*log(1-pred) ) cat("ROC area:",format(C.index)," Brier score:",format(Brier), " -2 Log L:",format(Deviance),"\n") } #penalty=0.045 gave lowest -2 Log L, Brier, ROC in test sample for S+ # #Use bootstrap validation to estimate predictive accuracy of #logistic models with various penalties #To see how noisy cross-validation estimates can be, change the #validate(f, \dots) to validate(f, method="cross", B=10) for example. #You will see tremendous variation in accuracy with minute changes in #the penalty. This comes from the error inherent in using 10-fold #cross validation but also because we are not fixing the splits. #20-fold cross validation was even worse for some #indexes because of the small test sample size. Stability would be #obtained by using the same sample splits for all penalty values #(see above), but then we wouldn't be sure that the choice of the #best penalty is not specific to how the sample was split. This #problem is addressed in the last example. # penalties <- seq(0,.7,length=3) # really use by=.02 index <- matrix(NA, nrow=length(penalties), ncol=11, dimnames=list(format(penalties), c("Dxy","R2","Intercept","Slope","Emax","D","U","Q","B","g","gp"))) i <- 0 for(penlty in penalties) { cat(penlty, "") i <- i+1 if(penlty==0) { f <- lrm(y ~ rcs(x1,4)+rcs(x2,6)*x3, x=TRUE, y=TRUE) # fit whole sample X <- f$x Y <- f$y penalty.matrix <- diag(diag(var(X))) # save time - only do once } else f <- lrm(Y ~ X, penalty=penlty, penalty.matrix=penalty.matrix, x=TRUE,y=TRUE) val <- validate(f, method="boot", B=20) # use larger B in practice index[i,] <- val[,"index.corrected"] } par(mfrow=c(3,3)) for(i in 1:9) { plot(penalties, index[,i], xlab="Penalty", ylab=dimnames(index)[[2]][i]) lines(lowess(penalties, index[,i])) } options(datadist=NULL) # Example of weighted analysis x <- 1:5 y <- c(0,1,0,1,0) reps <- c(1,2,3,2,1) lrm(y ~ x, weights=reps) x <- rep(x, reps) y <- rep(y, reps) lrm(y ~ x) # same as above # #Study performance of a modified AIC which uses the effective d.f. #See Verweij and Van Houwelingen (1994) Eq. (6). Here AIC=chisq-2*df. #Also try as effective d.f. equation (4) of the previous reference. #Also study performance of Shao's cross-validation technique (which was #designed to pick the "right" set of variables, and uses a much smaller #training sample than most methods). Compare cross-validated deviance #vs. penalty to the gold standard accuracy on a 7500 observation dataset. #Note that if you only want to get AIC or Schwarz Bayesian information #criterion, all you need is to invoke the pentrace function. #NOTE: the effective.df( ) function is used in practice # ## Not run: for(seed in c(339,777,22,111,3)){ # study performance for several datasets set.seed(seed) n <- 175; p <- 8 X <- matrix(rnorm(n*p), ncol=p) # p normal(0,1) predictors Coef <- c(-.1,.2,-.3,.4,-.5,.6,-.65,.7) # true population coefficients L <- X %*% Coef # intercept is zero Y <- ifelse(runif(n)<=plogis(L), 1, 0) pm <- diag(diag(var(X))) #Generate a large validation sample to use as a gold standard n.val <- 7500 X.val <- matrix(rnorm(n.val*p), ncol=p) L.val <- X.val %*% Coef Y.val <- ifelse(runif(n.val)<=plogis(L.val), 1, 0) # Penalty <- seq(0,30,by=1) reps <- length(Penalty) effective.df <- effective.df2 <- aic <- aic2 <- deviance.val <- Lpenalty <- single(reps) n.t <- round(n^.75) ncv <- c(10,20,30,40) # try various no. of reps in cross-val. deviance <- matrix(NA,nrow=reps,ncol=length(ncv)) #If model were complex, could have started things off by getting X, Y #penalty.matrix from an initial lrm fit to save time # for(i in 1:reps) { pen <- Penalty[i] cat(format(pen),"") f.full <- lrm.fit(X, Y, penalty.matrix=pen*pm) Lpenalty[i] <- pen* t(f.full$coef[-1]) %*% pm %*% f.full$coef[-1] f.full.nopenalty <- lrm.fit(X, Y, initial=f.full$coef, maxit=1) info.matrix.unpenalized <- solve(f.full.nopenalty$var) effective.df[i] <- sum(diag(info.matrix.unpenalized %*% f.full$var)) - 1 lrchisq <- f.full.nopenalty$stats["Model L.R."] # lrm does all this penalty adjustment automatically (for var, d.f., # chi-square) aic[i] <- lrchisq - 2*effective.df[i] # pred <- plogis(f.full$linear.predictors) score.matrix <- cbind(1,X) * (Y - pred) sum.u.uprime <- t(score.matrix) %*% score.matrix effective.df2[i] <- sum(diag(f.full$var %*% sum.u.uprime)) aic2[i] <- lrchisq - 2*effective.df2[i] # #Shao suggested averaging 2*n cross-validations, but let's do only 40 #and stop along the way to see if fewer is OK dev <- 0 for(j in 1:max(ncv)) { s <- sample(1:n, n.t) cof <- lrm.fit(X[s,],Y[s], penalty.matrix=pen*pm)$coef pred <- cof[1] + (X[-s,] %*% cof[-1]) dev <- dev -2*sum(Y[-s]*pred + log(1-plogis(pred))) for(k in 1:length(ncv)) if(j==ncv[k]) deviance[i,k] <- dev/j } # pred.val <- f.full$coef[1] + (X.val %*% f.full$coef[-1]) prob.val <- plogis(pred.val) deviance.val[i] <- -2*sum(Y.val*pred.val + log(1-prob.val)) } postscript(hor=TRUE) # along with graphics.off() below, allow plots par(mfrow=c(2,4)) # to be printed as they are finished plot(Penalty, effective.df, type="l") lines(Penalty, effective.df2, lty=2) plot(Penalty, Lpenalty, type="l") title("Penalty on -2 log L") plot(Penalty, aic, type="l") lines(Penalty, aic2, lty=2) for(k in 1:length(ncv)) { plot(Penalty, deviance[,k], ylab="deviance") title(paste(ncv[k],"reps")) lines(supsmu(Penalty, deviance[,k])) } plot(Penalty, deviance.val, type="l") title("Gold Standard (n=7500)") title(sub=format(seed),adj=1,cex=.5) graphics.off() } ## End(Not run) #The results showed that to obtain a clear picture of the penalty- #accuracy relationship one needs 30 or 40 reps in the cross-validation. #For 4 of 5 samples, though, the super smoother was able to detect #an accurate penalty giving the best (lowest) deviance using 10-fold #cross-validation. Cross-validation would have worked better had #the same splits been used for all penalties. #The AIC methods worked just as well and are much quicker to compute. #The first AIC based on the effective d.f. in Gray's Eq. 2.9 #(Verweij and Van Houwelingen (1994) Eq. 5 (note typo)) worked best.
Logistic Model Fitter
lrm.fit( x, y, offset = 0, initial, opt_method = c("NR", "nlminb", "LM", "glm.fit", "nlm", "BFGS", "L-BFGS-B", "CG", "Nelder-Mead"), maxit = 50, reltol = 1e-10, abstol = if (opt_method %in% c("NR", "LM")) 1e+10 else 0, gradtol = if (opt_method %in% c("NR", "LM")) 0.001 else 1e-05, factr = 1e+07, eps = 5e-04, minstepsize = 0.01, trace = 0, tol = .Machine$double.eps, penalty.matrix = NULL, weights = NULL, normwt = FALSE, transx = FALSE, compstats = TRUE, inclpen = TRUE, initglm = FALSE, y.precision = 7 )
lrm.fit( x, y, offset = 0, initial, opt_method = c("NR", "nlminb", "LM", "glm.fit", "nlm", "BFGS", "L-BFGS-B", "CG", "Nelder-Mead"), maxit = 50, reltol = 1e-10, abstol = if (opt_method %in% c("NR", "LM")) 1e+10 else 0, gradtol = if (opt_method %in% c("NR", "LM")) 0.001 else 1e-05, factr = 1e+07, eps = 5e-04, minstepsize = 0.01, trace = 0, tol = .Machine$double.eps, penalty.matrix = NULL, weights = NULL, normwt = FALSE, transx = FALSE, compstats = TRUE, inclpen = TRUE, initglm = FALSE, y.precision = 7 )
x |
design matrix with no column for an intercept. If a vector is transformed to a one-column matrix. |
y |
response vector, numeric, categorical, or character. For ordinal regression, the order of categories comes from |
offset |
optional numeric vector containing an offset on the logit scale |
initial |
vector of initial parameter estimates, beginning with the intercepts |
opt_method |
optimization method, with possible values
|
maxit |
maximum number of iterations allowed, which means different things for different |
reltol |
used by |
abstol |
used by |
gradtol |
used by |
factr |
see |
eps |
difference in -2 log likelihood for declaring convergence with |
minstepsize |
used with |
trace |
set to a positive integer to trace the iterative process. Some optimization methods distinguish |
tol |
QR singularity criterion for |
penalty.matrix |
a self-contained ready-to-use penalty matrix - see |
weights |
a vector (same length as |
normwt |
set to |
transx |
set to |
compstats |
set to |
inclpen |
set to |
initglm |
set to |
y.precision |
when |
Fits a binary or propoortional odds ordinal logistic model for a given design matrix and response vector with no missing values in either. Ordinary or quadratic penalized maximum likelihood estimation is used.
lrm.fit
implements a large number of optimization algorithms with the default being Newton-Raphson with step-halving. For binary logistic regression without penalization iteratively reweighted least squares method in stats::glm.fit()
is an option. The -2 log likeilhood, gradient, and Hessian (negative information) matrix are computed in Fortran for speed. Optionally, the x
matrix is mean-centered and QR-factored to help in optimization when there are strong collinearities. Parameter estimates and the covariance matrix are adjusted to the original x
scale after fitting. More detail and comparisons of the various optimization methods may be found here. For ordinal regression with a large number of intercepts (distinct y
values less one) you may want to use ‘optim_method=’BFGS', which does away with the need to compute the Hessian. This will be helpful if statistical tests and confidence intervals are not being computed, or when only likelihood ratio tests are done.
When using Newton-Raphson or Levenberg-Marquardt optimization, sparse Hessian/information/variance-covariance matrices are used throughout. For nlminb
the Hessian has to be expanded into full non-sparse form, so nlminb
will not be very efficient for a large number of intercepts.
When there is complete separation (Hauck-Donner condition), i.e., the MLE of a coefficient is , and
y
is binary and there is no penalty, glm.fit
may not converge because it does not have a convergence parameter for the deviance. Setting trace=1
will reveal that the -2LL is approaching zero but doesn't get there, relatively speaking. In such cases the default of NR
with eps=5e-4
or using nlminb
with its default of abstol=0.001
works well.
a list with the following elements:
call
: the R call to lrm.fit
freq
: vector of y
frequencies
ymedian
: median of original y
values if y
is numeric, otherwise the median of the integer-recorded version of y
yunique
: vector of distinct original y
values, subject to rounding
sumty
: vector of weighted y
frequencies
stats
: vector with a large number of indexes and model parameters (NULL
if compstats=FALSE
):
Obs
: number of observations
Max Deriv
: maximum absolute gradiant
Model L.R.
: overall model LR chi-square statistic
d.f.
: degrees of freedom (number of non-intercepts)
P
: p-value for the overall Model L.R.
and d.f.
C
: concordance probability between predicted probability and y
Dxy
: Somer's Dxy rank correlation between predicted probability and y
, = 2(C - 0.5)
Gamma
:
Tau-a
:
R2
: documented here; the first element, with the plain 'R2'
name is Nagelkerke's
Brier
: Brier score. For ordinal models this is computed with respect the the median intercept.
g
: g-index (Gini's mean difference of linear predictors)
gr
: g-index on the odds ratio scale
gp
: g-index on the probability scale
fail
: TRUE
if any matrix inversion or failure to converge occurred, FALSE
otherwise
coefficients
:
info.matrix
: a list of 3 elements a
, b
, ab
with a
being a $k x 2$ matrix for $k$ intercepts, b
being $p x p$ for $p$ predictors, and ab
being $k x p$. See infoMxop()
for easy ways of operating on these 3 elements.
u
: gradient vector
iter
: number of iterations required. For some optimization methods this is a vector.
deviance
: vector of deviances: intercepts-only, intercepts + offset (if offset
is present), final model (if x
is used)
non.slopes
: number of intercepts in the model
linear.predictors
: vector of linear predictors at the median intercept
penalty.matrix
: penalty matrix or NULL
weights
: weights
or NULL
xbar
: vector of column means of x
, or NULL
if transx=FALSE
xtrans
: input value of transx
R
: R matrix from QR to be used to rotate parameters back to original scale in the future
Ri
: inverse of R
opt_method
: input value
Frank Harrell [email protected]
lrm()
, stats::glm()
, cr.setup()
, gIndex()
, stats::optim()
, stats::nlminb()
, stats::nlm()
,stats::glm.fit()
, recode2integer()
, Hmisc::qrxcenter()
, infoMxop()
## Not run: # Fit an additive logistic model containing numeric predictors age, # blood.pressure, and sex, assumed to be already properly coded and # transformed fit <- lrm.fit(cbind(age,blood.pressure,sex=='male'), death) ## End(Not run)
## Not run: # Fit an additive logistic model containing numeric predictors age, # blood.pressure, and sex, assumed to be already properly coded and # transformed fit <- lrm.fit(cbind(age,blood.pressure,sex=='male'), death) ## End(Not run)
Update Model LR Statistics After Multiple Imputation
LRupdate(fit, anova)
LRupdate(fit, anova)
fit |
an |
anova |
the result of |
For fits from orm, lrm, orm, cph, psm
that were created using fit.mult.impute
with lrt=TRUE
or equivalent options and for which anova
was obtained using processMI(fit, 'anova')
to compute imputation-adjusted LR statistics. LRupdate
uses the last line of the anova
result (containing the overall model LR chi-square) to update Model L.R.
in the fit stats
component, and to adjust any of the new R-square measures in stats
.
For models using Nagelkerke's R-squared, these are set to NA
as they would need to be recomputed with a new intercept-only log-likelihood, which is not computed by anova
. For ols
models, R-squared is left alone as it is sample-size-independent and print.ols
prints the correct adjusted R-squared due to fit.mult.impute
correcting the residual d.f. in stacked fits.
new fit object like fit
but with the substitutions made
Frank Harrell
processMI.fit.mult.impute()
, Hmisc::R2Measures()
## Not run: a <- aregImpute(~ y + x1 + x2, n.impute=30, data=d) f <- fit.mult.impute(y ~ x1 + x2, lrm, a, data=d, lrt=TRUE) a <- processMI(f, 'anova') f <- LRupdate(f, a) print(f, r2=1:4) # print all imputation-corrected R2 measures ## End(Not run)
## Not run: a <- aregImpute(~ y + x1 + x2, n.impute=30, data=d) f <- fit.mult.impute(y ~ x1 + x2, lrm, a, data=d, lrt=TRUE) a <- processMI(f, 'anova') f <- LRupdate(f, a) print(f, r2=1:4) # print all imputation-corrected R2 measures ## End(Not run)
This function inverts or partially inverts a matrix using pivoting (the sweep operator). It is useful for sequential model-building.
matinv(a, which, negate=TRUE, eps=1e-12)
matinv(a, which, negate=TRUE, eps=1e-12)
a |
square matrix to invert or partially invert. May have been inverted or partially inverted previously by matinv, in which case its "swept" attribute is updated. Will un-invert if already inverted. |
which |
vector of column/row numbers in a to invert. Default is all, for total inverse. |
negate |
So that the algorithm can keep track of which pivots have been swept as well as roundoff errors, it actually returns the negative of the inverse or partial inverse. By default, these elements are negated to give the usual expected result. Set negate=FALSE if you will be passing the result right back into matinv, otherwise, negate the submatrix before sending back to matinv. |
eps |
singularity criterion |
a square matrix, with attributes "rank" and "swept".
Clarke MRB (1982). Algorithm AS 178: The Gauss-Jordan sweep operator with detection of collinearity. Appl Statist 31:166–9.
Ridout MS, Cobb JM (1986). Algorithm AS R78 : A remark on algorithm AS 178: The Gauss-Jordan sweep operator with detection of collinearity. Appl Statist 38:420–2.
a <- diag(1:3) a.inv1 <- matinv(a, 1, negate=FALSE) #Invert with respect to a[1,1] a.inv1 a.inv <- -matinv(a.inv1, 2:3, negate=FALSE) #Finish the job a.inv solve(a)
a <- diag(1:3) a.inv1 <- matinv(a, 1, negate=FALSE) #Invert with respect to a[1,1] a.inv1 a.inv <- -matinv(a.inv1, 2:3, negate=FALSE) #Finish the job a.inv solve(a)
Draws a partial nomogram that can be used to manually obtain predicted
values from a regression model that was fitted with rms
.
The nomogram does not have lines representing sums, but it has a reference
line for reading scoring points (default range 0–100). Once the reader
manually totals the points, the predicted values can be read at the bottom.
Non-monotonic transformations of continuous variables are handled (scales
wrap around), as
are transformations which have flat sections (tick marks are labeled
with ranges). If interactions are in the model, one variable
is picked as the “axis variable”, and separate axes are constructed for
each level of the interacting factors (preference is given automatically
to using any discrete factors to construct separate axes) and
levels of factors which are indirectly related to interacting
factors (see DETAILS). Thus the nomogram is designed so that only
one axis is actually read for each variable, since the variable
combinations are disjoint. For
categorical interacting factors, the default is to construct axes for
all levels.
The user may specify
coordinates of each predictor to label on its axis, or use default values.
If a factor interacts with other factors, settings for one or more of
the interacting factors may be specified separately (this is mandatory
for continuous variables). Optional confidence intervals will be
drawn for individual scores as well as for the linear predictor.
If more than one confidence level is chosen, multiple levels may be
displayed using different colors or gray scales. Functions of the
linear predictors may be added to the nomogram.
The datadist
object that was in effect when the model
was fit is used to specify the limits of the axis for continuous
predictors when the user does not specify tick mark locations in the
nomogram
call.
print.nomogram
prints axis information stored in an object returned
by nomogram
. This is useful in producing tables of point assignments
by levels of predictors. It also prints how many linear predictor
units there are per point and the number of points per unit change in
the linear predictor.
legend.nomabbrev
draws legends describing abbreviations used for
labeling tick marks for levels of categorical predictors.
nomogram(fit, ..., adj.to, lp=TRUE, lp.at=NULL, fun=NULL, fun.at=NULL, fun.lp.at=NULL, funlabel="Predicted Value", interact=NULL, kint=NULL, conf.int=FALSE, conf.lp=c("representative", "all", "none"), est.all=TRUE, posterior.summary=c('mean', 'median', 'mode'), abbrev=FALSE, minlength=4, maxscale=100, nint=10, vnames=c("labels","names"), varname.label=TRUE, varname.label.sep="=", omit=NULL, verbose=FALSE) ## S3 method for class 'nomogram' print(x, dec=0, ...) ## S3 method for class 'nomogram' plot(x, lplabel="Linear Predictor", fun.side, col.conf=c(1, 0.3), conf.space=c(.08,.2), label.every=1, force.label=FALSE, xfrac=.35, cex.axis=.85, cex.var=1, col.grid=NULL, varname.label=TRUE, varname.label.sep="=", ia.space=.7, tck=NA, tcl=-0.25, lmgp=.4, naxes, points.label='Points', total.points.label='Total Points', total.sep.page=FALSE, total.fun, cap.labels=FALSE, ...) legend.nomabbrev(object, which, x, y, ncol=3, ...)
nomogram(fit, ..., adj.to, lp=TRUE, lp.at=NULL, fun=NULL, fun.at=NULL, fun.lp.at=NULL, funlabel="Predicted Value", interact=NULL, kint=NULL, conf.int=FALSE, conf.lp=c("representative", "all", "none"), est.all=TRUE, posterior.summary=c('mean', 'median', 'mode'), abbrev=FALSE, minlength=4, maxscale=100, nint=10, vnames=c("labels","names"), varname.label=TRUE, varname.label.sep="=", omit=NULL, verbose=FALSE) ## S3 method for class 'nomogram' print(x, dec=0, ...) ## S3 method for class 'nomogram' plot(x, lplabel="Linear Predictor", fun.side, col.conf=c(1, 0.3), conf.space=c(.08,.2), label.every=1, force.label=FALSE, xfrac=.35, cex.axis=.85, cex.var=1, col.grid=NULL, varname.label=TRUE, varname.label.sep="=", ia.space=.7, tck=NA, tcl=-0.25, lmgp=.4, naxes, points.label='Points', total.points.label='Total Points', total.sep.page=FALSE, total.fun, cap.labels=FALSE, ...) legend.nomabbrev(object, which, x, y, ncol=3, ...)
fit |
a regression model fit that was created with |
... |
settings of variables to use in constructing axes. If |
adj.to |
If you didn't define |
lp |
Set to |
lp.at |
If |
fun |
an optional function to transform the linear predictors, and to plot
on another axis. If more than one transformation is plotted, put
them in a list, e.g. |
fun.at |
function values to label on axis. Default |
fun.lp.at |
If you want to
evaluate one of the functions at a different set of linear predictor
values than may have been used in constructing the linear predictor axis,
specify a vector or list of vectors
of linear predictor values at which to evaluate the function. This is
especially useful for discrete functions. The presence of this attribute
also does away with the need for |
funlabel |
label for |
interact |
When a continuous variable interacts with a discrete one, axes are
constructed so that the continuous variable moves within the axis, and
separate axes represent levels of interacting factors. For interactions
between two continuous variables, all but the axis variable must have
discrete levels defined in |
kint |
for models such as the ordinal models with multiple intercepts,
specifies which one to use in evaluating the linear predictor.
Default is to use |
conf.int |
confidence levels to display for each scoring. Default is |
conf.lp |
default is |
est.all |
To plot axes for only the subset of variables named in |
posterior.summary |
when operating on a Bayesian model such as a
result of |
abbrev |
Set to |
minlength |
applies if |
maxscale |
default maximum point score is 100 |
nint |
number of intervals to label for axes representing continuous variables.
See |
vnames |
By default, variable labels are used to label axes. Set
|
omit |
vector of character strings containing names of variables for which to suppress drawing axes. Default is to show all variables. |
verbose |
set to |
x |
an object created by |
dec |
number of digits to the right of the decimal point, for rounding
point scores in |
lplabel |
label for linear predictor axis. Default is |
fun.side |
a vector or list of vectors of |
col.conf |
colors corresponding to |
conf.space |
a 2-element vector with the vertical range within which to draw confidence bars, in units of 1=spacing between main bars. Four heights are used within this range (8 for the linear predictor if more than 16 unique values were evaluated), cycling them among separate confidence intervals to reduce overlapping. |
label.every |
Specify |
force.label |
set to |
xfrac |
fraction of horizontal plot to set aside for axis titles |
cex.axis |
character size for tick mark labels |
cex.var |
character size for axis titles (variable names) |
col.grid |
If left unspecified, no vertical reference lines are drawn. Specify a
vector of length one (to use the same color for both minor and major
reference lines) or two (corresponding to the color for the major and
minor divisions, respectively) containing colors, to cause vertical reference
lines to the top points scale to be drawn. For R, a good choice is
|
varname.label |
In constructing axis titles for interactions, the default is to add
|
varname.label.sep |
If |
ia.space |
When multiple axes are draw for levels of interacting factors, the default is to group combinations related to a main effect. This is done by spacing the axes for the second to last of these within a group only 0.7 (by default) of the way down as compared with normal space of 1 unit. |
tck |
see |
tcl |
length of tick marks in nomogram |
lmgp |
spacing between numeric axis labels and axis (see |
naxes |
maximum number of axes to allow on one plot. If the nomogram requires more than one “page”, the “Points” axis will be repeated at the top of each page when necessary. |
points.label |
a character string giving the axis label for the points scale |
total.points.label |
a character string giving the axis label for the total points scale |
total.sep.page |
set to |
total.fun |
a user-provided function that will be executed before the total points
axis is drawn. Default is not to execute a function. This is useful e.g.
when |
cap.labels |
logical: should the factor labels have their first letter capitalized? |
object |
the result returned from |
which |
a character string giving the name of a variable for which to draw a legend with abbreviations of factor levels |
y |
y-coordinate to pass to the |
ncol |
the number of columns to form in drawing the legend. |
A variable is considered to be discrete if it is categorical or ordered
or if datadist
stored values
for it (meaning it
had <11
unique values).
A variable is said to be indirectly related to another variable if
the two are related by some interaction. For example, if a model
has variables a, b, c, d, and the interactions are a:c and c:d,
variable d is indirectly related to variable a. The complete list
of variables related to a is c, d. If an axis is made for variable a,
several axes will actually be drawn, one for each combination of c
and d specified in interact
.
Note that with a caliper, it is easy to continually add point scores for individual predictors, and then to place the caliper on the upper “Points” axis (with extrapolation if needed). Then transfer these points to the “Total Points” axis. In this way, points can be added without writing them down.
Confidence limits for an individual predictor score are really confidence
limits for the entire linear predictor, with other predictors set to
adjustment values. If lp = TRUE
, all confidence bars for all linear
predictor values evaluated are drawn. The extent to which multiple
confidence bars of differing widths appear at the same linear predictor
value means that precision depended on how the linear predictor was
arrived at (e.g., a certain value may be realized from a setting of
a certain predictor that was associated with a large standard error
on the regression coefficients for that predictor).
On occasion, you may want to reverse the regression coefficients of a model
to make the “points” scales reverse direction. For parametric survival
models, which are stated in terms of increasing regression effects meaning
longer survival (the opposite of a Cox model), just do something like
fit$coefficients <- -fit$coefficients
before invoking nomogram
,
and if you add function axes, negate the function
arguments. For the Cox model, you also need to negate fit$center
.
If you omit lp.at
, also negate fit$linear.predictors
.
a list of class "nomogram"
that contains information used in plotting
the axes. If you specified abbrev = TRUE
, a list called abbrev
is also
returned that gives the abbreviations used for tick mark labels, if any.
This list is useful for
making legends and is used by legend.nomabbrev
(see the last example).
The returned list also has components called total.points
, lp
,
and the function axis names. These components have components
x
(at
argument vector given to axis
), y
(pos
for axis
),
and x.real
, the x-coordinates appearing on tick mark labels.
An often useful result is stored in the list of data for each axis variable,
namely the exact number of points that correspond to each tick mark on
that variable's axis.
Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]
Banks J: Nomograms. Encylopedia of Statistical Sciences, Vol 6. Editors: S Kotz and NL Johnson. New York: Wiley; 1985.
Lubsen J, Pool J, van der Does, E: A practical device for the application of a diagnostic or prognostic function. Meth. Inform. Med. 17:127–129; 1978.
Wikipedia: Nomogram, https://en.wikipedia.org/wiki/Nomogram.
rms
, plot.Predict
,
ggplot.Predict
, plot.summary.rms
,
axis
, pretty
, approx
,
latexrms
, rmsMisc
n <- 1000 # define sample size set.seed(17) # so can reproduce the results d <- data.frame(age = rnorm(n, 50, 10), blood.pressure = rnorm(n, 120, 15), cholesterol = rnorm(n, 200, 25), sex = factor(sample(c('female','male'), n,TRUE))) # Specify population model for log odds that Y=1 # Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)] d <- upData(d, L = .4*(sex=='male') + .045*(age-50) + (log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male')), y = ifelse(runif(n) < plogis(L), 1, 0)) ddist <- datadist(d); options(datadist='ddist') f <- lrm(y ~ lsp(age,50) + sex * rcs(cholesterol, 4) + blood.pressure, data=d) nom <- nomogram(f, fun=function(x)1/(1+exp(-x)), # or fun=plogis fun.at=c(.001,.01,.05,seq(.1,.9,by=.1),.95,.99,.999), funlabel="Risk of Death") #Instead of fun.at, could have specified fun.lp.at=logit of #sequence above - faster and slightly more accurate plot(nom, xfrac=.45) print(nom) nom <- nomogram(f, age=seq(10,90,by=10)) plot(nom, xfrac=.45) g <- lrm(y ~ sex + rcs(age, 3) * rcs(cholesterol, 3), data=d) nom <- nomogram(g, interact=list(age=c(20,40,60)), conf.int=c(.7,.9,.95)) plot(nom, col.conf=c(1,.5,.2), naxes=7) require(survival) w <- upData(d, cens = 15 * runif(n), h = .02 * exp(.04 * (age - 50) + .8 * (sex == 'Female')), d.time = -log(runif(n)) / h, death = ifelse(d.time <= cens, 1, 0), d.time = pmin(d.time, cens)) f <- psm(Surv(d.time,death) ~ sex * age, data=w, dist='lognormal') med <- Quantile(f) surv <- Survival(f) # This would also work if f was from cph plot(nomogram(f, fun=function(x) med(lp=x), funlabel="Median Survival Time")) nom <- nomogram(f, fun=list(function(x) surv(3, x), function(x) surv(6, x)), funlabel=c("3-Month Survival Probability", "6-month Survival Probability")) plot(nom, xfrac=.7) ## Not run: nom <- nomogram(fit.with.categorical.predictors, abbrev=TRUE, minlength=1) nom$x1$points # print points assigned to each level of x1 for its axis #Add legend for abbreviations for category levels abb <- attr(nom, 'info')$abbrev$treatment legend(locator(1), abb$full, pch=paste(abb$abbrev,collapse=''), ncol=2, bty='n') # this only works for 1-letter abbreviations #Or use the legend.nomabbrev function: legend.nomabbrev(nom, 'treatment', locator(1), ncol=2, bty='n') ## End(Not run) #Make a nomogram with axes predicting probabilities Y>=j for all j=1-3 #in an ordinal logistic model, where Y=0,1,2,3 w <- upData(w, Y = ifelse(y==0, 0, sample(1:3, length(y), TRUE))) g <- lrm(Y ~ age+rcs(cholesterol,4) * sex, data=w) fun2 <- function(x) plogis(x-g$coef[1]+g$coef[2]) fun3 <- function(x) plogis(x-g$coef[1]+g$coef[3]) f <- Newlabels(g, c(age='Age in Years')) #see Design.Misc, which also has Newlevels to change #labels for levels of categorical variables g <- nomogram(f, fun=list('Prob Y>=1'=plogis, 'Prob Y>=2'=fun2, 'Prob Y=3'=fun3), fun.at=c(.01,.05,seq(.1,.9,by=.1),.95,.99)) plot(g, lmgp=.2, cex.axis=.6) options(datadist=NULL)
n <- 1000 # define sample size set.seed(17) # so can reproduce the results d <- data.frame(age = rnorm(n, 50, 10), blood.pressure = rnorm(n, 120, 15), cholesterol = rnorm(n, 200, 25), sex = factor(sample(c('female','male'), n,TRUE))) # Specify population model for log odds that Y=1 # Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)] d <- upData(d, L = .4*(sex=='male') + .045*(age-50) + (log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male')), y = ifelse(runif(n) < plogis(L), 1, 0)) ddist <- datadist(d); options(datadist='ddist') f <- lrm(y ~ lsp(age,50) + sex * rcs(cholesterol, 4) + blood.pressure, data=d) nom <- nomogram(f, fun=function(x)1/(1+exp(-x)), # or fun=plogis fun.at=c(.001,.01,.05,seq(.1,.9,by=.1),.95,.99,.999), funlabel="Risk of Death") #Instead of fun.at, could have specified fun.lp.at=logit of #sequence above - faster and slightly more accurate plot(nom, xfrac=.45) print(nom) nom <- nomogram(f, age=seq(10,90,by=10)) plot(nom, xfrac=.45) g <- lrm(y ~ sex + rcs(age, 3) * rcs(cholesterol, 3), data=d) nom <- nomogram(g, interact=list(age=c(20,40,60)), conf.int=c(.7,.9,.95)) plot(nom, col.conf=c(1,.5,.2), naxes=7) require(survival) w <- upData(d, cens = 15 * runif(n), h = .02 * exp(.04 * (age - 50) + .8 * (sex == 'Female')), d.time = -log(runif(n)) / h, death = ifelse(d.time <= cens, 1, 0), d.time = pmin(d.time, cens)) f <- psm(Surv(d.time,death) ~ sex * age, data=w, dist='lognormal') med <- Quantile(f) surv <- Survival(f) # This would also work if f was from cph plot(nomogram(f, fun=function(x) med(lp=x), funlabel="Median Survival Time")) nom <- nomogram(f, fun=list(function(x) surv(3, x), function(x) surv(6, x)), funlabel=c("3-Month Survival Probability", "6-month Survival Probability")) plot(nom, xfrac=.7) ## Not run: nom <- nomogram(fit.with.categorical.predictors, abbrev=TRUE, minlength=1) nom$x1$points # print points assigned to each level of x1 for its axis #Add legend for abbreviations for category levels abb <- attr(nom, 'info')$abbrev$treatment legend(locator(1), abb$full, pch=paste(abb$abbrev,collapse=''), ncol=2, bty='n') # this only works for 1-letter abbreviations #Or use the legend.nomabbrev function: legend.nomabbrev(nom, 'treatment', locator(1), ncol=2, bty='n') ## End(Not run) #Make a nomogram with axes predicting probabilities Y>=j for all j=1-3 #in an ordinal logistic model, where Y=0,1,2,3 w <- upData(w, Y = ifelse(y==0, 0, sample(1:3, length(y), TRUE))) g <- lrm(Y ~ age+rcs(cholesterol,4) * sex, data=w) fun2 <- function(x) plogis(x-g$coef[1]+g$coef[2]) fun3 <- function(x) plogis(x-g$coef[1]+g$coef[3]) f <- Newlabels(g, c(age='Age in Years')) #see Design.Misc, which also has Newlevels to change #labels for levels of categorical variables g <- nomogram(f, fun=list('Prob Y>=1'=plogis, 'Prob Y>=2'=fun2, 'Prob Y=3'=fun3), fun.at=c(.01,.05,seq(.1,.9,by=.1),.95,.99)) plot(g, lmgp=.2, cex.axis=.6) options(datadist=NULL)
Computes an estimate of a survival curve for censored data
using either the Kaplan-Meier or the Fleming-Harrington method
or computes the predicted survivor function.
For competing risks data it computes the cumulative incidence curve.
This calls the survival
package's survfit.formula
function. Attributes of the event time variable are saved (label and
units of measurement).
For competing risks the second argument for Surv
should be the
event state variable, and it should be a factor variable with the first
factor level denoting right-censored observations.
npsurv(formula, data=environment(formula), subset, weights, na.action=na.delete, ...)
npsurv(formula, data=environment(formula), subset, weights, na.action=na.delete, ...)
formula |
a formula object, which must have a |
data , subset , weights , na.action
|
see |
... |
see |
see survfit.formula
for details
an object of class "npsurv"
and "survfit"
.
See survfit.object
for details. Methods defined for survfit
objects are print
, summary
, plot
,lines
, and
points
.
Thomas Lumley [email protected] and Terry Therneau
survfit.cph
for survival curves from Cox models.
print
,
plot
,
lines
,
coxph
,
strata
,
survplot
require(survival) # fit a Kaplan-Meier and plot it fit <- npsurv(Surv(time, status) ~ x, data = aml) plot(fit, lty = 2:3) legend(100, .8, c("Maintained", "Nonmaintained"), lty = 2:3) # Here is the data set from Turnbull # There are no interval censored subjects, only left-censored (status=3), # right-censored (status 0) and observed events (status 1) # # Time # 1 2 3 4 # Type of observation # death 12 6 2 3 # losses 3 2 0 3 # late entry 2 4 2 5 # tdata <- data.frame(time = c(1,1,1,2,2,2,3,3,3,4,4,4), status = rep(c(1,0,2),4), n = c(12,3,2,6,2,4,2,0,2,3,3,5)) fit <- npsurv(Surv(time, time, status, type='interval') ~ 1, data=tdata, weights=n) # # Time to progression/death for patients with monoclonal gammopathy # Competing risk curves (cumulative incidence) # status variable must be a factor with first level denoting right censoring m <- upData(mgus1, stop = stop / 365.25, units=c(stop='years'), labels=c(stop='Follow-up Time'), subset=start == 0) f <- npsurv(Surv(stop, event) ~ 1, data=m) # CI curves are always plotted from 0 upwards, rather than 1 down plot(f, fun='event', xmax=20, mark.time=FALSE, col=2:3, xlab="Years post diagnosis of MGUS") text(10, .4, "Competing Risk: death", col=3) text(16, .15,"Competing Risk: progression", col=2) # Use survplot for enhanced displays of cumulative incidence curves for # competing risks survplot(f, state='pcm', n.risk=TRUE, xlim=c(0, 20), ylim=c(0, .5), col=2) survplot(f, state='death', add=TRUE, col=3) f <- npsurv(Surv(stop, event) ~ sex, data=m) survplot(f, state='death', n.risk=TRUE, conf='diffbands')
require(survival) # fit a Kaplan-Meier and plot it fit <- npsurv(Surv(time, status) ~ x, data = aml) plot(fit, lty = 2:3) legend(100, .8, c("Maintained", "Nonmaintained"), lty = 2:3) # Here is the data set from Turnbull # There are no interval censored subjects, only left-censored (status=3), # right-censored (status 0) and observed events (status 1) # # Time # 1 2 3 4 # Type of observation # death 12 6 2 3 # losses 3 2 0 3 # late entry 2 4 2 5 # tdata <- data.frame(time = c(1,1,1,2,2,2,3,3,3,4,4,4), status = rep(c(1,0,2),4), n = c(12,3,2,6,2,4,2,0,2,3,3,5)) fit <- npsurv(Surv(time, time, status, type='interval') ~ 1, data=tdata, weights=n) # # Time to progression/death for patients with monoclonal gammopathy # Competing risk curves (cumulative incidence) # status variable must be a factor with first level denoting right censoring m <- upData(mgus1, stop = stop / 365.25, units=c(stop='years'), labels=c(stop='Follow-up Time'), subset=start == 0) f <- npsurv(Surv(stop, event) ~ 1, data=m) # CI curves are always plotted from 0 upwards, rather than 1 down plot(f, fun='event', xmax=20, mark.time=FALSE, col=2:3, xlab="Years post diagnosis of MGUS") text(10, .4, "Competing Risk: death", col=3) text(16, .15,"Competing Risk: progression", col=2) # Use survplot for enhanced displays of cumulative incidence curves for # competing risks survplot(f, state='pcm', n.risk=TRUE, xlim=c(0, 20), ylim=c(0, .5), col=2) survplot(f, state='death', add=TRUE, col=3) f <- npsurv(Surv(stop, event) ~ sex, data=m) survplot(f, state='death', n.risk=TRUE, conf='diffbands')
Creates a 2-column integer matrix that handles left- right- and interval-censored ordinal or continuous values for use in [rmsb::blrm()] and in the future [orm()]. A pair of values '[a, b]' represents an interval-censored value known to be in the interval '[a, b]' inclusive of 'a' and 'b'. It is assumed that all distinct values are observed as uncensored for at least one observation. When both input variables are 'factor's it is assume that the one with the higher number of levels is the one that correctly specifies the order of levels, and that the other variable does not contain any additional levels. If the variables are not 'factor's it is assumed their original values provide the orderings. Since all values that form the left or right endpoints of an interval censored value must be represented in the data, a left-censored point is is coded as 'a=1' and a right-censored point is coded as 'b' equal to the maximum observed value. If the maximum observed value is not really the maximum possible value, everything still works except that predictions involving values above the highest observed value cannot be made. As with most censored-data methods, modeling functions assumes that censoring is independent of the response variable values that would have been measured had censoring not occurred.
Ocens(a, b = a, precision = 7)
Ocens(a, b = a, precision = 7)
a |
vector representing a 'factor', numeric, or alphabetically ordered character strings |
b |
like 'a'. If omitted, it copies 'a', representing nothing but uncensored values |
precision |
when 'a' and 'b' are numeric, values may need to be rounded to avoid unpredictable behavior with |
a 2-column integer matrix of class '"Ocens"' with an attribute 'levels' (ordered). When the original variables were 'factor's, these are factor levels, otherwise are numerically or alphabetically sorted distinct (over 'a' and 'b' combined) values. When the variables are not factors and are numeric, another attribute 'median' is also returned. This is the median of the uncensored values. When the variables are factor or character, the median of the integer versions of variables for uncensored observations is returned as attribute 'mid'. A final attribute 'freq' is the vector of frequencies of occurrences of all uncensored values. 'freq' aligns with 'levels'.
Frank Harrell
Fits the usual weighted or unweighted linear regression model using the
same fitting routines used by lm
, but also storing the variance-covariance
matrix var
and using traditional dummy-variable coding for categorical
factors.
Also fits unweighted models using penalized least squares, with the same
penalization options as in the lrm
function. For penalized estimation,
there is a fitter function call lm.pfit
.
ols(formula, data=environment(formula), weights, subset, na.action=na.delete, method="qr", model=FALSE, x=FALSE, y=FALSE, se.fit=FALSE, linear.predictors=TRUE, penalty=0, penalty.matrix, tol=.Machine$double.eps, sigma, var.penalty=c('simple','sandwich'), ...)
ols(formula, data=environment(formula), weights, subset, na.action=na.delete, method="qr", model=FALSE, x=FALSE, y=FALSE, se.fit=FALSE, linear.predictors=TRUE, penalty=0, penalty.matrix, tol=.Machine$double.eps, sigma, var.penalty=c('simple','sandwich'), ...)
formula |
an S formula object, e.g.
|
data |
name of an S data frame containing all needed variables. Omit this to use a data frame already in the S “search list”. |
weights |
an optional vector of weights to be used in the fitting
process. If specified, weighted least squares is used with
weights |
subset |
an expression defining a subset of the observations to use in the fit. The default
is to use all observations. Specify for example |
na.action |
specifies an S function to handle missing data. The default is the function |
method |
specifies a particular fitting method, or |
model |
default is |
x |
default is |
y |
default is |
se.fit |
default is |
linear.predictors |
set to |
penalty |
see |
penalty.matrix |
see |
tol |
tolerance for information matrix singularity |
sigma |
If |
var.penalty |
the type of variance-covariance matrix to be stored in the |
... |
For penalized estimation, the penalty factor on the log likelihood is
, where
is defined above.
The penalized maximum likelihood estimate (penalized least squares
or ridge estimate) of
is
.
The maximum likelihood estimate of
is
, where
sse
is the sum of squared errors (residuals).
The effective.df.diagonal
vector is the
diagonal of the matrix .
the same objects returned from lm
(unless penalty
or penalty.matrix
are given - then an
abbreviated list is returned since lm.pfit
is used as a fitter)
plus the design attributes
(see rms
).
Predicted values are always returned, in the element linear.predictors
.
The vectors or matrix stored if y=TRUE
or x=TRUE
have rows deleted according to subset
and
to missing data, and have names or row names that come from the
data frame used as input data. If penalty
or penalty.matrix
is given,
the var
matrix
returned is an improved variance-covariance matrix
for the penalized regression coefficient estimates. If
var.penalty="sandwich"
(not the default, as limited simulation
studies have found it provides variance estimates that are too low) it
is defined as
, where
is
penalty factors * penalty.matrix
, with a column and row of zeros
added for the
intercept. When var.penalty="simple"
(the default), var
is
.
The returned list has a vector
stats
with named elements
n, Model L.R., d.f., R2, g, Sigma
. Model L.R.
is the model
likelihood ratio statistic, and
R2
is
. For penalized estimation,
d.f.
is the
effective degrees of freedom, which is the sum of the elements of another
vector returned, effective.df.diagonal
, minus one for the
intercept.
g
is the -index.
Sigma
is the penalized maximum likelihood estimate (see below).
Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
rms
, rms.trans
, anova.rms
,
summary.rms
, predict.rms
,
fastbw
, validate
, calibrate
,
Predict
, specs.rms
, cph
,
lrm
, which.influence
, lm
,
summary.lm
, print.ols
,
residuals.ols
, latex.ols
,
na.delete
, na.detail.response
,
datadist
, pentrace
, vif
,
abs.error.pred
set.seed(1) x1 <- runif(200) x2 <- sample(0:3, 200, TRUE) distance <- (x1 + x2/3 + rnorm(200))^2 d <- datadist(x1,x2) options(datadist="d") # No d -> no summary, plot without giving all details f <- ols(sqrt(distance) ~ rcs(x1,4) + scored(x2), x=TRUE) # could use d <- datadist(f); options(datadist="d") at this point, # but predictor summaries would not be stored in the fit object for # use with Predict, summary.rms. In that case, the original # dataset or d would need to be accessed later, or all variable values # would have to be specified to summary, plot anova(f) which.influence(f) summary(f) summary.lm(f) # will only work if penalty and penalty.matrix not used # Fit a complex model and approximate it with a simple one x1 <- runif(200) x2 <- runif(200) x3 <- runif(200) x4 <- runif(200) y <- x1 + x2 + rnorm(200) f <- ols(y ~ rcs(x1,4) + x2 + x3 + x4) pred <- fitted(f) # or predict(f) or f$linear.predictors f2 <- ols(pred ~ rcs(x1,4) + x2 + x3 + x4, sigma=1) # sigma=1 prevents numerical problems resulting from R2=1 fastbw(f2, aics=100000) # This will find the best 1-variable model, best 2-variable model, etc. # in predicting the predicted values from the original model options(datadist=NULL)
set.seed(1) x1 <- runif(200) x2 <- sample(0:3, 200, TRUE) distance <- (x1 + x2/3 + rnorm(200))^2 d <- datadist(x1,x2) options(datadist="d") # No d -> no summary, plot without giving all details f <- ols(sqrt(distance) ~ rcs(x1,4) + scored(x2), x=TRUE) # could use d <- datadist(f); options(datadist="d") at this point, # but predictor summaries would not be stored in the fit object for # use with Predict, summary.rms. In that case, the original # dataset or d would need to be accessed later, or all variable values # would have to be specified to summary, plot anova(f) which.influence(f) summary(f) summary.lm(f) # will only work if penalty and penalty.matrix not used # Fit a complex model and approximate it with a simple one x1 <- runif(200) x2 <- runif(200) x3 <- runif(200) x4 <- runif(200) y <- x1 + x2 + rnorm(200) f <- ols(y ~ rcs(x1,4) + x2 + x3 + x4) pred <- fitted(f) # or predict(f) or f$linear.predictors f2 <- ols(pred ~ rcs(x1,4) + x2 + x3 + x4, sigma=1) # sigma=1 prevents numerical problems resulting from R2=1 fastbw(f2, aics=100000) # This will find the best 1-variable model, best 2-variable model, etc. # in predicting the predicted values from the original model options(datadist=NULL)
Fits ordinal cumulative probability models for continuous or ordinal
response variables, efficiently allowing for a large number of
intercepts by capitalizing on the information matrix being sparse.
Five different distribution functions are implemented, with the
default being the logistic (i.e., the proportional odds
model). The ordinal cumulative probability models are stated in terms
of exceedance probabilities () so that as with
OLS larger predicted values are associated with larger
Y
. This is
important to note for the asymmetric distributions given by the
log-log and complementary log-log families, for which negating the
linear predictor does not result in . The
family
argument is defined in orm.fit
. The model
assumes that the inverse of the assumed cumulative distribution
function, when applied to one minus the true cumulative distribution function
and plotted on the -axis (with the original
on the
-axis) yields parallel curves (though not necessarily linear).
This can be checked by plotting the inverse cumulative probability
function of one minus the empirical distribution function, stratified
by
X
, and assessing parallelism. Note that parametric
regression models make the much stronger assumption of linearity of
such inverse functions.
For the print
method, format of output is controlled by the
user previously running options(prType="lang")
where
lang
is "plain"
(the default), "latex"
, or
"html"
. When using html with Quarto or RMarkdown,
results='asis'
need not be written in the chunk header.
Quantile.orm
creates an R function that computes an estimate of
a given quantile for a given value of the linear predictor (which was
assumed to use thefirst intercept). It uses a linear
interpolation method by default, but you can override that to use a
discrete method by specifying method="discrete"
when calling
the function generated by Quantile
.
Optionally a normal approximation for a confidence
interval for quantiles will be computed using the delta method, if
conf.int > 0
is specified to the function generated from calling
Quantile
and you specify X
. In that case, a
"lims"
attribute is included
in the result computed by the derived quantile function.
orm(formula, data=environment(formula), subset, na.action=na.delete, method="orm.fit", family=c("logistic", "probit", "loglog", "cloglog", "cauchit"), model=FALSE, x=FALSE, y=FALSE, linear.predictors=TRUE, se.fit=FALSE, penalty=0, penalty.matrix, var.penalty=c('simple','sandwich'), scale=FALSE, maxit=30, weights, normwt=FALSE, ...) ## S3 method for class 'orm' print(x, digits=4, r2=c(0,2,4), coefs=TRUE, pg=FALSE, intercepts=x$non.slopes < 10, title, ...) ## S3 method for class 'orm' Quantile(object, codes=FALSE, ...)
orm(formula, data=environment(formula), subset, na.action=na.delete, method="orm.fit", family=c("logistic", "probit", "loglog", "cloglog", "cauchit"), model=FALSE, x=FALSE, y=FALSE, linear.predictors=TRUE, se.fit=FALSE, penalty=0, penalty.matrix, var.penalty=c('simple','sandwich'), scale=FALSE, maxit=30, weights, normwt=FALSE, ...) ## S3 method for class 'orm' print(x, digits=4, r2=c(0,2,4), coefs=TRUE, pg=FALSE, intercepts=x$non.slopes < 10, title, ...) ## S3 method for class 'orm' Quantile(object, codes=FALSE, ...)
formula |
a formula object. An |
data |
data frame to use. Default is the current frame. |
subset |
logical expression or vector of subscripts defining a subset of observations to analyze |
na.action |
function to handle |
method |
name of fitting function. Only allowable choice at present is |
family |
character value specifying the distribution family, which is one of the following:
|
model |
causes the model frame to be returned in the fit object |
x |
causes the expanded design matrix (with missings excluded)
to be returned under the name |
y |
causes the response variable (with missings excluded) to be returned
under the name |
linear.predictors |
causes the predicted X beta (with missings excluded) to be returned
under the name |
se.fit |
causes the standard errors of the fitted values (on the linear predictor
scale) to be returned under the name |
penalty |
see |
penalty.matrix |
see |
var.penalty |
see |
scale |
set to |
maxit |
maximum number of iterations to allow in |
weights |
a vector (same length as |
normwt |
set to |
... |
arguments that are passed to |
digits |
number of significant digits to use |
r2 |
vector of integers specifying which R^2 measures to print,
with 0 for Nagelkerke R^2 and 1:4 corresponding to the 4 measures
computed by |
pg |
set to |
coefs |
specify |
intercepts |
By default, intercepts are only printed if there are
fewer than 10 of them. Otherwise this is controlled by specifying
|
title |
a character string title to be passed to |
object |
an object created by |
codes |
if |
The returned fit object of orm
contains the following components
in addition to the ones mentioned under the optional arguments.
call |
calling expression |
freq |
table of frequencies for |
stats |
vector with the following elements: number of observations used in the
fit, number of unique |