Package 'rms'

Title: Regression Modeling Strategies
Description: Regression modeling, testing, estimation, validation, graphics, prediction, and typesetting by storing enhanced model design attributes in the fit. 'rms' is a collection of functions that assist with and streamline modeling. It also contains functions for binary and ordinal logistic regression models, ordinal models for continuous Y with a variety of distribution families, and the Buckley-James multiple regression model for right-censored responses, and implements penalized maximum likelihood estimation for logistic and ordinary linear models. 'rms' works with almost any regression model, but it was especially written to work with binary or ordinal regression models, Cox regression, accelerated failure time models, ordinary linear models, the Buckley-James model, generalized least squares for serially or spatially correlated observations, generalized linear models, and quantile regression.
Authors: Frank E Harrell Jr <[email protected]>
Maintainer: Frank E Harrell Jr <[email protected]>
License: GPL (>= 2)
Version: 6.8-1
Built: 2024-05-28 03:36:25 UTC
Source: CRAN

Help Index


Analysis of Variance (Wald, LR, and F Statistics)

Description

The anova function automatically tests most meaningful hypotheses in a design. For example, suppose that age and cholesterol are predictors, and that a general interaction is modeled using a restricted spline surface. anova prints Wald statistics (FF statistics for an ols fit) for testing linearity of age, linearity of cholesterol, age effect (age + age by cholesterol interaction), cholesterol effect (cholesterol + age by cholesterol interaction), linearity of the age by cholesterol interaction (i.e., adequacy of the simple age * cholesterol 1 d.f. product), linearity of the interaction in age alone, and linearity of the interaction in cholesterol alone. Joint tests of all interaction terms in the model and all nonlinear terms in the model are also performed. For any multiple d.f. effects for continuous variables that were not modeled through rcs, pol, lsp, etc., tests of linearity will be omitted. This applies to matrix predictors produced by e.g. poly or ns.

For lrm, orm, cph, psm and Glm fits, the better likelihood ratio chi-square tests may be obtained by specifying test='LR'. Fits must use x=TRUE, y=TRUE to run LR tests. The tests are run fairly efficiently by subsetting the design matrix rather than recreating it.

print.anova.rms is the printing method. plot.anova.rms draws dot charts depicting the importance of variables in the model, as measured by Wald or LR χ2\chi^2, χ2\chi^2 minus d.f., AIC, PP-values, partial R2R^2, R2R^2 for the whole model after deleting the effects in question, or proportion of overall model R2R^2 that is due to each predictor. latex.anova.rms is the latex method. It substitutes Greek/math symbols in column headings, uses boldface for TOTAL lines, and constructs a caption. Then it passes the result to latex.default for conversion to LaTeX.

When the anova table was converted to account for missing data imputation by processMI, a separate function prmiInfo can be used to print information related to imputation adjustments.

For Bayesian models such as blrm, anova computes relative explained variation indexes (REV) based on approximate Wald statistics. This uses the variance-covariance matrix of all of the posterior draws, and the individual draws of betas, plus an overall summary from the posterior mode/mean/median beta. Wald chi-squares assuming multivariate normality of betas are computed just as with frequentist models, and for each draw (or for the summary) the ratio of the partial Wald chi-square to the total Wald statistic for the model is computed as REV.

The print method calls latex or html methods depending on options(prType=). For latex a table environment is not used and an ordinary tabular is produced. When using html with Quarto or RMarkdown, results='asis' need not be written in the chunk header.

html.anova.rms just calls latex.anova.rms.

Usage

## S3 method for class 'rms'
anova(object, ..., main.effect=FALSE, tol=1e-9,
      test=c('F','Chisq','LR'), india=TRUE, indnl=TRUE, ss=TRUE,
      vnames=c('names','labels'),
      posterior.summary=c('mean', 'median', 'mode'), ns=500, cint=0.95)

## S3 method for class 'anova.rms'
print(x,
      which=c('none','subscripts','names','dots'),
      table.env=FALSE, ...)

## S3 method for class 'anova.rms'
plot(x,
     what=c("chisqminusdf","chisq","aic","P","partial R2","remaining R2",
            "proportion R2", "proportion chisq"),
     xlab=NULL, pch=16,
     rm.totals=TRUE, rm.ia=FALSE, rm.other=NULL, newnames,
     sort=c("descending","ascending","none"), margin=c('chisq','P'),
     pl=TRUE, trans=NULL, ntrans=40, height=NULL, width=NULL, ...)

## S3 method for class 'anova.rms'
latex(object, title, dec.chisq=2,
      dec.F=2, dec.ss=NA, dec.ms=NA, dec.P=4, dec.REV=3,
      table.env=TRUE,
      caption=NULL, fontsize=1, params, ...)

## S3 method for class 'anova.rms'
html(object, ...)

Arguments

object

a rms fit object. object must allow vcov to return the variance-covariance matrix. For latex is the result of anova.

...

If omitted, all variables are tested, yielding tests for individual factors and for pooled effects. Specify a subset of the variables to obtain tests for only those factors, with a pooled tests for the combined effects of all factors listed. Names may be abbreviated. For example, specify anova(fit,age,cholesterol) to get a Wald statistic for testing the joint importance of age, cholesterol, and any factor interacting with them. Add test='LR' to get a likelihood ratio chi-square test instead.

Can be optional graphical parameters to send to dotchart2, or other parameters to send to latex.default. Ignored for print.

For html.anova.rms the arguments are passed to latex.anova.rms.

main.effect

Set to TRUE to print the (usually meaningless) main effect tests even when the factor is involved in an interaction. The default is FALSE, to print only the effect of the main effect combined with all interactions involving that factor.

tol

singularity criterion for use in matrix inversion

test

For an ols fit, set test="Chisq" to use Wald χ2\chi^2 tests rather than F-tests. For lrm, orm, cph, psm and Glm fits set test='LR' to get likelihood ratio χ2\chi^2 tests. This requires specifying x=TRUE, y=TRUE when fitting the model.

india

set to FALSE to exclude individual tests of interaction from the table

indnl

set to FALSE to exclude individual tests of nonlinearity from the table

ss

For an ols fit, set ss=FALSE to suppress printing partial sums of squares, mean squares, and the Error SS and MS.

vnames

set to 'labels' to use variable labels rather than variable names in the output

posterior.summary

specifies whether the posterior mode/mean/median beta are to be used as a measure of central tendence of the posterior distribution, for use in relative explained variation from Bayesian models

ns

number of random samples from the posterior draws to use for REV highest posterior density intervals

cint

HPD interval probability

x

for print,plot,text is the result of anova.

which

If which is not "none" (the default), print.anova.rms will add to the rightmost column of the output the list of parameters being tested by the hypothesis being tested in the current row. Specifying which="subscripts" causes the subscripts of the regression coefficients being tested to be printed (with a subscript of one for the first non-intercept term). which="names" prints the names of the terms being tested, and which="dots" prints dots for terms being tested and blanks for those just being adjusted for.

what

what type of statistic to plot. The default is the χ2\chi^2 statistic for each factor (adding in the effect of higher-ordered factors containing that factor) minus its degrees of freedom. The R2 choices for what only apply to ols models.

xlab

x-axis label, default is constructed according to what. plotmath symbols are used for R, by default.

pch

character for plotting dots in dot charts. Default is 16 (solid dot).

rm.totals

set to FALSE to keep total χ2\chi^2s (overall, nonlinear, interaction totals) in the chart.

rm.ia

set to TRUE to omit any effect that has "*" in its name

rm.other

a list of other predictor names to omit from the chart

newnames

a list of substitute predictor names to use, after omitting any.

sort

default is to sort bars in descending order of the summary statistic. Available options: 'ascending', 'descending', 'none'.

margin

set to a vector of character strings to write text for selected statistics in the right margin of the dot chart. The character strings can be any combination of "chisq", "d.f.", "P", "partial R2", "proportion R2", and "proportion chisq". Default is to not draw any statistics in the margin. When plotly is in effect, margin values are instead displayed as hover text.

pl

set to FALSE to suppress plotting. This is useful when you only wish to analyze the vector of statistics returned.

trans

set to a function to apply that transformation to the statistics being plotted, and to truncate negative values at zero. A good choice is trans=sqrt.

ntrans

n argument to pretty, specifying the number of values for which to place tick marks. This should be larger than usual because of nonlinear scaling, to provide a sufficient number of tick marks on the left (stretched) part of the chi-square scale.

height, width

height and width of plotly plots drawn using dotchartp, in pixels. Ignored for ordinary plots. Defaults to minimum of 400 and 100 + 25 times the number of test statistics displayed.

title

title to pass to latex, default is name of fit object passed to anova prefixed with "anova.". For Windows, the default is "ano" followed by the first 5 letters of the name of the fit object.

dec.chisq

number of places to the right of the decimal place for typesetting χ2\chi^2 values (default is 2). Use zero for integer, NA for floating point.

dec.F

digits to the right for FF statistics (default is 2)

dec.ss

digits to the right for sums of squares (default is NA, indicating floating point)

dec.ms

digits to the right for mean squares (default is NA)

dec.P

digits to the right for PP-values

dec.REV

digits to the right for REV

table.env

see latex

caption

caption for table if table.env is TRUE. Default is constructed from the response variable.

fontsize

font size for html output; default is 1 for 1em

params

used internally when called through print.

Details

If the statistics being plotted with plot.anova.rms are few in number and one of them is negative or zero, plot.anova.rms will quit because of an error in dotchart2.

The latex method requires LaTeX packages relsize and needspace.

Value

anova.rms returns a matrix of class anova.rms containing factors as rows and χ2\chi^2, d.f., and PP-values as columns (or d.f., partial SS,MS,F,PSS, MS, F, P). An attribute vinfo provides list of variables involved in each row and the type of test done. plot.anova.rms invisibly returns the vector of quantities plotted. This vector has a names attribute describing the terms for which the statistics in the vector are calculated.

Side Effects

print prints, latex creates a file with a name of the form "title.tex" (see the title argument above).

Author(s)

Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]

See Also

prmiInfo, rms, rmsMisc, lrtest, rms.trans, summary.rms, plot.Predict, ggplot.Predict, solvet, locator, dotchart2, latex, xYplot, anova.lm, contrast.rms, pantext

Examples

require(ggplot2)
n <- 1000    # define sample size
set.seed(17) # so can reproduce the results
treat <- factor(sample(c('a','b','c'), n,TRUE))
num.diseases <- sample(0:4, n,TRUE)
age <- rnorm(n, 50, 10)
cholesterol <- rnorm(n, 200, 25)
weight <- rnorm(n, 150, 20)
sex <- factor(sample(c('female','male'), n,TRUE))
label(age) <- 'Age'      # label is in Hmisc
label(num.diseases) <- 'Number of Comorbid Diseases'
label(cholesterol) <- 'Total Cholesterol'
label(weight) <- 'Weight, lbs.'
label(sex) <- 'Sex'
units(cholesterol) <- 'mg/dl'   # uses units.default in Hmisc


# Specify population model for log odds that Y=1
L <- .1*(num.diseases-2) + .045*(age-50) +
     (log(cholesterol - 10)-5.2)*(-2*(treat=='a') +
     3.5*(treat=='b')+2*(treat=='c'))
# Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)]
y <- ifelse(runif(n) < plogis(L), 1, 0)


fit <- lrm(y ~ treat + scored(num.diseases) + rcs(age) +
               log(cholesterol+10) + treat:log(cholesterol+10),
           x=TRUE, y=TRUE)   # x, y needed for test='LR'
a <- anova(fit)                       # Test all factors
b <- anova(fit, treat, cholesterol)   # Test these 2 by themselves
                                      # to get their pooled effects
a
b
a2 <- anova(fit, test='LR')
b2 <- anova(fit, treat, cholesterol, test='LR')
a2
b2

# Add a new line to the plot with combined effects
s <- rbind(a2, 'treat+cholesterol'=b2['TOTAL',])

class(s) <- 'anova.rms'
plot(s, margin=c('chisq', 'proportion chisq'))

g <- lrm(y ~ treat*rcs(age))
dd <- datadist(treat, num.diseases, age, cholesterol)
options(datadist='dd')
p <- Predict(g, age, treat="b")
s <- anova(g)
tx <- paste(capture.output(s), collapse='\n')
ggplot(p) + annotate('text', x=27, y=3.2, family='mono', label=tx,
                      hjust=0, vjust=1, size=1.5)

plot(s, margin=c('chisq', 'proportion chisq'))
# new plot - dot chart of chisq-d.f. with 2 other stats in right margin
# latex(s)                       # nice printout - creates anova.g.tex
options(datadist=NULL)


# Simulate data with from a given model, and display exactly which
# hypotheses are being tested


set.seed(123)
age <- rnorm(500, 50, 15)
treat <- factor(sample(c('a','b','c'), 500, TRUE))
bp  <- rnorm(500, 120, 10)
y   <- ifelse(treat=='a', (age-50)*.05, abs(age-50)*.08) + 3*(treat=='c') +
       pmax(bp, 100)*.09 + rnorm(500)
f   <- ols(y ~ treat*lsp(age,50) + rcs(bp,4))
print(names(coef(f)), quote=FALSE)
specs(f)
anova(f)
an <- anova(f)
options(digits=3)
print(an, 'subscripts')
print(an, 'dots')


an <- anova(f, test='Chisq', ss=FALSE)
# plot(0:1)                        # make some plot
# tab <- pantext(an, 1.2, .6, lattice=FALSE, fontfamily='Helvetica')
# create function to write table; usually omit fontfamily
# tab()                            # execute it; could do tab(cex=.65)
plot(an)                         # new plot - dot chart of chisq-d.f.
# Specify plot(an, trans=sqrt) to use a square root scale for this plot
# latex(an)                      # nice printout - creates anova.f.tex


## Example to save partial R^2 for all predictors, along with overall
## R^2, from two separate fits, and to combine them with ggplot2

require(ggplot2)
set.seed(1)
n <- 100
x1 <- runif(n)
x2 <- runif(n)
y  <- (x1-.5)^2 + x2 + runif(n)
group <- c(rep('a', n/2), rep('b', n/2))
A <- NULL
for(g in c('a','b')) {
    f <- ols(y ~ pol(x1,2) + pol(x2,2) + pol(x1,2) %ia% pol(x2,2),
             subset=group==g)
    a <- plot(anova(f),
              what='partial R2', pl=FALSE, rm.totals=FALSE, sort='none')
    a <- a[-grep('NONLINEAR', names(a))]
    d <- data.frame(group=g, Variable=factor(names(a), names(a)),
                    partialR2=unname(a))
    A <- rbind(A, d)
  }
ggplot(A, aes(x=partialR2, y=Variable)) + geom_point() +
       facet_wrap(~ group) + xlab(ex <- expression(partial~R^2)) +
       scale_y_discrete(limits=rev)
ggplot(A, aes(x=partialR2, y=Variable, color=group)) + geom_point() +
       xlab(ex <- expression(partial~R^2)) +
       scale_y_discrete(limits=rev)

# Suppose that a researcher wants to make a big deal about a variable
# because it has the highest adjusted chi-square.  We use the
# bootstrap to derive 0.95 confidence intervals for the ranks of all
# the effects in the model.  We use the plot method for anova, with
# pl=FALSE to suppress actual plotting of chi-square - d.f. for each
# bootstrap repetition.
# It is important to tell plot.anova.rms not to sort the results, or
# every bootstrap replication would have ranks of 1,2,3,... for the stats.

n <- 300
set.seed(1)
d <- data.frame(x1=runif(n), x2=runif(n),  x3=runif(n),
   x4=runif(n), x5=runif(n), x6=runif(n),  x7=runif(n),
   x8=runif(n), x9=runif(n), x10=runif(n), x11=runif(n),
   x12=runif(n))
d$y <- with(d, 1*x1 + 2*x2 + 3*x3 +  4*x4  + 5*x5 + 6*x6 +
               7*x7 + 8*x8 + 9*x9 + 10*x10 + 11*x11 +
              12*x12 + 9*rnorm(n))

f <- ols(y ~ x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+x12, data=d)
B <- 20   # actually use B=1000
ranks <- matrix(NA, nrow=B, ncol=12)
rankvars <- function(fit)
  rank(plot(anova(fit), sort='none', pl=FALSE))
Rank <- rankvars(f)
for(i in 1:B) {
  j <- sample(1:n, n, TRUE)
  bootfit <- update(f, data=d, subset=j)
  ranks[i,] <- rankvars(bootfit)
  }
lim <- t(apply(ranks, 2, quantile, probs=c(.025,.975)))
predictor <- factor(names(Rank), names(Rank))
w <- data.frame(predictor, Rank, lower=lim[,1], upper=lim[,2])
ggplot(w, aes(x=predictor, y=Rank)) + geom_point() + coord_flip() +
  scale_y_continuous(breaks=1:12) +
  geom_errorbar(aes(ymin=lim[,1], ymax=lim[,2]), width=0)

Buckley-James Multiple Regression Model

Description

bj fits the Buckley-James distribution-free least squares multiple regression model to a possibly right-censored response variable. This model reduces to ordinary least squares if there is no censoring. By default, model fitting is done after taking logs of the response variable. bj uses the rms class for automatic anova, fastbw, validate, Function, nomogram, summary, plot, bootcov, and other functions. The bootcov function may be worth using with bj fits, as the properties of the Buckley-James covariance matrix estimator are not fully known for strange censoring patterns.

For the print method, format of output is controlled by the user previously running options(prType="lang") where lang is "plain" (the default), "latex", or "html". When using html with Quarto or RMarkdown, results='asis' need not be written in the chunk header.

The residuals.bj function exists mainly to compute residuals and to censor them (i.e., return them as Surv objects) just as the original failure time variable was censored. These residuals are useful for checking to see if the model also satisfies certain distributional assumptions. To get these residuals, the fit must have specified y=TRUE.

The bjplot function is a special plotting function for objects created by bj with x=TRUE, y=TRUE in effect. It produces three scatterplots for every covariate in the model: the first plots the original situation, where censored data are distingushed from non-censored data by a different plotting symbol. In the second plot, called a renovated plot, vertical lines show how censored data were changed by the procedure, and the third is equal to the second, but without vertical lines. Imputed data are again distinguished from the non-censored by a different symbol.

The validate method for bj validates the Somers' Dxy rank correlation between predicted and observed responses, accounting for censoring.

The primary fitting function for bj is bj.fit, which does not allow missing data and expects a full design matrix as input.

Usage

bj(formula, data=environment(formula), subset, na.action=na.delete,
   link="log", control, method='fit', x=FALSE, y=FALSE, 
   time.inc)

## S3 method for class 'bj'
print(x, digits=4, long=FALSE, coefs=TRUE, 
title="Buckley-James Censored Data Regression", ...)

## S3 method for class 'bj'
residuals(object, type=c("censored","censored.normalized"),...)

bjplot(fit, which=1:dim(X)[[2]])

## S3 method for class 'bj'
validate(fit, method="boot", B=40,
         bw=FALSE,rule="aic",type="residual",sls=.05,aics=0,
         force=NULL, estimates=TRUE, pr=FALSE,
		 tol=1e-7, rel.tolerance=1e-3, maxiter=15, ...)

bj.fit(x, y, control)

Arguments

formula

an S statistical model formula. Interactions up to third order are supported. The left hand side must be a Surv object.

data, subset, na.action

the usual statistical model fitting arguments

fit

a fit created by bj, required for all functions except bj.

x

a design matrix with or without a first column of ones, to pass to bj.fit. All models will have an intercept. For print.bj is a result of bj. For bj, set x=TRUE to include the design matrix in the fit object.

y

a Surv object to pass to bj.fit as the two-column response variable. Only right censoring is allowed, and there need not be any censoring. For bj, set y to TRUE to include the two-column response matrix, with the event/censoring indicator in the second column. The first column will be transformed according to link, and depending on na.action, rows with missing data in the predictors or the response will be deleted.

link

set to, for example, "log" (the default) to model the log of the response, or "identity" to model the untransformed response.

control

a list containing any or all of the following components: iter.max (maximum number of iterations allowed, default is 20), eps (convergence criterion: concergence is assumed when the ratio of sum of squared errors from one iteration to the next is between 1-eps and 1+eps), trace (set to TRUE to monitor iterations), tol (matrix singularity criterion, default is 1e-7), and 'max.cycle' (in case of nonconvergence the program looks for a cycle that repeats itself, default is 30).

method

set to "model.frame" or "model.matrix" to return one of those objects rather than the model fit.

time.inc

setting for default time spacing. Default is 30 if time variable has units="Day", 1 otherwise, unless maximum follow-up time <1< 1. Then max time/10 is used as time.inc. If time.inc is not given and max time/default time.inc is >25> 25, time.inc is increased.

digits

number of significant digits to print if not 4.

long

set to TRUE to print the correlation matrix for parameter estimates

coefs

specify coefs=FALSE to suppress printing the table of model coefficients, standard errors, etc. Specify coefs=n to print only the first n regression coefficients in the model.

title

a character string title to be passed to prModFit

object

the result of bj

type

type of residual desired. Default is censored unnormalized residuals, defined as link(Y) - linear.predictors, where the link function was usually the log function. You can specify type="censored.normalized" to divide the residuals by the estimate of sigma.

which

vector of integers or character strings naming elements of the design matrix (the names of the original predictors if they entered the model linearly) for which to have bjplot make plots of only the variables listed in which (names or numbers).

B, bw, rule, sls, aics, force, estimates, pr, tol, rel.tolerance, maxiter

see predab.resample

...

ignored for print; passed through to predab.resample for validate

Details

The program implements the algorithm as described in the original article by Buckley & James. Also, we have used the original Buckley & James prescription for computing variance/covariance estimator. This is based on non-censored observations only and does not have any theoretical justification, but has been shown in simulation studies to behave well. Our experience confirms this view. Convergence is rather slow with this method, so you may want to increase the number of iterations. Our experience shows that often, in particular with high censoring, 100 iterations is not too many. Sometimes the method will not converge, but will instead enter a loop of repeating values (this is due to the discrete nature of Kaplan and Meier estimator and usually happens with small sample sizes). The program will look for such a loop and return the average betas. It will also issue a warning message and give the size of the cycle (usually less than 6).

Value

bj returns a fit object with similar information to what survreg, psm, cph would store as well as what rms stores and units and time.inc. residuals.bj returns a Surv object. One of the components of the fit object produced by bj (and bj.fit) is a vector called stats which contains the following names elements: "Obs", "Events", "d.f.","error d.f.","sigma","g". Here sigma is the estimate of the residual standard deviation. g is the gg-index. If the link function is "log", the gg-index on the anti-log scale is also returned as gr.

Author(s)

Janez Stare
Department of Biomedical Informatics
Ljubljana University
Ljubljana, Slovenia
[email protected]

Harald Heinzl
Department of Medical Computer Sciences
Vienna University
Vienna, Austria
[email protected]

Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]

References

Buckley JJ, James IR. Linear regression with censored data. Biometrika 1979; 66:429–36.

Miller RG, Halpern J. Regression with censored data. Biometrika 1982; 69: 521–31.

James IR, Smith PJ. Consistency results for linear regression with censored data. Ann Statist 1984; 12: 590–600.

Lai TL, Ying Z. Large sample theory of a modified Buckley-James estimator for regression analysis with censored data. Ann Statist 1991; 19: 1370–402.

Hillis SL. Residual plots for the censored data linear regression model. Stat in Med 1995; 14: 2023–2036.

Jin Z, Lin DY, Ying Z. On least-squares regression with censored data. Biometrika 2006; 93:147–161.

See Also

rms, psm, survreg, cph, Surv, na.delete, na.detail.response, datadist, rcorr.cens, GiniMd, prModFit, dxy.cens

Examples

require(survival)
suppressWarnings(RNGversion("3.5.0"))
set.seed(1)
ftime  <- 10*rexp(200)
stroke <- ifelse(ftime > 10, 0, 1)
ftime  <- pmin(ftime, 10)
units(ftime) <- "Month"
age <- rnorm(200, 70, 10)
hospital <- factor(sample(c('a','b'),200,TRUE))
dd <- datadist(age, hospital)
options(datadist="dd")

# Prior to rms 6.0 and R 4.0 the following worked with 5 knots
f <- bj(Surv(ftime, stroke) ~ rcs(age,3) + hospital, x=TRUE, y=TRUE)
# add link="identity" to use a censored normal regression model instead
# of a lognormal one
anova(f)
fastbw(f)
validate(f, B=15)
plot(Predict(f, age, hospital))
# needs datadist since no explicit age,hosp.
coef(f)               # look at regression coefficients
coef(psm(Surv(ftime, stroke) ~ rcs(age,3) + hospital, dist='lognormal'))
                      # compare with coefficients from likelihood-based
                      # log-normal regression model
                      # use dist='gau' not under R 


r <- resid(f, 'censored.normalized')
survplot(npsurv(r ~ 1), conf='none') 
                      # plot Kaplan-Meier estimate of 
                      # survival function of standardized residuals
survplot(npsurv(r ~ cut2(age, g=2)), conf='none')  
                      # may desire both strata to be n(0,1)
options(datadist=NULL)

BCa Bootstrap on Existing Bootstrap Replicates

Description

This functions constructs an object resembling one produced by the boot package's boot function, and runs that package's boot.ci function to compute BCa and percentile confidence limits. bootBCa can provide separate confidence limits for a vector of statistics when estimate has length greater than 1. In that case, estimates must have the same number of columns as estimate has values.

Usage

bootBCa(estimate, estimates, type=c('percentile','bca','basic'),
               n, seed, conf.int = 0.95)

Arguments

estimate

original whole-sample estimate

estimates

vector of bootstrap estimates

type

type of confidence interval, defaulting to nonparametric percentile

n

original number of observations

seed

.Random.seem in effect before bootstrap estimates were run

conf.int

confidence level

Value

a 2-vector if estimate is of length 1, otherwise a matrix with 2 rows and number of columns equal to the length of estimate

Note

You can use if(!exists('.Random.seed')) runif(1) before running your bootstrap to make sure that .Random.seed will be available to bootBCa.

Author(s)

Frank Harrell

See Also

boot.ci

Examples

## Not run: 
x1 <- runif(100); x2 <- runif(100); y <- sample(0:1, 100, TRUE)
f <- lrm(y ~ x1 + x2, x=TRUE, y=TRUE)
seed <- .Random.seed
b <- bootcov(f)
# Get estimated log odds at x1=.4, x2=.6
X <- cbind(c(1,1), x1=c(.4,2), x2=c(.6,3))
est <- X 
ests <- t(X 
bootBCa(est, ests, n=100, seed=seed)
bootBCa(est, ests, type='bca', n=100, seed=seed)
bootBCa(est, ests, type='basic', n=100, seed=seed)

## End(Not run)

Bootstrap Covariance and Distribution for Regression Coefficients

Description

bootcov computes a bootstrap estimate of the covariance matrix for a set of regression coefficients from ols, lrm, cph, psm, Rq, and any other fit where x=TRUE, y=TRUE was used to store the data used in making the original regression fit and where an appropriate fitter function is provided here. The estimates obtained are not conditional on the design matrix, but are instead unconditional estimates. For small sample sizes, this will make a difference as the unconditional variance estimates are larger. This function will also obtain bootstrap estimates corrected for cluster sampling (intra-cluster correlations) when a "working independence" model was used to fit data which were correlated within clusters. This is done by substituting cluster sampling with replacement for the usual simple sampling with replacement. bootcov has an option (coef.reps) that causes all of the regression coefficient estimates from all of the bootstrap re-samples to be saved, facilitating computation of nonparametric bootstrap confidence limits and plotting of the distributions of the coefficient estimates (using histograms and kernel smoothing estimates).

The loglik option facilitates the calculation of simultaneous confidence regions from quantities of interest that are functions of the regression coefficients, using the method of Tibshirani(1996). With Tibshirani's method, one computes the objective criterion (-2 log likelihood evaluated at the bootstrap estimate of β\beta but with respect to the original design matrix and response vector) for the original fit as well as for all of the bootstrap fits. The confidence set of the regression coefficients is the set of all coefficients that are associated with objective function values that are less than or equal to say the 0.95 quantile of the vector of B + 1 objective function values. For the coefficients satisfying this condition, predicted values are computed at a user-specified design matrix X, and minima and maxima of these predicted values (over the qualifying bootstrap repetitions) are computed to derive the final simultaneous confidence band.

The bootplot function takes the output of bootcov and either plots a histogram and kernel density estimate of specified regression coefficients (or linear combinations of them through the use of a specified design matrix X), or a qqnorm plot of the quantities of interest to check for normality of the maximum likelihood estimates. bootplot draws vertical lines at specified quantiles of the bootstrap distribution, and returns these quantiles for possible printing by the user. Bootstrap estimates may optionally be transformed by a user-specified function fun before plotting.

The confplot function also uses the output of bootcov but to compute and optionally plot nonparametric bootstrap pointwise confidence limits or (by default) Tibshirani (1996) simultaneous confidence sets. A design matrix must be specified to allow confplot to compute quantities of interest such as predicted values across a range of values or differences in predicted values (plots of effects of changing one or more predictor variable values).

bootplot and confplot are actually generic functions, with the particular functions bootplot.bootcov and confplot.bootcov automatically invoked for bootcov objects.

A service function called histdensity is also provided (for use with bootplot). It runs hist and density on the same plot, using twice the number of classes than the default for hist, and 1.5 times the width than the default used by density.

A comprehensive example demonstrates the use of all of the functions.

Usage

bootcov(fit, cluster, B=200, fitter, 
        coef.reps=TRUE, loglik=FALSE,
        pr=FALSE, maxit=15, eps=0.0001, group=NULL, stat=NULL,
        seed=sample(10000, 1))


bootplot(obj, which=1 : ncol(Coef), X,
         conf.int=c(.9,.95,.99),
         what=c('density', 'qqnorm', 'box'),
         fun=function(x) x, labels., ...)


confplot(obj, X, against, 
         method=c('simultaneous','pointwise'),
         conf.int=0.95, fun=function(x)x,
         add=FALSE, lty.conf=2, ...)


histdensity(y, xlab, nclass, width, mult.width=1, ...)

Arguments

fit

a fit object containing components x and y. For fits from cph, the "strata" attribute of the x component is used to obtain the vector of stratum codes.

obj

an object created by bootcov with coef.reps=TRUE.

X

a design matrix specified to confplot. See predict.rms or contrast.rms. For bootplot, X is optional.

y

a vector to pass to histdensity. NAs are ignored.

cluster

a variable indicating groupings. cluster may be any type of vector (factor, character, integer). Unique values of cluster indicate possibly correlated groupings of observations. Note the data used in the fit and stored in fit$x and fit$y may have had observations containing missing values deleted. It is assumed that if there were any NAs, an naresid function exists for the class of fit. This function restores NAs so that the rows of the design matrix coincide with cluster.

B

number of bootstrap repetitions. Default is 200.

fitter

the name of a function with arguments (x,y) that will fit bootstrap samples. Default is taken from the class of fit if it is ols, lrm, cph, psm, Rq.

coef.reps

set to TRUE if you want to store a matrix of all bootstrap regression coefficient estimates in the returned component boot.Coef.

loglik

set to TRUE to store -2 log likelihoods for each bootstrap model, evaluated against the original x and y data. The default is to do this when coef.reps is specified as TRUE. The use of loglik=TRUE assumes that an oos.loglik method exists for the type of model being analyzed, to calculate out-of-sample -2 log likelihoods (see rmsMisc). After the B -2 log likelihoods (stored in the element named boot.loglik in the returned fit object), the B+1 element is the -2 log likelihood for the original model fit.

pr

set to TRUE to print the current sample number to monitor progress.

maxit

maximum number of iterations, to pass to fitter

eps

argument to pass to various fitters

group

a grouping variable used to stratify the sample upon bootstrapping. This allows one to handle k-sample problems, i.e., each bootstrap sample will be forced to select the same number of observations from each level of group as the number appearing in the original dataset. You may specify both group and cluster.

stat

a single character string specifying the name of a stats element produced by the fitting function to save over the bootstrap repetitions. The vector of saved statistics will be in the boot.stats part of the list returned by bootcov.

seed

random number seed for set.seed, defaults to a random integer between 1 and 10000; user should specify a constant for reproducibility

which

one or more integers specifying which regression coefficients to plot for bootplot

conf.int

a vector (for bootplot, default is c(.9,.95,.99)) or scalar (for confplot, default is .95) confidence level.

what

for bootplot, specifies whether a density or a q-q plot is made, a ggplot2 is used to produce a box plot of all coefficients over the bootstrap reps

fun

for bootplot or confplot specifies a function used to translate the quantities of interest before analysis. A common choice is fun=exp to compute anti-logs, e.g., odds ratios.

labels.

a vector of labels for labeling the axes in plots produced by bootplot. Default is row names of X if there are any, or sequential integers.

...

For bootplot these are optional arguments passed to histdensity. Also may be optional arguments passed to plot by confplot or optional arguments passed to hist from histdensity, such as xlim and breaks. The argument probability=TRUE is always passed to hist.

against

For confplot, specifying against causes a plot to be made (or added to). The against variable is associated with rows of X and is used as the x-coordinates.

method

specifies whether "pointwise" or "simultaneous" confidence regions are derived by confplot. The default is simultaneous.

add

set to TRUE to add to an existing plot, for confplot

lty.conf

line type for plotting confidence bands in confplot. Default is 2 for dotted lines.

xlab

label for x-axis for histdensity. Default is label attribute or argument name if there is no label.

nclass

passed to hist if present

width

passed to density if present

mult.width

multiplier by which to adjust the default width passed to density. Default is 1.

Details

If the fit has a scale parameter (e.g., a fit from psm), the log of the individual bootstrap scale estimates are added to the vector of parameter estimates and and column and row for the log scale are added to the new covariance matrix (the old covariance matrix also has this row and column).

For Rq fits, the tau, method, and hs arguments are taken from the original fit.

Value

a new fit object with class of the original object and with the element orig.var added. orig.var is the covariance matrix of the original fit. Also, the original var component is replaced with the new bootstrap estimates. The component boot.coef is also added. This contains the mean bootstrap estimates of regression coefficients (with a log scale element added if applicable). boot.Coef is added if coef.reps=TRUE. boot.loglik is added if loglik=TRUE. If stat is specified an additional vector boot.stats will be contained in the returned object. B contains the number of successfully fitted bootstrap resamples. A component clusterInfo is added to contain elements name and n holding the name of the cluster variable and the number of clusters.

bootplot returns a (possible matrix) of quantities of interest and the requested quantiles of them. confplot returns three vectors: fitted, lower, and upper.

Side Effects

bootcov prints if pr=TRUE

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]

Bill Pikounis
Biometrics Research Department
Merck Research Laboratories
https://billpikounis.com/wpb/

References

Feng Z, McLerran D, Grizzle J (1996): A comparison of statistical methods for clustered data analysis with Gaussian error. Stat in Med 15:1793–1806.

Tibshirani R, Knight K (1996): Model search and inference by bootstrap "bumping". Department of Statistics, University of Toronto. Technical report available from
http://www-stat.stanford.edu/~tibs/. Presented at the Joint Statistical Meetings, Chicago, August 1996.

See Also

robcov, sample, rms, lm.fit, lrm.fit, survival-internal, predab.resample, rmsMisc, Predict, gendata, contrast.rms, Predict, setPb, multiwayvcov::cluster.boot

Examples

set.seed(191)
x <- exp(rnorm(200))
logit <- 1 + x/2
y <- ifelse(runif(200) <= plogis(logit), 1, 0)
f <- lrm(y ~ pol(x,2), x=TRUE, y=TRUE)
g <- bootcov(f, B=50, pr=TRUE, seed=3)
anova(g)    # using bootstrap covariance estimates
fastbw(g)   # using bootstrap covariance estimates
beta <- g$boot.Coef[,1]
hist(beta, nclass=15)     #look at normality of parameter estimates
qqnorm(beta)
# bootplot would be better than these last two commands


# A dataset contains a variable number of observations per subject,
# and all observations are laid out in separate rows. The responses
# represent whether or not a given segment of the coronary arteries
# is occluded. Segments of arteries may not operate independently
# in the same patient.  We assume a "working independence model" to
# get estimates of the coefficients, i.e., that estimates assuming
# independence are reasonably efficient.  The job is then to get
# unbiased estimates of variances and covariances of these estimates.


set.seed(2)
n.subjects <- 30
ages <- rnorm(n.subjects, 50, 15)
sexes  <- factor(sample(c('female','male'), n.subjects, TRUE))
logit <- (ages-50)/5
prob <- plogis(logit)  # true prob not related to sex
id <- sample(1:n.subjects, 300, TRUE) # subjects sampled multiple times
table(table(id))  # frequencies of number of obs/subject
age <- ages[id]
sex <- sexes[id]
# In truth, observations within subject are independent:
y   <- ifelse(runif(300) <= prob[id], 1, 0)
f <- lrm(y ~ lsp(age,50)*sex, x=TRUE, y=TRUE)
g <- bootcov(f, id, B=50, seed=3)  # usually do B=200 or more
diag(g$var)/diag(f$var)
# add ,group=w to re-sample from within each level of w
anova(g)            # cluster-adjusted Wald statistics
# fastbw(g)         # cluster-adjusted backward elimination
plot(Predict(g, age=30:70, sex='female'))  # cluster-adjusted confidence bands


# Get design effects based on inflation of the variances when compared
# with bootstrap estimates which ignore clustering
g2 <- bootcov(f, B=50, seed=3)
diag(g$var)/diag(g2$var)


# Get design effects based on pooled tests of factors in model
anova(g2)[,1] / anova(g)[,1]


# Simulate binary data where there is a strong 
# age x sex interaction with linear age effects 
# for both sexes, but where not knowing that
# we fit a quadratic model.  Use the bootstrap
# to get bootstrap distributions of various
# effects, and to get pointwise and simultaneous
# confidence limits


set.seed(71)
n   <- 500
age <- rnorm(n, 50, 10)
sex <- factor(sample(c('female','male'), n, rep=TRUE))
L   <- ifelse(sex=='male', 0, .1*(age-50))
y   <- ifelse(runif(n)<=plogis(L), 1, 0)


f <- lrm(y ~ sex*pol(age,2), x=TRUE, y=TRUE)
b <- bootcov(f, B=50, loglik=TRUE, pr=TRUE, seed=3)   # better: B=500


par(mfrow=c(2,3))
# Assess normality of regression estimates
bootplot(b, which=1:6, what='qq')
# They appear somewhat non-normal


# Plot histograms and estimated densities 
# for 6 coefficients
w <- bootplot(b, which=1:6)
# Print bootstrap quantiles
w$quantiles

# Show box plots for bootstrap reps for all coefficients
bootplot(b, what='box')


# Estimate regression function for females
# for a sequence of ages
ages <- seq(25, 75, length=100)
label(ages) <- 'Age'


# Plot fitted function and pointwise normal-
# theory confidence bands
par(mfrow=c(1,1))
p <- Predict(f, age=ages, sex='female')
plot(p)
# Save curve coordinates for later automatic
# labeling using labcurve in the Hmisc library
curves <- vector('list',8)
curves[[1]] <- with(p, list(x=age, y=lower))
curves[[2]] <- with(p, list(x=age, y=upper))


# Add pointwise normal-distribution confidence 
# bands using unconditional variance-covariance
# matrix from the 500 bootstrap reps
p <- Predict(b, age=ages, sex='female')
curves[[3]] <- with(p, list(x=age, y=lower))
curves[[4]] <- with(p, list(x=age, y=upper))


dframe <- expand.grid(sex='female', age=ages)
X <- predict(f, dframe, type='x')  # Full design matrix


# Add pointwise bootstrap nonparametric 
# confidence limits
p <- confplot(b, X=X, against=ages, method='pointwise',
              add=TRUE, lty.conf=4)
curves[[5]] <- list(x=ages, y=p$lower)
curves[[6]] <- list(x=ages, y=p$upper)


# Add simultaneous bootstrap confidence band
p <- confplot(b, X=X, against=ages, add=TRUE, lty.conf=5)
curves[[7]] <- list(x=ages, y=p$lower)
curves[[8]] <- list(x=ages, y=p$upper)
lab <- c('a','a','b','b','c','c','d','d')
labcurve(curves, lab, pl=TRUE)


# Now get bootstrap simultaneous confidence set for
# female:male odds ratios for a variety of ages


dframe <- expand.grid(age=ages, sex=c('female','male'))
X <- predict(f, dframe, type='x')  # design matrix
f.minus.m <- X[1:100,] - X[101:200,]
# First 100 rows are for females.  By subtracting
# design matrices are able to get Xf*Beta - Xm*Beta
# = (Xf - Xm)*Beta


confplot(b, X=f.minus.m, against=ages,
         method='pointwise', ylab='F:M Log Odds Ratio')
confplot(b, X=f.minus.m, against=ages,
         lty.conf=3, add=TRUE)


# contrast.rms makes it easier to compute the design matrix for use
# in bootstrapping contrasts:


f.minus.m <- contrast(f, list(sex='female',age=ages),
                         list(sex='male',  age=ages))$X
confplot(b, X=f.minus.m)


# For a quadratic binary logistic regression model use bootstrap
# bumping to estimate coefficients under a monotonicity constraint
set.seed(177)
n <- 400
x <- runif(n)
logit <- 3*(x^2-1)
y <- rbinom(n, size=1, prob=plogis(logit))
f <- lrm(y ~ pol(x,2), x=TRUE, y=TRUE)
k <- coef(f)
k
vertex <- -k[2]/(2*k[3])
vertex


# Outside [0,1] so fit satisfies monotonicity constraint within
# x in [0,1], i.e., original fit is the constrained MLE


g <- bootcov(f, B=50, coef.reps=TRUE, loglik=TRUE, seed=3)
bootcoef <- g$boot.Coef    # 100x3 matrix
vertex <- -bootcoef[,2]/(2*bootcoef[,3])
table(cut2(vertex, c(0,1)))
mono <- !(vertex >= 0 & vertex <= 1)
mean(mono)    # estimate of Prob{monotonicity in [0,1]}


var(bootcoef)   # var-cov matrix for unconstrained estimates
var(bootcoef[mono,])   # for constrained estimates


# Find second-best vector of coefficient estimates, i.e., best
# from among bootstrap estimates
g$boot.Coef[order(g$boot.loglik[-length(g$boot.loglik)])[1],]
# Note closeness to MLE

## Not run: 
# Get the bootstrap distribution of the difference in two ROC areas for
# two binary logistic models fitted on the same dataset.  This analysis
# does not adjust for the bias ROC area (C-index) due to overfitting.
# The same random number seed is used in two runs to enforce pairing.

set.seed(17)
x1 <- rnorm(100)
x2 <- rnorm(100)
y <- sample(0:1, 100, TRUE)
f <- lrm(y ~ x1, x=TRUE, y=TRUE)
g <- lrm(y ~ x1 + x2, x=TRUE, y=TRUE)
f <- bootcov(f, stat='C', seed=4)
g <- bootcov(g, stat='C', seed=4)
dif <- g$boot.stats - f$boot.stats
hist(dif)
quantile(dif, c(.025,.25,.5,.75,.975))
# Compute a z-test statistic.  Note that comparing ROC areas is far less
# powerful than likelihood or Brier score-based methods
z <- (g$stats['C'] - f$stats['C'])/sd(dif)
names(z) <- NULL
c(z=z, P=2*pnorm(-abs(z)))

## End(Not run)

3-D Plots Showing Effects of Two Continuous Predictors in a Regression Model Fit

Description

Uses lattice graphics and the output from Predict to plot image, contour, or perspective plots showing the simultaneous effects of two continuous predictor variables. Unless formula is provided, the xx-axis is constructed from the first variable listed in the call to Predict and the yy-axis variable comes from the second.

The perimeter function is used to generate the boundary of data to plot when a 3-d plot is made. It finds the area where there are sufficient data to generate believable interaction fits.

Usage

bplot(x, formula, lfun=lattice::levelplot, xlab, ylab, zlab,
      adj.subtitle=!info$ref.zero, cex.adj=.75, cex.lab=1,
      perim, showperim=FALSE,
      zlim=range(yhat, na.rm=TRUE), scales=list(arrows=FALSE),
      xlabrot, ylabrot, zlabrot=90, ...)

perimeter(x, y, xinc=diff(range(x))/10, n=10, lowess.=TRUE)

Arguments

x

for bplot, an object created by Predict for which two or more numeric predictors varied. For perim is the first variable of a pair of predictors forming a 3-d plot.

formula

a formula of the form f(yhat) ~ x*y optionally followed by |a*b*c which are 1-3 paneling variables that were specified to Predict. f can represent any R function of a vector that produces a vector. If the left hand side of the formula is omitted, yhat will be inserted. If formula is omitted, it will be inferred from the first two variables that varied in the call to Predict.

lfun

a high-level lattice plotting function that takes formulas of the form z ~ x*y. The default is an image plot (levelplot). Other common choices are wireframe for perspective plot or contourplot for a contour plot.

xlab

Character string label for xx-axis. Default is given by Predict.

ylab

Character string abel for yy-axis

zlab

Character string zz-axis label for perspective (wireframe) plots. Default comes from Predict. zlab will often be specified if fun was specified to Predict.

adj.subtitle

Set to FALSE to suppress subtitling the graph with the list of settings of non-graphed adjustment values. Default is TRUE if there are non-plotted adjustment variables and ref.zero was not used.

cex.adj

cex parameter for size of adjustment settings in subtitles. Default is 0.75

cex.lab

cex parameter for axis labels. Default is 1.

perim

names a matrix created by perimeter when used for 3-d plots of two continuous predictors. When the combination of variables is outside the range in perim, that section of the plot is suppressed. If perim is omitted, 3-d plotting will use the marginal distributions of the two predictors to determine the plotting region, when the grid is not specified explicitly in variables. When instead a series of curves is being plotted, perim specifies a function having two arguments. The first is the vector of values of the first variable that is about to be plotted on the xx-axis. The second argument is the single value of the variable representing different curves, for the current curve being plotted. The function's returned value must be a logical vector whose length is the same as that of the first argument, with values TRUE if the corresponding point should be plotted for the current curve, FALSE otherwise. See one of the latter examples.

showperim

set to TRUE if perim is specified and you want to show the actual perimeter used.

zlim

Controls the range for plotting in the zz-axis if there is one. Computed by default.

scales

see wireframe

xlabrot

rotation angle for the x-axis. Default is 30 for wireframe and 0 otherwise.

ylabrot

rotation angle for the y-axis. Default is -40 for wireframe, 90 for contourplot or levelplot, and 0 otherwise.

zlabrot

rotation angle for z-axis rotation for wireframe plots

...

other arguments to pass to the lattice function

y

second variable of the pair for perim. If omitted, x is assumed to be a list with both x and y components.

xinc

increment in x over which to examine the density of y in perimeter

n

within intervals of x for perimeter, takes the informative range of y to be the nnth smallest to the nnth largest values of y. If there aren't at least 2nn y values in the x interval, no y ranges are used for that interval.

lowess.

set to FALSE to not have lowess smooth the data perimeters

Details

perimeter is a kind of generalization of datadist for 2 continuous variables. First, the n smallest and largest x values are determined. These form the lowest and highest possible xs to display. Then x is grouped into intervals bounded by these two numbers, with the interval widths defined by xinc. Within each interval, y is sorted and the nnth smallest and largest y are taken as the interval containing sufficient data density to plot interaction surfaces. The interval is ignored when there are insufficient y values. When the data are being readied for persp, bplot uses the approx function to do linear interpolation of the y-boundaries as a function of the x values actually used in forming the grid (the values of the first variable specified to Predict). To make the perimeter smooth, specify lowess.=TRUE to perimeter.

Value

perimeter returns a matrix of class perimeter. This outline can be conveniently plotted by lines.perimeter.

Author(s)

Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]

See Also

datadist, Predict, rms, rmsMisc, levelplot, contourplot, wireframe

Examples

n <- 1000    # define sample size
set.seed(17) # so can reproduce the results
age            <- rnorm(n, 50, 10)
blood.pressure <- rnorm(n, 120, 15)
cholesterol    <- rnorm(n, 200, 25)
sex            <- factor(sample(c('female','male'), n,TRUE))
label(age)            <- 'Age'      # label is in Hmisc
label(cholesterol)    <- 'Total Cholesterol'
label(blood.pressure) <- 'Systolic Blood Pressure'
label(sex)            <- 'Sex'
units(cholesterol)    <- 'mg/dl'   # uses units.default in Hmisc
units(blood.pressure) <- 'mmHg'

# Specify population model for log odds that Y=1
L <- .4*(sex=='male') + .045*(age-50) +
  (log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male'))
# Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)]
y <- ifelse(runif(n) < plogis(L), 1, 0)

ddist <- datadist(age, blood.pressure, cholesterol, sex)
options(datadist='ddist')

fit <- lrm(y ~ blood.pressure + sex * (age + rcs(cholesterol,4)),
               x=TRUE, y=TRUE)
p <- Predict(fit, age, cholesterol, sex, np=50) # vary sex last
require(lattice)
bplot(p)                 # image plot for age, cholesterol with color
                         # coming from yhat; use default ranges for
                         # both continuous predictors; two panels (for sex)
bplot(p, lfun=wireframe) # same as bplot(p,,wireframe)
# View from different angle, change y label orientation accordingly
# Default is z=40, x=-60
bplot(p,, wireframe, screen=list(z=40, x=-75), ylabrot=-25)
bplot(p,, contourplot)   # contour plot
bounds  <- perimeter(age, cholesterol, lowess=TRUE)
plot(age, cholesterol)     # show bivariate data density and perimeter
lines(bounds[,c('x','ymin')]); lines(bounds[,c('x','ymax')])
p <- Predict(fit, age, cholesterol)  # use only one sex
bplot(p, perim=bounds)   # draws image() plot
                         # don't show estimates where data are sparse
                         # doesn't make sense here since vars don't interact
bplot(p, plogis(yhat) ~ age*cholesterol) # Probability scale
options(datadist=NULL)

Resampling Model Calibration

Description

Uses bootstrapping or cross-validation to get bias-corrected (overfitting- corrected) estimates of predicted vs. observed values based on subsetting predictions into intervals (for survival models) or on nonparametric smoothers (for other models). There are calibration functions for Cox (cph), parametric survival models (psm), binary and ordinal logistic models (lrm) and ordinary least squares (ols). For survival models, "predicted" means predicted survival probability at a single time point, and "observed" refers to the corresponding Kaplan-Meier survival estimate, stratifying on intervals of predicted survival, or, if the polspline package is installed, the predicted survival probability as a function of transformed predicted survival probability using the flexible hazard regression approach (see the val.surv function for details). For logistic and linear models, a nonparametric calibration curve is estimated over a sequence of predicted values. The fit must have specified x=TRUE, y=TRUE. The print and plot methods for lrm and ols models (which use calibrate.default) print the mean absolute error in predictions, the mean squared error, and the 0.9 quantile of the absolute error. Here, error refers to the difference between the predicted values and the corresponding bias-corrected calibrated values.

Below, the second, third, and fourth invocations of calibrate are, respectively, for ols and lrm, cph, and psm. The first and second plot invocation are respectively for lrm and ols fits or all other fits.

Usage

calibrate(fit, ...)
## Default S3 method:
calibrate(fit, predy, 
  method=c("boot","crossvalidation",".632","randomization"),
  B=40, bw=FALSE, rule=c("aic","p"),
  type=c("residual","individual"),
  sls=.05, aics=0, force=NULL, estimates=TRUE, pr=FALSE, kint,
  smoother="lowess", digits=NULL, ...) 
## S3 method for class 'cph'
calibrate(fit, cmethod=c('hare', 'KM'),
  method="boot", u, m=150, pred, cuts, B=40, 
  bw=FALSE, rule="aic", type="residual", sls=0.05, aics=0, force=NULL,
  estimates=TRUE,
  pr=FALSE, what="observed-predicted", tol=1e-12, maxdim=5, ...)
## S3 method for class 'psm'
calibrate(fit, cmethod=c('hare', 'KM'),
  method="boot", u, m=150, pred, cuts, B=40,
  bw=FALSE,rule="aic",
  type="residual", sls=.05, aics=0, force=NULL, estimates=TRUE,
  pr=FALSE, what="observed-predicted", tol=1e-12, maxiter=15, 
  rel.tolerance=1e-5, maxdim=5, ...)

## S3 method for class 'calibrate'
print(x, B=Inf, ...)
## S3 method for class 'calibrate.default'
print(x, B=Inf, ...)

## S3 method for class 'calibrate'
plot(x, xlab, ylab, subtitles=TRUE, conf.int=TRUE,
 cex.subtitles=.75, riskdist=TRUE, add=FALSE,
 scat1d.opts=list(nhistSpike=200), par.corrected=NULL, ...)

## S3 method for class 'calibrate.default'
plot(x, xlab, ylab, xlim, ylim,
  legend=TRUE, subtitles=TRUE, cex.subtitles=.75, riskdist=TRUE,
  scat1d.opts=list(nhistSpike=200), ...)

Arguments

fit

a fit from ols, lrm, cph or psm

x

an object created by calibrate

method, B, bw, rule, type, sls, aics, force, estimates

see validate. For print.calibrate, B is an upper limit on the number of resamples for which information is printed about which variables were selected in each model re-fit. Specify zero to suppress printing. Default is to print all re-samples.

cmethod

method for validating survival predictions using right-censored data. The default is cmethod='hare' to use the hare function in the polspline package. Specify cmethod='KM' to use less precision stratified Kaplan-Meier estimates. If the polspline package is not available, the procedure reverts to cmethod='KM'.

u

the time point for which to validate predictions for survival models. For cph fits, you must have specified surv=TRUE, time.inc=u, where u is the constant specifying the time to predict.

m

group predicted u-time units survival into intervals containing m subjects on the average (for survival models only)

pred

vector of predicted survival probabilities at which to evaluate the calibration curve. By default, the low and high prediction values from datadist are used, which for large sample size is the 10th smallest to the 10th largest predicted probability.

cuts

actual cut points for predicted survival probabilities. You may specify only one of m and cuts (for survival models only)

pr

set to TRUE to print intermediate results for each re-sample

what

The default is "observed-predicted", meaning to estimate optimism in this difference. This is preferred as it accounts for skewed distributions of predicted probabilities in outer intervals. You can also specify "observed". This argument applies to survival models only.

tol

criterion for matrix singularity (default is 1e-12)

maxdim

see hare

maxiter

for psm, this is passed to survreg.control (default is 15 iterations)

rel.tolerance

parameter passed to survreg.control for psm (default is 1e-5).

predy

a scalar or vector of predicted values to calibrate (for lrm, ols). Default is 50 equally spaced points between the 5th smallest and the 5th largest predicted values. For lrm the predicted values are probabilities (see kint).

kint

For an ordinal logistic model the default predicted probability that YY\geq the middle level. Specify kint to specify the intercept to use, e.g., kint=2 means to calibrate Prob(Yb)Prob(Y\geq b), where bb is the second level of YY.

smoother

a function in two variables which produces xx- and yy-coordinates by smoothing the input y. The default is to use lowess(x, y, iter=0).

digits

If specified, predicted values are rounded to digits digits before passing to the smoother. Occasionally, large predicted values on the logit scale will lead to predicted probabilities very near 1 that should be treated as 1, and the round function will fix that. Applies to calibrate.default.

...

other arguments to pass to predab.resample, such as group, cluster, and subset. Also, other arguments for plot.

xlab

defaults to "Predicted x-units Survival" or to a suitable label for other models

ylab

defaults to "Fraction Surviving x-units" or to a suitable label for other models

xlim, ylim

2-vectors specifying x- and y-axis limits, if not using defaults

subtitles

set to FALSE to suppress subtitles in plot describing method and for lrm and ols the mean absolute error and original sample size

conf.int

set to FALSE to suppress plotting 0.95 confidence intervals for Kaplan-Meier estimates

cex.subtitles

character size for plotting subtitles

riskdist

set to FALSE to suppress the distribution of predicted risks (survival probabilities) from being plotted

add

set to TRUE to add the calibration plot to an existing plot

scat1d.opts

a list specifying options to send to scat1d if riskdist=TRUE. See scat1d.

par.corrected

a list specifying graphics parameters col, lty, lwd, pch to be used in drawing overfitting-corrected estimates. Default is col="blue", lty=1, lwd=1, pch=4.

legend

set to FALSE to suppress legends (for lrm, ols only) on the calibration plot, or specify a list with elements x and y containing the coordinates of the upper left corner of the legend. By default, a legend will be drawn in the lower right 1/16th of the plot.

Details

If the fit was created using penalized maximum likelihood estimation, the same penalty and penalty.scale parameters are used during validation.

Value

matrix specifying mean predicted survival in each interval, the corresponding estimated bias-corrected Kaplan-Meier estimates, number of subjects, and other statistics. For linear and logistic models, the matrix instead has rows corresponding to the prediction points, and the vector of predicted values being validated is returned as an attribute. The returned object has class "calibrate" or "calibrate.default". plot.calibrate.default invisibly returns the vector of estimated prediction errors corresponding to the dataset used to fit the model.

Side Effects

prints, and stores an object pred.obs or .orig.cal

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]

See Also

validate, predab.resample, groupkm, errbar, scat1d, cph, psm, lowess,fit.mult.impute, processMI

Examples

require(survival)
set.seed(1)
n <- 200
d.time <- rexp(n)
x1 <- runif(n)
x2 <- factor(sample(c('a', 'b', 'c'), n, TRUE))
f <- cph(Surv(d.time) ~ pol(x1,2) * x2, x=TRUE, y=TRUE, surv=TRUE, time.inc=1.5)
#or f <- psm(S ~ \dots)
pa <- requireNamespace('polspline')
if(pa) {
 cal <- calibrate(f, u=1.5, B=20)  # cmethod='hare'
 plot(cal)
}
cal <- calibrate(f, u=1.5, cmethod='KM', m=50, B=20)  # usually B=200 or 300
plot(cal, add=pa)

set.seed(1)
y <- sample(0:2, n, TRUE)
x1 <- runif(n)
x2 <- runif(n)
x3 <- runif(n)
x4 <- runif(n)
f <- lrm(y ~ x1 + x2 + x3 * x4, x=TRUE, y=TRUE)
cal <- calibrate(f, kint=2, predy=seq(.2, .8, length=60), 
                 group=y)
# group= does k-sample validation: make resamples have same 
# numbers of subjects in each level of y as original sample

plot(cal)
#See the example for the validate function for a method of validating
#continuation ratio ordinal logistic models.  You can do the same
#thing for calibrate

General Contrasts of Regression Coefficients

Description

This function computes one or more contrasts of the estimated regression coefficients in a fit from one of the functions in rms, along with standard errors, confidence limits, t or Z statistics, P-values. General contrasts are handled by obtaining the design matrix for two sets of predictor settings (a, b) and subtracting the corresponding rows of the two design matrics to obtain a new contrast design matrix for testing the a - b differences. This allows for quite general contrasts (e.g., estimated differences in means between a 30 year old female and a 40 year old male). This can also be used to obtain a series of contrasts in the presence of interactions (e.g., female:male log odds ratios for several ages when the model contains age by sex interaction). Another use of contrast is to obtain center-weighted (Type III test) and subject-weighted (Type II test) estimates in a model containing treatment by center interactions. For the latter case, you can specify type="average" and an optional weights vector to average the within-center treatment contrasts. The design contrast matrix computed by contrast.rms can be used by other functions.

When the model was fitted by a Bayesian function such as blrm, highest posterior density intervals for contrasts are computed instead, along with the posterior probability that the contrast is positive. posterior.summary specifies whether posterior mean/median/mode is to be used for contrast point estimates.

contrast.rms also allows one to specify four settings to contrast, yielding contrasts that are double differences - the difference between the first two settings (a - b) and the last two (a2 - b2). This allows assessment of interactions.

If usebootcoef=TRUE, the fit was run through bootcov, and conf.type="individual", the confidence intervals are bootstrap nonparametric percentile confidence intervals, basic bootstrap, or BCa intervals, obtained on contrasts evaluated on all bootstrap samples.

By omitting the b argument, contrast can be used to obtain an average or weighted average of a series of predicted values, along with a confidence interval for this average. This can be useful for "unconditioning" on one of the predictors (see the next to last example).

Specifying type="joint", and specifying at least as many contrasts as needed to span the space of a complex test, one can make multiple degree of freedom tests flexibly and simply. Redundant contrasts will be ignored in the joint test. See the examples below. These include an example of an "incomplete interaction test" involving only two of three levels of a categorical variable (the test also tests the main effect).

When more than one contrast is computed, the list created by contrast.rms is suitable for plotting (with error bars or bands) with xYplot or Dotplot (see the last example before the type="joint" examples).

When fit is the result of a Bayesian model fit and fun is specified, contrast.rms operates altogether differently. a and b must both be specified and a2, b2 not specified. fun is evaluated on the estimates separately on a and b and the subtraction is deferred. So even in the absence of interactions, when fun is nonlinear, the settings of factors (predictors) will not cancel out and estimates of differences will be covariate-specific (unless there are no covariates in the model besides the one being varied to get from a to b).

Usage

contrast(fit, ...)
## S3 method for class 'rms'
contrast(fit, a, b, a2, b2, ycut=NULL, cnames=NULL,
         fun=NULL, funint=TRUE,
         type=c("individual", "average", "joint"),
         conf.type=c("individual","simultaneous"), usebootcoef=TRUE,
         boot.type=c("percentile","bca","basic"),
         posterior.summary=c('mean', 'median', 'mode'),
         weights="equal", conf.int=0.95, tol=1e-7, expand=TRUE, ...)

## S3 method for class 'contrast.rms'
print(x, X=FALSE,
       fun=function(u)u, jointonly=FALSE, prob=0.95, ...)

Arguments

fit

a fit of class "rms"

a

a list containing settings for all predictors that you do not wish to set to default (adjust-to) values. Usually you will specify two variables in this list, one set to a constant and one to a sequence of values, to obtain contrasts for the sequence of values of an interacting factor. The gendata function will generate the necessary combinations and default values for unspecified predictors, depending on the expand argument.

b

another list that generates the same number of observations as a, unless one of the two lists generates only one observation. In that case, the design matrix generated from the shorter list will have its rows replicated so that the contrasts assess several differences against the one set of predictor values. This is useful for comparing multiple treatments with control, for example. If b is missing, the design matrix generated from a is analyzed alone.

a2

an optional third list of settings of predictors

b2

an optional fourth list of settings of predictors. Mandatory if a2 is given.

ycut

used of the fit is a constrained partial proportional odds model fit, to specify the single value or vector of values (corresponding to the multiple contrasts) of the response variable to use in forming contrasts. When there is non-proportional odds, odds ratios will vary over levels of the response variable. When there are multiple contrasts and only one value is given for ycut, that value will be propagated to all contrasts. To show the effect of non-proportional odds, let ycut vary.

cnames

vector of character strings naming the contrasts when type!="average". Usually cnames is not necessary as contrast.rms tries to name the contrasts by examining which predictors are varying consistently in the two lists. cnames will be needed when you contrast "non-comparable" settings, e.g., you compare list(treat="drug", age=c(20,30)) with list(treat="placebo"), age=c(40,50))

fun

a function to evaluate on the linear predictor for each of a and b. Applies to Bayesian model fits. Also, a function to transform the contrast, SE, and lower and upper confidence limits before printing. For example, specify fun=exp to anti-log them for logistic models.

type

set type="average" to average the individual contrasts (e.g., to obtain a Type II or III contrast). Set type="joint" to jointly test all non-redundant contrasts with a multiple degree of freedom test and no averaging.

conf.type

The default type of confidence interval computed for a given individual (1 d.f.) contrast is a pointwise confidence interval. Set conf.type="simultaneous" to use the multcomp package's glht and confint functions to compute confidence intervals with simultaneous (family-wise) coverage, thus adjusting for multiple comparisons. Note that individual P-values are not adjusted for multiplicity.

usebootcoef

If fit was the result of bootcov but you want to use the bootstrap covariance matrix instead of the nonparametric percentile, basic, or BCa method for confidence intervals (which uses all the bootstrap coefficients), specify usebootcoef=FALSE.

boot.type

set to 'bca' to compute BCa confidence limits or 'basic' to use the basic bootstrap. The default is to compute percentile intervals

posterior.summary

By default the posterior mean is used. Specify posterior.summary='median' to instead use the posterior median and likewise posterior.summary='mode'. Unlike other functions, contrast.rms does not default to 'mode' because point estimates come from contrasts and not the original model coefficients point estimates.

weights

a numeric vector, used when type="average", to obtain weighted contrasts

conf.int

confidence level for confidence intervals for the contrasts (HPD interval probability for Bayesian analyses)

tol

tolerance for qr function for determining which contrasts are redundant, and for inverting the covariance matrix involved in a joint test

expand

set to FALSE to have gendata not generate all possible combinations of predictor settings. This is useful when getting contrasts over irregular predictor settings.

...

passed to print for main output. A useful thing to pass is digits=4.

x

result of contrast

X

set X=TRUE to print design matrix used in computing the contrasts (or the average contrast)

funint

set to FALSE if fun is not a function such as the result of Mean, Quantile, or ExProb that contains an intercepts argument

jointonly

set to FALSE to omit printing of individual contrasts

prob

highest posterior density interval probability when the fit was Bayesian and fun was specified to contrast.rms

Value

a list of class "contrast.rms" containing the elements Contrast, SE, Z, var, df.residual Lower, Upper, Pvalue, X, cnames, redundant, which denote the contrast estimates, standard errors, Z or t-statistics, variance matrix, residual degrees of freedom (this is NULL if the model was not ols), lower and upper confidence limits, 2-sided P-value, design matrix, contrast names (or NULL), and a logical vector denoting which contrasts are redundant with the other contrasts. If there are any redundant contrasts, when the results of contrast are printed, and asterisk is printed at the start of the corresponding lines. The object also contains ctype indicating what method was used for compute confidence intervals.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
[email protected]

See Also

Predict, gendata, bootcov, summary.rms, anova.rms,

Examples

require(ggplot2)
set.seed(1)
age <- rnorm(200,40,12)
sex <- factor(sample(c('female','male'),200,TRUE))
logit <- (sex=='male') + (age-40)/5
y <- ifelse(runif(200) <= plogis(logit), 1, 0)
f <- lrm(y ~ pol(age,2)*sex)
anova(f)
# Compare a 30 year old female to a 40 year old male
# (with or without age x sex interaction in the model)
contrast(f, list(sex='female', age=30), list(sex='male', age=40))
# Test for interaction between age and sex, duplicating anova
contrast(f, list(sex='female', age=30),
            list(sex='male',   age=30),
            list(sex='female', age=c(40,50)),
            list(sex='male',   age=c(40,50)), type='joint')
# Duplicate overall sex effect in anova with 3 d.f.
contrast(f, list(sex='female', age=c(30,40,50)),
            list(sex='male',   age=c(30,40,50)), type='joint')
# For females get an array of odds ratios against age=40
k <- contrast(f, list(sex='female', age=30:50),
                 list(sex='female', age=40))
print(k, fun=exp)
# Plot odds ratios with pointwise 0.95 confidence bands using log scale
k <- as.data.frame(k[c('Contrast','Lower','Upper')])
ggplot(k, aes(x=30:50, y=exp(Contrast))) + geom_line() +
   geom_ribbon(aes(ymin=exp(Lower), ymax=exp(Upper)),
               alpha=0.15, linetype=0) +
   scale_y_continuous(trans='log10', n.breaks=10,
               minor_breaks=c(seq(0.1, 1, by=.1), seq(1, 10, by=.5))) +
  xlab('Age') + ylab('OR against age 40')

# For a model containing two treatments, centers, and treatment
# x center interaction, get 0.95 confidence intervals separately
# by center
center <- factor(sample(letters[1 : 8], 500, TRUE))
treat  <- factor(sample(c('a','b'), 500, TRUE))
y      <- 8*(treat == 'b') + rnorm(500, 100, 20)
f <- ols(y ~ treat*center)


lc <- levels(center)
contrast(f, list(treat='b', center=lc),
            list(treat='a', center=lc))


# Get 'Type III' contrast: average b - a treatment effect over
# centers, weighting centers equally (which is almost always
# an unreasonable thing to do)
contrast(f, list(treat='b', center=lc),
            list(treat='a', center=lc),
         type='average')


# Get 'Type II' contrast, weighting centers by the number of
# subjects per center.  Print the design contrast matrix used.
k <- contrast(f, list(treat='b', center=lc),
                 list(treat='a', center=lc),
              type='average', weights=table(center))
print(k, X=TRUE)
# Note: If other variables had interacted with either treat 
# or center, we may want to list settings for these variables
# inside the list()'s, so as to not use default settings


# For a 4-treatment study, get all comparisons with treatment 'a'
treat  <- factor(sample(c('a','b','c','d'),  500, TRUE))
y      <- 8*(treat == 'b') + rnorm(500, 100, 20)
dd     <- datadist(treat, center); options(datadist='dd')
f <- ols(y ~ treat*center)
lt <- levels(treat)
contrast(f, list(treat=lt[-1]),
            list(treat=lt[ 1]),
         cnames=paste(lt[-1], lt[1], sep=':'), conf.int=1 - .05 / 3)


# Compare each treatment with average of all others
for(i in 1 : length(lt)) {
  cat('Comparing with', lt[i], '\n\n')
  print(contrast(f, list(treat=lt[-i]),
                    list(treat=lt[ i]), type='average'))
}
options(datadist=NULL)

# Six ways to get the same thing, for a variable that
# appears linearly in a model and does not interact with
# any other variables.  We estimate the change in y per
# unit change in a predictor x1.  Methods 4, 5 also
# provide confidence limits.  Method 6 computes nonparametric
# bootstrap confidence limits.  Methods 2-6 can work
# for models that are nonlinear or non-additive in x1.
# For that case more care is needed in choice of settings
# for x1 and the variables that interact with x1.


## Not run: 
coef(fit)['x1']                            # method 1
diff(predict(fit, gendata(x1=c(0,1))))     # method 2
g <- Function(fit)                         # method 3
g(x1=1) - g(x1=0)
summary(fit, x1=c(0,1))                    # method 4
k <- contrast(fit, list(x1=1), list(x1=0)) # method 5
print(k, X=TRUE)
fit <- update(fit, x=TRUE, y=TRUE)         # method 6
b <- bootcov(fit, B=500)
contrast(fit, list(x1=1), list(x1=0))


# In a model containing age, race, and sex,
# compute an estimate of the mean response for a
# 50 year old male, averaged over the races using
# observed frequencies for the races as weights


f <- ols(y ~ age + race + sex)
contrast(f, list(age=50, sex='male', race=levels(race)),
         type='average', weights=table(race))

# For a Bayesian model get the highest posterior interval for the
# difference in two nonlinear functions of predicted values
# Start with the mean from a proportional odds model
g <- blrm(y ~ x)
M <- Mean(g)
contrast(g, list(x=1), list(x=0), fun=M)

# For the median we have to make sure that contrast can pass the
# per-posterior-draw vector of intercepts through
qu <- Quantile(g)
med <- function(lp, intercepts) qu(0.5, lp, intercepts=intercepts)
contrast(g, list(x=1), list(x=0), fun=med)

## End(Not run)


# Plot the treatment effect (drug - placebo) as a function of age
# and sex in a model in which age nonlinearly interacts with treatment
# for females only

set.seed(1)
n <- 800
treat <- factor(sample(c('drug','placebo'), n,TRUE))
sex   <- factor(sample(c('female','male'),  n,TRUE))
age   <- rnorm(n, 50, 10)
y     <- .05*age + (sex=='female')*(treat=='drug')*.05*abs(age-50) + rnorm(n)
f     <- ols(y ~ rcs(age,4)*treat*sex)
d     <- datadist(age, treat, sex); options(datadist='d')

# show separate estimates by treatment and sex

require(ggplot2)
ggplot(Predict(f, age, treat, sex='female'))
ggplot(Predict(f, age, treat, sex='male'))
ages  <- seq(35,65,by=5); sexes <- c('female','male')
w     <- contrast(f, list(treat='drug',    age=ages, sex=sexes),
                     list(treat='placebo', age=ages, sex=sexes))
# add conf.type="simultaneous" to adjust for having done 14 contrasts
xYplot(Cbind(Contrast, Lower, Upper) ~ age | sex, data=w,
       ylab='Drug - Placebo')
w <- as.data.frame(w[c('age','sex','Contrast','Lower','Upper')])
ggplot(w, aes(x=age, y=Contrast)) + geom_point() + facet_grid(sex ~ .) +
   geom_errorbar(aes(ymin=Lower, ymax=Upper), width=0)
ggplot(w, aes(x=age, y=Contrast)) + geom_line() + facet_grid(sex ~ .) +
   geom_ribbon(aes(ymin=Lower, ymax=Upper), width=0, alpha=0.15, linetype=0)
xYplot(Cbind(Contrast, Lower, Upper) ~ age, groups=sex, data=w,
       ylab='Drug - Placebo', method='alt bars')
options(datadist=NULL)


# Examples of type='joint' contrast tests

set.seed(1)
x1 <- rnorm(100)
x2 <- factor(sample(c('a','b','c'), 100, TRUE))
dd <- datadist(x1, x2); options(datadist='dd')
y  <- x1 + (x2=='b') + rnorm(100)

# First replicate a test statistic from anova()

f <- ols(y ~ x2)
anova(f)
contrast(f, list(x2=c('b','c')), list(x2='a'), type='joint')

# Repeat with a redundancy; compare a vs b, a vs c, b vs c

contrast(f, list(x2=c('a','a','b')), list(x2=c('b','c','c')), type='joint')

# Get a test of association of a continuous predictor with y
# First assume linearity, then cubic

f <- lrm(y>0 ~ x1 + x2)
anova(f)
contrast(f, list(x1=1), list(x1=0), type='joint')  # a minimum set of contrasts
xs <- seq(-2, 2, length=20)
contrast(f, list(x1=0), list(x1=xs), type='joint')

# All contrasts were redundant except for the first, because of
# linearity assumption

f <- lrm(y>0 ~ pol(x1,3) + x2)
anova(f)
contrast(f, list(x1=0), list(x1=xs), type='joint')
print(contrast(f, list(x1=0), list(x1=xs), type='joint'), jointonly=TRUE)

# All contrasts were redundant except for the first 3, because of
# cubic regression assumption

# Now do something that is difficult to do without cryptic contrast
# matrix operations: Allow each of the three x2 groups to have a different
# shape for the x1 effect where x1 is quadratic.  Test whether there is
# a difference in mean levels of y for x2='b' vs. 'c' or whether
# the shape or slope of x1 is different between x2='b' and x2='c' regardless
# of how they differ when x2='a'.  In other words, test whether the mean
# response differs between group b and c at any value of x1.
# This is a 3 d.f. test (intercept, linear, quadratic effects) and is
# a better approach than subsetting the data to remove x2='a' then
# fitting a simpler model, as it uses a better estimate of sigma from
# all the data.

f <- ols(y ~ pol(x1,2) * x2)
anova(f)
contrast(f, list(x1=xs, x2='b'),
            list(x1=xs, x2='c'), type='joint')

# Note: If using a spline fit, there should be at least one value of
# x1 between any two knots and beyond the outer knots.
options(datadist=NULL)

Cox Proportional Hazards Model and Extensions

Description

Modification of Therneau's coxph function to fit the Cox model and its extension, the Andersen-Gill model. The latter allows for interval time-dependent covariables, time-dependent strata, and repeated events. The Survival method for an object created by cph returns an S function for computing estimates of the survival function. The Quantile method for cph returns an S function for computing quantiles of survival time (median, by default). The Mean method returns a function for computing the mean survival time. This function issues a warning if the last follow-up time is uncensored, unless a restricted mean is explicitly requested.

Usage

cph(formula = formula(data), data=environment(formula),
    weights, subset, na.action=na.delete, 
    method=c("efron","breslow","exact","model.frame","model.matrix"), 
    singular.ok=FALSE, robust=FALSE,
    model=FALSE, x=FALSE, y=FALSE, se.fit=FALSE,
    linear.predictors=TRUE, residuals=TRUE, nonames=FALSE,
    eps=1e-4, init, iter.max=10, tol=1e-9, surv=FALSE, time.inc,
    type=NULL, vartype=NULL, debug=FALSE, ...)

## S3 method for class 'cph'
Survival(object, ...)
# Evaluate result as g(times, lp, stratum=1, type=c("step","polygon"))

## S3 method for class 'cph'
Quantile(object, ...)
# Evaluate like h(q, lp, stratum=1, type=c("step","polygon"))

## S3 method for class 'cph'
Mean(object, method=c("exact","approximate"), type=c("step","polygon"),
          n=75, tmax, ...)
# E.g. m(lp, stratum=1, type=c("step","polygon"), tmax, \dots)

Arguments

formula

an S formula object with a Surv object on the left-hand side. The terms can specify any S model formula with up to third-order interactions. The strat function may appear in the terms, as a main effect or an interacting factor. To stratify on both race and sex, you would include both terms strat(race) and strat(sex). Stratification factors may interact with non-stratification factors; not all stratification terms need interact with the same modeled factors.

object

an object created by cph with surv=TRUE

data

name of an S data frame containing all needed variables. Omit this to use a data frame already in the S “search list”.

weights

case weights

subset

an expression defining a subset of the observations to use in the fit. The default is to use all observations. Specify for example age>50 & sex="male" or c(1:100,200:300) respectively to use the observations satisfying a logical expression or those having row numbers in the given vector.

na.action

specifies an S function to handle missing data. The default is the function na.delete, which causes observations with any variable missing to be deleted. The main difference between na.delete and the S-supplied function na.omit is that na.delete makes a list of the number of observations that are missing on each variable in the model. The na.action is usally specified by e.g. options(na.action="na.delete").

method

for cph, specifies a particular fitting method, "model.frame" instead to return the model frame of the predictor and response variables satisfying any subset or missing value checks, or "model.matrix" to return the expanded design matrix. The default is "efron", to use Efron's likelihood for fitting the model.

For Mean.cph, method is "exact" to use numerical integration of the survival function at any linear predictor value to obtain a mean survival time. Specify method="approximate" to use an approximate method that is slower when Mean.cph is executing but then is essentially instant thereafter. For the approximate method, the area is computed for n points equally spaced between the min and max observed linear predictor values. This calculation is done separately for each stratum. Then the n pairs (X beta, area) are saved in the generated S function, and when this function is evaluated, the approx function is used to evaluate the mean for any given linear predictor values, using linear interpolation over the n X beta values.

singular.ok

If TRUE, the program will automatically skip over columns of the X matrix that are linear combinations of earlier columns. In this case the coefficients for such columns will be NA, and the variance matrix will contain zeros. For ancillary calculations, such as the linear predictor, the missing coefficients are treated as zeros. The singularities will prevent many of the features of the rms library from working.

robust

if TRUE a robust variance estimate is returned. Default is TRUE if the model includes a cluster() operative, FALSE otherwise.

model

default is FALSE(false). Set to TRUE to return the model frame as element model of the fit object.

x

default is FALSE. Set to TRUE to return the expanded design matrix as element x (without intercept indicators) of the returned fit object.

y

default is FALSE. Set to TRUE to return the vector of response values (Surv object) as element y of the fit.

se.fit

default is FALSE. Set to TRUE to compute the estimated standard errors of the estimate of X beta and store them in element se.fit of the fit. The predictors are first centered to their means before computing the standard errors.

linear.predictors

set to FALSE to omit linear.predictors vector from fit

residuals

set to FALSE to omit residuals vector from fit

nonames

set to TRUE to not set names attribute for linear.predictors, residuals, se.fit, and rows of design matrix

eps

convergence criterion - change in log likelihood.

init

vector of initial parameter estimates. Defaults to all zeros. Special residuals can be obtained by setting some elements of init to MLEs and others to zero and specifying iter.max=1.

iter.max

maximum number of iterations to allow. Set to 0 to obtain certain null-model residuals.

tol

tolerance for declaring singularity for matrix inversion (available only when survival5 or later package is in effect)

surv

set to TRUE to compute underlying survival estimates for each stratum, and to store these along with standard errors of log Lambda(t), maxtime (maximum observed survival or censoring time), and surv.summary in the returned object. Set surv="summary" to only compute and store surv.summary, not survival estimates at each unique uncensored failure time. If you specify x=TRUE and y=TRUE, you can obtain predicted survival later, with accurate confidence intervals for any set of predictor values. The standard error information stored as a result of surv=TRUE are only accurate at the mean of all predictors. If the model has no covariables, these are of course OK. The main reason for using surv is to greatly speed up the computation of predicted survival probabilities as a function of the covariables, when accurate confidence intervals are not needed.

time.inc

time increment used in deriving surv.summary. Survival, number at risk, and standard error will be stored for t=0, time.inc, 2 time.inc, ..., maxtime, where maxtime is the maximum survival time over all strata. time.inc is also used in constructing the time axis in the survplot function (see below). The default value for time.inc is 30 if units(ftime) = "Day" or no units attribute has been attached to the survival time variable. If units(ftime) is a word other than "Day", the default for time.inc is 1 when it is omitted, unless maxtime<1, then maxtime/10 is used as time.inc. If time.inc is not given and maxtime/ default time.inc > 25, time.inc is increased.

type

(for cph) applies if surv is TRUE or "summary". If type is omitted, the method consistent with method is used. See survfit.coxph (under survfit) or survfit.cph for details and for the definitions of values of type

For Survival, Quantile, Mean set to "polygon" to use linear interpolation instead of the usual step function. For Mean, the default of step will yield the sample mean in the case of no censoring and no covariables, if type="kaplan-meier" was specified to cph. For method="exact", the value of type is passed to the generated function, and it can be overridden when that function is actually invoked. For method="approximate", Mean.cph generates the function different ways according to type, and this cannot be changed when the function is actually invoked.

vartype

see survfit.coxph

debug

set to TRUE to print debugging information related to model matrix construction. You can also use options(debug=TRUE).

...

other arguments passed to coxph.fit from cph. Ignored by other functions.

times

a scalar or vector of times at which to evaluate the survival estimates

lp

a scalar or vector of linear predictors (including the centering constant) at which to evaluate the survival estimates

stratum

a scalar stratum number or name (e.g., "sex=male") to use in getting survival probabilities

q

a scalar quantile or a vector of quantiles to compute

n

the number of points at which to evaluate the mean survival time, for method="approximate" in Mean.cph.

tmax

For Mean.cph, the default is to compute the overall mean (and produce a warning message if there is censoring at the end of follow-up). To compute a restricted mean life length, specify the truncation point as tmax. For method="exact", tmax is passed to the generated function and it may be overridden when that function is invoked. For method="approximate", tmax must be specified at the time that Mean.cph is run.

Details

If there is any strata by covariable interaction in the model such that the mean X beta varies greatly over strata, method="approximate" may not yield very accurate estimates of the mean in Mean.cph.

For method="approximate" if you ask for an estimate of the mean for a linear predictor value that was outside the range of linear predictors stored with the fit, the mean for that observation will be NA.

Value

For Survival, Quantile, or Mean, an S function is returned. Otherwise, in addition to what is listed below, formula/design information and the components maxtime, time.inc, units, model, x, y, se.fit are stored, the last 5 depending on the settings of options by the same names. The vectors or matrix stored if y=TRUE or x=TRUE have rows deleted according to subset and to missing data, and have names or row names that come from the data frame used as input data.

n

table with one row per stratum containing number of censored and uncensored observations

coef

vector of regression coefficients

stats

vector containing the named elements Obs, Events, Model L.R., d.f., P, Score, Score P, R2, Somers' Dxy, g-index, and gr, the g-index on the hazard ratio scale. R2 is the Nagelkerke R-squared, with division by the maximum attainable R-squared.

var

variance/covariance matrix of coefficients

linear.predictors

values of predicted X beta for observations used in fit, normalized to have overall mean zero, then having any offsets added

resid

martingale residuals

loglik

log likelihood at initial and final parameter values

score

value of score statistic at initial values of parameters

times

lists of times (if surv="T")

surv

lists of underlying survival probability estimates

std.err

lists of standard errors of estimate log-log survival

surv.summary

a 3 dimensional array if surv=TRUE. The first dimension is time ranging from 0 to maxtime by time.inc. The second dimension refers to strata. The third dimension contains the time-oriented matrix with Survival, n.risk (number of subjects at risk), and std.err (standard error of log-log survival).

center

centering constant, equal to overall mean of X beta.

Author(s)

Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]

See Also

coxph, survival-internal, Surv, residuals.cph, cox.zph, survfit.cph, survest.cph, survfit.coxph, survplot, datadist, rms, rms.trans, anova.rms, summary.rms, Predict, fastbw, validate, calibrate, plot.Predict, ggplot.Predict, specs.rms, lrm, which.influence, na.delete, na.detail.response, print.cph, latex.cph, vif, ie.setup, GiniMd, dxy.cens, concordance

Examples

# Simulate data from a population model in which the log hazard
# function is linear in age and there is no age x sex interaction

require(survival)
require(ggplot2)
n <- 1000
set.seed(731)
age <- 50 + 12*rnorm(n)
label(age) <- "Age"
sex <- factor(sample(c('Male','Female'), n, 
              rep=TRUE, prob=c(.6, .4)))
cens <- 15*runif(n)
h <- .02*exp(.04*(age-50)+.8*(sex=='Female'))
dt <- -log(runif(n))/h
label(dt) <- 'Follow-up Time'
e <- ifelse(dt <= cens,1,0)
dt <- pmin(dt, cens)
units(dt) <- "Year"
dd <- datadist(age, sex)
options(datadist='dd')
S <- Surv(dt,e)

f <- cph(S ~ rcs(age,4) + sex, x=TRUE, y=TRUE)
cox.zph(f, "rank")             # tests of PH
anova(f)
ggplot(Predict(f, age, sex)) # plot age effect, 2 curves for 2 sexes
survplot(f, sex)             # time on x-axis, curves for x2
res <- resid(f, "scaledsch")
time <- as.numeric(dimnames(res)[[1]])
z <- loess(res[,4] ~ time, span=0.50)   # residuals for sex
plot(time, fitted(z))
lines(supsmu(time, res[,4]),lty=2)
plot(cox.zph(f,"identity"))    #Easier approach for last few lines
# latex(f)


f <- cph(S ~ age + strat(sex), surv=TRUE)
g <- Survival(f)   # g is a function
g(seq(.1,1,by=.1), stratum="sex=Male", type="poly") #could use stratum=2
med <- Quantile(f)
plot(Predict(f, age, fun=function(x) med(lp=x)))  #plot median survival

# Fit a model that is quadratic in age, interacting with sex as strata
# Compare standard errors of linear predictor values with those from
# coxph
# Use more stringent convergence criteria to match with coxph

f <- cph(S ~ pol(age,2)*strat(sex), x=TRUE, eps=1e-9, iter.max=20)
coef(f)
se <- predict(f, se.fit=TRUE)$se.fit
require(lattice)
xyplot(se ~ age | sex, main='From cph')
a <- c(30,50,70)
comb <- data.frame(age=rep(a, each=2),
                   sex=rep(levels(sex), 3))

p <- predict(f, comb, se.fit=TRUE)
comb$yhat  <- p$linear.predictors
comb$se    <- p$se.fit
z <- qnorm(.975)
comb$lower <- p$linear.predictors - z*p$se.fit
comb$upper <- p$linear.predictors + z*p$se.fit
comb

age2 <- age^2
f2 <- coxph(S ~ (age + age2)*strata(sex))
coef(f2)
se <- predict(f2, se.fit=TRUE)$se.fit
xyplot(se ~ age | sex, main='From coxph')
comb <- data.frame(age=rep(a, each=2), age2=rep(a, each=2)^2,
                   sex=rep(levels(sex), 3))
p <- predict(f2, newdata=comb, se.fit=TRUE)
comb$yhat <- p$fit
comb$se   <- p$se.fit
comb$lower <- p$fit - z*p$se.fit
comb$upper <- p$fit + z*p$se.fit
comb


# g <- cph(Surv(hospital.charges) ~ age, surv=TRUE)
# Cox model very useful for analyzing highly skewed data, censored or not
# m <- Mean(g)
# m(0)                           # Predicted mean charge for reference age


#Fit a time-dependent covariable representing the instantaneous effect
#of an intervening non-fatal event
rm(age)
set.seed(121)
dframe <- data.frame(failure.time=1:10, event=rep(0:1,5),
                     ie.time=c(NA,1.5,2.5,NA,3,4,NA,5,5,5), 
                     age=sample(40:80,10,rep=TRUE))
z <- ie.setup(dframe$failure.time, dframe$event, dframe$ie.time)
S <- z$S
ie.status <- z$ie.status
attach(dframe[z$subs,])    # replicates all variables

f <- cph(S ~ age + ie.status, x=TRUE, y=TRUE)  
#Must use x=TRUE,y=TRUE to get survival curves with time-dep. covariables


#Get estimated survival curve for a 50-year old who has an intervening
#non-fatal event at 5 days
new <- data.frame(S=Surv(c(0,5), c(5,999), c(FALSE,FALSE)), age=rep(50,2),
                  ie.status=c(0,1))
g <- survfit(f, new)
plot(c(0,g$time), c(1,g$surv[,2]), type='s', 
     xlab='Days', ylab='Survival Prob.')
# Not certain about what columns represent in g$surv for survival5
# but appears to be for different ie.status
#or:
#g <- survest(f, new)
#plot(g$time, g$surv, type='s', xlab='Days', ylab='Survival Prob.')


#Compare with estimates when there is no intervening event
new2 <- data.frame(S=Surv(c(0,5), c(5, 999), c(FALSE,FALSE)), age=rep(50,2),
                   ie.status=c(0,0))
g2 <- survfit(f, new2)
lines(c(0,g2$time), c(1,g2$surv[,2]), type='s', lty=2)
#or:
#g2 <- survest(f, new2)
#lines(g2$time, g2$surv, type='s', lty=2)
detach("dframe[z$subs, ]")
options(datadist=NULL)

Continuation Ratio Ordinal Logistic Setup

Description

Creates several new variables which help set up a dataset with an ordinal response variable yy for use in fitting a forward continuation ratio (CR) model. The CR model can be fitted with binary logistic regression if each input observation is replicated the proper number of times according to the yy value, a new binary yy is computed that has at most one y=1y=1 per subject, and if a cohort variable is used to define the current qualifying condition for a cohort of subjects, e.g., y2y\geq 2. cr.setup creates the needed auxilliary variables. See predab.resample and validate.lrm for information about validating CR models (e.g., using the bootstrap to sample with replacement from the original subjects instead of the records used in the fit, validating the model separately for user-specified values of cohort).

Usage

cr.setup(y)

Arguments

y

a character, numeric, category, or factor vector containing values of the response variable. For category or factor variables, the levels of the variable are assumed to be listed in an ordinal way.

Value

a list with components y, cohort, subs, reps. y is a new binary variable that is to be used in the binary logistic fit. cohort is a factor vector specifying which cohort condition currently applies. subs is a vector of subscripts that can be used to replicate other variables the same way y was replicated. reps specifies how many times each original observation was replicated. y, cohort, subs are all the same length and are longer than the original y vector. reps is the same length as the original y vector. The subs vector is suitable for passing to validate.lrm or calibrate, which pass this vector under the name cluster on to predab.resample so that bootstrapping can be done by sampling with replacement from the original subjects rather than from the individual records created by cr.setup.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]

References

Berridge DM, Whitehead J: Analysis of failure time data with ordinal categories of response. Stat in Med 10:1703–1710, 1991.

See Also

lrm, glm, predab.resample

Examples

y <- c(NA, 10, 21, 32, 32)
cr.setup(y)


set.seed(171)
y <- sample(0:2, 100, rep=TRUE)
sex <- sample(c("f","m"),100,rep=TRUE)
sex <- factor(sex)
table(sex, y)
options(digits=5)
tapply(y==0, sex, mean)
tapply(y==1, sex, mean)
tapply(y==2, sex, mean)
cohort <- y>=1
tapply(y[cohort]==1, sex[cohort], mean)

u <- cr.setup(y)
Y <- u$y
cohort <- u$cohort
sex <- sex[u$subs]

lrm(Y ~ cohort + sex)
 
f <- lrm(Y ~ cohort*sex)   # saturated model - has to fit all data cells
f

#Prob(y=0|female):
# plogis(-.50078)
#Prob(y=0|male):
# plogis(-.50078+.11301)
#Prob(y=1|y>=1, female):
plogis(-.50078+.31845)
#Prob(y=1|y>=1, male):
plogis(-.50078+.31845+.11301-.07379)

combinations <- expand.grid(cohort=levels(cohort), sex=levels(sex))
combinations
p <- predict(f, combinations, type="fitted")
p
p0 <- p[c(1,3)]
p1 <- p[c(2,4)]
p1.unconditional <- (1 - p0) *p1
p1.unconditional
p2.unconditional <- 1 - p0 - p1.unconditional
p2.unconditional


## Not run: 
dd <- datadist(inputdata)   # do this on non-replicated data
options(datadist='dd')
pain.severity <- inputdata$pain.severity
u <- cr.setup(pain.severity)
# inputdata frame has age, sex with pain.severity
attach(inputdata[u$subs,])  # replicate age, sex
# If age, sex already available, could do age <- age[u$subs] etc., or
# age <- rep(age, u$reps), etc.
y      <- u$y
cohort <- u$cohort
dd     <- datadist(dd, cohort)       # add to dd
f <- lrm(y ~ cohort + age*sex)       # ordinary cont. ratio model
g <- lrm(y ~ cohort*sex + age, x=TRUE,y=TRUE) # allow unequal slopes for
                                     # sex across cutoffs
cal <- calibrate(g, cluster=u$subs, subset=cohort=='all')  
# subs makes bootstrap sample the correct units, subset causes
# Predicted Prob(pain.severity=0) to be checked for calibration

## End(Not run)

Distribution Summaries for Predictor Variables

Description

For a given set of variables or a data frame, determines summaries of variables for effect and plotting ranges, values to adjust to, and overall ranges for Predict, plot.Predict, ggplot.Predict, summary.rms, survplot, and nomogram.rms. If datadist is called before a model fit and the resulting object pointed to with options(datadist="name"), the data characteristics will be stored with the fit by Design(), so that later predictions and summaries of the fit will not need to access the original data used in the fit. Alternatively, you can specify the values for each variable in the model when using these 3 functions, or specify the values of some of them and let the functions look up the remainder (of say adjustmemt levels) from an object created by datadist. The best method is probably to run datadist once before any models are fitted, storing the distribution summaries for all potential variables. Adjustment values are 0 for binary variables, the most frequent category (or optionally the first category level) for categorical (factor) variables, the middle level for ordered factor variables, and medians for continuous variables. See descriptions of q.display and q.effect for how display and effect ranges are chosen for continuous variables.

Usage

datadist(..., data, q.display, q.effect=c(0.25, 0.75),
         adjto.cat=c('mode','first'), n.unique=10)

## S3 method for class 'datadist'
print(x, ...)
# options(datadist="dd")
# used by summary, plot, survplot, sometimes predict
# For dd substitute the name of the result of datadist

Arguments

...

a list of variable names, separated by commas, a single data frame, or a fit with Design information. The first element in this list may also be an object created by an earlier call to datadist; then the later variables are added to this datadist object. For a fit object, the variables named in the fit are retrieved from the active data frame or from the location pointed to by data=frame number or data="data frame name". For print, is ignored.

data

a data frame or a search position. If data is a search position, it is assumed that a data frame is attached in that position, and all its variables are used. If you specify both individual variables in ... and data, the two sets of variables are combined. Unless the first argument is a fit object, data must be an integer.

q.display

set of two quantiles for computing the range of continuous variables to use in displaying regression relationships. Defaults are qq and 1q1-q, where q=10/max(n,200)q=10/max(n,200), and nn is the number of non-missing observations. Thus for n<200n<200, the .05 and .95 quantiles are used. For n200n\geq 200, the 10th10^{th} smallest and 10th10^{th} largest values are used. If you specify q.display, those quantiles are used whether or not n<200n<200.

q.effect

set of two quantiles for computing the range of continuous variables to use in estimating regression effects. Defaults are c(.25,.75), which yields inter-quartile-range odds ratios, etc.

adjto.cat

default is "mode", indicating that the modal (most frequent) category for categorical (factor) variables is the adjust-to setting. Specify "first" to use the first level of factor variables as the adjustment values. In the case of many levels having the maximum frequency, the first such level is used for "mode".

n.unique

variables having n.unique or fewer unique values are considered to be discrete variables in that their unique values are stored in the values list. This will affect how functions such as nomogram.Design determine whether variables are discrete or not.

x

result of datadist

Details

For categorical variables, the 7 limits are set to character strings (factors) which correspond to c(NA,adjto.level,NA,1,k,1,k), where k is the number of levels. For ordered variables with numeric levels, the limits are set to c(L,M,H,L,H,L,H), where L is the lowest level, M is the middle level, and H is the highest level.

Value

a list of class "datadist" with the following components

limits

a 7×k7 \times k vector, where kk is the number of variables. The 7 rows correspond to the low value for estimating the effect of the variable, the value to adjust the variable to when examining other variables, the high value for effect, low value for displaying the variable, the high value for displaying it, and the overall lowest and highest values.

values

a named list, with one vector of unique values for each numeric variable having no more than n.unique unique values

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]

See Also

rms, rms.trans, describe, Predict, summary.rms

Examples

## Not run: 
d <- datadist(data=1)         # use all variables in search pos. 1
d <- datadist(x1, x2, x3)
page(d)                       # if your options(pager) leaves up a pop-up
                              # window, this is a useful guide in analyses
d <- datadist(data=2)         # all variables in search pos. 2
d <- datadist(data=my.data.frame)
d <- datadist(my.data.frame)  # same as previous.  Run for all potential vars.
d <- datadist(x2, x3, data=my.data.frame)   # combine variables
d <- datadist(x2, x3, q.effect=c(.1,.9), q.display=c(0,1))
# uses inter-decile range odds ratios,
# total range of variables for regression function plots
d <- datadist(d, z)           # add a new variable to an existing datadist
options(datadist="d")         #often a good idea, to store info with fit
f <- ols(y ~ x1*x2*x3)


options(datadist=NULL)        #default at start of session
f <- ols(y ~ x1*x2)
d <- datadist(f)              #info not stored in `f'
d$limits["Adjust to","x1"] <- .5   #reset adjustment level to .5
options(datadist="d")


f <- lrm(y ~ x1*x2, data=mydata)
d <- datadist(f, data=mydata)
options(datadist="d")


f <- lrm(y ~ x1*x2)           #datadist not used - specify all values for
summary(f, x1=c(200,500,800), x2=c(1,3,5))         # obtaining predictions
plot(Predict(f, x1=200:800, x2=3))  # or ggplot()


# Change reference value to get a relative odds plot for a logistic model
d$limits$age[2] <- 30    # make 30 the reference value for age
# Could also do: d$limits["Adjust to","age"] <- 30
fit <- update(fit)   # make new reference value take effect
plot(Predict(fit, age, ref.zero=TRUE, fun=exp),
     ylab='Age=x:Age=30 Odds Ratio')   # or ggplot()

## End(Not run)

Function Generator For Exceedance Probabilities

Description

For an orm object generates a function for computing the estimates of the function Prob(Y>=y) given one or more values of the linear predictor using the reference (median) intercept. This function can optionally be evaluated at only a set of user-specified y values, otherwise a right-step function is returned. There is a plot method for plotting the step functions, and if more than one linear predictor was evaluated multiple step functions are drawn. ExProb is especially useful for nomogram.

Optionally a normal approximation for a confidence interval for exceedance probabilities will be computed using the delta method, if conf.int > 0 is specified to the function generated from calling ExProb. In that case, a "lims" attribute is included in the result computed by the derived cumulative probability function.

Usage

ExProb(object, ...)

## S3 method for class 'orm'
ExProb(object, codes = FALSE, ...)

## S3 method for class 'ExProb'
plot(x, ..., data=NULL,
                      xlim=NULL, xlab=x$yname, ylab=expression(Prob(Y>=y)),
                      col=par('col'), col.vert='gray85', pch=20,
                      pch.data=21, lwd=par('lwd'), lwd.data=lwd,
                      lty.data=2, key=TRUE)

Arguments

object

a fit object from orm

codes

if TRUE, ExProb use the integer codes 1,2,,k1,2,\ldots,k for the kk-level response instead of its original unique values

...

ignored for ExProb. Passed to plot for plot.ExProb

data

Specify data if you want to add stratified empirical probabilities to the graph. If data is a numeric vector, it is assumed that no groups are present. Otherwise data must be a list or data frame where the first variable is the grouping variable (corresponding to what made the linear predictor vary) and the second variable is the data vector for the y variable. The rows of data should be sorted to be in order of the linear predictor argument.

x

an object created by running the function created by ExProb

xlim

limits for x-axis; default is range of observed y

xlab

x-axis label

ylab

y-axis label

col

color for horizontal lines and points

col.vert

color for vertical discontinuities

pch

plotting symbol for predicted curves

lwd

line width for predicted curves

pch.data, lwd.data, lty.data

plotting parameters for data

key

set to FALSE to suppress key in plot if data is given

Value

ExProb returns an R function. Running the function returns an object of class "ExProb".

Author(s)

Frank Harrell and Shengxin Tu

See Also

orm, Quantile.orm

Examples

set.seed(1)
x1 <- runif(200)
yvar <- x1 + runif(200)
f <- orm(yvar ~ x1)
d <- ExProb(f)
lp <- predict(f, newdata=data.frame(x1=c(.2,.8)))
w <- d(lp)
s1 <- abs(x1 - .2) < .1
s2 <- abs(x1 - .8) < .1
plot(w, data=data.frame(x1=c(rep(.2, sum(s1)), rep(.8, sum(s2))),
                        yvar=c(yvar[s1], yvar[s2])))

qu <- Quantile(f)
abline(h=c(.1,.5), col='gray80')
abline(v=qu(.5, lp), col='gray80')
abline(v=qu(.9, lp), col='green')

Fast Backward Variable Selection

Description

Performs a slightly inefficient but numerically stable version of fast backward elimination on factors, using a method based on Lawless and Singhal (1978). This method uses the fitted complete model and computes approximate Wald statistics by computing conditional (restricted) maximum likelihood estimates assuming multivariate normality of estimates. fastbw deletes factors, not columns of the design matrix. Factors requiring multiple d.f. will be retained or dropped as a group. The function prints the deletion statistics for each variable in turn, and prints approximate parameter estimates for the model after deleting variables. The approximation is better when the number of factors deleted is not large. For ols, the approximation is exact for regression coefficients, and standard errors are only off by a factor equal to the ratio of the mean squared error estimate for the reduced model to the original mean squared error estimate for the full model.

If the fit was from ols, fastbw will compute the usual R2R^2 statistic for each model.

Usage

fastbw(fit, rule=c("aic", "p"),
       type=c("residual", "individual", "total"),  sls=.05, aics=0, eps=1e-9,
       k.aic=2, force=NULL)

## S3 method for class 'fastbw'
print(x, digits=4, estimates=TRUE, ...)

Arguments

fit

fit object with Varcov(fit) defined (e.g., from ols, lrm, cph, psm, glmD)

rule

Stopping rule. Defaults to "aic" for Akaike's information criterion. Use rule="p" to use PP-values

type

Type of statistic on which to base the stopping rule. Default is "residual" for the pooled residual chi-square. Use type="individual" to use Wald chi-square of individual factors.

sls

Significance level for staying in a model if rule="p". Default is .05.

aics

For rule="aic", variables are deleted until the chi-square - k.aic times d.f. would rise above aics. Default aics is zero to use the ordinary AIC. Set aics to say 10000 to see all variables deleted in order of descending importance.

eps

Singularity criterion, default is 1E-9.

k.aic

multiplier to compute AIC, default is 2. To use BIC, set k.aic equal to log(n)\log(n), where nn is the effective sample size (number of events for survival models).

force

a vector of integers specifying parameters forced to be in the model, not counting intercept(s)

x

result of fastbw

digits

number of significant digits to print

estimates

set to FALSE to suppress printing table of approximate coefficients, SEs, etc., after variable deletions

...

ignored

Value

a list with an attribute kept if bw=TRUE, and the following components:

result

matrix of statistics with rows in order of deletion.

names.kept

names of factors kept in final model.

factors.kept

the subscripts of factors kept in the final model

factors.deleted

opposite of factors.kept.

parms.kept

column numbers in design matrix corresponding to parameters kept in the final model.

parms.deleted

opposite of parms.kept.

coefficients

vector of approximate coefficients of reduced model.

var

approximate covariance matrix for reduced model.

Coefficients

matrix of coefficients of all models. Rows correspond to the successive models examined and columns correspond to the coefficients in the full model. For variables not in a particular sub-model (row), the coefficients are zero.

Author(s)

Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]

References

Lawless, J. F. and Singhal, K. (1978): Efficient screening of nonnormal regression models. Biometrics 34:318–327.

See Also

rms, ols, lrm, cph, psm, validate, solvet, rmsMisc

Examples

## Not run: 
fastbw(fit, optional.arguments)     # print results
z <- fastbw(fit, optional.args)     # typically used in simulations
lm.fit(X[,z$parms.kept], Y)         # least squares fit of reduced model

## End(Not run)

Compose an S Function to Compute X beta from a Fit

Description

Function is a class of functions for creating other S functions. Function.rms is the method for creating S functions to compute X beta, based on a model fitted with rms in effect. Like latexrms, Function.rms simplifies restricted cubic spline functions and factors out terms in second-order interactions. Function.rms will not work for models that have third-order interactions involving restricted cubic splines. Function.cph is a particular method for handling fits from cph, for which an intercept (the negative of the centering constant) is added to the model. sascode is a function that takes an S function such as one created by Function and does most of the editing to turn the function definition into a fragment of SAS code for computing X beta from the fitted model, along with assignment statements that initialize predictors to reference values. perlcode similarly creates Perl code to evaluate a fitted regression model.

Usage

## S3 method for class 'rms'
Function(object, intercept=NULL,
 digits=max(8, .Options$digits), posterior.summary=c('mean', 'median', 'mode'), ...)
## S3 method for class 'cph'
Function(object, intercept=-object$center, ...)

# Use result as fun(predictor1=value1, predictor2=value2, \dots)

sascode(object, file='', append=FALSE)

perlcode(object)

Arguments

object

a fit created with rms in effect

intercept

an intercept value to use (not allowed to be specified to Function.cph). The intercept is usually retrieved from the regression coefficients automatically.

digits

number of significant digits to use for coefficients and knot locations

posterior.summary

if using a Bayesian model fit such as from blrm, specifies whether to use posterior mode/mean/median parameter estimates in generating the function

file

name of a file in which to write the SAS code. Default is to write to standard output.

append

set to TRUE to have sascode append code to an existing file named file.

...

arguments to pass to Function.rms from Function.cph

Value

Function returns an S-Plus function that can be invoked in any usual context. The function has one argument per predictor variable, and the default values of the predictors are set to adjust-to values (see datadist). Multiple predicted X beta values may be calculated by specifying vectors as arguments to the created function. All non-scalar argument values must have the same length. perlcode returns a character string with embedded newline characters.

Author(s)

Frank Harrell, Jeremy Stephens, and Thomas Dupont
Department of Biostatistics
Vanderbilt University
[email protected]

See Also

latexrms, transcan, predict.rms, rms, rms.trans

Examples

suppressWarnings(RNGversion("3.5.0"))
set.seed(1331)
x1 <- exp(rnorm(100))
x2 <- factor(sample(c('a','b'),100,rep=TRUE))
dd <- datadist(x1, x2)
options(datadist='dd')
y  <- log(x1)^2+log(x1)*(x2=='b')+rnorm(100)/4
f  <- ols(y ~ pol(log(x1),2)*x2)
f$coef
g  <- Function(f, digits=5)
g
sascode(g)
cat(perlcode(g), '\n')
g()
g(x1=c(2,3), x2='b')   #could omit x2 since b is default category
predict(f, expand.grid(x1=c(2,3),x2='b'))
g8 <- Function(f)   # default is 8 sig. digits
g8(x1=c(2,3), x2='b')
options(datadist=NULL)


## Not run: 
require(survival)
# Make self-contained functions for computing survival probabilities
# using a log-normal regression
f <- psm(Surv(d.time, death) ~ rcs(age,4)*sex, dist='gaussian')
g <- Function(f)
surv <- Survival(f)
# Compute 2 and 5-year survival estimates for 50 year old male
surv(c(2,5), g(age=50, sex='male'))

## End(Not run)

Generate Data Frame with Predictor Combinations

Description

If nobs is not specified, allows user to specify predictor settings by e.g. age=50, sex="male", and any omitted predictors are set to reference values (default=median for continuous variables, first level for categorical ones - see datadist). If any predictor has more than one value given, expand.grid is called to generate all possible combinations of values, unless expand=FALSE. If nobs is given, a data frame is first generated which has nobs of adjust-to values duplicated. Then an editor window is opened which allows the user to subset the variable names down to ones which she intends to vary (this streamlines the data.ed step). Then, if any predictors kept are discrete and viewvals=TRUE, a window (using page) is opened defining the possible values of this subset, to facilitate data editing. Then the data.ed function is invoked to allow interactive overriding of predictor settings in the nobs rows. The subset of variables are combined with the other predictors which were not displayed with data.ed, and a final full data frame is returned. gendata is most useful for creating a newdata data frame to pass to predict.

Usage

gendata(fit, ..., nobs, viewvals=FALSE, expand=TRUE, factors)

Arguments

fit

a fit object created with rms in effect

...

predictor settings, if nobs is not given.

nobs

number of observations to create if doing it interactively using X-windows

viewvals

if nobs is given, set viewvals=TRUE to open a window displaying the possible value of categorical predictors

expand

set to FALSE to prevent expand.grid from being called, and to instead just convert to a data frame.

factors

a list containing predictor settings with their names. This is an alternative to specifying the variables separately in .... Unlike the usage of ..., variables getting default ranges in factors should have NA as their value.

Details

if you have a variable in ... that is named n, no, nob, nob, add nobs=FALSE to the invocation to prevent that variable from being misrecognized as nobs

Value

a data frame with all predictors, and an attribute names.subset if nobs is specified. This attribute contains the vector of variable names for predictors which were passed to de and hence were allowed to vary. If neither nobs nor any predictor settings were given, returns a data frame with adjust-to values.

Side Effects

optionally writes to the terminal, opens X-windows, and generates a temporary file using sink.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]

See Also

predict.rms, survest.cph, survest.psm, rmsMisc, expand.grid, de, page, print.datadist, Predict

Examples

set.seed(1)
age <- rnorm(200, 50, 10)
sex <- factor(sample(c('female','male'),200,TRUE))
race <- factor(sample(c('a','b','c','d'),200,TRUE))
y <- sample(0:1, 200, TRUE)
dd <- datadist(age,sex,race)
options(datadist="dd")
f <- lrm(y ~ age*sex + race)
gendata(f)
gendata(f, age=50)
d <- gendata(f, age=50, sex="female")  # leave race=reference category
d <- gendata(f, age=c(50,60), race=c("b","a"))  # 4 obs.
d$Predicted <- predict(f, d, type="fitted")
d      # Predicted column prints at the far right
options(datadist=NULL)
## Not run: 
d <- gendata(f, nobs=5, view=TRUE)        # 5 interactively defined obs.
d[,attr(d,"names.subset")]             # print variables which varied
predict(f, d)

## End(Not run)

Plot Effects of Variables Estimated by a Regression Model Fit Using ggplot2

Description

Uses ggplot2 graphics to plot the effect of one or two predictors on the linear predictor or X beta scale, or on some transformation of that scale. The first argument specifies the result of the Predict function. The predictor is always plotted in its original coding.

If rdata is given, a spike histogram is drawn showing the location/density of data values for the xx-axis variable. If there is a groups (superposition) variable that generated separate curves, the data density specific to each class of points is shown. This assumes that the second variable was a factor variable. The histograms are drawn by histSpikeg.

To plot effects instead of estimates (e.g., treatment differences as a function of interacting factors) see contrast.rms and summary.rms.

Usage

## S3 method for class 'Predict'
ggplot(data, mapping, formula=NULL, groups=NULL,
     aestype=c('color', 'linetype'),
     conf=c('fill', 'lines'),
     conflinetype=1,
     varypred=FALSE, sepdiscrete=c('no', 'list', 'vertical', 'horizontal'),
     subset, xlim., ylim., xlab, ylab, 
     colorscale=function(...) scale_color_manual(...,
       values=c("#000000", "#E69F00", "#56B4E9",
                "#009E73","#F0E442", "#0072B2", "#D55E00", "#CC79A7")),
     colfill='black',
     rdata=NULL, anova=NULL, pval=FALSE, size.anova=4,
     adj.subtitle, size.adj=2.5, perim=NULL, nlevels=3,
     flipxdiscrete=TRUE,
     legend.position='right', legend.label=NULL,
     vnames=c('labels','names'), abbrev=FALSE, minlength=6,
     layout=NULL, addlayer,
     histSpike.opts=list(frac=function(f) 0.01 + 
         0.02 * sqrt(f - 1)/sqrt(max(f, 2) - 1), side=1, nint=100),
     type=NULL, ggexpr=FALSE, height=NULL, width=NULL, ..., environment)

Arguments

data

a data frame created by Predict

mapping

kept because of ggplot generic setup. If specified it will be assumed to be formula.

formula

a ggplot faceting formula of the form vertical variables ~ horizontal variables, with variables separated by * if there is more than one variable on a side. If omitted, the formula will be built using assumptions on the list of variables that varied in the Predict call. When plotting multiple panels (for separate predictors), formula may be specified but by default no formula is constructed.

groups

an optional character string containing the name of one of the variables in data that is to be used as a grouping (superpositioning) variable. Set groups=FALSE to suppress superpositioning. By default, the second varying variable is used for superpositioning groups. You can also specify a length 2 string vector of variable names specifying two dimensions of superpositioning, identified by different aesthetics corresponding to the aestype argument. When plotting effects of more than one predictor, groups is a character string that specifies a single variable name in data that can be used to form panels. Only applies if using rbind to combine several Predict results. If there is more than one groups variable, confidence bands are suppressed because ggplot2:geom_ribbon does not handle the aesthetics correctly.

aestype

a string vector of aesthetic names corresponding to variables in the groups vector. Default is to use, in order, color, and linetype. Other permissible values are size, shape.

conf

specify conf="line" to show confidence bands with lines instead of filled ribbons, the default

conflinetype

specify an alternative linetype for confidence intervals if conf="line"

varypred

set to TRUE if data is the result of passing multiple Predict results, that represent different predictors, to rbind.Predict. This will cause the .set. variable created by rbind to be copied to the .predictor. variable.

sepdiscrete

set to something other than "no" to create separate graphics for continuous and discrete predictors. For discrete predictors, horizontal dot charts are produced. This allows use of the ggplot2 facet_wrap function to make better use of space. If sepdiscrete="list", a list of two grid graphics objects is returned if both types of predictors are present (otherwise one object for the type that existed in the model). Set sepdiscrete="vertical" to put the two types of plots into one graphical object with continuous predictors on top and given a fraction of space relative to the number of continuous vs. number of discrete variables. Set sepdiscrete="horizontal" to get a horizontal arrangements with continuous variables on the left.

subset

a subsetting expression for restricting the rows of data that are used in plotting. For example, predictions may have been requested for males and females but one wants to plot only females.

xlim.

This parameter is seldom used, as limits are usually controlled with Predict. Usually given as its legal abbreviation xlim. One reason to use xlim is to plot a factor variable on the x-axis that was created with the cut2 function with the levels.mean option, with val.lev=TRUE specified to plot.Predict. In this case you may want the axis to have the range of the original variable values given to cut2 rather than the range of the means within quantile groups.

ylim.

Range for plotting on response variable axis. Computed by default. Usually specified using its legal definition ylim.

xlab

Label for x-axis. Default is one given to asis, rcs, etc., which may have been the "label" attribute of the variable.

ylab

Label for y-axis. If fun is not given, default is "log Odds" for lrm, "log Relative Hazard" for cph, name of the response variable for ols, TRUE or log(TRUE) for psm, or "X * Beta" otherwise. Specify ylab=NULL to omit y-axis labels.

colorscale

a ggplot2 discrete scale function, e.g. function(...) scale_color_brewer(..., palette='Set1', type='qual'). The default is the colorblind-friendly palette including black in http://www.cookbook-r.com/Graphs/Colors_(ggplot2). If you get an error "insufficient values in manual scale", which occurs when there are more than 8 groups, just specify colorscale=function(...){} to use default colors.

colfill

a single character string or number specifying the fill color to use for geom_ribbon for shaded confidence bands. Alpha transparency of 0.2 is applied to any color specified.

rdata

a data frame containing the original raw data on which the regression model were based, or at least containing the xx-axis and grouping variable. If rdata is present and contains the needed variables, the original data are added to the graph in the form of a spike histogram using histSpikeg in the Hmisc package.

anova

an object returned by anova.rms. If anova is specified, the overall test of association for predictor plotted is added as text to each panel, located at the spot at which the panel is most empty unless there is significant empty space at the top or bottom of the panel; these areas are given preference.

pval

specify pval=TRUE for anova to include not only the test statistic but also the P-value

size.anova

character size for the test statistic printed on the panel, mm

adj.subtitle

Set to FALSE to suppress subtitling the graph with the list of settings of non-graphed adjustment values. Subtitles appear as captions with ggplot2 using labs(caption=).

size.adj

Size of adjustment settings in subtitles in mm. Default is 2.5.

perim

perim specifies a function having two arguments. The first is the vector of values of the first variable that is about to be plotted on the x-axis. The second argument is the single value of the variable representing different curves, for the current curve being plotted. The function's returned value must be a logical vector whose length is the same as that of the first argument, with values TRUE if the corresponding point should be plotted for the current curve, FALSE otherwise. See one of the latter examples. perim only applies if predictors were specified to Predict.

nlevels

when groups and formula are not specified, if any panel variable has nlevels or fewer values, that variable is converted to a groups (superpositioning) variable. Set nlevels=0 to prevent this behavior. For other situations, a non-numeric x-axis variable with nlevels or fewer unique values will cause a horizontal dot plot to be drawn instead of an x-y plot unless flipxdiscrete=FALSE.

flipxdiscrete

see nlevels

legend.position

"right" (the default for single-panel plots), "left", "bottom", "top", a two-element numeric vector, or "none" to suppress. For multi-panel plots the default is "top", and a legend only appears for the first (top left) panel.

legend.label

if omitted, group variable labels will be used for label the legend. Specify legend.label=FALSE to suppress using a legend name, or a character string or expression to specify the label. Can be a vector is there is more than one grouping variable.

vnames

applies to the case where multiple plots are produced separately by predictor. Set to 'names' to use variable names instead of labels for these small plots.

abbrev

set to true to abbreviate levels of predictors that are categorical to a minimum length of minlength

minlength

see abbrev

layout

for multi-panel plots a 2-vector specifying the number of rows and number of columns. If omitted will be computed from the number of panels to make as square as possible.

addlayer

a ggplot2 expression consisting of one or more layers to add to the current plot

histSpike.opts

a list containing named elements that specifies parameters to histSpikeg when rdata is given. The col parameter is usually derived from other plotting information and not specified by the user.

type

a value ("l","p","b") to override default choices related to showing or connecting points. Especially useful for discrete x coordinate variables.

ggexpr

set to TRUE to have the function return the character string(s) constructed to invoke ggplot without executing the commands

height, width

used if plotly is in effect, to specify the plotly image in pixels. Default is to let plotly size the image.

...

ignored

environment

ignored; used to satisfy rules because of the generic ggplot

Value

an object of class "ggplot2" ready for printing. For the case where predictors were not specified to Predict, sepdiscrete=TRUE, and there were both continuous and discrete predictors in the model, a list of two graphics objects is returned.

Note

If plotting the effects of all predictors you can reorder the panels using for example p <- Predict(fit); p$.predictor. <- factor(p$.predictor., v) where v is a vector of predictor names specified in the desired order.

Author(s)

Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]

References

Fox J, Hong J (2009): Effect displays in R for multinomial and proportional-odds logit models: Extensions to the effects package. J Stat Software 32 No. 1.

See Also

Predict, rbind.Predict, datadist, predictrms, anova.rms, contrast.rms, summary.rms, rms, rmsMisc, plot.Predict, labcurve, histSpikeg, ggplot, Overview

Examples

require(ggplot2)
n <- 350     # define sample size
set.seed(17) # so can reproduce the results
age            <- rnorm(n, 50, 10)
blood.pressure <- rnorm(n, 120, 15)
cholesterol    <- rnorm(n, 200, 25)
sex            <- factor(sample(c('female','male'), n,TRUE))
label(age)            <- 'Age'      # label is in Hmisc
label(cholesterol)    <- 'Total Cholesterol'
label(blood.pressure) <- 'Systolic Blood Pressure'
label(sex)            <- 'Sex'
units(cholesterol)    <- 'mg/dl'   # uses units.default in Hmisc
units(blood.pressure) <- 'mmHg'

# Specify population model for log odds that Y=1
L <- .4*(sex=='male') + .045*(age-50) +
  (log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male')) +
  .01 * (blood.pressure - 120)
# Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)]
y <- ifelse(runif(n) < plogis(L), 1, 0)

ddist <- datadist(age, blood.pressure, cholesterol, sex)
options(datadist='ddist')

fit <- lrm(y ~ blood.pressure + sex * (age + rcs(cholesterol,4)),
               x=TRUE, y=TRUE)
an <- anova(fit)
# Plot effects in two vertical sub-panels with continuous predictors on top
# ggplot(Predict(fit), sepdiscrete='vertical')
# Plot effects of all 4 predictors with test statistics from anova, and P
ggplot(Predict(fit), anova=an, pval=TRUE)
# ggplot(Predict(fit), rdata=llist(blood.pressure, age))
# spike histogram plot for two of the predictors

# p <- Predict(fit, name=c('age','cholesterol'))   # Make 2 plots
# ggplot(p)

# p <- Predict(fit, age=seq(20,80,length=100), sex, conf.int=FALSE)
#                        # Plot relationship between age and log
                         # odds, separate curve for each sex,
# ggplot(p, subset=sex=='female' | age > 30)
# No confidence interval, suppress estimates for males <= 30

# p <- Predict(fit, age, sex)
# ggplot(p, rdata=llist(age,sex))
                         # rdata= allows rug plots (1-dimensional scatterplots)
                         # on each sex's curve, with sex-
                         # specific density of age
                         # If data were in data frame could have used that
# p <- Predict(fit, age=seq(20,80,length=100), sex='male', fun=plogis)
                         # works if datadist not used
# ggplot(p, ylab=expression(hat(P)))
                         # plot predicted probability in place of log odds
# per <- function(x, y) x >= 30
# ggplot(p, perim=per)       # suppress output for age < 30 but leave scale alone

# Do ggplot2 faceting a few different ways
p <- Predict(fit, age, sex, blood.pressure=c(120,140,160),
             cholesterol=c(180,200,215))
# ggplot(p)
ggplot(p, cholesterol ~ blood.pressure)
# ggplot(p, ~ cholesterol + blood.pressure)
# color for sex, line type for blood.pressure:
ggplot(p, groups=c('sex', 'blood.pressure'))
# Add legend.position='top' to allow wider plot
# Map blood.pressure to line thickness instead of line type:
# ggplot(p, groups=c('sex', 'blood.pressure'), aestype=c('color', 'size'))

# Plot the age effect as an odds ratio
# comparing the age shown on the x-axis to age=30 years

# ddist$limits$age[2] <- 30    # make 30 the reference value for age
# Could also do: ddist$limits["Adjust to","age"] <- 30
# fit <- update(fit)   # make new reference value take effect
# p <- Predict(fit, age, ref.zero=TRUE, fun=exp)
# ggplot(p, ylab='Age=x:Age=30 Odds Ratio',
#        addlayer=geom_hline(yintercept=1, col=gray(.8)) +
#                 geom_vline(xintercept=30, col=gray(.8)) +
#                 scale_y_continuous(trans='log',
#                       breaks=c(.5, 1, 2, 4, 8))))

# Compute predictions for three predictors, with superpositioning or
# conditioning on sex, combined into one graph

p1 <- Predict(fit, age, sex)
p2 <- Predict(fit, cholesterol, sex)
p3 <- Predict(fit, blood.pressure, sex)
p <- rbind(age=p1, cholesterol=p2, blood.pressure=p3)
ggplot(p, groups='sex', varypred=TRUE, adj.subtitle=FALSE)
# ggplot(p, groups='sex', varypred=TRUE, adj.subtitle=FALSE, sepdiscrete='vert')

## Not run: 
# For males at the median blood pressure and cholesterol, plot 3 types
# of confidence intervals for the probability on one plot, for varying age
ages <- seq(20, 80, length=100)
p1 <- Predict(fit, age=ages, sex='male', fun=plogis)  # standard pointwise
p2 <- Predict(fit, age=ages, sex='male', fun=plogis,
              conf.type='simultaneous')               # simultaneous
p3 <- Predict(fit, age=c(60,65,70), sex='male', fun=plogis,
              conf.type='simultaneous')               # simultaneous 3 pts
# The previous only adjusts for a multiplicity of 3 points instead of 100
f <- update(fit, x=TRUE, y=TRUE)
g <- bootcov(f, B=500, coef.reps=TRUE)
p4 <- Predict(g, age=ages, sex='male', fun=plogis)    # bootstrap percentile
p <- rbind(Pointwise=p1, 'Simultaneous 100 ages'=p2,
           'Simultaneous     3 ages'=p3, 'Bootstrap nonparametric'=p4)
# as.data.frame so will call built-in ggplot
ggplot(as.data.frame(p), aes(x=age, y=yhat)) + geom_line() +
 geom_ribbon(data=p, aes(ymin=lower, ymax=upper), alpha=0.2, linetype=0)+
 facet_wrap(~ .set., ncol=2)

# Plots for a parametric survival model
n <- 1000
set.seed(731)
age <- 50 + 12*rnorm(n)
label(age) <- "Age"
sex <- factor(sample(c('Male','Female'), n, 
              rep=TRUE, prob=c(.6, .4)))
cens <- 15*runif(n)
h <- .02*exp(.04*(age-50)+.8*(sex=='Female'))
t <- -log(runif(n))/h
label(t) <- 'Follow-up Time'
e <- ifelse(t<=cens,1,0)
t <- pmin(t, cens)
units(t) <- "Year"
ddist <- datadist(age, sex)
require(survival)
Srv <- Surv(t,e)

# Fit log-normal survival model and plot median survival time vs. age
f <- psm(Srv ~ rcs(age), dist='lognormal')
med <- Quantile(f)       # Creates function to compute quantiles
                         # (median by default)
p <- Predict(f, age, fun=function(x) med(lp=x))
ggplot(p, ylab="Median Survival Time")
# Note: confidence intervals from this method are approximate since
# they don't take into account estimation of scale parameter


# Fit an ols model to log(y) and plot the relationship between x1
# and the predicted mean(y) on the original scale without assuming
# normality of residuals; use the smearing estimator
# See help file for rbind.Predict for a method of showing two
# types of confidence intervals simultaneously.
# Add raw data scatterplot to graph
set.seed(1)
x1 <- runif(300)
x2 <- runif(300)
ddist <- datadist(x1, x2); options(datadist='ddist')
y  <- exp(x1 + x2 - 1 + rnorm(300))
f <- ols(log(y) ~ pol(x1,2) + x2)
r <- resid(f)
smean <- function(yhat)smearingEst(yhat, exp, res, statistic='mean')
formals(smean) <- list(yhat=numeric(0), res=r[! is.na(r)])
#smean$res <- r[! is.na(r)]   # define default res argument to function
ggplot(Predict(f, x1, fun=smean), ylab='Predicted Mean on y-scale', 
   addlayer=geom_point(aes(x=x1, y=y), data.frame(x1, y)))
# Had ggplot not added a subtitle (i.e., if x2 were not present), you
# could have done ggplot(Predict(), ylab=...) + geom_point(...) 

## End(Not run)

# Make an 'interaction plot', forcing the x-axis variable to be
# plotted at integer values but labeled with category levels
n <- 100
set.seed(1)
gender <- c(rep('male', n), rep('female',n))
m <- sample(c('a','b'), 2*n, TRUE)
d <-  datadist(gender, m); options(datadist='d')
anxiety <- runif(2*n) + .2*(gender=='female') + .4*(gender=='female' & m=='b')
tapply(anxiety, llist(gender,m), mean)
f <- ols(anxiety ~ gender*m)
p <- Predict(f, gender, m)
# ggplot(p)     # horizontal dot chart; usually preferred for categorical predictors
# ggplot(p, flipxdiscrete=FALSE)  # back to vertical
ggplot(p, groups='gender')
ggplot(p, ~ m, groups=FALSE, flipxdiscrete=FALSE)

options(datadist=NULL)

## Not run: 
# Example in which separate curves are shown for 4 income values
# For each curve the estimated percentage of voters voting for
# the democratic party is plotted against the percent of voters
# who graduated from college.  Data are county-level percents.

incomes <- seq(22900, 32800, length=4)  
# equally spaced to outer quintiles
p <- Predict(f, college, income=incomes, conf.int=FALSE)
ggplot(p, xlim=c(0,35), ylim=c(30,55))

# Erase end portions of each curve where there are fewer than 10 counties having
# percent of college graduates to the left of the x-coordinate being plotted,
# for the subset of counties having median family income with 1650
# of the target income for the curve

show.pts <- function(college.pts, income.pt) {
  s <- abs(income - income.pt) < 1650  #assumes income known to top frame
  x <- college[s]
  x <- sort(x[!is.na(x)])
  n <- length(x)
  low <- x[10]; high <- x[n-9]
  college.pts >= low & college.pts <= high
}

ggplot(p, xlim=c(0,35), ylim=c(30,55), perim=show.pts)

# Rename variables for better plotting of a long list of predictors
f <- ...
p <- Predict(f)
re <- c(trt='treatment', diabet='diabetes', sbp='systolic blood pressure')

for(n in names(re)) {
  names(p)[names(p)==n] <- re[n]
  p$.predictor.[p$.predictor.==n] <- re[n]
  }
ggplot(p)

## End(Not run)

Calculate Total and Partial g-indexes for an rms Fit

Description

gIndex computes the total gg-index for a model based on the vector of linear predictors, and the partial gg-index for each predictor in a model. The latter is computed by summing all the terms involving each variable, weighted by their regression coefficients, then computing Gini's mean difference on this sum. For example, a regression model having age and sex and age*sex on the right hand side, with corresponding regression coefficients b1,b2,b3b_{1}, b_{2}, b_{3} will have the gg-index for age computed from Gini's mean difference on the product of age ×(b1+b3w)\times (b_{1} + b_{3}w) where ww is an indicator set to one for observations with sex not equal to the reference value. When there are nonlinear terms associated with a predictor, these terms will also be combined.

A print method is defined, and there is a plot method for displaying gg-indexes using a dot chart.

These functions use Hmisc::GiniMd.

Usage

gIndex(object, partials=TRUE, type=c('ccterms', 'cterms', 'terms'),
           lplabel=if(length(object$scale) && is.character(object$scale))
           object$scale[1] else 'X*Beta',
           fun, funlabel=if(missing(fun)) character(0) else
           deparse(substitute(fun)),
           postfun=if(length(object$scale)==2) exp else NULL,
           postlabel=if(length(postfun))
           ifelse(missing(postfun),
                  if((length(object$scale) > 1) &&
                     is.character(object$scale)) object$scale[2] else
                     'Anti-log',
                     deparse(substitute(postfun))) else character(0),
           ...)

## S3 method for class 'gIndex'
print(x, digits=4, abbrev=FALSE,
 vnames=c("names","labels"), ...)

## S3 method for class 'gIndex'
plot(x, what=c('pre', 'post'),
 xlab=NULL, pch=16, rm.totals=FALSE,
sort=c('descending', 'ascending', 'none'), ...)

Arguments

object

result of an rms fitting function

partials

set to FALSE to suppress computation of partial ggs

type

defaults to 'ccterms' which causes partial discrimination indexes to be computed after maximally combining all related main effects and interactions. The is usually the only way that makes sense when considering partial linear predictors. Specify type='cterms' to only combine a main effect with interactions containing it, not also with other main effects connected through interactions. Use type='terms' to separate interactions into their own effects.

lplabel

a replacement for default values such as "X*Beta" or "log odds"/

fun

an optional function to transform the linear predictors before computing the total (only) gg. When this is present, a new component gtrans is added to the attributes of the object resulting from gIndex.

funlabel

a character string label for fun, otherwise taken from the function name itself

postfun

a function to transform gg such as exp (anti-log), which is the default for certain models such as the logistic and Cox models

postlabel

a label for postfun

...

For gIndex, passed to predict.rms. Ignored for print. Passed to dotchart2 for plot.

x

an object created by gIndex (for print or plot)

digits

causes rounding to the digits decimal place

abbrev

set to TRUE to abbreviate labels if vname="labels"

vnames

set to "labels" to print predictor labels instead of names

what

set to "post" to plot the transformed gg-index if there is one (e.g., ratio scale)

xlab

xx-axis label; constructed by default

pch

plotting character for point

rm.totals

set to TRUE to remove the total gg-index when plotting

sort

specifies how to sort predictors by gg-index; default is in descending order going down the dot chart

Details

For stratification factors in a Cox proportional hazards model, there is no contribution of variation towards computing a partial gg except from terms that interact with the stratification variable.

Value

gIndex returns a matrix of class "gIndex" with auxiliary information stored as attributes, such as variable labels. GiniMd returns a scalar.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]

References

David HA (1968): Gini's mean difference rediscovered. Biometrika 55:573–575.

See Also

predict.rms,GiniMd

Examples

set.seed(1)
n <- 40
x <- 1:n
w <- factor(sample(c('a','b'), n, TRUE))
u <- factor(sample(c('A','B'), n, TRUE))
y <- .01*x + .2*(w=='b') + .3*(u=='B') + .2*(w=='b' & u=='B') + rnorm(n)/5
dd <- datadist(x,w,u); options(datadist='dd')
f <- ols(y ~ x*w*u, x=TRUE, y=TRUE)
f
anova(f)
z <- list()
for(type in c('terms','cterms','ccterms'))
  {
    zc <- predict(f, type=type)
    cat('type:', type, '\n')
    print(zc)
    z[[type]] <- zc
  }

zc <- z$cterms
GiniMd(zc[, 1])
GiniMd(zc[, 2])
GiniMd(zc[, 3])
GiniMd(f$linear.predictors)
g <- gIndex(f)
g
g['Total',]
gIndex(f, partials=FALSE)
gIndex(f, type='cterms')
gIndex(f, type='terms')

y <- y > .8
f <- lrm(y ~ x * w * u, x=TRUE, y=TRUE)
gIndex(f, fun=plogis, funlabel='Prob[y=1]')

# Manual calculation of combined main effect + interaction effort of
# sex in a 2x2 design with treatments A B, sexes F M,
# model -.1 + .3*(treat=='B') + .5*(sex=='M') + .4*(treat=='B' & sex=='M')

set.seed(1)
X <- expand.grid(treat=c('A','B'), sex=c('F', 'M'))
a <- 3; b <- 7; c <- 13; d <- 5
X <- rbind(X[rep(1, a),], X[rep(2, b),], X[rep(3, c),], X[rep(4, d),])
y <- with(X, -.1 + .3*(treat=='B') + .5*(sex=='M') + .4*(treat=='B' & sex=='M')) 
f <- ols(y ~ treat*sex, data=X, x=TRUE)
gIndex(f, type='cterms')
k <- coef(f)
b1 <- k[2]; b2 <- k[3]; b3 <- k[4]
n <- nrow(X)
( (a+b)*c*abs(b2) + (a+b)*d*abs(b2+b3) + c*d*abs(b3))/(n*(n-1)/2 )

# Manual calculation for combined age effect in a model with sex,
# age, and age*sex interaction

a <- 13; b <- 7
sex <- c(rep('female',a), rep('male',b))
agef <- round(runif(a, 20, 30))
agem <- round(runif(b, 20, 40))
age  <- c(agef, agem)
y <- (sex=='male') + age/10 - (sex=='male')*age/20
f <- ols(y ~ sex*age, x=TRUE)
f
gIndex(f, type='cterms')
k <- coef(f)
b1 <- k[2]; b2 <- k[3]; b3 <- k[4]
n <- a + b
sp <- function(w, z=w) sum(outer(w, z, function(u, v) abs(u-v)))

( abs(b2)*sp(agef) + abs(b2+b3)*sp(agem) + 2*sp(b2*agef, (b2+b3)*agem) ) / (n*(n-1))

( abs(b2)*GiniMd(agef)*a*(a-1) + abs(b2+b3)*GiniMd(agem)*b*(b-1) +
  2*sp(b2*agef, (b2+b3)*agem) ) / (n*(n-1))

## Not run: 
# Compare partial and total g-indexes over many random fits
plot(NA, NA, xlim=c(0,3), ylim=c(0,3), xlab='Global',
     ylab='x1 (black)  x2 (red)  x3 (green)  x4 (blue)')
abline(a=0, b=1, col=gray(.9))
big <- integer(3)
n <- 50   # try with n=7 - see lots of exceptions esp. for interacting var
for(i in 1:100) {
   x1 <- runif(n)
   x2 <- runif(n)
   x3 <- runif(n)
   x4 <- runif(n)
   y  <- x1 + x2 + x3 + x4 + 2*runif(n)
   f <- ols(y ~ x1*x2+x3+x4, x=TRUE)
   # f <- ols(y ~ x1+x2+x3+x4, x=TRUE)   # also try this
   w <- gIndex(f)[,1]
   gt <- w['Total']
   points(gt, w['x1, x2'])
   points(gt, w['x3'], col='green')
   points(gt, w['x4'], col='blue')
   big[1] <- big[1] + (w['x1, x2'] > gt)
   big[2] <- big[2] + (w['x3'] > gt)
   big[3] <- big[3] + (w['x4'] > gt)
   }
print(big)

## End(Not run)

options(datadist=NULL)

rms Version of glm

Description

This function saves rms attributes with the fit object so that anova.rms, Predict, etc. can be used just as with ols and other fits. No validate or calibrate methods exist for Glm though.

Usage

Glm(
  formula,
  family = gaussian,
  data = environment(formula),
  weights,
  subset,
  na.action = na.delete,
  start = NULL,
  offset = NULL,
  control = glm.control(...),
  model = TRUE,
  method = "glm.fit",
  x = FALSE,
  y = TRUE,
  contrasts = NULL,
  ...
)

Arguments

formula, family, data, weights, subset, na.action, start, offset, control, model, method, x, y, contrasts

see stats::glm(); for print x is the result of Glm

...

ignored

Details

For the print method, format of output is controlled by the user previously running options(prType="lang") where lang is "plain" (the default), "latex", or "html".

Value

a fit object like that produced by stats::glm() but with rms attributes and a class of "rms", "Glm", "glm", and "lm". The g element of the fit object is the gg-index.

See Also

stats::glm(),Hmisc::GiniMd(), prModFit(), stats::residuals.glm

Examples

## Dobson (1990) Page 93: Randomized Controlled Trial :
counts <- c(18,17,15,20,10,20,25,13,12)
outcome <- gl(3,1,9)
treatment <- gl(3,3)
f <- glm(counts ~ outcome + treatment, family=poisson())
f
anova(f)
summary(f)
f <- Glm(counts ~ outcome + treatment, family=poisson())
# could have had rcs( ) etc. if there were continuous predictors
f
anova(f)
summary(f, outcome=c('1','2','3'), treatment=c('1','2','3'))

Fit Linear Model Using Generalized Least Squares

Description

This function fits a linear model using generalized least squares. The errors are allowed to be correlated and/or have unequal variances. Gls is a slightly enhanced version of the Pinheiro and Bates gls function in the nlme package to make it easy to use with the rms package and to implement cluster bootstrapping (primarily for nonparametric estimates of the variance-covariance matrix of the parameter estimates and for nonparametric confidence limits of correlation parameters).

For the print method, format of output is controlled by the user previously running options(prType="lang") where lang is "plain" (the default), "latex", or "html". When using html with Quarto or RMarkdown, results='asis' need not be written in the chunk header.

Usage

Gls(model, data, correlation, weights, subset, method, na.action=na.omit,
    control, verbose, B=0, dupCluster=FALSE, pr=FALSE, x=FALSE)

## S3 method for class 'Gls'
print(x, digits=4, coefs=TRUE, title, ...)

Arguments

model

a two-sided linear formula object describing the model, with the response on the left of a ~ operator and the terms, separated by + operators, on the right.

data

an optional data frame containing the variables named in model, correlation, weights, and subset. By default the variables are taken from the environment from which gls is called.

correlation

an optional corStruct object describing the within-group correlation structure. See the documentation of corClasses for a description of the available corStruct classes. If a grouping variable is to be used, it must be specified in the form argument to the corStruct constructor. Defaults to NULL, corresponding to uncorrelated errors.

weights

an optional varFunc object or one-sided formula describing the within-group heteroscedasticity structure. If given as a formula, it is used as the argument to varFixed, corresponding to fixed variance weights. See the documentation on varClasses for a description of the available varFunc classes. Defaults to NULL, corresponding to homoscesdatic errors.

subset

an optional expression indicating which subset of the rows of data should be used in the fit. This can be a logical vector, or a numeric vector indicating which observation numbers are to be included, or a character vector of the row names to be included. All observations are included by default.

method

a character string. If "REML" the model is fit by maximizing the restricted log-likelihood. If "ML" the log-likelihood is maximized. Defaults to "REML".

na.action

a function that indicates what should happen when the data contain NAs. The default action (na.omit) results in deletion of observations having any of the variables of interest missing.

control

a list of control values for the estimation algorithm to replace the default values returned by the function glsControl. Defaults to an empty list.

verbose

an optional logical value. If TRUE information on the evolution of the iterative algorithm is printed. Default is FALSE.

B

number of bootstrap resamples to fit and store, default is none

dupCluster

set to TRUE to have Gls when bootstrapping to consider multiply-sampled clusters as if they were one large cluster when fitting using the gls algorithm

pr

set to TRUE to show progress of bootstrap resampling

x

for Gls set to TRUE to store the design matrix in the fit object; otherwise the result of Gls

digits

number of significant digits to print

coefs

specify coefs=FALSE to suppress printing the table of model coefficients, standard errors, etc. Specify coefs=n to print only the first n regression coefficients in the model.

title

a character string title to be passed to prModFit

...

ignored

Details

The na.delete function will not work with Gls due to some nuance in the model.frame.default function. This probably relates to na.delete storing extra information in the "na.action" attribute of the returned data frame.

Value

an object of classes Gls, rms, and gls representing the linear model fit. Generic functions such as print, plot, ggplot, and summary have methods to show the results of the fit. See glsObject for the components of the fit. The functions resid, coef, and fitted can be used to extract some of its components. Gls returns the following components not returned by gls: Design, assign, formula (see arguments), B (see arguments), bootCoef (matrix of B bootstrapped coefficients), boot.Corr (vector of bootstrapped correlation parameters), Nboot (vector of total sample size used in each bootstrap (may vary if have unbalanced clusters), and var (sample variance-covariance matrix of bootstrapped coefficients). The gg-index is also stored in the returned object under the name "g".

Author(s)

Jose Pinheiro, Douglas Bates [email protected], Saikat DebRoy, Deepayan Sarkar, R-core [email protected], Frank Harrell [email protected], Patrick Aboyoun

References

Pinheiro J, Bates D (2000): Mixed effects models in S and S-Plus. New York: Springer-Verlag.

See Also

gls glsControl, glsObject, varFunc, corClasses, varClasses, GiniMd, prModFit, logLik.Gls

Examples

## Not run: 
require(ggplot2)
ns  <- 20  # no. subjects
nt  <- 10  # no. time points/subject
B   <- 10  # no. bootstrap resamples
           # usually do 100 for variances, 1000 for nonparametric CLs
rho <- .5  # AR(1) correlation parameter
V <- matrix(0, nrow=nt, ncol=nt)
V <- rho^abs(row(V)-col(V))   # per-subject correlation/covariance matrix

d <- expand.grid(tim=1:nt, id=1:ns)
d$trt <- factor(ifelse(d$id <= ns/2, 'a', 'b'))
true.beta <- c(Intercept=0,tim=.1,'tim^2'=0,'trt=b'=1)
d$ey  <- true.beta['Intercept'] + true.beta['tim']*d$tim +
  true.beta['tim^2']*(d$tim^2) +  true.beta['trt=b']*(d$trt=='b')
set.seed(13)
library(MASS)   # needed for mvrnorm
d$y <- d$ey + as.vector(t(mvrnorm(n=ns, mu=rep(0,nt), Sigma=V)))

dd <- datadist(d); options(datadist='dd')
f <- Gls(y ~ pol(tim,2) + trt, correlation=corCAR1(form= ~tim | id),
         data=d, B=B)
f
AIC(f)
f$var      # bootstrap variances
f$varBeta  # original variances
summary(f)
anova(f)
ggplot(Predict(f, tim, trt))
# v <- Variogram(f, form=~tim|id, data=d)
nlme:::summary.gls(f)$tTable   # print matrix of estimates etc.

options(datadist=NULL)

## End(Not run)

Kaplan-Meier Estimates vs. a Continuous Variable

Description

Function to divide x (e.g. age, or predicted survival at time u created by survest) into g quantile groups, get Kaplan-Meier estimates at time u (a scaler), and to return a matrix with columns x=mean x in quantile, n=number of subjects, events=no. events, and KM=K-M survival at time u, std.err = s.e. of -log K-M. Confidence intervals are based on -log S(t). Instead of supplying g, the user can supply the minimum number of subjects to have in the quantile group (m, default=50). If cuts is given (e.g. cuts=c(0,.1,.2,...,.9,.1)), it overrides m and g. Calls Therneau's survfitKM in the survival package to get Kaplan-Meiers estimates and standard errors.

Usage

groupkm(x, Srv, m=50, g, cuts, u, 
        pl=FALSE, loglog=FALSE, conf.int=.95, xlab, ylab,
        lty=1, add=FALSE, cex.subtitle=.7, ...)

Arguments

x

variable to stratify

Srv

a Surv object - n x 2 matrix containing survival time and event/censoring 1/0 indicator. Units of measurement come from the "units" attribute of the survival time variable. "Day" is the default.

m

desired minimum number of observations in a group

g

number of quantile groups

cuts

actual cuts in x, e.g. c(0,1,2) to use [0,1), [1,2].

u

time for which to estimate survival

pl

TRUE to plot results

loglog

set to TRUE to plot log(-log(survival)) instead of survival

conf.int

defaults to .95 for 0.95 confidence bars. Set to FALSE to suppress bars.

xlab

if pl=TRUE, is x-axis label. Default is label(x) or name of calling argument

ylab

if pl=TRUE, is y-axis label. Default is constructed from u and time units attribute.

lty

line time for primary line connecting estimates

add

set to TRUE if adding to an existing plot

cex.subtitle

character size for subtitle. Default is .7. Use FALSE to suppress subtitle.

...

plotting parameters to pass to the plot and errbar functions

Value

matrix with columns named x (mean predictor value in interval), n (sample size in interval), events (number of events in interval), KM (Kaplan-Meier estimate), std.err (standard error of -log KM)

See Also

survfit, errbar, cut2, Surv, units

Examples

require(survival)
n <- 1000
set.seed(731)
age <- 50 + 12*rnorm(n)
cens <- 15*runif(n)
h <- .02*exp(.04*(age-50))
d.time <- -log(runif(n))/h
label(d.time) <- 'Follow-up Time'
e <- ifelse(d.time <= cens,1,0)
d.time <- pmin(d.time, cens)
units(d.time) <- "Year"
groupkm(age, Surv(d.time, e), g=10, u=5, pl=TRUE)
#Plot 5-year K-M survival estimates and 0.95 confidence bars by 
#decile of age.  If omit g=10, will have >= 50 obs./group.

Hazard Ratio Plot

Description

The hazard.ratio.plot function repeatedly estimates Cox regression coefficients and confidence limits within time intervals. The log hazard ratios are plotted against the mean failure/censoring time within the interval. Unless times is specified, the number of time intervals will be max(round(d/e),2)\max(round(d/e),2), where dd is the total number of events in the sample. Efron's likelihood is used for estimating Cox regression coefficients (using coxph.fit). In the case of tied failure times, some intervals may have a point in common.

Usage

hazard.ratio.plot(x, Srv, which, times=, e=30, subset,
                  conf.int=.95, legendloc=NULL, smooth=TRUE, pr=FALSE, pl=TRUE,
                  add=FALSE, ylim, cex=.5, xlab="t", ylab, antilog=FALSE, ...)

Arguments

x

a vector or matrix of predictors

Srv

a Surv object

which

a vector of column numbers of x for which to estimate hazard ratios across time and make plots. The default is to do so for all predictors. Whenever one predictor is displayed, all other predictors in the x matrix are adjusted for (with a separate adjustment form for each time interval).

times

optional vector of time interval endpoints. Example: times=c(1,2,3) uses intervals [0,1), [1,2), [2,3), [3+). If times is omitted, uses intervals containing e events

e

number of events per time interval if times not given

subset

vector used for subsetting the entire analysis, e.g. subset=sex=="female"

conf.int

confidence interval coverage

legendloc

location for legend. Omit to use mouse, "none" for none, "ll" for lower left of graph, or actual x and y coordinates (e.g. c(2,3))

smooth

also plot the super–smoothed version of the log hazard ratios

pr

defaults to FALSE to suppress printing of individual Cox fits

pl

defaults to TRUE to plot results

add

add this plot to an already existing plot

ylim

vector of y-axis limits. Default is computed to include confidence bands.

cex

character size for legend information, default is 0.5

xlab

label for x-axis, default is "t"

ylab

label for y-axis, default is "Log Hazard Ratio" or "Hazard Ratio", depending on antilog.

antilog

default is FALSE. Set to TRUE to plot anti-log, i.e., hazard ratio.

...

optional graphical parameters

Author(s)

Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]

See Also

cox.zph, residuals.cph, survival-internal, cph, coxph, Surv

Examples

require(survival)
n <- 500
set.seed(1)
age <- 50 + 12*rnorm(n)
cens <- 15*runif(n)
h <- .02*exp(.04*(age-50))
d.time <- -log(runif(n))/h
label(d.time) <- 'Follow-up Time'
e <- ifelse(d.time <= cens,1,0)
d.time <- pmin(d.time, cens)
units(d.time) <- "Year"
hazard.ratio.plot(age, Surv(d.time,e), e=20, legendloc='ll')

Intervening Event Setup

Description

Creates several new variables which help set up a dataset for modeling with cph or coxph when there is a single binary time-dependent covariable which turns on at a given time, and stays on. This is typical when analyzing the impact of an intervening event. ie.setup creates a Surv object using the start time, stop time format. It also creates a binary indicator for the intervening event, and a variable called subs that is useful when attach-ing a dataframe. subs has observation numbers duplicated for subjects having an intervening event, so those subject's baseline covariables (that are not time-dependent) can be duplicated correctly.

Usage

ie.setup(failure.time, event, ie.time, break.ties=FALSE)

Arguments

failure.time

a numeric variable containing the event or censoring times for the terminating event

event

a binary (0/1) variable specifying whether observations had the terminating event (event=1) or were censored (event=0)

ie.time

intervening event times. For subjects having no intervening events, the corresponding values of ie.time must be NA.

break.ties

Occasionally intervening events are recorded as happening at exactly the same time as the termination of follow-up for some subjects. The Surv and Surv functions will not allow this. To randomly break the ties by subtracting a random number from such tied intervening event times, specify break.ties=TRUE. The random number is uniform between zero and the minimum difference between any two untied failure.times.

Value

a list with components S, ie.status, subs, reps. S is a Surv object containing start and stop times for intervals of observation, along with event indicators. ie.status is one if the intervening event has occurred at the start of the interval, zero otherwise. subs is a vector of subscripts that can be used to replicate other variables the same way S was replicated. reps specifies how many times each original observation was replicated. S, ie.status, subs are all the same length (at least the number of rows for S is) and are longer than the original failure.time vector. reps is the same length as the original failure.time vector. The subs vector is suitable for passing to validate.lrm or calibrate, which pass this vector under the name cluster on to predab.resample so that bootstrapping can be done by sampling with replacement from the original subjects rather than from the individual records created by ie.setup.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]

See Also

cph, coxph, Surv, cr.setup, predab.resample

Examples

failure.time <- c(1 ,   2,   3)
event        <- c(1 ,   1,   0)
ie.time      <- c(NA, 1.5, 2.5)

z <- ie.setup(failure.time, event, ie.time)
S <- z$S
S
ie.status <- z$ie.status
ie.status
z$subs
z$reps
## Not run: 
attach(input.data.frame[z$subs,])   #replicates all variables
f <- cph(S ~ age + sex + ie.status)
# Instead of duplicating rows of data frame, could do this:
attach(input.data.frame)
z <- ie.setup(failure.time, event, ie.time)
s <- z$subs
age <- age[s]
sex <- sex[s]
f <- cph(S ~ age + sex + ie.status)

## End(Not run)

Impact of Proportional Odds Assumpton

Description

Checks the impact of the proportional odds assumption by comparing predicted cell probabilities from a PO model with those from a multinomial or partial proportional odds logistic model that relax assumptions. For a given model formula, fits the model with both lrm and either nnet::multinom or VGAM::vglm or both, and obtains predicted cell probabilities for the PO and relaxed models on the newdata data frame. A print method formats the output.

Usage

impactPO(
  formula,
  relax = if (missing(nonpo)) "multinomial" else "both",
  nonpo,
  newdata,
  data = environment(formula),
  minfreq = 15,
  B = 0,
  ...
)

Arguments

formula

a model formula. To work properly with multinom or vglm the terms should have completely specified knot locations if a spline function is being used.

relax

defaults to "both" if nonpo is given, resulting in fitting two relaxed models. Set relax to "multinomial" or "ppo" to fit only one relaxed model. The multinomial model does not assume PO for any predictor.

nonpo

a formula with no left hand side variable, specifying the variable or variables for which PO is not assumed. Specifying nonpo results in a relaxed fit that is a partial PO model fitted with VGAM::vglm.

newdata

a data frame or data table with one row per covariate setting for which predictions are to be made

data

data frame containing variables to fit; default is the frame in which formula is found

minfreq

minimum sample size to allow for the least frequent category of the dependent variable. If the observed minimum frequency is less than this, the combine.levels() function will be called to combine enough consecutive levels so that this minimum frequency is achieved.

B

number of bootstrap resamples to do to get confidence intervals for differences in predicted probabilities for relaxed methods vs. PO model fits. Default is not to run the bootstrap. When running the bootstrap make sure that all model variables are explicitly in ⁠data=⁠ so that selection of random subsets of data will call along the correct rows for all predictors.

...

other parameters to pass to lrm and multinom

Details

Since partial proportional odds models and especially multinomial logistic models can have many parameters, it is not feasible to use this model comparison approach when the number of levels of the dependent variable Y is large. By default, the function will use combine.levels() to combine consecutive levels if the lowest frequency category of Y has fewer than minfreq observations.

Value

an impactPO object which is a list with elements estimates, stats, mad, newdata, nboot, and boot. estimates is a data frame containing the variables and values in newdata in a tall and thin format with additional variable method ("PO", "Multinomial", "PPO"), y (current level of the dependent variable), and Probability (predicted cell probability for covariate values and value of y in the current row). stats is a data frame containing Deviance the model deviance, d.f. the total number of parameters counting intercepts, AIC, p the number of regression coefficients, ⁠LR chi^2⁠ the likelihood ratio chi-square statistic for testing the predictors, LR - p a chance-corrected LR chi-square, ⁠LR chi^2 test for PO⁠ the likelihood ratio chi-square test statistic for testing the PO assumption (by comparing -2 log likelihood for a relaxed model to that of a fully PO model), d.f. the degrees of freedom for this test, ⁠ Pr(>chi^2)⁠ the P-value for this test, ⁠MCS R2⁠ the Maddala-Cox-Snell R2 using the actual sample size, ⁠MCS R2 adj⁠ (⁠MCS R2⁠ adjusted for estimating p regression coefficients by subtracting p from LR), ⁠McFadden R2⁠, ⁠McFadden R2 adj⁠ (an AIC-like adjustment proposed by McFadden without full justification), ⁠Mean |difference} from PO⁠ the overall mean absolute difference between predicted probabilities over all categories of Y and over all covariate settings. mad contains newdata and separately by rows in newdata the mean absolute difference (over Y categories) between estimated probabilities by the indicated relaxed model and those from the PO model. nboot is the number of successful bootstrap repetitions, and boot is a 4-way array with dimensions represented by the nboot resamples, the number of rows in newdata, the number of outcome levels, and elements for PPO and multinomial. For the modifications of the Maddala-Cox-Snell indexes see Hmisc::R2Measures.

Author(s)

Frank Harrell [email protected]

References

Adjusted R-square note

See Also

nnet::multinom(), VGAM::vglm(), lrm(), Hmisc::propsPO(), Hmisc::R2Measures(), Hmisc::combine.levels()

Examples

## Not run: 
set.seed(1)
age <- rnorm(500, 50, 10)
sex <- sample(c('female', 'male'), 500, TRUE)
y   <- sample(0:4, 500, TRUE)
d   <- expand.grid(age=50, sex=c('female', 'male'))
w   <- impactPO(y ~ age + sex, nonpo = ~ sex, newdata=d)
w
# Note that PO model is a better model than multinomial (lower AIC)
# since multinomial model's improvement in fit is low in comparison
# with number of additional parameters estimated.  Same for PO model
# in comparison with partial PO model.

# Reverse levels of y so stacked bars have higher y located higher
revo <- function(z) {
  z <- as.factor(z)
  factor(z, levels=rev(levels(as.factor(z))))
}

require(ggplot2)
ggplot(w$estimates, aes(x=method, y=Probability, fill=revo(y))) +
  facet_wrap(~ sex) + geom_col() +
  xlab('') + guides(fill=guide_legend(title=''))

# Now vary 2 predictors

d <- expand.grid(sex=c('female', 'male'), age=c(40, 60))
w <- impactPO(y ~ age + sex, nonpo = ~ sex, newdata=d)
w
ggplot(w$estimates, aes(x=method, y=Probability, fill=revo(y))) +
  facet_grid(age ~ sex) + geom_col() +
 xlab('') + guides(fill=guide_legend(title=''))

## End(Not run)

Exported Functions That Were Imported From Other Packages

Description

Surv and ggplot are imported from, respectively, the survival and ggplot2 packages and are exported from rms so that the user does not have to attach these packages to do simple things.

Usage

Surv(time, time2, event,
     type = c("right", "left", "interval", "counting", "interval2", "mstate"),
     origin = 0)

ggplot(data = NULL, mapping = aes(), ..., environment =  parent.frame())

Arguments

time, time2, event, type, origin

see Surv

data, mapping, ..., environment

see ggplot

Value

see documentation in the original packages

See Also

Surv, ggplot

Examples

## Not run: 
f <- psm(Surv(dtime, death) ~ x1 + x2 + sex + race, dist='gau')
ggplot(Predict(f))

## End(Not run)

LaTeX Representation of a Fitted Cox Model

Description

Creates a file containing a LaTeX representation of the fitted model.

Usage

## S3 method for class 'cph'
latex(object, title,
      file='', 
      append=FALSE, surv=TRUE, maxt=FALSE, which=NULL, varnames, columns=65, 
      inline=FALSE, before=if(inline)"" else "& &", after="", dec=3,
      pretrans=TRUE, caption, digits=.Options$digits, size="",
      ...) # for cph fit

## S3 method for class 'lrm'
latex(object, title, file, append, which, varnames,
columns, inline, before, after, pretrans, caption,
digits=.Options$digits, size="", ...) # for lrm fit

## S3 method for class 'ols'
latex(object, title, file, append, which, varnames,
columns, inline, before, after, pretrans, caption,
digits=.Options$digits, size="", ...) # ols fit

## S3 method for class 'orm'
latex(object, title, file, append, which, varnames,
columns, inline, before, after, pretrans, caption,
digits=.Options$digits, size="", intercepts=nrp < 10, ...) # for orm fit

## S3 method for class 'pphsm'
latex(object, title, file, append, which=NULL, varnames,
columns, inline, before, after, pretrans, caption,
digits=.Options$digits, size="", ...) # pphsm fit

## S3 method for class 'psm'
latex(object, title, file, append, which=NULL, varnames,
columns, inline, before, after, pretrans, caption,
digits=.Options$digits, size="", ...) # psm fit

Arguments

object

a fit object created by a rms fitting function.

title

ignored

file, append

see latex.default. Defaults to the console. When using html/markdown, file is ignored.

surv

if surv=TRUE was specified to cph, the underlying survival probabilities from object$surv.summary will be placed in a table unless surv=FALSE.

maxt

if the maximum follow-up time in the data (object$maxtime) exceeds the last entry in object$surv.summary, underlying survival estimates at object$maxtime will be added to the table if maxt=TRUE.

which, varnames, columns, inline, before, dec, pretrans

see latex.default

after

if not an empty string, added to end of markup if inline=TRUE

caption

a character string specifying a title for the equation to be centered and typeset in bold face. Default is no title.

digits

see latexrms

size

a LaTeX size to use, without the slash. Default is the prevailing size

intercepts

for orm fits. Default is to print intercepts if they are fewer than 10 in number. Set to TRUE or FALSE to force.

...

ignored

Value

the name of the created file, with class c("latex","file"). This object works with latex viewing and printing commands in Hmisc. If file='' and options(prType=x is in effect, where x is "html", "markdown" or "md", the result is run through knitr::asis_output so that it will be rendered correctly no matter which options are in effect in the chunk header.

Author(s)

Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]

See Also

latexrms, rcspline.restate, latex

Examples

## Not run: 
require(survival)
units(ftime) <- "Day"
f <- cph(Surv(ftime, death) ~ rcs(age)+sex, surv=TRUE, time.inc=60)
w <- latex(f, file='f.tex')  #Interprets fitted model and makes table of S0(t)
               #for t=0,60,120,180,...
w              #displays image, if viewer installed and file given above
latex(f)   # send LaTeX code to the console for knitr
options(prType='html')
latex(f)       # for use with knitr and R Markdown/Quarto using MathJax

## End(Not run)

LaTeX Representation of a Fitted Model

Description

Creates a file containing a LaTeX representation of the fitted model. For model-specific typesetting there is latex.lrm, latex.cph, latex.psm and latex.ols. latex.cph has some arguments that are specific to cph models. latexrms is the core function which is called internally by latexrms (which is called by latex.cph, latex.ols, etc.). html and R Markdown-compatible markup (using MathJax) are written if options(prType='html').

Usage

latexrms(object,
 file='',
 append=FALSE, which=1:p, varnames, columns=65, prefix=NULL, inline=FALSE,
 before=if(inline)"" else "& &", after="", intercept, pretrans=TRUE,
 digits=.Options$digits, size="")

Arguments

object

a fit object created by a fitting function in the rms series

file

name of .tex file to create, default is to write to console. file is ignored when options(prType='html'.

append

whether or not to append to an existing file

which

a vector of subcripts (corresponding to object$Design$name) specifying a submodel to print. Default is to describe the whole model. which can also be a vector of character strings specifying the factor names to print. Enough of each string is needed to ensure a unique match. Names for interaction effects are of the form "age * sex". For any interaction effect for which you do not request main effects, the main effects will be added to which. When which is given, the model structural statement is not included. In this case, intercepts are not included either.

varnames

variable names to substitute for non-interactions. Order must correspond to object$Design$name and interactions must be omitted. Default is object$Design$name[object$Design$assume.code!=9]. varnames can contain any LaTeX commands such as subscripts and "\\\\frac" (all "\" must be quadrupled.) Any "/" must be preceeded by "\\" (2, not 4 backslashes). Elements of varnames for interactions are ignored; they can be set to any value.

columns

maximum number of columns of printing characters to allow before outputting a LaTeX newline command

prefix

if given, a LaTeX \lefteqn command of the form \lefteqn{prefix =} \\ will be inserted to print a left-hand-side of the equation.

inline

Set to TRUE to create text for insertion in an in-line equation. This text contains only the expansion of X beta, and is not surrounded by "$".

before

a character string to place before each line of output. Use the default for a LaTeX eqnarray environment. For inline=TRUE, the before string, if not an empty string, will be placed once before the entire markup.

after

a character string to place after the output if inline=TRUE

intercept

a special intercept value to include that is not part of the standard model parameters (e.g., centering constant in Cox model). Only allowed in the latexrms rendition.

pretrans

if any spline or polynomial-expanded variables are themselves transformed, a table of pre-transformations will be formed unless pretrans=FALSE.

digits

number of digits of precision to use in formatting coefficients and other numbers

size

a LaTeX font size to use for the output, without the slash. Default is current size.

Value

latexrms returns a character vector if file='', otherwise writes the output to file. For particular model fits, the latex method returns the result of running knitr::asis_output on the LaTeX or HTML code if file='', options(prType) was set but not to 'plain', and if knitr is currently running. This causes correct output to be rendered whether or not results='asis' appeared in the R Markdown or Quarto chunk header.

Author(s)

Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]

See Also

latex, rcspline.restate, rms

Examples

## Not run: 
f <- lrm(death ~ rcs(age)+sex)
w <- latex(f, file='f.tex')
w     # displays, using e.g. xdvi
latex(f)    # send LaTeX code to console, as for knitr
options(prType='html')
latex(f)    # emit html and latex for knitr html and html notebooks

## End(Not run)

Logistic Regression Model

Description

Fit binary and proportional odds ordinal logistic regression models using maximum likelihood estimation or penalized maximum likelihood estimation. See cr.setup for how to fit forward continuation ratio models with lrm.

For the print method, format of output is controlled by the user previously running options(prType="lang") where lang is "plain" (the default), "latex", or "html". When using html with Quarto or RMarkdown, results='asis' need not be written in the chunk header.

Usage

lrm(formula, data=environment(formula),
    subset, na.action=na.delete, method="lrm.fit",
    model=FALSE, x=FALSE, y=FALSE, linear.predictors=TRUE, se.fit=FALSE, 
    penalty=0, penalty.matrix, tol=1e-7, 
    strata.penalty=0, var.penalty=c('simple','sandwich'),
    weights, normwt, scale=FALSE, ...)

## S3 method for class 'lrm'
print(x, digits=4, r2=c(0,2,4), strata.coefs=FALSE,
    coefs=TRUE, pg=FALSE, title='Logistic Regression Model', ...)

Arguments

formula

a formula object. An offset term can be included. The offset causes fitting of a model such as logit(Y=1)=Xβ+Wlogit(Y=1) = X\beta + W, where WW is the offset variable having no estimated coefficient. The response variable can be any data type; lrm converts it in alphabetic or numeric order to an S factor variable and recodes it 0,1,2,... internally.

data

data frame to use. Default is the current frame.

subset

logical expression or vector of subscripts defining a subset of observations to analyze

na.action

function to handle NAs in the data. Default is na.delete, which deletes any observation having response or predictor missing, while preserving the attributes of the predictors and maintaining frequencies of deletions due to each variable in the model. This is usually specified using options(na.action="na.delete").

method

name of fitting function. Only allowable choice at present is lrm.fit.

model

causes the model frame to be returned in the fit object

x

causes the expanded design matrix (with missings excluded) to be returned under the name x. For print, an object created by lrm.

y

causes the response variable (with missings excluded) to be returned under the name y.

linear.predictors

causes the predicted X beta (with missings excluded) to be returned under the name linear.predictors. When the response variable has more than two levels, the first intercept is used.

se.fit

causes the standard errors of the fitted values to be returned under the name se.fit.

penalty

The penalty factor subtracted from the log likelihood is 0.5βPβ0.5 \beta' P \beta, where β\beta is the vector of regression coefficients other than intercept(s), and PP is penalty factors * penalty.matrix and penalty.matrix is defined below. The default is penalty=0 implying that ordinary unpenalized maximum likelihood estimation is used. If penalty is a scalar, it is assumed to be a penalty factor that applies to all non-intercept parameters in the model. Alternatively, specify a list to penalize different types of model terms by differing amounts. The elements in this list are named simple, nonlinear, interaction and nonlinear.interaction. If you omit elements on the right of this series, values are inherited from elements on the left. Examples: penalty=list(simple=5, nonlinear=10) uses a penalty factor of 10 for nonlinear or interaction terms. penalty=list(simple=0, nonlinear=2, nonlinear.interaction=4) does not penalize linear main effects, uses a penalty factor of 2 for nonlinear or interaction effects (that are not both), and 4 for nonlinear interaction effects.

penalty.matrix

specifies the symmetric penalty matrix for non-intercept terms. The default matrix for continuous predictors has the variance of the columns of the design matrix in its diagonal elements so that the penalty to the log likelhood is unitless. For main effects for categorical predictors with cc categories, the rows and columns of the matrix contain a c1×c1c-1 \times c-1 sub-matrix that is used to compute the sum of squares about the mean of the cc parameter values (setting the parameter to zero for the reference cell) as the penalty component for that predictor. This makes the penalty independent of the choice of the reference cell. If you specify penalty.matrix, you may set the rows and columns for certain parameters to zero so as to not penalize those parameters. Depending on penalty, some elements of penalty.matrix may be overridden automatically by setting them to zero. The penalty matrix that is used in the actual fit is penalty×diag(pf)×penalty.matrix×diag(pf)penalty \times diag(pf) \times penalty.matrix \times diag(pf), where pfpf is the vector of square roots of penalty factors computed from penalty by Penalty.setup in rmsMisc. If you specify penalty.matrix you must specify a nonzero value of penalty or no penalization will be done.

tol

singularity criterion (see lrm.fit)

strata.penalty

scalar penalty factor for the stratification factor, for the experimental strat variable

var.penalty

the type of variance-covariance matrix to be stored in the var component of the fit when penalization is used. The default is the inverse of the penalized information matrix. Specify var.penalty="sandwich" to use the sandwich estimator (see below under var), which limited simulation studies have shown yields variances estimates that are too low.

weights

a vector (same length as y) of possibly fractional case weights

normwt

set to TRUE to scale weights so they sum to the length of y; useful for sample surveys as opposed to the default of frequency weighting

scale

set to TRUE to subtract means and divide by standard deviations of columns of the design matrix before fitting, and to back-solve for the un-normalized covariance matrix and regression coefficients. This can sometimes make the model converge for very large sample sizes where for example spline or polynomial component variables create scaling problems leading to loss of precision when accumulating sums of squares and crossproducts.

...

arguments that are passed to lrm.fit, or from print, to prModFit

digits

number of significant digits to use

r2

vector of integers specifying which R^2 measures to print, with 0 for Nagelkerke R^2 and 1:4 corresponding to the 4 measures computed by R2Measures. Default is to print Nagelkerke (labeled R2) and second and fourth R2Measures which are the measures adjusted for the number of predictors, first for the raw sample size then for the effective sample size, which here is from the formula for the approximate variance of a log odds ratio in a proportional odds model.

strata.coefs

set to TRUE to print the (experimental) strata coefficients

coefs

specify coefs=FALSE to suppress printing the table of model coefficients, standard errors, etc. Specify coefs=n to print only the first n regression coefficients in the model.

pg

set to TRUE to print g-indexes

title

a character string title to be passed to prModFit

Value

The returned fit object of lrm contains the following components in addition to the ones mentioned under the optional arguments.

call

calling expression

freq

table of frequencies for Y in order of increasing Y

stats

vector with the following elements: number of observations used in the fit, maximum absolute value of first derivative of log likelihood, model likelihood ratio χ2\chi^2, d.f., PP-value, cc index (area under ROC curve), Somers' DxyD_{xy}, Goodman-Kruskal γ\gamma, Kendall's τa\tau_a rank correlations between predicted probabilities and observed response, the Nagelkerke R2R^2 index, the Brier score computed with respect to Y>Y > its lowest level, the gg-index, grgr (the gg-index on the odds ratio scale), and gpgp (the gg-index on the probability scale using the same cutoff used for the Brier score). Probabilities are rounded to the nearest 0.0002 in the computations or rank correlation indexes. In the case of penalized estimation, the "Model L.R." is computed without the penalty factor, and "d.f." is the effective d.f. from Gray's (1992) Equation 2.9. The PP-value uses this corrected model L.R. χ2\chi^2 and corrected d.f. The score chi-square statistic uses first derivatives which contain penalty components.

fail

set to TRUE if convergence failed (and maxiter>1)

coefficients

estimated parameters

var

estimated variance-covariance matrix (inverse of information matrix). If penalty>0, var is either the inverse of the penalized information matrix (the default, if var.penalty="simple") or the sandwich-type variance - covariance matrix estimate (Gray Eq. 2.6) if var.penalty="sandwich". For the latter case the simple information-matrix - based variance matrix is returned under the name var.from.info.matrix.

effective.df.diagonal

is returned if penalty>0. It is the vector whose sum is the effective d.f. of the model (counting intercept terms).

u

vector of first derivatives of log-likelihood

deviance

-2 log likelihoods (counting penalty components) When an offset variable is present, three deviances are computed: for intercept(s) only, for intercepts+offset, and for intercepts+offset+predictors. When there is no offset variable, the vector contains deviances for the intercept(s)-only model and the model with intercept(s) and predictors.

est

vector of column numbers of X fitted (intercepts are not counted)

non.slopes

number of intercepts in model

penalty

see above

penalty.matrix

the penalty matrix actually used in the estimation

Author(s)

Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]

References

Le Cessie S, Van Houwelingen JC: Ridge estimators in logistic regression. Applied Statistics 41:191–201, 1992.

Verweij PJM, Van Houwelingen JC: Penalized likelihood in Cox regression. Stat in Med 13:2427–2436, 1994.

Gray RJ: Flexible methods for analyzing survival data using splines, with applications to breast cancer prognosis. JASA 87:942–951, 1992.

Shao J: Linear model selection by cross-validation. JASA 88:486–494, 1993.

Verweij PJM, Van Houwelingen JC: Crossvalidation in survival analysis. Stat in Med 12:2305–2314, 1993.

Harrell FE: Model uncertainty, penalization, and parsimony. ISCB Presentation on UVa Web page, 1998.

See Also

lrm.fit, predict.lrm, rms.trans, rms, glm, latex.lrm, residuals.lrm, na.delete, na.detail.response, pentrace, rmsMisc, vif, cr.setup, predab.resample, validate.lrm, calibrate, Mean.lrm, gIndex, prModFit

Examples

#Fit a logistic model containing predictors age, blood.pressure, sex
#and cholesterol, with age fitted with a smooth 5-knot restricted cubic 
#spline function and a different shape of the age relationship for males 
#and females.  As an intermediate step, predict mean cholesterol from
#age using a proportional odds ordinal logistic model
#
require(ggplot2)
n <- 1000    # define sample size
set.seed(17) # so can reproduce the results
age            <- rnorm(n, 50, 10)
blood.pressure <- rnorm(n, 120, 15)
cholesterol    <- rnorm(n, 200, 25)
sex            <- factor(sample(c('female','male'), n,TRUE))
label(age)            <- 'Age'      # label is in Hmisc
label(cholesterol)    <- 'Total Cholesterol'
label(blood.pressure) <- 'Systolic Blood Pressure'
label(sex)            <- 'Sex'
units(cholesterol)    <- 'mg/dl'   # uses units.default in Hmisc
units(blood.pressure) <- 'mmHg'

#To use prop. odds model, avoid using a huge number of intercepts by
#grouping cholesterol into 40-tiles
ch <- cut2(cholesterol, g=40, levels.mean=TRUE) # use mean values in intervals
table(ch)
f <- lrm(ch ~ age)
options(prType='latex')
print(f, coefs=4)  # write latex code to console
m <- Mean(f)    # see help file for Mean.lrm
d <- data.frame(age=seq(0,90,by=10))
m(predict(f, d))
# Repeat using ols
f <- ols(cholesterol ~ age)
predict(f, d)

# Specify population model for log odds that Y=1
L <- .4*(sex=='male') + .045*(age-50) +
     (log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male'))
# Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)]
y <- ifelse(runif(n) < plogis(L), 1, 0)
cholesterol[1:3] <- NA   # 3 missings, at random

ddist <- datadist(age, blood.pressure, cholesterol, sex)
options(datadist='ddist')

fit <- lrm(y ~ blood.pressure + sex * (age + rcs(cholesterol,4)),
               x=TRUE, y=TRUE)
#      x=TRUE, y=TRUE allows use of resid(), which.influence below
#      could define d <- datadist(fit) after lrm(), but data distribution
#      summary would not be stored with fit, so later uses of Predict
#      or summary.rms would require access to the original dataset or
#      d or specifying all variable values to summary, Predict, nomogram
anova(fit)
p <- Predict(fit, age, sex)
ggplot(p)   # or plot()
ggplot(Predict(fit, age=20:70, sex="male"))   # need if datadist not used
print(cbind(resid(fit,"dfbetas"), resid(fit,"dffits"))[1:20,])
which.influence(fit, .3)
# latex(fit)                       #print nice statement of fitted model
#
#Repeat this fit using penalized MLE, penalizing complex terms
#(for nonlinear or interaction effects)
#
fitp <- update(fit, penalty=list(simple=0,nonlinear=10), x=TRUE, y=TRUE)
effective.df(fitp)
# or lrm(y ~ \dots, penalty=\dots)


#Get fits for a variety of penalties and assess predictive accuracy 
#in a new data set.  Program efficiently so that complex design 
#matrices are only created once.


set.seed(201)
x1 <- rnorm(500)
x2 <- rnorm(500)
x3 <- sample(0:1,500,rep=TRUE)
L  <- x1+abs(x2)+x3
y  <- ifelse(runif(500)<=plogis(L), 1, 0)
new.data <- data.frame(x1,x2,x3,y)[301:500,]
#
for(penlty in seq(0,.15,by=.005)) {
  if(penlty==0) {
    f <- lrm(y ~ rcs(x1,4)+rcs(x2,6)*x3, subset=1:300, x=TRUE, y=TRUE)
    # True model is linear in x1 and has no interaction
    X <- f$x    # saves time for future runs - don't have to use rcs etc.
    Y <- f$y    # this also deletes rows with NAs (if there were any)
    penalty.matrix <- diag(diag(var(X)))
    Xnew <- predict(f, new.data, type="x")  
    # expand design matrix for new data
    Ynew <- new.data$y
  } else f <- lrm.fit(X,Y, penalty.matrix=penlty*penalty.matrix)
#
  cat("\nPenalty :",penlty,"\n")
  pred.logit <- f$coef[1] + (Xnew %*% f$coef[-1])
  pred <- plogis(pred.logit)
  C.index <- somers2(pred, Ynew)["C"]
  Brier   <- mean((pred-Ynew)^2)
  Deviance<- -2*sum( Ynew*log(pred) + (1-Ynew)*log(1-pred) )
  cat("ROC area:",format(C.index),"   Brier score:",format(Brier),
      "   -2 Log L:",format(Deviance),"\n")
}
#penalty=0.045 gave lowest -2 Log L, Brier, ROC in test sample for S+
#
#Use bootstrap validation to estimate predictive accuracy of
#logistic models with various penalties
#To see how noisy cross-validation estimates can be, change the
#validate(f, \dots) to validate(f, method="cross", B=10) for example.
#You will see tremendous variation in accuracy with minute changes in
#the penalty.  This comes from the error inherent in using 10-fold
#cross validation but also because we are not fixing the splits.  
#20-fold cross validation was even worse for some
#indexes because of the small test sample size.  Stability would be
#obtained by using the same sample splits for all penalty values 
#(see above), but then we wouldn't be sure that the choice of the 
#best penalty is not specific to how the sample was split.  This
#problem is addressed in the last example.
#
penalties <- seq(0,.7,length=3)   # really use by=.02
index <- matrix(NA, nrow=length(penalties), ncol=11,
	        dimnames=list(format(penalties),
          c("Dxy","R2","Intercept","Slope","Emax","D","U","Q","B","g","gp")))
i <- 0
for(penlty in penalties)
{
  cat(penlty, "")
  i <- i+1
  if(penlty==0)
    {
    f <- lrm(y ~ rcs(x1,4)+rcs(x2,6)*x3, x=TRUE, y=TRUE)  # fit whole sample
    X <- f$x
    Y <- f$y
    penalty.matrix <- diag(diag(var(X)))   # save time - only do once
    }
  else
   f <- lrm(Y ~ X, penalty=penlty,
            penalty.matrix=penalty.matrix, x=TRUE,y=TRUE)
  val <- validate(f, method="boot", B=20)  # use larger B in practice
  index[i,] <- val[,"index.corrected"]
}
par(mfrow=c(3,3))
for(i in 1:9)
{
  plot(penalties, index[,i], 
       xlab="Penalty", ylab=dimnames(index)[[2]][i])
  lines(lowess(penalties, index[,i]))
}
options(datadist=NULL)

# Example of weighted analysis
x    <- 1:5
y    <- c(0,1,0,1,0)
reps <- c(1,2,3,2,1)
lrm(y ~ x, weights=reps)
x <- rep(x, reps)
y <- rep(y, reps)
lrm(y ~ x)   # same as above

#
#Study performance of a modified AIC which uses the effective d.f.
#See Verweij and Van Houwelingen (1994) Eq. (6).  Here AIC=chisq-2*df.
#Also try as effective d.f. equation (4) of the previous reference.
#Also study performance of Shao's cross-validation technique (which was
#designed to pick the "right" set of variables, and uses a much smaller
#training sample than most methods).  Compare cross-validated deviance
#vs. penalty to the gold standard accuracy on a 7500 observation dataset.
#Note that if you only want to get AIC or Schwarz Bayesian information
#criterion, all you need is to invoke the pentrace function.
#NOTE: the effective.df( ) function is used in practice
#
## Not run: 
for(seed in c(339,777,22,111,3)){ 
# study performance for several datasets
  set.seed(seed)
  n <- 175; p <- 8
  X <- matrix(rnorm(n*p), ncol=p) # p normal(0,1) predictors
  Coef <- c(-.1,.2,-.3,.4,-.5,.6,-.65,.7)  # true population coefficients
  L <- X %*% Coef                 # intercept is zero
  Y <- ifelse(runif(n)<=plogis(L), 1, 0)
  pm <- diag(diag(var(X)))
  #Generate a large validation sample to use as a gold standard
  n.val <- 7500
  X.val <- matrix(rnorm(n.val*p), ncol=p)
  L.val <- X.val %*% Coef
  Y.val <- ifelse(runif(n.val)<=plogis(L.val), 1, 0)
  #
  Penalty <- seq(0,30,by=1)
  reps <- length(Penalty)
  effective.df <- effective.df2 <- aic <- aic2 <- deviance.val <- 
    Lpenalty <- single(reps)
  n.t <- round(n^.75)
  ncv <- c(10,20,30,40)     # try various no. of reps in cross-val.
  deviance <- matrix(NA,nrow=reps,ncol=length(ncv))
  #If model were complex, could have started things off by getting X, Y
  #penalty.matrix from an initial lrm fit to save time
  #
  for(i in 1:reps) {
    pen <- Penalty[i]
    cat(format(pen),"")
    f.full <- lrm.fit(X, Y, penalty.matrix=pen*pm)
    Lpenalty[i] <- pen* t(f.full$coef[-1]) %*% pm %*% f.full$coef[-1]
    f.full.nopenalty <- lrm.fit(X, Y, initial=f.full$coef, maxit=1)
    info.matrix.unpenalized <- solve(f.full.nopenalty$var)
    effective.df[i] <- sum(diag(info.matrix.unpenalized %*% f.full$var)) - 1
    lrchisq <- f.full.nopenalty$stats["Model L.R."]
    # lrm does all this penalty adjustment automatically (for var, d.f.,
    # chi-square)
    aic[i] <- lrchisq - 2*effective.df[i]
    #
    pred <- plogis(f.full$linear.predictors)
    score.matrix <- cbind(1,X) * (Y - pred)
    sum.u.uprime <- t(score.matrix) %*% score.matrix
    effective.df2[i] <- sum(diag(f.full$var %*% sum.u.uprime))
    aic2[i] <- lrchisq - 2*effective.df2[i]
    #
    #Shao suggested averaging 2*n cross-validations, but let's do only 40
    #and stop along the way to see if fewer is OK
    dev <- 0
    for(j in 1:max(ncv)) {
      s    <- sample(1:n, n.t)
      cof  <- lrm.fit(X[s,],Y[s], 
                      penalty.matrix=pen*pm)$coef
      pred <- cof[1] + (X[-s,] %*% cof[-1])
      dev <- dev -2*sum(Y[-s]*pred + log(1-plogis(pred)))
      for(k in 1:length(ncv)) if(j==ncv[k]) deviance[i,k] <- dev/j
    }
    #
    pred.val <- f.full$coef[1] + (X.val %*% f.full$coef[-1])
    prob.val <- plogis(pred.val)
    deviance.val[i] <- -2*sum(Y.val*pred.val + log(1-prob.val))
  }
  postscript(hor=TRUE)   # along with graphics.off() below, allow plots
  par(mfrow=c(2,4))   # to be printed as they are finished
  plot(Penalty, effective.df, type="l")
  lines(Penalty, effective.df2, lty=2)
  plot(Penalty, Lpenalty, type="l")
  title("Penalty on -2 log L")
  plot(Penalty, aic, type="l")
  lines(Penalty, aic2, lty=2)
  for(k in 1:length(ncv)) {
    plot(Penalty, deviance[,k], ylab="deviance")
    title(paste(ncv[k],"reps"))
    lines(supsmu(Penalty, deviance[,k]))
  }
  plot(Penalty, deviance.val, type="l")
  title("Gold Standard (n=7500)")
  title(sub=format(seed),adj=1,cex=.5)
  graphics.off()
}

## End(Not run)
#The results showed that to obtain a clear picture of the penalty-
#accuracy relationship one needs 30 or 40 reps in the cross-validation.
#For 4 of 5 samples, though, the super smoother was able to detect
#an accurate penalty giving the best (lowest) deviance using 10-fold
#cross-validation.  Cross-validation would have worked better had
#the same splits been used for all penalties.
#The AIC methods worked just as well and are much quicker to compute.
#The first AIC based on the effective d.f. in Gray's Eq. 2.9
#(Verweij and Van Houwelingen (1994) Eq. 5 (note typo)) worked best.

Logistic Model Fitter

Description

Fits a binary or ordinal logistic model for a given design matrix and response vector with no missing values in either. Ordinary or penalized maximum likelihood estimation is used.

Usage

lrm.fit(x, y, offset=0, initial, est, maxit=12, eps=.025,
        tol=1e-7, trace=FALSE, penalty.matrix=NULL, weights=NULL,
        normwt=FALSE, scale=FALSE)

Arguments

x

design matrix with no column for an intercept

y

response vector, numeric, categorical, or character

offset

optional numeric vector containing an offset on the logit scale

initial

vector of initial parameter estimates, beginning with the intercept

est

indexes of x to fit in the model (default is all columns of x). Specifying est=c(1,2,5) causes columns 1,2, and 5 to have parameters estimated. The score vector u and covariance matrix var can be used to obtain score statistics for other columns

maxit

maximum no. iterations (default=12). Specifying maxit=1 causes logist to compute statistics at initial estimates.

eps

difference in 2log-2 log likelihood for declaring convergence. Default is .025. If the 2log-2 log likelihood gets worse by eps/10 while the maximum absolute first derivative of 2log-2 log likelihood is below 1e-9, convergence is still declared. This handles the case where the initial estimates are MLEs, to prevent endless step-halving.

tol

Singularity criterion. Default is 1e-7

trace

set to TRUE to print -2 log likelihood, step-halving fraction, change in -2 log likelihood, maximum absolute value of first derivative, and vector of first derivatives at each iteration.

penalty.matrix

a self-contained ready-to-use penalty matrix - see lrm

weights

a vector (same length as y) of possibly fractional case weights

normwt

set to TRUE to scale weights so they sum to the length of y; useful for sample surveys as opposed to the default of frequency weighting

scale

set to TRUE to subtract column means and divide by column standard deviations of x before fitting, and to back-solve for the un-normalized covariance matrix and regresion coefficients. This can sometimes make the model converge for very large sample sizes where for example spline or polynomial component variables create scaling problems leading to loss of precision when accumulating sums of squares and crossproducts.

Value

a list with the following components:

call

calling expression

freq

table of frequencies for y in order of increasing y

stats

vector with the following elements: number of observations used in the fit, maximum absolute value of first derivative of log likelihood, model likelihood ratio chi-square, d.f., P-value, cc index (area under ROC curve), Somers' DxyD_{xy}, Goodman-Kruskal γ\gamma, and Kendall's τa\tau_a rank correlations between predicted probabilities and observed response, the Nagelkerke R2R^2 index, 4 indexes computed by R2Measures, the Brier probability score with respect to computing the probability that y>y > the mid level less one, the gg-index, grgr (the gg-index on the odds ratio scale), and gpgp (the gg-index on the probability scale using the same cutoff used for the Brier score). Probabilities are rounded to the nearest 0.002 in the computations or rank correlation indexes. When penalty.matrix is present, the χ2\chi^2, d.f., and P-value are not corrected for the effective d.f.

fail

set to TRUE if convergence failed (and maxit>1)

coefficients

estimated parameters

var

estimated variance-covariance matrix (inverse of information matrix). Note that in the case of penalized estimation, var is not the improved sandwich-type estimator (which lrm does compute).

u

vector of first derivatives of log-likelihood

deviance

-2 log likelihoods. When an offset variable is present, three deviances are computed: for intercept(s) only, for intercepts+offset, and for intercepts+offset+predictors. When there is no offset variable, the vector contains deviances for the intercept(s)-only model and the model with intercept(s) and predictors.

est

vector of column numbers of X fitted (intercepts are not counted)

non.slopes

number of intercepts in model

penalty.matrix

see above

Author(s)

Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]

See Also

lrm, glm, matinv, solvet, cr.setup, gIndex

Examples

#Fit an additive logistic model containing numeric predictors age, 
#blood.pressure, and sex, assumed to be already properly coded and 
#transformed
#
# fit <- lrm.fit(cbind(age,blood.pressure,sex), death)

lrm.fit.bare

Description

Bare Bones Logistic Regression Fit

Usage

lrm.fit.bare(x, y, maxit = 12, eps = 0.025, tol = 1e-07)

Arguments

x

a vector of matrix of covariate values

y

a numeric or factor vector representing the dependent variable

maxit

maximum number of iteractions

eps

stopping criterion (change in -2 log likelihood)

tol

matrix inversion tolerance for singularities

Details

This is a stripped down version of the lrm.fit() function that computes only the regression coefficients, variance-covariance-matrix, and log likelihood (for null and fitted model) and does not compute any model fit indexes etc. This is for speed in simulations or with bootstrapping. Missing data are not allowed. The function handles binary and ordinal logistic regression (proportional odds model).

Value

a list with elements coefficients, var, fail, freq, deviance

Author(s)

Frank Harrell


LRupdate

Description

Update Model LR Statistics After Multiple Imputation

Usage

LRupdate(fit, anova)

Arguments

fit

an rms fit object

anova

the result of processMI(..., 'anova')

Details

For fits from ⁠orm, lrm, orm, cph, psm⁠ that were created using fit.mult.impute with lrt=TRUE or equivalent options and for which anova was obtained using processMI(fit, 'anova') to compute imputation-adjusted LR statistics. LRupdate uses the last line of the anova result (containing the overall model LR chi-square) to update ⁠Model L.R.⁠ in the fit stats component, and to adjust any of the new R-square measures in stats.

For models using Nagelkerke's R-squared, these are set to NA as they would need to be recomputed with a new intercept-only log-likelihood, which is not computed by anova. For ols models, R-squared is left alone as it is sample-size-independent and print.ols prints the correct adjusted R-squared due to fit.mult.impute correcting the residual d.f. in stacked fits.

Value

new fit object like fit but with the substitutions made

Author(s)

Frank Harrell

See Also

processMI.fit.mult.impute(), Hmisc::R2Measures()

Examples

## Not run: 
a <- aregImpute(~ y + x1 + x2, n.impute=30, data=d)
f <- fit.mult.impute(y ~ x1 + x2, lrm, a, data=d, lrt=TRUE)
a <- processMI(f, 'anova')
f <- LRupdate(f, a)
print(f, r2=1:4)   # print all imputation-corrected R2 measures

## End(Not run)

Total and Partial Matrix Inversion using Gauss-Jordan Sweep Operator

Description

This function inverts or partially inverts a matrix using pivoting (the sweep operator). It is useful for sequential model-building.

Usage

matinv(a, which, negate=TRUE, eps=1e-12)

Arguments

a

square matrix to invert or partially invert. May have been inverted or partially inverted previously by matinv, in which case its "swept" attribute is updated. Will un-invert if already inverted.

which

vector of column/row numbers in a to invert. Default is all, for total inverse.

negate

So that the algorithm can keep track of which pivots have been swept as well as roundoff errors, it actually returns the negative of the inverse or partial inverse. By default, these elements are negated to give the usual expected result. Set negate=FALSE if you will be passing the result right back into matinv, otherwise, negate the submatrix before sending back to matinv.

eps

singularity criterion

Value

a square matrix, with attributes "rank" and "swept".

References

Clarke MRB (1982). Algorithm AS 178: The Gauss-Jordan sweep operator with detection of collinearity. Appl Statist 31:166–9.

Ridout MS, Cobb JM (1986). Algorithm AS R78 : A remark on algorithm AS 178: The Gauss-Jordan sweep operator with detection of collinearity. Appl Statist 38:420–2.

See Also

lrm, solve

Examples

a      <- diag(1:3)
a.inv1 <- matinv(a, 1, negate=FALSE)	     #Invert with respect to a[1,1]
a.inv1
a.inv  <- -matinv(a.inv1, 2:3, negate=FALSE) #Finish the job
a.inv
solve(a)

Draw a Nomogram Representing a Regression Fit

Description

Draws a partial nomogram that can be used to manually obtain predicted values from a regression model that was fitted with rms. The nomogram does not have lines representing sums, but it has a reference line for reading scoring points (default range 0–100). Once the reader manually totals the points, the predicted values can be read at the bottom. Non-monotonic transformations of continuous variables are handled (scales wrap around), as are transformations which have flat sections (tick marks are labeled with ranges). If interactions are in the model, one variable is picked as the “axis variable”, and separate axes are constructed for each level of the interacting factors (preference is given automatically to using any discrete factors to construct separate axes) and levels of factors which are indirectly related to interacting factors (see DETAILS). Thus the nomogram is designed so that only one axis is actually read for each variable, since the variable combinations are disjoint. For categorical interacting factors, the default is to construct axes for all levels. The user may specify coordinates of each predictor to label on its axis, or use default values. If a factor interacts with other factors, settings for one or more of the interacting factors may be specified separately (this is mandatory for continuous variables). Optional confidence intervals will be drawn for individual scores as well as for the linear predictor. If more than one confidence level is chosen, multiple levels may be displayed using different colors or gray scales. Functions of the linear predictors may be added to the nomogram.

The datadist object that was in effect when the model was fit is used to specify the limits of the axis for continuous predictors when the user does not specify tick mark locations in the nomogram call.

print.nomogram prints axis information stored in an object returned by nomogram. This is useful in producing tables of point assignments by levels of predictors. It also prints how many linear predictor units there are per point and the number of points per unit change in the linear predictor.

legend.nomabbrev draws legends describing abbreviations used for labeling tick marks for levels of categorical predictors.

Usage

nomogram(fit, ..., adj.to, lp=TRUE, lp.at=NULL,
         fun=NULL, fun.at=NULL, fun.lp.at=NULL, funlabel="Predicted Value",
         interact=NULL, kint=NULL,  conf.int=FALSE, 
         conf.lp=c("representative", "all", "none"),
         est.all=TRUE, posterior.summary=c('mean', 'median', 'mode'),
         abbrev=FALSE, minlength=4, maxscale=100, nint=10, 
         vnames=c("labels","names"),
         varname.label=TRUE, varname.label.sep="=",
         omit=NULL, verbose=FALSE)

## S3 method for class 'nomogram'
print(x, dec=0, ...)

## S3 method for class 'nomogram'
plot(x, lplabel="Linear Predictor", fun.side,
 col.conf=c(1, 0.3),
 conf.space=c(.08,.2), label.every=1, force.label=FALSE, 
 xfrac=.35, cex.axis=.85, cex.var=1, col.grid=NULL,
 varname.label=TRUE, varname.label.sep="=", ia.space=.7, 
 tck=NA, tcl=-0.25, lmgp=.4, naxes,
 points.label='Points', total.points.label='Total Points',
 total.sep.page=FALSE, total.fun, cap.labels=FALSE, ...)

legend.nomabbrev(object, which, x, y, ncol=3, ...)

Arguments

fit

a regression model fit that was created with rms, and (usually) with options(datadist = "object.name") in effect.

...

settings of variables to use in constructing axes. If datadist was in effect, the default is to use pretty(total range, nint) for continuous variables, and the class levels for discrete ones. For legend.nomabbrev, ... specifies optional parameters to pass to legend. Common ones are bty = "n" to suppress drawing the box. You may want to specify a non-proportionally spaced font (e.g., courier) number if abbreviations are more than one letter long. This will make the abbreviation definitions line up (e.g., specify font = 2, the default for courier). Ignored for print and plot.

adj.to

If you didn't define datadist for all predictors, you will have to define adjustment settings for the undefined ones, e.g. adj.to= list(age = 50, sex = "female").

lp

Set to FALSE to suppress creation of an axis for scoring XβX\beta.

lp.at

If lp=TRUE, lp.at may specify a vector of settings of XβX\beta. Default is to use pretty(range of linear predictors, nint).

fun

an optional function to transform the linear predictors, and to plot on another axis. If more than one transformation is plotted, put them in a list, e.g. list(function(x) x/2, function(x) 2*x). Any function values equal to NA will be ignored.

fun.at

function values to label on axis. Default fun evaluated at lp.at. If more than one fun was specified, using a vector for fun.at will cause all functions to be evaluated at the same argument values. To use different values, specify a list of vectors for fun.at, with elements corresponding to the different functions (lists of vectors also applies to fun.lp.at and fun.side).

fun.lp.at

If you want to evaluate one of the functions at a different set of linear predictor values than may have been used in constructing the linear predictor axis, specify a vector or list of vectors of linear predictor values at which to evaluate the function. This is especially useful for discrete functions. The presence of this attribute also does away with the need for nomogram to compute numerical approximations of the inverse of the function. It also allows the user-supplied function to return factor objects, which is useful when e.g. a single tick mark position actually represents a range. If the fun.lp.at parameter is present, the fun.at vector for that function is ignored.

funlabel

label for fun axis. If more than one function was given but funlabel is of length one, it will be duplicated as needed. If fun is a list of functions for which you specified names (see the final example below), these names will be used as labels.

interact

When a continuous variable interacts with a discrete one, axes are constructed so that the continuous variable moves within the axis, and separate axes represent levels of interacting factors. For interactions between two continuous variables, all but the axis variable must have discrete levels defined in interact. For discrete interacting factors, you may specify levels to use in constructing the multiple axes. For continuous interacting factors, you must do this. Examples: interact = list(age = seq(10,70,by=10), treat = c("A","B","D")).

kint

for models such as the ordinal models with multiple intercepts, specifies which one to use in evaluating the linear predictor. Default is to use fit$interceptRef if it exists, or 1.

conf.int

confidence levels to display for each scoring. Default is FALSE to display no confidence limits. Setting conf.int to TRUE is the same as setting it to c(0.7, 0.9), with the line segment between the 0.7 and 0.9 levels shaded using gray scale.

conf.lp

default is "representative" to group all linear predictors evaluated into deciles, and to show, for the linear predictor confidence intervals, only the mean linear predictor within the deciles along with the median standard error within the deciles. Set conf.lp = "none" to suppress confidence limits for the linear predictors, and to "all" to show all confidence limits.

est.all

To plot axes for only the subset of variables named in ..., set est.all = FALSE. Note: This option only works when zero has a special meaning for the variables that are omitted from the graph.

posterior.summary

when operating on a Bayesian model such as a result of blrm specifies whether to use posterior mean (default) vs. posterior mode/median of parameter values in constructing the nomogram

abbrev

Set to TRUE to use the abbreviate function to abbreviate levels of categorical factors, both for labeling tick marks and for axis titles. If you only want to abbreviate certain predictor variables, set abbrev to a vector of character strings containing their names.

minlength

applies if abbrev = TRUE. Is the minimum abbreviation length passed to the abbreviate function. If you set minlength = 1, the letters of the alphabet are used to label tick marks for categorical predictors, and all letters are drawn no matter how close together they are. For labeling axes (interaction settings), minlength = 1 causes minlength = 4 to be used.

maxscale

default maximum point score is 100

nint

number of intervals to label for axes representing continuous variables. See pretty.

vnames

By default, variable labels are used to label axes. Set vnames = "names" to instead use variable names.

omit

vector of character strings containing names of variables for which to suppress drawing axes. Default is to show all variables.

verbose

set to TRUE to get printed output detailing how tick marks are chosen and labeled for function axes. This is useful in seeing how certain linear predictor values cannot be solved for using inverse linear interpolation on the (requested linear predictor values, function values at these lp values). When this happens you will see NAs in the verbose output, and the corresponding tick marks will not appear in the nomogram.

x

an object created by nomogram, or the x coordinate for a legend

dec

number of digits to the right of the decimal point, for rounding point scores in print.nomogram. Default is to round to the nearest whole number of points.

lplabel

label for linear predictor axis. Default is "Linear Predictor".

fun.side

a vector or list of vectors of side parameters for the axis function for labeling function values. Values may be 1 to position a tick mark label below the axis (the default), or 3 for above the axis. If for example an axis has 5 tick mark labels and the second and third will run into each other, specify fun.side=c(1,1,3,1,1) (assuming only one function is specified as fun).

col.conf

colors corresponding to conf.int.

conf.space

a 2-element vector with the vertical range within which to draw confidence bars, in units of 1=spacing between main bars. Four heights are used within this range (8 for the linear predictor if more than 16 unique values were evaluated), cycling them among separate confidence intervals to reduce overlapping.

label.every

Specify label.every = i to label on every ith tick mark.

force.label

set to TRUE to force every tick mark intended to be labeled to have a label plotted (whether the labels run into each other or not)

xfrac

fraction of horizontal plot to set aside for axis titles

cex.axis

character size for tick mark labels

cex.var

character size for axis titles (variable names)

col.grid

If left unspecified, no vertical reference lines are drawn. Specify a vector of length one (to use the same color for both minor and major reference lines) or two (corresponding to the color for the major and minor divisions, respectively) containing colors, to cause vertical reference lines to the top points scale to be drawn. For R, a good choice is col.grid = gray(c(0.8, 0.95)).

varname.label

In constructing axis titles for interactions, the default is to add (interacting.varname = level) on the right. Specify varname.label = FALSE to instead use "(level)".

varname.label.sep

If varname.label = TRUE, you can change the separator to something other than = by specifying this parameter.

ia.space

When multiple axes are draw for levels of interacting factors, the default is to group combinations related to a main effect. This is done by spacing the axes for the second to last of these within a group only 0.7 (by default) of the way down as compared with normal space of 1 unit.

tck

see tck under par

tcl

length of tick marks in nomogram

lmgp

spacing between numeric axis labels and axis (see par for mgp)

naxes

maximum number of axes to allow on one plot. If the nomogram requires more than one “page”, the “Points” axis will be repeated at the top of each page when necessary.

points.label

a character string giving the axis label for the points scale

total.points.label

a character string giving the axis label for the total points scale

total.sep.page

set to TRUE to force the total points and later axes to be placed on a separate page

total.fun

a user-provided function that will be executed before the total points axis is drawn. Default is not to execute a function. This is useful e.g. when total.sep.page = TRUE and you wish to use locator to find the coordinates for positioning an abbreviation legend before it's too late and a new page is started (i.e., total.fun = function() print(locator(1))).

cap.labels

logical: should the factor labels have their first letter capitalized?

object

the result returned from nomogram

which

a character string giving the name of a variable for which to draw a legend with abbreviations of factor levels

y

y-coordinate to pass to the legend function. This is the upper left corner of the legend box. You can omit y if x is a list with named elements x and y. To use the mouse to locate the legend, specify locator(1) for x. For print, x is the result of nomogram.

ncol

the number of columns to form in drawing the legend.

Details

A variable is considered to be discrete if it is categorical or ordered or if datadist stored values for it (meaning it had <11 unique values). A variable is said to be indirectly related to another variable if the two are related by some interaction. For example, if a model has variables a, b, c, d, and the interactions are a:c and c:d, variable d is indirectly related to variable a. The complete list of variables related to a is c, d. If an axis is made for variable a, several axes will actually be drawn, one for each combination of c and d specified in interact.

Note that with a caliper, it is easy to continually add point scores for individual predictors, and then to place the caliper on the upper “Points” axis (with extrapolation if needed). Then transfer these points to the “Total Points” axis. In this way, points can be added without writing them down.

Confidence limits for an individual predictor score are really confidence limits for the entire linear predictor, with other predictors set to adjustment values. If lp = TRUE, all confidence bars for all linear predictor values evaluated are drawn. The extent to which multiple confidence bars of differing widths appear at the same linear predictor value means that precision depended on how the linear predictor was arrived at (e.g., a certain value may be realized from a setting of a certain predictor that was associated with a large standard error on the regression coefficients for that predictor).

On occasion, you may want to reverse the regression coefficients of a model to make the “points” scales reverse direction. For parametric survival models, which are stated in terms of increasing regression effects meaning longer survival (the opposite of a Cox model), just do something like fit$coefficients <- -fit$coefficients before invoking nomogram, and if you add function axes, negate the function arguments. For the Cox model, you also need to negate fit$center. If you omit lp.at, also negate fit$linear.predictors.

Value

a list of class "nomogram" that contains information used in plotting the axes. If you specified abbrev = TRUE, a list called abbrev is also returned that gives the abbreviations used for tick mark labels, if any. This list is useful for making legends and is used by legend.nomabbrev (see the last example). The returned list also has components called total.points, lp, and the function axis names. These components have components x (at argument vector given to axis), y (pos for axis), and x.real, the x-coordinates appearing on tick mark labels. An often useful result is stored in the list of data for each axis variable, namely the exact number of points that correspond to each tick mark on that variable's axis.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]

References

Banks J: Nomograms. Encylopedia of Statistical Sciences, Vol 6. Editors: S Kotz and NL Johnson. New York: Wiley; 1985.

Lubsen J, Pool J, van der Does, E: A practical device for the application of a diagnostic or prognostic function. Meth. Inform. Med. 17:127–129; 1978.

Wikipedia: Nomogram, https://en.wikipedia.org/wiki/Nomogram.

See Also

rms, plot.Predict, ggplot.Predict, plot.summary.rms, axis, pretty, approx, latexrms, rmsMisc

Examples

n <- 1000    # define sample size
set.seed(17) # so can reproduce the results
d <- data.frame(age = rnorm(n, 50, 10),
                blood.pressure = rnorm(n, 120, 15),
                cholesterol = rnorm(n, 200, 25),
                sex = factor(sample(c('female','male'), n,TRUE)))

# Specify population model for log odds that Y=1
# Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)]
d <- upData(d,
  L = .4*(sex=='male') + .045*(age-50) +
       (log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male')),
  y = ifelse(runif(n) < plogis(L), 1, 0))

ddist <- datadist(d); options(datadist='ddist')


f <- lrm(y ~ lsp(age,50) + sex * rcs(cholesterol, 4) + blood.pressure,
         data=d)
nom <- nomogram(f, fun=function(x)1/(1+exp(-x)),  # or fun=plogis
    fun.at=c(.001,.01,.05,seq(.1,.9,by=.1),.95,.99,.999),
    funlabel="Risk of Death")
#Instead of fun.at, could have specified fun.lp.at=logit of
#sequence above - faster and slightly more accurate
plot(nom, xfrac=.45)
print(nom)
nom <- nomogram(f, age=seq(10,90,by=10))
plot(nom, xfrac=.45)
g <- lrm(y ~ sex + rcs(age, 3) * rcs(cholesterol, 3), data=d)
nom <- nomogram(g, interact=list(age=c(20,40,60)), 
                conf.int=c(.7,.9,.95))
plot(nom, col.conf=c(1,.5,.2), naxes=7)

require(survival)
w <- upData(d,
            cens = 15 * runif(n),
            h = .02 * exp(.04 * (age - 50) + .8 * (sex == 'Female')),
            d.time = -log(runif(n)) / h,
            death = ifelse(d.time <= cens, 1, 0),
            d.time = pmin(d.time, cens))


f <- psm(Surv(d.time,death) ~ sex * age, data=w, dist='lognormal')
med  <- Quantile(f)
surv <- Survival(f)  # This would also work if f was from cph
plot(nomogram(f, fun=function(x) med(lp=x), funlabel="Median Survival Time"))
nom <- nomogram(f, fun=list(function(x) surv(3, x),
                            function(x) surv(6, x)),
            funlabel=c("3-Month Survival Probability", 
                       "6-month Survival Probability"))
plot(nom, xfrac=.7)

## Not run: 
nom <- nomogram(fit.with.categorical.predictors, abbrev=TRUE, minlength=1)
nom$x1$points   # print points assigned to each level of x1 for its axis
#Add legend for abbreviations for category levels
abb <- attr(nom, 'info')$abbrev$treatment
legend(locator(1), abb$full, pch=paste(abb$abbrev,collapse=''), 
       ncol=2, bty='n')  # this only works for 1-letter abbreviations
#Or use the legend.nomabbrev function:
legend.nomabbrev(nom, 'treatment', locator(1), ncol=2, bty='n')

## End(Not run)


#Make a nomogram with axes predicting probabilities Y>=j for all j=1-3
#in an ordinal logistic model, where Y=0,1,2,3
w <- upData(w, Y = ifelse(y==0, 0, sample(1:3, length(y), TRUE)))
g <- lrm(Y ~ age+rcs(cholesterol,4) * sex, data=w)
fun2 <- function(x) plogis(x-g$coef[1]+g$coef[2])
fun3 <- function(x) plogis(x-g$coef[1]+g$coef[3])
f <- Newlabels(g, c(age='Age in Years'))  
#see Design.Misc, which also has Newlevels to change 
#labels for levels of categorical variables
g <- nomogram(f, fun=list('Prob Y>=1'=plogis, 'Prob Y>=2'=fun2, 
                     'Prob Y=3'=fun3), 
         fun.at=c(.01,.05,seq(.1,.9,by=.1),.95,.99))
plot(g, lmgp=.2, cex.axis=.6)
options(datadist=NULL)

Nonparametric Survival Estimates for Censored Data

Description

Computes an estimate of a survival curve for censored data using either the Kaplan-Meier or the Fleming-Harrington method or computes the predicted survivor function. For competing risks data it computes the cumulative incidence curve. This calls the survival package's survfit.formula function. Attributes of the event time variable are saved (label and units of measurement).

For competing risks the second argument for Surv should be the event state variable, and it should be a factor variable with the first factor level denoting right-censored observations.

Usage

npsurv(formula, data=environment(formula),
              subset, weights, na.action=na.delete, ...)

Arguments

formula

a formula object, which must have a Surv object as the response on the left of the ~ operator and, if desired, terms separated by + operators on the right. One of the terms may be a strata object. For a single survival curve the right hand side should be ~ 1.

data, subset, weights, na.action

see survfit.formula

...

see survfit.formula

Details

see survfit.formula for details

Value

an object of class "npsurv" and "survfit". See survfit.object for details. Methods defined for survfit objects are print, summary, plot,lines, and points.

Author(s)

Thomas Lumley [email protected] and Terry Therneau

See Also

survfit.cph for survival curves from Cox models. print, plot, lines, coxph, strata, survplot

Examples

require(survival)
# fit a Kaplan-Meier and plot it
fit <- npsurv(Surv(time, status) ~ x, data = aml)
plot(fit, lty = 2:3)
legend(100, .8, c("Maintained", "Nonmaintained"), lty = 2:3)

# Here is the data set from Turnbull
#  There are no interval censored subjects, only left-censored (status=3),
#  right-censored (status 0) and observed events (status 1)
#
#                             Time
#                         1    2   3   4
# Type of observation
#           death        12    6   2   3
#          losses         3    2   0   3
#      late entry         2    4   2   5
#
tdata <- data.frame(time   = c(1,1,1,2,2,2,3,3,3,4,4,4),
                    status = rep(c(1,0,2),4),
                    n      = c(12,3,2,6,2,4,2,0,2,3,3,5))
fit  <- npsurv(Surv(time, time, status, type='interval') ~ 1,
               data=tdata, weights=n)

#
# Time to progression/death for patients with monoclonal gammopathy
# Competing risk curves (cumulative incidence)
# status variable must be a factor with first level denoting right censoring
m <- upData(mgus1, stop = stop / 365.25, units=c(stop='years'),
            labels=c(stop='Follow-up Time'), subset=start == 0)
f <- npsurv(Surv(stop, event) ~ 1, data=m)

# CI curves are always plotted from 0 upwards, rather than 1 down
plot(f, fun='event', xmax=20, mark.time=FALSE,
     col=2:3, xlab="Years post diagnosis of MGUS")
text(10, .4, "Competing Risk: death", col=3)
text(16, .15,"Competing Risk: progression", col=2)

# Use survplot for enhanced displays of cumulative incidence curves for
# competing risks

survplot(f, state='pcm', n.risk=TRUE, xlim=c(0, 20), ylim=c(0, .5), col=2)
survplot(f, state='death', add=TRUE, col=3)

f <- npsurv(Surv(stop, event) ~ sex, data=m)
survplot(f, state='death', n.risk=TRUE, conf='diffbands')

Linear Model Estimation Using Ordinary Least Squares

Description

Fits the usual weighted or unweighted linear regression model using the same fitting routines used by lm, but also storing the variance-covariance matrix var and using traditional dummy-variable coding for categorical factors. Also fits unweighted models using penalized least squares, with the same penalization options as in the lrm function. For penalized estimation, there is a fitter function call lm.pfit.

Usage

ols(formula, data=environment(formula), weights, subset, na.action=na.delete, 
    method="qr", model=FALSE,
    x=FALSE, y=FALSE, se.fit=FALSE, linear.predictors=TRUE,
    penalty=0, penalty.matrix, tol=1e-7, sigma,
    var.penalty=c('simple','sandwich'), ...)

Arguments

formula

an S formula object, e.g.
Y ~ rcs(x1,5)*lsp(x2,c(10,20))

data

name of an S data frame containing all needed variables. Omit this to use a data frame already in the S “search list”.

weights

an optional vector of weights to be used in the fitting process. If specified, weighted least squares is used with weights weights (that is, minimizing sum(we2)sum(w*e^2)); otherwise ordinary least squares is used.

subset

an expression defining a subset of the observations to use in the fit. The default is to use all observations. Specify for example age>50 & sex="male" or c(1:100,200:300) respectively to use the observations satisfying a logical expression or those having row numbers in the given vector.

na.action

specifies an S function to handle missing data. The default is the function na.delete, which causes observations with any variable missing to be deleted. The main difference between na.delete and the S-supplied function na.omit is that na.delete makes a list of the number of observations that are missing on each variable in the model. The na.action is usally specified by e.g. options(na.action="na.delete").

method

specifies a particular fitting method, or "model.frame" instead to return the model frame of the predictor and response variables satisfying any subset or missing value checks.

model

default is FALSE. Set to TRUE to return the model frame as element model of the fit object.

x

default is FALSE. Set to TRUE to return the expanded design matrix as element x (without intercept indicators) of the returned fit object. Set both x=TRUE if you are going to use the residuals function later to return anything other than ordinary residuals.

y

default is FALSE. Set to TRUE to return the vector of response values as element y of the fit.

se.fit

default is FALSE. Set to TRUE to compute the estimated standard errors of the estimate of XβX\beta and store them in element se.fit of the fit.

linear.predictors

set to FALSE to cause predicted values not to be stored

penalty

see lrm

penalty.matrix

see lrm

tol

tolerance for information matrix singularity

sigma

If sigma is given, it is taken as the actual root mean squared error parameter for the model. Otherwise sigma is estimated from the data using the usual formulas (except for penalized models). It is often convenient to specify sigma=1 for models with no error, when using fastbw to find an approximate model that predicts predicted values from the full model with a given accuracy.

var.penalty

the type of variance-covariance matrix to be stored in the var component of the fit when penalization is used. The default is the inverse of the penalized information matrix. Specify var.penalty="sandwich" to use the sandwich estimator (see below under var), which limited simulation studies have shown yields variances estimates that are too low.

...

arguments to pass to lm.wfit or lm.fit

Details

For penalized estimation, the penalty factor on the log likelihood is 0.5βPβ/σ2-0.5 \beta' P \beta / \sigma^2, where PP is defined above. The penalized maximum likelihood estimate (penalized least squares or ridge estimate) of β\beta is (XX+P)1XY(X'X + P)^{-1} X'Y. The maximum likelihood estimate of σ2\sigma^2 is (sse+βPβ)/n(sse + \beta' P \beta) / n, where sse is the sum of squared errors (residuals). The effective.df.diagonal vector is the diagonal of the matrix XX/(sse/n)σ2(XX+P)1X'X/(sse/n) \sigma^{2} (X'X + P)^{-1}.

Value

the same objects returned from lm (unless penalty or penalty.matrix are given - then an abbreviated list is returned since lm.pfit is used as a fitter) plus the design attributes (see rms). Predicted values are always returned, in the element linear.predictors. The vectors or matrix stored if y=TRUE or x=TRUE have rows deleted according to subset and to missing data, and have names or row names that come from the data frame used as input data. If penalty or penalty.matrix is given, the var matrix returned is an improved variance-covariance matrix for the penalized regression coefficient estimates. If var.penalty="sandwich" (not the default, as limited simulation studies have found it provides variance estimates that are too low) it is defined as σ2(XX+P)1XX(XX+P)1\sigma^{2} (X'X + P)^{-1} X'X (X'X + P)^{-1}, where PP is penalty factors * penalty.matrix, with a column and row of zeros added for the intercept. When var.penalty="simple" (the default), var is σ2(XX+P)1\sigma^{2} (X'X + P)^{-1}. The returned list has a vector stats with named elements n, Model L.R., d.f., R2, g, Sigma. Model L.R. is the model likelihood ratio χ2\chi^2 statistic, and R2 is R2R^2. For penalized estimation, d.f. is the effective degrees of freedom, which is the sum of the elements of another vector returned, effective.df.diagonal, minus one for the intercept. g is the gg-index. Sigma is the penalized maximum likelihood estimate (see below).

Author(s)

Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]

See Also

rms, rms.trans, anova.rms, summary.rms, predict.rms, fastbw, validate, calibrate, Predict, specs.rms, cph, lrm, which.influence, lm, summary.lm, print.ols, residuals.ols, latex.ols, na.delete, na.detail.response, datadist, pentrace, vif, abs.error.pred

Examples

set.seed(1)
x1 <- runif(200)
x2 <- sample(0:3, 200, TRUE)
distance <- (x1 + x2/3 + rnorm(200))^2
d <- datadist(x1,x2)
options(datadist="d")   # No d -> no summary, plot without giving all details


f <- ols(sqrt(distance) ~ rcs(x1,4) + scored(x2), x=TRUE)
# could use d <- datadist(f); options(datadist="d") at this point,
# but predictor summaries would not be stored in the fit object for
# use with Predict, summary.rms.  In that case, the original
# dataset or d would need to be accessed later, or all variable values
# would have to be specified to summary, plot
anova(f)
which.influence(f)
summary(f)
summary.lm(f)    # will only work if penalty and penalty.matrix not used


# Fit a complex model and approximate it with a simple one
x1 <- runif(200)
x2 <- runif(200)
x3 <- runif(200)
x4 <- runif(200)
y <- x1 + x2 + rnorm(200)
f    <- ols(y ~ rcs(x1,4) + x2 + x3 + x4)
pred <- fitted(f)   # or predict(f) or f$linear.predictors
f2   <- ols(pred ~ rcs(x1,4) + x2 + x3 + x4, sigma=1)
# sigma=1 prevents numerical problems resulting from R2=1
fastbw(f2, aics=100000)
# This will find the best 1-variable model, best 2-variable model, etc.
# in predicting the predicted values from the original model
options(datadist=NULL)

Ordinal Regression Model

Description

Fits ordinal cumulative probability models for continuous or ordinal response variables, efficiently allowing for a large number of intercepts by capitalizing on the information matrix being sparse. Five different distribution functions are implemented, with the default being the logistic (i.e., the proportional odds model). The ordinal cumulative probability models are stated in terms of exceedance probabilities (Prob[YyX]Prob[Y \ge y | X]) so that as with OLS larger predicted values are associated with larger Y. This is important to note for the asymmetric distributions given by the log-log and complementary log-log families, for which negating the linear predictor does not result in Prob[Y<yX]Prob[Y < y | X]. The family argument is defined in orm.fit. The model assumes that the inverse of the assumed cumulative distribution function, when applied to one minus the true cumulative distribution function and plotted on the yy-axis (with the original yy on the xx-axis) yields parallel curves (though not necessarily linear). This can be checked by plotting the inverse cumulative probability function of one minus the empirical distribution function, stratified by X, and assessing parallelism. Note that parametric regression models make the much stronger assumption of linearity of such inverse functions.

For the print method, format of output is controlled by the user previously running options(prType="lang") where lang is "plain" (the default), "latex", or "html". When using html with Quarto or RMarkdown, results='asis' need not be written in the chunk header.

Quantile.orm creates an R function that computes an estimate of a given quantile for a given value of the linear predictor (which was assumed to use thefirst intercept). It uses a linear interpolation method by default, but you can override that to use a discrete method by specifying method="discrete" when calling the function generated by Quantile. Optionally a normal approximation for a confidence interval for quantiles will be computed using the delta method, if conf.int > 0 is specified to the function generated from calling Quantile and you specify X. In that case, a "lims" attribute is included in the result computed by the derived quantile function.

Usage

orm(formula, data=environment(formula),
    subset, na.action=na.delete, method="orm.fit",
    model=FALSE, x=FALSE, y=FALSE, linear.predictors=TRUE, se.fit=FALSE, 
    penalty=0, penalty.matrix, tol=1e-7, eps=0.005, 
    var.penalty=c('simple','sandwich'), scale=FALSE, ...)

## S3 method for class 'orm'
print(x, digits=4, r2=c(0,2,4), coefs=TRUE, pg=FALSE,
    intercepts=x$non.slopes < 10, title, ...)

## S3 method for class 'orm'
Quantile(object, codes=FALSE, ...)

Arguments

formula

a formula object. An offset term can be included. The offset causes fitting of a model such as logit(Y=1)=Xβ+Wlogit(Y=1) = X\beta + W, where WW is the offset variable having no estimated coefficient. The response variable can be any data type; orm converts it in alphabetic or numeric order to a factor variable and recodes it 1,2,... internally.

data

data frame to use. Default is the current frame.

subset

logical expression or vector of subscripts defining a subset of observations to analyze

na.action

function to handle NAs in the data. Default is na.delete, which deletes any observation having response or predictor missing, while preserving the attributes of the predictors and maintaining frequencies of deletions due to each variable in the model. This is usually specified using options(na.action="na.delete").

method

name of fitting function. Only allowable choice at present is orm.fit.

model

causes the model frame to be returned in the fit object

x

causes the expanded design matrix (with missings excluded) to be returned under the name x. For print, an object created by orm.

y

causes the response variable (with missings excluded) to be returned under the name y.

linear.predictors

causes the predicted X beta (with missings excluded) to be returned under the name linear.predictors. The first intercept is used.

se.fit

causes the standard errors of the fitted values (on the linear predictor scale) to be returned under the name se.fit. The middle intercept is used.

penalty

see lrm

penalty.matrix

see lrm

tol

singularity criterion (see orm.fit)

eps

difference in 2log-2 log likelihood for declaring convergence

var.penalty

see lrm

scale

set to TRUE to subtract column means and divide by column standard deviations of the design matrix before fitting, and to back-solve for the un-normalized covariance matrix and regression coefficients. This can sometimes make the model converge for very large sample sizes where for example spline or polynomial component variables create scaling problems leading to loss of precision when accumulating sums of squares and crossproducts.

...

arguments that are passed to orm.fit, or from print, to prModFit. Ignored for Quantile. One of the most important arguments is family.

digits

number of significant digits to use

r2

vector of integers specifying which R^2 measures to print, with 0 for Nagelkerke R^2 and 1:4 corresponding to the 4 measures computed by R2Measures. Default is to print Nagelkerke (labeled R2) and second and fourth R2Measures which are the measures adjusted for the number of predictors, first for the raw sample size then for the effective sample size, which here is from the formula for the approximate variance of a log odds ratio in a proportional odds model.

pg

set to TRUE to print g-indexes

coefs

specify coefs=FALSE to suppress printing the table of model coefficients, standard errors, etc. Specify coefs=n to print only the first n regression coefficients in the model.

intercepts

By default, intercepts are only printed if there are fewer than 10 of them. Otherwise this is controlled by specifying intercepts=FALSE or TRUE.

title

a character string title to be passed to prModFit. Default is constructed from the name of the distribution family.

object

an object created by orm

codes

if TRUE, uses the integer codes 1,2,,k1,2,\ldots,k for the kk-level response in computing the predicted quantile

Value

The returned fit object of orm contains the following components in addition to the ones mentioned under the optional arguments.

call

calling expression

freq

table of frequencies for Y in order of increasing Y

stats

vector with the following elements: number of observations used in the fit, number of unique Y values, median Y from among the observations used int he fit, maximum absolute value of first derivative of log likelihood, model likelihood ratio χ2\chi^2, d.f., PP-value, score χ2\chi^2 statistic (if no initial values given), PP-value, Spearman's ρ\rho rank correlation between the linear predictor and Y, the Nagelkerke R2R^2 index, R2R^2 indexes computed by R2Measures, the gg-index, grgr (the gg-index on the odds ratio scale), and pdmpdm (the mean absolute difference between 0.5 and the predicted probability that YY\geq the marginal median). In the case of penalized estimation, the "Model L.R." is computed without the penalty factor, and "d.f." is the effective d.f. from Gray's (1992) Equation 2.9. The PP-value uses this corrected model L.R. χ2\chi^2 and corrected d.f. The score chi-square statistic uses first derivatives which contain penalty components.

fail

set to TRUE if convergence failed (and maxiter>1) or if a singular information matrix is encountered

coefficients

estimated parameters

var

estimated variance-covariance matrix (inverse of information matrix) for the middle intercept and regression coefficients. See lrm for details if penalization is used.

effective.df.diagonal

see lrm

family

the character string for family. If family was a user-customized list, it must have had an element named name, which is taken as the return value for family here.

trans

a list of functions for the choice of family, with elements cumprob (the cumulative probability distribution function), inverse (inverse of cumprob), deriv (first derivative of cumprob), and deriv2 (second derivative of cumprob)

deviance

-2 log likelihoods (counting penalty components) When an offset variable is present, three deviances are computed: for intercept(s) only, for intercepts+offset, and for intercepts+offset+predictors. When there is no offset variable, the vector contains deviances for the intercept(s)-only model and the model with intercept(s) and predictors.

non.slopes

number of intercepts in model

interceptRef

the index of the middle (median) intercept used in computing the linear predictor and var

penalty

see lrm

penalty.matrix

the penalty matrix actually used in the estimation

info.matrix

a sparse matrix representation of type matrix.csr from the SparseM package. This allows the full information matrix with all intercepts to be stored efficiently, and matrix operations using the Cholesky decomposition to be fast. link{vcov.orm} uses this information to compute the covariance matrix for intercepts other than the middle one.

Author(s)

Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]
For the Quantile function:
Qi Liu and Shengxin Tu
Department of Biostatistics, Vanderbilt University

References

Sall J: A monotone regression smoother based on ordinal cumulative logistic regression, 1991.

Le Cessie S, Van Houwelingen JC: Ridge estimators in logistic regression. Applied Statistics 41:191–201, 1992.

Verweij PJM, Van Houwelingen JC: Penalized likelihood in Cox regression. Stat in Med 13:2427–2436, 1994.

Gray RJ: Flexible methods for analyzing survival data using splines, with applications to breast cancer prognosis. JASA 87:942–951, 1992.

Shao J: Linear model selection by cross-validation. JASA 88:486–494, 1993.

Verweij PJM, Van Houwelingen JC: Crossvalidation in survival analysis. Stat in Med 12:2305–2314, 1993.

Harrell FE: Model uncertainty, penalization, and parsimony. Available from https://hbiostat.org/talks/iscb98.pdf.

See Also

orm.fit, predict.orm, solve, rms.trans, rms, polr, latex.orm, vcov.orm, num.intercepts, residuals.orm, na.delete, na.detail.response, pentrace, rmsMisc, vif, predab.resample, validate.orm, calibrate, Mean.orm, gIndex, prModFit

Examples

require(ggplot2)
set.seed(1)
n <- 100
y <- round(runif(n), 2)
x1 <- sample(c(-1,0,1), n, TRUE)
x2 <- sample(c(-1,0,1), n, TRUE)
f <- lrm(y ~ x1 + x2, eps=1e-5)
g <- orm(y ~ x1 + x2, eps=1e-5)
max(abs(coef(g) - coef(f)))
w <- vcov(g, intercepts='all') / vcov(f) - 1
max(abs(w))

set.seed(1)
n <- 300
x1 <- c(rep(0,150), rep(1,150))
y <- rnorm(n) + 3*x1
g <- orm(y ~ x1)
g
k <- coef(g)
i <- num.intercepts(g)
h <- orm(y ~ x1, family=probit)
ll <- orm(y ~ x1, family=loglog)
cll <- orm(y ~ x1, family=cloglog)
cau <- orm(y ~ x1, family=cauchit)
x <- 1:i
z <- list(logistic=list(x=x, y=coef(g)[1:i]),
          probit  =list(x=x, y=coef(h)[1:i]),
          loglog  =list(x=x, y=coef(ll)[1:i]),
          cloglog =list(x=x, y=coef(cll)[1:i]))
labcurve(z, pl=TRUE, col=1:4, ylab='Intercept')

tapply(y, x1, mean)
m <- Mean(g)
m(w <- k[1] + k['x1']*c(0,1))
mh <- Mean(h)
wh <- coef(h)[1] + coef(h)['x1']*c(0,1)
mh(wh)

qu <- Quantile(g)
# Compare model estimated and empirical quantiles
cq <- function(y) {
   cat(qu(.1, w), tapply(y, x1, quantile, probs=.1), '\n')
   cat(qu(.5, w), tapply(y, x1, quantile, probs=.5), '\n')
   cat(qu(.9, w), tapply(y, x1, quantile, probs=.9), '\n')
   }
cq(y)

# Try on log-normal model
g <- orm(exp(y) ~ x1)
g
k <- coef(g)
plot(k[1:i])
m <- Mean(g)
m(w <- k[1] + k['x1']*c(0,1))
tapply(exp(y), x1, mean)

qu <- Quantile(g)
cq(exp(y))

# Compare predicted mean with ols for a continuous x
set.seed(3)
n <- 200
x1 <- rnorm(n)
y <- x1 + rnorm(n)
dd <- datadist(x1); options(datadist='dd')
f <- ols(y ~ x1)
g <- orm(y ~ x1, family=probit)
h <- orm(y ~ x1, family=logistic)
w <- orm(y ~ x1, family=cloglog)
mg <- Mean(g); mh <- Mean(h); mw <- Mean(w)
r <- rbind(ols      = Predict(f, conf.int=FALSE),
           probit   = Predict(g, conf.int=FALSE, fun=mg),
           logistic = Predict(h, conf.int=FALSE, fun=mh),
           cloglog  = Predict(w, conf.int=FALSE, fun=mw))
plot(r, groups='.set.')

# Compare predicted 0.8 quantile with quantile regression
qu <- Quantile(g)
qu80 <- function(lp) qu(.8, lp)
f <- Rq(y ~ x1, tau=.8)
r <- rbind(probit   = Predict(g, conf.int=FALSE, fun=qu80),
           quantreg = Predict(f, conf.int=FALSE))
plot(r, groups='.set.')

# Verify transformation invariance of ordinal regression
ga <- orm(exp(y) ~ x1, family=probit)
qua <- Quantile(ga)
qua80 <- function(lp) log(qua(.8, lp))
r <- rbind(logprobit = Predict(ga, conf.int=FALSE, fun=qua80),
           probit    = Predict(g,  conf.int=FALSE, fun=qu80))
plot(r, groups='.set.')

# Try the same with quantile regression.  Need to transform x1
fa <- Rq(exp(y) ~ rcs(x1,5), tau=.8)
r <- rbind(qr    = Predict(f, conf.int=FALSE),
           logqr = Predict(fa, conf.int=FALSE, fun=log))
plot(r, groups='.set.')

# Make a plot of Pr(Y >= y) vs. a continuous covariate for 3 levels
# of y and also against a binary covariate
set.seed(1)
n <- 1000
age <- rnorm(n, 50, 15)
sex <- sample(c('m', 'f'), 1000, TRUE)
Y   <- runif(n)
dd  <- datadist(age, sex); options(datadist='dd')
f <- orm(Y ~ age + sex)
# Use ExProb function to derive an R function to compute
# P(Y >= y | X)
ex  <- ExProb(f)
ex1 <- function(x) ex(x, y=0.25)
ex2 <- function(x) ex(x, y=0.5)
ex3 <- function(x) ex(x, y=0.75)
p1  <- Predict(f, age, sex, fun=ex1)
p2  <- Predict(f, age, sex, fun=ex2)
p3  <- Predict(f, age, sex, fun=ex3)
p   <- rbind('P(Y >= 0.25)' = p1,
             'P(Y >= 0.5)'  = p2,
             'P(Y >= 0.75)' = p3)
ggplot(p)

# Make plot with two curves (by sex) with y on the x-axis, and
# estimated P(Y >= y | sex, age=median) on the y-axis
ys <- seq(min(Y), max(Y), length=100)
g <- function(sx) as.vector(ex(y=ys, Predict(f, sex=sx)$yhat)$prob)

d  <- rbind(data.frame(sex='m', y=ys, p=g('m')),
            data.frame(sex='f', y=ys, p=g('f')))
ggplot(d, aes(x=y, y=p, color=sex)) + geom_line() +
  ylab(expression(P(Y >= y))) +
  guides(color=guide_legend(title='Sex')) +
  theme(legend.position='bottom')

options(datadist=NULL)
## Not run: 
## Simulate power and type I error for orm logistic and probit regression
## for likelihood ratio, Wald, and score chi-square tests, and compare
## with t-test
require(rms)
set.seed(5)
nsim <- 2000
r <- NULL
for(beta in c(0, .4)) {
  for(n in c(10, 50, 300)) {
    cat('beta=', beta, '  n=', n, '\n\n')
    plogistic <- pprobit <- plogistics <- pprobits <- plogisticw <-
      pprobitw <- ptt <- numeric(nsim)
    x <- c(rep(0, n/2), rep(1, n/2))
    pb <- setPb(nsim, every=25, label=paste('beta=', beta, '  n=', n))
    for(j in 1:nsim) {
      pb(j)
      y <- beta*x + rnorm(n)
      tt <- t.test(y ~ x)
      ptt[j] <- tt$p.value
      f <- orm(y ~ x)
      plogistic[j]  <- f$stats['P']
      plogistics[j] <- f$stats['Score P']
      plogisticw[j] <- 1 - pchisq(coef(f)['x']^2 / vcov(f)[2,2], 1)
      f <- orm(y ~ x, family=probit)
      pprobit[j]  <- f$stats['P']
      pprobits[j] <- f$stats['Score P']
      pprobitw[j] <- 1 - pchisq(coef(f)['x']^2 / vcov(f)[2,2], 1)
    }
    if(beta == 0) plot(ecdf(plogistic))
    r <- rbind(r, data.frame(beta         = beta, n=n,
                             ttest        = mean(ptt        < 0.05),
                             logisticlr   = mean(plogistic  < 0.05),
                             logisticscore= mean(plogistics < 0.05),
                             logisticwald = mean(plogisticw < 0.05),
                             probit       = mean(pprobit    < 0.05),
                             probitscore  = mean(pprobits   < 0.05),
                             probitwald   = mean(pprobitw   < 0.05)))
  }
}
print(r)
#  beta   n  ttest logisticlr logisticscore logisticwald probit probitscore probitwald
#1  0.0  10 0.0435     0.1060        0.0655        0.043 0.0920      0.0920     0.0820
#2  0.0  50 0.0515     0.0635        0.0615        0.060 0.0620      0.0620     0.0620
#3  0.0 300 0.0595     0.0595        0.0590        0.059 0.0605      0.0605     0.0605
#4  0.4  10 0.0755     0.1595        0.1070        0.074 0.1430      0.1430     0.1285
#5  0.4  50 0.2950     0.2960        0.2935        0.288 0.3120      0.3120     0.3120
#6  0.4 300 0.9240     0.9215        0.9205        0.920 0.9230      0.9230     0.9230

## End(Not run)

Ordinal Regression Model Fitter

Description

Fits ordinal cumulative probability models for continuous or ordinal response variables, efficiently allowing for a large number of intercepts by capitalizing on the information matrix being sparse. Five different distribution functions are implemented, with the default being the logistic (yielding the proportional odds model). Penalized estimation will be implemented in the future. Weights are not implemented. The optimization method is Newton-Raphson with step-halving. Execution time is linear in the number of intercepts.

Usage

orm.fit(x=NULL, y, family='logistic',
        offset=0., initial,  maxit=12L, eps=.005, tol=1e-7, trace=FALSE,
        penalty.matrix=NULL, scale=FALSE, y.precision = 7)

Arguments

x

design matrix with no column for an intercept

y

response vector, numeric, factor, or character. The ordering of levels is assumed from factor(y).

family

the distribution family, corresponding to logistic (the default), Gaussian, Cauchy, Gumbel maximum (exp(exp(x))exp(-exp(-x)); extreme value type I), and Gumbel minimum (1exp(exp(x))1-exp(-exp(x))) distributions. These are the cumulative distribution functions assumed for Prob[YyX]Prob[Y \ge y | X]. The family argument can be an unquoted or a quoted string, e.g. family=loglog or family="loglog". To use a built-in family, the string must be one of the following corresponding to the previous list: logistic, probit, loglog, cloglog, cauchit. The user can also provide her own customized family by setting family to a list with elements cumprob, inverse, deriv, deriv2; see the body of orm.fit for examples. An additional element, name must be given, which is a character string used to name the family for print and latex.

offset

optional numeric vector containing an offset on the logit scale

initial

vector of initial parameter estimates, beginning with the intercepts. If initial is not specified, the function computes the overall score χ2\chi^2 test for the global null hypothesis of no regression.

maxit

maximum no. iterations (default=12).

eps

difference in 2log-2 log likelihood for declaring convergence. Default is .005. If the 2log-2 log likelihood gets worse by eps/10 while the maximum absolute first derivative of

-2 log

likelihood is below 1E-9, convergence is still declared. This handles the case where the initial estimates are MLEs, to prevent endless step-halving.

tol

Singularity criterion. Default is 1e-7

trace

set to TRUE to print -2 log likelihood, step-halving fraction, change in -2 log likelihood, and maximum absolute value of first derivative at each iteration.

penalty.matrix

a self-contained ready-to-use penalty matrix - seelrm

scale

set to TRUE to subtract column means and divide by column standard deviations of x before fitting, and to back-solve for the un-normalized covariance matrix and regression coefficients. This can sometimes make the model converge for very large sample sizes where for example spline or polynomial component variables create scaling problems leading to loss of precision when accumulating sums of squares and crossproducts.

y.precision

When ‘y’ is numeric, values may need to be rounded to avoid unpredictable behavior with unique() with floating-point numbers. Default is to 7 decimal places.

Value

a list with the following components:

call

calling expression

freq

table of frequencies for y in order of increasing y

yunique

vector of sorted unique values of y

stats

vector with the following elements: number of observations used in the fit, number of unique y values, median y from among the observations used in the fit, maximum absolute value of first derivative of log likelihood, model likelihood ratio chi-square, d.f., P-value, score chi-square and its P-value, Spearman's ρ\rho rank correlation between linear predictor and y, the Nagelkerke R2R^2 index, the gg-index, grgr (the gg-index on the ratio scale), and pdmpdm (the mean absolute difference between 0.5 and the estimated probability that yy\geq the marginal median). When penalty.matrix is present, the χ2\chi^2, d.f., and P-value are not corrected for the effective d.f.

fail

set to TRUE if convergence failed (and maxit>1)

coefficients

estimated parameters

var

estimated variance-covariance matrix (inverse of information matrix). Note that in the case of penalized estimation, var is not the improved sandwich-type estimator (which lrm does compute). The only intercept parameter included in the stored object is the middle intercept.

family, trans

see orm

deviance

-2 log likelihoods. When an offset variable is present, three deviances are computed: for intercept(s) only, for intercepts+offset, and for intercepts+offset+predictors. When there is no offset variable, the vector contains deviances for the intercept(s)-only model and the model with intercept(s) and predictors.

non.slopes

number of intercepts in model

interceptRef

the index of the middle (median) intercept used in computing the linear predictor and var

linear.predictors

the linear predictor using the first intercept

penalty.matrix

see above

info.matrix

see orm

Author(s)

Frank Harrell
Department of Biostatistics, Vanderbilt University
[email protected]

See Also

orm, lrm, glm, gIndex, solve

Examples

#Fit an additive logistic model containing numeric predictors age, 
#blood.pressure, and sex, assumed to be already properly coded and 
#transformed
#
# fit <- orm.fit(cbind(age,blood.pressure,sex), death)

Trace AIC and BIC vs. Penalty

Description

For an ordinary unpenalized fit from lrm or ols and for a vector or list of penalties, fits a series of logistic or linear models using penalized maximum likelihood estimation, and saves the effective degrees of freedom, Akaike Information Criterion (AICAIC), Schwarz Bayesian Information Criterion (BICBIC), and Hurvich and Tsai's corrected AICAIC (AICcAIC_c). Optionally pentrace can use the nlminb function to solve for the optimum penalty factor or combination of factors penalizing different kinds of terms in the model. The effective.df function prints the original and effective degrees of freedom for a penalized fit or for an unpenalized fit and the best penalization determined from a previous invocation of pentrace if method="grid" (the default). The effective d.f. is computed separately for each class of terms in the model (e.g., interaction, nonlinear). A plot method exists to plot the results, and a print method exists to print the most pertinent components. Both AICAIC and BICBIC may be plotted if there is only one penalty factor type specified in penalty. Otherwise, the first two types of penalty factors are plotted, showing only the AICAIC.

Usage

pentrace(fit, penalty, penalty.matrix, 
         method=c('grid','optimize'),
         which=c('aic.c','aic','bic'), target.df=NULL,
         fitter, pr=FALSE, tol=1e-7,
         keep.coef=FALSE, complex.more=TRUE, verbose=FALSE, maxit=12,
         subset, noaddzero=FALSE)

effective.df(fit, object)

## S3 method for class 'pentrace'
print(x, ...)

## S3 method for class 'pentrace'
plot(x, method=c('points','image'), 
     which=c('effective.df','aic','aic.c','bic'), pch=2, add=FALSE, 
     ylim, ...)

Arguments

fit

a result from lrm or ols with x=TRUE, y=TRUE and without using penalty or penalty.matrix (or optionally using penalization in the case of effective.df)

penalty

can be a vector or a list. If it is a vector, all types of terms in the model will be penalized by the same amount, specified by elements in penalty, with a penalty of zero automatically added. penalty can also be a list in the format documented in the lrm function, except that elements of the list can be vectors. The expand.grid function is invoked by pentrace to generate all possible combinations of penalties. For example, specifying penalty=list(simple=1:2, nonlinear=1:3) will generate 6 combinations to try, so that the analyst can attempt to determine whether penalizing more complex terms in the model more than the linear or categorical variable terms will be beneficial. If complex.more=TRUE, it is assumed that the variables given in penalty are listed in order from less complex to more complex. With method="optimize" penalty specifies an initial guess for the penalty or penalties. If all term types are to be equally penalized, penalty should be a single number, otherwise it should be a list containing single numbers as elements, e.g., penalty=list(simple=1, nonlinear=2). Experience has shown that the optimization algorithm is more likely to find a reasonable solution when the starting value specified in penalty is too large rather than too small.

object

an object returned by pentrace. For effective.df, object can be omitted if the fit was penalized.

penalty.matrix

see lrm

method

The default is method="grid" to print various indexes for all combinations of penalty parameters given by the user. Specify method="optimize" to have pentrace use nlminb to solve for the combination of penalty parameters that gives the maximum value of the objective named in which, or, if target.df is given, to find the combination that yields target.df effective total degrees of freedom for the model. When target.df is specified, method is set to "optimize" automatically. For plot.pentrace this parameter applies only if more than one penalty term-type was used. The default is to use open triangles whose sizes are proportional to the ranks of the AICs, plotting the first two penalty factors respectively on the x and y axes. Use method="image" to plot an image plot.

which

the objective to maximize for either method. Default is "aic.c" (corrected AIC). For plot.pentrace, which is a vector of names of criteria to show; default is to plot all 4 types, with effective d.f. in its own separate plot

target.df

applies only to method="optimize". See method. target.df makes sense mainly when a single type of penalty factor is specified.

fitter

a fitting function. Default is lrm.fit (lm.pfit is always used for ols).

pr

set to TRUE to print intermediate results

tol

tolerance for declaring a matrix singular (see lrm.fit, solvet)

keep.coef

set to TRUE to store matrix of regression coefficients for all the fits (corresponding to increasing values of penalty) in object Coefficients in the returned list. Rows correspond to penalties, columns to regression parameters.

complex.more

By default if penalty is a list, combinations of penalties for which complex terms are penalized less than less complex terms will be dropped after expand.grid is invoked. Set complex.more=FALSE to allow more complex terms to be penalized less. Currently this option is ignored for method="optimize".

verbose

set to TRUE to print number of intercepts and sum of effective degrees of freedom

maxit

maximum number of iterations to allow in a model fit (default=12). This is passed to the appropriate fitter function with the correct argument name. Increase maxit if you had to when fitting the original unpenalized model.

subset

a logical or integer vector specifying rows of the design and response matrices to subset in fitting models. This is most useful for bootstrapping pentrace to see if the best penalty can be estimated with little error so that variation due to selecting the optimal penalty can be safely ignored when bootstrapping standard errors of regression coefficients and measures of predictive accuracy. See an example below.

noaddzero

set to TRUE to not add an unpenalized model to the list of models to fit

x

a result from pentrace

pch

used for method="points"

add

set to TRUE to add to an existing plot. In that case, the effective d.f. plot is not re-drawn, but the AIC/BIC plot is added to.

ylim

2-vector of y-axis limits for plots other than effective d.f.

...

other arguments passed to plot, lines, or image

Value

a list of class "pentrace" with elements penalty, df, objective, fit, var.adj, diag, results.all, and optionally Coefficients. The first 6 elements correspond to the fit that had the best objective as named in the which argument, from the sequence of fits tried. Here fit is the fit object from fitter which was a penalized fit, diag is the diagonal of the matrix used to compute the effective d.f., and var.adj is Gray (1992) Equation 2.9, which is an improved covariance matrix for the penalized beta. results.all is a data frame whose first few variables are the components of penalty and whose other columns are df, aic, bic, aic.c. results.all thus contains a summary of results for all fits attempted. When method="optimize", only two components are returned: penalty and objective, and the object does not have a class.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
[email protected]

References

Gray RJ: Flexible methods for analyzing survival data using splines, with applications to breast cancer prognosis. JASA 87:942–951, 1992.

Hurvich CM, Tsai, CL: Regression and time series model selection in small samples. Biometrika 76:297–307, 1989.

See Also

lrm, ols, solvet, rmsMisc, image

Examples

n <- 1000    # define sample size
set.seed(17) # so can reproduce the results
age            <- rnorm(n, 50, 10)
blood.pressure <- rnorm(n, 120, 15)
cholesterol    <- rnorm(n, 200, 25)
sex            <- factor(sample(c('female','male'), n,TRUE))
# Specify population model for log odds that Y=1
L <- .4*(sex=='male') + .045*(age-50) +
  (log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male'))
# Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)]
y <- ifelse(runif(n) < plogis(L), 1, 0)


f <- lrm(y ~ blood.pressure + sex * (age + rcs(cholesterol,4)),
         x=TRUE, y=TRUE)
p <- pentrace(f, seq(.2,1,by=.05))
plot(p)
p$diag      # may learn something about fractional effective d.f. 
            # for each original parameter
pentrace(f, list(simple=c(0,.2,.4), nonlinear=c(0,.2,.4,.8,1)))


# Bootstrap pentrace 5 times, making a plot of corrected AIC plot with 5 reps
n <- nrow(f$x)
plot(pentrace(f, seq(.2,1,by=.05)), which='aic.c', 
     col=1, ylim=c(30,120)) #original in black
for(j in 1:5)
  plot(pentrace(f, seq(.2,1,by=.05), subset=sample(n,n,TRUE)), 
       which='aic.c', col=j+1, add=TRUE)


# Find penalty giving optimum corrected AIC.  Initial guess is 1.0
# Not implemented yet
# pentrace(f, 1, method='optimize')


# Find penalty reducing total regression d.f. effectively to 5
# pentrace(f, 1, target.df=5)


# Re-fit with penalty giving best aic.c without differential penalization
f <- update(f, penalty=p$penalty)
effective.df(f)

plot.contrast.rms

Description

Plot Bayesian Contrast Posterior Densities

Usage

## S3 method for class 'contrast.rms'
plot(
  x,
  bivar = FALSE,
  bivarmethod = c("ellipse", "kernel"),
  prob = 0.95,
  which = c("both", "diff", "ind"),
  nrow = NULL,
  ncol = NULL,
  ...
)

Arguments

x

the result of contrast.rms

bivar

set to TRUE to plot 2-d posterior density contour

bivarmethod

see rmsb::pdensityContour()

prob

posterior coverage probability for HPD interval or 2-d contour

which

applies when plotting the result of contrast(..., fun=), defaulting to showing the posterior density of both estimates plus their difference. Set to "ind" to only show the two individual densities or "diff" to only show the posterior density for the differences.

nrow

for ggplot2::facet_wrap()

ncol

likewise

...

unused

Details

If there are exactly two contrasts and bivar=TRUE plots an elliptical or kernal (based on bivarmethod posterior density contour with probability prob). Otherwise plots a series of posterior densities of contrasts along with HPD intervals, posterior means, and medians. When the result being plotted comes from contrast with ⁠fun=⁠ specified, both the two individual estimates and their difference are plotted.

Value

ggplot2 object

Author(s)

Frank Harrell


Plot Effects of Variables Estimated by a Regression Model Fit

Description

Uses lattice graphics to plot the effect of one or two predictors on the linear predictor or X beta scale, or on some transformation of that scale. The first argument specifies the result of the Predict function. The predictor is always plotted in its original coding. plot.Predict uses the xYplot function unless formula is omitted and the x-axis variable is a factor, in which case it reverses the x- and y-axes and uses the Dotplot function.

If data is given, a rug plot is drawn showing the location/density of data values for the xx-axis variable. If there is a groups (superposition) variable that generated separate curves, the data density specific to each class of points is shown. This assumes that the second variable was a factor variable. The rug plots are drawn by scat1d. When the same predictor is used on all xx-axes, and multiple panels are drawn, you can use subdata to specify an expression to subset according to other criteria in addition.

To plot effects instead of estimates (e.g., treatment differences as a function of interacting factors) see contrast.rms and summary.rms.

pantext creates a lattice panel function for including text such as that produced by print.anova.rms inside a panel or in a base graphic.

Usage

## S3 method for class 'Predict'
plot(x, formula, groups=NULL,
     cond=NULL, varypred=FALSE, subset,
     xlim, ylim, xlab, ylab, 
     data=NULL, subdata, anova=NULL, pval=FALSE, cex.anova=.85,
     col.fill=gray(seq(.825, .55, length=5)),
     adj.subtitle, cex.adj, cex.axis, perim=NULL, digits=4, nlevels=3,
     nlines=FALSE, addpanel, scat1d.opts=list(frac=0.025, lwd=0.3),
     type=NULL, yscale=NULL, scaletrans=function(z) z, ...)

pantext(object, x, y, cex=.5, adj=0, fontfamily="Courier", lattice=TRUE)

Arguments

x

a data frame created by Predict, or for pantext the x-coordinate for text

formula

the right hand side of a lattice formula reference variables in data frame x. You may not specify formula if you varied multiple predictors separately when calling Predict. Otherwise, when formula is not given, plot.Predict constructs one from information in x.

groups

an optional name of one of the variables in x that is to be used as a grouping (superpositioning) variable. Note that groups does not contain the groups data as is customary in lattice; it is only a single character string specifying the name of the grouping variable.

cond

when plotting effects of different predictors, cond is a character string that specifies a single variable name in x that can be used to form panels. Only applies if using rbind to combine several Predict results.

varypred

set to TRUE if x is the result of passing multiple Predict results, that represent different predictors, to rbind.Predict. This will cause the .set. variable created by rbind to be copied to the .predictor. variable.

subset

a subsetting expression for restricting the rows of x that are used in plotting. For example, predictions may have been requested for males and females but one wants to plot only females.

xlim

This parameter is seldom used, as limits are usually controlled with Predict. One reason to use xlim is to plot a factor variable on the x-axis that was created with the cut2 function with the levels.mean option, with val.lev=TRUE specified to plot.Predict. In this case you may want the axis to have the range of the original variable values given to cut2 rather than the range of the means within quantile groups.

ylim

Range for plotting on response variable axis. Computed by default.

xlab

Label for x-axis. Default is one given to asis, rcs, etc., which may have been the "label" attribute of the variable.

ylab

Label for y-axis. If fun is not given, default is "log Odds" for lrm, "log Relative Hazard" for cph, name of the response variable for ols, TRUE or log(TRUE) for psm, or "X * Beta" otherwise.

data

a data frame containing the original raw data on which the regression model were based, or at least containing the xx-axis and grouping variable. If data is present and contains the needed variables, the original data are added to the graph in the form of a rug plot using scat1d.

subdata

if data is specified, an expression to be evaluated in the data environment that evaluates to a logical vector specifying which observations in data to keep. This will be intersected with the criterion for the groups variable. Example: if conditioning on two paneling variables using |a*b you can specify subdata=b==levels(b)[which.packet()[2]], where the 2 comes from the fact that b was listed second after the vertical bar (this assumes b is a factor in data. Another example: subdata=sex==c('male','female')[current.row()].

anova

an object returned by anova.rms. If anova is specified, the overall test of association for predictor plotted is added as text to each panel, located at the spot at which the panel is most empty unless there is significant empty space at the top or bottom of the panel; these areas are given preference.

pval

specify pval=TRUE for anova to include not only the test statistic but also the P-value

cex.anova

character size for the test statistic printed on the panel

col.fill

a vector of colors used to fill confidence bands for successive superposed groups. Default is inceasingly dark gray scale.

adj.subtitle

Set to FALSE to suppress subtitling the graph with the list of settings of non-graphed adjustment values.

cex.adj

cex parameter for size of adjustment settings in subtitles. Default is 0.75 times par("cex").

cex.axis