Title: | Modelling Dispersion in GLM |
---|---|
Description: | Functions for estimating Gaussian dispersion regression models (Aitkin, 1987 <doi:10.2307/2347792>), overdispersed binomial logit models (Williams, 1987 <doi:10.2307/2347977>), and overdispersed Poisson log-linear models (Breslow, 1984 <doi:10.2307/2347661>), using a quasi-likelihood approach. |
Authors: | Luca Scrucca [aut, cre] |
Maintainer: | Luca Scrucca <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.2 |
Built: | 2024-11-09 06:25:49 UTC |
Source: | CRAN |
This function estimates overdispersed binomial logit models using the approach discussed by Williams (1982).
glm.binomial.disp(object, maxit = 30, verbose = TRUE)
glm.binomial.disp(object, maxit = 30, verbose = TRUE)
object |
an object of class |
maxit |
integer giving the maximal number of iterations for the model fitting procedure. |
verbose |
logical, if |
Extra-binomial variation in logistic linear models is discussed, among others, in Collett (1991). Williams (1982) proposed a quasi-likelihood approach for handling overdispersion in logistic regression models.
Suppose we observe the number of successes in
trials, for
, such that
Under this model, each of the binomial observations has a different probability of success
, where
is a random draw from a Beta distribution. Thus,
Assuming and
, the Beta density is zero at the extreme values of zero and one, and thus
. From this, the unconditional mean and variance can be calculated:
so unless or
, the unconditional variance of
is larger than binomial variance.
Identical expressions for the mean and variance of can be obtained if we assume that the
counts on the i-th unit are dependent, with the same correlation
. In this case,
.
The method proposed by Williams uses an iterative algorithm for estimating the dispersion parameter and hence the necessary weights
(for details see Williams, 1982).
The function returns an object of class "glm"
with the usual information and the added components:
dispersion |
the estimated dispersion parameter. |
disp.weights |
the final weights used to fit the model. |
Based on a similar procedure available in Arc (Cook and Weisberg, http://www.stat.umn.edu/arc)
Collett, D. (1991), Modelling Binary Data, London: Chapman and Hall.
Williams, D. A. (1982), Extra-binomial variation in logistic linear models, Applied Statistics, 31, 144–148.
lm
, glm
, lm.disp
, glm.poisson.disp
data(orobanche) mod <- glm(cbind(germinated, seeds-germinated) ~ host*variety, data = orobanche, family = binomial(logit)) summary(mod) mod.disp <- glm.binomial.disp(mod) summary(mod.disp) mod.disp$dispersion
data(orobanche) mod <- glm(cbind(germinated, seeds-germinated) ~ host*variety, data = orobanche, family = binomial(logit)) summary(mod) mod.disp <- glm.binomial.disp(mod) summary(mod.disp) mod.disp$dispersion
This function estimates overdispersed Poisson log-linear models using the approach discussed by Breslow N.E. (1984).
glm.poisson.disp(object, maxit = 30, verbose = TRUE)
glm.poisson.disp(object, maxit = 30, verbose = TRUE)
object |
an object of class |
maxit |
integer giving the maximal number of iterations for the model fitting procedure. |
verbose |
logical, if |
Breslow (1984) proposed an iterative algorithm for fitting overdispersed Poisson log-linear models. The method is similar to that proposed by Williams (1982) for handling overdispersion in logistic regression models (see glm.binomial.disp
).
Suppose we observe independent responses such that
for .
The response variable
may be an event counts variable observed over a period of time (or in the space) of length
, whereas
is the rate parameter.
Then,
where is an offset and
expresses the dependence of the Poisson rate parameter on a set of, say
, predictors. If the periods of time are all of the same length, we can set
for all
so the offset is zero.
The Poisson distribution has , but it may happen that the actual variance exceeds the nominal variance under the assumed probability model.
Suppose that is a random variable distributed according to
where and
. Thus, it can be shown that the unconditional mean and variance of
are given by
and
Hence, for we have overdispersion. It is interesting to note that the same mean and variance arise also if we assume a negative binomial distribution for the response variable.
The method proposed by Breslow uses an iterative algorithm for estimating the dispersion parameter and hence the necessary weights
(for details see Breslow, 1984).
The function returns an object of class "glm"
with the usual information and the added components:
dispersion |
the estimated dispersion parameter. |
disp.weights |
the final weights used to fit the model. |
Based on a similar procedure available in Arc (Cook and Weisberg, http://www.stat.umn.edu/arc)
Breslow, N.E. (1984), Extra-Poisson variation in log-linear models, Applied Statistics, 33, 38–44.
lm
, glm
, lm.disp
, glm.binomial.disp
## Salmonella TA98 data data(salmonellaTA98) salmonellaTA98 <- within(salmonellaTA98, logx10 <- log(x+10)) mod <- glm(y ~ logx10 + x, data = salmonellaTA98, family = poisson(log)) summary(mod) mod.disp <- glm.poisson.disp(mod) summary(mod.disp) mod.disp$dispersion # compute predictions on a grid of x-values... x0 <- with(salmonellaTA98, seq(min(x), max(x), length=50)) eta0 <- predict(mod, newdata = data.frame(logx10 = log(x0+10), x = x0), se=TRUE) eta0.disp <- predict(mod.disp, newdata = data.frame(logx10 = log(x0+10), x = x0), se=TRUE) # ... and plot the mean functions with variability bands plot(y ~ x, data = salmonellaTA98) lines(x0, exp(eta0$fit)) lines(x0, exp(eta0$fit+2*eta0$se), lty=2) lines(x0, exp(eta0$fit-2*eta0$se), lty=2) lines(x0, exp(eta0.disp$fit), col=3) lines(x0, exp(eta0.disp$fit+2*eta0.disp$se), lty=2, col=3) lines(x0, exp(eta0.disp$fit-2*eta0.disp$se), lty=2, col=3) ## Holford's data data(holford) mod <- glm(incid ~ offset(log(pop)) + Age + Cohort, data = holford, family = poisson(log)) summary(mod) mod.disp <- glm.poisson.disp(mod) summary(mod.disp) mod.disp$dispersion
## Salmonella TA98 data data(salmonellaTA98) salmonellaTA98 <- within(salmonellaTA98, logx10 <- log(x+10)) mod <- glm(y ~ logx10 + x, data = salmonellaTA98, family = poisson(log)) summary(mod) mod.disp <- glm.poisson.disp(mod) summary(mod.disp) mod.disp$dispersion # compute predictions on a grid of x-values... x0 <- with(salmonellaTA98, seq(min(x), max(x), length=50)) eta0 <- predict(mod, newdata = data.frame(logx10 = log(x0+10), x = x0), se=TRUE) eta0.disp <- predict(mod.disp, newdata = data.frame(logx10 = log(x0+10), x = x0), se=TRUE) # ... and plot the mean functions with variability bands plot(y ~ x, data = salmonellaTA98) lines(x0, exp(eta0$fit)) lines(x0, exp(eta0$fit+2*eta0$se), lty=2) lines(x0, exp(eta0$fit-2*eta0$se), lty=2) lines(x0, exp(eta0.disp$fit), col=3) lines(x0, exp(eta0.disp$fit+2*eta0.disp$se), lty=2, col=3) lines(x0, exp(eta0.disp$fit-2*eta0.disp$se), lty=2, col=3) ## Holford's data data(holford) mod <- glm(incid ~ offset(log(pop)) + Age + Cohort, data = holford, family = poisson(log)) summary(mod) mod.disp <- glm.poisson.disp(mod) summary(mod.disp) mod.disp$dispersion
Holford's data on prostatic cancer deaths and mid-period population denominators for non-whites in the US by age and calendar period. Thirteen birth cohorts from 1855-59 through to 1915-19 are represented in at least one of seven 5-year age groups (50-54 through to 80-84) and one of the seven 5-year calendar periods (1935-39 through to 1965-69) for which data are provided.
data(minitab)
data(minitab)
This data frame contains the following columns:
number ofd prostatic cancer deaths.
mid-period population counts.
age groups.
calendar periods.
cohorts.
Holford, T.R. (1983) The estimation of age, period and cohort effects for vital rates. Biometrics, 39, 311–324.
Breslow, N.E. (1984), Extra-Poisson variation in log-linear models, Applied Statistics, 33, 38–44.
This function estimates Gaussian dispersion regression models.
lm.disp(formula, var.formula, data = list(), maxit = 30, epsilon = glm.control()$epsilon, subset, na.action = na.omit, contrasts = NULL, offset = NULL)
lm.disp(formula, var.formula, data = list(), maxit = 30, epsilon = glm.control()$epsilon, subset, na.action = na.omit, contrasts = NULL, offset = NULL)
formula |
a symbolic description of the mean function of the model to be fit. For the details of model formula specification see |
var.formula |
a symbolic description of the variance function of the model to be fit. This must be a one-sided formula; if omitted the same terms used for the mean function are used. For the details of model formula specification see |
data |
an optional data frame containing the variables in the model. By default the variables are taken from |
maxit |
integer giving the maximal number of iterations for the model fitting procedure. |
epsilon |
tolerance value for checking convergence. See |
subset |
an optional vector specifying a subset of observations to be used in the fitting process. |
na.action |
a function which indicates what should happen when the data contain |
contrasts |
an optional list as described in the |
offset |
this can be used to specify an a priori known component to be included in the linear predictor during fitting. An |
Gaussian dispersion models allow to model variance heterogeneity in Gaussian regression analysis using a log-linear model for the variance.
Suppose a response is modelled as a function of a set of
predictors
through the linear model
where under homogeneity.
Variance heterogeneity is modelled as
where may contain some or all the variables in
and other variables not included in
;
is however assumed to contain a constant term.
The full model can be expressed as
and it is fitted by maximum likelihood following the algorithm described in Aitkin (1987).
lm.dispmod()
returns an object of class "dispmod"
.
The summary
method can be used to obtain and print a summary of the results.
An object of class "dispmod"
is a list containing the following components:
call |
the matched call. |
mean |
an object of class |
var |
an object of class |
initial.deviance |
the value of the deviance at the beginning of the iterative procedure, i.e. assuming constant variance. |
deviance |
the value of the deviance at the end of the iterative procedure. |
Based on a similar procedure available in Arc (Cook and Weisberg, http://www.stat.umn.edu/arc)
Aitkin, M. (1987), Modelling variance heterogeneity in normal regression models using GLIM, Applied Statistics, 36, 332–339.
lm
, glm
, glm.binomial.disp
, glm.poisson.disp
, ncvTest
.
data(minitab) minitab <- within(minitab, y <- V^(1/3) ) mod <- lm(y ~ H + D, data = minitab) summary(mod) mod.disp1 <- lm.disp(y ~ H + D, data = minitab) summary(mod.disp1) mod.disp2 <- lm.disp(y ~ H + D, ~ H, data = minitab) summary(mod.disp2) # Likelihood ratio test deviances <- c(mod.disp1$initial.deviance, mod.disp2$deviance, mod.disp1$deviance) lrt <- c(NA, abs(diff(deviances))) cbind(deviances, lrt, p.value = 1-pchisq(lrt, 1)) # quadratic dispersion model on D (as discussed by Aitkin) mod.disp4 <- lm.disp(y ~ H + D, ~ D + I(D^2), data = minitab) summary(mod.disp4) r <- mod$residuals phi.est <- mod.disp4$var$fitted.values plot(minitab$D, log(r^2)) lines(minitab$D, log(phi.est))
data(minitab) minitab <- within(minitab, y <- V^(1/3) ) mod <- lm(y ~ H + D, data = minitab) summary(mod) mod.disp1 <- lm.disp(y ~ H + D, data = minitab) summary(mod.disp1) mod.disp2 <- lm.disp(y ~ H + D, ~ H, data = minitab) summary(mod.disp2) # Likelihood ratio test deviances <- c(mod.disp1$initial.deviance, mod.disp2$deviance, mod.disp1$deviance) lrt <- c(NA, abs(diff(deviances))) cbind(deviances, lrt, p.value = 1-pchisq(lrt, 1)) # quadratic dispersion model on D (as discussed by Aitkin) mod.disp4 <- lm.disp(y ~ H + D, ~ D + I(D^2), data = minitab) summary(mod.disp4) r <- mod$residuals phi.est <- mod.disp4$var$fitted.values plot(minitab$D, log(r^2)) lines(minitab$D, log(phi.est))
Data on 31 black cherry trees sampled from the Allegheny Natinoal Forest, Pennsylvania.
data(minitab)
data(minitab)
This data frame contains the following columns:
diameter 4.5 feet of the ground, inches
height of the tree, feet
marketable volume of wood, cubic feet
Ryan, T.A., Joiner, B.L. and Ryan, B.F. (1976) Minitab Student Handbook. N. Scituate, MA: Duxbury.
Cook, R.D. and Weisberg, S. (1982) Residuals and Influence in Regression, New York: Chapman and Hall, p. 66.
Orobanche, commonly known as broomrape, is a genus of parasitic plants with chlorophyll that grow on the roots of flowering plants. Batches of seeds of two varieties of the plant were were brushed onto a plate of diluted extract of bean or cucumber, and the number germinating were recorded.
data(orobanche)
data(orobanche)
This data frame contains the following columns:
Number germinated
Number of seeds
Slide number
Host type
Variety name
Crowder, M.J. (1978) Beta-binomial anova for proportions. Applied Statistics, 27, 34–37.
Collett, D. (1991) Modelling Binary Data, London: Chapman and Hall, Chapter 6.
Data on Ames Salmonella reverse mutagenicity assay.
data(salmonellaTA98)
data(salmonellaTA98)
This data frame contains the following columns:
dose levels of quinoline
numbers of revertant colonies of TA98 Salmonella observed on each of three replicate plates testes at each of six dose levels of quinolinediameter 4.5 feet of the ground, inches
Margolin, B.J., Kaplan, N. and Zeiger, E. (1981) Statistical analysis of the Ames Salmonella/microsome test, Proc. Natl. Acad. Sci. USA, 76, 3779–3783.
Breslow, N.E. (1984), Extra-Poisson variation in log-linear models, Applied Statistics, 33, 38–44.