Title: | Regression Model Diagnostics for Survey Data |
---|---|
Description: | Diagnostics for fixed effects linear and general linear regression models fitted with survey data. Extensions of standard diagnostics to complex survey data are included: standardized residuals, leverages, Cook's D, dfbetas, dffits, condition indexes, and variance inflation factors as found in Li and Valliant (Surv. Meth., 2009, 35(1), pp. 15-24; Jnl. of Off. Stat., 2011, 27(1), pp. 99-119; Jnl. of Off. Stat., 2015, 31(1), pp. 61-75); Liao and Valliant (Surv. Meth., 2012, 38(1), pp. 53-62; Surv. Meth., 2012, 38(2), pp. 189-202). Variance inflation factors and condition indexes are also computed for some general linear models as described in Liao (U. Maryland thesis, 2010). |
Authors: | Richard Valliant [aut, cre] |
Maintainer: | Richard Valliant <[email protected]> |
License: | GPL-3 |
Version: | 0.7 |
Built: | 2024-12-25 07:13:14 UTC |
Source: | CRAN |
Demographic and dietary intake variables from a U.S. national household survey
data(nhanes2007)
data(nhanes2007)
A data frame with 4,329 person-level observations on the following 26 variables measuring 24-hour dietary recall. See https://wwwn.cdc.gov/nchs/nhanes/2013-2014/DR2IFF_H.htm for more details about the variables.
SEQN
Identification variable
SDMVSTRA
Stratum
SDMVPSU
Primary sampling unit, numbered within each stratum (1,2)
WTDRD1
Dietary day 1 sample weight
GENDER
Gender (0 = female; 1 = male)
RIDAGEYR
Age in years at the time of the screening interview; reported for survey participants between the ages of 1 and 79 years of age. All responses of participants aged 80 years and older are coded as 80.
RIDRETH1
Race/Hispanic origin (1 = Mexican American; 2 = Other Hispanic; 3 = Non-Hispanic White; 4 = Non-Hispanic Black; 5 = Other Race including multiracial)
BMXWT
Body weight (kg)
BMXBMI
Body mass Index ((weight in kg) / (height in meters)**2)
DIET
On any diet (0 = No; 1 = Yes)
CALDIET
On a low-calorie diet (0 = No; 1 = Yes)
FATDIET
On a low-fat diet (0 = No; 1 = Yes)
CARBDIET
On a low-carbohydrate diet (0 = No; 1 = Yes)
DR1DRSTZ
Dietary recall status that indicates quality and completeness of survey participant's response to dietary recall section. (1 = Reliable and met the minimum criteria; 2 = Not reliable or not met the minimum criteria; 4 = Reported consuming breast-milk (infants and children only))
DR1TKCAL
Energy (kcal)
DR1TPROT
Protein (gm)
DR1TCARB
Carbohydrate (gm)
DR1TSUGR
Total sugars (gm)
DR1TFIBE
Dietary fiber (gm)
DR1TTFAT
Total fat (gm)
DR1TSFAT
Total saturated fatty acids (gm)
DR1TMFAT
Total monounsaturated fatty acids (gm)
DR1TPFAT
Total polyunsaturated fatty acids (gm)
DR1TCAFF
Caffeine (mg)
DR1TALCO
Alcohol (gm)
DR1_320Z
Total plain water drank yesterday (gm)
The National Health and Nutrition Examination Survey (NHANES) is a program of studies designed to assess the health and nutritional status of adults and children in the United States. The survey is unique in that it combines interviews and physical examinations. The nhis2007
data set contains observations for 4,329 persons collected in 2007-2008.
National Health and Nutrition Examination Survey of 2007-2008 conducted by the U.S. National Center for Health Statistics. https://www.cdc.gov/nchs/nhanes.htm
data(nhanes2007) str(nhanes2007) summary(nhanes2007)
data(nhanes2007) str(nhanes2007) summary(nhanes2007)
Compute condition indexes and variance decompositions for diagnosing collinearity in fixed effects, general linear regression models fitted with data collected from one- and two-stage complex survey designs.
svycollinear(mobj, X, w, sc=TRUE, rnd=3, fuzz=0.05)
svycollinear(mobj, X, w, sc=TRUE, rnd=3, fuzz=0.05)
mobj |
model object produced by |
X |
|
w |
|
sc |
|
rnd |
Round the output to |
fuzz |
Replace any variance decomposition proportions that are less than |
svycollinear
computes condition indexes and variance decomposition proportions to use for diagnosing collinearity in a general linear model fitted from complex survey data as discussed in Liao (2010, ch. 5) and Liao and Valliant (2012). All measures are based on where
is the diagonal matrix of survey weights,
is a diagonal matrix of estimated parameters from the particular type of GLM, and X is the
matrix of covariates. In a full-rank model with p covariates, there are p condition indexes, defined as the ratio of the maximum eigenvalue of
to each of the p eigenvalues. If
sc=TRUE
, before computing condition indexes, as recommended by Belsley (1991), the columns are normalized by their individual Euclidean norms, , so that each column has unit length. The columns are not centered around their means because that can obscure near-dependencies between the intercept and other covariates (Belsley 1984).
Variance decompositions are for the variance of each estimated regression coefficient and are based on a singular value decomposition of the variance formula. For linear models, the decomposition is for the sandwich variance estimator, which has both a model-based and design-based interpretation. In the case of nonlinear GLMs (i.e., family
is not gaussian
), the variance is the approximate model variance. Proportions of the model variance, , associated with each column of
are displayed in an output matrix described below.
data frame,
. The first column gives the condition indexes of
. Values of 10 or more are usually considered to potentially signal collinearity of two or more columns of
. The remaining columns give the proportions (within columns) of variance of each estimated regression coefficient associated with a singular value decomposition into p terms. Columns
will each approximately sum to 1. When
family=gaussian
, some ‘proportions’ can be negative or greater than 1 due to the nature of the variance decomposition (see Liao and Valliant, 2012). For other families the proportions will be in [0,1]. If two proportions in a given row of are relatively large and its associated condition index in that row in the first column of
is also large, then near dependencies between the covariates associated with those elements are influencing the regression coefficient estimates.
Richard Valliant
Belsley, D.A., Kuh, E. and Welsch, R.E. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York: Wiley-Interscience.
Belsley, D.A. (1984). Demeaning conditioning diagnostics through centering. The American Statistician, 38(2), 73-77.
Belsley, D.A. (1991). Conditioning Diagnostics, Collinearity, and Weak Data in Regression. New York: John Wiley & Sons, Inc.
Liao, D. (2010). Collinearity Diagnostics for Complex Survey Data. PhD thesis, University of Maryland. http://hdl.handle.net/1903/10881.
Liao, D, and Valliant, R. (2012). Condition indexes and variance decompositions for diagnosing collinearity in linear model analysis of survey data. Survey Methodology, 38, 189-202.
Lumley, T. (2010). Complex Surveys. New York: John Wiley & Sons.
Lumley, T. (2023). survey: analysis of complex survey samples. R package version 4.2.
require(survey) # example from svyglm help page data(api) dstrat <- svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc) # linear model m1 <- svyglm(api00 ~ ell + meals + mobility, design=dstrat) X.model <- model.matrix(~ ell + meals + mobility, data = apistrat) # send model object from svyglm svycollinear(mobj=m1, X=X.model, w=apistrat$pw, sc=TRUE, rnd=3, fuzz= 0.05) # logistic model data(nhanes2007) nhanes2007$obese <- nhanes2007$BMXBMI >= 30 nhanes.dsgn <- svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTDRD1, nest=TRUE, data=nhanes2007) m2 <- svyglm(obese ~ RIDAGEYR + as.factor(RIDRETH1) + DR1TKCAL + DR1TTFAT + DR1TMFAT, design=nhanes.dsgn, family=quasibinomial()) X.model <- model.matrix(~ RIDAGEYR + as.factor(RIDRETH1) + DR1TKCAL + DR1TTFAT + DR1TMFAT, data = data.frame(nhanes2007)) svycollinear(mobj=m2, X=X.model, w=nhanes2007$WTDRD1, sc=TRUE, rnd=2, fuzz=0.05)
require(survey) # example from svyglm help page data(api) dstrat <- svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc) # linear model m1 <- svyglm(api00 ~ ell + meals + mobility, design=dstrat) X.model <- model.matrix(~ ell + meals + mobility, data = apistrat) # send model object from svyglm svycollinear(mobj=m1, X=X.model, w=apistrat$pw, sc=TRUE, rnd=3, fuzz= 0.05) # logistic model data(nhanes2007) nhanes2007$obese <- nhanes2007$BMXBMI >= 30 nhanes.dsgn <- svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTDRD1, nest=TRUE, data=nhanes2007) m2 <- svyglm(obese ~ RIDAGEYR + as.factor(RIDRETH1) + DR1TKCAL + DR1TTFAT + DR1TMFAT, design=nhanes.dsgn, family=quasibinomial()) X.model <- model.matrix(~ RIDAGEYR + as.factor(RIDRETH1) + DR1TKCAL + DR1TTFAT + DR1TMFAT, data = data.frame(nhanes2007)) svycollinear(mobj=m2, X=X.model, w=nhanes2007$WTDRD1, sc=TRUE, rnd=2, fuzz=0.05)
Compute a modified Cook's D for fixed effects, linear regression models fitted with data collected from one- and two-stage complex survey designs.
svyCooksD(mobj, stvar=NULL, clvar=NULL, doplot=FALSE)
svyCooksD(mobj, stvar=NULL, clvar=NULL, doplot=FALSE)
mobj |
model object produced by |
stvar |
name of the stratification variable in the |
clvar |
name of the cluster variable in the |
doplot |
if |
svyCooksD
computes the modified Cook's D (m-cook; see Atkinson (1982) and Li & Valliant (2011, 2015)) which measures the effect on the vector of parameter estimates of deleting single observations when fitting a fixed effects regression model to complex survey data. The function svystdres
is called for some of the calculations. Values of m-cook are considered large if they are greater than 2 or 3. The R package MASS
must also be loaded before calling svyCooksD
. The output is a vector of the m-cook values and a scatterplot of them versus the sequence number of the sample element used in fitting the model. By default, svyglm
uses only complete cases (i.e., ones for which the dependent variable and all independent variables are non-missing) to fit the model. The rows of the data frame used in fitting the model can be retrieved from the svyglm
object via as.numeric(names(mobj$y))
. The data for those rows is in mobj$data
.
Numeric vector whose names are the rows of the data frame in the svydesign
object that were used in fitting the model
Richard Valliant
Atkinson, A.C. (1982). Regression diagnostics, transformations and constructed variables (with discussion). Journal of the Royal Statistical Society, Series B, Methodological, 44, 1-36.
Cook, R.D. (1977). Detection of Influential Observation in Linear Regression. Technometrics, 19, 15-18.
Cook, R.D. and Weisberg, S. (1982). Residuals and Influence in Regression. London:Chapman & Hall Ltd.
Li, J., and Valliant, R. (2011). Linear regression diagnostics for unclustered survey data. Journal of Official Statistics, 27, 99-119.
Li, J., and Valliant, R. (2015). Linear regression diagnostics in cluster samples. Journal of Official Statistics, 31, 61-75.
Lumley, T. (2010). Complex Surveys. New York: John Wiley & Sons.
Lumley, T. (2023). survey: analysis of complex survey samples. R package version 4.2.
svydfbetas
, svydffits
, svystdres
require(MASS) # to get ginv require(survey) data(api) # unstratified design single stage design d0 <- svydesign(id=~1,strata=NULL, weights=~pw, data=apistrat) m0 <- svyglm(api00 ~ ell + meals + mobility, design=d0) mcook <- svyCooksD(m0, doplot=TRUE) # stratified clustered design require(NHANES) data(NHANESraw) dnhanes <- svydesign(id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTINT2YR, nest=TRUE, data=NHANESraw) m2 <- svyglm(BPDiaAve ~ as.factor(Race1) + BMI + AlcoholYear, design = dnhanes) mcook <- svyCooksD(mobj=m2, stvar="SDMVSTRA", clvar="SDMVPSU", doplot=TRUE)
require(MASS) # to get ginv require(survey) data(api) # unstratified design single stage design d0 <- svydesign(id=~1,strata=NULL, weights=~pw, data=apistrat) m0 <- svyglm(api00 ~ ell + meals + mobility, design=d0) mcook <- svyCooksD(m0, doplot=TRUE) # stratified clustered design require(NHANES) data(NHANESraw) dnhanes <- svydesign(id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTINT2YR, nest=TRUE, data=NHANESraw) m2 <- svyglm(BPDiaAve ~ as.factor(Race1) + BMI + AlcoholYear, design = dnhanes) mcook <- svyCooksD(mobj=m2, stvar="SDMVSTRA", clvar="SDMVPSU", doplot=TRUE)
Compute the dfbetas measure of the effect of extreme observations on parameter estimates for fixed effects, linear regression models fitted with data collected from one- and two-stage complex survey designs.
svydfbetas(mobj, stvar=NULL, clvar=NULL, z=3)
svydfbetas(mobj, stvar=NULL, clvar=NULL, z=3)
mobj |
model object produced by |
stvar |
name of the stratification variable in the |
clvar |
name of the cluster variable in the |
z |
numerator of cutoff for measuring whether an observation has an extreme effect on its own predicted value; default is 3 but can be adjusted to control how many observations are flagged for inspection |
svydfbetas
computes the values of dfbetas for each observation and parameter estimate, i.e., the amount that a parameter estimate changes when the unit is deleted from the sample. The model object must be created by svyglm
in the R survey
package. The output is a vector of the dfbeta and standardized dfbetas values. By default, svyglm
uses only complete cases (i.e., ones for which the dependent variable and all independent variables are non-missing) to fit the model. The rows of the data frame used in fitting the model can be retrieved from the svyglm
object via as.numeric(names(mobj$y))
. The data for those rows is in mobj$data
.
List object with values:
Dfbeta |
Numeric vector of unstandardized dfbeta values whose names are the rows of the data frame in the |
Dfbetas |
Numeric vector of standardized dfbetas values whose names are the rows of the data frame in the |
cutoff |
Value used for gauging whether a value of dffits is large. For a single-stage sample, |
Richard Valliant
Li, J., and Valliant, R. (2011). Linear regression diagnostics for unclustered survey data. Journal of Official Statistics, 27, 99-119.
Li, J., and Valliant, R. (2015). Linear regression diagnostics in cluster samples. Journal of Official Statistics, 31, 61-75.
Lumley, T. (2010). Complex Surveys. New York: John Wiley & Sons.
Lumley, T. (2023). survey: analysis of complex survey samples. R package version 4.2.
require(survey) data(api) # unstratified design single stage design d0 <- svydesign(id=~1,strata=NULL, weights=~pw, data=apistrat) m0 <- svyglm(api00 ~ ell + meals + mobility, design=d0) svydfbetas(mobj=m0) # stratified cluster require(NHANES) data(NHANESraw) dnhanes <- svydesign(id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTINT2YR, nest=TRUE, data=NHANESraw) m2 <- svyglm(BPDiaAve ~ as.factor(Race1) + BMI + AlcoholYear, design = dnhanes) yy <- svydfbetas(mobj=m2, stvar= "SDMVSTRA", clvar="SDMVPSU") apply(abs(yy$Dfbetas) > yy$cutoff,1, sum)
require(survey) data(api) # unstratified design single stage design d0 <- svydesign(id=~1,strata=NULL, weights=~pw, data=apistrat) m0 <- svyglm(api00 ~ ell + meals + mobility, design=d0) svydfbetas(mobj=m0) # stratified cluster require(NHANES) data(NHANESraw) dnhanes <- svydesign(id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTINT2YR, nest=TRUE, data=NHANESraw) m2 <- svyglm(BPDiaAve ~ as.factor(Race1) + BMI + AlcoholYear, design = dnhanes) yy <- svydfbetas(mobj=m2, stvar= "SDMVSTRA", clvar="SDMVPSU") apply(abs(yy$Dfbetas) > yy$cutoff,1, sum)
Compute the dffits measure of the effect of extreme observations on predicted values for fixed effects, linear regression models fitted with data collected from one- and two-stage complex survey designs.
svydffits(mobj, stvar=NULL, clvar=NULL, z=3)
svydffits(mobj, stvar=NULL, clvar=NULL, z=3)
mobj |
model object produced by |
stvar |
name of the stratification variable in the |
clvar |
name of the cluster variable in the |
z |
numerator of cutoff for measuring whether an observation has an extreme effect on its own predicted value; default is 3 but can be adjusted to control how many observations are flagged for inspection |
svydffits
computes the value of dffits for each observation, i.e., the amount that a unit's predicted value changes when the unit is deleted from the sample. The model object must be created by svyglm
in the R survey
package. The output is a vector of the dffit and standardized dffits values. By default, svyglm
uses only complete cases (i.e., ones for which the dependent variable and all independent variables are non-missing) to fit the model. The rows of the data frame used in fitting the model can be retrieved from the svyglm
object via as.numeric(names(mobj$y))
. The data for those rows is in mobj$data
.
List object with values:
Dffit |
Numeric vector of unstandardized dffit values whose names are the rows of the data frame in the |
Dffits |
Numeric vector of standardized dffits values whose names are the rows of the data frame in the |
cutoff |
Value used for gauging whether a value of dffits is large. For a single-stage sample, |
Richard Valliant
Li, J., and Valliant, R. (2011). Linear regression diagnostics for unclustered survey data. Journal of Official Statistics, 27, 99-119.
Li, J., and Valliant, R. (2015). Linear regression diagnostics in cluster samples. Journal of Official Statistics, 31, 61-75.
Lumley, T. (2010). Complex Surveys. New York: John Wiley & Sons.
Lumley, T. (2023). survey: analysis of complex survey samples. R package version 4.2.
require(survey) data(api) # unstratified design single stage design d0 <- svydesign(id=~1,strata=NULL, weights=~pw, data=apistrat) m0 <- svyglm(api00 ~ ell + meals + mobility, design=d0) yy <- svydffits(mobj=m0) yy$cutoff sum(abs(yy$Dffits) > yy$cutoff) require(NHANES) data(NHANESraw) dnhanes <- svydesign(id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTINT2YR, nest=TRUE, data=NHANESraw) m2 <- svyglm(BPDiaAve ~ as.factor(Race1) + BMI + AlcoholYear, design = dnhanes) yy <- svydffits(mobj=m2, stvar= "SDMVSTRA", clvar="SDMVPSU", z=4) sum(abs(yy$Dffits) > yy$cutoff)
require(survey) data(api) # unstratified design single stage design d0 <- svydesign(id=~1,strata=NULL, weights=~pw, data=apistrat) m0 <- svyglm(api00 ~ ell + meals + mobility, design=d0) yy <- svydffits(mobj=m0) yy$cutoff sum(abs(yy$Dffits) > yy$cutoff) require(NHANES) data(NHANESraw) dnhanes <- svydesign(id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTINT2YR, nest=TRUE, data=NHANESraw) m2 <- svyglm(BPDiaAve ~ as.factor(Race1) + BMI + AlcoholYear, design = dnhanes) yy <- svydffits(mobj=m2, stvar= "SDMVSTRA", clvar="SDMVPSU", z=4) sum(abs(yy$Dffits) > yy$cutoff)
Compute leverages for fixed effects, linear regression models fitted from complex survey data.
svyhat(mobj, doplot=FALSE)
svyhat(mobj, doplot=FALSE)
mobj |
model object produced by |
doplot |
if |
svyhat
computes the leverages from a model fitted with complex survey data. The model object mobj
must be created by svyglm
in the R survey
package. The output is a vector of the leverages and a scatterplot of them versus the sequence number of the sample element used in fitting the model. By default, svyglm
uses only complete cases (i.e., ones for which the dependent variable and all independent variables are non-missing) to fit the model. The rows of the data frame used in fitting the model can be retrieved from the svyglm
object via as.numeric(names(mobj$y))
. The data for those rows is in mobj$data
.
Numeric vector whose names are the rows of the data frame in the svydesign
object that were used in fitting the model.
Richard Valliant
Belsley, D.A., Kuh, E. and Welsch, R. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York: John Wiley & Sons, Inc.
Li, J., and Valliant, R. (2009). Survey weighted hat matrix and leverages. Survey Methodology, 35, 15-24.
Lumley, T. (2010). Complex Surveys. New York: John Wiley & Sons.
Lumley, T. (2023). survey: analysis of complex survey samples. R package version 4.2.
require(survey) data(api) dstrat <- svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat) m1 <- svyglm(api00 ~ ell + meals + mobility, design=dstrat) h <- svyhat(mobj = m1, doplot=TRUE) 100*sum(h > 3*mean(h))/length(h) # percentage of leverages > 3*mean require(NHANES) data(NHANESraw) dnhanes <- svydesign(id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTINT2YR, nest=TRUE, data=NHANESraw) m1 <- svyglm(BPDiaAve ~ as.factor(Race1) + BMI + AlcoholYear, design = dnhanes) h <- svyhat(mobj = m1, doplot=TRUE)
require(survey) data(api) dstrat <- svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat) m1 <- svyglm(api00 ~ ell + meals + mobility, design=dstrat) h <- svyhat(mobj = m1, doplot=TRUE) 100*sum(h > 3*mean(h))/length(h) # percentage of leverages > 3*mean require(NHANES) data(NHANESraw) dnhanes <- svydesign(id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTINT2YR, nest=TRUE, data=NHANESraw) m1 <- svyglm(BPDiaAve ~ as.factor(Race1) + BMI + AlcoholYear, design = dnhanes) h <- svyhat(mobj = m1, doplot=TRUE)
Compute standardized residuals for fixed effects, linear regression models fitted with data collected from one- and two-stage complex survey designs.
svystdres(mobj, stvar=NULL, clvar=NULL, doplot=FALSE)
svystdres(mobj, stvar=NULL, clvar=NULL, doplot=FALSE)
mobj |
model object produced by |
stvar |
name of the stratification variable in the |
clvar |
name of the cluster variable in the |
doplot |
if |
svystdres
computes the standardized residuals, i.e., the residuals divided by an estimate of the model standard deviation of the residuals. Residuals are used from a model object created by svyglm
in the R survey
package. The output is a vector of the standardized residuals and a scatterplot of them versus the sequence number of the sample element used in fitting the model. By default, svyglm
uses only complete cases (i.e., ones for which the dependent variable and all independent variables are non-missing) to fit the model. The rows of the data frame used in fitting the model can be retrieved from the svyglm
object via as.numeric(names(mobj$y))
. The data for those rows is in mobj$data
.
List object with values:
stdresids |
Numeric vector whose names are the rows of the data frame in the |
n |
number of sample clusters |
mbar |
average number of non-missing, sample elements per cluster |
rtsighat |
estimate of the square root of the model variance of the residuals, |
rhohat |
estimate of the intracluster correlation of the residuals, |
Richard Valliant
Li, J., and Valliant, R. (2011). Linear regression diagnostics for unclustered survey data. Journal of Official Statistics, 27, 99-119.
Li, J., and Valliant, R. (2015). Linear regression diagnostics in cluster samples. Journal of Official Statistics, 31, 61-75.
Lumley, T. (2010). Complex Surveys. New York: John Wiley & Sons.
Lumley, T. (2023). survey: analysis of complex survey samples. R package version 4.2.
require(survey) data(api) # unstratified design single stage design d0 <- svydesign(id=~1,strata=NULL, weights=~pw, data=apistrat) m0 <- svyglm(api00 ~ ell + meals + mobility, design=d0) svystdres(mobj=m0, stvar=NULL, clvar=NULL) # stratified cluster design require(NHANES) data(NHANESraw) dnhanes <- svydesign(id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTINT2YR, nest=TRUE, data=NHANESraw) m1 <- svyglm(BPDiaAve ~ as.factor(Race1) + BMI + AlcoholYear, design = dnhanes) svystdres(mobj=m1, stvar= "SDMVSTRA", clvar="SDMVPSU")
require(survey) data(api) # unstratified design single stage design d0 <- svydesign(id=~1,strata=NULL, weights=~pw, data=apistrat) m0 <- svyglm(api00 ~ ell + meals + mobility, design=d0) svystdres(mobj=m0, stvar=NULL, clvar=NULL) # stratified cluster design require(NHANES) data(NHANESraw) dnhanes <- svydesign(id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTINT2YR, nest=TRUE, data=NHANESraw) m1 <- svyglm(BPDiaAve ~ as.factor(Race1) + BMI + AlcoholYear, design = dnhanes) svystdres(mobj=m1, stvar= "SDMVSTRA", clvar="SDMVPSU")
Compute a VIF for fixed effects, general linear regression models fitted with data collected from one- and two-stage complex survey designs.
svyvif(mobj, X, w, stvar=NULL, clvar=NULL)
svyvif(mobj, X, w, stvar=NULL, clvar=NULL)
mobj |
model object produced by |
X |
|
w |
|
stvar |
field in |
clvar |
field in |
svyvif
computes variance inflation factors (VIFs) appropriate for linear models and some general linear models (GLMs) fitted from complex survey data (see Liao 2010 and Liao & Valliant 2012). A VIF measures the inflation of a slope estimate caused by nonorthogonality of the predictors over and above what the variance would be with orthogonality (Theil 1971; Belsley, Kuh, and Welsch 1980). A VIF may also be thought of as the amount that the variance of an estimated coefficient for a predictor x is inflated in a model that includes all x's compared to a model that includes only the single x. Another alternative is to use as a comparison a model that includes an intercept and the single x. Both of these VIFs are in the output.
The standard, non-survey data VIF equals where
is the multiple correlation of the
column of
X
regressed on the remaining columns. The complex sample value of the VIF for a linear model consists of the standard VIF multiplied by two adjustments denoted in the output as zeta
and either varrho.m
or varrho
. The VIF for a GLM is similar (Liao 2010, chap. 5; Liao & Valliant 2024). There is no widely agreed-upon cutoff value for identifying high values of a VIF, although 10 is a common suggestion.
A list with two components:
Intercept adjusted
data frame with columns:
svy.vif.m:
complex sample VIF where the reference model includes an intercept and a single x
reg.vif.m:
standard VIF, , that omits the factors,
zeta
and varrho.m
; is an R-square, corrected for the mean, from a weighted least squares regression of the
x on the other x's in the regression
zeta:
1st multiplicative adjustment to reg.vif.m
varrho.m:
2nd multiplicative adjustment to reg.vif.m
zeta.x.varrho.m:
product of the two adjustments to reg.vif.m
Rsq.m:
R-square, corrected for the mean, in the regression of the x on the other x's, including an intercept
No intercept
data frame with columns:
svy.vif:
complex sample VIF where the reference model includes a single x and excludes an intercept; this VIF is analogous to the one included in standard packages that provide VIFs for linear regressions
reg.vif:
standard VIF, , that omits the factors,
zeta
and varrho
; is an R-square, not corrected for the mean, from a weighted least squares regression of the
x on the other x's in the regression
zeta:
1st multiplicative adjustment to reg.vif
varrho:
2nd multiplicative adjustment to reg.vif
zeta.x.varrho:
product of the two adjustments to reg.vif
Rsq:
R-square, not corrected for the mean, in the regression of the x on the other x's, including an intercept
Richard Valliant
Belsley, D.A., Kuh, E. and Welsch, R.E. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York: Wiley-Interscience.
Liao, D. (2010). Collinearity Diagnostics for Complex Survey Data. PhD thesis, University of Maryland. http://hdl.handle.net/1903/10881.
Liao, D, and Valliant, R. (2012). Variance inflation factors in the analysis of complex survey data. Survey Methodology, 38, 53-62.
Liao, D, and Valliant, R. (2024). Collinearity Diagnostics in Generalized Linear Models Fitted with Survey Data. submitted.
Theil, H. (1971). Principles of Econometrics. New York: John Wiley & Sons, Inc.
Lumley, T. (2010). Complex Surveys. New York: John Wiley & Sons.
Lumley, T. (2023). survey: analysis of complex survey samples. R package version 4.4.
require(survey) data(nhanes2007) X1 <- nhanes2007[order(nhanes2007$SDMVSTRA, nhanes2007$SDMVPSU),] # eliminate cases with missing values delete <- which(complete.cases(X1)==FALSE) X2 <- X1[-delete,] X2$obese <- X2$BMXBMI >= 30 nhanes.dsgn <- svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTDRD1, nest=TRUE, data=X2) # linear model m1 <- svyglm(BMXWT ~ RIDAGEYR + as.factor(RIDRETH1) + DR1TKCAL + DR1TTFAT + DR1TMFAT, design=nhanes.dsgn) summary(m1) # construct X matrix using model.matrix from stats package X3 <- model.matrix(~ RIDAGEYR + as.factor(RIDRETH1) + DR1TKCAL + DR1TTFAT + DR1TMFAT, data = data.frame(X2)) # remove col of 1's for intercept with X3[,-1] svyvif(mobj=m1, X=X3[,-1], w = X2$WTDRD1, stvar="SDMVSTRA", clvar="SDMVPSU") # Logistic model m2 <- svyglm(obese ~ RIDAGEYR + as.factor(RIDRETH1) + DR1TKCAL + DR1TTFAT + DR1TMFAT, design=nhanes.dsgn, family="quasibinomial") summary(m2) svyvif(mobj=m2, X=X3[,-1], w = X2$WTDRD1, stvar = "SDMVSTRA", clvar = "SDMVPSU")
require(survey) data(nhanes2007) X1 <- nhanes2007[order(nhanes2007$SDMVSTRA, nhanes2007$SDMVPSU),] # eliminate cases with missing values delete <- which(complete.cases(X1)==FALSE) X2 <- X1[-delete,] X2$obese <- X2$BMXBMI >= 30 nhanes.dsgn <- svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTDRD1, nest=TRUE, data=X2) # linear model m1 <- svyglm(BMXWT ~ RIDAGEYR + as.factor(RIDRETH1) + DR1TKCAL + DR1TTFAT + DR1TMFAT, design=nhanes.dsgn) summary(m1) # construct X matrix using model.matrix from stats package X3 <- model.matrix(~ RIDAGEYR + as.factor(RIDRETH1) + DR1TKCAL + DR1TTFAT + DR1TMFAT, data = data.frame(X2)) # remove col of 1's for intercept with X3[,-1] svyvif(mobj=m1, X=X3[,-1], w = X2$WTDRD1, stvar="SDMVSTRA", clvar="SDMVPSU") # Logistic model m2 <- svyglm(obese ~ RIDAGEYR + as.factor(RIDRETH1) + DR1TKCAL + DR1TTFAT + DR1TMFAT, design=nhanes.dsgn, family="quasibinomial") summary(m2) svyvif(mobj=m2, X=X3[,-1], w = X2$WTDRD1, stvar = "SDMVSTRA", clvar = "SDMVPSU")
Compute a covariance matrix using residuals from a fixed effects, general linear regression model fitted with data collected from one- and two-stage complex survey designs.
Vmat(mobj, stvar = NULL, clvar = NULL)
Vmat(mobj, stvar = NULL, clvar = NULL)
mobj |
model object produced by |
stvar |
field in |
clvar |
field in |
Vmat
computes a covariance matrix among the residuals returned from svyglm
in the survey
package. Vmat
is called by svyvif
when computing variance inflation factors. The matrix that is computed by Vmat
is appropriate under these model assumptions: (1) in single-stage, unclustered sampling, units are assumed to be uncorrelated but can have different model variances, (2) in single-stage, stratified sampling, units are assumed to be uncorrelated within strata and between strata but can have different model variances; (3) in unstratified, clustered samples, units in different clusters are assumed to be uncorrelated but units within clusters are correlated; (3) in stratified, clustered samples, units in different strata or clusters are assumed to be uncorrelated but units within clusters are correlated.
matrix where
is the number of cases used in the linear regression model
Richard Valliant
Liao, D, and Valliant, R. (2012). Variance inflation factors in the analysis of complex survey data. Survey Methodology, 38, 53-62.
Lumley, T. (2010). Complex Surveys. New York: John Wiley & Sons.
Lumley, T. (2023). survey: analysis of complex survey samples. R package version 4.2.
require(Matrix) require(survey) data(nhanes2007) black <- nhanes2007$RIDRETH1 == 4 X <- nhanes2007 X <- cbind(X, black) X1 <- X[order(X$SDMVSTRA, X$SDMVPSU),] # unstratified, unclustered design nhanes.dsgn <- svydesign(ids = 1:nrow(X1), strata = NULL, weights = ~WTDRD1, data=X1) m1 <- svyglm(BMXWT ~ RIDAGEYR + as.factor(black) + DR1TKCAL, design=nhanes.dsgn) summary(m1) V <- Vmat(mobj = m1, stvar = NULL, clvar = NULL) # stratified, clustered design nhanes.dsgn <- svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTDRD1, nest=TRUE, data=X1) m1 <- svyglm(BMXWT ~ RIDAGEYR + as.factor(black) + DR1TKCAL, design=nhanes.dsgn) summary(m1) V <- Vmat(mobj = m1, stvar = "SDMVSTRA", clvar = "SDMVPSU")
require(Matrix) require(survey) data(nhanes2007) black <- nhanes2007$RIDRETH1 == 4 X <- nhanes2007 X <- cbind(X, black) X1 <- X[order(X$SDMVSTRA, X$SDMVPSU),] # unstratified, unclustered design nhanes.dsgn <- svydesign(ids = 1:nrow(X1), strata = NULL, weights = ~WTDRD1, data=X1) m1 <- svyglm(BMXWT ~ RIDAGEYR + as.factor(black) + DR1TKCAL, design=nhanes.dsgn) summary(m1) V <- Vmat(mobj = m1, stvar = NULL, clvar = NULL) # stratified, clustered design nhanes.dsgn <- svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTDRD1, nest=TRUE, data=X1) m1 <- svyglm(BMXWT ~ RIDAGEYR + as.factor(black) + DR1TKCAL, design=nhanes.dsgn) summary(m1) V <- Vmat(mobj = m1, stvar = "SDMVSTRA", clvar = "SDMVPSU")