Title: | Methods for Analyzing Multiple Response Categorical Variables (MRCVs) |
---|---|
Description: | Provides functions for analyzing the association between one single response categorical variable (SRCV) and one multiple response categorical variable (MRCV), or between two or three MRCVs. A modified Pearson chi-square statistic can be used to test for marginal independence for the one or two MRCV case, or a more general loglinear modeling approach can be used to examine various other structures of association for the two or three MRCV case. Bootstrap- and asymptotic-based standardized residuals and model-predicted odds ratios are available, in addition to other descriptive information. Statisical methods implemented are described in Bilder et al. (2000) <doi:10.1080/03610910008813665>, Bilder and Loughin (2004) <doi:10.1111/j.0006-341X.2004.00147.x>, Bilder and Loughin (2007) <doi:10.1080/03610920600974419>, and Koziol and Bilder (2014) <https://journal.r-project.org/articles/RJ-2014-014/>. |
Authors: | Natalie Koziol [aut], Chris Bilder [aut, cre] |
Maintainer: | Chris Bilder <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.4-0 |
Built: | 2024-12-22 06:20:55 UTC |
Source: | CRAN |
Provides functions for analyzing the association between one single response categorical variable (SRCV) and one multiple response categorical variable (MRCV), or between two or three MRCVs. A modified Pearson chi-square statistic can be used to test for marginal independence for the one or two MRCV case, or a more general loglinear modeling approach can be used to examine various other structures of association for the two or three MRCV case. Bootstrap- and asymptotic-based standardized residuals and model-predicted odds ratios are available, in addition to other descriptive information.
Package: | MRCV |
Version: | 0.4-0 |
Date: | 2024-10-18 |
Depends: | R (>= 4.4.0) |
Imports: | tables |
Suggests: | geepack |
LazyData: | TRUE |
License: | GPL (>= 3) |
Notation:
For the two or three MRCV case, define row variable, W, column variable, Y, and strata variable, Z, as MRCVs with binary items (i.e., categories) Wi for i = 1, ..., I, Yj for j = 1, ..., J, and Zk for k = 1, ..., K, respectively. Define a marginal count as the number of subjects who responded (Wi = a, Yj = b, Zk = c) for a, b, and c belonging to the set {0, 1}. For the one MRCV case, let W be an SRCV such that I = 1 and 'a' corresponds to one of r levels of W, and let Y be the MRCV as previously defined.
Format of Data Frame:
Many of the functions require a data frame containing the raw data structured such that the n rows correspond to the individual item response vectors, and the columns correspond to the items, W1, ..., WI, Y1, ..., YJ, and Z1, ..., ZK (in this order). Some of the functions use a summary version of the raw data frame (converted automatically without need for user action) formatted to have rx2J rows and 4 columns generically named W
, Y
, yj
, and count
(one MRCV case), 2Ix2J rows and 5 columns named W
, Y
, wi
, yj
, and count
(two MRCV case), or 2Ix2Jx2K rows and 7 columns named W
, Y
, Z
, wi
, yj
, zk
, and count
(three MRCV case). The column named count
contains the marginal counts defined above.
Descriptive Functions:
Users can call the item.response.table
function to obtain a cross-tabulation of the positive and negative responses for each combination of items, or the marginal.table
function to obtain a cross-tabulation of only the positive responses.
Functions to Test for Marginal Independence:
Methods proposed by Agresti and Liu (1999), Bilder and Loughin (2004), Bilder, Loughin, and Nettleton (2000), and Thomas and Decady (2004) are implemented using the MI.test
function. This function calculates a modified Pearson chi-square statistic that can be used to test for multiple marginal independence (MMI; one MRCV case) or simultaneous pairwise marginal independence (SPMI; two MRCV case). MMI is a test of whether the SRCV, W, is marginally independent of each Yj, where the modified statistic is the sum of the J Pearson statistics used to test for independence of each (W, Yj) pair. SPMI is a test of whether each Wi is pairwise independent of each Yj, where the modified statistic is the sum of the IxJ Pearson statistics used to test for independence of each (Wi, Yj) pair. The asymptotic distribution of the modified statistics is a linear combination of independent chi-square(1) random variables, so traditional methods for analyzing the association between categorical variables W and Y are inappropriate. The MI.test
function offers three sets of testing methods: a nonparametric bootstrap approach, a Rao-Scott second-order adjustment, and a Bonferroni adjustment, that can be used in conjunction with the modified statistic to construct an appropriate test for independence.
Functions for Performing Regression Modeling:
Regression modeling methods described by Bilder and Loughin (2007) are implemented using genloglin
and methods summary.genloglin
, residuals.genloglin
, anova.genloglin
, and predict.genloglin
. The genloglin
function provides parameter estimates and Rao-Scott adjusted standard errors for models involving two or three MRCVs. The anova.genloglin
function offers second-order Rao-Scott and bootstrap adjusted model comparison and goodness-of-fit (Pearson and LRT) statistics. The residuals.genloglin
and predict.genloglin
functions provide bootstrap- and asymptotic-based standardized Pearson residuals and model-based odds ratios, respectively.
General Notes:
Rao-Scott adjustments may not be feasible when the total number of MRCV items is large. In this case, an error message will be returned describing a memory allocation issue.
Natalie Koziol, Chris Bilder
Maintainer: Chris Bilder <[email protected]>
Agresti, A. and Liu, I.-M. (1999) Modeling a categorical variable allowing arbitrarily many category choices. Biometrics, 55, 936–943.
Bilder, C. and Loughin, T. (2004) Testing for marginal independence between two categorical variables with multiple responses. Biometrics, 36, 433–451.
Bilder, C. and Loughin, T. (2007) Modeling association between two or more categorical variables that allow for multiple category choices. Communications in Statistics–Theory and Methods, 36, 433–451.
Bilder, C., Loughin, T., and Nettleton, D. (2000) Multiple marginal independence testing for pick any/c variables. Communications in Statistics–Theory and Methods, 29, 1285–1316.
Koziol, N. and Bilder, C. (2014) MRCV: A package for analyzing categorical variables with multiple response options. The R Journal, 6, 144–150.
Thomas, D. and Decady, Y. (2004) Testing for association using multiple response survey data: Approximate procedures based on the Rao-Scott approach. International Journal of Testing, 4, 43–59.
The anova.genloglin
method function offers second-order Rao-Scott and bootstrap adjusted model comparison and goodness-of-fit (Pearson and LRT) statistics appropriate for evaluating models estimated by the genloglin
function.
## S3 method for class 'genloglin' anova(object, model.HA = "saturated", type = "all", gof = TRUE, print.status = TRUE, ...)
## S3 method for class 'genloglin' anova(object, model.HA = "saturated", type = "all", gof = TRUE, print.status = TRUE, ...)
object |
An object of class |
model.HA |
For the two MRCV case, a character string specifying one of the following models to be compared to the null model (where the null model should be nested within the alternative model): |
type |
A character string specifying one of the following approaches for performing adjusted model comparison tests: |
gof |
A logical value indicating whether goodness-of-fit statistics should be calculated in addition to model comparison statistics. For |
print.status |
A logical value indicating whether bootstrap progress updates should be provided. |
... |
Additional arguments passed to or from other methods. |
The Rao-Scott approach applies a second-order adjustment to the model comparison statistic and its sampling distribution. Formulas are provided in Appendix A of Bilder and Loughin (2007).
The bootstrap approach empirically estimates the sampling distribution of the model comparison statistic. Gange's (1995) method for generating correlated binary data is used for taking resamples under the null hypothesis. Bootstrap results are available only when boot = TRUE
in the call to the genloglin
function.
— A list containing at least the following objects: original.arg
and test.statistics
.
original.arg
is a list containing the following objects:
model
: The original model specified in the call to the genloglin
function.
model.HA
: The alternative model specified for the model.HA
argument.
gof
: The original value supplied to the gof
argument.
test.statistics
is a list containing at least the following objects:
Pearson.chisq
: The Pearson model comparison statistic calculated using the observed data.
lrt
: The LRT model comparison statistic calculated using the observed data.
If gof = TRUE
, test.statistics
additionally contains
Pearson.chisq.gof
: The Pearson goodness-of-fit statistic calculated using the observed data.
lrt.gof
: The LRT goodness-of-fit statistic calculated using the observed data.
— For type = "boot"
, the primary list additionally includes boot.results
, a list containing at least the following objects:
B.use
: The number of bootstrap resamples used.
B.discard
: The number of bootstrap resamples discarded due to having at least one item with all positive or negative responses.
p.chisq.boot
: The bootstrap p-value for the Pearson model comparison test.
p.lrt.boot
: The bootstrap p-value for the LRT model comparison test.
If gof = TRUE
, boot.results
additionally contains
p.chisq.gof.boot
: The bootstrap p-value for the Pearson goodness-of-fit test.
p.lrt.gof.boot
: The bootstrap p-value for the LRT goodness-of-fit test.
— For type = "rs2"
, the primary list additionally includes rs.results
, a list that includes at least Pearson.chisq.rs
and lrt.rs
.
Pearson.chisq.rs
is a list containing the following objects:
Pearson.chisq.rs
: The Rao-Scott second-order adjusted Pearson model comparison statistic.
df
: The Rao-Scott second-order adjusted degrees of freedom for the model comparison statistic.
p.value
: The p-value for the Rao-Scott second-order adjusted Pearson model comparison test.
lrt.rs
is a list containing the following objects:
lrt.rs
: The Rao-Scott second-order adjusted LRT model comparison statistic.
df
: Same as df
given above.
p.value
: The p-value for the Rao-Scott second-order adjusted LRT model comparison test.
If gof = TRUE
, rs.results
additionally includes Pearson.chisq.gof.rs
and lrt.gof.rs
.
Pearson.chisq.gof.rs
is a list containing the following objects:
Pearson.chisq.gof.rs
: The Rao-Scott second-order adjusted Pearson goodness-of-fit statistic.
df
: Same as df
given above.
p.value
: The p-value for the Rao-Scott second-order adjusted Pearson goodness-of-fit test.
lrt.gof.rs
is a list containing the following objects:
lrt.gof.rs
: The Rao-Scott second-order adjusted LRT goodness-of-fit statistic.
df
: Same as df
given above.
p.value
: The p-value for the Rao-Scott second-order adjusted LRT goodness-of-fit test.
— For type = "all"
, the original list includes the boot.results
and rs.results
output.
Bilder, C. and Loughin, T. (2007) Modeling association between two or more categorical variables that allow for multiple category choices. Communications in Statistics–Theory and Methods, 36, 433–451.
Gange, S. (1995) Generating multivariate categorical variates using the iterative proportional fitting algorithm. The American Statistician, 49, 134–138.
## For examples see help(genloglin).
## For examples see help(genloglin).
Responses for one SRCV and one MRCV from a survey of Kansas farmers. This data was first examined by Loughin and Scherer (1998) and subsequently examined by papers such as Agresti and Liu (1999) and Bilder, Loughin, and Nettleton (2000).
farmer1
farmer1
The data frame contains the following 6 columns:
Column 1, labeled Ed
, corresponds to the respondent's highest attained level of education. A total of r = 5 levels of education are represented in the data.
1
: High school
2
: Vocational school
3
: 2-Year college
4
: 4-Year college
5
: Other
Columns 2-6 correspond to the response categories for the question "What are your primary sources of veterinary information?" Binary responses (1 = Yes, 0 = No) are provided for each category.
Y1
: Professional consultant
Y2
: Veterinarian
Y3
: State or local extension service
Y4
: Magazines
Y5
: Feed companies and representatives
Loughin, T. M. and Scherer, P. N. (1998) Testing for association in contingency tables with multiple column responses. Biometrics, 54, 630–637.
Agresti, A. and Liu, I.-M. (1999) Modeling a categorical variable allowing arbitrarily many category choices. Biometrics, 55, 936–943.
Bilder, C., Loughin, T., and Nettleton, D. (2000) Multiple marginal independence testing for pick any/c variables. Communications in Statistics–Theory and Methods, 29, 1285–1316.
Loughin, T. M. and Scherer, P. N. (1998) Testing for association in contingency tables with multiple column responses. Biometrics, 54, 630–637.
Responses for two MRCVs from a survey of Kansas farmers. This data was first examined by Bilder and Loughin (2007).
farmer2
farmer2
The data frame contains the following 7 columns:
Columns 1-3 correspond to the response categories for the question "Which of the following do you test your swine waste for?" Binary responses (1 = Yes, 0 = No) are provided for each category.
W1
: Nitrogen
W2
: Phosphorous
W3
: Salt
Columns 4-7 correspond to the response categories for the question "What swine waste disposal methods do you use?" Binary responses (1 = Yes, 0 = No) are provided for each category.
Y1
: Lagoon
Y2
: Pit
Y3
: Natural drainage
Y4
: Holding tank
Bilder, C. and Loughin, T. (2007) Modeling association between two or more categorical variables that allow for multiple category choices. Communications in Statistics–Theory and Methods, 36, 433–451.
Responses for three MRCVs from a survey of Kansas farmers. This data was first examined by Bilder and Loughin (2007).
farmer3
farmer3
The data frame contains the following 12 columns:
Columns 1-3 correspond to the response categories for the question "Which of the following do you test your swine waste for?" Binary responses (1 = Yes, 0 = No) are provided for each category.
W1
: Nitrogen
W2
: Phosphorous
W3
: Salt
Columns 4-7 correspond to the response categories for the question "What swine waste disposal methods do you use?" Binary responses (1 = Yes, 0 = No) are provided for each category.
Y1
: Lagoon
Y2
: Pit
Y3
: Natural drainage
Y4
: Holding tank
Columns 8-12 correspond to the response categories for the question "What are your primary sources of veterinary information?" Binary responses (1 = Yes, 0 = No) are provided for each category.
Z1
: Professional consultant
Z2
: Veterinarian
Z3
: State or local extension service
Z4
: Magazines
Z5
: Feed companies and representatives
Bilder, C. and Loughin, T. (2007) Modeling association between two or more categorical variables that allow for multiple category choices. Communications in Statistics–Theory and Methods, 36, 433–451.
The genloglin
function uses a generalized loglinear modeling approach to estimate the association among two or three MRCVs. Standard errors are adjusted using a second-order Rao-Scott approach.
genloglin(data, I, J, K = NULL, model, add.constant = 0.5, boot = TRUE, B = 1999, B.max = B, print.status = TRUE)
genloglin(data, I, J, K = NULL, model, add.constant = 0.5, boot = TRUE, B = 1999, B.max = B, print.status = TRUE)
data |
A data frame containing the raw data where rows correspond to the individual item response vectors, and columns correspond to the binary items, W1, ..., WI, Y1, ..., YJ, and Z1, ..., ZK (in this order). |
I |
The number of items corresponding to row variable W. |
J |
The number of items corresponding to column variable Y. |
K |
The number of items corresponding to strata variable Z. |
model |
For the two MRCV case, a character string specifying one of the following models: |
add.constant |
A positive constant to be added to all zero marginal cell counts. |
boot |
A logical value indicating whether bootstrap resamples should be taken. |
B |
The desired number of bootstrap resamples. |
B.max |
The maximum number of bootstrap resamples. Resamples for which at least one item has all positive or negative responses are thrown out; |
print.status |
A logical value indicating whether progress updates should be provided. When |
The genloglin
function first converts the raw data into a form that can be used for estimation. For the two MRCV case, the reformatted data frame contains 2Ix2J rows and 5 columns generically named W
, Y
, wi
, yj
, and count
. For the three MRCV case, the reformatted data frame contains 2Ix2Jx2K rows and 7 columns generically named W
, Y
, Z
, wi
, yj
, zk
, and count
. Then, the model of interest is estimated by calling the glm
function where the family
argument is specified as poisson
. For all predictor variables, the first level is the reference group (i.e., 1 is the reference for variables W
, Y
, and Z
, and 0 is the reference for variables wi
, yj
, and zj
).
The boot
argument must equal TRUE
in order to obtain bootstrap results with the genloglin
method functions.
— genloglin
returns an object of class 'genloglin'
. The object is a list containing at least the following objects: original.arg
, mod.fit
, sum.fit
, and rs.results
.
original.arg
is a list containing the following objects:
data
: The original data frame supplied to the data
argument.
I
: The original value supplied to the I
argument.
J
: The original value supplied to the J
argument.
K
: The original value supplied to the K
argument.
nvars
: The number of MRCVs.
model
: The original value supplied to the model
argument.
add.constant
: The original value supplied to the add.constant
argument.
boot
: The original value supplied to the boot
argument.
mod.fit
is a list containing the same objects returned by glm
with a few modifications as described in summary.genloglin
.
sum.fit
is a list containing the same objects returned by the summary
method for class "glm"
with a few modifications as described in summary.genloglin
.
rs.results
is a list containing the following objects (see Appendix A of Bilder and Loughin, 2007, for more detail):
cov.mu
: The covariance matrix for the estimated cell counts.
E
: The covariance matrix for the residuals.
gamma
: Eigenvalues used in computing second-order Rao-Scott adjusted statistics.
— For boot = TRUE
, the primary list additionally includes boot.results
, a list containing the following objects:
B.use
: The number of bootstrap resamples used.
B.discard
: The number of bootstrap resamples discarded due to having at least one item with all positive or negative responses.
model.data.star
: For the two MRCV case, a numeric matrix containing 2Ix2J rows and B.use
+4 columns, where the first 4 columns correspond to the model variables W
, Y
, wi
, and yj
, and the last B.use
columns correspond to the observed counts for each resample. For the three MRCV case, a numeric matrix containing 2Ix2Jx2K rows and B.use+6
columns, where the first 6 columns correspond to the model variables W
, Y
, Z
, wi
, yj
, and zk
, and the last B.use
columns correspond to the observed counts for each resample.
mod.fit.star
: For the two MRCV case, a numeric matrix containing B.use
rows and 2Ix2J +1 columns, where the first 2Ix2J columns correspond to the model-predicted counts for each resample, and the last column corresponds to the residual deviance for each resample. For the three MRCV case, a numeric matrix containing B.use
rows and 2Ix2Jx2K+1 columns, where the first 2Ix2Jx2K columns correspond to the model-predicted counts for each resample, and the last column corresponds to the residual deviance for each resample.
chisq.star
: A numeric vector of length B.use
containing the Pearson statistics (comparing model
to the saturated model) calculated for each resample.
lrt.star
: A numeric vector of length B.use
containing the LRT statistics calculated for each resample.
residual.star
: A numeric matrix with 2Ix2J rows (or 2Ix2Jx2K rows for the three MRCV case) and B.use
columns containing the residuals calculated for each resample.
Bilder, C. and Loughin, T. (2007) Modeling association between two or more categorical variables that allow for multiple category choices. Communications in Statistics–Theory and Methods, 36, 433–451.
The genloglin
methods summary.genloglin
, residuals.genloglin
, anova.genloglin
, and predict.genloglin
, and the corresponding generic functions summary
, residuals
, anova
, and predict
.
The glm
function for fitting generalized linear models.
The MI.test
function for testing for MMI (one MRCV case) or SPMI (two MRCV case).
# Estimate the y-main effects model for 2 MRCVs mod.fit <- genloglin(data = farmer2, I = 3, J = 4, model = "y.main", boot = FALSE) # Summarize model fit information summary(mod.fit) # Examine standardized Pearson residuals residuals(mod.fit) # Compare the y-main effects model to the saturated model anova(mod.fit, model.HA = "saturated", type = "rs2") # Obtain observed and model-predicted odds ratios predict(mod.fit) # Estimate a model that is not one of the named models # Note that this was the final model chosen by Bilder and Loughin (2007) mod.fit.other <- genloglin(data = farmer2, I = 3, J = 4, model = count ~ -1 + W:Y + wi%in%W:Y + yj%in%W:Y + wi:yj + wi:yj%in%Y + wi:yj%in%W3:Y1, boot = FALSE) # Compare this model to the y-main effects model anova(mod.fit, model.HA = count ~ -1 + W:Y + wi%in%W:Y + yj%in%W:Y + wi:yj + wi:yj%in%Y + wi:yj%in%W3:Y1, type = "rs2", gof = TRUE) # To obtain bootstrap results from the method functions the genloglin() boot # argument must be specified as TRUE (the default) # A small B is used for demonstration purposes; normally, a larger B should be used mod.fit <- genloglin(data = farmer2, I = 3, J = 4, model = "y.main", boot = TRUE, B = 99) residuals(mod.fit) anova(mod.fit, model.HA = "saturated", type = "all") predict(mod.fit) # Estimate a model for 3 MRCVs mod.fit.three <- genloglin(data = farmer3, I = 3, J = 4, K = 5, model = count ~ -1 + W:Y:Z + wi%in%W:Y:Z + yj%in%W:Y:Z + zk%in%W:Y:Z + wi:yj + wi:yj%in%Y + wi:yj%in%W + wi:yj%in%Y:W + yj:zk + yj:zk%in%Z + yj:zk%in%Y + yj:zk%in%Z:Y, boot = TRUE, B = 99) residuals(mod.fit.three) anova(mod.fit.three, model.HA = "saturated", type = "all") predict(mod.fit.three, pair = "WY")
# Estimate the y-main effects model for 2 MRCVs mod.fit <- genloglin(data = farmer2, I = 3, J = 4, model = "y.main", boot = FALSE) # Summarize model fit information summary(mod.fit) # Examine standardized Pearson residuals residuals(mod.fit) # Compare the y-main effects model to the saturated model anova(mod.fit, model.HA = "saturated", type = "rs2") # Obtain observed and model-predicted odds ratios predict(mod.fit) # Estimate a model that is not one of the named models # Note that this was the final model chosen by Bilder and Loughin (2007) mod.fit.other <- genloglin(data = farmer2, I = 3, J = 4, model = count ~ -1 + W:Y + wi%in%W:Y + yj%in%W:Y + wi:yj + wi:yj%in%Y + wi:yj%in%W3:Y1, boot = FALSE) # Compare this model to the y-main effects model anova(mod.fit, model.HA = count ~ -1 + W:Y + wi%in%W:Y + yj%in%W:Y + wi:yj + wi:yj%in%Y + wi:yj%in%W3:Y1, type = "rs2", gof = TRUE) # To obtain bootstrap results from the method functions the genloglin() boot # argument must be specified as TRUE (the default) # A small B is used for demonstration purposes; normally, a larger B should be used mod.fit <- genloglin(data = farmer2, I = 3, J = 4, model = "y.main", boot = TRUE, B = 99) residuals(mod.fit) anova(mod.fit, model.HA = "saturated", type = "all") predict(mod.fit) # Estimate a model for 3 MRCVs mod.fit.three <- genloglin(data = farmer3, I = 3, J = 4, K = 5, model = count ~ -1 + W:Y:Z + wi%in%W:Y:Z + yj%in%W:Y:Z + zk%in%W:Y:Z + wi:yj + wi:yj%in%Y + wi:yj%in%W + wi:yj%in%Y:W + yj:zk + yj:zk%in%Z + yj:zk%in%Y + yj:zk%in%Z:Y, boot = TRUE, B = 99) residuals(mod.fit.three) anova(mod.fit.three, model.HA = "saturated", type = "all") predict(mod.fit.three, pair = "WY")
The item.response.table
function is used to summarize data arising from one, two, or three MRCVs. For the one and two MRCV cases, a cross-tabulation of the positive and negative responses for each (Wi, Yj) pair is presented as a table or data frame (where Wi = W for the one MRCV case). For the three MRCV case, a cross-tabulation of the positive and negative responses for each (Wi, Yj) pair is presented conditional on the response for each Zk.
item.response.table(data, I, J, K = NULL, create.dataframe = FALSE)
item.response.table(data, I, J, K = NULL, create.dataframe = FALSE)
data |
A data frame containing the raw data where rows correspond to the individual item response vectors, and columns correspond to the items W1, ..., WI, Y1, ..., YJ, and Z1, ..., ZK (in this order). |
I |
The number of items corresponding to row variable W. I = 1 for the one MRCV case. |
J |
The number of items corresponding to column variable Y. |
K |
The number of items corresponding to strata variable Z. |
create.dataframe |
A logical value indicating whether the results should be presented as a data frame instead of a table. |
For create.dataframe = FALSE
, item.response.table
uses the tables:tabular()
function to produce tables of marginal counts.
For create.dataframe = TRUE
, item.response.table
returns the same information as above but presents it as a data frame. For the one MRCV case, the data frame contains rx2J rows and 4 columns generically named W
, Y
, yj
, and count
. For the two MRCV case, the data frame contains 2Ix2J rows and 5 columns named W
, Y
, wi
, yj
, and count
. For the three MRCV case, the data frame contains 2Ix2Jx2K rows and 7 columns named W
, Y
, Z
, wi
, yj
, zk
, and count
.
The marginal.table
function for creating a marginal table that summarizes only the positive responses for each pair.
# Create an item response table for 1 SRCV and 1 MRCV farmer.irtable.one <- item.response.table(data = farmer1, I = 1, J = 5) farmer.irtable.one # Create an item response data frame for 1 SRCV and 1 MRCV farmer.irdataframe.one <- item.response.table(data = farmer1, I = 1, J = 5, create.dataframe = TRUE) farmer.irdataframe.one # Create an item response table for 2 MRCVs farmer.irtable.two <- item.response.table(data = farmer2, I = 3, J = 4) farmer.irtable.two # Create an item response table for 3 MRCVs farmer.irtable.three <- item.response.table(data = farmer3, I = 3, J = 4, K = 5) farmer.irtable.three
# Create an item response table for 1 SRCV and 1 MRCV farmer.irtable.one <- item.response.table(data = farmer1, I = 1, J = 5) farmer.irtable.one # Create an item response data frame for 1 SRCV and 1 MRCV farmer.irdataframe.one <- item.response.table(data = farmer1, I = 1, J = 5, create.dataframe = TRUE) farmer.irdataframe.one # Create an item response table for 2 MRCVs farmer.irtable.two <- item.response.table(data = farmer2, I = 3, J = 4) farmer.irtable.two # Create an item response table for 3 MRCVs farmer.irtable.three <- item.response.table(data = farmer3, I = 3, J = 4, K = 5) farmer.irtable.three
The marginal.table
function is used to summarize only the positive responses (joint positive responses for multiple MRCVs) for data arising from one, two, or three MRCVs. This function essentially provides a subset of counts from the item.response.table
function.
marginal.table(data, I, J, K = NULL)
marginal.table(data, I, J, K = NULL)
data |
A data frame containing the raw data where rows correspond to the individual item response vectors, and columns correspond to the items W1, ..., WI, Y1, ..., YJ, and Z1, ..., ZK (in this order). |
I |
The number of items corresponding to row variable W. I = 1 for the one MRCV case. |
J |
The number of items corresponding to column variable Y. |
K |
The number of items corresponding to strata variable Z. |
The marginal.table
function uses the tables:tabular()
function to produce a table containing positive counts.
The item.response.table
function for creating an item response table or data frame that summarizes both the positive and negative responses for each (Wi, Yj) pair (conditional on the response for each Zk in the case of three MRCVs).
# Create a marginal table for 1 MRCV farmer.mtable.one <- marginal.table(data = farmer1, I = 1, J = 5) farmer.mtable.one # Create a marginal table for 2 MRCVs farmer.mtable.two <- marginal.table(data = farmer2, I = 3, J = 4) farmer.mtable.two # Create a marginal table for 3 MRCVs farmer.mtable.three <- marginal.table(data = farmer3, I = 3, J = 4, K = 5) farmer.mtable.three
# Create a marginal table for 1 MRCV farmer.mtable.one <- marginal.table(data = farmer1, I = 1, J = 5) farmer.mtable.one # Create a marginal table for 2 MRCVs farmer.mtable.two <- marginal.table(data = farmer2, I = 3, J = 4) farmer.mtable.two # Create a marginal table for 3 MRCVs farmer.mtable.three <- marginal.table(data = farmer3, I = 3, J = 4, K = 5) farmer.mtable.three
The MI.test
function offers three approaches for testing multiple marginal independence (MMI) between one SRCV and one MRCV, or simultaneous pairwise marginal independence (SPMI) between two MRCVs.
MI.test(data, I, J, type = "all", B = 1999, B.max = B, summary.data = FALSE, add.constant = 0.5, plot.hist = FALSE, print.status = TRUE) MI.stat(data, I, J, summary.data = FALSE, add.constant = 0.5)
MI.test(data, I, J, type = "all", B = 1999, B.max = B, summary.data = FALSE, add.constant = 0.5, plot.hist = FALSE, print.status = TRUE) MI.stat(data, I, J, summary.data = FALSE, add.constant = 0.5)
data |
For For |
I |
The number of items corresponding to row variable W. I = 1 for the one MRCV case. |
J |
The number of items corresponding to column variable Y. |
type |
A character string specifying one of the following approaches for testing for MI: |
B |
The desired number of bootstrap resamples. |
B.max |
The maximum number of bootstrap resamples. A resample is thrown out if at least one of the J (one MRCV case) or IxJ (two MRCV case) contingency tables does not have the correct dimension; |
summary.data |
A logical value indicating whether |
add.constant |
A positive constant to be added to all zero marginal cell counts. |
plot.hist |
A logical value indicating whether plots of the emprical bootstrap sampling distributions should be provided. |
print.status |
A logical value indicating whether bootstrap progress updates should be provided. |
The MI.test
function calls MI.stat
to calculate a modified Pearson statistic (see Bilder, Loughin, and Nettleton (2000) and Bilder and Loughin (2004)), and then performs the testing of MMI or SPMI. Three sets of testing methods are implemented:
The nonparametric bootstrap resamples under the null hypothesis by independently sampling the W and Y vectors with replacement from the observed data. Fixed row counts (i.e., fixed counts for each level of the SRCV) are maintained for the one MRCV case. A modified Pearson statistic is calculated for each resample. In addition, bootstrap p-value combination methods are available to take advantage of the modified Pearson statistic's decomposition into J (one MRCV case) or IxJ (two MRCV case) individual Pearson statistics. The minimum or the product of p-values is the combination for each resample.
The Rao-Scott approach applies a second-order adjustment to the modified Pearson statistic and its sampling distribution. Formulas are provided in Appendix A of Bilder, Loughin, and Nettleton (2000) and Bilder and Loughin (2004). Note that this test can be conservative at times.
The Bonferroni adjustment multiplies each p-value (using a standard chi-square approximation) from the individual Pearson statistics by J (one MRCV case) or IxJ (two MRCV case). If a resulting p-value is greater than 1 after the multiplication, it is set to a value of 1. The overall p-value for the test then is the minimum of these adjusted p-values. Note that the Bonferroni adjustment tends to produce an overly conservative test when the number of individual Pearson statistics is large.
Agresti and Liu (1999) discuss a marginal logit model approach that uses generalized estimation equations (GEE) to test for MMI. As shown in the example given below, this approach can be performed via functions available from the geepack package. However, Bilder, Loughin, and Nettleton (2000) caution that the Wald test produced by this approach does not hold the correct size, particularly when the sample size is not large and marginal probabilities are small.
— MI.test
returns a list containing at least general
, a list containing the following objects:
data
: The original data frame supplied to the data
argument.
I
: The original value supplied to the I
argument.
J
: The original value supplied to the J
argument.
summary.data
: The original value supplied to the summary.data
argument.
X.sq.S
: The modified Pearson statistic; NA if at least one of the J (one MRCV case) or IxJ (two MRCV case) contingency tables does not have the correct dimension.
X.sq.S.ij
: A matrix containing the individual Pearson statistics.
— For type = "boot"
, the primary list additionally includes boot
, a list containing the following objects:
B.use
: The number of bootstrap resamples used.
B.discard
: The number of bootstrap resamples discarded due to having at least one contingency table with incorrect dimension.
p.value.boot
: The bootstrap p-value for the test of MMI or SPMI.
p.combo.min.boot
: The bootstrap p-value for the minimum p-value combination method.
p.combo.prod.boot
: The bootstrap p-value for the product p-value combination method.
X.sq.S.star
: A numeric vector containing the modified Pearson statistics calculated for each resample.
X.sq.S.ij.star
: A matrix containing the individual Pearson statistics calculated for each resample.
p.combo.min.star
: A numeric vector containing the minimum p-value calculated for each resample.
p.combo.prod.star
: A numeric vector containing the product p-value calculated for each resample.
— For type = "rs2"
, the primary list additionally includes rs2
, a list containing the following objects:
X.sq.S.rs2
: The Rao-Scott second-order adjusted Pearson statistic.
df.rs2
: The degrees of freedom for testing the second-order Rao-Scott adjusted Pearson statistic.
p.value.rs2
: The p-value based on the Rao-Scott second-order adjustment.
— For type = "bon"
, the primary list additionally includes bon
, a list containing the following objects:
p.value.bon
: The Bonferroni adjusted p-value for the test of MMI or SPMI.
X.sq.S.ij.p.bon
: A matrix containing the Bonferroni adjusted p-values for the individual Pearson statistics.
— For type = "all"
, the list includes all of the above objects.
— MI.stat
returns a list containing the following objects:
X.sq.S
: Defined above.
X.sq.S.ij
: Defined above.
valid.margins
: The number of contingency tables with correct dimension.
Agresti, A. and Liu, I.-M. (1999) Modeling a categorical variable allowing arbitrarily many category choices. Biometrics, 55, 936–943.
Bilder, C. and Loughin, T. (2004) Testing for marginal independence between two categorical variables with multiple responses. Biometrics, 36, 433–451.
Bilder, C., Loughin, T., and Nettleton, D. (2000) Multiple marginal independence testing for pick any/c variables. Communications in Statistics–Theory and Methods, 29, 1285–1316.
The genloglin
function offers a generalized loglinear modeling approach for testing the relationship among two or three MRCVs.
# Test for MMI using the second-order Rao-Scott adjustment test.mmi.rs2 <- MI.test(data = farmer1, I = 1, J = 5, type = "rs2") test.mmi.rs2 # Test for MMI using all three approaches # A small B is used for demonstration purposes; normally, a larger B should be used test.mmi.all <- MI.test(data = farmer1, I = 1, J = 5, type = "all", B = 99, plot.hist = TRUE) test.mmi.all # Use MI.test() with summary data # Convert raw data file to summary file for this example farmer1.irdframe <- item.response.table(data = farmer1, I = 1, J = 5, create.dataframe = TRUE) # Test for MMI using the Bonferroni adjustment test.mmi.bon <- MI.test(data = farmer1.irdframe, I = 1, J = 5, type = "bon", summary.data = TRUE) test.mmi.bon # Test for SPMI using the second-order Rao-Scott adjustment test.spmi.rs2 <- MI.test(data = farmer2, I = 3, J = 4, type = "rs2") test.spmi.rs2 # Test for MMI using the marginal logit model approach library(geepack) n<-nrow(farmer1) farmer1.id<-cbind(case=1:n, farmer1) # Reshape raw data into long format as required by geeglm() function # Assumes 3:ncol(farmer1.id) corresponds to MRCV items farmer1.gee<-reshape(data = farmer1.id, varying = names(farmer1.id)[3:ncol(farmer1.id)], v.names = "response", timevar = "item", idvar = "case", direction = "long") row.names(farmer1.gee)<-NULL farmer1.gee[,2:3]<-lapply(farmer1.gee[,2:3], factor) # Data frame must be ordered by case farmer1.gee<-farmer1.gee[order(farmer1.gee$case),] head(farmer1.gee) tail(farmer1.gee) mod.fit.H0<-geeglm(formula = response ~ item, family = binomial(link = logit), data = farmer1.gee, na.action = na.omit, id = case, corstr = "unstructured") mod.fit.HA<-geeglm(formula = response ~ Ed*item, family = binomial(link = logit), data = farmer1.gee, na.action = na.omit, id = case, corstr = "unstructured") # Compute Wald test anova(mod.fit.HA, mod.fit.H0)
# Test for MMI using the second-order Rao-Scott adjustment test.mmi.rs2 <- MI.test(data = farmer1, I = 1, J = 5, type = "rs2") test.mmi.rs2 # Test for MMI using all three approaches # A small B is used for demonstration purposes; normally, a larger B should be used test.mmi.all <- MI.test(data = farmer1, I = 1, J = 5, type = "all", B = 99, plot.hist = TRUE) test.mmi.all # Use MI.test() with summary data # Convert raw data file to summary file for this example farmer1.irdframe <- item.response.table(data = farmer1, I = 1, J = 5, create.dataframe = TRUE) # Test for MMI using the Bonferroni adjustment test.mmi.bon <- MI.test(data = farmer1.irdframe, I = 1, J = 5, type = "bon", summary.data = TRUE) test.mmi.bon # Test for SPMI using the second-order Rao-Scott adjustment test.spmi.rs2 <- MI.test(data = farmer2, I = 3, J = 4, type = "rs2") test.spmi.rs2 # Test for MMI using the marginal logit model approach library(geepack) n<-nrow(farmer1) farmer1.id<-cbind(case=1:n, farmer1) # Reshape raw data into long format as required by geeglm() function # Assumes 3:ncol(farmer1.id) corresponds to MRCV items farmer1.gee<-reshape(data = farmer1.id, varying = names(farmer1.id)[3:ncol(farmer1.id)], v.names = "response", timevar = "item", idvar = "case", direction = "long") row.names(farmer1.gee)<-NULL farmer1.gee[,2:3]<-lapply(farmer1.gee[,2:3], factor) # Data frame must be ordered by case farmer1.gee<-farmer1.gee[order(farmer1.gee$case),] head(farmer1.gee) tail(farmer1.gee) mod.fit.H0<-geeglm(formula = response ~ item, family = binomial(link = logit), data = farmer1.gee, na.action = na.omit, id = case, corstr = "unstructured") mod.fit.HA<-geeglm(formula = response ~ Ed*item, family = binomial(link = logit), data = farmer1.gee, na.action = na.omit, id = case, corstr = "unstructured") # Compute Wald test anova(mod.fit.HA, mod.fit.H0)
The predict.genloglin
method function calculates observed and model-predicted odds ratios and their confidence intervals using results from genloglin
. It offers an asymptotic normal approximation for estimating the confidence intervals for the observed and model-predicted odds ratios, and a bootstrap approach for estimating the confidence intervals for the model-predicted odds ratios.
## S3 method for class 'genloglin' predict(object, alpha = 0.05, pair = "WY", print.status = TRUE, ...)
## S3 method for class 'genloglin' predict(object, alpha = 0.05, pair = "WY", print.status = TRUE, ...)
object |
An object of class |
alpha |
The desired alpha level. The |
pair |
For the case of three MRCVs, a character string specifying the pair of items for which odds ratios will be calculated: |
print.status |
A logical value indicating whether bootstrap progress updates should be provided. |
... |
Additional arguments passed to or from other methods. |
Wald confidence intervals are estimated for both model-based (see Appendix A of Bilder and Loughin, 2007) and observed (see Agresti, 2013, p. 70) odds ratios.
A bootstrap method is also available which provides bias-corrected accelerated (BCa) confidence intervals for the model-predicted odds ratios. See Efron (1987) for more information about BCa intervals. The predict.genloglin
function uses a jackknife approximation for estimating the empirical influence values.
The bootstrap confidence intervals are available only when boot = TRUE
in the original call to the genloglin
function.
— A list containing at least original.arg
, OR.obs
, and OR.model.asymp
.
original.arg
is a list containing the following objects:
data
: The original data frame supplied to the data
argument.
I
: The original value supplied to the I
argument.
J
: The original value supplied to the J
argument.
K
: The original value supplied to the K
argument.
nvars
: The number of MRCVs.
alpha
: The original value supplied to the alpha
argument.
OR.obs
is a numeric matrix. For the two MRCV case, the matrix contains IxJ rows corresponding to the IxJ possible pairs (Wi, Yj) and 3 columns, where column 1 corresponds to the observed odds ratio for (Wi, Yj) and columns 2 and 3 correspond to the estimated lower and upper confidence bounds, respectively. For the three MRCV case, the matrix contains 2xIxJxK rows corresponding to all possible combinations of pair
conditional on the response for each item of the 3rd MRCV, and 3 columns as described for the 2 MRCV case.
OR.model.asymp
is a numeric matrix similar to OR.obs
but where column 1 corresponds to the model-predicted odds ratios and columns 2 and 3 correspond to the estimated lower and upper confidence bounds, respectively, using an asymptotic normal approximation.
— For boot = TRUE
in the call to the genloglin
function, the primary list additionally includes boot.results
, a list containing the following objects:
B.use
: The number of bootstrap resamples used.
B.discard
: The number of bootstrap resamples discarded due to having at least one item with all positive or negative responses.
OR.model.BCa
: A numeric matrix similar to OR.obs
but where column 1 corresponds to the model-predicted odds ratios and columns 2 and 3 correspond to the estimated lower and upper confidence bounds, respectively, of the BCa intervals.
Agresti, A. (2013) Categorical data analysis (3rd ed.). Hoboken, New Jersey: John Wiley & Sons.
Bilder, C. and Loughin, T. (2007) Modeling association between two or more categorical variables that allow for multiple category choices. Communications in Statistics–Theory and Methods, 36, 433–451.
Efron, B. (1987) Better bootstrap confidence intervals. Journal of the American Statistical Association, 82, 171–185.
## For examples see help(genloglin).
## For examples see help(genloglin).
Method functions that control the printed display of MRCV regression modeling objects.
## S3 method for class 'genloglin' print(x, digits = max(3, getOption("digits") - 3), ...) ## S3 method for class 'summary.genloglin' print(x, digits = max(3, getOption("digits") - 3), symbolic.cor = x$symbolic.cor, signif.stars = getOption("show.signif.stars"), ...) ## S3 method for class 'anova.genloglin' print(x, ...) ## S3 method for class 'predict.genloglin' print(x, ...)
## S3 method for class 'genloglin' print(x, digits = max(3, getOption("digits") - 3), ...) ## S3 method for class 'summary.genloglin' print(x, digits = max(3, getOption("digits") - 3), symbolic.cor = x$symbolic.cor, signif.stars = getOption("show.signif.stars"), ...) ## S3 method for class 'anova.genloglin' print(x, ...) ## S3 method for class 'predict.genloglin' print(x, ...)
x |
An object of class |
digits |
Minimum number of digits; see |
symbolic.cor |
A logical value indicating whether correlations should be printed in symbolic form; see the |
signif.stars |
A logical value indicating whether significance stars should be printed; see the |
... |
Additional arguments passed to or from other methods. |
The print.genloglin
function is based on the print
method for class "glm"
.
The print.summary.genloglin
function is based on the print
method for class "summary.glm"
.
A print out of selected results.
The print.MMI
method function controls the printed display of objects of class MMI
.
## S3 method for class 'MMI' print(x, ...)
## S3 method for class 'MMI' print(x, ...)
x |
An object of class |
... |
Additional arguments passed to or from other methods. |
A print out of selected results from MI.test
.
The print.SPMI
method function controls the printed display of objects of class SPMI
.
## S3 method for class 'SPMI' print(x, ...)
## S3 method for class 'SPMI' print(x, ...)
x |
An object of class |
... |
Additional arguments passed to or from other methods. |
A print out of selected results from MI.test
.
The residuals.genloglin
method function calculates standardized Pearson residuals for the model specified in the genloglin
function. It offers an asymptotic approximation and a bootstrap approximation for estimating the variance of the residuals.
## S3 method for class 'genloglin' residuals(object, ...)
## S3 method for class 'genloglin' residuals(object, ...)
object |
An object of class |
... |
Additional arguments passed to or from other methods. |
The bootstrap results are only available when boot = TRUE
in the call to the genloglin
function.
The residuals.genloglin
function uses tables:tabular()
to display the results for the two MRCV case.
See Bilder and Loughin (2007) for additional details about calculating the residuals.
— A list containing at least std.pearson.res.asymp.var
. For the two MRCV case, the object is a 2Ix2J table of class 'tabular'
containing the standardized Pearson residuals based on the estimated asymptotic variance. For the three MRCV case, the object is a data frame containing the 2Ix2Jx2K residuals.
— For boot = TRUE
in the call to the genloglin
function, the list additionally includes:
B.use
: The number of bootstrap resamples used.
B.discard
: The number of bootstrap resamples discarded due to having at least one item with all positive or negative responses.
std.pearson.res.boot.var
: For the two MRCV case, a 2Ix2J table of class 'tabular'
containing the standardized Pearson residuals based on the bootstrap variance. For the three MRCV case, a data frame containing the 2Ix2Jx2K residuals.
Bilder, C. and Loughin, T. (2007) Modeling association between two or more categorical variables that allow for multiple category choices. Communications in Statistics–Theory and Methods, 36, 433–451.
## For examples see help(genloglin).
## For examples see help(genloglin).
The sealion data frame is a portion of the data analyzed by Riemer, Wright, and Brown (2011). The data represents the types of prey found in Steller sea lion scats found at the mouth of the Columbia river in August 2004 and 2007. The goal was to determine if the diet habits of the sea lions had changed over time.
sealion
sealion
The data frame contains the following 12 columns:
Column 1, labeled Date
, corresponds to the date of data collection. Two collection dates, 8/10/2004 and 8/7/2007, are included in the data frame.
Columns 2-12 correspond to the types of prey found in Steller sea lion scats. Binary responses (1 = Present in scat, 0 = Not present in scat) are provided for each category.
prey1
: Unidentified fish
prey2
: Pacific lamprey (Lampetra tridentata)
prey3
: Starry flounder (Platichthys stellatus)
prey4
: Pacific sardine (Sardinops sagax)
prey5
: Pacific herring (Clupea pallasii)
prey6
: Unidentified clupeid (family Clupeidae)
prey7
: Unidentified skate (family Rajidae)
prey8
: Northern anchovy (Engraulis mordax)
prey9
: Pacific salmon (Oncorhynchus spp.)
prey10
: Pacific staghorn sculpin (Leptocottus armatus)
prey11
: Pacific hake (Merluccius productus)
Riemer, S. D., Wright, B. E., and Brown, R. F. (2011) Food habits of Steller sea lions (Eumetopias jubatus) off Oregon and northern California, 1986-2007. Fishery Bulletin, 109, 369–381.
The summary.genloglin
function summarizes model fit information provided by the genloglin
function.
## S3 method for class 'genloglin' summary(object, ...)
## S3 method for class 'genloglin' summary(object, ...)
object |
An object of class |
... |
Additional arguments passed to or from other methods. |
The summary.genloglin
function is based on the summary
method for class "glm"
with a few modifications. The coefficients
object contains Rao-Scott second-order adjusted standard errors, z-values, and p-values. The cov.unscaled
object contains the Rao-Scott second-order adjusted covariance matrix of the estimated coefficients.
The deviance information printed by summary.genloglin
should not be used to conduct traditional model comparison tests. The anova.genloglin
function offers adjusted tests.
The summary.genloglin
function returns the same list returned by the summary
method for class "glm"
with the exception of AIC.
## For examples see help(genloglin).
## For examples see help(genloglin).