Title: | Categorical Data Analysis Program Package |
---|---|
Description: | Categorical data analysis by AIC. The methodology is described in Sakamoto (1992) <ISBN 978-0-7923-1429-5>. |
Authors: | The Institute of Statistical Mathematics |
Maintainer: | Masami Saga <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.3.7 |
Built: | 2024-12-03 06:32:03 UTC |
Source: | CRAN |
R functions for categorical data analysis
This package provides functions for analyzing multivariate data. Dependencies of the distribution of specified variable (response variable) to other variables (explanatory variables) are derived and evaluated by AIC (Akaike Information Criterion).
Functions catdap1
and catdap1c
are for the
analysis of categorical data. Every variable is specified as the response
variable in turn and the goodness of other variables as the explanatory
variables to the specified variable is evaluated by AIC.
Function catdap2
can be applied to data where categorical
variable and numerical variable are mixed. Specifying one variable as the
response variable, the dependencies of its distribution on sets of other
variables are investigated. If the response variable is categorical,
contingency table analysis method is employed. If the response variable is
numerical, categorizing the response variable by pooling, the problem is
reduced to the categorical response variable case. This method eventually
finds the dependency of the histogram of numerical response variable on sets
of explanatory variables.
The Fortran source program codes for above functions were published in Sakamoto, Ishiguro and Kitagawa (1983), and Frontiers of Times Series Modeling 3 : Modeling Seasonality & Periodicity ; ISM (2002), respectively.
Y.Sakamoto and H.Akaike (1978) Analysis of Cross-Classified Data by AIC. Ann. Inst. Statist. Math., 30, pp.185-197.
K.Katsura and Y.Sakamoto (1980) Computer Science Monograph, No.14, CATDAP, A Categorical Data Analysis Program Package. The Institute of Statistical Mathematics.
Y.Sakamoto, M.Ishiguro and G.Kitagawa (1983) Information Statistics Kyoritsu Shuppan Co., Ltd., Tokyo. (in Japanese)
Y.Sakamoto (1985) Model Analysis of Categorical Data. Kyoritsu Shuppan Co., Ltd., Tokyo. (in Japanese)
Y.Sakamoto (1985) Categorical Data Analysis by AIC. Kluwer Academic publishers.
An AIC-based Tool for Data Visualization (2015), NTT DATA Mathematical Systems Inc. (in Japanese)
Create bar plots for output two-way tables of catdap1() or catdap2().
Barplot2WayTable(x, exvar = NULL, gray.shade = FALSE)
Barplot2WayTable(x, exvar = NULL, gray.shade = FALSE)
x |
an output object of |
exvar |
names of the explanatory variables. Default is all variables
except |
gray.shade |
A logical value indicating whether the gamma-corrected grey
palette should be used. If |
For continuous variables, we assume that
are boundary values
of
bins. Output value ranges
are
defined as follows :
# catdap1c (Titanic data) resvar <- "Survived" z1 <- catdap1c(Titanic, resvar) Barplot2WayTable(z1) # catdap2 (Edgar Anderson's Iris Data) # "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species" data(iris) resvar <- "Petal.Width" z2 <- catdap2(iris, c(0, 0, 0, -7, 2), resvar, c(0.1, 0.1, 0.1, 0.1, 0)) exvar <- c("Sepal.Length", "Petal.Length") Barplot2WayTable(z2, exvar)
# catdap1c (Titanic data) resvar <- "Survived" z1 <- catdap1c(Titanic, resvar) Barplot2WayTable(z1) # catdap2 (Edgar Anderson's Iris Data) # "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species" data(iris) resvar <- "Petal.Width" z2 <- catdap2(iris, c(0, 0, 0, -7, 2), resvar, c(0.1, 0.1, 0.1, 0.1, 0)) exvar <- c("Sepal.Length", "Petal.Length") Barplot2WayTable(z2, exvar)
Calculates the degree of association between all the possible pairs of categorical variables.
catdap1(cdata, response.names = NULL, plot = 1, gray.shade = FALSE, ask = TRUE) catdap1c(ctable, response.names = NULL, plot = 1, gray.shade = FALSE, ask = TRUE)
catdap1(cdata, response.names = NULL, plot = 1, gray.shade = FALSE, ask = TRUE) catdap1c(ctable, response.names = NULL, plot = 1, gray.shade = FALSE, ask = TRUE)
cdata |
categorical data matrix with variable names on the first row. |
ctable |
cross-tabulation table with a list of variable names. |
response.names |
variable names of response variables. If |
plot |
split directions for each level of the mosaic:
|
gray.shade |
A logical value indicating whether the gamma-corrected grey
palette should be used. If |
ask |
logical; if |
This function is an R-function style clone of Sakamoto's CATDAP-01 program for categorical data analysis. CATDAP-01 calculates the degree of association between all the possible pairs of categorical variables.
The degree of association is evaluated by AIC value. See help(catdap2) for details about AIC.
catdap2
should be used when the best subset and categorization
of explanatory variables are sought for. Continuous explanatory variables
could be explanatory variables in case of catdap2.
tway.table |
two-way tables and ratio. |
total |
total number of data with corresponding code of variables. |
aic |
AIC's of explanatory variables for each response variable. |
aic.order |
list of explanatory variable numbers arranged in ascending order of AIC. |
Y.Sakamoto and H.Akaike (1978) Analysis of Cross-Classified Data by AIC. Ann. Inst. Statist. Math., 30, pp.185-197.
K.Katsura and Y.Sakamoto (1980) Computer Science Monograph, No.14, CATDAP, A Categorical Data Analysis Program Package. The Institute of Statistical Mathematics.
Y.Sakamoto, M.Ishiguro and G.Kitagawa (1983) Information Statistics Kyoritsu Shuppan Co., Ltd., Tokyo. (in Japanese)
Y.Sakamoto (1985) Categorical Data Analysis by AIC. Kluwer Academic publishers.
## example 1 (The Japanese National Character) data(JNcharacter) response <- c("born.again", "difficult", "pleasure", "women.job", "money") catdap1(JNcharacter, response) # or, simply data(JNcharacter) catdap1(JNcharacter) ## example 2 (Titanic data) # A data set with 2201 observations on 4 variables (Class, Sex, Age and Survived) # cross-tabulating data catdap1c(Titanic, "Survived") # individual data x <- data.frame(Titanic) y <- data.matrix(x) n <- dim(y)[1] nc <- dim(y)[2] z <- array(, dim = c(nc-1, sum(y[, 5]))) k <- 1 for (i in 1:n) if (y[i, nc] != 0) { np <- y[i, nc] for (j in 1:(nc-1)) z[j, k:(k+np-1)] <- dimnames(Titanic)[[j]][[y[i, j]]] k <- k + np } data <- data.frame(aperm(array(z, dim = c(4,2201)), c(2,1)), stringsAsFactors = TRUE) names(data) <- names(dimnames(Titanic)) catdap1(data, "Survived")
## example 1 (The Japanese National Character) data(JNcharacter) response <- c("born.again", "difficult", "pleasure", "women.job", "money") catdap1(JNcharacter, response) # or, simply data(JNcharacter) catdap1(JNcharacter) ## example 2 (Titanic data) # A data set with 2201 observations on 4 variables (Class, Sex, Age and Survived) # cross-tabulating data catdap1c(Titanic, "Survived") # individual data x <- data.frame(Titanic) y <- data.matrix(x) n <- dim(y)[1] nc <- dim(y)[2] z <- array(, dim = c(nc-1, sum(y[, 5]))) k <- 1 for (i in 1:n) if (y[i, nc] != 0) { np <- y[i, nc] for (j in 1:(nc-1)) z[j, k:(k+np-1)] <- dimnames(Titanic)[[j]][[y[i, j]]] k <- k + np } data <- data.frame(aperm(array(z, dim = c(4,2201)), c(2,1)), stringsAsFactors = TRUE) names(data) <- names(dimnames(Titanic)) catdap1(data, "Survived")
Search for the best single explanatory variable and detect the best subset of explanatory variables.
catdap2(data, pool = NULL, response.name, accuracy = NULL, nvar = NULL, additional.output = NULL, missingmark = NULL, pa1 = 1, pa2 = 4, pa3 = 10, print.level = 0, plot = 1, gray.shade = FALSE)
catdap2(data, pool = NULL, response.name, accuracy = NULL, nvar = NULL, additional.output = NULL, missingmark = NULL, pa1 = 1, pa2 = 4, pa3 = 10, print.level = 0, plot = 1, gray.shade = FALSE)
data |
data matrix with variable names on the first row. |
pool |
the ways of pooling to categorize each variable must be specified by integer parameters:
|
response.name |
variable name of the response variable. |
accuracy |
minimum width for the discretization for each variable. |
nvar |
number of variables to be retained for the analysis of
multidimensional tables. Default is the number of variables in |
additional.output |
list of sets of explanatory variable names for additional output. |
missingmark |
positive number for handling missing value. See 'Details'. |
pa1 , pa2 , pa3
|
control parameter for size of the working area. If error message is output, please change the value of parameter according to it. |
print.level |
this argument determines the level of output printing. The
default value of ' |
plot |
split directions of the mosaic plot for single explanatory models and minimum AIC model:
|
gray.shade |
A logical value indicating whether the gamma-corrected grey
palette should be used. If |
This function is an R-function style clone of Sakamoto's CATDAP-02 program for categorical data analysis. CATDAP-02 can be used to search for the best subset of explanatory variables which have the most effective information on a specified response variable. Continuous explanatory variables could be explanatory variables. In that case CATDAP-02 searches for optimal categorization of continuous values.
The basic statistic adopted is obtained by the application of the statistic AIC to the models.
denotes the response variable and
denotes candidate
explanatory variable, and their cell frequencies by
and
. The cross frequency is denoted by
. To measure
the strength of dependence of a specific set of response variables
on
the explanatory variable
, we use the following statistic:
where and
denote the total number of categories of the
corresponding sets of variables, respectively.
The selection of the best subset of explanatory variables is realized by the
search for which gives the minimum
.
In case of , the formula (1) reduces to
Here it is assumed that and
.
Sakamoto's original CATDAP outputs as the AIC
value instead of
. By this way the positive value of AIC
indicates that the variable
is judged to be useless as the explanatory
variable of the
.
On the other hand, this policy make impossible to compare the goodness of the CATDAP model with other models, logit models for example.
Considering the convenience of users, present "R version CATDAP" provides not
only , but
, either. The
latter value is given as base_AIC in the output.
Users could recover by adding AIC and base_AIC.
missingmark
enables missing value handling.
When a positive values, say , is set here, any value, say
,
greater than or equal to
is treated as a missing value. If
,
is treated as a missing
value of the 1st type. If
,
is treated as a missing value of the 2nd type, and so on. Generally speaking,
any
that
is
treated as the
-th type missing value. Users are referred to the
reference for the technical details of the missing value handling procedure.
For continuous variables, we assume that
are boundary values
of
bins. Output value ranges
are
defined as follows :
Specifically, for continuous response variable ,
where and
are the minimum and the
maximums of variable V respectively and
.
tway.table |
two-way tables. |
total |
total number of data with corresponding code of variables. |
interval |
class interval for continuous and discrete explanatory variables. |
base.aic |
base_AIC. |
aic |
AIC's of single explanatory variables. |
aic.order |
list of explanatory variable numbers arranged in ascending order of AIC. |
nsub |
number of subsets of explanatory variables. |
subset |
list of subsets of explanatory variables in ascending order of AIC with the following components:
|
ctable |
list of contingency table constructed by the best subset and additional
subsets if any variables is specified by
|
missing |
number of types of the missing values for each variable. |
K.Katsura and Y.Sakamoto (1980) Computer Science Monograph, No.14, CATDAP, A Categorical Data Analysis Program Package. The Institute of Statistical Mathematics.
Y.Sakamoto (1985) Model Analysis of Categorical Data. Kyoritsu Shuppan Co., Ltd., Tokyo. (in Japanese)
Y.Sakamoto (1985) Categorical Data Analysis by AIC. Kluwer Academic publishers.
An AIC-based Tool for Data Visualization (2015), NTT DATA Mathematical Systems Inc. (in Japanese)
# Example 1 (medical data "HealthData") # as additional output, contingency tables for explanatory variable sets # c("aortic.wav","min.press") and c("ecg","age") are obtained. data(HealthData) catdap2(HealthData, c(2, 2, 2, 0, 0, 0, 0, 2), "symptoms", c(0., 0., 0., 1., 1., 1., 0.1, 0.), , list(c("aortic.wav", "min.press"), c("ecg", "age"))) # Example 2 (Edgar Anderson's Iris Data) # continuous response variable handling and the usage of Barplot2WayTable # function to visualize the result in shape of stacked histogram. data(iris) resvar <- "Petal.Width" z <- catdap2(iris, c(0, 0, 0, -7, 2), resvar, c(0.1, 0.1, 0.1, 0.1, 0)) z exvar <- c("Sepal.Length", "Petal.Length") Barplot2WayTable(z, exvar) # Example 3 (in the case of a large number of variables) data(HelloGoodbye) pool <- rep(2, 56) ## using the default values of parameters pa1, pa2, pa3 ## catdap2(HelloGoodbye, pool, "Isay", nvar = 10, print.level = 1, plot = 0) ## Error : Working area for contingency table is too short, try pa1 = 12. ### According to the error message, set the parameter p1 at 12, then .. catdap2(HelloGoodbye, pool, "Isay", nvar = 10, pa1 = 12, print.level = 1, plot = 0) # Example 4 (HealthData with missing values) data(MissingHealthData) catdap2(MissingHealthData, c(2, 2, 2, 0, 0, 0, 0, 2), "symptoms", c(0., 0., 0., 1., 1., 1., 0.1, 0.), missingmark = 300)
# Example 1 (medical data "HealthData") # as additional output, contingency tables for explanatory variable sets # c("aortic.wav","min.press") and c("ecg","age") are obtained. data(HealthData) catdap2(HealthData, c(2, 2, 2, 0, 0, 0, 0, 2), "symptoms", c(0., 0., 0., 1., 1., 1., 0.1, 0.), , list(c("aortic.wav", "min.press"), c("ecg", "age"))) # Example 2 (Edgar Anderson's Iris Data) # continuous response variable handling and the usage of Barplot2WayTable # function to visualize the result in shape of stacked histogram. data(iris) resvar <- "Petal.Width" z <- catdap2(iris, c(0, 0, 0, -7, 2), resvar, c(0.1, 0.1, 0.1, 0.1, 0)) z exvar <- c("Sepal.Length", "Petal.Length") Barplot2WayTable(z, exvar) # Example 3 (in the case of a large number of variables) data(HelloGoodbye) pool <- rep(2, 56) ## using the default values of parameters pa1, pa2, pa3 ## catdap2(HelloGoodbye, pool, "Isay", nvar = 10, print.level = 1, plot = 0) ## Error : Working area for contingency table is too short, try pa1 = 12. ### According to the error message, set the parameter p1 at 12, then .. catdap2(HelloGoodbye, pool, "Isay", nvar = 10, pa1 = 12, print.level = 1, plot = 0) # Example 4 (HealthData with missing values) data(MissingHealthData) catdap2(MissingHealthData, c(2, 2, 2, 0, 0, 0, 0, 2), "symptoms", c(0., 0., 0., 1., 1., 1., 0.1, 0.), missingmark = 300)
Medical data containing both continuous and categorical explanatory variables.
data(HealthData)
data(HealthData)
A data frame with 52 observations on the following 8 variables.
A part of the source data was recoded according to an input example of original program CATDAP-02. In addition, we converted 1 into 'A' and 2 into 'B' of symptoms data, and converted cholesterol data less than 198 into 'low' and the others into 'high'.
[, 1] | opthalmo. | 1, 2 | |
[, 2] | ecg | 1, 2 | |
[, 3] | symptoms | A, B | |
[, 4] | age | 49-59 | |
[, 5] | max.press | 98-216 | |
[, 6] | min.press | 56-120 | |
[, 7] | aortic.wav | 6.3-10.2 | |
[, 8] | cholesterol | low, high |
Y.Sakamoto, M.Ishiguro and G.Kitagawa (1980) Computer Science Monograph, No.14, CATDAP, A CATEGORICAL DATA ANALYSIS PROGRAM PACKAGE, DATA No.2. The Institute of Statistical Mathematics.
Y.Sakamoto (1985) Categorical Data Analysis by AIC, p. 74. Kluwer Academic publishers.
Real data contributed from an anonymous organization. We borrowed the wording of a famous song to hide the true nature of the data.
data(HelloGoodbye)
data(HelloGoodbye)
A data frame of with 13954 observations (rows) and 56 variables (columns).
An anonymous organization.
A part of the Survey on the Japanese National Character.
data(JNcharacter)
data(JNcharacter)
A data frame with 85 observations on the following 10 variables.
A part of the source data was deleted and recoded according to an input example of original program CATDAP-01.
[, 1] | sex | 1, 2 | |
[, 2] | age | 1, 2, 3, 4 | |
[, 3] | pol.party | 1, 2, 3, 4 | |
[, 4] | education | 1, 2, 3 | |
[, 5] | occupation | 1, 2 | |
[, 6] | born.again | 1, 2 | |
[, 7] | difficult | 1, 2 | |
[, 8] | pleasure | 1, 2 | |
[, 9] | women.job | 1, 2, 3 | |
[, 10] | money | 1, 2, 3 |
K.Katsura and Y.Sakamoto (1980) Computer Science Monograph, No.14, CATDAP, A Categorical Data Analysis Program Package, DATA No.1. The Institute of Statistical Mathematics.
Y.Sakamoto, M.Ishiguro and G.Kitagawa (1983) Information Statistics, III-2, DATA No. 9, Kyoritsu Shuppan Co., Ltd., Tokyo. (in Japanese)
Medical data containing both categorical variables and continuous variables, the latter include two variables with missing values.
data(MissingHealthData)
data(MissingHealthData)
A data frame with 52 observations on the following 8 variables.
A part of the source data was recoded according to an input example of original program CATDAP-02. In addition, we converted 1 into 'A' and 2 into 'B' of symptoms data, and converted cholesterol data less than 198 into 'low' and the others into 'high'.
[, 1] | opthalmo. | 1, 2 | |
[, 2] | ecg | 1, 2 | |
[, 3] | symptoms | A, B | |
[, 4] | age | 49-59 | |
[, 5] | max.press | 98-216, 300 (missing value) | |
[, 6] | min.press | 56-120, 300 (missing value) | |
[, 7] | aortic.wav | 6.3-10.2 | |
[, 8] | cholesterol | low, high |
Y.Sakamoto, M.Ishiguro and G.Kitagawa (1980) Computer Science Monograph, No.14, CATDAP, A CATEGORICAL DATA ANALYSIS PROGRAM PACKAGE, DATA No.2. The Institute of Statistical Mathematics.
Y.Sakamoto (1985) Categorical Data Analysis by AIC, p. 74. Kluwer Academic publishers.