Title: | Measure Dependence Between Categorical and Continuous Variables |
---|---|
Description: | Semi-distance and mean-variance (MV) index are proposed to measure the dependence between a categorical random variable and a continuous variable. Test of independence and feature screening for classification problems can be implemented via the two dependence measures. For the details of the methods, see Zhong et al. (2023) <doi:10.1080/01621459.2023.2284988>; Cui and Zhong (2019) <doi:10.1016/j.csda.2019.05.004>; Cui, Li and Zhong (2015) <doi:10.1080/01621459.2014.920256>. |
Authors: | Wei Zhong [aut], Zhuoxi Li [aut, cre, cph], Wenwen Guo [aut], Hengjian Cui [aut], Runze Li [aut] |
Maintainer: | Zhuoxi Li <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.0 |
Built: | 2024-12-16 06:39:37 UTC |
Source: | CRAN |
Implement the mutual information independence test (MINT) (Berrett and Samworth, 2019), but with some modification in estimating the mutual informaion (MI) between a categorical random variable and a continuous variable. The modification is based on the idea of Ross (2014).
MINTsemiperm()
implements the permutation independence test via
mutual information, but the parameter k
should be pre-specified.
MINTsemiauto()
automatically selects an appropriate k
based on a
data-driven procedure, and conducts MINTsemiperm()
with the k
chosen.
MINTsemiperm(X, y, k, B = 1000) MINTsemiauto(X, y, kmax, B1 = 1000, B2 = 1000)
MINTsemiperm(X, y, k, B = 1000) MINTsemiauto(X, y, kmax, B1 = 1000, B2 = 1000)
X |
Data of multivariate continuous variables, which should be an
|
y |
Data of categorical variables, which should be a factor of length
|
k |
Number of nearest neighbor. See References for details. |
B , B1 , B2
|
Number of permutations to use. Defaults to 1000. |
kmax |
Maximum |
A list with class "indtest"
containing the following components
method
: name of the test;
name_data
: names of the X
and y
;
n
: sample size of the data;
num_perm
: number of replications in permutation test;
stat
: test statistic;
pvalue
: computed p-value.
For MINTsemiauto()
, the list also contains
kmax
: maximum k
in the automatic search for optimal k
;
kopt
: optimal k
chosen.
Berrett, Thomas B., and Richard J. Samworth. "Nonparametric independence testing via mutual information." Biometrika 106, no. 3 (2019): 547-566.
Ross, Brian C. "Mutual information between discrete and continuous data sets." PloS one 9, no. 2 (2014): e87357.
X <- mtcars[, c("mpg", "disp", "drat", "wt")] y <- factor(mtcars[, "am"]) MINTsemiperm(X, y, 5) MINTsemiauto(X, y, kmax = 32)
X <- mtcars[, c("mpg", "disp", "drat", "wt")] y <- factor(mtcars[, "am"]) MINTsemiperm(X, y, 5) MINTsemiauto(X, y, kmax = 32)
Compute the statistics of mean variance (MV) index, which can measure the dependence between a univariate continuous variable and a categorical variable. See Cui, Li and Zhong (2015); Cui and Zhong (2019) for details.
mv(x, y, return_mat = FALSE)
mv(x, y, return_mat = FALSE)
x |
Data of univariate continuous variables, which should be a vector of
length |
y |
Data of categorical variables, which should be a factor of length
|
return_mat |
A boolean. If |
The value of the corresponding sample statistic.
If the argument return_mat
of mv()
is set as TRUE
, a list with
elements
mv
: the MV index statistic;
mat_x
: the matrices of the distances of the indicator for x <= x_i;
will be returned.
mv_test()
for implementing independence test via MV index;
mv_sis()
for implementing feature screening via MV index.
x <- mtcars[, "mpg"] y <- factor(mtcars[, "am"]) print(mv(x, y)) # Man-made independent data ------------------------------------------------- n <- 30; R <- 5; prob <- rep(1/R, R) x <- rnorm(n) y <- factor(sample(1:R, size = n, replace = TRUE, prob = prob), levels = 1:R) print(mv(x, y)) # Man-made functionally dependent data -------------------------------------- n <- 30; R <- 3 x <- rep(0, n) x[1:10] <- 0.3; x[11:20] <- 0.2; x[21:30] <- -0.1 y <- factor(rep(1:3, each = 10)) print(mv(x, y))
x <- mtcars[, "mpg"] y <- factor(mtcars[, "am"]) print(mv(x, y)) # Man-made independent data ------------------------------------------------- n <- 30; R <- 5; prob <- rep(1/R, R) x <- rnorm(n) y <- factor(sample(1:R, size = n, replace = TRUE, prob = prob), levels = 1:R) print(mv(x, y)) # Man-made functionally dependent data -------------------------------------- n <- 30; R <- 3 x <- rep(0, n) x[1:10] <- 0.3; x[11:20] <- 0.2; x[21:30] <- -0.1 y <- factor(rep(1:3, each = 10)) print(mv(x, y))
Implement the feature screening for the classification problem via MV index.
mv_sis(X, y, d = NULL, parallel = FALSE)
mv_sis(X, y, d = NULL, parallel = FALSE)
X |
Data of multivariate covariates, which should be an
|
y |
Data of categorical response, which should be a factor of length
|
d |
An integer specifying how many features should be kept after
screening. Defaults to |
parallel |
A boolean indicating whether to calculate parallelly via
|
A list of the objects about the implemented feature screening:
measurement
: sample MV index calculated for each single covariate;
selected
: indicies or names (if avaiable as colnames of X
) of
covariates that are selected after feature screening;
ordering
: order of the calculated measurements of each single covariate.
The first one is the largest, and the last is the smallest.
X <- mtcars[, c("mpg", "disp", "hp", "drat", "wt", "qsec")] y <- factor(mtcars[, "am"]) mv_sis(X, y, d = 4)
X <- mtcars[, c("mpg", "disp", "hp", "drat", "wt", "qsec")] y <- factor(mtcars[, "am"]) mv_sis(X, y, d = 4)
Implement the MV independence test via permutation test, or via the asymptotic approximation
mv_test(x, y, test_type = "perm", num_perm = 10000)
mv_test(x, y, test_type = "perm", num_perm = 10000)
x |
Data of univariate continuous variables, which should be a vector of
length |
y |
Data of categorical variables, which should be a factor of length
|
test_type |
Type of the test:
See the Reference for details. |
num_perm |
The number of replications in permutation test. |
A list with class "indtest"
containing the following components
method
: name of the test;
name_data
: names of the x
and y
;
n
: sample size of the data;
num_perm
: number of replications in permutation test;
stat
: test statistic;
pvalue
: computed p-value. (Notice: asymptotic test cannot return a p-value, but only the critical values crit_vals
for 90%, 95% and 99% confidence levels.)
x <- mtcars[, "mpg"] y <- factor(mtcars[, "am"]) test <- mv_test(x, y) print(test) test_asym <- mv_test(x, y, test_type = "asym") print(test_asym) # Man-made independent data ------------------------------------------------- n <- 30; R <- 5; prob <- rep(1/R, R) x <- rnorm(n) y <- factor(sample(1:R, size = n, replace = TRUE, prob = prob), levels = 1:R) test <- mv_test(x, y) print(test) test_asym <- mv_test(x, y, test_type = "asym") print(test_asym) # Man-made functionally dependent data -------------------------------------- n <- 30; R <- 3 x <- rep(0, n) x[1:10] <- 0.3; x[11:20] <- 0.2; x[21:30] <- -0.1 y <- factor(rep(1:3, each = 10)) test <- mv_test(x, y) print(test) test_asym <- mv_test(x, y, test_type = "asym") print(test_asym)
x <- mtcars[, "mpg"] y <- factor(mtcars[, "am"]) test <- mv_test(x, y) print(test) test_asym <- mv_test(x, y, test_type = "asym") print(test_asym) # Man-made independent data ------------------------------------------------- n <- 30; R <- 5; prob <- rep(1/R, R) x <- rnorm(n) y <- factor(sample(1:R, size = n, replace = TRUE, prob = prob), levels = 1:R) test <- mv_test(x, y) print(test) test_asym <- mv_test(x, y, test_type = "asym") print(test_asym) # Man-made functionally dependent data -------------------------------------- n <- 30; R <- 3 x <- rep(0, n) x[1:10] <- 0.3; x[11:20] <- 0.2; x[21:30] <- -0.1 y <- factor(rep(1:3, each = 10)) test <- mv_test(x, y) print(test) test_asym <- mv_test(x, y, test_type = "asym") print(test_asym)
Printing object of class "indtest"
, by simple print method.
## S3 method for class 'indtest' print(x, digits = getOption("digits"), ...)
## S3 method for class 'indtest' print(x, digits = getOption("digits"), ...)
x |
|
digits |
minimal number of significant digits. |
... |
further arguments passed to or from other methods. |
None
# Man-made functionally dependent data -------------------------------------- n <- 30; R <- 3 x <- rep(0, n) x[1:10] <- 0.3; x[11:20] <- 0.2; x[21:30] <- -0.1 y <- factor(rep(1:3, each = 10)) test <- mv_test(x, y) print(test) test_asym <- mv_test(x, y, test_type = "asym") print(test_asym)
# Man-made functionally dependent data -------------------------------------- n <- 30; R <- 3 x <- rep(0, n) x[1:10] <- 0.3; x[11:20] <- 0.2; x[21:30] <- -0.1 y <- factor(rep(1:3, each = 10)) test <- mv_test(x, y) print(test) test_asym <- mv_test(x, y, test_type = "asym") print(test_asym)
Implement the (grouped) feature screening for the classification problem via semi-distance correlation.
sd_sis(X, y, group_info = NULL, d = NULL, parallel = FALSE)
sd_sis(X, y, group_info = NULL, d = NULL, parallel = FALSE)
X |
Data of multivariate covariates, which should be an
|
y |
Data of categorical response, which should be a factor of length
|
group_info |
A list specifying the group information, with elements
being sets of indicies of covariates in a same group. For example,
Defaults to If The names of the list can help recoginize the group. For example,
|
d |
An integer specifying at least how many (single) features should
be kept after screening. For example, if Defaults to |
parallel |
A boolean indicating whether to calculate parallelly via
|
A list of the objects about the implemented feature screening:
group_info
: group information;
measurement
: sample semi-distance correlations calculated for the groups
specified in group_info
;
selected
: indicies/names of (single) covariates that are selected after
feature screening;
ordering
: order of the calculated measurements of the groups specified in
group_info
. The first one is the largest, and the last is the smallest.
sdcor()
for calculating the sample semi-distance correlation.
X <- mtcars[, c("mpg", "disp", "hp", "drat", "wt", "qsec")] y <- factor(mtcars[, "am"]) sd_sis(X, y, d = 4) # Suppose we have prior information for the group structure as # ("mpg", "drat"), ("disp", "hp") and ("wt", "qsec") group_info <- list( mpg_drat = c("mpg", "drat"), disp_hp = c("disp", "hp"), wt_qsec = c("wt", "qsec") ) sd_sis(X, y, group_info, d = 4)
X <- mtcars[, c("mpg", "disp", "hp", "drat", "wt", "qsec")] y <- factor(mtcars[, "am"]) sd_sis(X, y, d = 4) # Suppose we have prior information for the group structure as # ("mpg", "drat"), ("disp", "hp") and ("wt", "qsec") group_info <- list( mpg_drat = c("mpg", "drat"), disp_hp = c("disp", "hp"), wt_qsec = c("wt", "qsec") ) sd_sis(X, y, group_info, d = 4)
Implement the semi-distance independence test via permutation
test, or via the asymptotic approximation when the dimensionality of
continuous variables is high.
sd_test(X, y, test_type = "perm", num_perm = 10000)
sd_test(X, y, test_type = "perm", num_perm = 10000)
X |
Data of multivariate continuous variables, which should be an
|
y |
Data of categorical variables, which should be a factor of length
|
test_type |
Type of the test:
See the Reference for details. |
num_perm |
The number of replications in permutation test. Defaults to 10000. See Details and Reference. |
The semi-distance independence test statistic is
where the
can be computed by
sdcov(X, y, type = "U")
.
For the permutation test (test_type = "perm"
), totally
replications of permutation will be conducted, and the argument
num_perm
specifies the here. The p-value of permutation test is computed by
where is the semi-distance test statistic and
is the test statistic with
-th permutation
sample.
When the dimension of the continuous variables is high, the asymptotic
approximation approach can be applied (test_type = "asym"
), which is
computationally faster since no permutation is needed.
A list with class "indtest"
containing the following components
method
: name of the test;
name_data
: names of the X
and y
;
n
: sample size of the data;
test_type
: type of the test;
num_perm
: number of replications in permutation test, if
test_type = "perm"
;
stat
: test statistic;
pvalue
: computed p-value.
sdcov()
for computing the statistic of semi-distance covariance.
X <- mtcars[, c("mpg", "disp", "drat", "wt")] y <- factor(mtcars[, "am"]) test <- sd_test(X, y) print(test) # Man-made independent data ------------------------------------------------- n <- 30; R <- 5; p <- 3; prob <- rep(1/R, R) X <- matrix(rnorm(n*p), n, p) y <- factor(sample(1:R, size = n, replace = TRUE, prob = prob), levels = 1:R) test <- sd_test(X, y) print(test) # Man-made functionally dependent data -------------------------------------- n <- 30; R <- 3; p <- 3 X <- matrix(0, n, p) X[1:10, 1] <- 1; X[11:20, 2] <- 1; X[21:30, 3] <- 1 y <- factor(rep(1:3, each = 10)) test <- sd_test(X, y) print(test) #' Man-made high-dimensionally independent data ----------------------------- n <- 30; R <- 3; p <- 100 X <- matrix(rnorm(n*p), n, p) y <- factor(rep(1:3, each = 10)) test <- sd_test(X, y) print(test) test <- sd_test(X, y, test_type = "asym") print(test) # Man-made high-dimensionally dependent data -------------------------------- n <- 30; R <- 3; p <- 100 X <- matrix(0, n, p) X[1:10, 1] <- 1; X[11:20, 2] <- 1; X[21:30, 3] <- 1 y <- factor(rep(1:3, each = 10)) test <- sd_test(X, y) print(test) test <- sd_test(X, y, test_type = "asym") print(test)
X <- mtcars[, c("mpg", "disp", "drat", "wt")] y <- factor(mtcars[, "am"]) test <- sd_test(X, y) print(test) # Man-made independent data ------------------------------------------------- n <- 30; R <- 5; p <- 3; prob <- rep(1/R, R) X <- matrix(rnorm(n*p), n, p) y <- factor(sample(1:R, size = n, replace = TRUE, prob = prob), levels = 1:R) test <- sd_test(X, y) print(test) # Man-made functionally dependent data -------------------------------------- n <- 30; R <- 3; p <- 3 X <- matrix(0, n, p) X[1:10, 1] <- 1; X[11:20, 2] <- 1; X[21:30, 3] <- 1 y <- factor(rep(1:3, each = 10)) test <- sd_test(X, y) print(test) #' Man-made high-dimensionally independent data ----------------------------- n <- 30; R <- 3; p <- 100 X <- matrix(rnorm(n*p), n, p) y <- factor(rep(1:3, each = 10)) test <- sd_test(X, y) print(test) test <- sd_test(X, y, test_type = "asym") print(test) # Man-made high-dimensionally dependent data -------------------------------- n <- 30; R <- 3; p <- 100 X <- matrix(0, n, p) X[1:10, 1] <- 1; X[11:20, 2] <- 1; X[21:30, 3] <- 1 y <- factor(rep(1:3, each = 10)) test <- sd_test(X, y) print(test) test <- sd_test(X, y, test_type = "asym") print(test)
Compute the statistics (or sample estimates) of semi-distance covariance and correlation. The semi-distance correlation is a standardized version of semi-distance covariance, and it can measure the dependence between a multivariate continuous variable and a categorical variable. See Details for the definition of semi-distance covariance and semi-distance correlation.
sdcov(X, y, type = "V", return_mat = FALSE) sdcor(X, y)
sdcov(X, y, type = "V", return_mat = FALSE) sdcor(X, y)
X |
Data of multivariate continuous variables, which should be an
|
y |
Data of categorical variables, which should be a factor of length
|
type |
Type of statistic: |
return_mat |
A boolean. If |
For and
, the (population-level) semi-distance covariance is defined as
where and
is an iid copy of
.
The (population-level) semi-distance correlation is defined as
where is
the distance variance (Szekely, Rizzo, and Bakirov 2007) of
.
With observations
,
sdcov()
and sdcor()
can compute the sample estimates for the semi-distance
covariance and correlation.
If type = "V"
, the semi-distance covariance statistic is computed as a
V-statistic, which takes a very similar form as the energy-based statistic
with double centering, and is always non-negative. Specifically,
where
is the double centering (Szekely, Rizzo, and Bakirov 2007) of
and
with .
The semi-distance correlation statistic is
where is the V-statistic of distance variance
of
.
If type = "U"
, then the semi-distance covariance statistic is computed as
an “estimated U-statistic”, which is utilized in the independence test
statistic and is not necessarily non-negative. Specifically,
where . Note that the test statistic of the semi-distance independence
test is
The value of the corresponding sample statistic.
If the argument return_mat
of sdcov()
is set as TRUE
, a list with
elements
sdcov
: the semi-distance covariance statistic;
mat_x, mat_y
: the matrices of the distances of X and the divergences
of y, respectively;
will be returned.
sd_test()
for implementing independence test via semi-distance
covariance;
sd_sis()
for implementing groupwise feature screening via
semi-distance correlation.
X <- mtcars[, c("mpg", "disp", "drat", "wt")] y <- factor(mtcars[, "am"]) print(sdcov(X, y)) print(sdcor(X, y)) # Man-made independent data ------------------------------------------------- n <- 30; R <- 5; p <- 3; prob <- rep(1/R, R) X <- matrix(rnorm(n*p), n, p) y <- factor(sample(1:R, size = n, replace = TRUE, prob = prob), levels = 1:R) print(sdcov(X, y)) print(sdcor(X, y)) # Man-made functionally dependent data -------------------------------------- n <- 30; R <- 3; p <- 3 X <- matrix(0, n, p) X[1:10, 1] <- 1; X[11:20, 2] <- 1; X[21:30, 3] <- 1 y <- factor(rep(1:3, each = 10)) print(sdcov(X, y)) print(sdcor(X, y))
X <- mtcars[, c("mpg", "disp", "drat", "wt")] y <- factor(mtcars[, "am"]) print(sdcov(X, y)) print(sdcor(X, y)) # Man-made independent data ------------------------------------------------- n <- 30; R <- 5; p <- 3; prob <- rep(1/R, R) X <- matrix(rnorm(n*p), n, p) y <- factor(sample(1:R, size = n, replace = TRUE, prob = prob), levels = 1:R) print(sdcov(X, y)) print(sdcor(X, y)) # Man-made functionally dependent data -------------------------------------- n <- 30; R <- 3; p <- 3 X <- matrix(0, n, p) X[1:10, 1] <- 1; X[11:20, 2] <- 1; X[21:30, 3] <- 1 y <- factor(rep(1:3, each = 10)) print(sdcov(X, y)) print(sdcor(X, y))
Categorical data with n observations and R levels can typically be represented as two forms in R: a factor with length n, or an n by K indicator matrix with elements being 0 or 1. This function is to switch the form of a categorical object from one to the another.
switch_cat_repr(obj)
switch_cat_repr(obj)
obj |
an object representing categorical data, either a factor or an indicator matrix with each row representing an observation. |
categorical object in the another form.
For a design matrix , estimate the trace of its covariance matrix
,
and the square of covariance matrix
.
tr_estimate(X)
tr_estimate(X)
X |
The design matrix. |
A list with elements:
tr_S
: estimate for trace of ;
tr_S2
: estimate for trace of .