Title: | Stochastic Complexity-Based Conditional Independence Test for Discrete Data |
---|---|
Description: | An efficient implementation of SCCI using 'Rcpp'. SCCI is short for the Stochastic Complexity-based Conditional Independence criterium (Marx and Vreeken, 2019). SCCI is an asymptotically unbiased and L2 consistent estimator of (conditional) mutual information for discrete data. |
Authors: | Alexander Marx [aut,cre] Jilles Vreeken [aut] |
Maintainer: | Alexander Marx <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.2 |
Built: | 2024-11-28 06:53:14 UTC |
Source: | CRAN |
Calculates the Shannon entropy of a discrete random variable conditioned on a discrete (possibly multivariate) random variable
.
conditionalShannonEntropy(x, y)
conditionalShannonEntropy(x, y)
x |
A discrete vector. |
y |
A data frame. |
set.seed(1) x = round((runif(1000, min=0, max=5))) Y = data.frame(round((runif(1000, min=0, max=5))), round((runif(1000, min=0, max=5)))) conditionalShannonEntropy(x=x,y=Y) ## 2.411972
set.seed(1) x = round((runif(1000, min=0, max=5))) Y = data.frame(round((runif(1000, min=0, max=5))), round((runif(1000, min=0, max=5)))) conditionalShannonEntropy(x=x,y=Y) ## 2.411972
Calculates the Stochastic Complexity of a discrete random variable conditioned on a discrete (possibly multivariate) random variable
. Variants for both factorized NML (fNML, Silander et al. 2008) and quotient NML (qNML, Silander et al. 2018) are included.
conditionalStochasticComplexity(x, y, score="fNML")
conditionalStochasticComplexity(x, y, score="fNML")
x |
A discrete vector. |
y |
A discrete vector or a data frame containing multiple discrete vectors to condition X on. |
score |
Default: fNML, optionally qNML can be passed. |
Tomi Silander, Janne Leppä-aho, Elias Jääsaari, Teemu Roos; Quotient normalized maximum likelihood criterion for learning bayesian network structures, Proceedings of the 21nd International Conference on Artificial Intelligence and Statistics (AISTATS), PMLR, 2018
Tomi Silander, Teemu Roos, Petri Kontkanen and Petri Myllymäki; Factorized Normalized Maximum Likelihood Criterion for Learning Bayesian Network Structures, Proceedings of the 4th European Workshop on Probabilistic Graphical Models, 2008
set.seed(1) x = round((runif(1000, min=0, max=5))) Y = data.frame(round((runif(1000, min=0, max=5))), round((runif(1000, min=0, max=5)))) conditionalStochasticComplexity(x=x,y=Y,score="fNML") ## 2779.477
set.seed(1) x = round((runif(1000, min=0, max=5))) Y = data.frame(round((runif(1000, min=0, max=5))), round((runif(1000, min=0, max=5)))) conditionalStochasticComplexity(x=x,y=Y,score="fNML") ## 2779.477
This is an adapted version of for which the output can be interpreted as a p-value. For this, we adapted
such that if
(
is independent of
given
) it gives a p-value greater than
and for
(
is not independent of
given
) gives a p-value smaller or equal to
. Note that we just transformed the output of
and do not obtain a real p-value. In essence, we define the artificial p-value as follows. Let v the output of
divided by the number of samples n.
, which is equal to
if
. Further,
for
. We restrict the p-values to be between
and
.
Unlike ,
is currently only instantiated with fNML.
can be used directly in the PC algorithm developed by Spirtes et al. (2000), which was implemented in the 'pcalg' R-package by Kalisch et al. (2012), as shown in the example.
pSCCI(x, y, S, suffStat)
pSCCI(x, y, S, suffStat)
x |
Position of x variabe (integer). |
y |
Position of y variabe (integer). |
S |
Vector of the position of zero or more conditioning variables (integer). |
suffStat |
This format was adapted such that it can be used in the PC algorithm and other algorithms from the 'pcalg' package. |
Markus Kalisch, Martin Mächler, Diego Colombo, Marloes H. Maathuis, Peter Bühlmann; Causal inference using graphical models with the R package pcalg, Journal of Statistical Software, 2012
Alexander Marx and Jilles Vreeken; Testing Conditional Independence on Discrete Data using Stochastic Complexity, Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS), PMLR, 2019
Peter Spirtes, Clark N. Glymour, Richard Scheines, David Heckerman, Christopher Meek, Gregory Cooper and Thomas Richardson; Causation, Prediction, and Search, MIT press, 2000
set.seed(1) x = round((runif(1000, min=0, max=5))) y = round((runif(1000, min=0, max=5))) Z = data.frame(round((runif(1000, min=0, max=5))), round((runif(1000, min=0, max=5)))) ## create data matrix data_matrix = as.matrix(data.frame(x,y,S1=Z[,1], S2=Z[,2])) suffStat = list(dm=data_matrix) pSCCI(x=1,y=2,S=c(3,4),suffStat=suffStat) ## 0.01000001 ### Using SCI within the PC algorithm if(require(pcalg)){ ## Load data data(gmD) V <- colnames(gmD$x) ## define sufficient statistics suffStat <- list(dm = gmD$x, nlev = c(3,2,3,4,2), adaptDF = FALSE) ## estimate CPDAG pc.D <- pc(suffStat, ## independence test: SCCI using fNML indepTest = pSCCI, alpha = 0.01, labels = V, verbose = TRUE) } if (require(pcalg) & require(Rgraphviz)) { ## show estimated CPDAG par(mfrow = c(1,2)) plot(pc.D, main = "Estimated CPDAG") plot(gmD$g, main = "True DAG") }
set.seed(1) x = round((runif(1000, min=0, max=5))) y = round((runif(1000, min=0, max=5))) Z = data.frame(round((runif(1000, min=0, max=5))), round((runif(1000, min=0, max=5)))) ## create data matrix data_matrix = as.matrix(data.frame(x,y,S1=Z[,1], S2=Z[,2])) suffStat = list(dm=data_matrix) pSCCI(x=1,y=2,S=c(3,4),suffStat=suffStat) ## 0.01000001 ### Using SCI within the PC algorithm if(require(pcalg)){ ## Load data data(gmD) V <- colnames(gmD$x) ## define sufficient statistics suffStat <- list(dm = gmD$x, nlev = c(3,2,3,4,2), adaptDF = FALSE) ## estimate CPDAG pc.D <- pc(suffStat, ## independence test: SCCI using fNML indepTest = pSCCI, alpha = 0.01, labels = V, verbose = TRUE) } if (require(pcalg) & require(Rgraphviz)) { ## show estimated CPDAG par(mfrow = c(1,2)) plot(pc.D, main = "Estimated CPDAG") plot(gmD$g, main = "True DAG") }
Calculates the multinomial regret term for for a discrete random variable with domain size and sample size
(see Silander et al. 2018). Note that we use the logarithm to basis 2 to calculate the result. To compare the results to Silander et al. (2018), we need to multiply the result with
to compare the results.
regret(n,k)
regret(n,k)
n |
Integer (sample size) |
k |
Integer (domain size) |
Tomi Silander, Janne Leppä-aho, Elias Jääsaari, Teemu Roos; Quotient normalized maximum likelihood criterion for learning bayesian network structures, Proceedings of the 21nd International Conference on Artificial Intelligence and Statistics (AISTATS), PMLR, 2018
regret(50,10) ## 19.1 regret(50,10) * log(2) ## 13.24 (see Silander et al. 2018)
regret(50,10) ## 19.1 regret(50,10) * log(2) ## 13.24 (see Silander et al. 2018)
Calculates whether two random variables and
are independent given a set of variables
using
. A score of
denotes that independence holds and values greater than
mean that
is not independent of
given
. For details on
, we refer to Marx and Vreeken (AISTATS, 2019). If you use
in your work, please cite Marx and Vreeken (AISTATS, 19).
The output of is the difference in number of bits between condtioning
only on
and conditioning on
and
. For the variant of
that gives outputs that can be intpreted as p-values, please refer to pSCCI.
SCCI(x, y, Z, score="fNML", sym=FALSE)
SCCI(x, y, Z, score="fNML", sym=FALSE)
x |
A discrete vector. |
y |
A discrete vector. |
Z |
A data frame consisting of zero or more columns of discrete vectors. |
score |
Default: fNML, optionally qNML can be passed. |
sym |
sym can be true or false |
Alexander Marx and Jilles Vreeken; Testing Conditional Independence on Discrete Data using Stochastic Complexity, Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS), PMLR, 2019
set.seed(1) x = round((runif(1000, min=0, max=5))) y = round((runif(1000, min=0, max=5))) Z = data.frame(round((runif(1000, min=0, max=5))), round((runif(1000, min=0, max=5)))) SCCI(x=x,y=y,Z=Z,score="fNML",sym=FALSE) ## 0
set.seed(1) x = round((runif(1000, min=0, max=5))) y = round((runif(1000, min=0, max=5))) Z = data.frame(round((runif(1000, min=0, max=5))), round((runif(1000, min=0, max=5)))) SCCI(x=x,y=y,Z=Z,score="fNML",sym=FALSE) ## 0
Calculates the Shannon entropy over data of a discrete random variable .
shannonEntropy(x)
shannonEntropy(x)
x |
A discrete vector. |
set.seed(1) x = round((runif(1000, min=0, max=5))) shannonEntropy(x=x) ## 2.522265
set.seed(1) x = round((runif(1000, min=0, max=5))) shannonEntropy(x=x) ## 2.522265
Efficient implementation of the exact computation of stochastic complexity for multinomials (Silander et al. 2008) for data over a discrete random variable .
stochasticComplexity(x)
stochasticComplexity(x)
x |
A discrete vector. |
Tomi Silander, Teemu Roos, Petri Kontkanen and Petri Myllymäki; Factorized Normalized Maximum Likelihood Criterion for Learning Bayesian Network Structures, Proceedings of the 4th European Workshop on Probabilistic Graphical Models, 2008
set.seed(1) x = round((runif(1000, min=0, max=5))) stochasticComplexity(x=x) ## 2544.698
set.seed(1) x = round((runif(1000, min=0, max=5))) stochasticComplexity(x=x) ## 2544.698