| Title: | Bayesian Network Learning with the PCHC and Related Algorithms |
|---|---|
| Description: | Bayesian network learning using the PCHC, FEDHC, MMHC and variants of these algorithms. PCHC stands for PC Hill-Climbing, a new hybrid algorithm that uses PC to construct the skeleton of the BN and then applies the Hill-Climbing greedy search. More algorithms and variants have been added, such as MMHC, FEDHC, and the Tabu search variants, PCTABU, MMTABU and FEDTABU. The relevant papers are: a) Tsagris M. (2021). "A new scalable Bayesian network learning algorithm with applications to economics". Computational Economics, 57(1): 341-367. <doi:10.1007/s10614-020-10065-7>. b) Tsagris M. (2022). "The FEDHC Bayesian Network Learning Algorithm". Mathematics 2022, 10(15): 2604. <doi:10.3390/math10152604>. c) Sevinc V. and Tsagris M. (2024). "On the Hyperparameters of PCTABU and PCHC Bayesian Network Learning Algorithms". <doi:10.21203/rs.3.rs-5137132/v1>. |
| Authors: | Michail Tsagris [aut, cre] |
| Maintainer: | Michail Tsagris <[email protected]> |
| License: | GPL (>= 2) |
| Version: | 1.4 |
| Built: | 2026-05-17 08:55:00 UTC |
| Source: | https://github.com/cran/pchc |
The original version of this package was to learn Bayesian networks with the PCHC algorithm. PCHC stands for PC Hill-Climbing. It is a new hybrid algorithm that used PC to construct the skeleton of the BN and then utilizes the Hill-Climbing greedy search. The package has been expanded to include the MMHC and the FEDHC algortihms. It further includes the PCTABU, MMTABU and FEDTABU algortihms which are pretty much similar. Instead of the Hill Climbing greey search the Tabu search is employed.
| Package: | pchc |
| Type: | Package |
| Version: | 1.4 |
| Date: | 2026-03-25 |
| License: | GPL-2 |
Michail Tsagris <[email protected]>.
Michail Tsagris [email protected].
Sevinc V. and Tsagris M. (2024). On the Hyperparameters of PCTABU and PCHC Bayesian Network Learning Algorithms. https://assets-eu.researchsquare.com/files/rs-5137132/v1_covered_442bc8e8-2d63-4f92-be2d-1ecf9d7f9ee8.pdf?c=1727238437
Tsagris M. (2022). The FEDHC Bayesian Network Learning Algorithm. Mathematics 2022, 10(15): 2604.
Tsagris M. (2021). A new scalable Bayesian network learning algorithm with applications to economics. Computational Economics, 57(1): 341–367.
Spirtes P., Glymour C. and Scheines R. (2001). Causation, Prediction, and Search. The MIT Press, Cambridge, MA, USA, 3nd edition.
Tsamardinos I., Borboudakis G. (2010) Permutation Testing Improves Bayesian Network Learning. In Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2010. 322–337.
Tsamardinos I., Brown E.L. and Aliferis F.C. (2006). The max-min hill-climbing Bayesian network structure learning algorithm. Machine learning 65(1): 31–78.
Tsagris M. (2017). Conditional independence test for categorical data using Poisson log-linear model. Journal of Data Science, 15(2): 347–356.
Borboudakis G. and Tsamardinos I. (2019). Forward-backward selection with early dropping. Journal of Machine Learning Research, 20(8): 1-39.
Adjacency matrix of a Bayesian network.
bnmat(dag)bnmat(dag)
dag |
A BN object, an object of class "bn". |
The function is called from the "bnlearn" package which invokes the "Rgraphviz" package from Bioconductor and you need to install it first.
Adjacency matrix of a Bayesian network is extracted.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
x <- matrix( rnorm(200 * 10, 1, 10), nrow = 200 ) a <- pchc::pchc(x) pchc::bnmat(a$dag)x <- matrix( rnorm(200 * 10, 1, 10), nrow = 200 ) a <- pchc::pchc(x) pchc::bnmat(a$dag)
All pairwise G-square and chi-square tests of indepedence.
g2test_univariate(x, dc) g2test_univariate_perm(x, dc, B) chi2test_univariate(x, dc)g2test_univariate(x, dc) g2test_univariate_perm(x, dc, B) chi2test_univariate(x, dc)
x |
A numerical matrix with the data. The minimum must be 0, otherwise the function can crash or will produce wrong results. The data must be consecutive numbers. |
dc |
A numerical value equal to the number of variables (or columns of the data matrix) indicating the number of distinct, unique values (or levels) of each variable. Make sure you give the correct numbers here, otherwise the degrees of freedom will be wrong. |
B |
The number of permutations. The permutations test is slower than without permutations and should be used with small sample sizes or when the contigency tables have zeros. When there are few variables, R's "chisq.test" function is faster, but as the number of variables increase the time difference with R's procedure becomes larger and larger. |
The function does all the pairwise test of independence and gives the position inside the matrix.
The user must build the associations matrix now, similarly to the correlation matrix. See the examples of how to do that.
The p-value is not returned, we live this to the user. See the examples of how to obtain it.
A list including:
statistic |
The |
pvalue |
The p-value of the test statistic for each pair of variables. |
x |
The row or variable of the data. |
y |
The column or variable of the data. |
df |
The degrees of freedom of each test. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris M. (2017). Conditional independence test for categorical data using Poisson log-linear model. Journal of Data Science, 15(2): 347–356.
Tsamardinos I. and Borboudakis G. (2010). Permutation testing improves Bayesian network learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (pp. 322–337). Springer Berlin Heidelberg.
nvalues <- 3 nvars <- 10 nsamples <- 1000 x<- matrix( sample( 0:(nvalues - 1), nvars * nsamples, replace = TRUE ), nsamples, nvars ) dc <- rep(nvalues, nvars) system.time( g2test_univariate(x, dc) ) a <- g2test_univariate(x, dc)nvalues <- 3 nvars <- 10 nsamples <- 1000 x<- matrix( sample( 0:(nvalues - 1), nvars * nsamples, replace = TRUE ), nsamples, nvars ) dc <- rep(nvalues, nvars) system.time( g2test_univariate(x, dc) ) a <- g2test_univariate(x, dc)
Bootstrap versions of the skeleton of a Bayesian network.
pchc.skel.boot(x, method = "pearson", alpha = 0.05, B = 200) fedhc.skel.boot(x, method = "pearson", alpha = 0.05, B = 200) mmhc.skel.boot(x, max_k = 3, method = "pearson", alpha = 0.05, B = 200)pchc.skel.boot(x, method = "pearson", alpha = 0.05, B = 200) fedhc.skel.boot(x, method = "pearson", alpha = 0.05, B = 200) mmhc.skel.boot(x, max_k = 3, method = "pearson", alpha = 0.05, B = 200)
x |
A matrix with the variables. The user must know if they are continuous or if they are categorical. If you have categorical data though, the user must transform the data.frame into a matrix. In addition, the numerical matrix must have values starting from 0. For example, 0, 1, 2, instead of "A", "B" and "C". |
max_k |
The maximum conditioning set to use in the conditional indepedence test (see Details). Integer, default value is 3. |
method |
If you have continuous data, this "pearson". If you have categorical data though, this must be "cat". In this case, make sure the minimum value of each variable is zero. The function "g2Test" in the R package Rfast and the relevant functions work that way. |
alpha |
The significance level ( suitable values in (0, 1) ) for assessing the p-values. The default value is 0.05. |
B |
The number of bootstrap resamples to draw. The algorithm is performed in each bootstrap sample. In the end, the adjacency matrix on the observed data is returned, along with another adjacency matrix produced by the bootstrap. The latter one contains values from 0 to 1 indicating the proportion of times an edge between two nodes was present. |
A list including:
G |
The observed adjancency matrix. A value of 1 in G[i, j] appears in G[j, i] also, indicating that i and j have an edge between them. |
Gboot |
The bootstrapped adjancency matrix. A value of 1 in G[i, j] appears in G[j, i] also, indicating that i and j have an edge between them. |
runtime |
The duration of the algorithm. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
A new scalable Bayesian network learning algorithm with applications to economics. Computational Economics, 57(1): 341–367.
Tsagris M. (2022). The FEDHC Bayesian Network Learning Algorithm. Mathematics 2022, 10(15), 2604.
Spirtes P., Glymour C. and Scheines R. (2001). Causation, Prediction, and Search. The MIT Press, Cambridge, MA, USA, 3nd edition.
Tsamardinos I., Brown E.L. and Aliferis F.C. (2006). The max-min hill-climbing Bayesian network structure learning algorithm. Machine learning, 65(1): 31–78.
Borboudakis G. and Tsamardinos I. (2019). Forward-backward selection with early dropping. Journal of Machine Learning Research, 20(8): 1–39.
pchc.skel, fedhc.skel, mmhc.skel, bn.skel.utils
x <- pchc::rbn2(500, p = 20, nei = 3)$x a <- pchc::pchc.skel.boot(x, alpha = 0.05)x <- pchc::rbn2(500, p = 20, nei = 3)$x a <- pchc::pchc.skel.boot(x, alpha = 0.05)
Bootstrapping the FEDHC and FEDTABU Bayesian network learning algorithms.
fedhc.boot(x, method = "pearson", alpha = 0.05, ini.stat = NULL, R = NULL, restart = 10, score = "bic-g", blacklist = NULL, whitelist = NULL, B = 200, ncores = 1) fedtabu.boot(x, method = "pearson", alpha = 0.05, ini.stat = NULL, R = NULL, tabu = 10, score = "bic-g", blacklist = NULL, whitelist = NULL, B = 200, ncores = 1)fedhc.boot(x, method = "pearson", alpha = 0.05, ini.stat = NULL, R = NULL, restart = 10, score = "bic-g", blacklist = NULL, whitelist = NULL, B = 200, ncores = 1) fedtabu.boot(x, method = "pearson", alpha = 0.05, ini.stat = NULL, R = NULL, tabu = 10, score = "bic-g", blacklist = NULL, whitelist = NULL, B = 200, ncores = 1)
x |
A numerical matrix with the variables. If you have a data.frame (i.e. categorical data) turn them into a matrix. Note, that for the categorical case data, the numbers must start from 0. No missing data are allowed. |
method |
If you have continuous data, you can choose either "pearson" or "spearman". If you have categorical data though, this must be "cat". In this case, make sure the minimum value of each variable is zero. The |
alpha |
The significance level for assessing the p-values. |
ini.stat |
If the initial test statistics (univariate associations) are available, pass them through this parameter. |
R |
If the correlation matrix is available, pass it here. |
restart |
An integer, the number of random restarts. |
tabu |
An integer, the length of the tabu list used in the tabu function. |
score |
A character string, the label of the network score to be used in the algorithm. If none is specified, the default score is the Bayesian Information Criterion for both discrete and continuous data sets. The available score for continuous variables are: "bic-g" (default), "loglik-g", "aic-g", "bic-g" or "bge". The available score categorical variables are: "bde", "loglik" or "bic". |
blacklist |
A data frame with two columns (optionally labeled "from" and "to"), containing a set of arcs not to be included in the graph. |
whitelist |
A data frame with two columns (optionally labeled "from" and "to"), containing a set of arcs to be included in the graph. |
B |
The number of bootstrap resamples to draw. The algorithm is performed in each bootstrap sample. In the end, the adjacency matrix on the observed data is returned, along with another adjacency matrix produced by the bootstrap. The latter one contains values from 0 to 1 indicating the proportion of times an edge between two nodes was present. |
ncores |
The number of cores to use, in case of parallel computing. |
The FEDHC algorithm is implemented. The FBED algortihm (Borboudakis and Tsamardinos, 2019), without the backward phase, is implemented during the skeleton identification phase. Next, the Hill Climbing greedy search or the Tabu search is employed to score the network.
A list including:
mod |
A list including the output of the |
Gboot |
The bootstrapped adjancency matrix of the Bayesian network. |
runtime |
The duration of the algorithm. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris M. (2022). The FEDHC Bayesian Network Learning Algorithm. Mathematics 2022, 10(15): 2604.
Borboudakis G. and Tsamardinos I. (2019). Forward-backward selection with early dropping. Journal of Machine Learning Research, 20(8): 1–39.
Tsamardinos I., Brown E.L. and Aliferis F.C. (2006). The max-min hill-climbing Bayesian network structure learning algorithm. Machine Learning, 65(1): 31–78.
pchc, mmhc, fedhc, fedhc.skel
# simulate a dataset with continuous data x <- matrix( rnorm(200 * 20, 1, 10), nrow = 200 ) a <- fedhc.boot(x, B = 50)# simulate a dataset with continuous data x <- matrix( rnorm(200 * 20, 1, 10), nrow = 200 ) a <- fedhc.boot(x, B = 50)
Bootstrapping the MMHC and MMTABU Bayesian network learning algorithms.
mmhc.boot(x, method = "pearson", max_k = 3, alpha = 0.05, ini.stat = NULL, R = NULL, restart = 10, score = "bic-g", blacklist = NULL, whitelist = NULL, B = 200, ncores = 1) mmtabu.boot(x, method = "pearson", max_k = 3, alpha = 0.05, ini.stat = NULL, R = NULL, tabu = 10, score = "bic-g", blacklist = NULL, whitelist = NULL, B = 200, ncores = 1)mmhc.boot(x, method = "pearson", max_k = 3, alpha = 0.05, ini.stat = NULL, R = NULL, restart = 10, score = "bic-g", blacklist = NULL, whitelist = NULL, B = 200, ncores = 1) mmtabu.boot(x, method = "pearson", max_k = 3, alpha = 0.05, ini.stat = NULL, R = NULL, tabu = 10, score = "bic-g", blacklist = NULL, whitelist = NULL, B = 200, ncores = 1)
x |
A numerical matrix with the variables. If you have a data.frame (i.e. categorical data) turn them into a matrix. Note, that for the categorical case data, the numbers must start from 0. No missing data are allowed. |
method |
If you have continuous data, this "pearson". If you have categorical data though, this must be "cat". In this case, make sure the minimum value of each variable is zero. The function "g2Test" in the R package Rfast and the relevant functions work that way. |
max_k |
The maximum conditioning set to use in the conditional indepedence test (see Details). Integer, default value is 3 |
alpha |
The significance level for assessing the p-values. |
ini.stat |
If the initial test statistics (univariate associations) are available, pass them through this parameter. |
R |
If the correlation matrix is available, pass it here. |
restart |
An integer, the number of random restarts. |
tabu |
An integer, the length of the tabu list used in the tabu function. |
score |
A character string, the label of the network score to be used in the algorithm. If none is specified, the default score is the Bayesian Information Criterion for both discrete and continuous data sets. The available score for continuous variables are: "bic-g" (default), "loglik-g", "aic-g", "bic-g" or "bge". The available score categorical variables are: "bde", "loglik" or "bic". |
blacklist |
A data frame with two columns (optionally labeled "from" and "to"), containing a set of arcs not to be included in the graph. |
whitelist |
A data frame with two columns (optionally labeled "from" and "to"), containing a set of arcs to be included in the graph. |
B |
The number of bootstrap resamples to draw. The algorithm is performed in each bootstrap sample. In the end, the adjacency matrix on the observed data is returned, along with another adjacency matrix produced by the bootstrap. The latter one contains values from 0 to 1 indicating the proportion of times an edge between two nodes was present. |
ncores |
The number of cores to use, in case of parallel computing. |
The MMHC algorithm is implemented without performing the backward elimination during the skeleton identification phase. The MMHC as described in Tsamardinos et al. (2006) employs the MMPC algorithm during the skeleton construction phase and the Tabu search in the scoring phase. In this package, the mmhc function employs the Hill Climbing greedy search in the scoring phase while the mmtabu employs the Tabu search.
A list including:
mod |
A list including the output of the |
Gboot |
The bootstrapped adjancency matrix of the Bayesian network. |
runtime |
The duration of the algorithm. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsamardinos I., Brown E.L. and Aliferis F.C. (2006). The max-min hill-climbing Bayesian network structure learning algorithm. Machine Learning, 65(1): 31–78.
Tsagris M. (2021). A new scalable Bayesian network learning algorithm with applications to economics. Computational Economics, 57(1): 341–367.
# simulate a dataset with continuous data x <- matrix( rnorm(200 * 20, 1, 10), nrow = 200 ) a <- mmhc.boot(x, B = 50)# simulate a dataset with continuous data x <- matrix( rnorm(200 * 20, 1, 10), nrow = 200 ) a <- mmhc.boot(x, B = 50)
Bootstrapping the PCHC and PCTABU Bayesian network learning algorithms.
pchc.boot(x, method = "pearson", alpha = 0.05, ini.stat = NULL, R = NULL, restart = 10, score = "bic-g", blacklist = NULL, whitelist = NULL, B = 200, ncores = 1) pctabu.boot(x, method = "pearson", alpha = 0.05, ini.stat = NULL, R = NULL, tabu = 10, score = "bic-g", blacklist = NULL, whitelist = NULL, B = 200, ncores = 1)pchc.boot(x, method = "pearson", alpha = 0.05, ini.stat = NULL, R = NULL, restart = 10, score = "bic-g", blacklist = NULL, whitelist = NULL, B = 200, ncores = 1) pctabu.boot(x, method = "pearson", alpha = 0.05, ini.stat = NULL, R = NULL, tabu = 10, score = "bic-g", blacklist = NULL, whitelist = NULL, B = 200, ncores = 1)
x |
A numerical matrix with the variables. If you have a data.frame (i.e. categorical data) turn them into a matrix. Note, that for the categorical case data, the numbers must start from 0. No missing data are allowed. |
method |
If you have continuous data, you can choose either "pearson" or "spearman". If you have categorical data though,
this must be "cat". In this case, make sure the minimum value of each variable is zero. The |
alpha |
The significance level for assessing the p-values. |
ini.stat |
If the initial test statistics (univariate associations) are available, pass them through this parameter. |
R |
If the correlation matrix is available, pass it here. |
restart |
An integer, the number of random restarts. |
tabu |
An integer, the length of the tabu list used in the tabu function. |
score |
A character string, the label of the network score to be used in the algorithm. If none is specified, the default score is the Bayesian Information Criterion for both discrete and continuous data sets. The available score for continuous variables are: "bic-g" (default), "loglik-g", "aic-g", "bic-g" or "bge". The available score categorical variables are: "bde", "loglik" or "bic". |
blacklist |
A data frame with two columns (optionally labeled "from" and "to"), containing a set of arcs not to be included in the graph. |
whitelist |
A data frame with two columns (optionally labeled "from" and "to"), containing a set of arcs to be included in the graph. |
B |
The number of bootstrap resamples to draw. The algorithm is performed in each bootstrap sample. In the end, the adjacency matrix on the observed data is returned, along with another adjacency matrix produced by the bootstrap. The latter one contains values from 0 to 1 indicating the proportion of times an edge between two nodes was present. |
ncores |
The number of cores to use, in case of parallel computing. |
The PC algorithm as proposed by Spirtes et al. (2001) is first implemented followed by a scoring phase, such as hill climbing or tabu search. The PCHC was proposed by Tsagris (2021), while the PCTABU algorithm is the same but instead of the hill climbing scoring phase, the tabu search is employed.
A list including:
mod |
A list including the output of the |
Gboot |
The bootstrapped adjancency matrix of the Bayesian network. |
runtime |
The duration of the algorithm. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris M. (2021). A new scalable Bayesian network learning algorithm with applications to economics. Computational Economics, 57(1): 341–367.
Spirtes P., Glymour C. and Scheines R. (2001). Causation, Prediction, and Search. The MIT Press, Cambridge, MA, USA, 3nd edition.
Tsamardinos I. and Borboudakis G. (2010) Permutation Testing Improves Bayesian Network Learning. In Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2010, 322–337.
Tsamardinos I., Brown E.L. and Aliferis F.C. (2006). The max-min hill-climbing Bayesian network structure learning algorithm. Machine Learning, 65(1): 31–78.
# simulate a dataset with continuous data x <- matrix( rnorm(200 * 20, 1, 10), nrow = 200 ) a <- pchc.boot(x, B = 50)# simulate a dataset with continuous data x <- matrix( rnorm(200 * 20, 1, 10), nrow = 200 ) a <- pchc.boot(x, B = 50)
Check whether a directed graph is acyclic.
is.dag(dag)is.dag(dag)
dag |
A square matrix representing a directed graph which contains either 0 or 1, where G[i, j] = 1, means there is an arrow from node i to node j. |
The topological sort is performed. If it cannot be performed, NAs are returned. Hence, the functions checks for NAs.
A logical value, TRUE if the matrix represents a DAG and FALSE otherwise.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Chickering D.M. (1995). A transformational characterization of equivalent Bayesian network structures. Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence, Montreal, Canada, 87–98.
G <- pchc::rbn3(100, 20, 0.3)$G pchc::is.dag(G) ## TRUEG <- pchc::rbn3(100, 20, 0.3)$G pchc::is.dag(G) ## TRUE
Chi-square and G-square tests of (unconditional) indepdence.
cat.tests(x, y, logged = FALSE)cat.tests(x, y, logged = FALSE)
x |
A numerical vector or a factor variable with data. The data must be consecutive numbers. |
y |
A numerical vector or a factor variable with data. The data must be consecutive numbers. |
logged |
Should the p-values be returned (FALSE) or their logarithm (TRUE)? |
The function calculates the test statistic of the and the tests of unconditional
independence between x and y. x and y need not be numerical vectors like in g2test. This
function is more close to the spirit of MASS' loglm()" function which calculates both statistics
using Poisson log-linear models (Tsagris, 2017).
A matrix with two rows. In each row the or test statistic, its p-value and the degrees of freedom are returned.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris M. (2021). A new scalable Bayesian network learning algorithm with applications to economics. Computational Economics, 57(1): 341–367.
Tsagris M. (2017). Conditional independence test for categorical data using Poisson log-linear model. Journal of Data Science, 15(2): 347–356.
x <- rbinom(100, 3, 0.5) y <- rbinom(100, 2, 0.5) cat.tests(x, y)x <- rbinom(100, 3, 0.5) y <- rbinom(100, 2, 0.5) cat.tests(x, y)
Contunuous data simulation from a DAG.
rbn2(n, G = NULL, p, nei, low = 0.1, up = 1) rbn3(n, p, s, a = 0, m, G = NULL, seed = FALSE)rbn2(n, G = NULL, p, nei, low = 0.1, up = 1) rbn3(n, p, s, a = 0, m, G = NULL, seed = FALSE)
n |
A number indicating the sample size. |
p |
A number indicating the number of nodes (or vectices, or variables). |
nei |
The average number of neighbours. |
s |
A number in |
a |
A number in |
m |
A vector equal to the number of nodes. This is the mean vector of the normal distribution from which the data are to be generated. This is used only when |
G |
If you already have an an adjacency matrix in mind, plug it in here, otherwise, leave it NULL. |
seed |
If seed is TRUE, the simulated data will always be the same. |
low |
Every child will be a function of some parents. The beta coefficients of the parents will be drawn uniformly from two numbers, low and up. See details for more information on this. |
up |
Every child will be a function of some parents. The beta coefficients of the parents will be drawn uniformly from two numbers, low and up. See details for more information on this. |
In the case where no adjacency matrix is given, an matrix with zeros everywhere is created.
Every element below the diagonal is is replaced by random values from a Bernoulli distribution with probability of success equal to s. This is the matrix B. Every value of 1 is replaced by a uniform value in . This final matrix is called A. The data are generated from a multivariate normal distribution with a zero mean vector and covariance matrix equal to , where is the identiy matrix. If a is greater than zero, the outliers are generated from a multivariate normal with the same covariance matrix and mean vector the one specified by the user, the argument "m". The flexibility of the outliers is that you cna specifiy outliers in some variables only or in all of them. For example, m = c(0,0,5) introduces outliers in the third variable only, whereas m = c(5,5,5) introduces outliers in all variables. The user is free to decide on the type of outliers to include in the data.
For the "rdag2", this is a different way of simulating data from DAGs. The first variable is normally generated. Every other variable can be a function of some previous ones. Suppose now that the i-th variable is a child of 4 previous variables. We need for coefficients to multiply the 4 variables and then generate the i-th variable from a normal with mean and variance 1. The will be either positive or negative values with equal probability. Their absolute values ranges between "low" and "up". The code is accessible and you can see in detail what is going on. In addition, every generated data, are standardised to avoid numerical overflow.
A list including:
nout |
The number of outliers. |
G |
The adcacency matrix used. For the "rdag" if G[i, j] = 2, then G[j, i] = 3 and this means that there is an arrow from j to i. For the "rdag2" the entries are either G[i, j] = G[j, i] = 0 (no edge) or G[i, j] = 1 and G[j, i] = 0 (indicating i -> j). |
A |
The matrix with the with the uniform values in the interval |
x |
The simulated data. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris M. (2019). Bayesian network learning with the PC algorithm: an improved and correct variation. Applied Artificial Intelligence, 33(2): 101–123.
Tsagris M., Borboudakis G., Lagani V. and Tsamardinos I. (2018). Constraint-based Causal Discovery with Mixed Data. International Journal of Data Science and Analytics, 6: 19–30.
Spirtes P., Glymour C. and Scheines R. (2001). Causation, Prediction, and Search. The MIT Press, Cambridge, MA, USA, 3nd edition.
Colombo Diego, and Marloes H. Maathuis (2014). Order-independent constraint-based causal structure learning. Journal of Machine Learning Research, 15(1): 3741–3782.
x <- pchc::rbn3(100, 20, 0.2)$x a <- pchc::pchc(x)x <- pchc::rbn3(100, 20, 0.2)$x a <- pchc::pchc(x)
Correlations between pairs of variables.
corpairs(x, y, rho = NULL, logged = FALSE, parallel = FALSE)corpairs(x, y, rho = NULL, logged = FALSE, parallel = FALSE)
x |
A matrix with real valued data. |
y |
A matrix with real valued data whose dimensions match those of x. |
rho |
This can be a vector of assumed correlations (equal to the number of variables or the columns of x or y) to be tested. If this is not the case, leave it NULL and only the correlations will be returned. |
logged |
Should the p-values be returned (FALSE) or their logarithm (TRUE)? This is taken into account only if "rho" is a vector. |
parallel |
Should parallel implentations take place in C++? The default value is FALSE. |
The paired correlations are calculated. For each column of the matrices x and y the correlation between them is calculated.
A vector of correlations in the case of "rho" being NULL, or a matrix with two extra columns, the test statistic and the (logged) p-value.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Lambert Diane (1992). Zero-Inflated Poisson Regression, with an Application to Defects in Manufacturing. Technometrics. 34(1): 1–14.
Johnson Norman L., Kotz Samuel and Kemp Adrienne W. (1992). Univariate Discrete Distributions (2nd ed.). Wiley
Cohen, A. Clifford (1960). Estimating parameters in a conditional Poisson distribution. Biometrics. 16: 203–211.
Johnson, Norman L. Kemp, Adrianne W. Kotz, Samuel (2005). Univariate Discrete Distributions (third edition). Hoboken, NJ: Wiley-Interscience.
x <- matrix( rnorm(100 * 100), ncol = 100) y <- matrix( rnorm(100 * 100), ncol = 100) system.time( a <- corpairs(x, y) )x <- matrix( rnorm(100 * 100), ncol = 100) y <- matrix( rnorm(100 * 100), ncol = 100) system.time( a <- corpairs(x, y) )
Correlation matrix for FBM class matrices (big matrices).
big_cor(x)big_cor(x)
x |
An FBM class matrix. |
The function accepts a Filebacked Big Matrix (FBM) class matrix and returns the correlation matrix. Check you matrix for possible NA values. For more information see the "bigmemory" and "bigstatsr" packages.
The correlation matrix of the big data x.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
big_read, fedhc.skel, mmhc.skel
require(bigstatsr, quietly = TRUE) x <- matrix( runif(100 * 50, 1, 100), ncol = 50 ) x <- bigstatsr::as_FBM(x) a <- pchc::big_cor(x)require(bigstatsr, quietly = TRUE) x <- matrix( runif(100 * 50, 1, 100), ncol = 50 ) x <- bigstatsr::as_FBM(x) a <- pchc::big_cor(x)
Correlation significance testing using Fisher's z-transformation.
cortest(y, x, rho = 0, a = 0.05 )cortest(y, x, rho = 0, a = 0.05 )
y |
A numerical vector. |
x |
A numerical vector. |
rho |
The value of the hypothesised correlation to be used in the hypothesis testing. |
a |
The significance level used for the confidence intervals. |
The function uses the built-in function "cor" which is very fast, then computes a confidence interval and produces a p-value for the hypothesis test.
A vector with 5 numbers; the correlation, the p-value for the hypothesis test that each of them is equal to "rho", the test statistic and the $a/2%$ lower and upper confidence limits.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris M. (2021). A new scalable Bayesian network learning algorithm with applications to economics. Computational Economics, 57(1): 341–367.
x <- rcauchy(60) y <- rnorm(60) cortest(y, x)x <- rcauchy(60) y <- rnorm(60) cortest(y, x)
Correlation between a vector and a set of variables.
correls(y, x, type = "pearson", rho = 0, a = 0.05)correls(y, x, type = "pearson", rho = 0, a = 0.05)
y |
A numerical vector. |
x |
A matrix with the data. |
type |
The type of correlation you want. "pearson" and "spearman" are the two supported types because their standard error is easily calculated. For the "groupcorrels" you can also put "kendall" because no hypothesis test is performed in that function. |
rho |
The value of the hypothesised correlation to be used in the hypothesis testing. |
a |
The significance level used for the confidence intervals. |
The functions uses the built-in function "cor" which is very fast and then includes confidence intervals and produces a p-value for the hypothesis test.
A matrix with 5 column; the correlation, the p-value for the hypothesis test that each of them is
eaqual to "rho", the test statistic and the lower and upper confidence limits.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
x <- matrix( rnorm(100 * 50 ), ncol = 50) y <- rnorm(100) r <- cor(y, x) ## correlation of y with each of the xs b <- correls(y, x)x <- matrix( rnorm(100 * 50 ), ncol = 50) y <- rnorm(100) r <- cor(y, x) ## correlation of y with each of the xs b <- correls(y, x)
Estimation of the percentage of null p-values.
pi0est(p, lambda = seq(0.05, 0.95, by = 0.01), dof = 3)pi0est(p, lambda = seq(0.05, 0.95, by = 0.01), dof = 3)
p |
A vector of p-values. |
lambda |
A vector of values of the tuning parameter lambda. |
dof |
Number of degrees of freedom to use when estimating pi_0 with smoothing splines. |
The estimated proporiton of null p-values is estimated the algorithm by Storey and Tibshirani (2003).
The estimated proportion of non significant (null) p-values. In the paper Storey and Tibshirani mention that the estimate of pi0 is with lambda=1, but in their R code they use the highest value of lambda and thus we do the same here.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Storey J.D. and Tibshirani R. (2003). Statistical significance for genome-wide experiments. Proceedings of the National Academy of Sciences, 100: 9440–9445.
conf.edge.lower, bn.skel.utils, mmhc.skel
A <- pchc::rbn2(1000, p = 20, nei = 3) x <- A$x mod <- pchc::mmhc.skel(x, alpha = 0.05 ) pval <- exp(mod$pvalue) pval <- lower.tri(pval) pchc::pi0est(pval)A <- pchc::rbn2(1000, p = 20, nei = 3) x <- A$x mod <- pchc::mmhc.skel(x, alpha = 0.05 ) pval <- exp(mod$pvalue) pval <- lower.tri(pval) pchc::pi0est(pval)
G-square test of conditional indepdence with and without permutations.
g2test(x, indx, indy, indz, dc) chi2test(x, indx, indy, indz, dc) g2test_perm(x, indx, indy, indz, dc, B)g2test(x, indx, indy, indz, dc) chi2test(x, indx, indy, indz, dc) g2test_perm(x, indx, indy, indz, dc, B)
x |
A numerical matrix with the data. The minimum must be 0, otherwise the function can crash or will produce wrong results. The data must be consecutive numbers. |
indx |
A number between 1 and the number of columns of data. This indicates which variable to take. |
indy |
A number between 1 and the number of columns of data (other than x). This indicates the other variable whose independence with x is to be tested. |
indz |
A vector with the indices of the variables to condition upon. It must be non zero and between 1 and the number
of variables. If you want unconditional independence test see |
dc |
A numerical value equal to the number of variables (or columns of the data matrix) indicating the number of distinct, unique values (or levels) of each variable. Make sure you give the correct numbers here, otherwise the degrees of freedom will be wrong. |
B |
The number of permutations. The permutations test is slower than without permutations and should be used with small sample sizes or when the contigency tables have zeros. When there are few variables, R's "chisq.test" function is faster, but as the number of variables increase the time difference with R's procedure becomes larger and larger. |
The functions calculates the test statistic of the or the test of conditional independence between x and y conditional on a set of variable(s) cs.
A list including:
statistic |
The |
df |
The degrees of freedom of the test statistic. |
x |
The row or variable of the data. |
y |
The column or variable of the data. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris M. (2021). A new scalable Bayesian network learning algorithm with applications to economics. Computational Economics, 57(1): 341–367.
Tsamardinos I., and Borboudakis G. (2010). Permutation testing improves Bayesian network learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (pp. 322–337). Springer Berlin Heidelberg.
nvalues <- 2 nvars <- 5 nsamples <- 5000 data <- matrix( sample( 0:(nvalues - 1), nvars * nsamples, replace = TRUE ), nsamples, nvars ) dc <- rep(nvalues, nvars) g2test( data, 1, 2, 3, c(3, 3, 3) ) g2test_perm( data, 1, 2, 3, c(3, 3, 3), 1000 )nvalues <- 2 nvars <- 5 nsamples <- 5000 data <- matrix( sample( 0:(nvalues - 1), nvars * nsamples, replace = TRUE ), nsamples, nvars ) dc <- rep(nvalues, nvars) g2test( data, 1, 2, 3, c(3, 3, 3) ) g2test_perm( data, 1, 2, 3, c(3, 3, 3), 1000 )
Lower limit of the confidence of an edge.
conf.edge.lower(p)conf.edge.lower(p)
p |
A numerical vector with the proportion of times an edge was found in the bootstrapped PC algorithm or the confidence of the edge returned by |
After having performed PC algorithm many times in the bootstrap samples (using mmhc.skel.boot for example) you get a symmetric matrix with the proportion of times an edge was discovered. Take the lower (or upper) triangular elements of that matrix and pass them as input in this function. This will tell you the minimum proportion required to be confident that an edge is trully significant.
The estimated cutoff limit above which an edge can be deemed significant.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Scutari M. and Nagarajan R. (2013). Identifying significant edges in graphical models of molecular networks. Artifficial Intelligence in Medicine, 57: 207–217.
pchc.skel.boot, mmhc.skel.boot, fedhc.skel.boot
y <- pchc::rbn2(200, p = 30, nei = 3) x <- y$x g <- pchc::pchc.skel.boot(x, B = 100)$Gboot a <- g[ lower.tri(g) ] pchc::conf.edge.lower(a)y <- pchc::rbn2(200, p = 30, nei = 3) x <- y$x g <- pchc::pchc.skel.boot(x, B = 100)$Gboot a <- g[ lower.tri(g) ] pchc::conf.edge.lower(a)
Markov blanket of a node in a Bayesian network.
mb(bn, node)mb(bn, node)
bn |
This can either be a bn object or the adjacency matrix. |
node |
A vector with one number indicating the node or variable whose Markov blanket is to be returned. |
The Markov blanket of a variable (node) is the set of its parents, children and spouses.
parents |
The parents of the node of interest. |
children |
The children of the node of interest. |
spouses |
The spouses of the node of interest. These are the other parents of the children of the node of interest. |
markov.blanket |
The Markov blanket of the node of interest. The collection of all the previous. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
y <- pchc::rbn3(1000, 10, 0.3) tru <- y$G x <- y$x mod <- pchc(x) pchc::bnplot(mod$dag) G <- pchc::bnmat(mod$dag) pchc::mb(G, 6)y <- pchc::rbn3(1000, 10, 0.3) tru <- y$G x <- y$x mod <- pchc(x) pchc::bnplot(mod$dag) G <- pchc::bnmat(mod$dag) pchc::mb(G, 6)
Outliers free data via the reweighted MCD.
rmcd(x, alpha = NULL)rmcd(x, alpha = NULL)
x |
A numerical matrix with the variables. If you have a data.frame (i.e. categorical data) turn them into a matrix. |
alpha |
A number controlling the size of the subsets over which the determinant is minimized; roughly alpha*n observations are used for computing the determinant. Values between 0.5 and 1 are allowed. |
The FEDHC algorithm.
A list including:
poia |
A vector with the indices of the vectors that were removed. |
x |
The outlier free data. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Rousseeuw P. J. and Leroy A. M. (1987) Robust Regression and Outlier Detection. Wiley.
Rousseeuw P. J. and van Driessen K. (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41: 212–223.
Pison G., Van Aelst S., and Willems G. (2002) Small Sample Corrections for LTS and MCD. Metrika, 55: 111–123.
Hubert M., Rousseeuw P. J. and Verdonck, T. (2012) A deterministic algorithm for robust location and scatter. Journal of Computational and Graphical Statistics, 21: 618–637.
Cerioli A. (2010). Multivariate outlier detection with high-breakdown estimators. Journal of the American Statistical Association, 105(489): 147–156.
Cerchiello P. and Giudici P. (2016). Big data analysis for financial risk management. Journal of Big Data, 3(1): 18.
fedhc.skel, pchc.skel, mmhc.skel
x <- matrix( rnorm(200 * 20), nrow = 200 ) x1 <- matrix( rnorm(10 * 20, 10), nrow = 10 ) x <- rbind(x, x1) a <- pchc::rmcd(x) a$poiax <- matrix( rnorm(200 * 20), nrow = 200 ) x1 <- matrix( rnorm(10 * 20, 10), nrow = 10 ) x <- rbind(x, x1) a <- pchc::rmcd(x) a$poia
Partial correlation between two continuous variables when a correlation matrix is given.
pcor(R, indx, indy, indz, n)pcor(R, indx, indy, indz, n)
R |
A correlation or covariance matrix. |
indx |
The index of the first variable whose conditional correlation is to estimated. |
indy |
The index of the second variable whose conditional correlation is to estimated. |
indz |
The index of the conditioning variables. |
n |
The sample size of the data from which the correlation matrix was computed. |
Given a correlation or a covariance matrix the function will caclulate the partial correlation between variables indx and indy conditioning on variable(s) indz and will return the logarithm of the p-value.
A numeric vector containing the partial correlation and logged p-value for the test of no partial correlation.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
y <- as.matrix( iris[, 1:2] ) z <- cbind(1, iris[, 3] ) er <- resid( .lm.fit(z, y) ) r <- cor(er)[1, 2] z <- 0.5 * log( (1 + r) / (1 - r) ) * sqrt( 150 - 1 - 3 ) log(2) + pt( abs(z), 150 - 1 - 3, lower.tail = FALSE, log.p = TRUE ) r <- cor(iris[, 1:3]) pcor(r, 1,2, 3, 150)y <- as.matrix( iris[, 1:2] ) z <- cbind(1, iris[, 3] ) er <- resid( .lm.fit(z, y) ) r <- cor(er)[1, 2] z <- 0.5 * log( (1 + r) / (1 - r) ) * sqrt( 150 - 1 - 3 ) log(2) + pt( abs(z), 150 - 1 - 3, lower.tail = FALSE, log.p = TRUE ) r <- cor(iris[, 1:3]) pcor(r, 1,2, 3, 150)
Partial correlation matrix from correlation or covariance matrix.
cor2pcor(R)cor2pcor(R)
R |
A correlation or covariance matrix. |
Given a correlation or covariance matrix the function will caclulate the pairwise partial correlation conditional on all other variables.
A matrix where each entry is the partial correlation matrix between each pair of variables conditional on all other variables.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
x <- as.matrix(iris[, 1:4]) R <- cor(x) cor2pcor(R) pcor(R, 1, 2, 3:4, n = 150)x <- as.matrix(iris[, 1:4]) R <- cor(x) cor2pcor(R) pcor(R, 1, 2, 3:4, n = 150)
Plot of a Bayesian network.
bnplot(dag, shape = "ellipse", main = NULL, sub = NULL, highlight = NULL)bnplot(dag, shape = "ellipse", main = NULL, sub = NULL, highlight = NULL)
dag |
A BN object, an object of class "bn". |
shape |
A character string defining the shape of the nodes, "ellipse" (default value), "circle" or "rectangle". |
main |
The main title of the graph displayed on the top. |
sub |
The subtitle of the graph displayed at the bottom. |
highlight |
A list with options specifying which nodes to plot with different colours. You can also check the package bnlearn or the package Rgraphviz for more information on this, or simply check the example below. |
The function is called from the "bnlearn" package which invokes the "Rgraphviz" package from Bioconductor and you need to install it first.
The Bayesian network is visualised.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
if (require("Rgraphviz") ) { # simulate a dataset with continuous data x <- matrix( rnorm(100 * 15, 1, 5), nrow = 100 ) colnames(x) <- paste("X", 1:15, sep = "") nam <- colnames(x) a <- pchc(x) bnplot(a$dag) bnplot( a$dag, highlight = list(nodes = nam[c(2, 3)], col = "tomato", fill = "orange") ) }if (require("Rgraphviz") ) { # simulate a dataset with continuous data x <- matrix( rnorm(100 * 15, 1, 5), nrow = 100 ) colnames(x) <- paste("X", 1:15, sep = "") nam <- colnames(x) a <- pchc(x) bnplot(a$dag) bnplot( a$dag, highlight = list(nodes = nam[c(2, 3)], col = "tomato", fill = "orange") ) }
Random values simulation from a Bayesian network.
rbn(n, dagobj, x)rbn(n, dagobj, x)
n |
The number of observations to generate. |
dagobj |
A "bn" object. See the examples for more information. |
x |
The data used to fit the Bayesian network in a data.frame format. |
This information is taken directly from the R package "bnlearn". This function implements forward/logic sampling: values for the root nodes are sampled from their (un-conditional) distribution, then those of their children conditional on the respective parent sets. Thisis done iteratively until values have been sampled for all nodes.If "dagobj" contains NA parameter estimates (because of unobserved discrete parents configurations in the data the parameters were learned from), rbn will produce observations that contain NAs when thoseparents configurations appear in the simulated samples.
A data frame with the same structure (column names and data types) of the argument "data".
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Korb K. and Nicholson A.E. (2010).Bayesian Artificial Intelligence. Chapman & Hall/CRC, 2nd edition.
# simulate a dataset with continuous data x <- matrix( rnorm(200 * 20, 1, 10), nrow = 200 ) a <- pchc::pchc(x) sim <- pchc::rbn( 100, dagobj = a$dag, x = x )# simulate a dataset with continuous data x <- matrix( rnorm(200 * 20, 1, 10), nrow = 200 ) a <- pchc::pchc(x) sim <- pchc::rbn( 100, dagobj = a$dag, x = x )
Read big data or a big.matrix object.
big_read(big_path, select, header = TRUE, sep = ",")big_read(big_path, select, header = TRUE, sep = ",")
big_path |
The path (including the name) where the big.matrix object is. |
select |
Indices of columns to read (sorted). The length of select will be the number of columns of the resulting filebacked Big Matrix. |
header |
If there are column names, then this should be TRUE. |
sep |
A field delimiter, for example ";" or "," (comma separated). See also |
The data (matrix) which will be read and compressed into a big.matrix object must be of type "numeric". I tested it and it works with "integer" as well. But, in general, bear in mind that only matrices will be read. I have not tested with data.frame for example. However, in the help page of "bigmemory" this is mentioned: Any non-numeric entry will be ignored and replaced with NA, so reading something that traditionally would be a data.frame won't cause an error. A warning is issued. In all cases, the big.matrix is turned into a Filebacked Big Matrix (FBM) of type 'double' the object size is alwasy 680 bytes! If the initial dataset has row names these will be ignored and a column with NAs will apear. So check your final FBM matrix. For more information see the "bigmemory" and "bigstatsr" packages.
A Filebacked Big Matrix (FBM) matrix.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
big_cor, fedhc.skel, mmhc.skel
x <- matrix( runif(100 * 5, 1, 100), ncol = 5 )x <- matrix( runif(100 * 5, 1, 100), ncol = 5 )
Receiver operating curve and area under the curve.
auc(group, preds, roc = FALSE, cutoffs = NULL)auc(group, preds, roc = FALSE, cutoffs = NULL)
group |
A numerical vector with the predicted values of each group as 0 and 1. |
preds |
The predicted values of each group. |
roc |
If you want the ROC to appear set it to TRUE. |
cutoffs |
If you provide a vector with decreasing numbers from 1 to 0 that will be used for the ROC, otherwise, the values from 1 to 0 with a step equal to -0.01 will be used. |
The ara under the curve is returned. The user has the option of getting the receiver operating curve as well.
A list including:
cutoffs |
The cutoff values. |
sensitivity |
The sensitivity values for each cutoff value. |
specificity |
The specificity value for each cutoff value. |
youden |
The pair of of 1- specificity and sensitivity where the Youden's J appears on the graph and the Youden index which is defined as the maximum value of sensitivity - specificity + 1. |
auc |
The area under the curve, plus a circle with the point where Youden's J is located. If "roc" is set to FALSE, this is the only item in the list to be returned. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
bn.skel.utils, conf.edge.lower, mmhc.skel
g <- rbinom(150, 1, 0.6) f <- rnorm(150) pchc::auc(g, f, roc = FALSE)g <- rbinom(150, 1, 0.6) f <- rnorm(150) pchc::auc(g, f, roc = FALSE)
The skeleton of a Bayesian network produced by the MMHC or the FEDHC algorithm using the distance correlation.
dcor.mmhc.skel(x, max_k = 3, alpha = 0.05, ini.pvalue = NULL, B = 999) dcor.fedhc.skel(x, alpha = 0.05, ini.stat = NULL, R = NULL)dcor.mmhc.skel(x, max_k = 3, alpha = 0.05, ini.pvalue = NULL, B = 999) dcor.fedhc.skel(x, alpha = 0.05, ini.stat = NULL, R = NULL)
x |
A numerical matrix with the variables. If you have a data.frame (i.e. categorical data) turn them into a matrix. Note, that for the categorical case data, the numbers must start from 0. No missing data are allowed. |
max_k |
The maximum conditioning set to use in the conditional indepedence test (see Details). Integer, default value is 3. |
alpha |
The significance level (suitable values in (0, 1)) for assessing the p-values. Default value is 0.05. |
ini.pvalue |
If the initial p-values (univariate associations) are available, pass them through this parameter. |
ini.stat |
If the initial test statistics (univariate associations) are available, pass them through this parameter. |
B |
The number of permutations to execute to compute the p-value of the distance correlation. |
R |
If the correlation matrix is available, pass it here. |
The max_k option: the maximum size of the conditioning set to use in the conditioning independence test. Larger values provide more accurate results, at the cost of higher computational times. When the sample size is small (e.g., observations) the max_k parameter should be 3 for example, otherwise the conditional independence test may not be able to provide reliable results.
As in FEDHC the first phase consists of a variable selection procedure, the FBED algortihm (Borboudakis and Tsamardinos, 2019) which is performed though by utilizing the distance correlation (Szekely et al., 2007, Szekely and Rizzo 2014, Huo and Szekely, 2016).
A list including:
ini.stat |
The test statistics of the univariate associations. |
ini.pvalue |
The initial p-values univariate associations. |
pvalue |
A matrix with the logarithm of the p-values of the updated associations. This final p-value is the maximum p-value among the two p-values in the end. |
runtime |
The duration of the algorithm. |
ntests |
The number of tests conducted during each k. |
G |
The adjancency matrix. A value of 1 in G[i, j] appears in G[j, i] also, indicating that i and j have an edge between them. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris M. (2022). The FEDHC Bayesian Network Learning Algorithm. Mathematics, 10(25): 2604.
Szekely G.J., Rizzo M.L. and Bakirov N.K. (2007). Measuring and Testing Independence by Correlation of Distances. Annals of Statistics, 35(6): 2769–2794.
Szekely G.J. and Rizzo M. L. (2014). Partial distance correlation with methods for dissimilarities. Annals of Statistics, 42(6): 2382–2412.
Huo X. and Szekely G.J. (2016). Fast computing for distance covariance. Technometrics, 58(4): 435–447.
Tsamardinos I., Brown E.L. and Aliferis F.C. (2006). The max-min hill-climbing Bayesian network structure learning algorithm. Machine Learning, 65(1): 31–78.
# simulate a dataset with continuous data x <- matrix( rnorm(500 * 30, 1, 10), nrow = 500 ) a <- dcor.fedhc.skel(x)# simulate a dataset with continuous data x <- matrix( rnorm(500 * 30, 1, 10), nrow = 500 ) a <- dcor.fedhc.skel(x)
The skeleton of a Bayesian network produced by the FEDHC algorithm.
fedhc.skel(x, method = "pearson", alpha = 0.05, robust = FALSE, ini.stat = NULL, R = NULL)fedhc.skel(x, method = "pearson", alpha = 0.05, robust = FALSE, ini.stat = NULL, R = NULL)
x |
A numerical matrix with the variables. If you have a data.frame (i.e. categorical data) turn them into a matrix. Note, that for the categorical case data, the numbers must start from 0. No missing data are allowed. |
method |
If you have continuous data, this "pearson". If you have categorical data though, this must be "cat". In this case, make sure the minimum value of each variable is zero. The function "g2Test" in the R package Rfast and the relevant functions work that way. |
alpha |
The significance level (suitable values in (0, 1)) for assessing the p-values. Default value is 0.05. |
robust |
Do you want outliers to be removed prior to applying the PCHC algorithm? If yes, set this to TRUE to utilise the MCD. |
ini.stat |
If the initial test statistics (univariate associations) are available, pass them through this parameter. |
R |
If the correlation matrix is available, pass it here. |
Similar to MMHC and PCHC the first phase consists of a variable selection procedure, the FBED algortihm (Borboudakis and Tsamardinos, 2019).
A list including:
ini.stat |
The test statistics of the univariate associations. |
ini.pvalue |
The initial p-values univariate associations. |
pvalue |
A matrix with the logarithm of the p-values of the updated associations. This final p-value is the maximum p-value among the two p-values in the end. |
runtime |
The duration of the algorithm. |
ntests |
The number of tests conducted during each k. |
G |
The adjancency matrix. A value of 1 in G[i, j] appears in G[j, i] also, indicating that i and j have an edge between them. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris M. (2022). The FEDHC Bayesian Network Learning Algorithm. Mathematics, 10(25): 2604.
Borboudakis G. and Tsamardinos I. (2019). Forward-backward selection with early dropping. Journal of Machine Learning Research, 20(8): 1–39.
Tsamardinos I., Brown E.L. and Aliferis F.C. (2006). The max-min hill-climbing Bayesian network structure learning algorithm. Machine Learning, 65(1): 31–78.
pchc.skel, mmhc.skel, fedhc, fedhc.skel.boot, dcor.fedhc.skel
# simulate a dataset with continuous data x <- matrix( rnorm(200 * 50, 1, 10), nrow = 200 ) a <- fedhc.skel(x)# simulate a dataset with continuous data x <- matrix( rnorm(200 * 50, 1, 10), nrow = 200 ) a <- fedhc.skel(x)
The skeleton of a Bayesian network learned with the MMHC algorithm.
mmhc.skel(x, method = "pearson", max_k = 3, alpha = 0.05, robust = FALSE, ini.stat = NULL, R = NULL, parallel = FALSE)mmhc.skel(x, method = "pearson", max_k = 3, alpha = 0.05, robust = FALSE, ini.stat = NULL, R = NULL, parallel = FALSE)
x |
A numerical matrix with the variables. If you have a data.frame (i.e. categorical data) turn them into a matrix. Note, that for the categorical case data, the numbers must start from 0. No missing data are allowed. |
method |
If you have continuous data, this "pearson". If you have categorical data though, this must be "cat". In this case, make sure the minimum value of each variable is zero. The function "g2Test()" in the R package Rfast and the relevant functions work that way. |
max_k |
The maximum conditioning set to use in the conditional indepedence test (see Details). Integer, default value is 3. |
alpha |
The significance level (suitable values in (0, 1)) for assessing the p-values. Default value is 0.05. |
robust |
Do you want outliers to be removed prior to applying the PCHC algorithm? If yes, set this to TRUE to utilise the MCD. |
ini.stat |
If the initial test statistics (univariate associations) are available, pass them through this parameter. |
R |
If the correlation matrix is available, pass it here. |
parallel |
Set this to TRUE if you have millions of observations. In that instance it can reduce the computational time by 1/3. |
The max_k option: the maximum size of the conditioning set to use in the conditioning independence test. Larger values provide more accurate results, at the cost of higher computational times. When the sample size is small (e.g., observations) the max_k parameter should be 3 for example, otherwise the conditional independence test may not be able to provide reliable results.
A list including:
ini.stat |
The test statistics of the univariate associations. |
ini.pvalue |
The initial p-values univariate associations. |
pvalue |
A matrix with the logarithm of the p-values of the updated associations. This final p-value is the maximum p-value among the two p-values in the end. |
runtime |
The duration of the algorithm. |
ntests |
The number of tests conducted during each k. |
G |
The adjancency matrix. A value of 1 in G[i, j] appears in G[j, i] also, indicating that i and j have an edge between them. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsamardinos, I., Aliferis, C. F. and Statnikov, A. (2003). Time and sample efficient discovery of Markov blankets and direct causal relations. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 673–678). ACM.
Brown, L. E., Tsamardinos, I. and Aliferis, C. F. (2004). A novel algorithm for scalable and accurate Bayesian network learning. Medinfo, 711–715.
Tsamardinos I., Brown E.L. and Aliferis F.C. (2006). The max-min hill-climbing Bayesian network structure learning algorithm. Machine Learning, 65(1): 31–78.
pchc.skel, mmhc.skel, mmhc, mmhc.skel.boot
# simulate a dataset with continuous data x <- matrix( rnorm(300 * 30, 1, 100), nrow = 300 ) a <- mmhc.skel(x)# simulate a dataset with continuous data x <- matrix( rnorm(300 * 30, 1, 100), nrow = 300 ) a <- mmhc.skel(x)
The skeleton of a Bayesian network learned with the PC algorithm.
pchc.skel(x, method = "pearson", alpha = 0.05, robust = FALSE, ini.stat = NULL, R = NULL)pchc.skel(x, method = "pearson", alpha = 0.05, robust = FALSE, ini.stat = NULL, R = NULL)
x |
A numerical matrix with the variables. If you have a data.frame (i.e. categorical data) turn them into a matrix. Note, that for the categorical case data, the numbers must start from 0. No missing data are allowed. |
method |
If you have continuous data, this "pearson". If you have categorical data though, this must be "cat". In this case, make sure the minimum value of each variable is zero. The function "g2Test()" in the R package Rfast and the relevant functions work that way. |
alpha |
The significance leve for assessing the p-values. |
robust |
Do you want outliers to be removed prior to applying the PCHC algorithm? If yes, set this to TRUE to utilise the MCD. |
ini.stat |
If the initial test statistics (univariate associations) are available, pass them through this parameter. |
R |
If the correlation matrix is available, pass it here. |
The PC algorithm as proposed by Spirtes et al. (2000) is implemented. The variables must be either continuous or categorical, only. The skeleton of the PC algorithm is order independent, since we are using the third heuristic (Spirte et al., 2000, pg. 90). At every stage of the algorithm use the pairs which are least statistically associated. The conditioning set consists of variables which are most statistically associated with each other of the pair of variables.
For example, for the pair (X, Y) there can be two conditioning sets for example (Z1, Z2) and (W1, W2). All p-values and test statistics and degrees of freedom have been computed at the first step of the algorithm. Take the p-values between (Z1, Z2) and (X, Y) and between (Z1, Z2) and (X, Y). The conditioning set with the minimum p-value is used first. If the minimum p-values are the same, use the second lowest p-value. If the unlikely, but not impossible, event of all p-values being the same, the test statistic divided by the degrees of freedom is used as a means of choosing which conditioning set is to be used first.
If two or more p-values are below the machine epsilon (.Machine$double.eps which is equal to 2.220446e-16), all of them are set to 0. To make the comparison or the ordering feasible we use the logarithm of p-value. Hence, the logarithm of the p-values is always calculated and used.
In the case of the test of independence (for categorical data) with no permutations, we have incorporated a rule of thumb. If the number of samples is at least 5 times the number of the parameters to be estimated, the test is performed, otherwise, independence is not rejected according to Tsamardinos et al. (2006). We have modified it so that it calculates the p-value using permutations.
A list including:
stat |
The test statistics of the univariate associations. |
ini.pvalue |
The initial p-values univariate associations. |
pvalue |
The logarithm of the p-values of the univariate associations. |
runtime |
The duration of the algorithm. |
kappa |
The maximum value of k, the maximum cardinality of the conditioning set at which the algorithm stopped. |
n.tests |
The number of tests conducted during each k. |
G |
The adjancency matrix. A value of 1 in G[i, j] appears in G[j, i] also, indicating that i and j have an edge between them. |
sepset |
A list with the separating sets for every value of k. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Spirtes P., Glymour C. and Scheines R. (2001). Causation, Prediction, and Search. The MIT Press, Cambridge, MA, USA, 3nd edition.
Tsagris M. (2021). A new scalable Bayesian network learning algorithm with applications to economics. Computational Economics, 57(1): 341–367.
Tsamardinos I. and Borboudakis G. (2010) Permutation Testing Improves Bayesian Network Learning. In Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2010. 322–337.
fedhc.skel, mmhc.skel, pchc, pchc.skel.boot
# simulate a dataset with continuous data x <- matrix( rnorm(300 * 30, 1, 100), nrow = 300 ) a <- pchc::pchc.skel(x)# simulate a dataset with continuous data x <- matrix( rnorm(300 * 30, 1, 100), nrow = 300 ) a <- pchc::pchc.skel(x)
The FEDHC and FEDTABU Bayesian network learning algorithms.
fedhc(x, method = "pearson", alpha = 0.05, robust = FALSE, skel = NULL, ini.stat = NULL, R = NULL, restart = 10, score = "bic-g", blacklist = NULL, whitelist = NULL) fedtabu(x, method = "pearson", alpha = 0.05, robust = FALSE, skel = NULL, ini.stat = NULL, R = NULL, tabu = 10, score = "bic-g", blacklist = NULL, whitelist = NULL)fedhc(x, method = "pearson", alpha = 0.05, robust = FALSE, skel = NULL, ini.stat = NULL, R = NULL, restart = 10, score = "bic-g", blacklist = NULL, whitelist = NULL) fedtabu(x, method = "pearson", alpha = 0.05, robust = FALSE, skel = NULL, ini.stat = NULL, R = NULL, tabu = 10, score = "bic-g", blacklist = NULL, whitelist = NULL)
x |
A numerical matrix with the variables. If you have a data.frame (i.e. categorical data) turn them into a matrix. Note, that for the categorical case data, the numbers must start from 0. No missing data are allowed. |
method |
If you have continuous data, you can choose either "pearson" or "spearman". If you have categorical data though, this must be "cat". In this case, make sure the minimum value of each variable is zero. The |
alpha |
The significance level for assessing the p-values. |
robust |
Do you want outliers to be removed prior to applying the FEDHC algorithm? If yes, set this to TRUE to utilise the MCD. |
skel |
If you have the output of the skeleton phase, the output from the function |
ini.stat |
If the initial test statistics (univariate associations) are available, pass them through this parameter. |
R |
If the correlation matrix is available, pass it here. |
restart |
An integer, the number of random restarts. |
tabu |
An integer, the length of the tabu list used in the tabu function. |
score |
A character string, the label of the network score to be used in the algorithm. If none is specified, the default score is the Bayesian Information Criterion for both discrete and continuous data sets. The available score for continuous variables are: "bic-g" (default), "loglik-g", "aic-g", "bic-g" or "bge". The available score categorical variables are: "bde", "loglik" or "bic". |
blacklist |
A data frame with two columns (optionally labeled "from" and "to"), containing a set of arcs not to be included in the graph. |
whitelist |
A data frame with two columns (optionally labeled "from" and "to"), containing a set of arcs to be included in the graph. |
The FEDHC algorithm is implemented. The FBED algortihm (Borboudakis and Tsamardinos, 2019), without the backward phase, is implemented during the skeleton identification phase. Next, the Hill Climbing greedy search or the Tabu search is employed to score the network.
A list including:
ini |
A list including the output of the |
dag |
A "bn" class output. A list including the outcome of the Hill-Climbing or the Tabu search phase. See the package "bnlearn" for more details. |
scoring |
The score value. |
runtime |
The duration of the algorithm. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris M. (2022). The FEDHC Bayesian Network Learning Algorithm. Mathematics, 10(15): 2604.
Borboudakis G. and Tsamardinos I. (2019). Forward-backward selection with early dropping. Journal of Machine Learning Research, 20(8): 1–39.
Tsamardinos I., Brown E.L. and Aliferis F.C. (2006). The max-min hill-climbing Bayesian network structure learning algorithm. Machine Learning, 65(1): 31–78.
pchc, mmhc, fedhc.skel, fedhc.boot
# simulate a dataset with continuous data x <- matrix( rnorm(300 * 30, 1, 10), nrow = 300 ) a <- fedhc(x)# simulate a dataset with continuous data x <- matrix( rnorm(300 * 30, 1, 10), nrow = 300 ) a <- fedhc(x)
The MMHC and MMTABU Bayesian network learning algorithms.
mmhc(x, method = "pearson", max_k = 3, alpha = 0.05, robust = FALSE, skel = NULL, ini.stat = NULL, R = NULL, restart = 10, score = "bic-g", blacklist = NULL, whitelist = NULL) mmtabu(x, method = "pearson", max_k = 3, alpha = 0.05, robust = FALSE, skel = NULL, ini.stat = NULL, R = NULL, tabu = 10, score = "bic-g", blacklist = NULL, whitelist = NULL)mmhc(x, method = "pearson", max_k = 3, alpha = 0.05, robust = FALSE, skel = NULL, ini.stat = NULL, R = NULL, restart = 10, score = "bic-g", blacklist = NULL, whitelist = NULL) mmtabu(x, method = "pearson", max_k = 3, alpha = 0.05, robust = FALSE, skel = NULL, ini.stat = NULL, R = NULL, tabu = 10, score = "bic-g", blacklist = NULL, whitelist = NULL)
x |
A numerical matrix with the variables. If you have a data.frame (i.e. categorical data) turn them into a matrix. Note, that for the categorical case data, the numbers must start from 0. No missing data are allowed. |
method |
If you have continuous data, this "pearson". If you have categorical data though, this must be "cat". In this case, make sure the minimum value of each variable is zero. The function "g2Test" in the R package Rfast and the relevant functions work that way. |
max_k |
The maximum conditioning set to use in the conditional indepedence test (see Details). Integer, default value is 3 |
alpha |
The significance level for assessing the p-values. |
robust |
Do you want outliers to be removed prior to applying the MMHC algorithm? If yes, set this to TRUE to utilise the MCD. |
skel |
If you have the output of the skeleton phase, the output from the function |
ini.stat |
If the initial test statistics (univariate associations) are available, pass them through this parameter. |
R |
If the correlation matrix is available, pass it here. |
restart |
An integer, the number of random restarts. |
tabu |
An integer, the length of the tabu list used in the tabu function. |
score |
A character string, the label of the network score to be used in the algorithm. If none is specified, the default score is the Bayesian Information Criterion for both discrete and continuous data sets. The available score for continuous variables are: "bic-g" (default), "loglik-g", "aic-g", "bic-g" or "bge". The available score categorical variables are: "bde", "loglik" or "bic". |
blacklist |
A data frame with two columns (optionally labeled "from" and "to"), containing a set of arcs not to be included in the graph. |
whitelist |
A data frame with two columns (optionally labeled "from" and "to"), containing a set of arcs to be included in the graph. |
The MMHC algorithm is implemented without performing the backward elimination during the skeleton identification phase. The MMHC as described in Tsamardinos et al. (2006) employs the MMPC algorithm during the skeleton construction phase and the Tabu search in the scoring phase. In this package, the mmhc function employs the Hill Climbing greedy search in the scoring phase while the mmtabu employs the Tabu search.
A list including:
ini |
A list including the output of the |
dag |
A "bn" class output. A list including the outcome of the Hill-Climbing or the Tabu search phase. See the package "bnlearn" for more details. |
scoring |
The score value. |
runtime |
The duration of the algorithm. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsamardinos I., Brown E.L. and Aliferis F.C. (2006). The max-min hill-climbing Bayesian network structure learning algorithm. Machine Learning, 65(1): 31–78.
Tsagris M. (2021). A new scalable Bayesian network learning algorithm with applications to economics. Computational Economics, 57(1): 341–367.
fedhc, pchc, mmhc.skel, mmhc.boot
# simulate a dataset with continuous data x <- matrix( rnorm(300 * 30, 1, 10), nrow = 300 ) a <- mmhc(x)# simulate a dataset with continuous data x <- matrix( rnorm(300 * 30, 1, 10), nrow = 300 ) a <- mmhc(x)
The PCHC and PCTABU Bayesian network learning algorithms.
pchc(x, method = "pearson", alpha = 0.05, robust = FALSE, skel = NULL, ini.stat = NULL, R = NULL, restart = 10, score = "bic-g", blacklist = NULL, whitelist = NULL) pctabu(x, method = "pearson", alpha = 0.05, robust = FALSE, skel = NULL, ini.stat = NULL, R = NULL, tabu = 10, score = "bic-g", blacklist = NULL, whitelist = NULL)pchc(x, method = "pearson", alpha = 0.05, robust = FALSE, skel = NULL, ini.stat = NULL, R = NULL, restart = 10, score = "bic-g", blacklist = NULL, whitelist = NULL) pctabu(x, method = "pearson", alpha = 0.05, robust = FALSE, skel = NULL, ini.stat = NULL, R = NULL, tabu = 10, score = "bic-g", blacklist = NULL, whitelist = NULL)
x |
A numerical matrix with the variables. If you have a data.frame (i.e. categorical data) turn them into a matrix. Note, that for the categorical case data, the numbers must start from 0. No missing data are allowed. |
method |
If you have continuous data, you can choose either "pearson" or "spearman". If you have categorical data though,
this must be "cat". In this case, make sure the minimum value of each variable is zero. The |
alpha |
The significance level for assessing the p-values. |
robust |
Do you want outliers to be removed prior to applying the PCHC algorithm? If yes, set this to TRUE to utilise the MCD. |
skel |
If you have the output of the skeleton phase, the output from the function |
ini.stat |
If the initial test statistics (univariate associations) are available, pass them through this parameter. |
R |
If the correlation matrix is available, pass it here. |
restart |
An integer, the number of random restarts. |
tabu |
An integer, the length of the tabu list used in the tabu function. |
score |
A character string, the label of the network score to be used in the algorithm. If none is specified, the default score is the Bayesian Information Criterion for both discrete and continuous data sets. The available score for continuous variables are: "bic-g" (default), "loglik-g", "aic-g", "bic-g" or "bge". The available score categorical variables are: "bde", "loglik" or "bic". |
blacklist |
A data frame with two columns (optionally labeled "from" and "to"), containing a set of arcs not to be included in the graph. |
whitelist |
A data frame with two columns (optionally labeled "from" and "to"), containing a set of arcs to be included in the graph. |
The PC algorithm as proposed by Spirtes et al. (2001) is first implemented followed by a scoring phase, such as hill climbing or tabu search. The PCHC was proposed by Tsagris (2021), while the PCTABU algorithm is the same but instead of the hill climbing scoring phase, the tabu search is employed.
A list including:
ini |
A list including the output of the |
dag |
A "bn" class output. A list including the outcome of the Hill-Climbing or the Tabu search phase. See the package "bnlearn" for more details. |
scoring |
The score value. |
runtime |
The duration of the algorithm. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris M. (2021). A new scalable Bayesian network learning algorithm with applications to economics. Computational Economics, 57(1): 341–367.
Spirtes P., Glymour C. and Scheines R. (2001). Causation, Prediction, and Search. The MIT Press, Cambridge, MA, USA, 3nd edition.
Tsamardinos I. and Borboudakis G. (2010) Permutation Testing Improves Bayesian Network Learning. In Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2010, 322–337.
Tsamardinos I., Brown E.L. and Aliferis F.C. (2006). The max-min hill-climbing Bayesian network structure learning algorithm. Machine Learning, 65(1): 31–78.
fedhc, mmhc, pchc.skel, pchc.boot
# simulate a dataset with continuous data x <- matrix( rnorm(300 * 30, 1, 10), nrow = 300 ) a <- pchc(x)# simulate a dataset with continuous data x <- matrix( rnorm(300 * 30, 1, 10), nrow = 300 ) a <- pchc(x)
Topological sort of a Bayesian network.
topological_sort(dag)topological_sort(dag)
dag |
A square matrix representing a directed graph which contains 0s and 1s. If G[i, j] = 1 it means there is an arrow from node i to node j. When there is no edge between nodes i and j if G[i, j] = 0. |
The function is an R translation from an old matlab code.
A vector with numbers indicating the sorting. If the matrix does not correspond to a Bayesian network (or a DAG), NA will be returned.
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Chickering D.M. (1995). A transformational characterization of equivalent Bayesian network structures. Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence, Montreal, Canada, 87–98.
y <- pchc::rbn3(100, 20, 0.2) Gtrue <- y$G a <- pchc::pchc(y$x) G <- bnmat(a$dag) topological_sort(G)y <- pchc::rbn3(100, 20, 0.2) Gtrue <- y$G a <- pchc::pchc(y$x) G <- bnmat(a$dag) topological_sort(G)
Utilities for the skeleton of a (Bayesian) Network
bn.skel.utils(bnskel.obj, G = NULL, roc = TRUE, alpha = 0.05) bn.skel.utils2(bnskel.obj, G = NULL, roc = TRUE, alpha = 0.05)bn.skel.utils(bnskel.obj, G = NULL, roc = TRUE, alpha = 0.05) bn.skel.utils2(bnskel.obj, G = NULL, roc = TRUE, alpha = 0.05)
bnskel.obj |
An object as retured by pc.skel, glmm.pc.skel or mmhc.skel. |
G |
The true adjacency matrix with 1 indicating an edge and zero its absence. Symmetric or not is not important. If this is not available, leave it NULL. |
roc |
Do you want a graph with the ROC curve be returned? Default value is TRUE. |
alpha |
The significance level ( suitable values in (0, 1) ) for assessing the p-values. Default value is 0.01. |
Given the true adjaceny matrix one can evaluate the estimated adjacency matrix, skeleton, of the PC or the MMHC algorithm.
The bn.skels.utils give you the area under the curve, false discovery rate and sorting of the edges based on their p-values.
The bn.skel.utils2 estimates the confidence of each edge. The estimated proportion of null p-values is estimated the algorithm by Storey and Tibshirani (2003).
For the "bn.skel.utils" a list including:
fdr |
The false discovery rate as estimated using the Benjamini-Hochberg correction. |
area |
This is a list with the elements of the |
sig.pvalues |
A matrix with the row and column of each significant p-value sorted in asending order. As we move down the matrix, the p-values increase and hence the strength of the associations decreases. |
For the "bn.skel.utils2" a list including:
area |
This is a list with the elements of the |
pxy |
A matrix with the row and column of the confidence of each p-value sorted in asending order. As we move down the matrix, the confidences decrease. |
lower |
The lower confidcence limit of an edge as estimated by |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsamardinos I. and Brown L.E. Bounding the False Discovery Rate in Local Bayesian Network Learning. AAAI, 2008.
Triantafillou S., Tsamardinos I. and Roumpelaki A. (2014). Learning neighborhoods of high confidence in constraint-based causal discovery. In European Workshop on Probabilistic Graphical Models, pp. 487–502.
Storey J.D. and Tibshirani R. (2003). Statistical significance for genome-wide experiments. Proceedings of the National Academy of Sciences, 100: 9440–9445.
Benjamini Y. and Hochberg Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B, 57(1): 289–300.
Spirtes P., Glymour C. and Scheines R. (2001). Causation, Prediction, and Search. The MIT Press, Cambridge, MA, USA, 3nd edition.
conf.edge.lower, pchc.skel, rbn2
y <- pchc::rbn2(200, p = 25, nei = 3) x <- y$x G <- y$G mod <- pchc::pchc.skel(x, method = "pearson", alpha = 0.05) G <- G + t(G) bn.skel.utils(mod, G, roc = FALSE, alpha = 0.05) bn.skel.utils2(mod, G, roc = FALSE, alpha = 0.05)y <- pchc::rbn2(200, p = 25, nei = 3) x <- y$x G <- y$G mod <- pchc::pchc.skel(x, method = "pearson", alpha = 0.05) G <- G + t(G) bn.skel.utils(mod, G, roc = FALSE, alpha = 0.05) bn.skel.utils2(mod, G, roc = FALSE, alpha = 0.05)
Variable selection for continuous data using the FBED algorithm.
cor.fbed(y, x, ystand = TRUE, xstand = TRUE, alpha = 0.05, K = 0)cor.fbed(y, x, ystand = TRUE, xstand = TRUE, alpha = 0.05, K = 0)
y |
The response variable, a numeric vector. |
x |
A matrix with the data, where the rows denote the samples and the columns are the variables. |
ystand |
If this is TRUE the response variable is centered. The mean is subtracted from every value. |
xstand |
If this is TRUE the independent variables are standardised. |
alpha |
The significance level, set to 0.05 by default. |
K |
The number of times to repeat the process. The default value is 0. |
FBED stands for Forward Backward with Earcly Dropping. It is a variation of the classical forward selection, where at each step, only the statistically significant variables carry on. The rest are dropped. The process stops when no other variables can be selected. If K = 1, the process is repeated testing sequentially again all those that have not been selected. If K > 1, then this is repeated.
In the end, the backward selection is performed to remove any falsely included variables. This backward phase has not been implemented yet.
A list including:
runtime |
The duration of the process. |
res |
A matrix with the index of the selected variable, their test statistic value and the associated p-value. |
info |
A matrix with two columns. The cumulative number of variables selected and the number of tests for each value of K. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Borboudakis G. and Tsamardinos I. (2019). Forward-backward selection with early dropping. Journal of Machine Learning Research, 20(8): 1–39.
pc.sel, mmpc, cortest, correls
x <- matrix( rnorm(50 * 50), ncol = 50 ) y <- rnorm(50) a <- pchc::cor.fbed(y, x) ax <- matrix( rnorm(50 * 50), ncol = 50 ) y <- rnorm(50) a <- pchc::cor.fbed(y, x) a
Variable selection for continuous data using the MMPC algorithm.
mmpc(y, x, max_k = 3, alpha = 0.05, method = "pearson", ini = NULL, hash = FALSE, hashobject = NULL, backward = FALSE)mmpc(y, x, max_k = 3, alpha = 0.05, method = "pearson", ini = NULL, hash = FALSE, hashobject = NULL, backward = FALSE)
y |
The class variable. Provide a numeric vector. |
x |
The main dataset. Provide a numeric matrix. |
max_k |
The maximum conditioning set to use in the conditional independence test. Provide an integer. The default value set is 3. |
alpha |
Threshold for assessing p-values' significance. Provide a double value, between 0.0 and 1.0. The default value set is 0.05. |
method |
Currently only "pearson" is supported. |
ini |
This argument is used for the avoidance of the univariate associations re-calculations, in the case of them being present. Provide it in the form of a list. |
hash |
Boolean value for the activation of the statistics storage in a hash type object. The default value is false. |
hashobject |
This argument is used for the avoidance of the hash re-calculation, in the case of them being present, similarly to ini argument. Provide it in the form of a hash. Please note that the generated hash object should be used only when the same dataset is re-analyzed, possibly with different values of max_k and alpha. |
backward |
Boolean value for the activation of the backward/symmetry correction phase. This option removes and falsely included variables in the parents and children set of the target variable. The backward option seems dubious. Please do not use at the moment. |
The MMPC function implements the MMPC algorithm as presented in Tsamardinos, Brown and Aliferis (2006).
The output of the algorithm is an list including:
selected |
The order of the selected variables according to the increasing pvalues. |
hashobject |
The hash object containing the statistics calculated in the current run. |
pvalues |
For each feature included in the dataset, this vector reports the strength of its association with the target in the context of all other variables. Particularly, this vector reports the max p-values found when the association of each variable with the target is tested against different conditional sets. Lower values indicate higher association. |
stats |
The statistics corresponding to the aforementioned pvalues (higher values indicate higher association). |
univ |
This is a list with the univariate associations; the test statistics and their corresponding logged p-values. |
max_k |
The max_k value used in the current execution. |
alpha |
The alpha value used in the current execution. |
n.tests |
If hash = TRUE, the number of tests performed will be returned. If hash != TRUE, the number of univariate associations will be returned. |
runtime |
The time (in seconds) that was needed for the execution of algorithm. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Tsagris M. and Tsamardinos I. (2019). Feature selection with the R package MXM. F1000Research, 7: 1505
Feature Selection with the R Package MXM: Discovering Statistically Equivalent Feature Subsets, Lagani V. and Athineou G. and Farcomeni A. and Tsagris M. and Tsamardinos I. (2017). Journal of Statistical Software, 80(7).
Tsamardinos I., Aliferis C. F. and Statnikov A. (2003). Time and sample efficient discovery of Markov blankets and direct causal relations. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 673–678). ACM.
Brown L. E., Tsamardinos, I. and Aliferis C. F. (2004). A novel algorithm for scalable and accurate Bayesian network learning. Medinfo, 711–715.
Tsamardinos, Brown and Aliferis (2006). The max-min hill-climbing Bayesian network structure learning algorithm. Machine learning, 65(1): 31–78.
x <- matrix( rnorm(50 * 50), ncol = 50 ) y <- rnorm(50) a <- pchc::mmpc(y, x)x <- matrix( rnorm(50 * 50), ncol = 50 ) y <- rnorm(50) a <- pchc::mmpc(y, x)
Variable selection for continuous data using the PC-simple algorithm.
pc.sel(y, x, ystand = TRUE, xstand = TRUE, alpha = 0.05)pc.sel(y, x, ystand = TRUE, xstand = TRUE, alpha = 0.05)
y |
A numerical vector with continuous data. |
x |
A matrix with numerical data; the independent variables, of which some will probably be selected. |
ystand |
If this is TRUE the response variable is centered. The mean is subtracted from every value. |
xstand |
If this is TRUE the independent variables are standardised. |
alpha |
The significance level. |
Variable selection for continuous data only is performed using the PC-simple algorithm (Buhlmann, Kalisch and Maathuis, 2010). The PC algorithm used to infer the skeleton of a Bayesian Network has been adopted in the context of variable selection. In other words, the PC algorithm is used for a single node.
A list including:
vars |
A vector with the selected variables. |
n.tests |
The number of tests performed. |
runtime |
The runtime of the algorithm. |
Michail Tsagris.
R implementation and documentation: Michail Tsagris [email protected].
Buhlmann P., Kalisch M. and Maathuis M. H. (2010). Variable selection in high-dimensional linear models: partially faithful distributions and the PC-simple algorithm. Biometrika, 97(2): 261–278.
x <- matrix( rnorm(50 * 50), ncol = 50 ) y <- rnorm(50) a <- pchc::pc.sel(y, x)x <- matrix( rnorm(50 * 50), ncol = 50 ) y <- rnorm(50) a <- pchc::pc.sel(y, x)