Title: | Chordalysis R Package |
---|---|
Description: | Learning the structure of graphical models from datasets with thousands of variables. More information about the research papers detailing the theory behind Chordalysis is available at <http://www.francois-petitjean.com/Research> (KDD 2016, SDM 2015, ICDM 2014, ICDM 2013). The R package development site is <https://github.com/HerrmannM/Monash-ChoR>. |
Authors: | François Petitjean [aut], Matthieu Herrmann [aut, com, cre], Christoph Bergmeir [ctb] |
Maintainer: | Matthieu Herrmann <[email protected]> |
License: | GPL-3 |
Version: | 0.0-4 |
Built: | 2024-11-20 06:29:14 UTC |
Source: | CRAN |
The chordalysis algorithm allows to learn the structure of graphical models from datasets with thousands of variables. More information about the research papers detailing the theory behind Chordalysis is available at http://www.francois-petitjean.com/Research
If you have problems using ChoR, find a bug, or have suggestions, please contact the package maintainer by email. Do not write to the general R lists or contact the authors of the original chordalysis software.
If you use the package, please cite references in your publications.
Chordalysis allows to learn the structure of graphical models from datasets with thousands of variables. There are 3 differentes algorithms versions: SMT, Budget and MML. SMT, standing for Subfamiliwize Multiple Testing, is generally the method of choice. It superseeds Budget and is always superior to it. Demonstration is in our KDD'16 paper (see CITATION). Both SMT and Budget are based on statistical testing, while MML uses information theory to decide upon a model. The objective of the different techniques is slightly different: SMT controls the familywise error rate (FWER) while MML is a probabilistic method. Our experiments (again in KDD'16) indicate that SMT is superior to MML for most datasets.
See citation("ChoR")
# Warning: RJava requires to **copy** your data from R into a JVM. # If you need extra memory, use this option (here, for 4Gb) **before** loading choR. # Note: not needed in our case, kept for the example options( java.parameters = "-Xmx4g" ) library(ChoR) # Helper function for graph printing. Require Rgraphviz: # source("https://bioconductor.org/biocLite.R") # biocLite("Rgraphviz") printGraph = function(x){ if(requireNamespace("Rgraphviz", quietly=TRUE)){ attrs <- list(node=list(shape="ellipse", fixedsize=FALSE, fontsize=25)) Rgraphviz::plot(x, attrs=attrs) } else { stop("Rgraphviz required for graph printing.") } } ###### MUSHROOM ##### # We are using a partial UCI mushroom data set (the example should not be too long) MR.url = system.file("extdata", "mushrooms.csv", package = "ChoR", mustWork = TRUE) MR.data = read.csv( MR.url, header = TRUE, # Here, we have a header na.strings = c("NA","?",""), # Configure the missing values stringsAsFactors = FALSE, # Keep strings for now check.names = TRUE # Replace some special characters ) # This file has a special line with types. You can check this with MR.data[1,]. # Let's remove it: MR.data = MR.data[-1, ] # Launch the SMT analysis, with: # ## default pValueThreshold=0.05 # ## computation of attributes cardinality from the data MR.res = ChoR.SMT(MR.data) # Access the result: # ## As a list of cliques: NR.cl = ChoR.as.cliques(MR.res) print(NR.cl) # ## As a formula NR.fo = ChoR.as.formula(MR.res) print(NR.fo) # ## As a graph if(requireNamespace("graph", quietly=TRUE)){ NR.gr = ChoR.as.graph(MR.res) printGraph(NR.gr) } else { print("'graph' package not installed; Skipping 'as graph' example.") } ###### Titanic ##### # We are using the titanix data set MR.url = system.file("extdata", "titanic.dat.txt", package = "ChoR", mustWork = TRUE) T.data = read.csv( MR.url, sep = "", # White spaces header = FALSE, stringsAsFactors = FALSE ) # Give meaningful names colnames(T.data) = c( "Class", "Age", "Sex", "Survived" ) # Chordalysis T.res = ChoR.SMT(T.data, card = c(4, 2, 2, 2)) if(requireNamespace("graph", quietly=TRUE)){ T.gr = ChoR.as.graph(T.res) printGraph(T.gr) }
# Warning: RJava requires to **copy** your data from R into a JVM. # If you need extra memory, use this option (here, for 4Gb) **before** loading choR. # Note: not needed in our case, kept for the example options( java.parameters = "-Xmx4g" ) library(ChoR) # Helper function for graph printing. Require Rgraphviz: # source("https://bioconductor.org/biocLite.R") # biocLite("Rgraphviz") printGraph = function(x){ if(requireNamespace("Rgraphviz", quietly=TRUE)){ attrs <- list(node=list(shape="ellipse", fixedsize=FALSE, fontsize=25)) Rgraphviz::plot(x, attrs=attrs) } else { stop("Rgraphviz required for graph printing.") } } ###### MUSHROOM ##### # We are using a partial UCI mushroom data set (the example should not be too long) MR.url = system.file("extdata", "mushrooms.csv", package = "ChoR", mustWork = TRUE) MR.data = read.csv( MR.url, header = TRUE, # Here, we have a header na.strings = c("NA","?",""), # Configure the missing values stringsAsFactors = FALSE, # Keep strings for now check.names = TRUE # Replace some special characters ) # This file has a special line with types. You can check this with MR.data[1,]. # Let's remove it: MR.data = MR.data[-1, ] # Launch the SMT analysis, with: # ## default pValueThreshold=0.05 # ## computation of attributes cardinality from the data MR.res = ChoR.SMT(MR.data) # Access the result: # ## As a list of cliques: NR.cl = ChoR.as.cliques(MR.res) print(NR.cl) # ## As a formula NR.fo = ChoR.as.formula(MR.res) print(NR.fo) # ## As a graph if(requireNamespace("graph", quietly=TRUE)){ NR.gr = ChoR.as.graph(MR.res) printGraph(NR.gr) } else { print("'graph' package not installed; Skipping 'as graph' example.") } ###### Titanic ##### # We are using the titanix data set MR.url = system.file("extdata", "titanic.dat.txt", package = "ChoR", mustWork = TRUE) T.data = read.csv( MR.url, sep = "", # White spaces header = FALSE, stringsAsFactors = FALSE ) # Give meaningful names colnames(T.data) = c( "Class", "Age", "Sex", "Survived" ) # Chordalysis T.res = ChoR.SMT(T.data, card = c(4, 2, 2, 2)) if(requireNamespace("graph", quietly=TRUE)){ T.gr = ChoR.as.graph(T.res) printGraph(T.gr) }
Get the list of cliques associated to a chordalysis object.
ChoR.as.cliques(x)
ChoR.as.cliques(x)
x |
A chordalysis object obtained by a call to ChoR. |
A list of cliques, a clique being a list of attributes'name, i.e. a list of lists of names.
Extract the formula from a Chordalysis object.
ChoR.as.formula(x)
ChoR.as.formula(x)
x |
A chordalysis object obtained by a call to ChoR. |
a formula representing the model
Get an undirected graph representing the cliques from a Chordalysis object.
ChoR.as.graph(x)
ChoR.as.graph(x)
x |
A chordalysis object obtained by a call to ChoR. |
The undirected graph use the graph package from Bioconductor.
A graph
Searches a statistically significant decomposable model to explain a dataset using Prioritized Chordalysis.
ChoR.Budget(x, pValueThreshold = 0.05, budgetShare = 0.01, card = NULL)
ChoR.Budget(x, pValueThreshold = 0.05, budgetShare = 0.01, card = NULL)
x |
A dataframe with categorical data; column names are the name of the attributes. |
pValueThreshold |
A double value, minimum p-value for statistical consistency (commonly 0.05) |
budgetShare |
A double value, share of the statistical budget to consume at each step (>0 and <=1; 0.01 seems like a reasonable value for most datasets) |
card |
A vector containing the cardinality of the attributes (position wise). |
Call the Budget chordalysis function on the dataframe x. The optionnal card argument can provide a vector of cardinalities for each attribute (i.e. column) of the dataframe. If absent, the cardinalities are computed from the dataframe, but not accurate if some possible values never show up. See papers "Scaling log-linear analysis to high-dimensional data, ICDM 2013", "Scaling log-linear analysis to datasets with thousands of variables, SDM 2015", and "A multiple test correction for streams and cascades of statistical hypothesis tests, KDD 2016" for more details.
A Chordalysis object. Use ChoR.as.*
functions to access the result.
## Not run: res = ChoR.Budget(data) ## Not run: res = ChoR.Budget(data, budgetShare=0.0) ## Not run: res = ChoR.Budget(data, 0.05, card = c(3, 5, 4, 4, 3, 2, 3, 3))
## Not run: res = ChoR.Budget(data) ## Not run: res = ChoR.Budget(data, budgetShare=0.0) ## Not run: res = ChoR.Budget(data, 0.05, card = c(3, 5, 4, 4, 3, 2, 3, 3))
Loads the data from x, which should be a dataframe (else, a conversion to a dataframe is attempted).
ChoR.loadData(x, card = NULL)
ChoR.loadData(x, card = NULL)
x |
A dataframe with categorical data; column names are the name of the attributes. |
card |
A vectore containing the cardinality of the attributes (position wise). |
Loads the data from x, which should be a dataframe (else, a conversion to a dataframe is attempted). The data must be categorical, each column being an attribute. The optionnal argument card should be a vector representing the cardinality of each attribute (position wise). If it is provided, its size must be equal to the number of attributes. Else, its values will be computed from the data, and the cardinality for an attribute will be accurate only if all its possible values appear at least once in the data.
A list how two .jarray references (one for the dimension, one for the data) and the dataframe
Searches a statistically significant decomposable model to explain a dataset.
ChoR.MML(x, card = NULL)
ChoR.MML(x, card = NULL)
x |
A dataframe with categorical data; column names are the name of the attributes. |
card |
A vector containing the cardinality of the attributes (position wise). |
Call the MML chordalysis function on the dataframe x. The optionnal card argument can provide a vector of cardinalities for each attribute (i.e. column) of the dataframe. If absent, the cardinalities are computed from the dataframe, but may not be accurate if some possible values never show up. See papers "A statistically efficient and scalable method for log-linear analysis of high-dimensional data, ICDM 2014" and "Scaling log-linear analysis to datasets with thousands of variables, SDM 2015" for more details.
A Chordalysis object. Use ChoR.as.*
functions to access the result.
## Not run: res = ChoR.MML(data) ## Not run: res = ChoR.MML(data, c(3, 5, 4, 4, 3, 2, 3, 3))
## Not run: res = ChoR.MML(data) ## Not run: res = ChoR.MML(data, c(3, 5, 4, 4, 3, 2, 3, 3))
Convert the result in a 'chordalysis object'.
ChoR.processResult(x, modelStr)
ChoR.processResult(x, modelStr)
x |
The dataframe used to loadData; column names are the name of the attributes. |
modelStr |
The result of a java Chordalysis algorithm |
Process the result of a call to the java Chordalysis algorithm. The result is a String of the forme "~0*1*2+...+3*4*5". The numbers (+1 for indice correction) are replaced with the corresponding column name in x, and the string is split in a list of cliques, a cliques being a list of name. For example, "~ 0*1*2 + 3*4*5" gives the two cliques [[ [[0,1,2]], [[3,4,5]] ]]
A Chordalysis object. Use ChoR.as.*
functions to access the result.
Searches a statistically significant decomposable model to explain a dataset using Prioritized Chordalysis.
ChoR.SMT(x, pValueThreshold = 0.05, card = NULL)
ChoR.SMT(x, pValueThreshold = 0.05, card = NULL)
x |
A dataframe with categorical data; column names are the name of the attributes. |
pValueThreshold |
A double value, minimum p-value for statistical consistency (commonly 0.05) |
card |
A vector containing the cardinality of the attributes (position wise). |
Call the SMT chordalysis function on the dataframe x. The optionnal card argument can provide a vector of cardinalities for each attribute (i.e. column) of the dataframe. If absent, the cardinalities are computed from the dataframe, but may not be accurate if some possible values never show up. See papers "A multiple test correction for streams and cascades of statistical hypothesis tests, KDD 2016", "Scaling log-linear analysis to high-dimensional data, ICDM 2013", and "Scaling log-linear analysis to datasets with thousands of variables, SDM 2015" for more details.
A Chordalysis object. Use ChoR.as.*
functions to access the result.
## Not run: res = ChoR.SMT(data, 0.05, c(3, 5, 4, 4, 3, 2, 3, 3)) ## Not run: res = ChoR.SMT(data, card = c(3, 5, 4, 4, 3, 2, 3, 3))
## Not run: res = ChoR.SMT(data, 0.05, c(3, 5, 4, 4, 3, 2, 3, 3)) ## Not run: res = ChoR.SMT(data, card = c(3, 5, 4, 4, 3, 2, 3, 3))
Create a String representation of a model, compatible with the formula interface, e.g. "~a*b*c+...+e*f*g".
## S3 method for class 'chordalysis' print(x, ...)
## S3 method for class 'chordalysis' print(x, ...)
x |
A "Chordalysis" model, obtained by a call to a ChoR function. |
... |
Unused argument, here for S3 consistency |
A String representation of the model.
Create a String representation of a model, compatible with the formula interface, e.g. "~a*b*c+...+e*f*g".
toString(x)
toString(x)
x |
A "Chordalysis" model, obtained by a call to a ChoR function. |
A String representation of the model.