Title: | Co-Clustering Package for Binary, Categorical, Contingency and Continuous Data-Sets |
---|---|
Description: | Simultaneous clustering of rows and columns, usually designated by biclustering, co-clustering or block clustering, is an important technique in two way data analysis. It consists of estimating a mixture model which takes into account the block clustering problem on both the individual and variables sets. The 'blockcluster' package provides a bridge between the C++ core library build on top of the 'STK++' library, and the R statistical computing environment. This package allows to co-cluster binary <doi:10.1016/j.csda.2007.09.007>, contingency <doi:10.1080/03610920903140197>, continuous <doi:10.1007/s11634-013-0161-3> and categorical data-sets <doi:10.1007/s11222-014-9472-2>. It also provides utility functions to visualize the results. This package may be useful for various applications in fields of Data mining, Information retrieval, Biology, computer vision and many more. More information about the project and comprehensive tutorial can be found on the link mentioned in URL. |
Authors: | Serge Iovleff [aut, cre], Parmeet Singh Bhatia [aut], Josselin Demont [ctb], Vincent Brault [ctb], Vincent Kubicki [ctb], Gerard Goavert [ctb], Christophe Biernacki [ctb], Gilles Celeux [ctb] |
Maintainer: | Serge Iovleff <[email protected]> |
License: | GPL (>= 3) |
Version: | 4.5.5 |
Built: | 2024-11-20 07:00:06 UTC |
Source: | CRAN |
This is overloading of square braces to extract values of various slots of
the output from coclusterBinary
,
coclusterCategorical
, coclusterContingency
,
coclusterContinuous
.
## S4 method for signature 'strategy' x[i, j, drop] ## S4 method for signature 'BinaryOptions' x[i, j, drop] ## S4 method for signature 'ContingencyOptions' x[i, j, drop] ## S4 method for signature 'ContinuousOptions' x[i, j, drop] ## S4 method for signature 'CategoricalOptions' x[i, j, drop]
## S4 method for signature 'strategy' x[i, j, drop] ## S4 method for signature 'BinaryOptions' x[i, j, drop] ## S4 method for signature 'ContingencyOptions' x[i, j, drop] ## S4 method for signature 'ContinuousOptions' x[i, j, drop] ## S4 method for signature 'CategoricalOptions' x[i, j, drop]
x |
object from which to extract element(s) or in which to replace element(s). |
i |
the name of the element we want to extract or replace. |
j |
if the element designing by i is complex, j specifying elements to extract or replace. |
drop |
not used |
It is a binary data-set simulated using Bernoulli distribution. It consist of two clusters in rows and three clusters in columns.
A data matrix with 1000 rows and 100 columns.
data(binarydata)
data(binarydata)
This class contains all the input options as well as the estimated paramters for Binary data-set. It inherits
from base class CommonOptions
. The class contains following output parameters given in 'Details' along
with the parameters in base class.
The mean value of each co-cluster.
The dispersion of each co-cluster.
Integrated complete likelihood
This package performs Co-clustering of binary, contingency, continuous and categorical data-sets.
This package performs Co-clustering of binary, contingency, continuous and
categorical data-sets with utility functions to visualize the Co-clustered
data. The package contains a set of functions coclusterBinary
,
coclusterCategorical
, coclusterContingency
,
coclusterContinuous
which perform Co-clustering on various
kinds of data-sets and return object of appropriate class (refer to
documentation of these functions. The package also contains function
coclusterStrategy
(see documentation of function to know
various slots) which returns an object of class strategy
.
This object can be given as input to co-clustering functions to control
various Co-clustering parameters. Please refer to testmodels.R file which is
included in "test" directory to see examples with various models and
simulated data-sets.
The package also provide utility functions like summary() and plot() to summarize results and plot the original and Co-clustered data respectively.
## Simple example with simulated binary data ## load data data(binarydata) ## usage of coclusterBinary function in its most simplest form out<-coclusterBinary(binarydata,nbcocluster=c(2,3)) #" Summarize the output results summary(out) ## Plot the original and Co-clustered data plot(out)
## Simple example with simulated binary data ## load data data(binarydata) ## usage of coclusterBinary function in its most simplest form out<-coclusterBinary(binarydata,nbcocluster=c(2,3)) #" Summarize the output results summary(out) ## Plot the original and Co-clustered data plot(out)
It is a categorical data-set simulated using Categorical distribution with 5 modalities. It consist of three clusters in rows and two clusters in columns.
A data matrix with 1000 rows and 100 columns.
data(categoricaldata)
data(categoricaldata)
This class contains all the input options as well as the estimated paramters for categorical data-set. It inherits
from base class CommonOptions
. The class contains following output parameters given in 'Details' along
with the parameters in base class.
The categorical distribution of each co-cluster
Integrated complete likelihood
This function performs Co-Clustering (simultaneous clustering of rows and columns ) for Binary, Contingency and Continuous data-sets using latent block models.It can also be used to perform semi-supervised co-clustering.
cocluster( data, datatype, semisupervised = FALSE, rowlabels = integer(0), collabels = integer(0), model = NULL, nbcocluster, strategy = coclusterStrategy(), nbCore = 1 )
cocluster( data, datatype, semisupervised = FALSE, rowlabels = integer(0), collabels = integer(0), model = NULL, nbcocluster, strategy = coclusterStrategy(), nbCore = 1 )
data |
Input data as matrix (or list containing data matrix, numeric vector for row effects and numeric vector column effects in case of contingency data with known row and column effects.) |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
datatype |
This is the type of data which can be "binary" , "contingency", "continuous" or "categorical". |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
semisupervised |
Boolean value specifying whether to perform semi-supervised co-clustering or not. Make sure to provide row and/or column labels if specified value is true. The default value is false. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
rowlabels |
Integer Vector specifying the class of rows. The class number starts from zero. Provide -1 for unknown row class. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
collabels |
Integer Vector specifying the class of columns. The class number starts from zero. Provide -1 for unknown column class. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
model |
This is the name of model. The following models exists for various types of data:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
nbcocluster |
Integer vector specifying the number of row and column clusters respectively. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
strategy |
Object of class |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
nbCore |
number of thread to use (OpenMP must be available), 0 for all cores. Default value is 1. |
Return an object of BinaryOptions
or ContingencyOptions
or ContinuousOptions
depending on whether the data-type is Binary, Contingency or Continuous
respectively.
# Simple example with simulated binary data #load data data(binarydata) #usage of cocluster function in its most simplest form out<-cocluster(binarydata,datatype="binary",nbcocluster=c(2,3)) #Summarize the output results summary(out) #Plot the original and Co-clustered data plot(out)
# Simple example with simulated binary data #load data data(binarydata) #usage of cocluster function in its most simplest form out<-cocluster(binarydata,datatype="binary",nbcocluster=c(2,3)) #Summarize the output results summary(out) #Plot the original and Co-clustered data plot(out)
This function performs Co-Clustering (simultaneous clustering of rows and columns ) for Binary data-sets using latent block models. It can also be used to perform semi-supervised co-clustering.
coclusterBinary( data, semisupervised = FALSE, rowlabels = integer(0), collabels = integer(0), model = NULL, nbcocluster, strategy = coclusterStrategy(), a = 1, b = 1, nbCore = 1 )
coclusterBinary( data, semisupervised = FALSE, rowlabels = integer(0), collabels = integer(0), model = NULL, nbcocluster, strategy = coclusterStrategy(), a = 1, b = 1, nbCore = 1 )
data |
Input data as matrix (or list containing data matrix) |
|||||||||||||||||||||
semisupervised |
Boolean value specifying whether to perform semi-supervised co-clustering or not. Make sure to provide row and/or column labels if specified value is true. The default value is false. |
|||||||||||||||||||||
rowlabels |
Integer Vector specifying the class of rows. The class number starts from zero. Provide -1 for unknown row class. |
|||||||||||||||||||||
collabels |
Integer Vector specifying the class of columns. The class number starts from zero. Provide -1 for unknown column class. |
|||||||||||||||||||||
model |
This is the name of model. The following models exists for Binary data:
|
|||||||||||||||||||||
nbcocluster |
Integer vector specifying the number of row and column clusters respectively. |
|||||||||||||||||||||
strategy |
Object of class |
|||||||||||||||||||||
a |
First hyper-parameter in case of Bayesian settings. Default is 1 (no prior). |
|||||||||||||||||||||
b |
Second hyper-parameter in case of Bayesian settings. Default is 1 (no prior). |
|||||||||||||||||||||
nbCore |
number of thread to use (OpenMP must be available), 0 for all cores. Default value is 1. |
Return an object of BinaryOptions
.
## Simple example with simulated binary data ## load data data(binarydata) ## usage of coclusterBinary function in its most simplest form out<-coclusterBinary(binarydata,nbcocluster=c(2,3)) ## Summarize the output results summary(out) ## Plot the original and Co-clustered data plot(out)
## Simple example with simulated binary data ## load data data(binarydata) ## usage of coclusterBinary function in its most simplest form out<-coclusterBinary(binarydata,nbcocluster=c(2,3)) ## Summarize the output results summary(out) ## Plot the original and Co-clustered data plot(out)
This function performs Co-Clustering (simultaneous clustering of rows and columns ) Categorical data-sets using latent block models. It can also be used to perform semi-supervised co-clustering.
coclusterCategorical( data, semisupervised = FALSE, rowlabels = integer(0), collabels = integer(0), model = NULL, nbcocluster, strategy = coclusterStrategy(), a = 1, b = 1, nbCore = 1 )
coclusterCategorical( data, semisupervised = FALSE, rowlabels = integer(0), collabels = integer(0), model = NULL, nbcocluster, strategy = coclusterStrategy(), a = 1, b = 1, nbCore = 1 )
data |
Input data as matrix (or list containing data matrix.) |
|||||||||
semisupervised |
Boolean value specifying whether to perform semi-supervised co-clustering or not. Make sure to provide row and/or column labels if specified value is true. The default value is false. |
|||||||||
rowlabels |
Integer Vector specifying the class of rows. The class number starts from zero. Provide -1 for unknown row class. |
|||||||||
collabels |
Integer Vector specifying the class of columns. The class number starts from zero. Provide -1 for unknown column class. |
|||||||||
model |
This is the name of model. The following models exists for categorical data:
|
|||||||||
nbcocluster |
Integer vector specifying the number of row and column clusters respectively. |
|||||||||
strategy |
Object of class |
|||||||||
a |
First hyper-parameter in case of Bayesian settings. Default is 1 (no prior). |
|||||||||
b |
Second hyper-parameter in case of Bayesian settings. Default is 1 (no prior). |
|||||||||
nbCore |
number of thread to use (OpenMP must be available), 0 for all cores. Default value is 1. |
Return an object of BinaryOptions
or ContingencyOptions
or ContinuousOptions
depending on whether the data-type is Binary, Contingency or Continuous
respectively.
## Simple example with simulated categorical data ## load data data(categoricaldata) ## usage of coclusterCategorical function in its most simplest form out<-coclusterCategorical(categoricaldata,nbcocluster=c(3,2)) ## Summarize the output results summary(out) ## Plot the original and Co-clustered data plot(out)
## Simple example with simulated categorical data ## load data data(categoricaldata) ## usage of coclusterCategorical function in its most simplest form out<-coclusterCategorical(categoricaldata,nbcocluster=c(3,2)) ## Summarize the output results summary(out) ## Plot the original and Co-clustered data plot(out)
This function performs Co-Clustering (simultaneous clustering of rows and columns ) for Contingency data-sets using latent block models.It can also be used to perform semi-supervised co-clustering.
coclusterContingency( data, semisupervised = FALSE, rowlabels = integer(0), collabels = integer(0), model = NULL, nbcocluster, strategy = coclusterStrategy(), nbCore = 1 )
coclusterContingency( data, semisupervised = FALSE, rowlabels = integer(0), collabels = integer(0), model = NULL, nbcocluster, strategy = coclusterStrategy(), nbCore = 1 )
data |
Input data as matrix (or list containing data matrix, numeric vector for row effects and numeric vector column effects in case of contingency data with known row and column effects.) |
|||||||||||||||||
semisupervised |
Boolean value specifying whether to perform semi-supervised co-clustering or not. Make sure to provide row and/or column labels if specified value is true. The default value is false. |
|||||||||||||||||
rowlabels |
Integer Vector specifying the class of rows. The class number starts from zero. Provide -1 for unknown row class. |
|||||||||||||||||
collabels |
Integer Vector specifying the class of columns. The class number starts from zero. Provide -1 for unknown column class. |
|||||||||||||||||
model |
This is the name of model. The following models exists for Poisson data:
|
|||||||||||||||||
nbcocluster |
Integer vector specifying the number of row and column clusters respectively. |
|||||||||||||||||
strategy |
Object of class |
|||||||||||||||||
nbCore |
number of thread to use (OpenMP must be available), 0 for all cores. Default value is 1. |
Return an object of BinaryOptions
or ContingencyOptions
or ContinuousOptions
depending on whether the data-type is Binary, Contingency or Continuous
respectively.
## Simple example with simulated contingency data ## load data data(contingencydataunknown) ## usage of coclusterContingency function in its most simplest form strategy = coclusterStrategy( nbinititerations = 5, nbxem = 2, nbiterations_int = 2 , nbiterationsxem = 10, nbiterationsXEM = 100, epsilonXEM=1e-5) out<-coclusterContingency( contingencydataunknown, nbcocluster=c(2,3), strategy = strategy) ## Summarize the output results summary(out) ## Plot the original and Co-clustered data plot(out)
## Simple example with simulated contingency data ## load data data(contingencydataunknown) ## usage of coclusterContingency function in its most simplest form strategy = coclusterStrategy( nbinititerations = 5, nbxem = 2, nbiterations_int = 2 , nbiterationsxem = 10, nbiterationsXEM = 100, epsilonXEM=1e-5) out<-coclusterContingency( contingencydataunknown, nbcocluster=c(2,3), strategy = strategy) ## Summarize the output results summary(out) ## Plot the original and Co-clustered data plot(out)
This function performs Co-Clustering (simultaneous clustering of rows and columns ) for continuous data-sets using latent block models. It can also be used to perform semi-supervised co-clustering.
coclusterContinuous( data, semisupervised = FALSE, rowlabels = integer(0), collabels = integer(0), model = NULL, nbcocluster, strategy = coclusterStrategy(), nbCore = 1 )
coclusterContinuous( data, semisupervised = FALSE, rowlabels = integer(0), collabels = integer(0), model = NULL, nbcocluster, strategy = coclusterStrategy(), nbCore = 1 )
data |
Input data as matrix (or list containing data matrix.) |
|||||||||||||||||||||
semisupervised |
Boolean value specifying whether to perform semi-supervised co-clustering or not. Make sure to provide row and/or column labels if specified value is true. The default value is false. |
|||||||||||||||||||||
rowlabels |
Vector specifying the class of rows. The class number starts from zero. Provide -1 for unknown row class. |
|||||||||||||||||||||
collabels |
Vector specifying the class of columns. The class number starts from zero. Provide -1 for unknown column class. |
|||||||||||||||||||||
model |
This is the name of model. The following models exists for Gaussian data:
|
|||||||||||||||||||||
nbcocluster |
Integer vector specifying the number of row and column clusters respectively. |
|||||||||||||||||||||
strategy |
Object of class |
|||||||||||||||||||||
nbCore |
number of thread to use (OpenMP must be available), 0 for all cores. Default value is 1. |
Return an object of BinaryOptions
or ContingencyOptions
or ContinuousOptions
depending on whether the data-type is Binary, Contingency or Continuous
respectively.
# Simple example with simulated continuous data #load data data(gaussiandata) #usage of coclusterContinuous function in its most simplest form out<-coclusterContinuous(gaussiandata,nbcocluster=c(2,3)) #Summarize the output results summary(out) #Plot the original and Co-clustered data plot(out)
# Simple example with simulated continuous data #load data data(gaussiandata) #usage of coclusterContinuous function in its most simplest form out<-coclusterContinuous(gaussiandata,nbcocluster=c(2,3)) #Summarize the output results summary(out) #Plot the original and Co-clustered data plot(out)
This function is used to set all the parameters for Co-clustering. It returns
an object of class strategy
which can be given as input
to coclusterBinary
,
coclusterCategorical
, coclusterContingency
,
coclusterContinuous
function.
This class contains all the input parameters to run coclustering.
coclusterStrategy( algo = "BEM", initmethod = "emInitStep", stopcriteria = "Parameter", nbiterationsxem = 50, nbiterationsXEM = 500, nbinitmax = 100, nbinititerations = 10, initepsilon = 0.01, nbiterations_int = 5, epsilon_int = 0.01, epsilonxem = 1e-04, epsilonXEM = 1e-10, nbtry = 2, nbxem = 5 )
coclusterStrategy( algo = "BEM", initmethod = "emInitStep", stopcriteria = "Parameter", nbiterationsxem = 50, nbiterationsXEM = 500, nbinitmax = 100, nbinititerations = 10, initepsilon = 0.01, nbiterations_int = 5, epsilon_int = 0.01, epsilonxem = 1e-04, epsilonXEM = 1e-10, nbtry = 2, nbxem = 5 )
algo |
The valid values for this parameter are "BEM" (Default), "BCEM", "BSEM" and "BGibbs" (only for Binary model). |
initmethod |
Method to initialize model parameters. The valid values are "cemInitStep", "emInitStep" and "randomInit". |
stopcriteria |
It specifies the stopping criteria. It can be based on either relative change in parameters (preffered due to computation reasons) value or relative change in pseudo log-likelihood. Valid criterion values are "Parameter" and "Likelihood". Default criteria is "Parameter". |
nbiterationsxem |
Number of EM iterations used during xem step. Default value is 50. |
nbiterationsXEM |
Number of EM iterations used during XEM step. Default value is 500. |
nbinitmax |
Maximal number initialization to try. Default value is 100. |
nbinititerations |
Number of Global iterations used in initialization step. Default value is 10. |
initepsilon |
Tolerance value used while initialization. Default value is 1e-2. |
nbiterations_int |
Number of iterations for internal E step. Default value is 5. |
epsilon_int |
Tolerance value for relative change in Parameter/likelihood for internal E-step. Default value is 1e-2. |
epsilonxem |
Tolerance value used during xem step. Default value is 1e-4. |
epsilonXEM |
Tolerance value used during XEM step. Default value is 1e-10 |
nbtry |
Number of tries (XEM steps). Default value is 2. |
nbxem |
Number of xem steps. Default value is 5. |
Algorithm to be use for co-clustering.
Stopping criteria used to stop the algorithm.
Method to initialize model parameters.
Maximal number of initialization to try (if reached estimation failed)
Number of global iterations while running initialization.
Tolerance value used while initialization.
Number of iterations for internal E-step.
Tolerance value for internal E-step.
Number of tries.
Number of xem iterations.
Number of EM iterations used during xem.
Number of EM iterations used during XEM.
Tolerance value used during xem.
Tolerance value used during XEM.
Object of class strategy
#Default strategy values strategy<-coclusterStrategy() summary(strategy)
#Default strategy values strategy<-coclusterStrategy() summary(strategy)
This class contains all the input options and common output options for all kinds of data-sets (Binary, Categorical, Contingency and Continuous).
Input data.
Type of data.
Boolean value specifying if Co-clustering is semi-supervised or not.
Model to be run for co-clustering.
Number of row and column clusters.
Input strategy.
Status returned.
Vector of row proportions.
Vector of column proportions.
Vector of assigned row cluster to each row.
Vector of assigned column cluster to each column.
Final pseudo log-likelihood.
Final posterior probabilities for rows.
Final posterior probabilities for columns.
It is a contingency data-set simulated using Poisson distribution. The row and column effects is unknown for this data-set. It consist of two clusters in rows and three clusters in columns.
A data list consisting of following data:
A data matrix consisting of 1000 rows and 100 columns.
A numeric vector of size 1000. Each value represent row effect of corresponding row.
A numeric vector of size 100. Each value represent column effect of corresponding column.
data(contingencydatalist)
data(contingencydatalist)
It is a contingency data-set simulated using Poisson distribution. The row and column effects is unknown for this data-set. It consist of two clusters in rows and three clusters in columns.
A data matrix with 1000 rows and 100 columns.
data(contingencydataunknown)
data(contingencydataunknown)
This class contains all the input options as well as the estimated paramters for Contingency data-set.It inherits
from base class CommonOptions
. The class contains following output parameters given in 'Details' along
with the parameters in base class.
The value of poisson parameter (gamma) for each co-cluster.
Rows effect (if known).
Columns effect (if known).
This class contains all the input options as well as the estimated parameters
for Continuous data-sets. It inherits from base class CommonOptions
. The class contains following output parameters given in 'Details' along
with the parameters in base class.
The mean value of each co-cluster.
The variance of each co-cluster.
It is a Continuous data-set simulated using Gaussian distribution. It consist of two clusters in rows and three clusters in columns.
A data matrix with 1000 rows and 100 columns.
data(gaussiandata)
data(gaussiandata)
This function plot the original and Co-clustered data-sets.
## S4 method for signature 'BinaryOptions' plot(x, y, ...) ## S4 method for signature 'ContingencyOptions' plot(x, y, ...) ## S4 method for signature 'ContinuousOptions' plot(x, y, ...) ## S4 method for signature 'CategoricalOptions' plot(x, y, ...)
## S4 method for signature 'BinaryOptions' plot(x, y, ...) ## S4 method for signature 'ContingencyOptions' plot(x, y, ...) ## S4 method for signature 'ContinuousOptions' plot(x, y, ...) ## S4 method for signature 'CategoricalOptions' plot(x, y, ...)
x |
output object from |
y |
Ignored |
... |
Additional argument(s). Currently we support two additional argument. "asp": If this is set to TRUE the original aspect ratio is conserved. By default "asp" is FALSE. "type" : This is the type of plot which is either "cocluster" or "distribution". The corresponding plots are Co-clustered data and distributions and mixture densities for Co-clusters respectively. Default is "cocluster" plot. |
This function gives the summary of output from coclusterBinary
,
coclusterCategorical
, coclusterContingency
,
coclusterContinuous
.
## S4 method for signature 'strategy' summary(object, ...) ## S4 method for signature 'BinaryOptions' summary(object, ...) ## S4 method for signature 'ContingencyOptions' summary(object, ...) ## S4 method for signature 'ContinuousOptions' summary(object, ...) ## S4 method for signature 'CategoricalOptions' summary(object, ...)
## S4 method for signature 'strategy' summary(object, ...) ## S4 method for signature 'BinaryOptions' summary(object, ...) ## S4 method for signature 'ContingencyOptions' summary(object, ...) ## S4 method for signature 'ContinuousOptions' summary(object, ...) ## S4 method for signature 'CategoricalOptions' summary(object, ...)
object |
output object from |
... |
Additional argument(s) . Currently there is no additional arguments. |
In Co-clustering, there could be many local optimal where the algorithm may get struck resulting in sub-optimum results. Hence we applied a strategy called XEM strategy to run the EM algorithm. The various steps are defined as follows:
Do several runs of: "initialization followed by
short run of algorithm (few iterations/high tolerance)". This parameter is
named as "nbxem" in coclusterStrategy
function. Default value
is 5. We call this step as xem step.
Select the best result of step 1 and make long run of Algorithm(high iterations/low tolerance).We call this step as XEM step.
Repeat step 1 and 2 several times and select the best result.
The number of repetitions can be modified via parameter "nbtry" of
coclusterStrategy
function. Default value is 2.