Title: | Nonlinear Network, Clustering, and Variable Selection Based on DCOL |
---|---|
Description: | It includes four methods: DCOL-based K-profiles clustering, non-linear network reconstruction, non-linear hierarchical clustering, and variable selection for generalized additive model. References: Tianwei Yu (2018)<DOI: 10.1002/sam.11381>; Haodong Liu and others (2016)<DOI: 10.1371/journal.pone.0158247>; Kai Wang and others (2015)<DOI: 10.1155/2015/918954>; Tianwei Yu and others (2010)<DOI: 10.1109/TCBB.2010.73>. |
Authors: | Tianwei Yu, Haodong Liu |
Maintainer: | Tianwei Yu<[email protected]> |
License: | GPL (>= 2) |
Version: | 1.4 |
Built: | 2024-12-11 07:14:39 UTC |
Source: | CRAN |
Generating gene matrix as a example of input.
data.gen(n.genes=100, n.samples=100, n.grps=10, aver.grp.size=10, n.fun.types=6, epsilon=0.1, n.depend=0)
data.gen(n.genes=100, n.samples=100, n.grps=10, aver.grp.size=10, n.fun.types=6, epsilon=0.1, n.depend=0)
n.genes |
the number of rows of the matrix. |
n.samples |
the number of columns of the matrix. |
n.grps |
the number of hidden clusters. |
aver.grp.size |
averge number of genes in a cluster. |
n.fun.types |
number of function types to use. |
epsilon |
noise level. |
n.depend |
data generation dependence structure. can be 0, 1, 2. |
The data generation scheme is described in detail in IEEE ACM Trans. Comput. Biol. Bioinform. 10(4):1080-85.
return the data including gene and clustering.
data |
the gene matrix |
grps |
the predicted clustering |
Tianwei Yu<[email protected]>
##generating a gene matrix with 100 genes, some in 5 clusters, and 100 samples per gene. output<-data.gen(n.genes=100, n.samples=10, n.grps=5) ##get the gene matrix from the source of data. matrix<-output$data ##get the hiden clusters from the source of data. grps<-output$grp
##generating a gene matrix with 100 genes, some in 5 clusters, and 100 samples per gene. output<-data.gen(n.genes=100, n.samples=10, n.grps=5) ##get the gene matrix from the source of data. matrix<-output$data ##get the hiden clusters from the source of data. grps<-output$grp
implementation of K-Profiles Clustering
KPC(dataset, nCluster, maxIter = 100, p.max = 0.2, p.min = 0.05)
KPC(dataset, nCluster, maxIter = 100, p.max = 0.2, p.min = 0.05)
dataset |
the data matrix with genes in the row and samples in the column |
nCluster |
the number of clusters K |
maxIter |
the maximum number of iterations |
p.max |
the starting p-value cutoff to exclude noise genes |
p.min |
the final p-value cutoff to exclude noise genes |
Return a list about gene cluster and the list of value p
cluster |
gene cluster |
p.list |
a list of value p |
Tianwei Yu <[email protected]>
http://www.hindawi.com/journals/bmri/aa/918954/
## generating the data matrix & hiden clusters as a sample input<-data.gen(n.genes=40, n.grps=4) ## now input includes data matrix and hiden clusters, so get the matrix as input. input<-input$data ## set nCluster value to 4 kpc<-KPC(input,nCluster=4) ##get the hiden cluster result from "KPC" cluster<-kpc$cluster ##get the list of p p<-kpc$p.list
## generating the data matrix & hiden clusters as a sample input<-data.gen(n.genes=40, n.grps=4) ## now input includes data matrix and hiden clusters, so get the matrix as input. input<-input$data ## set nCluster value to 4 kpc<-KPC(input,nCluster=4) ##get the hiden cluster result from "KPC" cluster<-kpc$cluster ##get the list of p p<-kpc$p.list
The non-linear hierarchical clustering based on DCOL
nlhc(array, hamil.method = "nn", concorde.path = NA, use.normal.approx = FALSE, normalization = "standardize", combine.linear = TRUE, use.traditional.hclust = FALSE, method.traditional.hclust = "average")
nlhc(array, hamil.method = "nn", concorde.path = NA, use.normal.approx = FALSE, normalization = "standardize", combine.linear = TRUE, use.traditional.hclust = FALSE, method.traditional.hclust = "average")
array |
the data matrix with no missing values |
hamil.method |
the method to find the hamiltonian path. |
concorde.path |
If using the Concorde TSP Solver, the local directory of the solver |
use.normal.approx |
whether to use the normal approximation for the null hypothesis. |
normalization |
the normalization method for the array. |
combine.linear |
whether linear association should be found by correlation to combine with nonlinear association found by DCOL. |
use.traditional.hclust |
whether traditional agglomerative clustering should be used. |
method.traditional.hclust |
the method to pass on to hclust() if traditional method is chosen. |
Hamil.method: It is passed onto the function tsp of library TSP. To use linkern method, the user needs to install concord as instructed in TSP.
use.normal.approx: If TRUE, normal approximation is used for every feature, AND all covariances are assumed to be zero. If FALSE, generates permutation based null distribution - mean vector and a variance-covariance matrix.
normalization: There are three choices - "standardize" means removing the mean of each row and make the standard deviation one; "normal_score" means normal score transformation; "none" means do nothing. In that case we still assume some normalization has been done by the user such that each row has approximately mean 0 and sd 1.
combine.linear: The two pieces of information is combined at the start to initiate the distance matrix.
Returns a hclust object same as the output of hclust(). Reference: help(hclust)
merge |
an n-1 by 2 matrix. Row i of merge describes the merging of clusters at step i of the clustering. If an element j in the row is negative, then observation -j was merged at this stage. If j is positive then the merge was with the cluster formed at the (earlier) stage j of the algorithm. |
height |
a set of n-1 real values, the value of the criterion associated with the clusterig method for the particular agglomeration |
order |
a vector giving the permutation of the original observations suitable for plotting, in the sense that a cluster plot using this ordering and matrix merge will not have crossings of the branches. |
labels |
labels for each of the objects being clustered |
call |
the call which produced the result |
dist.method |
the distance that has been used to create d |
height.0 |
original calculation of merging height |
Tianwei Yu <[email protected]>
http://www.ncbi.nlm.nih.gov/pubmed/24334400
## generating the data matrix & hiden clusters as a sample input<-data.gen(n.genes=40, n.grps=4) ## now input includes data matrix and hiden clusters, so get the matrix as input. input<-input$data nlhc.data<-nlhc(input) plot(nlhc.data) ##get the merge from the input. merge<-nlhc.data$merge
## generating the data matrix & hiden clusters as a sample input<-data.gen(n.genes=40, n.grps=4) ## now input includes data matrix and hiden clusters, so get the matrix as input. input<-input$data nlhc.data<-nlhc(input) plot(nlhc.data) ##get the merge from the input. merge<-nlhc.data$merge
Non-Linear Network reconstruction method
nlnet(input, min.fdr.cutoff=0.05,max.fdr.cutoff=0.2, conn.proportion=0.007, gene.fdr.plot=FALSE, min.module.size=0, gene.community.method="multilevel", use.normal.approx=FALSE, normalization="standardize", plot.method="communitygraph")
nlnet(input, min.fdr.cutoff=0.05,max.fdr.cutoff=0.2, conn.proportion=0.007, gene.fdr.plot=FALSE, min.module.size=0, gene.community.method="multilevel", use.normal.approx=FALSE, normalization="standardize", plot.method="communitygraph")
input |
the data matrix with no missing values. |
min.fdr.cutoff |
the minimun allowable value of the local false discovery cutoff in establishing links between genes. |
max.fdr.cutoff |
the maximun allowable value of the local false discovery cutoff in establishing links between genes. |
conn.proportion |
the target proportion of connections between all pairs of genes, if allowed by the fdr cutoff limits. |
gene.fdr.plot |
whether plot a figure with estimated densities, distribution functions, and (local) false discovery rates. |
min.module.size |
the min number of genes together as a module. |
gene.community.method |
the method for community detection. |
use.normal.approx |
whether to use the normal approximation for the null hypothesis. |
normalization |
the normalization method for the array. |
plot.method |
the method for graph and community ploting. |
gene.community.method: It provides three kinds of community detection method: "mutilevel", "label.propagation" and "leading.eigenvector".
use.normal.approx: If TRUE, normal approximation is used for every feature, AND all covariances are assumed to be zero. If FALSE, generates permutation based null distribution - mean vector and a variance-covariance matrix.
normalization: There are three choices: "standardize" means removing the mean of each row and make the standard deviation one; "normal_score" means normal score transformation; "none" means do nothing. In that case we still assume some normalization has been done by the user such that each row has approximately mean 0 and sd 1.
plot.method: It provides three kinds of ploting method: "none" means ploting no graph, "communitygraph" means ploting community with graph, "graph" means ploting graph, "membership" means ploting membership of the community
it returns a graph and the community membership of the graph.
algorithm |
The algorithm name for community detection |
graph |
An igraph object including edges : Numeric vector defining the edges, the first edge points from the first element to the second, the second edge from the third to the fourth, etc. |
community |
Numeric vector, one value for each vertex, the membership vector of the community structure. |
Haodong Liu <[email protected]>
https://www.ncbi.nlm.nih.gov/pubmed/27380516
## generating the data matrix & hiden clusters as a sample input<-data.gen(n.genes=40, n.grps=4) ## now input includes data matrix and hiden clusters, so get the matrix as input. input<-input$data ##change the ploting method result<-nlnet(input,plot.method="graph") ## get the result and see it values graph<-result$graph ##a igraph object. comm<-result$community ##community of the graph ## use different community detection method #nlnet(input,gene.community.method="label.propagation") ## change the fdr pro to control connections of genes ## adjust the modularity size #nlnet(input,conn.proportion=0.005,min.module.size=10)
## generating the data matrix & hiden clusters as a sample input<-data.gen(n.genes=40, n.grps=4) ## now input includes data matrix and hiden clusters, so get the matrix as input. input<-input$data ##change the ploting method result<-nlnet(input,plot.method="graph") ## get the result and see it values graph<-result$graph ##a igraph object. comm<-result$community ##community of the graph ## use different community detection method #nlnet(input,gene.community.method="label.propagation") ## change the fdr pro to control connections of genes ## adjust the modularity size #nlnet(input,conn.proportion=0.005,min.module.size=10)
This is a nonlinear variable selection procedure for generalized additive models. It's based on DCOL, using forward stagewise selection. In addition, a cross-validation is conducted to tune the stopping alpha level and finalize the variable selection.
nvsd(X, y, fold = 10, step.size = 0.01, stop.alpha = 0.05, stop.var.count = 20, max.model.var.count = 10, roughening.method = "DCOL", do.plot = F, pred.method = "MARS")
nvsd(X, y, fold = 10, step.size = 0.01, stop.alpha = 0.05, stop.var.count = 20, max.model.var.count = 10, roughening.method = "DCOL", do.plot = F, pred.method = "MARS")
X |
The predictor matrix. Each row is a gene (predictor), each column is a sample. Notice the dimensionality is different than most other packages, where each column is a predictor. This is to conform to other functions in this package that handles gene expression type of data. |
y |
The numerical outcome vector. |
fold |
The fold of cross-validation. |
step.size |
The step size of the roughening process. |
stop.alpha |
The alpha level (significance of the current selected predictor) to stop the iterations. |
stop.var.count |
The maximum number of predictors to select in the forward stagewise selection. Once this number is reached, the iteration stops. |
max.model.var.count |
The maximum number of predictors to select. Notice this can be smaller than the stop.var.count. Stop.var.count can be set more liniently, and this parameter controls the final maximum model size. |
roughening.method |
The method for roughening. The choices are "DCOL" or "spline". |
do.plot |
Whether to plot the points change in each step. |
pred.method |
The prediction method for the cross validation variable selection. As forward stagewise procedure doesn't do prediction, a method has to be borrowed from existing packages. The choices include "MARS", "RF", and "SVM". |
Please refer to the reference for details.
A list object is returned. The components include the following.
selected.pred |
The selected predictors (row number). |
all.pred |
The selected predictors by the forward stagewise selection. The $selected.pred is a subset of this. |
Tianwei Yu<[email protected]>
https://arxiv.org/abs/1601.05285
stage.forward
X<-matrix(rnorm(2000),ncol=20) y<-sin(X[,1])+X[,2]^2+X[,3] nvsd(t(X),y,stop.alpha=0.001,step.size=0.05)
X<-matrix(rnorm(2000),ncol=20) y<-sin(X[,1])+X[,2]^2+X[,3] nvsd(t(X),y,stop.alpha=0.001,step.size=0.05)
The subroutine conducts forward stagewise regression using DCOL. Either DCOL roughening or spline roughening is conducted.
stage.forward(X, y, step.size = 0.01, stop.alpha = 0.01, stop.var.count = 20, roughening.method = "DCOL", tol = 1e-08, spline.df = 5, dcol.sel.only = FALSE, do.plot = F)
stage.forward(X, y, step.size = 0.01, stop.alpha = 0.01, stop.var.count = 20, roughening.method = "DCOL", tol = 1e-08, spline.df = 5, dcol.sel.only = FALSE, do.plot = F)
X |
The predictor matrix. Each row is a gene (predictor), each column is a sample. Notice the dimensionality is different than most other packages, where each column is a predictor. This is to conform to other functions in this package that handles gene expression type of data. |
y |
The numerical outcome vector. |
step.size |
The step size of the roughening process. |
stop.alpha |
The alpha level (significance of the current selected predictor) to stop the iterations. |
stop.var.count |
The maximum number of predictors to select. Once this number is reached, the iteration stops. |
roughening.method |
The method for roughening. The choices are "DCOL" or "spline". |
tol |
The tolerance level of sum of squared changes in the residuals. |
spline.df |
The degree of freedom for the spline. |
dcol.sel.only |
TRUE or FALSE. If FALSE, the selection of predictors will consider both linear and nonlinear association significance. |
do.plot |
Whether to plot the points change in each step. |
Please refer to the reference manuscript for details.
A list object is returned. The components include the following.
found.pred |
The selected predictors (row number). |
ssx.rec |
The magnitude of variance explained using the current predictor at each step. |
$sel.rec |
The selected predictor at each step. |
$p.rec |
The p-value of the association between the current residual and the selected predictor at each step. |
Tianwei Yu<[email protected]>
https://arxiv.org/abs/1601.05285
nvsd
X<-matrix(rnorm(2000),ncol=20) y<-sin(X[,1])+X[,2]^2+X[,3] stage.forward(t(X),y,stop.alpha=0.001,step.size=0.05)
X<-matrix(rnorm(2000),ncol=20) y<-sin(X[,1])+X[,2]^2+X[,3] stage.forward(t(X),y,stop.alpha=0.001,step.size=0.05)