Title: | Network-Adjusted Covariates for Community Detection |
---|---|
Description: | Incorporating node-level covariates for community detection has gained increasing attention these years. This package provides the function for implementing the novel community detection algorithm known as Network-Adjusted Covariates for Community Detection (NAC), which is designed to detect latent community structure in graphs with node-level information, i.e., covariates. This algorithm can handle models such as the degree-corrected stochastic block model (DCSBM) with covariates. NAC specifically addresses the discrepancy between the community structure inferred from the adjacency information and the community structure inferred from the covariates information. For more detailed information, please refer to the reference paper: Yaofang Hu and Wanjie Wang (2023) <arXiv:2306.15616>. In addition to NAC, this package includes several other existing community detection algorithms that are compared to NAC in the reference paper. These algorithms are Spectral Clustering On Ratios-of Eigenvectors (SCORE), network-based regularized spectral clustering (Net-based), covariate-based spectral clustering (Cov-based), covariate-assisted spectral clustering (CAclustering) and semidefinite programming (SDP). |
Authors: | Yaofang Hu [aut, cre], Wanjie Wang [aut] |
Maintainer: | Yaofang Hu <[email protected]> |
License: | GPL-2 |
Version: | 0.1.0 |
Built: | 2024-11-29 08:44:53 UTC |
Source: | CRAN |
CAclustering clusters graph nodes by applying spectral clustering with the assistance from node covariates.
CAclustering(Adj, Covariate, K, alphan = 5, itermax = 100, startn = 10)
CAclustering(Adj, Covariate, K, alphan = 5, itermax = 100, startn = 10)
Adj |
An |
Covariate |
An |
K |
A positive integer which is no larger than |
alphan |
The number of candidate |
itermax |
|
startn |
|
CAclustering
is an algorithm designed for community detection in networks with node covariates,
as introduced in the paper Covariate-assisted spectral clustering of Binkiewicz et al. (2017).
CAclustering
applies
k-means
on the first leading eigenvectors of
, where
is the regularized graph Laplacian,
is the covariates matrix, and
is
a tuning parameter.
estall |
A factor indicating nodes' labels. Items sharing the same label are in the same community. |
Binkiewicz, N., Vogelstein, J. T., & Rohe, K. (2017). Covariate-assisted spectral clustering.
Biometrika, 104(2), 361-377.
doi:10.1093/biomet/asx008
# Simulate the Network n = 10; K = 2; p =5; prob1 = 0.9; theta = 0.4 + (0.45-0.05)*(seq(1:n)/n)^2; Theta = diag(theta); P = matrix(c(0.8, 0.2, 0.2, 0.8), byrow = TRUE, nrow = K) set.seed(2022) l = sample(1:K, n, replace=TRUE); # node labels Pi = matrix(0, n, K) # label matrix for (k in 1:K){ Pi[l == k, k] = 1 } Omega = Theta %*% Pi %*% P %*% t(Pi) %*% Theta; Adj = matrix(runif(n*n, 0, 1), nrow = n); Adj = Omega - Adj; Adj = 1*(Adj >= 0) diag(Adj) = 0 Adj[lower.tri(Adj)] = t(Adj)[lower.tri(Adj)] Q = 0.1*matrix(sign(runif(p*K) - 0.5), nrow = p); for(i in 1:K){ Q[(i-1)*(p/K)+(1:(p/K)), i] = 0.3; #remark. has a change here } W = matrix(0, nrow = n, ncol = K); for(jj in 1:n) { pp = rep(1/(K-1), K); pp[l[jj]] = 0; if(runif(1) <= prob1) {W[jj, 1:K] = Pi[jj, ];} else W[jj, sample(K, 1, prob = pp)] = 1; } W = t(W) D0 = Q %*% W D = matrix(0, n, p) for (i in 1:n){ D[i,] = rnorm(p, mean = D0[,i], sd = 1); } CAclustering(Adj, D, 2)
# Simulate the Network n = 10; K = 2; p =5; prob1 = 0.9; theta = 0.4 + (0.45-0.05)*(seq(1:n)/n)^2; Theta = diag(theta); P = matrix(c(0.8, 0.2, 0.2, 0.8), byrow = TRUE, nrow = K) set.seed(2022) l = sample(1:K, n, replace=TRUE); # node labels Pi = matrix(0, n, K) # label matrix for (k in 1:K){ Pi[l == k, k] = 1 } Omega = Theta %*% Pi %*% P %*% t(Pi) %*% Theta; Adj = matrix(runif(n*n, 0, 1), nrow = n); Adj = Omega - Adj; Adj = 1*(Adj >= 0) diag(Adj) = 0 Adj[lower.tri(Adj)] = t(Adj)[lower.tri(Adj)] Q = 0.1*matrix(sign(runif(p*K) - 0.5), nrow = p); for(i in 1:K){ Q[(i-1)*(p/K)+(1:(p/K)), i] = 0.3; #remark. has a change here } W = matrix(0, nrow = n, ncol = K); for(jj in 1:n) { pp = rep(1/(K-1), K); pp[l[jj]] = 0; if(runif(1) <= prob1) {W[jj, 1:K] = Pi[jj, ];} else W[jj, sample(K, 1, prob = pp)] = 1; } W = t(W) D0 = Q %*% W D = matrix(0, n, p) for (i in 1:n){ D[i,] = rnorm(p, mean = D0[,i], sd = 1); } CAclustering(Adj, D, 2)
Covariates-based Spectral Clustering is a spectral clustering
method that focuses solely on the covariates structure, i.e., the where
is the
covariates matrix, as introduced in Lee et al. (2010).
Cov_based(Covariate, K, itermax = 100, startn = 10)
Cov_based(Covariate, K, itermax = 100, startn = 10)
Covariate |
An |
K |
A positive integer which is no larger than |
itermax |
|
startn |
|
estall |
A factor indicating nodes' labels. Items sharing the same label are in the same community. |
Lee, A. B., Luca, D., Klei, L., Devlin, B., & Roeder, K. (2010).
Discovering genetic ancestry using spectral graph theory.
Genetic Epidemiology: The Official Publication of the International Genetic Epidemiology Society,
34(1), 51-59.
doi:10.1002/gepi.20434
# Simulate the Covariate Matrix n = 10; p = 5; K = 2; prob1 = 0.9; set.seed(2022) l = sample(1:K, n, replace=TRUE); # node labels Pi = matrix(0, n, K) # label matrix for (k in 1:K){ Pi[l == k, k] = 1 } Q = 0.1*matrix(sign(runif(p*K) - 0.5), nrow = p); for(i in 1:K){ Q[(i-1)*(p/K)+(1:(p/K)), i] = 0.3; #remark. has a change here } W = matrix(0, nrow = n, ncol = K); for(jj in 1:n) { pp = rep(1/(K-1), K); pp[l[jj]] = 0; if(runif(1) <= prob1) {W[jj, 1:K] = Pi[jj, ];} else W[jj, sample(K, 1, prob = pp)] = 1; } W = t(W) D0 = Q %*% W D = matrix(0, n, p) for (i in 1:n){ D[i,] = rnorm(p, mean = D0[,i], sd = 1); } Cov_based(D, 2)
# Simulate the Covariate Matrix n = 10; p = 5; K = 2; prob1 = 0.9; set.seed(2022) l = sample(1:K, n, replace=TRUE); # node labels Pi = matrix(0, n, K) # label matrix for (k in 1:K){ Pi[l == k, k] = 1 } Q = 0.1*matrix(sign(runif(p*K) - 0.5), nrow = p); for(i in 1:K){ Q[(i-1)*(p/K)+(1:(p/K)), i] = 0.3; #remark. has a change here } W = matrix(0, nrow = n, ncol = K); for(jj in 1:n) { pp = rep(1/(K-1), K); pp[l[jj]] = 0; if(runif(1) <= prob1) {W[jj, 1:K] = Pi[jj, ];} else W[jj, sample(K, 1, prob = pp)] = 1; } W = t(W) D0 = Q %*% W D = matrix(0, n, p) for (i in 1:n){ D[i,] = rnorm(p, mean = D0[,i], sd = 1); } Cov_based(D, 2)
Using network-adjusted covariates to detect underlying communities.
NAC(Adj, Covariate, K, alpha = NULL, beta = 0, itermax = 100, startn = 10)
NAC(Adj, Covariate, K, alpha = NULL, beta = 0, itermax = 100, startn = 10)
Adj |
An |
Covariate |
An |
K |
A positive integer which is no larger than |
alpha |
An optional numeric vector to tune the weight of covariate matrix. The default value is
|
beta |
An optional parameter used when the covariate matrix |
itermax |
|
startn |
|
Spectral Clustering Network-Adjusted Covariates (NAC) is fully established in
Network-Adjusted Covariates for Community Detection
of Hu & Wang (2023). This method is particularly effective in the analysis of multiscale networks with
covariates, addressing the challenge of misspecification between networks and covariates. NAC
relies on the construction of network-adjusted
covariate vectors , where the first part has
the nodal covariate information and the second part conveys network information.
By constructing
where
is the adjacency matrix,
is the covariate matrix, and
is the diagonal matrix with diagonals
as
,
NAC
applies K-means
on the first normalized left
singular vectors, treating each row as a data point.
A notable feature of
NAC
is its tuning-free nature, where node-specific coefficient is
computed given the
-th node's degree.
NAC
allows for
user-specified as well. A generalization with uninformative covariates is considered by adjusting
parameter
. As long as the covariates do provide information, the specification of
can be
ignored.
estall |
A factor indicating nodes' labels. Items sharing the same label are in the same community. |
Hu, Y., & Wang, W. (2023). Network-Adjusted Covariates for Community Detection.
arXiv preprint arXiv:2306.15616.
https://arxiv.org/abs/2306.15616
# Simulate the Network n = 10; K = 2; p =5; prob1 = 0.9; theta = 0.4 + (0.45-0.05)*(seq(1:n)/n)^2; Theta = diag(theta); P = matrix(c(0.8, 0.2, 0.2, 0.8), byrow = TRUE, nrow = K) set.seed(2022) l = sample(1:K, n, replace=TRUE); # node labels Pi = matrix(0, n, K) # label matrix for (k in 1:K){ Pi[l == k, k] = 1 } Omega = Theta %*% Pi %*% P %*% t(Pi) %*% Theta; Adj = matrix(runif(n*n, 0, 1), nrow = n); Adj = Omega - Adj; Adj = 1*(Adj >= 0) diag(Adj) = 0 Adj[lower.tri(Adj)] = t(Adj)[lower.tri(Adj)] Q = 0.1*matrix(sign(runif(p*K) - 0.5), nrow = p); for(i in 1:K){ Q[(i-1)*(p/K)+(1:(p/K)), i] = 0.3; #remark. has a change here } W = matrix(0, nrow = n, ncol = K); for(jj in 1:n) { pp = rep(1/(K-1), K); pp[l[jj]] = 0; if(runif(1) <= prob1) {W[jj, 1:K] = Pi[jj, ];} else W[jj, sample(K, 1, prob = pp)] = 1; } W = t(W) D0 = Q %*% W D = matrix(0, n, p) for (i in 1:n){ D[i,] = rnorm(p, mean = D0[,i], sd = 1); } NAC(Adj, D, 2)
# Simulate the Network n = 10; K = 2; p =5; prob1 = 0.9; theta = 0.4 + (0.45-0.05)*(seq(1:n)/n)^2; Theta = diag(theta); P = matrix(c(0.8, 0.2, 0.2, 0.8), byrow = TRUE, nrow = K) set.seed(2022) l = sample(1:K, n, replace=TRUE); # node labels Pi = matrix(0, n, K) # label matrix for (k in 1:K){ Pi[l == k, k] = 1 } Omega = Theta %*% Pi %*% P %*% t(Pi) %*% Theta; Adj = matrix(runif(n*n, 0, 1), nrow = n); Adj = Omega - Adj; Adj = 1*(Adj >= 0) diag(Adj) = 0 Adj[lower.tri(Adj)] = t(Adj)[lower.tri(Adj)] Q = 0.1*matrix(sign(runif(p*K) - 0.5), nrow = p); for(i in 1:K){ Q[(i-1)*(p/K)+(1:(p/K)), i] = 0.3; #remark. has a change here } W = matrix(0, nrow = n, ncol = K); for(jj in 1:n) { pp = rep(1/(K-1), K); pp[l[jj]] = 0; if(runif(1) <= prob1) {W[jj, 1:K] = Pi[jj, ];} else W[jj, sample(K, 1, prob = pp)] = 1; } W = t(W) D0 = Q %*% W D = matrix(0, n, p) for (i in 1:n){ D[i,] = rnorm(p, mean = D0[,i], sd = 1); } NAC(Adj, D, 2)
Network-based Regularized Spectral Clustering is a spectral clustering with regularized Laplacian method, fully established in fully established in Impact of Regularization on Spectral Clustering of Joseph & Yu (2016).
Net_based(Adj, K, tau = NULL, itermax = 100, startn = 10)
Net_based(Adj, K, tau = NULL, itermax = 100, startn = 10)
Adj |
An |
K |
A positive positive integer which is no larger than |
tau |
An optional tuning parameter to add |
itermax |
|
startn |
|
estall |
A factor indicating nodes' labels. Items sharing the same label are in the same community. |
Joseph, A., & Yu, B. (2016). Impact of Regularization on Spectral Clustering.
The Annals of Statistics, 44(4), 1765-1791.
doi:10.1214/16-AOS1447
# Simulate the Network n = 10; K = 2; theta = 0.4 + (0.45-0.05)*(seq(1:n)/n)^2; Theta = diag(theta); P = matrix(c(0.8, 0.2, 0.2, 0.8), byrow = TRUE, nrow = K) set.seed(2022) l = sample(1:K, n, replace=TRUE); # node labels Pi = matrix(0, n, K) # label matrix for (k in 1:K){ Pi[l == k, k] = 1 } Omega = Theta %*% Pi %*% P %*% t(Pi) %*% Theta; Adj = matrix(runif(n*n, 0, 1), nrow = n); Adj = Omega - Adj; Adj = 1*(Adj >= 0) diag(Adj) = 0 Adj[lower.tri(Adj)] = t(Adj)[lower.tri(Adj)] Net_based(Adj, 2)
# Simulate the Network n = 10; K = 2; theta = 0.4 + (0.45-0.05)*(seq(1:n)/n)^2; Theta = diag(theta); P = matrix(c(0.8, 0.2, 0.2, 0.8), byrow = TRUE, nrow = K) set.seed(2022) l = sample(1:K, n, replace=TRUE); # node labels Pi = matrix(0, n, K) # label matrix for (k in 1:K){ Pi[l == k, k] = 1 } Omega = Theta %*% Pi %*% P %*% t(Pi) %*% Theta; Adj = matrix(runif(n*n, 0, 1), nrow = n); Adj = Omega - Adj; Adj = 1*(Adj >= 0) diag(Adj) = 0 Adj[lower.tri(Adj)] = t(Adj)[lower.tri(Adj)] Net_based(Adj, 2)
Using ratios-of-eigenvectors to detect underlying communities.
SCORE(G, K, itermax = 100, startn = 10)
SCORE(G, K, itermax = 100, startn = 10)
G |
An |
K |
A positive integer which is no larger than |
itermax |
|
startn |
|
SCORE
is fully established in Fast community detection by
SCORE of Jin (2015). SCORE
uses the entrywise ratios between the
first leading eigenvector and each of the other leading eigenvectors for
clustering. It is noteworthy that SCORE only works on connected graphs.
In other words, it does not allow for isolated vertices.
estall |
A factor indicating nodes' labels. Items sharing the same label are in the same community. |
Jin, J. (2015). Fast community detection by score.
The Annals of Statistics, 43 (1), 57–89.
doi:10.1214/14-AOS1265
# Simulate the Network n = 10; K = 2; theta = 0.4 + (0.45-0.05)*(seq(1:n)/n)^2; Theta = diag(theta); P = matrix(c(0.8, 0.2, 0.2, 0.8), byrow = TRUE, nrow = K) set.seed(2022) l = sample(1:K, n, replace=TRUE); # node labels Pi = matrix(0, n, K) # label matrix for (k in 1:K){ Pi[l == k, k] = 1 } Omega = Theta %*% Pi %*% P %*% t(Pi) %*% Theta; Adj = matrix(runif(n*n, 0, 1), nrow = n); Adj = Omega - Adj; Adj = 1*(Adj >= 0) diag(Adj) = 0 Adj[lower.tri(Adj)] = t(Adj)[lower.tri(Adj)] library(igraph) is.igraph(Adj) # [1] FALSE ix = components(graph.adjacency(Adj)) componentLabel = ix$membership giantLabel = which(componentLabel == which.max(ix$csize)) Giant = Adj[giantLabel, giantLabel] SCORE(Giant, 2)
# Simulate the Network n = 10; K = 2; theta = 0.4 + (0.45-0.05)*(seq(1:n)/n)^2; Theta = diag(theta); P = matrix(c(0.8, 0.2, 0.2, 0.8), byrow = TRUE, nrow = K) set.seed(2022) l = sample(1:K, n, replace=TRUE); # node labels Pi = matrix(0, n, K) # label matrix for (k in 1:K){ Pi[l == k, k] = 1 } Omega = Theta %*% Pi %*% P %*% t(Pi) %*% Theta; Adj = matrix(runif(n*n, 0, 1), nrow = n); Adj = Omega - Adj; Adj = 1*(Adj >= 0) diag(Adj) = 0 Adj[lower.tri(Adj)] = t(Adj)[lower.tri(Adj)] library(igraph) is.igraph(Adj) # [1] FALSE ix = components(graph.adjacency(Adj)) componentLabel = ix$membership giantLabel = which(componentLabel == which.max(ix$csize)) Giant = Adj[giantLabel, giantLabel] SCORE(Giant, 2)
Semidefinite programming (SDP) for optimizing the inner product between combined network and the solution matrix.
SDP( Adj, Covariate, lambda, K, alpha, rho, TT, tol, quiet = NULL, report_interval = NULL, r = NULL )
SDP( Adj, Covariate, lambda, K, alpha, rho, TT, tol, quiet = NULL, report_interval = NULL, r = NULL )
Adj |
An |
Covariate |
An |
lambda |
A tuning parameter to weigh the covariate matrix. |
K |
A positive integer which is no larger than |
alpha |
The element-wise upper bound in the |
rho |
The learning rate of |
TT |
The maximum of iteration. |
tol |
The tolerance for stopping criterion. |
quiet |
An optional input, indicating whether to print result at each step. |
report_interval |
An optional input. The frequency to print intermediate result. |
r |
An optional input. The expected rank of the solution, leave |
SDP
is proposed in Covariate Regularized Community Detection in Sparse Graphs
of Yan & Sarkar (2021). This method relies on semidefinite programming relaxations for detecting
the community structure in sparse networks with covariates.
estall |
A factor indicating nodes' labels. Items sharing the same label are in the same community. |
Yan, B., & Sarkar, P. (2021). Covariate Regularized Community Detection in Sparse Graphs.
Journal of the American Statistical Association, 116(534), 734-745.
doi:10.1080/01621459.2019.1706541
# Simulate the Network n = 10; K = 2; p =5; prob1 = 0.9; theta = 0.4 + (0.45-0.05)*(seq(1:n)/n)^2; Theta = diag(theta); P = matrix(c(0.8, 0.2, 0.2, 0.8), byrow = TRUE, nrow = K) set.seed(2022) l = sample(1:K, n, replace=TRUE); # node labels Pi = matrix(0, n, K) # label matrix for (k in 1:K){ Pi[l == k, k] = 1 } Omega = Theta %*% Pi %*% P %*% t(Pi) %*% Theta; Adj = matrix(runif(n*n, 0, 1), nrow = n); Adj = Omega - Adj; Adj = 1*(Adj >= 0) diag(Adj) = 0 Adj[lower.tri(Adj)] = t(Adj)[lower.tri(Adj)] Q = 0.1*matrix(sign(runif(p*K) - 0.5), nrow = p); for(i in 1:K){ Q[(i-1)*(p/K)+(1:(p/K)), i] = 0.3; #remark. has a change here } W = matrix(0, nrow = n, ncol = K); for(jj in 1:n) { pp = rep(1/(K-1), K); pp[l[jj]] = 0; if(runif(1) <= prob1) {W[jj, 1:K] = Pi[jj, ];} else W[jj, sample(K, 1, prob = pp)] = 1; } W = t(W) D0 = Q %*% W D = matrix(0, n, p) for (i in 1:n){ D[i,] = rnorm(p, mean = D0[,i], sd = 1); } SDP(Adj, D, lambda = 0.2, K = 2, alpha = 0.5, rho = 2, TT = 100, tol = 5)
# Simulate the Network n = 10; K = 2; p =5; prob1 = 0.9; theta = 0.4 + (0.45-0.05)*(seq(1:n)/n)^2; Theta = diag(theta); P = matrix(c(0.8, 0.2, 0.2, 0.8), byrow = TRUE, nrow = K) set.seed(2022) l = sample(1:K, n, replace=TRUE); # node labels Pi = matrix(0, n, K) # label matrix for (k in 1:K){ Pi[l == k, k] = 1 } Omega = Theta %*% Pi %*% P %*% t(Pi) %*% Theta; Adj = matrix(runif(n*n, 0, 1), nrow = n); Adj = Omega - Adj; Adj = 1*(Adj >= 0) diag(Adj) = 0 Adj[lower.tri(Adj)] = t(Adj)[lower.tri(Adj)] Q = 0.1*matrix(sign(runif(p*K) - 0.5), nrow = p); for(i in 1:K){ Q[(i-1)*(p/K)+(1:(p/K)), i] = 0.3; #remark. has a change here } W = matrix(0, nrow = n, ncol = K); for(jj in 1:n) { pp = rep(1/(K-1), K); pp[l[jj]] = 0; if(runif(1) <= prob1) {W[jj, 1:K] = Pi[jj, ];} else W[jj, sample(K, 1, prob = pp)] = 1; } W = t(W) D0 = Q %*% W D = matrix(0, n, p) for (i in 1:n){ D[i,] = rnorm(p, mean = D0[,i], sd = 1); } SDP(Adj, D, lambda = 0.2, K = 2, alpha = 0.5, rho = 2, TT = 100, tol = 5)