Les Misérables characters network, encoding interactions between characters of Victor Hugo’s novel, was first created by Donald Knuth as part of the Stanford Graph Base (https://people.sc.fsu.edu/~jburkardt/datasets/sgb/sgb.html). It contains 77 nodes corresponding to characters of the novel, and 254 vertices connecting two characters whenever they appear in the same chapter.
data(les_miserables)
A<- les_miserables$A
names <- les_miserables$names
net <- graph_from_adjacency_matrix(A, mode = "undirected")
V(net)$name <- names
V(net)$color <- "gray80"
deg <- degree(net, mode="all")
V(net)$size <- deg
plot(net, vertex.label.cex = 0.4)
We fit a classical SBM to the graph and represent the graph with nodes proportional to their degrees and colored by community assignment. The number of communities has been selected so as to minimize the ICL criterion.
vBlocks <- 1:10
collection_sbm <- missSBM::estimateMissSBM(A, vBlocks, "node")
#>
#>
#> Adjusting Variational EM for Stochastic Block Model
#>
#> Imputation assumes a 'node' network-sampling process
#>
#> Initialization of 10 model(s).
#> Performing VEM inference
#> Model with 8 blocks. Model with 5 blocks. Model with 4 blocks. Model with 3 blocks. Model with 7 blocks. Model with 2 blocks. Model with 9 blocks. Model with 1 blocks. Model with 6 blocks. Model with 10 blocks.
#> Looking for better solutions
#> Pass 1 Going forward +++++++++ Pass 1 Going backward +++++++++
colo <- round(collection_sbm$bestModel$fittedSBM$probMemberships)
colo <- sapply(1:nrow(A), function(i) which.max(colo[i,]))
pal3 <- brewer.pal(10, "Set3")
V(net)$color <- pal3[colo]
V(net)$label <- NA
V(net)$size <- deg
plot(net)
We observe that the main character Jean Valjean is alone in his community, and one of the clusters groups important characters (Thénardier, Éponine, Javert).
The Generalized stochastic Block Model accounts for outlier profiles (hubs, mixed memberships). In this model, nodes are divided into two sets: the inliers which follow a classical SBM, and the outliers, for which we make no assumptions on the connectivity model. These two sets are unknown a priori and are learned automatically by our procedure. Below we represent the result of the clustering, with the detected outliers indicated in red. They correspond to hubs (large center node, Jean Valjean) and nodes with mixed memberships (e.g. smaller central nodes with connections to several clusters).
lambda1 <- 4
lambda2 <- 5
res <- gsbm_mcgd(A, lambda1 = lambda1, lambda2 = lambda2)
outliers <- names[which(colSums(res$S)>0)]
sv <- svd(res$L)
pc <- sv$u[,1:4]
rownames(pc) <- names
pc <- pc[setdiff(names, outliers),]
com <- kmeans(pc, centers=4, nstart=50)
com$cluster
#> Napoleon MlleBaptistine MmeMagloire CountessDeLo
#> 2 2 2 2
#> Geborand Champtercier Cravatte Count
#> 2 2 2 2
#> OldMan Labarre Marguerite MmeDeR
#> 2 2 2 2
#> Isabeau Gervais Tholomyes Listolier
#> 2 2 1 1
#> Fameuil Blacheville Favourite Dahlia
#> 1 1 1 1
#> Zephine MmeThenardier Fauchelevent Bamatabois
#> 1 2 2 3
#> Perpetue Simplice Scaufflaire Woman1
#> 2 2 2 2
#> Judge Champmathieu Brevet Chenildieu
#> 3 3 3 3
#> Cochepaille Pontmercy Boulatruelle Eponine
#> 3 2 2 2
#> Anzelma Woman2 MotherInnocent Gribier
#> 2 2 2 2
#> Jondrette MmeBurgon Gillenormand Magnon
#> 2 2 2 2
#> MlleGillenormand MmePontmercy MlleVaubois LtGillenormand
#> 2 2 2 2
#> BaronessT Mabeuf Enjolras Combeferre
#> 2 4 4 4
#> Prouvaire Feuilly Courfeyrac Bahorel
#> 4 4 4 4
#> Bossuet Joly Grantaire MotherPlutarch
#> 4 4 4 2
#> Gueulemer Babet Claquesous Montparnasse
#> 2 2 2 2
#> Toussaint Child1 Child2 Brujon
#> 2 2 2 2
#> MmeHucheloup
#> 4
colo2 <- 1:nrow(A)
names(colo2) <- names
comu <- com$cluster
comu[which(comu==4)] <- 6
colo2[setdiff(names, outliers)] <- pal3[comu]
colo2[outliers] <- "red"
labels <- names(A)
names(labels) <- names(A)
labels[setdiff(names, outliers)] <- NA
V(net)$label <- NA
V(net)$color <- colo2
V(net)$size <- deg
E(net)$arrow.size <- 5
plot(net, vertex.label.dist=20)