louvain
On the Power of Louvain in the Stochastic Block Model
A classic problem in machine learning and data analysis is to partition the vertices of a network in such a way that vertices in the same set are densely connected and vertices in different sets are loosely connected. In practice, the most popular approaches rely on local search algorithms; not only for the ease of implementation and the efficiency, but also because of the accuracy of these methods on many real world graphs. For example, the Louvain algorithm -- a local search based algorithm -- has quickly become the method of choice for clustering in social networks. However, explaining the success of these methods remains an open problem: in the worst-case, the runtime can be up to \Omega(n^2), much worse than what is typically observed in practice, and no guarantee on the quality of its output can be established. The goal of this paper is to shed light on the inner-workings of Louvain; only if we understand Louvain, can we rely on it and further improve it. To achieve this goal, we study the behavior of Louvain in the famous two-bloc Stochastic Block Model, which has a clear ground-truth and serves as the standard testbed for graph clustering algorithms. We provide valuable tools for the analysis of Louvain, but also for many other combinatorial algorithms. For example, we show that the probability for a node to have more edges towards its own community is 1/2 + \Omega( \min( \Delta(p-q)/\sqrt{np},1)) in the SBM(n,p,q), where \Delta is the imbalance. Note that this bound is asymptotically tight and useful for the analysis of a wide range of algorithms (Louvain, Kernighan-Lin, Simulated Annealing etc).
DynBenchmark: Customizable Ground Truths to Benchmark Community Detection and Tracking in Temporal Networks
Brisson, Laurent, Bothorel, Cécile, Duminy, Nicolas
Graph models help understand network dynamics and evolution. Creating graphs with controlled topology and embedded partitions is a common strategy for evaluating community detection algorithms. However, existing benchmarks often overlook the need to track the evolution of communities in real-world networks. To address this, a new community-centered model is proposed to generate customizable evolving community structures where communities can grow, shrink, merge, split, appear or disappear. This benchmark also generates the underlying temporal network, where nodes can appear, disappear, or move between communities. The benchmark has been used to test three methods, measuring their performance in tracking nodes' cluster membership and detecting community evolution. Python libraries, drawing utilities, and validation metrics are provided to compare ground truth with algorithm results for detecting dynamic communities.
On the Power of Louvain in the Stochastic Block Model Vincent Cohen-Addad
A classic problem in machine learning and data analysis is to partition the vertices of a network in such a way that vertices in the same set are densely connected and vertices in different sets are loosely connected. In practice, the most popular approaches rely on local search algorithms; not only for the ease of implementation and the efficiency, but also because of the accuracy of these methods on many real world graphs. For example, the Louvain algorithm - a local search based algorithm - has quickly become the method of choice for clustering in social networks.
Scalable Varied-Density Clustering via Graph Propagation
Pham, Ninh, Zheng, Yingtao, Phibbs, Hugo
We propose a novel perspective on varied-density clustering for high-dimensional data by framing it as a label propagation process in neighborhood graphs that adapt to local density variations. Our method formally connects density-based clustering with graph connectivity, enabling the use of efficient graph propagation techniques developed in network science. To ensure scalability, we introduce a density-aware neighborhood propagation algorithm and leverage advanced random projection methods to construct approximate neighborhood graphs. Our approach significantly reduces computational cost while preserving clustering quality. Empirically, it scales to datasets with millions of points in minutes and achieves competitive accuracy compared to existing baselines.
Clustering scientific publications: lessons learned through experiments with a real citation network
Clustering scientific publications can reveal underlying research structures within bibliographic databases. Graph-based clustering methods, such as spectral, Louvain, and Leiden algorithms, are frequently utilized due to their capacity to effectively model citation networks. However, their performance may degrade when applied to real-world data. This study evaluates the performance of these clustering algorithms on a citation graph comprising approx. 700,000 papers and 4.6 million citations extracted from Web of Science. The results show that while scalable methods like Louvain and Leiden perform efficiently, their default settings often yield poor partitioning. Meaningful outcomes require careful parameter tuning, especially for large networks with uneven structures, including a dense core and loosely connected papers. These findings highlight practical lessons about the challenges of large-scale data, method selection and tuning based on specific structures of bibliometric clustering tasks.
On the Power of Louvain in the Stochastic Block Model
A classic problem in machine learning and data analysis is to partition the vertices of a network in such a way that vertices in the same set are densely connected and vertices in different sets are loosely connected. In practice, the most popular approaches rely on local search algorithms; not only for the ease of implementation and the efficiency, but also because of the accuracy of these methods on many real world graphs. For example, the Louvain algorithm -- a local search based algorithm -- has quickly become the method of choice for clustering in social networks. However, explaining the success of these methods remains an open problem: in the worst-case, the runtime can be up to \Omega(n 2), much worse than what is typically observed in practice, and no guarantee on the quality of its output can be established. The goal of this paper is to shed light on the inner-workings of Louvain; only if we understand Louvain, can we rely on it and further improve it.
Testing network clustering algorithms with Natural Language Processing
Achitouv, Ixandra, Chavalarias, David, Gaume, Bruno
The advent of online social networks has led to the development of an abundant literature on the study of online social groups and their relationship to individuals' personalities as revealed by their textual productions. Social structures are inferred from a wide range of social interactions. Those interactions form complex -- sometimes multi-layered -- networks, on which community detection algorithms are applied to extract higher order structures. The choice of the community detection algorithm is however hardily questioned in relation with the cultural production of the individual they classify. In this work, we assume the entangled nature of social networks and their cultural production to propose a definition of cultural based online social groups as sets of individuals whose online production can be categorized as social group-related. We take advantage of this apparently self-referential description of online social groups with a hybrid methodology that combines a community detection algorithm and a natural language processing classification algorithm. A key result of this analysis is the possibility to score community detection algorithms using their agreement with the natural language processing classification. A second result is that we can assign the opinion of a random user at >85% accuracy.
An Adaptive Deep Clustering Pipeline to Inform Text Labeling at Scale
Mining the latent intentions from large volumes of natural language inputs is a key step to help data analysts design and refine Intelligent Virtual Assistants (IVAs) for customer service and sales support. We created a flexible and scalable clustering pipeline within the Verint Intent Manager (VIM) that integrates the fine-tuning of language models, a high performing k-NN library and community detection techniques to help analysts quickly surface and organize relevant user intentions from conversational texts. The fine-tuning step is necessary because pre-trained language models cannot encode texts to efficiently surface particular clustering structures when the target texts are from an unseen domain or the clustering task is not topic detection. We describe the pipeline and demonstrate its performance and ability to scale on three real-world text mining tasks. As deployed in the VIM application, this clustering pipeline produces high quality results, improving the performance of data analysts and reducing the time it takes to surface intentions from customer service data, thereby reducing the time it takes to build and deploy IVAs in new domains.