Clustering
Community Detection and Classification Guarantees Using Embeddings Learned by Node2Vec
Davison, Andrew, Morgan, S. Carlyle, Ward, Owen G.
Within network science, a widely applicable and important inference task is to understand how the behavior of interactions between different units (nodes) within the network depend on their latent characteristics. This occurs within a wide array of disciplines, from sociological (Freeman, 2004) to biological (Luo et al., 2007) networks. One simple and interpretable model for such a task is the stochastic block model (SBM) (Holland et al., 1983) which assumes that nodes within the network are assigned a discrete community label. Edges between nodes in the network are then formed independently across all pairs of edges, conditional on these community assignments. While such a model is simplistic, it and various extensions, such as the degree corrected SBM (DCSBM), used to handle degree heterogenity (Karrer and Newman, 2011), and mixed-membership SBMs, to allow for more complex community structures (Airoldi, Blei, Fienberg, and Xing, 2008), have seen a wide degree of empirical success (Latouche et al., 2011; Legramanti et al., 2022; Airoldi, Blei, Fienberg, Xing, and Jaakkola, 2006). One restriction of the stochastic block model and its generalizations is the requirement for a discrete community assignment as a latent representation of the units within the network. While the statistical community has previously considered more flexible latent representations (Hoff et al., 2002), over the past decade, there have been significant advancements in general embedding methods for networks, which produce general vector representations of units within a network, and generally achieve start-of-the-art performance in downstream tasks for node classification and link prediction. An early example of such a method is spectral clustering (Ng et al., 2001), which constructs an embedding of the nodes in the network from an eigendecomposition of the graph Laplacian. The k smallest non zero eigenvectors provides a k dimensional representation of each of the nodes in the network.
Joint Multi-View Collaborative Clustering
Khalafaoui, Yasser, Matei, Basarab, Grozavu, Nistor, Lovisetto, Martino
Data is increasingly being collected from multiple sources and described by multiple views. These multi-view data provide richer information than traditional single-view data. Fusing the former for specific tasks is an essential component of multi-view clustering. Since the goal of multi-view clustering algorithms is to discover the common latent structure shared by multiple views, the majority of proposed solutions overlook the advantages of incorporating knowledge derived from horizontal collaboration between multi-view data and the final consensus. To fill this gap, we propose the Joint Multi-View Collaborative Clustering (JMVCC) solution, which involves the generation of basic partitions using Non-negative Matrix Factorization (NMF) and the horizontal collaboration principle, followed by the fusion of these local partitions using ensemble clustering. Furthermore, we propose a weighting method to reduce the risk of negative collaboration (i.e., views with low quality) during the generation and fusion of local partitions. The experimental results, which were obtained using a variety of data sets, demonstrate that JMVCC outperforms other multi-view clustering algorithms and is robust to noisy views.
DECWA : Density-Based Clustering using Wasserstein Distance
Malki, Nabil El, Cugny, Robin, Teste, Olivier, Ravat, Franck
Clustering is a data analysis method for extracting knowledge by discovering groups of data called clusters. Among these methods, state-of-the-art density-based clustering methods have proven to be effective for arbitrary-shaped clusters. Despite their encouraging results, they suffer to find low-density clusters, near clusters with similar densities, and high-dimensional data. Our proposals are a new characterization of clusters and a new clustering algorithm based on spatial density and probabilistic approach. First of all, sub-clusters are built using spatial density represented as probability density function ($p.d.f$) of pairwise distances between points. A method is then proposed to agglomerate similar sub-clusters by using both their density ($p.d.f$) and their spatial distance. The key idea we propose is to use the Wasserstein metric, a powerful tool to measure the distance between $p.d.f$ of sub-clusters. We show that our approach outperforms other state-of-the-art density-based clustering methods on a wide variety of datasets.
A Batch-to-Online Transformation under Random-Order Model
We introduce a transformation framework that can be utilized to develop online algorithms with low $\epsilon$-approximate regret in the random-order model from offline approximation algorithms. We first give a general reduction theorem that transforms an offline approximation algorithm with low average sensitivity to an online algorithm with low $\epsilon$-approximate regret. We then demonstrate that offline approximation algorithms can be transformed into a low-sensitivity version using a coreset construction method. To showcase the versatility of our approach, we apply it to various problems, including online $(k,z)$-clustering, online matrix approximation, and online regression, and successfully achieve polylogarithmic $\epsilon$-approximate regret for each problem. Moreover, we show that in all three cases, our algorithm also enjoys low inconsistency, which may be desired in some online applications.
Hierarchical clustering with OWA-based linkages, the Lance-Williams formula, and dendrogram inversions
Gagolewski, Marek, Cena, Anna, James, Simon, Beliakov, Gleb
Agglomerative hierarchical clustering based on Ordered Weighted Averaging (OWA) operators not only generalises the single, complete, and average linkages, but also includes intercluster distances based on a few nearest or farthest neighbours, trimmed and winsorised means of pairwise point similarities, amongst many others. We explore the relationships between the famous Lance-Williams update formula and the extended OWA-based linkages with weights generated via infinite coefficient sequences. Furthermore, we provide some conditions for the weight generators to guarantee the resulting dendrograms to be free from unaesthetic inversions.
A framework for benchmarking clustering algorithms
The evaluation of clustering algorithms can involve running them on a variety of benchmark problems, and comparing their outputs to the reference, ground-truth groupings provided by experts. Unfortunately, many research papers and graduate theses consider only a small number of datasets. Also, the fact that there can be many equally valid ways to cluster a given problem set is rarely taken into account. In order to overcome these limitations, we have developed a framework whose aim is to introduce a consistent methodology for testing clustering algorithms. Furthermore, we have aggregated, polished, and standardised many clustering benchmark dataset collections referred to across the machine learning and data mining literature, and included new datasets of different dimensionalities, sizes, and cluster types. An interactive datasets explorer, the documentation of the Python API, a description of the ways to interact with the framework from other programming languages such as R or MATLAB, and other details are all provided at
Transfer learning for day-ahead load forecasting: a case study on European national electricity demand time series
Tzortzis, Alexandros-Menelaos, Pelekis, Sotiris, Spiliotis, Evangelos, Mouzakitis, Spiros, Psarras, John, Askounis, Dimitris
Short-term load forecasting (STLF) is crucial for the daily operation of power grids. However, the non-linearity, non-stationarity, and randomness characterizing electricity demand time series renders STLF a challenging task. Various forecasting approaches have been proposed for improving STLF, including neural network (NN) models which are trained using data from multiple electricity demand series that may not necessary include the target series. In the present study, we investigate the performance of this special case of STLF, called transfer learning (TL), by considering a set of 27 time series that represent the national day-ahead electricity demand of indicative European countries. We employ a popular and easy-to-implement NN model and perform a clustering analysis to identify similar patterns among the series and assist TL. In this context, two different TL approaches, with and without the clustering step, are compiled and compared against each other as well as a typical NN training setup. Our results demonstrate that TL can outperform the conventional approach, especially when clustering techniques are considered.
Analyzing Single Cell RNA Sequencing with Topological Nonnegative Matrix Factorization
Single-cell RNA sequencing (scRNA-seq) is a relatively new technology that has stimulated enormous interest in statistics, data science, and computational biology due to the high dimensionality, complexity, and large scale associated with scRNA-seq data. Nonnegative matrix factorization (NMF) offers a unique approach due to its meta-gene interpretation of resulting low-dimensional components. However, NMF approaches suffer from the lack of multiscale analysis. This work introduces two persistent Laplacian regularized NMF methods, namely, topological NMF (TNMF) and robust topological NMF (rTNMF). By employing a total of 12 datasets, we demonstrate that the proposed TNMF and rTNMF significantly outperform all other NMF-based methods. We have also utilized TNMF and rTNMF for the visualization of popular Uniform Manifold Approximation and Projection (UMAP) and t-distributed stochastic neighbor embedding (t-SNE).
Privacy-Preserving Federated Deep Clustering based on GAN
Yan, Jie, Liu, Jing, Qi, Ji, Zhang, Zhong-Yuan
Federated clustering (FC) is an essential extension of centralized clustering designed for the federated setting, wherein the challenge lies in constructing a global similarity measure without the need to share private data. Conventional approaches to FC typically adopt extensions of centralized methods, like K-means and fuzzy c-means. However, these methods are susceptible to non-independent-and-identically-distributed (non-IID) data among clients, leading to suboptimal performance, particularly with high-dimensional data. In this paper, we present a novel approach to address these limitations by proposing a Privacy-Preserving Federated Deep Clustering based on Generative Adversarial Networks (GANs). Each client trains a local generative adversarial network (GAN) locally and uploads the synthetic data to the server. The server applies a deep clustering network on the synthetic data to establish $k$ cluster centroids, which are then downloaded to the clients for cluster assignment. Theoretical analysis demonstrates that the GAN-generated samples, shared among clients, inherently uphold certain privacy guarantees, safeguarding the confidentiality of individual data. Furthermore, extensive experimental evaluations showcase the effectiveness and utility of our proposed method in achieving accurate and privacy-preserving federated clustering.
Self-Supervised One-Shot Learning for Automatic Segmentation of StyleGAN Images
Manerikar, Ankit, Kak, Avinash C.
We propose a framework for the automatic one-shot segmentation of synthetic images generated by a StyleGAN. Our framework is based on the observation that the multi-scale hidden features in the GAN generator hold useful semantic information that can be utilized for automatic on-the-fly segmentation of the generated images. Using these features, our framework learns to segment synthetic images using a self-supervised contrastive clustering algorithm that projects the hidden features into a compact space for per-pixel classification. This contrastive learner is based on using a novel data augmentation strategy and a pixel-wise swapped prediction loss that leads to faster learning of the feature vectors for one-shot segmentation. We have tested our implementation on five standard benchmarks to yield a segmentation performance that not only outperforms the semi-supervised baselines by an average wIoU margin of 1.02 % but also improves the inference speeds by a factor of 4.5. Finally, we also show the results of using the proposed one-shot learner in implementing BagGAN, a framework for producing annotated synthetic baggage X-ray scans for threat detection. This framework was trained and tested on the PIDRay baggage benchmark to yield a performance comparable to its baseline segmenter based on manual annotations.