Clustering
Provable Clustering of a Union of Linear Manifolds Using Optimal Directions
This paper focuses on the Matrix Factorization based Clustering (MFC) method which is one of the few closed form algorithms for the subspace clustering problem. Despite being simple, closed-form, and computation-efficient, MFC can outperform the other sophisticated subspace clustering methods in many challenging scenarios. We reveal the connection between MFC and the Innovation Pursuit (iPursuit) algorithm which was shown to be able to outperform the other spectral clustering based methods with a notable margin especially when the span of clusters are close. A novel theoretical study is presented which sheds light on the key performance factors of both algorithms (MFC/iPursuit) and it is shown that both algorithms can be robust to notable intersections between the span of clusters. Importantly, in contrast to the theoretical guarantees of other algorithms which emphasized on the distance between the subspaces as the key performance factor and without making the innovation assumption, it is shown that the performance of MFC/iPursuit mainly depends on the distance between the innovative components of the clusters.
Quantum Unsupervised and Supervised Learning on Superconducting Processors
Sarma, Abhijat, Chatterjee, Rupak, Gili, Kaitlin, Yu, Ting
Machine learning algorithms perform well on identifying patterns in many different datasets due to their versatility. However, as one increases the size of the dataset, the computation time for training and using these statistical models grows quickly. Quantum computing offers a new paradigm which may have the ability to overcome these computational difficulties. Here, we propose a quantum analogue to K-means clustering, implement it on simulated superconducting qubits, and compare it to a previously developed quantum support vector machine. We find the algorithm's accuracy comparable to the classical K-means algorithm for clustering and classification problems, and find that it has asymptotic complexity $O(N^{3/2}K^{1/2}\log{P})$, where $N$ is the number of data points, $K$ is the number of clusters, and $P$ is the dimension of the data points, giving a significant speedup over the classical analogue.
Group structure estimation for panel data -- a general approach
Yu, Lu, Gu, Jiaying, Volgushev, Stanislav
Panel data models are a standard empirical tool in statistics, economics, marketing, and financial research. The conventional modeling approach is to assume that all individual heterogeneity can be summarized by an individual specific intercept, often known as the fixed effects, while assuming all covariates have a common effect among all the individuals, such that information can be pooled across individuals to gain efficiency of these common parameters. However, heterogeneous responses towards observed control variables are often better supported by empirical evidence, especially as detailed individual level data becomes more available. An increasingly popular approach to model unobserved heterogeneity in the effects of covariates on individual responses is to assume the existence of a finite number of homogeneous groups.
ZeroBERTo -- Leveraging Zero-Shot Text Classification by Topic Modeling
Alcoforado, Alexandre, Ferraz, Thomas Palmeira, Gerber, Rodrigo, Bustos, Enzo, Oliveira, André Seidel, Veloso, Bruno Miguel, Siqueira, Fabio Levy, Costa, Anna Helena Reali
Traditional text classification approaches often require a good amount of labeled data, which is difficult to obtain, especially in restricted domains or less widespread languages. This lack of labeled data has led to the rise of low-resource methods, that assume low data availability in natural language processing. Among them, zero-shot learning stands out, which consists of learning a classifier without any previously labeled data. The best results reported with this approach use language models such as Transformers, but fall into two problems: high execution time and inability to handle long texts as input. This paper proposes a new model, ZeroBERTo, which leverages an unsupervised clustering step to obtain a compressed data representation before the classification task. We show that ZeroBERTo has better performance for long inputs and shorter execution time, outperforming XLM-R by about 12 % in the F1 score in the FolhaUOL dataset.
Matrix Completion with Hierarchical Graph Side Information
Elmahdy, Adel, Ahn, Junhyung, Suh, Changho, Mohajer, Soheil
We consider a matrix completion problem that exploits social or item similarity graphs as side information. We develop a universal, parameter-free, and computationally efficient algorithm that starts with hierarchical graph clustering and then iteratively refines estimates both on graph clustering and matrix ratings. Under a hierarchical stochastic block model that well respects practically-relevant social graphs and a low-rank rating matrix model (to be detailed), we demonstrate that our algorithm achieves the information-theoretic limit on the number of observed matrix entries (i.e., optimal sample complexity) that is derived by maximum likelihood estimation together with a lower-bound impossibility result. One consequence of this result is that exploiting the hierarchical structure of social graphs yields a substantial gain in sample complexity relative to the one that simply identifies different groups without resorting to the relational structure across them. We conduct extensive experiments both on synthetic and real-world datasets to corroborate our theoretical results as well as to demonstrate significant performance improvements over other matrix completion algorithms that leverage graph side information.
A sampling-based approach for efficient clustering in large datasets
Exarchakis, Georgios, Oubari, Omar, Lenz, Gregor
We propose a simple and efficient clustering method for high-dimensional data with a large number of clusters. Our algorithm achieves high-performance by evaluating distances of datapoints with a subset of the cluster centres. Our contribution is substantially more efficient than k-means as it does not require an all to all comparison of data points and clusters. We show that the optimal solutions of our approximation are the same as in the exact solution. However, our approach is considerably more efficient at extracting these clusters compared to the state-of-the-art. We compare our approximation with the exact k-means and alternative approximation approaches on a series of standardised clustering tasks. For the evaluation, we consider the algorithmic complexity, including number of operations to convergence, and the stability of the results.
VDPC: Variational Density Peak Clustering Algorithm
Wang, Yizhang, Wang, Di, Zhou, You, Zhang, Xiaofeng, Quek, Chai
The widely applied density peak clustering (DPC) algorithm makes an intuitive cluster formation assumption that cluster centers are often surrounded by data points with lower local density and far away from other data points with higher local density. However, this assumption suffers from one limitation that it is often problematic when identifying clusters with lower density because they might be easily merged into other clusters with higher density. As a result, DPC may not be able to identify clusters with variational density. To address this issue, we propose a variational density peak clustering (VDPC) algorithm, which is designed to systematically and autonomously perform the clustering task on datasets with various types of density distributions. Specifically, we first propose a novel method to identify the representatives among all data points and construct initial clusters based on the identified representatives for further analysis of the clusters' property. Furthermore, we divide all data points into different levels according to their local density and propose a unified clustering framework by combining the advantages of both DPC and DBSCAN. Thus, all the identified initial clusters spreading across different density levels are systematically processed to form the final clusters. To evaluate the effectiveness of the proposed VDPC algorithm, we conduct extensive experiments using 20 datasets including eight synthetic, six real-world and six image datasets. The experimental results show that VDPC outperforms two classical algorithms (i.e., DPC and DBSCAN) and four state-of-the-art extended DPC algorithms.
Deep Graph Clustering via Dual Correlation Reduction
Liu, Yue, Tu, Wenxuan, Zhou, Sihang, Liu, Xinwang, Song, Linxuan, Yang, Xihong, Zhu, En
Deep graph clustering, which aims to reveal the underlying graph structure and divide the nodes into different groups, has attracted intensive attention in recent years. However, we observe that, in the process of node encoding, existing methods suffer from representation collapse which tends to map all data into the same representation. Consequently, the discriminative capability of the node representation is limited, leading to unsatisfied clustering performance. To address this issue, we propose a novel self-supervised deep graph clustering method termed Dual Correlation Reduction Network (DCRN) by reducing information correlation in a dual manner. Specifically, in our method, we first design a siamese network to encode samples. Then by forcing the cross-view sample correlation matrix and cross-view feature correlation matrix to approximate two identity matrices, respectively, we reduce the information correlation in the dual-level, thus improving the discriminative capability of the resulting features. Moreover, in order to alleviate representation collapse caused by over-smoothing in GCN, we introduce a propagation regularization term to enable the network to gain long-distance information with the shallow network structure. Extensive experimental results on six benchmark datasets demonstrate the effectiveness of the proposed DCRN against the existing state-of-the-art methods.
Robust Convergence in Federated Learning through Label-wise Clustering
Lee, Hunmin, Liu, Yueyang, Kim, Donghyun, Li, Yingshu
Non-IID dataset and heterogeneous environment of the local clients are regarded as a major issue in Federated Learning (FL), causing a downturn in the convergence without achieving satisfactory performance. In this paper, we propose a novel Label-wise clustering algorithm that guarantees the trainability among geographically dispersed heterogeneous local clients, by selecting only the local models trained with a dataset that approximates into uniformly distributed class labels, which is likely to obtain faster minimization of the loss and increment the accuracy among the FL network. Through conducting experiments on the suggested six common non-IID scenarios, we empirically show that the vanilla FL aggregation model is incapable of gaining robust convergence generating biased pre-trained local models and drifting the local weights to mislead the trainability in the worst case. Moreover, we quantitatively estimate the expected performance of the local models before training, which offers a global server to select the optimal clients, saving additional computational costs. Ultimately, in order to gain resolution of the non-convergence in such non-IID situations, we design clustering algorithms based on local input class labels, accommodating the diversity and assorting clients that could lead the overall system to attain the swift convergence as global training continues. Our paper shows that proposed Label-wise clustering demonstrates prompt and robust convergence compared to other FL algorithms when local training datasets are non-IID or coexist with IID through multiple experiments.
Multilayer Graph Contrastive Clustering Network
Liu, Liang, Kang, Zhao, Tian, Ling, Xu, Wenbo, He, Xixu
Multilayer graph has garnered plenty of research attention in many areas due to their high utility in modeling interdependent systems. However, clustering of multilayer graph, which aims at dividing the graph nodes into categories or communities, is still at a nascent stage. Existing methods are often limited to exploiting the multiview attributes or multiple networks and ignoring more complex and richer network frameworks. To this end, we propose a generic and effective autoencoder framework for multilayer graph clustering named Multilayer Graph Contrastive Clustering Network (MGCCN). MGCCN consists of three modules: (1)Attention mechanism is applied to better capture the relevance between nodes and neighbors for better node embeddings. (2)To better explore the consistent information in different networks, a contrastive fusion strategy is introduced. (3)MGCCN employs a self-supervised component that iteratively strengthens the node embedding and clustering. Extensive experiments on different types of real-world graph data indicate that our proposed method outperforms state-of-the-art techniques.