Clustering
Local Graph Clustering with Network Lasso
Jung, Alexander, SarcheshmehPour, Yasmin
We study the statistical and computational properties of a network Lasso method for local graph clustering. The clusters delivered by nLasso can be characterized elegantly via network flows between cluster boundary and seed nodes. While spectral clustering methods are guided by a minimization of the graph Laplacian quadratic form, nLasso minimizes the total variation of cluster indicator signals. As demonstrated theoretically and numerically, nLasso methods can handle very sparse clusters (chain-like) which are difficult for spectral clustering. We also verify that a primal-dual method for nonsmooth optimization allows to approximate nLasso solutions with optimal worst-case convergence rate.
From Time Series to Euclidean Spaces: On Spatial Transformations for Temporal Clustering
Goncalves, Nuno Mota, Giurgiu, Ioana, Schumann, Anika
Unsupervised clustering of temporal data is both challenging and crucial in machine learning. In this paper, we show that neither traditional clustering methods, time series specific or even deep learning-based alternatives generalise well when both varying sampling rates and high dimensionality are present in the input data. We propose a novel approach to temporal clustering, in which we (1) transform the input time series into a distance-based projected representation by using similarity measures suitable for dealing with temporal data,(2) feed these projections into a multi-layer CNN-GRU autoencoder to generate meaningful domain-aware latent representations, which ultimately (3) allow for a natural separation of clusters beneficial for most important traditional clustering algorithms. We evaluate our approach on time series datasets from various domains and show that it not only outperforms existing methods in all cases, by up to 32%, but is also robust and incurs negligible computation overheads.
Deep Incomplete Multi-View Multiple Clusterings
Wei, Shaowei, Wang, Jun, Yu, Guoxian, Domeniconi, Carlotta, Zhang, Xiangliang
Multi-view clustering aims at exploiting information from multiple heterogeneous views to promote clustering. Most previous works search for only one optimal clustering based on the predefined clustering criterion, but devising such a criterion that captures what users need is difficult. Due to the multiplicity of multi-view data, we can have meaningful alternative clusterings. In addition, the incomplete multi-view data problem is ubiquitous in real world but has not been studied for multiple clusterings. To address these issues, we introduce a deep incomplete multi-view multiple clusterings (DiMVMC) framework, which achieves the completion of data view and multiple shared representations simultaneously by optimizing multiple groups of decoder deep networks. In addition, it minimizes a redundancy term to simultaneously %uses Hilbert-Schmidt Independence Criterion (HSIC) to control the diversity among these representations and among parameters of different networks. Next, it generates an individual clustering from each of these shared representations. Experiments on benchmark datasets confirm that DiMVMC outperforms the state-of-the-art competitors in generating multiple clusterings with high diversity and quality.
Attention-Based Clustering: Learning a Kernel from Context
Coward, Samuel, Visse-Martindale, Erik, Ramesh, Chithrupa
In machine learning, no data point stands alone. We believe that context is an underappreciated concept in many machine learning methods. We propose Attention-Based Clustering (ABC), a neural architecture based on the attention mechanism, which is designed to learn latent representations that adapt to context within an input set, and which is inherently agnostic to input sizes and number of clusters. By learning a similarity kernel, our method directly combines with any out-of-the-box kernel-based clustering approach. We present competitive results for clustering Omniglot characters and include analytical evidence of the effectiveness of an attention-based approach for clustering.
Regularized K-means through hard-thresholding
Raymaekers, Jakob, Zamar, Ruben H.
We study a framework of regularized $K$-means methods based on direct penalization of the size of the cluster centers. Different penalization strategies are considered and compared through simulation and theoretical analysis. Based on the results, we propose HT $K$-means, which uses an $\ell_0$ penalty to induce sparsity in the variables. Different techniques for selecting the tuning parameter are discussed and compared. The proposed method stacks up favorably with the most popular regularized $K$-means methods in an extensive simulation study. Finally, HT $K$-means is applied to several real data examples. Graphical displays are presented and used in these examples to gain more insight into the datasets.
SST-BERT at SemEval-2020 Task 1: Semantic Shift Tracing by Clustering in BERT-based Embedding Spaces
Vani, K, Mitrovic, Sandra, Antonucci, Alessandro, Rinaldi, Fabio
Lexical semantic change detection (also known as semantic shift tracing) is a task of identifying words that have changed their meaning over time. Unsupervised semantic shift tracing, focal point of SemEval2020, is particularly challenging. Given the unsupervised setup, in this work, we propose to identify clusters among different occurrences of each target word, considering these as representatives of different word meanings. As such, disagreements in obtained clusters naturally allow to quantify the level of semantic shift per each target word in four target languages. To leverage this idea, clustering is performed on contextualized (BERT-based) embeddings of word occurrences. The obtained results show that our approach performs well both measured separately (per language) and overall, where we surpass all provided SemEval baselines.
From Trees to Continuous Embeddings and Back: Hyperbolic Hierarchical Clustering
Chami, Ines, Gu, Albert, Chatziafratis, Vaggos, Ré, Christopher
Similarity-based Hierarchical Clustering (HC) is a classical unsupervised machine learning algorithm that has traditionally been solved with heuristic algorithms like Average-Linkage. Recently, Dasgupta reframed HC as a discrete optimization problem by introducing a global cost function measuring the quality of a given tree. In this work, we provide the first continuous relaxation of Dasgupta's discrete optimization problem with provable quality guarantees. The key idea of our method, HypHC, is showing a direct correspondence from discrete trees to continuous representations (via the hyperbolic embeddings of their leaf nodes) and back (via a decoding algorithm that maps leaf embeddings to a dendrogram), allowing us to search the space of discrete binary trees with continuous optimization. Building on analogies between trees and hyperbolic space, we derive a continuous analogue for the notion of lowest common ancestor, which leads to a continuous relaxation of Dasgupta's discrete objective. We can show that after decoding, the global minimizer of our continuous relaxation yields a discrete tree with a (1 + epsilon)-factor approximation for Dasgupta's optimal tree, where epsilon can be made arbitrarily small and controls optimization challenges. We experimentally evaluate HypHC on a variety of HC benchmarks and find that even approximate solutions found with gradient descent have superior clustering quality than agglomerative heuristics or other gradient based algorithms. Finally, we highlight the flexibility of HypHC using end-to-end training in a downstream classification task.
Manifold Adaptive Multiple Kernel K-Means for Clustering
Du, Liang, Zhang, Haiying, Ren, Xin, Lv, Xiaolin
Multiple kernel methods based on k-means aims to integrate a group of kernels to improve the performance of kernel k-means clustering. However, we observe that most existing multiple kernel k-means methods exploit the nonlinear relationship within kernels, whereas the local manifold structure among multiple kernel space is not sufficiently considered. In this paper, we adopt the manifold adaptive kernel, instead of the original kernel, to integrate the local manifold structure of kernels. Thus, the induced multiple manifold adaptive kernels not only reflect the nonlinear relationship but also the local manifold structure. We then perform multiple kernel clustering within the multiple kernel k-means clustering framework. It has been verified that the proposed method outperforms several state-of-the-art baseline methods on a variety of data sets.
From Twitter to Traffic Predictor: Next-Day Morning Traffic Prediction Using Social Media Data
The effectiveness of traditional traffic prediction methods is often extremely limited when forecasting traffic dynamics in early morning. The reason is that traffic can break down drastically during the early morning commute, and the time and duration of this break-down vary substantially from day to day. Early morning traffic forecast is crucial to inform morning-commute traffic management, but they are generally challenging to predict in advance, particularly by midnight. In this paper, we propose to mine Twitter messages as a probing method to understand the impacts of people's work and rest patterns in the evening/midnight of the previous day to the next-day morning traffic. The model is tested on freeway networks in Pittsburgh as experiments. The resulting relationship is surprisingly simple and powerful. We find that, in general, the earlier people rest as indicated from Tweets, the more congested roads will be in the next morning. The occurrence of big events in the evening before, represented by higher or lower tweet sentiment than normal, often implies lower travel demand in the next morning than normal days. Besides, people's tweeting activities in the night before and early morning are statistically associated with congestion in morning peak hours. We make use of such relationships to build a predictive framework which forecasts morning commute congestion using people's tweeting profiles extracted by 5 am or as late as the midnight prior to the morning. The Pittsburgh study supports that our framework can precisely predict morning congestion, particularly for some road segments upstream of roadway bottlenecks with large day-to-day congestion variation. Our approach considerably outperforms those existing methods without Twitter message features, and it can learn meaningful representation of demand from tweeting profiles that offer managerial insights.
Fast Transformers with Clustered Attention
Vyas, Apoorv, Katharopoulos, Angelos, Fleuret, François
Transformers have been proven a successful model for a variety of tasks in sequence modeling. However, computing the attention matrix, which is their key component, has quadratic complexity with respect to the sequence length, thus making them prohibitively expensive for large sequences. To address this, we propose clustered attention, which instead of computing the attention for every query, groups queries into clusters and computes attention just for the centroids. To further improve this approximation, we use the computed clusters to identify the keys with the highest attention per query and compute the exact key/query dot products. This results in a model with linear complexity with respect to the sequence length for a fixed number of clusters. We evaluate our approach on two automatic speech recognition datasets and show that our model consistently outperforms vanilla transformers for a given computational budget. Finally, we demonstrate that our model can approximate arbitrarily complex attention distributions with a minimal number of clusters by approximating a pretrained BERT model on GLUE and SQuAD benchmarks with only 25 clusters and no loss in performance.