Clustering
On Controlling the Size of Clusters in Probabilistic Clustering
Jitta, Aditya (University of Helsinki) | Klami, Arto (University of Helsinki)
Classical model-based partitional clustering algorithms, such ask-means or mixture of Gaussians, provide only loose and indirect control over the size of the resulting clusters. In this work, we present a family of probabilistic clustering models that can be steered towards clusters of desired size by providing a prior distribution over the possible sizes, allowing the analyst to fine-tune exploratory analysis or to produce clusters of suitable size for future down-stream processing.Our formulation supports arbitrary multimodal prior distributions, generalizing the previous work on clustering algorithms searching for clusters of equal size or algorithms designed for the microclustering task of finding small clusters. We provide practical methods for solving the problem, using integer programming for making the cluster assignments, and demonstrate that we can also automatically infer the number of clusters.
Product Quantized Translation for Fast Nearest Neighbor Search
Hwang, Yoonho (Pohang University of Science and Technology (POSTECH)) | Baek, Mooyeol (Pohang University of Science and Technology (POSTECH)) | Kim, Saehoon (Pohang University of Science and Technology (POSTECH)) | Han, Bohyung (Pohang University of Science and Technology (POSTECH)) | Ahn, Hee-Kap (Pohang University of Science and Technology (POSTECH))
This paper proposes a simple nearest neighbor search algorithm, which provides the exact solution in terms of the Euclidean distance efficiently. Especially, we present an interesting approach to improve the speed of nearest neighbor search by proper translations of data and query although the task is inherently invariant to the Euclidean transformations. The proposed algorithm aims to eliminate nearest neighbor candidates effectively using their distance lower bounds in nonlinear embedded spaces, and further improves the lower bounds by transforming data and query through product quantized translations. Although our framework is composed of simple operations only, it achieves the state-of-the-art performance compared to existing nearest neighbor search techniques, which is illustrated quantitatively using various large-scale benchmark datasets in different sizes and dimensions.
Interactive Machine Learning at Scale With CHISSL
Arendt, Dustin (Pacific Northwest National Laboratory) | Grace, Emily (Pacific Northwest National Laboratory ) | Volkova, Svitlana (Pacific Northwest National Laboratory)
We demonstrate CHISSL a scalable client-server system for real-time interactive machine learning. Our system is capable of incorporating user feedback incrementally and immediately without a pre-defined prediction task. Computation is partitioned between a lightweight web-client and a heavyweight server. The server relies on representation learning and off-the-shelf agglomerative clustering to find a dendrogram, which we use to quickly approximate distances in the representation space. The client, using only this dendrogram, incorporates user feedback via transduction. Distances and predictions for each unlabeled instance are updated incrementally and deterministically, with O(n) space and time complexity. Our algorithm is implemented in a functional prototype, designed to be easy to use by non-experts. The prototype organizes the large amounts of data into recommendations. This allows the user to interact with actual instances by dragging and dropping to provide feedback in an intuitive manner. We applied CHISSL to several domains including cyber, social media, and geo-temporal analysis.
A Semi-Supervised Network Embedding Model for Protein Complexes Detection
Zhao, Wei (SIAT, Chinese Academy of Sciences) | Zhu, Jia (South China Normal University) | Yang, Min (SIAT, Chinese Academy of Sciences) | Xiao, Danyang (South China Normal University) | Fung, Gabriel Pui Cheong (The Chinese University of Hong Kong) | Chen, Xiaojun (Shenzhen University)
Protein complex is a group of associated polypeptide chains which plays essential roles in biological process. Given a graph representing protein-protein interactions (PPI) network, it is critical but non-trivial to detect protein complexes.In this paper, we propose a semi-supervised network embedding model by adopting graph convolutional networks to effectively detect densely connected subgraphs. We conduct extensive experiment on two popular PPI networks with various data sizes and densities. The experimental results show our approach achieves state-of-the-art performance.
Deep Embedding for Determining the Number of Clusters
Wang, Yiqi (National University of Defense Technology) | Shi, Zhan (University of Texas at Austin) | Guo, Xifeng (National University of Defense Technology) | Liu, Xinwang (National University of Defense Technology) | Zhu, En (National University of Defense Technology) | Yin, Jianping (Dongguan University of Technology)
Determining the number of clusters is important but challenging, especially for data of high dimension. In this paper, we propose Deep Embedding Determination (DED), a method that can solve jointly for the unknown number of clusters and feature extraction. DED first combines the virtues of the convolutional autoencoder and the t-SNE technique to extract low dimensional embedded features. Then it determines the number of clusters using an improved density-based clustering algorithm. Our experimental evaluation on image datasets shows significant improvement over state-of-the-art methods and robustness with respect to hyperparameter settings.
Discovering Program Topoi Through Clustering
Ieva, Carlo (Simula Research Laboratory) | Gotlieb, Arnaud (Simula Research Laboratory) | Kaci, Souhila (LIRMM, University of Montpellier) | Lazaar, Nadjib (LIRMM, University of Montpellier)
Understanding source code of large open-source software projects is very challenging when there is only little documentation. New developers face the task of classifying a huge number of files and functions without any help. This paper documents a novel approach to this problem, called FEAT, that automatically extracts topoi from source code by using hierarchical agglomerative clustering. Program topoi summarize the main capabilities of a software system by presenting to developers clustered lists of functions together with an index of their relevant words. The clustering method used in FEAT exploits a new hybrid distance which combines both textual and structural elements automatically extracted from source code and comments. The experimental evaluation of FEAT shows that this approach is suitable to understand open-source software projects of size approaching 2,000 functions and 150 files, which opens the door for its deployment in the open-source community.
Leveraging Lexical Substitutes for Unsupervised Word Sense Induction
Alagiฤ, Domagoj (University of Zagreb) | ล najder, Jan (University of Zagreb) | Padรณ, Sebastian (University of Stuttgart,ย Institut fรผr Maschinelle Sprachverarbeitung)
Word sense induction is the most prominent unsupervised approach to lexical disambiguation. It clusters word instances, typically represented by their bag-of-words contexts. Therefore, uninformative and ambiguous contexts present a major challenge. In this paper, we investigate the use of an alternative instance representation based on lexical substitutes , i.e., contextually suitable, meaning-preserving replacements. Using lexical substitutes predicted by a state-of-the-art automatic system and a simple clustering algorithm, we outperform bag-of-words instance representations and compete with much more complex structured probabilistic models. Furthermore, we show that an oracle based on manually-labeled lexical substitutes yields yet substantially higher performance. Taken together, this provides evidence for a complementarity between word sense induction and lexical substitution that has not been given much consideration before.
Label Distribution Learning by Exploiting Sample Correlations Locally
Zheng, Xiang (Nanjing University of Science and Technology) | Jia, Xiuyi (Nanjing University of Science and Technology) | Li, Weiwei (Nanjing University of Aeronautics and Astronautics)
Label distribution learning (LDL) is a novel multi-label learning paradigm proposed in recent years for solving label ambiguity. Existing approaches typically exploit label correlations globally to improve the effectiveness of label distribution learning, by assuming that the label correlations are shared by all instances. However, different instances may share different label correlations, and few correlations are globally applicable in real-world applications. In this paper, we propose a new label distribution learning algorithm by exploiting sample correlations locally (LDL-SCL). To encode the influence of local samples, we design a local correlation vector for each instance based on the clustered local samples. Then we predict the label distribution for an unseen instance based on the original features and the local correlation vector simultaneously. Experimental results demonstrate that LDL-SCL can effectively deal with the label distribution problems and perform remarkably better than the state-of-the-art LDL methods.
Optimal Margin Distribution Clustering
Zhang, Teng (Nanjing University) | Zhou, Zhi-Hua (Nanjing University)
Maximum margin clustering (MMC), which borrows the large margin heuristic from support vector machine (SVM), has achieved more accurate results than traditional clustering methods. The intuition is that, for a good clustering, when labels are assigned to different clusters, SVM can achieve a large minimum margin on this data. Recent studies, however, disclosed that maximizing the minimum margin does not necessarily lead to better performance, and instead, it is crucial to optimize the margin distribution. In this paper, we propose a novel approach ODMC (Optimal margin Distribution Machine for Clustering), which tries to cluster the data and achieve optimal margin distribution simultaneously. Specifically, we characterize the margin distribution by the first- and second-order statistics, i.e., the margin mean and variance, and extend a stochastic mirror descent method to solve the resultant minimax problem. Moreover, we prove theoretically that ODMC has the same convergence rate with state-of-the-art cutting plane based algorithms but involves much less computation cost per iteration, so our method is much more scalable than existing approaches. Extensive experiments on UCI data sets show that ODMC is significantly better than compared methods, which verifies the superiority of optimal margin distribution learning.
New l 2,1 -Norm Relaxation of Multi-Way Graph Cut for Clustering
Yang, Xu (Xidian University) | Deng, Cheng (Xidian University) | Liu, Xianglong (Beihang University) | Nie, Feiping (Northwestern Polytechnical University)
The clustering methods have absorbed even-increasing attention in machine learning and computer vision communities in recent years. Exploring manifold information in multi-way graph cut clustering, such as ratio cut clustering, has shown its promising performance. However, traditional multi-way ratio cut clustering method is NP-hard and thus the spectral solution may deviate from the optimal one. In this paper, we propose a new relaxed multi-way graph cut clustering method, where l 2,1 -norm distance instead of squared distance is utilized to preserve the solution having much more clearer cluster structures. Furthermore, the resulting solution is constrained with normalization to obtain more sparse representation, which can encourage the solution to contain more discrete values with many zeros. For the objective function, it is very difficult to optimize due to minimizing the ratio of two non-smooth items. To address this problem, we transform the objective function into a quadratic problem on the Stiefel manifold (QPSM), and introduce a novel yet efficient iterative algorithm to solve it. Experimental results on several benchmark datasets show that our method significantly outperforms several state-of-the-art clustering approaches.