Clustering
Unbalanced Incomplete Multi-view Clustering via the Scheme of View Evolution: Weak Views are Meat; Strong Views do Eat
Fang, Xiang, Hu, Yuchong, Zhou, Pan, Liu, Xiao-Yang, Wu, Dapeng Oliver
Incomplete multi-view clustering is an important technique to deal with real-world incomplete multi-view data. Previous works assume that all views have the same incompleteness, i.e., balanced incompleteness. However, different views often have distinct incompleteness, i.e., unbalanced incompleteness, which results in strong views (low-incompleteness views) and weak views (high-incompleteness views). The unbalanced incompleteness prevents us from directly using the previous methods for clustering. In this paper, inspired by the effective biological evolution theory, we design the novel scheme of view evolution to cluster strong and weak views. Moreover, we propose an Unbalanced Incomplete Multi-view Clustering method (UIMC), which is the first effective method based on view evolution for unbalanced incomplete multi-view clustering. Compared with previous methods, UIMC has two unique advantages: 1) it proposes weighted multi-view subspace clustering to integrate these unbalanced incomplete views, which effectively solves the unbalanced incomplete multi-view problem; 2) it designs the low-rank and robust representation to recover the data, which diminishes the impact of the incompleteness and noises. Extensive experimental results demonstrate that UIMC improves the clustering performance by up to 40% on three evaluation metrics over other state-of-the-art methods.
GL-Coarsener: A Graph representation learning framework to construct coarse grid hierarchy for AMG solvers
Namazi, Reza, Zolanvari, Arsham, Sani, Mahdi, Ghahramani, Seyed Amir Ali Ghafourian
In many numerical schemes, the computational complexity scales non-linearly with the problem size. Solving a linear system of equations using direct methods or most iterative methods is a typical example. Algebraic multi-grid (AMG) methods are numerical methods used to solve large linear systems of equations efficiently. One of the main differences between AMG methods is how the coarser grid is constructed from a given fine grid. There are two main classes of AMG methods; graph and aggregation based coarsening methods. Here we propose an aggregation-based coarsening framework leveraging graph representation learning and clustering algorithms. Our method introduces the power of machine learning into the AMG research field and opens a new perspective for future researches. The proposed method uses graph representation learning techniques to learn latent features of the graph obtained from the underlying matrix of coefficients. Using these extracted features, we generated a coarser grid from the fine grid. The proposed method is highly capable of parallel computations. Our experiments show that the proposed method's efficiency in solving large systems is closely comparable with other aggregation-based methods, demonstrating the high capability of graph representation learning in designing multi-grid solvers.
Introduction to Core-sets: an Updated Survey
In optimization or machine learning problems we are given a set of items, usually points in some metric space, and the goal is to minimize or maximize an objective function over some space of candidate solutions. For example, in clustering problems, the input is a set of points in some metric space, and a common goal is to compute a set of centers in some other space (points, lines) that will minimize the sum of distances to these points. In database queries, we may need to compute such a some for a specific query set of k centers. However, traditional algorithms cannot handle modern systems that require parallel real-time computations of infinite distributed streams from sensors such as GPS, audio or video that arrive to a cloud, or networks of weaker devices such as smartphones or robots. Core-set is a "small data" summarization of the input "big data", where every possible query has approximately the same answer on both data sets. Generic techniques enable efficient coreset maintenance of streaming, distributed and dynamic data. Traditional algorithms can then be applied on these coresets to maintain the approximated optimal solutions. The challenge is to design coresets with provable tradeoff between their size and approximation error. This survey summarizes such constructions in a retrospective way, that aims to unified and simplify the state-of-the-art. Bringing big data to the enterprise, 2012) are generated by cheap and numerous information-sensing mobile devices, remote sensing, software logs, cameras, microphones, RFID readers and wireless sensor networks (Segaran & Hammerbacher, 2009; Hellerstein, 2008; Funke & Laue, 2007). These require clustering algorithms that, unlike traditional algorithms, (a) learn unbounded streaming data that cannot fit into main memory, (b) run in parallel on distributed data among thousands of machines, (c) use low communication between the machines (d) apply real-time computations on the device, (e) handle privacy and security issues. A common approach is to reinvent computer science for handling these new computational models, and develop new algorithms "from scratch" independently of existing solutions.
Cluster-Specific Predictions with Multi-Task Gaussian Processes
Leroy, Arthur, Latouche, Pierre, Guedj, Benjamin, Gey, Servane
A model involving Gaussian processes (GPs) is introduced to simultaneously handle multi-task learning, clustering, and prediction for multiple functional data. This procedure acts as a model-based clustering method for functional data as well as a learning step for subsequent predictions for new tasks. The model is instantiated as a mixture of multi-task GPs with common mean processes. A variational EM algorithm is derived for dealing with the optimisation of the hyper-parameters along with the hyper-posteriors' estimation of latent variables and processes. We establish explicit formulas for integrating the mean processes and the latent clustering variables within a predictive distribution, accounting for uncertainty on both aspects. This distribution is defined as a mixture of cluster-specific GP predictions, which enhances the performances when dealing with group-structured data. The model handles irregular grid of observations and offers different hypotheses on the covariance structure for sharing additional information across tasks. The performances on both clustering and prediction tasks are assessed through various simulated scenarios and real datasets. The overall algorithm, called MagmaClust, is publicly available as an R package.
Phenotyping Clusters of Patient Trajectories suffering from Chronic Complex Disease
Aguiar, Henrique, Santos, Mauro, Watkinson, Peter, Zhu, Tingting
Recent years have seen an increased focus into the tasks of predicting hospital inpatient risk of deterioration and trajectory evolution due to the availability of electronic patient data. A common approach to these problems involves clustering patients time-series information such as vital sign observations) to determine dissimilar subgroups of the patient population. Most clustering methods assume time-invariance of vital-signs and are unable to provide interpretability in clusters that is clinically relevant, for instance, event or outcome information. In this work, we evaluate three different clustering models on a large hospital dataset of vital-sign observations from patients suffering from Chronic Obstructive Pulmonary Disease. We further propose novel modifications to deal with unevenly sampled time-series data and unbalanced class distribution to improve phenotype separation. Lastly, we discuss further avenues of investigation for models to learn patient subgroups with distinct behaviour and phenotype.
Hierarchical clustering in particle physics through reinforcement learning
Brehmer, Johann, Macaluso, Sebastian, Pappadopulo, Duccio, Cranmer, Kyle
Particle physics experiments often require the reconstruction of decay patterns through a hierarchical clustering of the observed final-state particles. We show that this task can be phrased as a Markov Decision Process and adapt reinforcement learning algorithms to solve it. In particular, we show that Monte-Carlo Tree Search guided by a neural policy can construct high-quality hierarchical clusterings and outperform established greedy and beam search baselines.
Automatic selection of clustering algorithms using supervised graph embedding
Cohen-Shapira, Noy, Rokach, Lior
The widespread adoption of machine learning (ML) techniques and the extensive expertise required to apply them have led to increased interest in automated ML solutions that reduce the need for human intervention. One of the main challenges in applying ML to previously unseen problems is algorithm selection - the identification of high-performing algorithm(s) for a given dataset, task, and evaluation measure. This study addresses the algorithm selection challenge for data clustering, a fundamental task in data mining that is aimed at grouping similar objects. We present MARCO-GE, a novel meta-learning approach for the automated recommendation of clustering algorithms. MARCO-GE first transforms datasets into graphs and then utilizes a graph convolutional neural network technique to extract their latent representation. Using the embedding representations obtained, MARCO-GE trains a ranking meta-model capable of accurately recommending top-performing algorithms for a new dataset and clustering evaluation measure. Extensive evaluation on 210 datasets, 13 clustering algorithms, and 10 clustering measures demonstrates the effectiveness of our approach and its dominance in terms of predictive and generalization performance over state-of-the-art clustering meta-learning approaches.
PC-GAIN: Pseudo-label Conditional Generative Adversarial Imputation Networks for Incomplete Data
Wang, Yufeng, Li, Dan, Li, Xiang, Yang, Min
Datasets with missing values are very common in real world applications. GAIN, a recently proposed deep generative model for missing data imputation, has been proved to outperform many state-of-the-art methods. But GAIN only uses a reconstruction loss in the generator to minimize the imputation error of the non-missing part, ignoring the potential category information which can reflect the relationship between samples. In this paper, we propose a novel unsupervised missing data imputation method named PC-GAIN, which utilizes potential category information to further enhance the imputation power. Specifically, we first propose a pre-training procedure to learn potential category information contained in a subset of low-missing-rate data. Then an auxiliary classifier is determined based on the synthetic pseudo-labels. Further, this classifier is incorporated into the generative adversarial framework to help the generator to yield higher quality imputation results. The proposed method can significantly improve the imputation quality of GAIN. Experimental results on various benchmark datasets show that our method is also superior to other baseline models.
Estimation of the number of clusters on d-dimensional sphere
Spherical data is distributed on the sphere. The data appears in various fields such as meteorology, biology, and natural language processing. However, a method for analysis of spherical data does not develop enough yet. One of the important issues is an estimation of the number of clusters in spherical data. To address the issue, I propose a new method called the Spherical X-means (SX-means) that can estimate the number of clusters on d-dimensional sphere. The SX-means is the model-based method assuming that the data is generated from a mixture of von Mises-Fisher distributions. The present paper explains the proposed method and shows its performance of estimation of the number of clusters.