Clustering
Real Elliptically Skewed Distributions and Their Application to Robust Cluster Analysis
Schroth, Christian A., Muma, Michael
This article proposes a new class of Real Elliptically Skewed (RESK) distributions and associated clustering algorithms that allow for integrating robustness and skewness into a single unified cluster analysis framework. Non-symmetrically distributed and heavy-tailed data clusters have been reported in a variety of real-world applications. Robustness is essential because a few outlying observations can severely obscure the cluster structure. The RESK distributions are a generalization of the Real Elliptically Symmetric (RES) distributions. To estimate the cluster parameters and memberships, we derive an expectation maximization (EM) algorithm for arbitrary RESK distributions. Special attention is given to a new robust skew-Huber M-estimator, which is also the maximum likelihood estimator (MLE) for the skew-Huber distribution that belongs to the RESK class. Numerical experiments on simulated and real-world data confirm the usefulness of the proposed methods for skewed and heavy-tailed data sets.
The Anatomy of K-means Clustering
Let’s say you want to classify hundreds (or thousands) of documents based on their content and topics, or you wish to group together different images for some reason. Or what’s even more, let’s think you have that same data already classified but you want to challenge that labeling. You want to know if that data categorization makes sense or not, or can be improved.
Laplacian Regularized Few-Shot Learning
Ziko, Imtiaz Masud, Dolz, Jose, Granger, Eric, Ayed, Ismail Ben
We propose a transductive Laplacian-regularized inference for few-shot tasks. Given any feature embedding learned from the base classes, we minimize a quadratic binary-assignment function containing two terms: (1) a unary term assigning query samples to the nearest class prototype, and (2) a pairwise Laplacian term encouraging nearby query samples to have consistent label assignments. Our transductive inference does not re-train the base model, and can be viewed as a graph clustering of the query set, subject to supervision constraints from the support set. We derive a computationally efficient bound optimizer of a relaxation of our function, which computes independent (parallel) updates for each query sample, while guaranteeing convergence. Following a simple cross-entropy training on the base classes, and without complex meta-learning strategies, we conducted comprehensive experiments over five few-shot learning benchmarks. Our LaplacianShot consistently outperforms state-of-the-art methods by significant margins across different models, settings, and data sets. Furthermore, our transductive inference is very fast, with computational times that are close to inductive inference, and can be used for large-scale few-shot tasks.
Consistency of Anchor-based Spectral Clustering
de Kergorlay, Henry-Louis, Higham, Desmond John
Anchor-based techniques reduce the computational complexity of spectral clustering algorithms. Although empirical tests have shown promising results, there is currently a lack of theoretical support for the anchoring approach. We define a specific anchor-based algorithm and show that it is amenable to rigorous analysis, as well as being effective in practice. We establish the theoretical consistency of the method in an asymptotic setting where data is sampled from an underlying continuous probability distribution. In particular, we provide sharp asymptotic conditions for the algorithm parameters which ensure that the anchor-based method can recover with high probability disjoint clusters that are mutually separated by a positive distance. We illustrate the performance of the algorithm on synthetic data and explain how the theoretical convergence analysis can be used to inform the practical choice of parameter scalings. We also test the accuracy and efficiency of the algorithm on two large scale real data sets. We find that the algorithm offers clear advantages over standard spectral clustering. We also find that it is competitive with the state-of-the-art LSC method of Chen and Cai (Twenty-Fifth AAAI Conference on Artificial Intelligence, 2011), while having the added benefit of a consistency guarantee.
Smile-GANs: Semi-supervised clustering via GANs for dissecting brain disease heterogeneity from medical images
Yang, Zhijian, Wen, Junhao, Davatzikos, Christos
Machine learning methods applied to complex biomedical data has enabled the construction of disease signatures of diagnostic/prognostic value. However, less attention has been given to understanding disease heterogeneity. Semi-supervised clustering methods can address this problem by estimating multiple transformations from a (e.g. healthy) control (CN) group to a patient (PT) group, seeking to capture the heterogeneity of underlying pathlogic processes. Herein, we propose a novel method, Smile-GANs (SeMi-supervIsed cLustEring via GANs), for semi-supervised clustering, and apply it to brain MRI scans. Smile-GANs first learns multiple distinct mappings by generating PT from CN, with each mapping characterizing one relatively distinct pathological pattern. Moreover, a clustering model is trained interactively with mapping functions to assign PT into corresponding subtype memberships. Using relaxed assumptions on PT/CN data distribution and imposing mapping non-linearity, Smile-GANs captures heterogeneous differences in distribution between the CN and PT domains. We first validate Smile-GANs using simulated data, subsequently on real data, by demonstrating its potential in characterizing heterogeneity in Alzheimer's Disease (AD) and its prodromal phases. The model was first trained using baseline MRIs from the ADNI2 database and then applied to longitudinal data from ADNI1 and BLSA. Four robust subtypes with distinct neuroanatomical patterns were discovered: 1) normal brain, 2) diffuse atrophy atypical of AD, 3) focal medial temporal lobe atrophy, 4) typical-AD. Further longitudinal analyses discover two distinct progressive pathways from prodromal to full AD: i) subtypes 1 - 2 - 4, and ii) subtypes 1 - 3 - 4. Although demonstrated on an important biomedical problem, Smile-GANs is general and can find application in many biomedical and other domains.
Revisiting Agglomerative Clustering
Tokuda, Eric K., Comin, Cesar H., Costa, Luciano da F.
An important issue in clustering concerns the avoidance of false positives while searching for clusters. This work addressed this problem considering agglomerative methods, namely single, average, median, complete, centroid and Ward's approaches applied to unimodal and bimodal datasets obeying uniform, gaussian, exponential and power-law distributions. A model of clusters was also adopted, involving a higher density nucleus surrounded by a transition, followed by outliers. This paved the way to defining an objective means for identifying the clusters from dendrograms. The adopted model also allowed the relevance of the clusters to be quantified in terms of the height of their subtrees. The obtained results include the verification that many methods detect two clusters in unimodal data. The single-linkage method was found to be more resilient to false positives. Also, several methods detected clusters not corresponding directly to the nucleus. The possibility of identifying the type of distribution was also investigated.
Outlier Detection -- Theory, Visualizations, and Code
Outlier Detection is also known as anomaly detection, noise detection, deviation detection, or exception mining. There is no universally accepted definition. An early definition by (Grubbs, 1969) is: An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs. An observation which appears to be inconsistent with the remainder of that set of data. A list of applications that utilize outlier detection according to (Hodge, V.J. and Austin, J., 2014) is: This is analogous to unsupervised clustering.
Scalable Spectral Clustering with Nystrom Approximation: Practical and Theoretical Aspects
Spectral clustering techniques are valuable tools in signal processing and machine learning for partitioning complex data sets. The effectiveness of spectral clustering stems from constructing a non-linear embedding based on creating a similarity graph and computing the spectral decomposition of the Laplacian matrix. However, spectral clustering methods fail to scale to large data sets because of high computational cost and memory usage. A popular approach for addressing these problems utilizes the Nystrom method, an efficient sampling-based algorithm for computing low-rank approximations to large positive semi-definite matrices. This paper demonstrates how the previously popular approach of Nystrom-based spectral clustering has severe limitations. Existing time-efficient methods ignore critical information by prematurely reducing the rank of the similarity matrix associated with sampled points. Also, current understanding is limited regarding how utilizing the Nystrom approximation will affect the quality of spectral embedding approximations. To address the limitations, this work presents a principled spectral clustering algorithm that makes full use of the information obtained from the Nystrom method. The proposed method exhibits linear scalability in the number of input data points, allowing us to partition large complex data sets. We provide theoretical results to reduce the current gap and present numerical experiments with real and synthetic data. Empirical results demonstrate the efficacy and efficiency of the proposed method compared to existing spectral clustering techniques based on the Nystrom method and other efficient methods. The overarching goal of this work is to provide an improved baseline for future research directions to accelerate spectral clustering.
Tangles: From Weak to Strong Clustering
Elbracht, Christian, Fioravanti, Diego, Klepper, Solveig, Kneip, Jakob, Rendsburg, Luca, Teegen, Maximilian, von Luxburg, Ulrike
We introduce a new approach to clustering by using tangles, a tool that originates in mathematical graph theory. Given a collection of "weak partitions" of a data set, tangles provide a framework to aggregate these weak partitions such that they "point in the direction of a cluster". As a result, a cluster is softly characterized by a set of consistent pointers. This mechanism provides a highly flexible way of solving soft clustering problems in a variety of setups, ranging from questionnaires over community detection in graphs to clustering points in metric spaces. Conceptually, tangles have many intriguing properties: (1) Similar to boosting, which combines many weak classifiers to a strong classifier, tangles provide a formal way to combine many weak partitions to obtain few strong clusters. (2) In terms of computational complexity, tangles allow us to use simple, fast algorithms to produce the weak partitions. The complexity of identifying the strong partitions is dominated by the number of weak partitions, not the number of data points, leading to an interesting trade-off between the two. (3) If the weak partitions are interpretable, so are the strong partitions induced by the tangles, resulting in one of the rare algorithms to produce interpretable clusters. (4) The output of tangles is of a hierarchical nature, inducing the notion of a soft dendrogram that can be helpful in data visualization.
Off-the-grid: Fast and Effective Hyperparameter Search for Kernel Clustering
Ordozgoiti, Bruno, Muñoz, Lluís A. Belanche
Kernel functions are a powerful tool to enhance the $k$-means clustering algorithm via the kernel trick. It is known that the parameters of the chosen kernel function can have a dramatic impact on the result. In supervised settings, these can be tuned via cross-validation, but for clustering this is not straightforward and heuristics are usually employed. In this paper we study the impact of kernel parameters on kernel $k$-means. In particular, we derive a lower bound, tight up to constant factors, below which the parameter of the RBF kernel will render kernel $k$-means meaningless. We argue that grid search can be ineffective for hyperparameter search in this context and propose an alternative algorithm for this purpose. In addition, we offer an efficient implementation based on fast approximate exponentiation with provable quality guarantees. Our experimental results demonstrate the ability of our method to efficiently reveal a rich and useful set of hyperparameter values.