AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Real Elliptically Skewed Distributions and Their Application to Robust Cluster Analysis

Schroth, Christian A., Muma, Michael

arXiv.org Machine LearningJun-30-2020

This article proposes a new class of Real Elliptically Skewed (RESK) distributions and associated clustering algorithms that allow for integrating robustness and skewness into a single unified cluster analysis framework. Non-symmetrically distributed and heavy-tailed data clusters have been reported in a variety of real-world applications. Robustness is essential because a few outlying observations can severely obscure the cluster structure. The RESK distributions are a generalization of the Real Elliptically Symmetric (RES) distributions. To estimate the cluster parameters and memberships, we derive an expectation maximization (EM) algorithm for arbitrary RESK distributions. Special attention is given to a new robust skew-Huber M-estimator, which is also the maximum likelihood estimator (MLE) for the skew-Huber distribution that belongs to the RESK class. Numerical experiments on simulated and real-world data confirm the usefulness of the proposed methods for skewed and heavy-tailed data sets.

artificial intelligence, machine learning, matrix, (16 more...)

arXiv.org Machine Learning

2006.16671

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Germany > Hesse > Darmstadt Region > Darmstadt (0.04)
North America > United States > New York > New York County > New York City (0.04)
(4 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.66)

Add feedback

The Anatomy of K-means Clustering

#artificialintelligenceJun-29-2020, 08:30:09 GMT

Let’s say you want to classify hundreds (or thousands) of documents based on their content and topics, or you wish to group together different images for some reason. Or what’s even more, let’s think you have that same data already classified but you want to challenge that labeling. You want to know if that data categorization makes sense or not, or can be improved.

artificial intelligence, k-means clustering, machine learning, (1 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.40)

Add feedback

Laplacian Regularized Few-Shot Learning

Ziko, Imtiaz Masud, Dolz, Jose, Granger, Eric, Ayed, Ismail Ben

arXiv.org Machine LearningJun-27-2020

We propose a transductive Laplacian-regularized inference for few-shot tasks. Given any feature embedding learned from the base classes, we minimize a quadratic binary-assignment function containing two terms: (1) a unary term assigning query samples to the nearest class prototype, and (2) a pairwise Laplacian term encouraging nearby query samples to have consistent label assignments. Our transductive inference does not re-train the base model, and can be viewed as a graph clustering of the query set, subject to supervision constraints from the support set. We derive a computationally efficient bound optimizer of a relaxation of our function, which computes independent (parallel) updates for each query sample, while guaranteeing convergence. Following a simple cross-entropy training on the base classes, and without complex meta-learning strategies, we conducted comprehensive experiments over five few-shot learning benchmarks. Our LaplacianShot consistently outperforms state-of-the-art methods by significant margins across different models, settings, and data sets. Furthermore, our transductive inference is very fast, with computational times that are close to inductive inference, and can be used for large-scale few-shot tasks.

artificial intelligence, laplacianshot, machine learning, (18 more...)

arXiv.org Machine Learning

2006.15486

Country:

Europe > Austria > Vienna (0.14)
North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)

Add feedback

Consistency of Anchor-based Spectral Clustering

de Kergorlay, Henry-Louis, Higham, Desmond John

arXiv.org Machine LearningJun-27-2020

Anchor-based techniques reduce the computational complexity of spectral clustering algorithms. Although empirical tests have shown promising results, there is currently a lack of theoretical support for the anchoring approach. We define a specific anchor-based algorithm and show that it is amenable to rigorous analysis, as well as being effective in practice. We establish the theoretical consistency of the method in an asymptotic setting where data is sampled from an underlying continuous probability distribution. In particular, we provide sharp asymptotic conditions for the algorithm parameters which ensure that the anchor-based method can recover with high probability disjoint clusters that are mutually separated by a positive distance. We illustrate the performance of the algorithm on synthetic data and explain how the theoretical convergence analysis can be used to inform the practical choice of parameter scalings. We also test the accuracy and efficiency of the algorithm on two large scale real data sets. We find that the algorithm offers clear advantages over standard spectral clustering. We also find that it is competitive with the state-of-the-art LSC method of Chen and Cai (Twenty-Fifth AAAI Conference on Artificial Intelligence, 2011), while having the added benefit of a consistency guarantee.

data mining, machine learning, spectral, (16 more...)

arXiv.org Machine Learning

2006.13984

Country:

North America > United States > Rhode Island (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Data Science > Data Mining (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Add feedback

Smile-GANs: Semi-supervised clustering via GANs for dissecting brain disease heterogeneity from medical images

Yang, Zhijian, Wen, Junhao, Davatzikos, Christos

arXiv.org Machine LearningJun-26-2020

Machine learning methods applied to complex biomedical data has enabled the construction of disease signatures of diagnostic/prognostic value. However, less attention has been given to understanding disease heterogeneity. Semi-supervised clustering methods can address this problem by estimating multiple transformations from a (e.g. healthy) control (CN) group to a patient (PT) group, seeking to capture the heterogeneity of underlying pathlogic processes. Herein, we propose a novel method, Smile-GANs (SeMi-supervIsed cLustEring via GANs), for semi-supervised clustering, and apply it to brain MRI scans. Smile-GANs first learns multiple distinct mappings by generating PT from CN, with each mapping characterizing one relatively distinct pathological pattern. Moreover, a clustering model is trained interactively with mapping functions to assign PT into corresponding subtype memberships. Using relaxed assumptions on PT/CN data distribution and imposing mapping non-linearity, Smile-GANs captures heterogeneous differences in distribution between the CN and PT domains. We first validate Smile-GANs using simulated data, subsequently on real data, by demonstrating its potential in characterizing heterogeneity in Alzheimer's Disease (AD) and its prodromal phases. The model was first trained using baseline MRIs from the ADNI2 database and then applied to longitudinal data from ADNI1 and BLSA. Four robust subtypes with distinct neuroanatomical patterns were discovered: 1) normal brain, 2) diffuse atrophy atypical of AD, 3) focal medial temporal lobe atrophy, 4) typical-AD. Further longitudinal analyses discover two distinct progressive pathways from prodromal to full AD: i) subtypes 1 - 2 - 4, and ii) subtypes 1 - 3 - 4. Although demonstrated on an important biomedical problem, Smile-GANs is general and can find application in many biomedical and other domains.

artificial intelligence, machine learning, smile-gan, (17 more...)

arXiv.org Machine Learning

2006.15255

Country:

North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.14)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area > Neurology > Alzheimer's Disease (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

Add feedback

Revisiting Agglomerative Clustering

Tokuda, Eric K., Comin, Cesar H., Costa, Luciano da F.

arXiv.org Machine LearningJun-26-2020

An important issue in clustering concerns the avoidance of false positives while searching for clusters. This work addressed this problem considering agglomerative methods, namely single, average, median, complete, centroid and Ward's approaches applied to unimodal and bimodal datasets obeying uniform, gaussian, exponential and power-law distributions. A model of clusters was also adopted, involving a higher density nucleus surrounded by a transition, followed by outliers. This paved the way to defining an objective means for identifying the clusters from dendrograms. The adopted model also allowed the relevance of the clusters to be quantified in terms of the height of their subtrees. The obtained results include the verification that many methods detect two clusters in unimodal data. The single-linkage method was found to be more resilient to false positives. Also, several methods detected clusters not corresponding directly to the nucleus. The possibility of identifying the type of distribution was also investigated.

artificial intelligence, data mining, machine learning, (18 more...)

arXiv.org Machine Learning

2005.07995

Country:

South America > Brazil > São Paulo (0.04)
Oceania > Australia > Tasmania (0.04)
North America > United States > Kansas (0.04)
(2 more...)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine > Therapeutic Area (0.68)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Outlier Detection -- Theory, Visualizations, and Code

#artificialintelligenceJun-25-2020, 21:37:15 GMT

Outlier Detection is also known as anomaly detection, noise detection, deviation detection, or exception mining. There is no universally accepted definition. An early definition by (Grubbs, 1969) is: An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs. An observation which appears to be inconsistent with the remainder of that set of data. A list of applications that utilize outlier detection according to (Hodge, V.J. and Austin, J., 2014) is: This is analogous to unsupervised clustering.

artificial intelligence, data mining, machine learning, (5 more...)

#artificialintelligence

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.40)

Add feedback

Scalable Spectral Clustering with Nystrom Approximation: Practical and Theoretical Aspects

Pourkamali-Anaraki, Farhad

arXiv.org Machine LearningJun-25-2020

Spectral clustering techniques are valuable tools in signal processing and machine learning for partitioning complex data sets. The effectiveness of spectral clustering stems from constructing a non-linear embedding based on creating a similarity graph and computing the spectral decomposition of the Laplacian matrix. However, spectral clustering methods fail to scale to large data sets because of high computational cost and memory usage. A popular approach for addressing these problems utilizes the Nystrom method, an efficient sampling-based algorithm for computing low-rank approximations to large positive semi-definite matrices. This paper demonstrates how the previously popular approach of Nystrom-based spectral clustering has severe limitations. Existing time-efficient methods ignore critical information by prematurely reducing the rank of the similarity matrix associated with sampled points. Also, current understanding is limited regarding how utilizing the Nystrom approximation will affect the quality of spectral embedding approximations. To address the limitations, this work presents a principled spectral clustering algorithm that makes full use of the information obtained from the Nystrom method. The proposed method exhibits linear scalability in the number of input data points, allowing us to partition large complex data sets. We provide theoretical results to reduce the current gap and present numerical experiments with real and synthetic data. Empirical results demonstrate the efficacy and efficiency of the proposed method compared to existing spectral clustering techniques based on the Nystrom method and other efficient methods. The overarching goal of this work is to provide an improved baseline for future research directions to accelerate spectral clustering.

artificial intelligence, machine learning, spectral, (18 more...)

arXiv.org Machine Learning

2006.1447

Country:

North America > United States > Massachusetts > Middlesex County > Lowell (0.14)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.66)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Tangles: From Weak to Strong Clustering

Elbracht, Christian, Fioravanti, Diego, Klepper, Solveig, Kneip, Jakob, Rendsburg, Luca, Teegen, Maximilian, von Luxburg, Ulrike

arXiv.org Machine LearningJun-25-2020

We introduce a new approach to clustering by using tangles, a tool that originates in mathematical graph theory. Given a collection of "weak partitions" of a data set, tangles provide a framework to aggregate these weak partitions such that they "point in the direction of a cluster". As a result, a cluster is softly characterized by a set of consistent pointers. This mechanism provides a highly flexible way of solving soft clustering problems in a variety of setups, ranging from questionnaires over community detection in graphs to clustering points in metric spaces. Conceptually, tangles have many intriguing properties: (1) Similar to boosting, which combines many weak classifiers to a strong classifier, tangles provide a formal way to combine many weak partitions to obtain few strong clusters. (2) In terms of computational complexity, tangles allow us to use simple, fast algorithms to produce the weak partitions. The complexity of identifying the strong partitions is dominated by the number of weak partitions, not the number of data points, leading to an interesting trade-off between the two. (3) If the weak partitions are interpretable, so are the strong partitions induced by the tangles, resulting in one of the rare algorithms to produce interpretable clusters. (4) The output of tangles is of a hierarchical nature, inducing the notion of a soft dendrogram that can be helpful in data visualization.

artificial intelligence, machine learning, tangle, (17 more...)

arXiv.org Machine Learning

2006.14444

Country:

Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)
North America > United States > Wisconsin (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Germany > Hamburg (0.04)

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Off-the-grid: Fast and Effective Hyperparameter Search for Kernel Clustering

Ordozgoiti, Bruno, Muñoz, Lluís A. Belanche

arXiv.org Machine LearningJun-24-2020

Kernel functions are a powerful tool to enhance the $k$-means clustering algorithm via the kernel trick. It is known that the parameters of the chosen kernel function can have a dramatic impact on the result. In supervised settings, these can be tuned via cross-validation, but for clustering this is not straightforward and heuristics are usually employed. In this paper we study the impact of kernel parameters on kernel $k$-means. In particular, we derive a lower bound, tight up to constant factors, below which the parameter of the RBF kernel will render kernel $k$-means meaningless. We argue that grid search can be ineffective for hyperparameter search in this context and propose an alternative algorithm for this purpose. In addition, we offer an efficient implementation based on fast approximate exponentiation with provable quality guarantees. Our experimental results demonstrate the ability of our method to efficiently reveal a rich and useful set of hyperparameter values.

algorithm, artificial intelligence, machine learning, (14 more...)

arXiv.org Machine Learning

2006.13567

Country:

Asia > Middle East > Jordan (0.04)
Europe > Spain (0.04)
Europe > Finland > North Karelia > Joensuu (0.04)

Genre: Research Report > New Finding (0.48)

Industry: Energy > Power Industry (0.41)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback