AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Approximating Spectral Clustering via Sampling: a Review

Tremblay, Nicolas, Loukas, Andreas

arXiv.org Machine LearningJan-29-2019

Spectral clustering refers to a family of unsupervised learning algorithms that compute a spectral embedding of the original data based on the eigenvectors of a similarity graph. This non-linear transformation of the data is both the key of these algorithms' success and their Achilles heel: forming a graph and computing its dominant eigenvectors can indeed be computationally prohibitive when dealing with more that a few tens of thousands of points. In this paper, we review the principal research efforts aiming to reduce this computational cost. We focus on methods that come with a theoretical control on the clustering performance and incorporate some form of sampling in their operation. Such methods abound in the machine learning, numerical linear algebra, and graph signal processing literature and, amongst others, include Nystr\"om-approximation, landmarks, coarsening, coresets, and compressive spectral clustering. We present the approximation guarantees available for each and discuss practical merits and limitations. Surprisingly, despite the breadth of the literature explored, we conclude that there is still a gap between theory and practice: the most scalable methods are only intuitively motivated or loosely controlled, whereas those that come with end-to-end guarantees rely on strong assumptions or enable a limited gain of computation time.

algorithm, graph, spectral, (17 more...)

arXiv.org Machine Learning

1901.10204

Country:

North America > United States > New York > New York County > New York City (0.14)
Asia > Middle East > Jordan (0.04)
Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.04)
(10 more...)

Genre:

Research Report (0.49)
Overview (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Add feedback

Model-based clustering for identifying disease-associated SNPs in case-control genome-wide association studies

Xu, Yan, Xing, Li, Su, Jessica, Zhang, Xuekui, Qiu, Weiliang

arXiv.org Machine LearningJan-29-2019

Genome-wide association studies (GWASs) aim to detect genetic risk factors for complex human diseases by identifying disease-associated single-nucleotide polymorphisms (SNPs). The traditional SNP-wise approach along with multiple testing adjustment is over-conservative and lack of power in many GWASs. In this article, we proposed a model-based clustering method that transforms the challenging high-dimension-small-sample-size problem to low-dimension-large-sample-size problem and borrows information across SNPs by grouping SNPs into three clusters. We pre-specify the patterns of clusters by minor allele frequencies of SNPs between cases and controls, and enforce the patterns with prior distributions. In the simulation studies our proposed novel model outperform traditional SNP-wise approach by showing better controls of false discovery rate (FDR) and higher sensitivity. We re-analyzed two real studies to identifying SNPs associated with severe bortezomib-induced peripheral neuropathy (BiPN) in patients with multiple myeloma (MM). The original analysis in the literature failed to identify SNPs after FDR adjustment. Our proposed method not only detected the reported SNPs after FDR adjustment but also discovered a novel BiPN-associated SNP rs4351714 that has been reported to be related to MM in another study.

association study, snp, snpwise approach, (17 more...)

arXiv.org Machine Learning

1806.08456

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
North America > United States > Massachusetts > Hampshire County > Northampton (0.04)
(2 more...)

Genre: Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Geometric structure of graph Laplacian embeddings

Trillos, Nicolas Garcia, Hoffmann, Franca, Hosseini, Bamdad

arXiv.org Machine LearningJan-29-2019

We analyze the spectral clustering procedure for identifying coarse structure in a data set $x_1, \dots, x_n$, and in particular study the geometry of graph Laplacian embeddings which form the basis for spectral clustering algorithms. More precisely, we assume that the data is sampled from a mixture model supported on a manifold $\mathcal{M}$ embedded in $\mathbb{R}^d$, and pick a connectivity length-scale $\varepsilon>0$ to construct a kernelized graph Laplacian. We introduce a notion of a well-separated mixture model which only depends on the model itself, and prove that when the model is well separated, with high probability the embedded data set concentrates on cones that are centered around orthogonal vectors. Our results are meaningful in the regime where $\varepsilon = \varepsilon(n)$ is allowed to decay to zero at a slow enough rate as the number of data points grows. This rate depends on the intrinsic dimension of the manifold on which the data is supported.

geometric structure, laplacian, orthogonal cone structure, (16 more...)

arXiv.org Machine Learning

1901.10651

Country:

North America > United States > Wisconsin > Dane County > Madison (0.14)
North America > United States > Rhode Island > Providence County > Providence (0.04)
North America > United States > New York (0.04)
(3 more...)

Genre: Research Report (0.70)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

Add feedback

Deep Constrained Clustering - Algorithms and Advances

Zhang, Hongjing, Basu, Sugato, Davidson, Ian

arXiv.org Machine LearningJan-28-2019

The area of constrained clustering has been extensively explored by researchers and used by practitioners. Constrained clustering formulations exist for popular algorithms such as k-means, mixture models, and spectral clustering but have several limitations. We explore a deep learning formulation of constrained clustering and in particular explore how it can extend the field of constrained clustering. We show that our formulation can not only handle standard together/apart constraints without the well documented negative effects reported but can also model instance level constraints (level-of-difficulty), cluster level constraints (balancing cluster size) and triplet constraints. The first two are new ways for domain experts to enforce guidance whilst the later importantly allows generating ordering constraints from continuous side-information.

constraint, deep constrained clustering, pairwise constraint, (12 more...)

arXiv.org Machine Learning

1901.10061

Country:

North America > United States > California > Yolo County > Davis (0.04)
Asia > Middle East > Jordan (0.04)
Asia > Afghanistan > Parwan Province > Charikar (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Hierarchically Clustered Representation Learning

Shin, Su-Jin, Song, Kyungwoo, Moon, Il-Chul

arXiv.org Machine LearningJan-28-2019

The joint optimization of representation learning and clustering in the embedding space has experienced a breakthrough in recent years. In spite of the advance, clustering with representation learning has been limited to flat-level categories, which often involves cohesive clustering with a focus on instance relations. To overcome the limitations of flat clustering, we introduce hierarchically-clustered representation learning (HCRL), which simultaneously optimizes representation learning and hierarchical clustering in the embedding space. Compared with a few prior works, HCRL firstly attempts to consider a generation of deep embeddings from every component of the hierarchy, not just leaf components. In addition to obtaining hierarchically clustered embeddings, we can reconstruct data by the various abstraction levels, infer the intrinsic hierarchical structure, and learn the level-proportion features. We conducted evaluations with image and text domains, and our quantitative analyses showed competent likelihoods and the best accuracies compared with the baselines.

level proportion, mixture component, representation, (14 more...)

arXiv.org Machine Learning

1901.09906

Country:

Asia > Middle East > Jordan (0.05)
North America > United States > California > San Diego County > San Diego (0.04)
Asia > South Korea > Daejeon > Daejeon (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.50)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.36)

Add feedback

Towards Fair Deep Clustering With Multi-State Protected Variables

Wang, Bokun, Davidson, Ian

arXiv.org Machine LearningJan-28-2019

Fair clustering under the disparate impact doctrine requires that population of each protected group should be approximately equal in every cluster. Previous work investigated a difficult-to-scale pre-processing step for $k$-center and $k$-median style algorithms for the special case of this problem when the number of protected groups is two. In this work, we consider a more general and practical setting where there can be many protected groups. To this end, we propose Deep Fair Clustering, which learns a discriminative but fair cluster assignment function. The experimental results on three public datasets with different types of protected attribute show that our approach can steadily improve the degree of fairness while only having minor loss in terms of clustering quality.

dataset, fair deep clustering, fairness, (13 more...)

arXiv.org Machine Learning

1901.10053

Country:

Oceania > Australia > New South Wales > Sydney (0.04)
North America > United States > Oregon > Multnomah County > Portland (0.04)
North America > United States > New York > Richmond County > New York City (0.04)
(12 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)

Add feedback

Strong Black-box Adversarial Attacks on Unsupervised Machine Learning Models

Chhabra, Anshuman, Roy, Abhishek, Mohapatra, Prasant

arXiv.org Machine LearningJan-27-2019

Machine Learning (ML) and Deep Learning (DL) models have achieved state-of-the-art performance on multiple learning tasks, from vision to natural language modelling. With the growing adoption of ML and DL to many areas of computer science, recent research has also started focusing on the security properties of these models. There has been a lot of work undertaken to understand if (deep) neural network architectures are resilient to black-box adversarial attacks which craft perturbed input samples that fool the classifier without knowing the architecture used. Recent work has also focused on the transferability of adversarial attacks and found that adversarial attacks are generally easily transferable between models, datasets, and techniques. However, such attacks and their analysis have not been covered from the perspective of unsupervised machine learning algorithms. In this paper, we seek to bridge this gap through multiple contributions. We first provide a strong (iterative) black-box adversarial attack that can craft adversarial samples which will be incorrectly clustered irrespective of the choice of clustering algorithm. We choose 4 prominent clustering algorithms, and a real-world dataset to show the working of the proposed adversarial algorithm. Using these clustering algorithms we also carry out a simple study of cross-technique adversarial attack transferability.

adversarial sample, algorithm, transferability, (16 more...)

arXiv.org Machine Learning

1901.09493

Country:

North America > United States > California > Yolo County > Davis (0.04)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report (0.50)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Information-Theoretic Understanding of Population Risk Improvement with Model Compression

Bu, Yuheng, Gao, Weihao, Zou, Shaofeng, Veeravalli, Venugopal V.

arXiv.org Machine LearningJan-27-2019

We show that model compression can improve the population risk of a pre-trained model, by studying the tradeoff between the decrease in the generalization error and the increase in the empirical risk with model compression. We first prove that model compression reduces an information-theoretic bound on the generalization error; this allows for an interpretation of model compression as a regularization technique to avoid overfitting. We then characterize the increase in empirical risk with model compression using rate distortion theory. These results imply that the population risk could be improved by model compression if the decrease in generalization error exceeds the increase in empirical risk. We show through a linear regression example that such a decrease in population risk due to model compression is indeed possible. Our theoretical results further suggest that the Hessian-weighted $K$-means clustering compression approach can be improved by regularizing the distance between the clustering centers. We provide experiments with neural networks to support our theoretical assertions.

algorithm, generalization error, population risk, (11 more...)

arXiv.org Machine Learning

1901.09421

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > Illinois > Champaign County > Champaign (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.35)

Add feedback

Estimating multi-year 24/7 origin-destination demand using high-granular multi-source traffic data

Ma, Wei, Zhen, null, Qian, null

arXiv.org Artificial IntelligenceJan-26-2019

Dynamic origin-destination (OD) demand is central to transportation system modeling and analysis. The dynamic OD demand estimation problem (DODE) has been studied for decades, most of which solve the DODE problem on a typical day or several typical hours. There is a lack of methods that estimate high-resolution dynamic OD demand for a sequence of many consecutive days over several years (referred to as 24/7 OD in this research). Having multi-year 24/7 OD demand would allow a better understanding of characteristics of dynamic OD demands and their evolution/trends over the past few years, a critical input for modeling transportation system evolution and reliability. This paper presents a data-driven framework that estimates day-to-day dynamic OD using high-granular traffic counts and speed data collected over many years. The proposed framework statistically clusters daily traffic data into typical traffic patterns using t-Distributed Stochastic Neighbor Embedding (t-SNE) and k-means methods. A GPU-based stochastic projected gradient descent method is proposed to efficiently solve the multi-year 24/7 DODE problem. It is demonstrated that the new method efficiently estimates the 5-minute dynamic OD demand for every single day from 2014 to 2016 on I-5 and SR-99 in the Sacramento region. The resultant multi-year 24/7 dynamic OD demand reveals the daily, weekly, monthly, seasonal and yearly change in travel demand in a region, implying intriguing demand characteristics over the years.

artificial intelligence, machine learning, od demand, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.1016/j.trc.2018.09.002

1901.09266

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
North America > Canada > Quebec > Montreal (0.04)
Europe > Switzerland > Vaud > Lausanne (0.04)
Europe > Finland > Pirkanmaa > Tampere (0.04)

Genre: Research Report > New Finding (0.46)

Industry:

Consumer Products & Services > Travel (0.70)
Transportation > Infrastructure & Services (0.68)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback

A general model for plane-based clustering with loss function

Wang, Zhen, Shao, Yuan-Hai, Bai, Lan, Li, Chun-Na, Liu, Li-Ming

arXiv.org Machine LearningJan-26-2019

In this paper, we propose a general model for plane-based clustering. The general model contains many existing plane-based clustering methods, e.g., k-plane clustering (kPC), proximal plane clustering (PPC), twin support vector clustering (TWSVC) and its extensions. Under this general model, one may obtain an appropriate clustering method for specific purpose. The general model is a procedure corresponding to an optimization problem, where the optimization problem minimizes the total loss of the samples. Thereinto, the loss of a sample derives from both within-cluster and between-cluster. In theory, the termination conditions are discussed, and we prove that the general model terminates in a finite number of steps at a local or weak local optimal point. Furthermore, based on this general model, we propose a plane-based clustering method by introducing a new loss function to capture the data distribution precisely. Experimental results on artificial and public available datasets verify the effectiveness of the proposed method.

cluster1 cluster 2, general model, loss function, (12 more...)

arXiv.org Machine Learning

1901.09178

Country:

Asia > Mongolia (0.04)
Asia > China > Inner Mongolia > Hohhot (0.04)
North America > United States > New York (0.04)
(5 more...)

Genre: Research Report (0.40)

Industry: Health & Medicine (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback