AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Statistical power for cluster analysis

Dalmaijer, E. S., Nord, C. L., Astle, D. E.

arXiv.org Machine LearningFeb-29-2020

Cluster algorithms are gaining in popularity due to their compelling ability to identify discrete subgroups in data, and their increasing accessibility in mainstream programming languages and statistical software. While researchers can follow guidelines to choose the right algorithms, and to determine what constitutes convincing clustering, there are no firmly established ways of computing a priori statistical power for cluster analysis. Here, we take a simulation approach to estimate power and classification accuracy for popular analysis pipelines. We systematically varied cluster size, number of clusters, number of different features between clusters, effect size within each different feature, and cluster covariance structure in generated datasets. We then subjected these datasets to common dimensionality reduction approaches (none, multi-dimensional scaling, or uniform manifold approximation and projection) and cluster algorithms (k-means, hierarchical agglomerative clustering with Ward linkage and Euclidean distance, or average linkage and cosine distance, HDBSCAN). Furthermore, we simulated additional datasets to explore the effect of sample size and cluster separation on statistical power and classification accuracy. We found that clustering outcomes were driven by large effect sizes or the accumulation of many smaller effects across features, and were mostly unaffected by differences in covariance structure. Sufficient statistical power can be achieved with relatively small samples (N=20 per subgroup), provided cluster separation is large ({\Delta}=4). Finally, we discuss whether fuzzy clustering (c-means) could provide a more parsimonious alternative for identifying separable multivariate normal distributions, particularly those with lower centroid separation.

covariance structure, separation, subgroup, (15 more...)

arXiv.org Machine Learning

2003.00381

Country: Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)

Genre:

Research Report > New Finding (0.56)
Research Report > Experimental Study (0.46)

Industry: Health & Medicine > Therapeutic Area (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Estimating Multiple Precision Matrices with Cluster Fusion Regularization

Price, Bradley S., Molstad, Aaron J., Sherwood, Ben

arXiv.org Machine LearningFeb-29-2020

We propose a penalized likelihood framework for estimating multiple precision matrices from different classes. Most existing methods either incorporate no information on relationships between the precision matrices, or require this information be known a priori. The framework proposed in this article allows for simultaneous estimation of the precision matrices and relationships between the precision matrices, jointly. Sparse and non-sparse estimators are proposed, both of which require solving a non-convex optimization problem. To compute our proposed estimators, we use an iterative algorithm which alternates between a convex optimization problem solved by blockwise coordinate descent and a k-means clustering problem. Blockwise updates for computing the sparse estimator require solving an elastic net penalized precision matrix estimation problem, which we solve using a proximal gradient descent algorithm. We prove that this subalgorithm has a linear rate of convergence. In simulation studies and two real data applications, we show that our method can outperform competitors that ignore relevant relationships between precision matrices and performs similarly to methods which use prior information often uknown in practice.

algorithm, matrix, precision matrix, (16 more...)

arXiv.org Machine Learning

2003.00371

Country:

North America > United States > West Virginia (0.04)
North America > United States > New York (0.04)
North America > United States > Kansas (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.94)

Add feedback

Survival Cluster Analysis

Chapfuwa, Paidamoyo, Li, Chunyuan, Mehta, Nikhil, Carin, Lawrence, Henao, Ricardo

arXiv.org Machine LearningFeb-29-2020

Conventional survival analysis approaches estimate risk scores or individualized time-to-event distributions conditioned on covariates. In practice, there is often great population-level phenotypic heterogeneity, resulting from (unknown) subpopulations with diverse risk profiles or survival distributions. As a result, there is an unmet need in survival analysis for identifying subpopulations with distinct risk profiles, while jointly accounting for accurate individualized time-to-event predictions. An approach that addresses this need is likely to improve characterization of individual outcomes by leveraging regularities in subpopulations, thus accounting for population-level heterogeneity. In this paper, we propose a Bayesian nonparametrics approach that represents observations (subjects) in a clustered latent space, and encourages accurate time-to-event predictions and clusters (subpopulations) with distinct risk profiles. Experiments on real-world datasets show consistent improvements in predictive performance and interpretability relative to existing state-of-the-art survival analysis models.

covariate, dataset, prediction, (16 more...)

arXiv.org Machine Learning

2003.00355

Country: Asia > Middle East > Jordan (0.04)

Genre: Research Report > Experimental Study (0.46)

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Therapeutic Area > Endocrinology (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Determination of Latent Dimensionality in International Trade Flow

Truong, Duc P., Skau, Erik, Valtchinov, Vladimir I., Alexandrov, Boian S.

arXiv.org Machine LearningFeb-28-2020

Currently, high-dimensional data is ubiquitous in data science, which necessitates the development of techniques to decompose and interpret such multidimensional (aka tensor) datasets. Finding a low dimensional representation of the data, that is, its inherent structure, is one of the approaches that can serve to understand the dynamics of low dimensional latent features hidden in the data. Nonnegative RESCAL is one such technique, particularly well suited to analyze self-relational data, such as dynamic networks found in international trade flows. Nonnegative RESCAL computes a low dimensional tensor representation by finding the latent space containing multiple modalities. Estimating the dimensionality of this latent space is crucial for extracting meaningful latent features. Here, to determine the dimensionality of the latent space with nonnegative RESCAL, we propose a latent dimension determination method which is based on clustering of the solutions of multiple realizations of nonnegative RESCAL decompositions. We demonstrate the performance of our model selection method on synthetic data and then we apply our method to decompose a network of international trade flows data from International Monetary Fund and validate the resulting features against empirical facts from economic literature.

data mining, dimension, machine learning, (18 more...)

arXiv.org Machine Learning

2003.00129

Country:

North America > Mexico (0.15)
North America > Canada (0.15)
Europe > Denmark (0.14)
(23 more...)

Genre: Research Report (0.82)

Industry:

Government > Foreign Policy (1.00)
Banking & Finance > Economy (1.00)
Government > Commerce (0.95)
Government > Regional Government > North America Government > United States Government (0.68)

Technology:

Information Technology > Data Science > Data Mining (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Explainable $k$-Means and $k$-Medians Clustering

Dasgupta, Sanjoy, Frost, Nave, Moshkovitz, Michal, Rashtchian, Cyrus

arXiv.org Machine LearningFeb-27-2020

Clustering is a popular form of unsupervised learning for geometric data. Unfortunately, many clustering algorithms lead to cluster assignments that are hard to explain, partially because they depend on all the features of the data in a complicated way. To improve interpretability, we consider using a small decision tree to partition a data set into clusters, so that clusters can be characterized in a straightforward manner. We study this problem from a theoretical viewpoint, measuring cluster quality by the $k$-means and $k$-medians objectives: Must there exist a tree-induced clustering whose cost is comparable to that of the best unconstrained clustering, and if so, how can it be found? In terms of negative results, we show, first, that popular top-down decision tree algorithms may lead to clusterings with arbitrarily large cost, and second, that any tree-induced clustering must in general incur an $\Omega(\log k)$ approximation factor compared to the optimal clustering. On the positive side, we design an efficient algorithm that produces explainable clusters using a tree with $k$ leaves. For two means/medians, we show that a single threshold cut suffices to achieve a constant factor approximation, and we give nearly-matching lower bounds. For general $k \geq 2$, our algorithm is an $O(k)$ approximation to the optimal $k$-medians and an $O(k^2)$ approximation to the optimal $k$-means. Prior to our work, no algorithms were known with provable guarantees independent of dimension and input size.

algorithm, approximation, threshold, (16 more...)

arXiv.org Machine Learning

2002.12538

Country:

North America > United States > California > San Diego County > San Diego (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
Asia > Afghanistan > Parwan Province > Charikar (0.04)

Genre: Research Report (0.63)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)

Add feedback

Supervised Enhanced Soft Subspace Clustering (SESSC) for TSK Fuzzy Classifiers

Cui, Yuqi, Wang, Huidong, Wu, Dongrui

arXiv.org Machine LearningFeb-27-2020

Fuzzy c-means based clustering algorithms are frequently used for Takagi-Sugeno-Kang (TSK) fuzzy classifier antecedent parameter estimation. One rule is initialized from each cluster. However, most of these clustering algorithms are unsupervised, which waste valuable label information in the training data. This paper proposes a supervised enhanced soft subspace clustering (SESSC) algorithm, which considers simultaneously the within-cluster compactness, between-cluster separation, and label information in clustering. It can effectively deal with high-dimensional data, be used as a classifier alone, or be integrated into a TSK fuzzy classifier to further improve its performance. Experiments on nine UCI datasets from various application domains demonstrated that SESSC based initialization outperformed other clustering approaches, especially when the number of rules is small.

algorithm, classifier, sessc, (14 more...)

arXiv.org Machine Learning

2002.12404

Country:

North America > United States > Wisconsin (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
North America > United States > Texas > Dallas County > Dallas (0.04)
(2 more...)

Genre: Research Report (0.64)

Industry: Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Fuzzy Logic (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Simultaneous prediction and community detection for networks with application to neuroimaging

Arroyo, Jesús, Levina, Elizaveta

arXiv.org Machine LearningFeb-27-2020

Community structure in networks is observed in many different domains, and unsupervised community detection has received a lot of attention in the literature. Increasingly the focus of network analysis is shifting towards using network information in some other prediction or inference task rather than just analyzing the network itself. In particular, in neuroimaging applications brain networks are available for multiple subjects and the goal is often to predict a phenotype of interest. Community structure is well known to be a feature of brain networks, typically corresponding to different regions of the brain responsible for different functions. There are standard parcellations of the brain into such regions, usually obtained by applying clustering methods to brain connectomes of healthy subjects. However, when the goal is predicting a phenotype or distinguishing between different conditions, these static communities from an unrelated set of healthy subjects may not be the most useful for prediction. Here we present a method for supervised community detection, aiming to find a partition of the network into communities that is most useful for predicting a particular response. We use a block-structured regularization penalty combined with a prediction loss function, and compute the solution with a combination of a spectral method and an ADMM optimization algorithm. We show that the spectral clustering method recovers the correct communities under a weighted stochastic block model. The method performs well on both simulated and real brain networks, providing support for the idea of task-dependent brain regions.

coefficient, community detection, matrix, (14 more...)

arXiv.org Machine Learning

2002.01645

Country:

North America > United States > New York > New York County > New York City (0.14)
North America > United States > Michigan (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Health Care Technology (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.86)

Add feedback

Multi-objective Consensus Clustering Framework for Flight Search Recommendation

Chatterjee, Sujoy, Pasquier, Nicolas, Nanty, Simon, Zuluaga, Maria A.

arXiv.org Artificial IntelligenceFeb-26-2020

To provide personalized recommendations for travel searches, an appropriate segmentation of customers is required. Clustering ensemble approaches were developed to overcome well-known problems of classical clustering approaches, that each rely on a different theoretical model and can thus identify in the data space only clusters corresponding to this model. Clustering ensemble approaches combine multiple clustering results, each from a different algorithmic configuration, for generating more robust consensus clusters corresponding to agreements between initial clusters. We present a new clustering ensemble multi-objective optimization-based framework developed for analyzing Amadeus customer search data and improve personalized recommendations. This framework optimizes diversity in the clustering ensemble search space and automatically determines an appropriate number of clusters without requiring user's input. Experimental results compare the efficiency of this approach with other existing approaches on Amadeus customer search data in terms of internal (Adjusted Rand Index) and external (Amadeus business metric) validations.

algorithm, customer, recommendation, (13 more...)

arXiv.org Artificial Intelligence

2002.10241

Country: Europe > France > Provence-Alpes-Côte d'Azur (0.04)

Genre: Research Report > Promising Solution (0.46)

Industry: Consumer Products & Services > Travel (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.74)

Add feedback

Compact Representation of Uncertainty in Hierarchical Clustering

Greenberg, Craig S., Macaluso, Sebastian, Monath, Nicholas, Lee, Ji-Ah, Flaherty, Patrick, Cranmer, Kyle, McGregor, Andrew, McCallum, Andrew

arXiv.org Machine LearningFeb-26-2020

Hierarchical clustering is a fundamental task often used to discover meaningful structures in data, such as phylogenetic trees, taxonomies of concepts, subtypes of cancer, and cascades of particle decays in particle physics. When multiple hierarchical clusterings of the data are possible, it is useful to represent uncertainty in the clustering through various probabilistic quantities. Existing approaches represent uncertainty for a range of models; however, they only provide approximate inference. This paper presents dynamic-programming algorithms and proofs for exact inference in hierarchical clustering. We are able to compute the partition function, MAP hierarchical clustering, and marginal probabilities of sub-hierarchies and clusters. Our method supports a wide range of hierarchical models and only requires a cluster compatibility function. Rather than scaling with the number of hierarchical clusterings of $n$ elements ($\omega(n n! / 2^{n-1})$), our approach runs in time and space proportional to the significantly smaller powerset of $n$. Despite still being large, these algorithms enable exact inference in small-data applications and are also interesting from a theoretical perspective. We demonstrate the utility of our method and compare its performance with respect to existing approximate methods.

algorithm, hierarchy, partition function, (15 more...)

arXiv.org Machine Learning

2002.11661

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
Asia > Afghanistan > Parwan Province > Charikar (0.04)
(2 more...)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (0.68)
Health & Medicine > Therapeutic Area > Oncology (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Variational Wasserstein Barycenters for Geometric Clustering

Mi, Liang, Yu, Tianshu, Bento, Jose, Zhang, Wen, Li, Baoxin, Wang, Yalin

arXiv.org Machine LearningFeb-24-2020

We propose to compute Wasserstein barycenters (WBs) by solving for Monge maps with variational principle. We discuss the metric properties of WBs and explore their connections, especially the connections of Monge WBs, to K-means clustering and co-clustering. We also discuss the feasibility of Monge WBs on unbalanced measures and spherical domains. We propose two new problems -- regularized K-means and Wasserstein barycenter compression. We demonstrate the use of VWBs in solving these clustering-related problems.

barycenter, optimal transport, variational wasserstein barycenter, (13 more...)

arXiv.org Machine Learning

2002.10543

Country: North America > United States > Arizona (0.04)

Genre:

Overview (0.66)
Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback