cluster allocation
Random Models for Fuzzy Clustering Similarity Measures
DeWolfe, Ryan, Andrews, Jeffery L.
The Adjusted Rand Index (ARI) is a widely used method for comparing hard clusterings, but requires a choice of random model that is often left implicit. Several recent works have extended the Rand Index to fuzzy clusterings, but the assumptions of the most common random model is difficult to justify in fuzzy settings. We propose a single framework for computing the ARI with three random models that are intuitive and explainable for both hard and fuzzy clusterings, along with the benefit of lower computational complexity. The theory and assumptions of the proposed models are contrasted with the existing permutation model. Computations on synthetic and benchmark data show that each model has distinct behaviour, meaning that accurate model selection is important for the reliability of results.
clusterBMA: Bayesian model averaging for clustering
Forbes, Owen, Santos-Fernandez, Edgar, Wu, Paul Pao-Yen, Xie, Hong-Bo, Schwenn, Paul E., Lagopoulos, Jim, Mills, Lia, Sacks, Dashiell D., Hermens, Daniel F., Mengersen, Kerrie
Various methods have been developed to combine inference across multiple sets of results for unsupervised clustering, within the ensemble clustering literature. The approach of reporting results from one `best' model out of several candidate clustering models generally ignores the uncertainty that arises from model selection, and results in inferences that are sensitive to the particular model and parameters chosen. Bayesian model averaging (BMA) is a popular approach for combining results across multiple models that offers some attractive benefits in this setting, including probabilistic interpretation of the combined cluster structure and quantification of model-based uncertainty. In this work we introduce clusterBMA, a method that enables weighted model averaging across results from multiple unsupervised clustering algorithms. We use clustering internal validation criteria to develop an approximation of the posterior model probability, used for weighting the results from each model. From a consensus matrix representing a weighted average of the clustering solutions across models, we apply symmetric simplex matrix factorisation to calculate final probabilistic cluster allocations. In addition to outperforming other ensemble clustering methods on simulated data, clusterBMA offers unique features including probabilistic allocation to averaged clusters, combining allocation probabilities from 'hard' and 'soft' clustering algorithms, and measuring model-based uncertainty in averaged cluster allocation. This method is implemented in an accompanying R package of the same name.
BasisVAE: Translation-invariant feature-level clustering with Variational Autoencoders
Mรคrtens, Kaspar, Yau, Christopher
Variational Autoencoders (VAEs) provide a flexible and scalable framework for non-linear dimensionality reduction. However, in application domains such as genomics where data sets are typically tabular and high-dimensional, a black-box approach to dimensionality reduction does not provide sufficient insights. Common data analysis workflows additionally use clustering techniques to identify groups of similar features. This usually leads to a two-stage process, however, it would be desirable to construct a joint modelling framework for simultaneous dimensionality reduction and clustering of features. In this paper, we propose to achieve this through the BasisVAE: a combination of the VAE and a probabilistic clustering prior, which lets us learn a one-hot basis function representation as part of the decoder network. Furthermore, for scenarios where not all features are aligned, we develop an extension to handle translation-invariant basis functions. We show how a collapsed variational inference scheme leads to scalable and efficient inference for BasisVAE, demonstrated on various toy examples as well as on single-cell gene expression data.
A Probabilistic framework for Quantum Clustering
Casaรฑa-Eslava, Raรบl V., Lisboa, Paulo J. G., Ortega-Martorell, Sandra, Jarman, Ian H., Martรญn-Guerrero, Josรฉ D.
Quantum Clustering is a powerful method to detect clusters in data with mixed density. However, it is very sensitive to a length parameter that is inherent to the Schr\"odinger equation. In addition, linking data points into clusters requires local estimates of covariance that are also controlled by length parameters. This raises the question of how to adjust the control parameters of the Schr\"odinger equation for optimal clustering. We propose a probabilistic framework that provides an objective function for the goodness-of-fit to the data, enabling the control parameters to be optimised within a Bayesian framework. This naturally yields probabilities of cluster membership and data partitions with specific numbers of clusters. The proposed framework is tested on real and synthetic data sets, assessing its validity by measuring concordance with known data structure by means of the Jaccard score (JS). This work also proposes an objective way to measure performance in unsupervised learning that correlates very well with JS.
On Controlling the Size of Clusters in Probabilistic Clustering
Jitta, Aditya (University of Helsinki) | Klami, Arto (University of Helsinki)
Classical model-based partitional clustering algorithms, such ask-means or mixture of Gaussians, provide only loose and indirect control over the size of the resulting clusters. In this work, we present a family of probabilistic clustering models that can be steered towards clusters of desired size by providing a prior distribution over the possible sizes, allowing the analyst to fine-tune exploratory analysis or to produce clusters of suitable size for future down-stream processing.Our formulation supports arbitrary multimodal prior distributions, generalizing the previous work on clustering algorithms searching for clusters of equal size or algorithms designed for the microclustering task of finding small clusters. We provide practical methods for solving the problem, using integer programming for making the cluster assignments, and demonstrate that we can also automatically infer the number of clusters.
Kernelized Infomax Clustering
Barber, David, Agakov, Felix V.
We propose a simple information-theoretic approach to soft clustering based on maximizing the mutual information I(x, y) between the unknown cluster labels y and the training patterns x with respect to parameters of specifically constrained encoding distributions. The constraints are chosen such that patterns are likely to be clustered similarly if they lie close to specific unknown vectors in the feature space. The method may be conveniently applied to learning the optimal affinity matrix, which corresponds to learning parameters of the kernelized encoder. The procedure does not require computations of eigenvalues of the Gram matrices, which makes it potentially attractive for clustering large data sets.
Kernelized Infomax Clustering
Barber, David, Agakov, Felix V.
We propose a simple information-theoretic approach to soft clustering based on maximizing the mutual information I(x, y) between the unknown cluster labels y and the training patterns x with respect to parameters of specifically constrained encoding distributions. The constraints are chosen such that patterns are likely to be clustered similarly if they lie close to specific unknown vectors in the feature space. The method may be conveniently applied to learning the optimal affinity matrix, which corresponds to learning parameters of the kernelized encoder. The procedure does not require computations of eigenvalues of the Gram matrices, which makes it potentially attractive for clustering large data sets.