Clustering
Reviews: Hierarchical Clustering via Spreading Metrics
The paper of Dasgupta introduced a nice objective for hierarchical clustering, and presented a fairly simple algorithm for approximating the objective. I like this objective since it gives a nice principled way of comparing different algorithms (potentially new ones) for hierarchical clustering, as opposed to just showing bounds for linkage-based heuristics. I think the algorithm and analysis in this paper is very elegant. Every hierarchical clustering corresponds to an ultrametric; but they characterize the ultrametrics that correspond to the objective defined by Dasgupta, using some very nice properties. They use this characterization to come up with an LP and its rounding.
Reviews: Graph Clustering: Block-models and model free results
The goal is to obtain such guarantees with quantities that can be computed from the data and the output of the clustering algorithms being compared. Providing such model free theoretical guarantees for clustering is of importance for both theoretical and practical purposes. Given that Spectral Clutering works well for all the models specified, why not use the same model estimator? In particular, it is not clear why the Laplacian is used for PFM while the adjacency matrix is used for the SBM. Also, the results for PFM is for weighted ME whereas for SBM it is in terms of ME.
FastEx: Hash Clustering with Exponential Families
Clustering is a key component in data analysis toolbox. In this paper we present a sampler, capable of estimating mixtures of exponential families. At its heart lies a novel proposal distribution using random projections to achieve high throughput in generating proposals, which is crucial for clustering models with large numbers of clusters.
Classification of HI Galaxy Profiles Using Unsupervised Learning and Convolutional Neural Networks: A Comparative Analysis and Methodological Cases of Studies
Jaimes-Illanes, Gabriel, Parra-Royon, Manuel, Darriba-Pol, Laura, Moldรณn, Javier, Sorgho, Amidou, Sรกnchez-Expรณsito, Susana, Garrido-Sรกnchez, Juliรกn, Verdes-Montenegro, Lourdes
Hydrogen, the most abundant element in the universe, is crucial for understanding galaxy formation and evolution. The 21 cm neutral atomic hydrogen - HI spectral line maps the gas kinematics within galaxies, providing key insights into interactions, galactic structure, and star formation processes. With new radio instruments, the volume and complexity of data is increasing. To analyze and classify integrated HI spectral profiles in a efficient way, this work presents a framework that integrates Machine Learning techniques, combining unsupervised methods and CNNs. To this end, we apply our framework to a selected subsample of 318 spectral HI profiles of the CIG and 30.780 profiles from the Arecibo Legacy Fast ALFA Survey catalogue. Data pre-processing involved the Busyfit package and iterative fitting with polynomial, Gaussian, and double-Lorentzian models. Clustering methods, including K-means, spectral clustering, DBSCAN, and agglomerative clustering, were used for feature extraction and to bootstrap classification we applied K-NN, SVM, and Random Forest classifiers, optimizing accuracy with CNN. Additionally, we introduced a 2D model of the profiles to enhance classification by adding dimensionality to the data. Three 2D models were generated based on transformations and normalised versions to quantify the level of asymmetry. These methods were tested in a previous analytical classification study conducted by the Analysis of the Interstellar Medium in Isolated Galaxies group. This approach enhances classification accuracy and aims to establish a methodology that could be applied to data analysis in future surveys conducted with the Square Kilometre Array (SKA), currently under construction. All materials, code, and models have been made publicly available in an open-access repository, adhering to FAIR principles.
Hypergraph Representations of scRNA-seq Data for Improved Clustering with Random Walks
He, Wan, Bolnick, Daniel I., Scarpino, Samuel V., Eliassi-Rad, Tina
Analysis of single-cell RNA sequencing data is often conducted through network projections such as coexpression networks, primarily due to the abundant availability of network analysis tools for downstream tasks. However, this approach has several limitations: loss of higher-order information, inefficient data representation caused by converting a sparse dataset to a fully connected network, and overestimation of coexpression due to zero-inflation. To address these limitations, we propose conceptualizing scRNA-seq expression data as hypergraphs, which are generalized graphs in which the hyperedges can connect more than two vertices. In the context of scRNA-seq data, the hypergraph nodes represent cells and the edges represent genes. Each hyperedge connects all cells where its corresponding gene is actively expressed and records the expression of the gene across different cells. This hypergraph conceptualization enables us to explore multi-way relationships beyond the pairwise interactions in coexpression networks without loss of information. We propose two novel clustering methods: (1) the Dual-Importance Preference Hypergraph Walk (DIPHW) and (2) the Coexpression and Memory-Integrated Dual-Importance Preference Hypergraph Walk (CoMem-DIPHW). They outperform established methods on both simulated and real scRNA-seq datasets. The improvement brought by our proposed methods is especially significant when data modularity is weak. Furthermore, CoMem-DIPHW incorporates the gene coexpression network, cell coexpression network, and the cell-gene expression hypergraph from the single-cell abundance counts data altogether for embedding computation. This approach accounts for both the local level information from single-cell level gene expression and the global level information from the pairwise similarity in the two coexpression networks.
Simple, Scalable and Effective Clustering via One-Dimensional Projections
Clustering is a fundamental problem in unsupervised machine learning with many applications in data analysis. Popular clustering algorithms such as Lloyd's algorithm and k -means can take \Omega(ndk) time when clustering n points in a d -dimensional space (represented by an n\times d matrix X) into k clusters. On massive datasets with moderate to large k, the multiplicative k factor can become very expensive. We introduce a simple randomized clustering algorithm that provably runs in expected time O(\mathsf{nnz}(X) n\log n) for arbitrary k . Here \mathsf{nnz}(X) is the total number of non-zero entries in the input dataset X, which is upper bounded by nd and can be significantly smaller for sparse datasets.
Tree Variational Autoencoders
We propose Tree Variational Autoencoder (TreeVAE), a new generative hierarchical clustering model that learns a flexible tree-based posterior distribution over latent variables. TreeVAE hierarchically divides samples according to their intrinsic characteristics, shedding light on hidden structures in the data. It adapts its architecture to discover the optimal tree for encoding dependencies between latent variables. The proposed tree-based generative architecture enables lightweight conditional inference and improves generative performance by utilizing specialized leaf decoders. We show that TreeVAE uncovers underlying clusters in the data and finds meaningful hierarchical relations between the different groups on a variety of datasets, including real-world imaging data. We present empirically that TreeVAE provides a more competitive log-likelihood lower bound than the sequential counterparts.
Sketching Algorithms for Sparse Dictionary Learning: PTAS and Turnstile Streaming
Sketching algorithms have recently proven to be a powerful approach both for designing low-space streaming algorithms as well as fast polynomial time approximation schemes (PTAS). In this work, we develop new techniques to extend the applicability of sketching-based approaches to the sparse dictionary learning and the Euclidean k -means clustering problems. In particular, we initiate the study of the challenging setting where the dictionary/clustering assignment for each of the n input points must be output, which has surprisingly received little attention in prior work. On the fast algorithms front, we obtain a new approach for designing PTAS's for the k -means clustering problem, which generalizes to the first PTAS for the sparse dictionary learning problem. On the streaming algorithms front, we obtain new upper bounds and lower bounds for dictionary learning and k -means clustering.
Fair, Polylog-Approximate Low-Cost Hierarchical Clustering
Research in fair machine learning, and particularly clustering, has been crucial in recent years given the many ethical controversies that modern intelligent systems have posed. Ahmadian et al. [2020] established the study of fairness in hierarchical clustering, a stronger, more structured variant of its well-known flat counterpart, though their proposed algorithm that optimizes for Dasgupta's [2016] famous cost function was highly theoretical. Knittel et al. [2023] then proposed the first practical fair approximation for cost, however they were unable to break the polynomial-approximate barrier they posed as a hurdle of interest. We break this barrier, proposing the first truly polylogarithmic-approximate low-cost fair hierarchical clustering, thus greatly bridging the gap between the best fair and vanilla hierarchical clustering approximations.
Replicable Clustering
We design replicable algorithms in the context of statistical clustering under the recently introduced notion of replicability from Impagliazzo et al. [2022]. According to this definition, a clustering algorithm is replicable if, with high probability, its output induces the exact same partition of the sample space after two executions on different inputs drawn from the same distribution, when its internal randomness is shared across the executions. We propose such algorithms for the statistical k -medians, statistical k -means, and statistical k -centers problems by utilizing approximation routines for their combinatorial counterparts in a black-box manner. In particular, we demonstrate a replicable O(1) -approximation algorithm for statistical Euclidean k -medians ( k -means) with \operatorname{poly}(d) sample complexity. We also describe an O(1) -approximation algorithm with an additional O(1) -additive error for statistical Euclidean k -centers, albeit with \exp(d) sample complexity.