Clustering
Towards Interpretable and Inference-Optimal COT Reasoning with Sparse Autoencoder-Guided Generation
Zhao, Daniel, Shankarampeta, Abhilash, Hu, Lanxiang, Rosing, Tajana, Zhang, Hao
We propose a novel method that leverages sparse autoencoders (SAEs) and clustering techniques to analyze the internal token representations of large language models (LLMs) and guide generations in mathematical reasoning tasks. Our approach first trains an SAE to generate sparse vector representations for training tokens, then applies k-means clustering to construct a graph where vertices represent token clusters and weighted edges capture sequential token transitions. Using this graph, we define an edge-weight based reward function to quantify adherence to established reasoning traces, thereby identifying exploitative reasoning trajectories. Additionally, we measure generation diversity from clustering to assess the extent of exploration. Our findings indicate that balancing both exploitation and exploration is crucial for achieving high accuracy in mathematical reasoning tasks. During generation, the SAE can serve as a scalable reward model to guide generations, ensuring a balanced trade-off between exploitation and exploration. This prevents extreme behaviors in either direction, ultimately fostering a higher-quality reasoning process in LLMs.
Landcover classification and change detection using remote sensing and machine learning: a case study of Western Fiji
Gurjar, Yadvendra, Wan, Ruoni, Farahbakhsh, Ehsan, Chandra, Rohitash
As a developing country, Fiji is facing rapid urbanisation, which is visible in the massive development projects that include housing, roads, and civil works. In this study, we present machine learning and remote sensing frameworks to compare land use and land cover change from 2013 to 2024 in Nadi, Fiji. The ultimate goal of this study is to provide technical support in land cover/land use modelling and change detection. We used Landsat-8 satellite image for the study region and created our training dataset with labels for supervised machine learning. We used Google Earth Engine and unsupervised machine learning via k-means clustering to generate the land cover map. We used convolutional neural networks to classify the selected regions' land cover types. We present a visualisation of change detection, highlighting urban area changes over time to monitor changes in the map.
scSiameseClu: A Siamese Clustering Framework for Interpreting single-cell RNA Sequencing Data
Xu, Ping, Ning, Zhiyuan, Li, Pengjiang, Liu, Wenhao, Wang, Pengyang, Cui, Jiaxu, Zhou, Yuanchun, Wang, Pengfei
Single-cell RNA sequencing (scRNA-seq) reveals cell heterogeneity, with cell clustering playing a key role in identifying cell types and marker genes. Recent advances, especially graph neural networks (GNNs)-based methods, have significantly improved clustering performance. However, the analysis of scRNA-seq data remains challenging due to noise, sparsity, and high dimensionality. Compounding these challenges, GNNs often suffer from over-smoothing, limiting their ability to capture complex biological information. In response, we propose scSiameseClu, a novel Siamese Clustering framework for interpreting single-cell RNA-seq data, comprising of 3 key steps: (1) Dual Augmentation Module, which applies biologically informed perturbations to the gene expression matrix and cell graph relationships to enhance representation robustness; (2) Siamese Fusion Module, which combines cross-correlation refinement and adaptive information fusion to capture complex cellular relationships while mitigating over-smoothing; and (3) Optimal Transport Clustering, which utilizes Sinkhorn distance to efficiently align cluster assignments with predefined proportions while maintaining balance. Comprehensive evaluations on seven real-world datasets demonstrate that scSiameseClu outperforms state-of-the-art methods in single-cell clustering, cell type annotation, and cell type classification, providing a powerful tool for scRNA-seq data interpretation.
Export Reviews, Discussions, Author Feedback and Meta-Reviews
First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. This paper proposes an incremental but very sensible and practical modification to'curriculum learning'. Given a partition of the training examples into classes, they propose an additional regularising term (and an additional parameter) to ensure that the'easy' examples selected during learning are spread across the classes, and not from one class. The partition into classes can come from a clustering algorithm, or from a priori knowledge. The idea is straightforward and sensible, and the authors propose an algorithm that looks efficient and correct.
Export Reviews, Discussions, Author Feedback and Meta-Reviews
First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. Summary of the paper: The paper studies the incremental clustering problem and shows several properties: - It shows that no deterministic memory-bounded incremental clustering method is nice-detecting. Specifically, the authors show that no deterministic nice-detecting incremental clustering algorithm can use less than 2^{cp-1} bits of memory for data in R^p under the l2 metric. Then some example algorithms are displayed. General comments: - The paper is written clearly and the guarantees in this paper are solid.
Tight Continuous Relaxation of the Balanced k-Cut Problem
Syama Sundar Rangapuram, Pramod Kaushik Mudrakarta, Matthias Hein
Spectral Clustering as a relaxation of the normalized/ratio cut has become one of the standard graph-based clustering methods. Existing methods for the computation of multiple clusters, corresponding to a balanced k -cut of the graph, are either based on greedy techniques or heuristics which have weak connection to the original motivation of minimizing the normalized cut. In this paper we propose a new tight continuous relaxation for any balanced k -cut problem and show that a related recently proposed relaxation is in most cases loose leading to poor performance in practice. For the optimization of our tight continuous relaxation we propose a new algorithm for the difficult sum-of-ratios minimization problem which achieves monotonic descent. Extensive comparisons show that our method outperforms all existing approaches for ratio cut and other balanced k -cut criteria.
Export Reviews, Discussions, Author Feedback and Meta-Reviews
First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. The authors present a novel non-parametric Bayesian model for unsupervised clustering. The model uses a two level hierarchy of Dirichlet process priors to handle clusters which may be multi-modal, skewed and/or heavy tailed. The authors present a collapsed Gibbs sampler for inference which exploits the conjugacy of the model. The authors do an excellent job of motivating the model by explaining the deficiencies of the standard infinite mixture of Gaussians.