Clustering
TerraMAE: Learning Spatial-Spectral Representations from Hyperspectral Earth Observation Data via Adaptive Masked Autoencoders
Faruk, Tanjim Bin, Matin, Abdul, Pallickara, Shrideep, Pallickara, Sangmi Lee
Hyperspectral satellite imagery offers sub-30 m views of Earth in hundreds of contiguous spectral bands, enabling fine-grained mapping of soils, crops, and land cover. While self-supervised Masked Autoencoders excel on RGB and low-band multispectral data, they struggle to exploit the intricate spatial-spectral correlations in 200+ band hyperspectral images. We introduce TerraMAE, a novel HSI encoding framework specifically designed to learn highly representative spatial-spectral embeddings for diverse geospatial analyses. TerraMAE features an adaptive channel grouping strategy, based on statistical reflectance properties to capture spectral similarities, and an enhanced reconstruction loss function that incorporates spatial and spectral quality metrics. We demonstrate TerraMAE's effectiveness through superior spatial-spectral information preservation in high-fidelity image reconstruction. Furthermore, we validate its practical utility and the quality of its learned representations through strong performance on three key downstream geospatial tasks: crop identification, land cover classification, and soil texture prediction.
QuiZSF: An efficient data-model interaction framework for zero-shot time-series forecasting
Ma, Shichao, Zhou, Zhengyang, Huang, Qihe, Wang, Binwu, Yang, Kuo, Li, Huan, Wang, Yang
Time series forecasting has become increasingly important to empower diverse applications with streaming data. Zero-shot time-series forecasting (ZSF), particularly valuable in data-scarce scenarios, such as domain transfer or forecasting under extreme conditions, is difficult for traditional models to deal with. While time series pre-trained models (TSPMs) have demonstrated strong performance in ZSF, they often lack mechanisms to dynamically incorporate external knowledge. Fortunately, emerging retrieval-augmented generation (RAG) offers a promising path for injecting such knowledge on demand, yet they are rarely integrated with TSPMs. To leverage the strengths of both worlds, we introduce RAG into TSPMs to enhance zero-shot time series forecasting. In this paper, we propose QuiZSF (Quick Zero-Shot Time Series Forecaster), a lightweight and modular framework that couples efficient retrieval with representation learning and model adaptation for ZSF. Specifically, we construct a hierarchical tree-structured ChronoRAG Base (CRB) for scalable time-series storage and domain-aware retrieval, introduce a Multi-grained Series Interaction Learner (MSIL) to extract fine- and coarse-grained relational features, and develop a dual-branch Model Cooperation Coherer (MCC) that aligns retrieved knowledge with two kinds of TSPMs: Non-LLM based and LLM based. Compared with contemporary baselines, QuiZSF, with Non-LLM based and LLM based TSPMs as base model, respectively, ranks Top1 in 75% and 87.5% of prediction settings, while maintaining high efficiency in memory and inference time.
ClimateSOM: A Visual Analysis Workflow for Climate Ensemble Datasets
Kawakami, Yuya, Cayan, Daniel, Liu, Dongyu, Ma, Kwan-Liu
Ensemble datasets are ever more prevalent in various scientific domains. In climate science, ensemble datasets are used to capture variability in projections under plausible future conditions including greenhouse and aerosol emissions. Each ensemble model run produces projections that are fundamentally similar yet meaningfully distinct. Understanding this variability among ensemble model runs and analyzing its magnitude and patterns is a vital task for climate scientists. In this paper, we present ClimateSOM, a visual analysis workflow that leverages a self-organizing map (SOM) and Large Language Models (LLMs) to support interactive exploration and interpretation of climate ensemble datasets. The workflow abstracts climate ensemble model runs - spatiotemporal time series - into a distribution over a 2D space that captures the variability among the ensemble model runs using a SOM. LLMs are integrated to assist in sensemaking of this SOM-defined 2D space, the basis for the visual analysis tasks. In all, ClimateSOM enables users to explore the variability among ensemble model runs, identify patterns, compare and cluster the ensemble model runs. To demonstrate the utility of ClimateSOM, we apply the workflow to an ensemble dataset of precipitation projections over California and the Northwestern United States. Furthermore, we conduct a short evaluation of our LLM integration, and conduct an expert review of the visual workflow and the insights from the case studies with six domain experts to evaluate our approach and its utility.
Clustered factor analysis of multineuronal spike data
Lars Buesing, Timothy A. Machado, John P. Cunningham, Liam Paninski
High-dimensional, simultaneous recordings of neural spiking activity are often explored, analyzed and visualized with the help of latent variable or factor models. Such models are however ill-equipped to extract structure beyond shared, distributed aspects of firing activity across multiple cells. Here, we extend unstructured factor models by proposing a model that discovers subpopulations or groups of cells from the pool of recorded neurons. The model combines aspects of mixture of factor analyzer models for capturing clustering structure, and aspects of latent dynamical system models for capturing temporal dependencies. In the resulting model, we infer the subpopulations and the latent factors from data using variational inference and model parameters are estimated by Expectation Maximization (EM). We also address the crucial problem of initializing parameters for EM by extending a sparse subspace clustering algorithm to integer-valued spike count observations. We illustrate the merits of the proposed model by applying it to calcium-imaging data from spinal cord neurons, and we show that it uncovers meaningful clustering structure in the data.
Geometric-k-means: A Bound Free Approach to Fast and Eco-Friendly k-means
Sharma, Parichit, Stanislaw, Marcin, Kurban, Hasan, Kulekci, Oguzhan, Dalkilic, Mehmet
This paper introduces Geometric-k-means (or Gk-means for short), a novel approach that significantly enhances the efficiency and energy economy of the widely utilized k-means algorithm, which, despite its inception over five decades ago, remains a cornerstone in machine learning applications. The essence of Gk-means lies in its active utilization of geometric principles, specifically scalar projection, to significantly accelerate the algorithm without sacrificing solution quality. This geometric strategy enables a more discerning focus on data points that are most likely to influence cluster updates, which we call as high expressive data (HE). In contrast, low expressive data (LE), does not impact clustering outcome, is effectively bypassed, leading to considerable reductions in computational overhead. Experiments spanning synthetic, real-world and high-dimensional datasets, demonstrate Gk-means is significantly better than traditional and state of the art (SOTA) k-means variants in runtime and distance computations (DC). Moreover, Gk-means exhibits better resource efficiency, as evidenced by its reduced energy footprint, placing it as more sustainable alternative.
SCAR: State-Space Compression for AI-Driven Resource Management in 6G-Enabled Vehicular Infotainment Systems
Comsa, Ioan-Sorin, Shah, Purav, Vaidhyanathan, Karthik, Gangadharan, Deepak, Imhof, Christof, Bergamin, Per, Kaushik, Aryan, Muntean, Gabriel-Miro, Trestian, Ramona
The advent of 6G networks opens new possibilities for connected infotainment services in vehicular environments. However, traditional Radio Resource Management (RRM) techniques struggle with the increasing volume and complexity of data such as Channel Quality Indicators (CQI) from autonomous vehicles. To address this, we propose SCAR (State-Space Compression for AI-Driven Resource Management), an Edge AI-assisted framework that optimizes scheduling and fairness in vehicular infotainment. SCAR employs ML-based compression techniques (e.g., clustering and RBF networks) to reduce CQI data size while preserving essential features. These compressed states are used to train 6G-enabled Reinforcement Learning policies that maximize throughput while meeting fairness objectives defined by the NGMN. Simulations show that SCAR increases time in feasible scheduling regions by 14\% and reduces unfair scheduling time by 15\% compared to RL baselines without CQI compression. Furthermore, Simulated Annealing with Stochastic Tunneling (SAST)-based clustering reduces CQI clustering distortion by 10\%, confirming its efficiency. These results demonstrate SCAR's scalability and fairness benefits for dynamic vehicular networks.
Differentially Private Federated Clustering with Random Rebalancing
Yang, Xiyuan, Hu, Shengyuan, Kim, Soyeon, Li, Tian
Federated clustering aims to group similar clients into clusters and produce one model for each cluster. Such a personalization approach typically improves model performance compared with training a single model to serve all clients, but can be more vulnerable to privacy leakage. Directly applying client-level differentially private (DP) mechanisms to federated clustering could degrade the utilities significantly. We identify that such deficiencies are mainly due to the difficulties of averaging privacy noise within each cluster (following standard privacy mechanisms), as the number of clients assigned to the same clusters is uncontrolled. To this end, we propose a simple and effective technique, named RR-Cluster, that can be viewed as a light-weight add-on to many federated clustering algorithms. RR-Cluster achieves reduced privacy noise via randomly rebalancing cluster assignments, guaranteeing a minimum number of clients assigned to each cluster. We analyze the tradeoffs between decreased privacy noise variance and potentially increased bias from incorrect assignments and provide convergence bounds for RR-Clsuter. Empirically, we demonstrate the RR-Cluster plugged into strong federated clustering algorithms results in significantly improved privacy/utility tradeoffs across both synthetic and real-world datasets.
Parameter-free entropy-regularized multi-view clustering with hierarchical feature selection
Sinaga, Kristina P., Colantonio, Sara, Yang, Miin-Shen
Multi - view clustering faces critical challenges in automatically discovering patterns across heterogeneous data while managing high - dimensional features and eliminating irrelevant information. Traditional approaches suffer from manual parameter tuning and lack principled cross - view integration mechanisms. This work introduces two complementary algorithms: AMVFCM - U and AAMVFCM - U, providing a unified parameter - free framework. Our approach replaces fuzzification parameters with entropy regularization terms tha t enforce adaptive cross - view consensus. The core innovation employs signal - to - noise ratio based regularization for principled feature weighting with convergence guarantees, coupled with dual - level entropy terms that automatically balance view and feature contributions. AAMVFCM - U extends this with hierarchical dimensionality reduction operating at feature and view levels through adaptive thresholding . Evaluation across five diverse benchmarks demonstrates superiority over 15 state - of - the - art methods. AAMVFCM - U achieves up to 97% computational efficiency gains, reduces dimensionality to 0.45% of original size, and automatically identifies critical view combinations for optimal pattern discovery. Keywords: Multi - view clustering, Dimensionality reduction, Feature selection, Parameter - free, Signal - to - noise ratio, Fuzzy c - means 1. Introduction Understanding complex data is crucial in today's data - driven world, and recent advancements in machine learning are significantly enhancing our ability to analyze and interpret this information.
Online Sparsification of Bipartite-Like Clusters in Graphs
Das, Joyentanuj, De, Suranjan, Sun, He
Graph clustering is an important algorithmic technique for analysing massive graphs, and has been widely applied in many research fields of data science. While the objective of most graph clustering algorithms is to find a vertex set of low conductance, a sequence of recent studies highlights the importance of the inter-connection between vertex sets when analysing real-world datasets. Following this line of research, in this work we study bipartite-like clusters and present efficient and online sparsification algorithms that find such clusters in both undirected graphs and directed ones. We conduct experimental studies on both synthetic and real-world datasets, and show that our algorithms significantly speedup the running time of existing clustering algorithms while preserving their effectiveness.
Explainable Evidential Clustering
de Souza, Victor F. Lopes, Bakhti, Karima, Ramdani, Sofiane, Mottet, Denis, Imoussaten, Abdelhak
Unsupervised classification is a fundamental machine learning problem. Real-world data often contain imperfections, characterized by uncertainty and imprecision, which are not well handled by traditional methods. Evidential clustering, based on Dempster-Shafer theory, addresses these challenges. This paper explores the underexplored problem of explaining evidential clustering results, which is crucial for high-stakes domains such as healthcare. Our analysis shows that, in the general case, representativity is a necessary and sufficient condition for decision trees to serve as abductive explainers. Building on the concept of representativity, we generalize this idea to accommodate partial labeling through utility functions. These functions enable the representation of "tolerable" mistakes, leading to the definition of evidential mistakeness as explanation cost and the construction of explainers tailored to evidential classifiers. Finally, we propose the Iterative Evidential Mistake Minimization (IEMM) algorithm, which provides interpretable and cautious decision tree explanations for evidential clustering functions. We validate the proposed algorithm on synthetic and real-world data. Taking into account the decision-maker's preferences, we were able to provide an explanation that was satisfactory up to 93% of the time.