AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

PTCMIL: Multiple Instance Learning via Prompt Token Clustering for Whole Slide Image Analysis

Zhao, Beidi, Kim, SangMook, Chen, Hao, Zhou, Chen, Gao, Zu-hua, Wang, Gang, Li, Xiaoxiao

arXiv.org Artificial IntelligenceJul-28-2025

Multiple Instance Learning (MIL) has advanced WSI analysis but struggles with the complexity and heterogeneity of WSIs. Existing MIL methods face challenges in aggregating diverse patch information into robust WSI representations. While ViTs and clustering-based approaches show promise, they are computationally intensive and fail to capture task-specific and slide-specific variability. To address these limitations, we propose PTCMIL, a novel Prompt Token Clustering-based ViT for MIL aggregation. By introducing learnable prompt tokens into the ViT backbone, PTCMIL unifies clustering and prediction tasks in an end-to-end manner. It dynamically aligns clustering with downstream tasks, using projection-based clustering tailored to each WSI, reducing complexity while preserving patch heterogeneity. Through token merging and prototype-based pooling, PTCMIL efficiently captures task-relevant patterns. Extensive experiments on eight datasets demonstrate its superior performance in classification and survival analysis tasks, outperforming state-of-the-art methods. Systematic ablation studies confirm its robustness and strong interpretability. The code is released at https://github.com/ubc-tea/PTCMIL.

machine learning, natural language, ptcmil, (17 more...)

arXiv.org Artificial Intelligence

2507.18848

Country: North America (0.14)

Genre: Research Report > Promising Solution (0.34)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Diagnostic Medicine (0.69)

Technology:

Information Technology > Artificial Intelligence > Vision (0.97)
Information Technology > Artificial Intelligence > Natural Language (0.93)
Information Technology > Sensing and Signal Processing > Image Processing (0.84)
(2 more...)

Add feedback

Learning graphons from data: Random walks, transfer operators, and spectral clustering

Klus, Stefan, Bramburger, Jason J.

arXiv.org Machine LearningJul-25-2025

Many signals in the real world that evolve in time can be modeled as a stochastic process with the signal randomly jumping from one state to another as time proceeds. When the signal can only exhibit a finite number of possible states, one can interpret the evolution of the signal as a random walk on a graph with vertices representing the states of the signal and edge weights giving way to the transition probabilities from one state to another. In particular, one arrives at a Markov chain representation of the signal that can be estimated using only the signal data. However, many realistic signals can take on a continuum of values, and so the goal of this work is to present a framework for modeling continuous-space stochastic signals and to identify metastable and coherent sets via clustering techniques. We present a data-driven method to learn the discrete-time transition probabilities of stochastic signals evolving in continuous space, which can be regarded as a generalization of the discrete space case considered in [25, 22]. The underlying theory is developed by evoking the concept of a graphon, which can be defined as the limit of sequences of dense networks that grow without bound [35, 34, 21, 18]. As recently shown in [43], graphons provide a well-developed framework for extending the concepts of random walks on finite graphs to stochastic processes evolving in continuous space. For example, random walks on graphs can be used to measure the centrality of vertices, and these concepts can also be extended to graphons [4]. Our goal is to identify transition probabilities, clusters, and the graphon itself from random walk data.

artificial intelligence, machine learning, operator, (16 more...)

arXiv.org Machine Learning

2507.18147

Country:

North America > United States > New York > Albany County > Albany (0.04)
North America > Canada > Quebec > Montreal (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)

Add feedback

Artificial Intelligence for Green Hydrogen Yield Prediction and Site Suitability using SHAP-Based Composite Index: Focus on Oman

Nwafor, Obumneme Zimuzor, Hooti, Mohammed Abdul Majeed Al

arXiv.org Artificial IntelligenceJul-24-2025

As nations seek sustainable alternatives to fossil fuels, green hydrogen has emerged as a promising strategic pathway toward decarbonisation, particularly in solar-rich arid regions. However, identifying optimal locations for hydrogen production requires the integration of complex environmental, atmospheric, and infrastructural factors, often compounded by limited availability of direct hydrogen yield data. This study presents a novel Artificial Intelligence (AI) framework for computing green hydrogen yield and site suitability index using mean absolute SHAP (SHapley Additive exPlanations) values. This framework consists of a multi-stage pipeline of unsupervised multi-variable clustering, supervised machine learning classifier and SHAP algorithm. The pipeline trains on an integrated meteorological, topographic and temporal dataset and the results revealed distinct spatial patterns of suitability and relative influence of the variables. With model predictive accuracy of 98%, the result also showed that water proximity, elevation and seasonal variation are the most influential factors determining green hydrogen site suitability in Oman with mean absolute shap values of 2.470891, 2.376296 and 1.273216 respectively. Given limited or absence of ground-truth yield data in many countries that have green hydrogen prospects and ambitions, this study offers an objective and reproducible alternative to subjective expert weightings, thus allowing the data to speak for itself and potentially discover novel latent groupings without pre-imposed assumptions. This study offers industry stakeholders and policymakers a replicable and scalable tool for green hydrogen infrastructure planning and other decision making in data-scarce regions.

artificial intelligence, machine learning, oman, (14 more...)

arXiv.org Artificial Intelligence

2507.14219

Country: Asia > Middle East > Oman (1.00)

Genre: Research Report > New Finding (0.68)

Industry: Energy > Renewable > Hydrogen (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.48)

Add feedback

Fast-VAT: Accelerating Cluster Tendency Visualization using Cython and Numba

Avinash, MSR, Lachheb, Ismael

arXiv.org Artificial IntelligenceJul-23-2025

Visual Assessment of Cluster Tendency (VAT) is a widely used unsupervised technique to assess the presence of cluster structure in unlabeled datasets. However, its standard implementation suffers from significant performance limitations due to its O(n^2) time complexity and inefficient memory usage. In this work, we present Fast-VAT, a high-performance reimplementation of the VAT algorithm in Python, augmented with Numba's Just-In-Time (JIT) compilation and Cython's static typing and low-level memory optimizations. Our approach achieves up to 50x speedup over the baseline implementation, while preserving the output fidelity of the original method. We validate Fast-VAT on a suite of real and synthetic datasets -- including Iris, Mall Customers, and Spotify subsets -- and verify cluster tendency using Hopkins statistics, PCA, and t-SNE. Additionally, we compare VAT's structural insights with clustering results from DBSCAN and K-Means to confirm its reliability.

artificial intelligence, data mining, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2507.15904

Country: Europe > France (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Add feedback

Tri-Learn Graph Fusion Network for Attributed Graph Clustering

Li, Binxiong, Xiang, Xu, Li, Xue, Zhao, Binyu, Gao, Heyang, Zhao, Qinyu

arXiv.org Artificial IntelligenceJul-23-2025

In recent years, models based on Graph Convolutional Networks (GCN) have made significant strides in the field of graph data analysis. However, challenges such as over-smoothing and over-compression remain when handling large-scale and complex graph datasets, leading to a decline in clustering quality. Although the Graph Transformer architecture has mitigated some of these issues, its performance is still limited when processing heterogeneous graph data. To address these challenges, this study proposes a novel deep clustering framework that comprising GCN, Autoencoder (AE), and Graph Transformer, termed the Tri-Learn Graph Fusion Network (Tri-GFN). This framework enhances the differentiation and consistency of global and local information through a unique tri-learning mechanism and feature fusion enhancement strategy. The framework integrates GCN, AE, and Graph Transformer modules. These components are meticulously fused by a triple-channel enhancement module, which maximizes the use of both node attributes and topological structures, ensuring robust clustering representation. The tri-learning mechanism allows mutual learning among these modules, while the feature fusion strategy enables the model to capture complex relationships, yielding highly discriminative representations for graph clustering. It surpasses many state-of-the-art methods, achieving an accuracy improvement of approximately 0.87% on the ACM dataset, 14.14 % on the Reuters dataset, and 7.58 % on the USPS dataset. Due to its outstanding performance on the Reuters dataset, Tri-GFN can be applied to automatic news classification, topic retrieval, and related fields.

artificial intelligence, deep learning, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2507.1362

Country:

North America > United States (1.00)
Asia > China > Sichuan Province (0.28)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.93)
Research Report > Promising Solution (0.87)

Industry: Information Technology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Pattern-Based Graph Classification: Comparison of Quality Measures and Importance of Preprocessing

Potin, Lucas, Figueiredo, Rosa, Labatut, Vincent, Largeron, Christine

arXiv.org Artificial IntelligenceJul-23-2025

Graph classification aims to categorize graphs based on their structural and attribute features, with applications in diverse fields such as social network analysis and bioinformatics. Among the methods proposed to solve this task, those relying on patterns (i.e. subgraphs) provide good explainability, as the patterns used for classification can be directly interpreted. To identify meaningful patterns, a standard approach is to use a quality measure, i.e. a function that evaluates the discriminative power of each pattern. However, the literature provides tens of such measures, making it difficult to select the most appropriate for a given application. Only a handful of surveys try to provide some insight by comparing these measures, and none of them specifically focuses on graphs. This typically results in the systematic use of the most widespread measures, without thorough evaluation. To address this issue, we present a comparative analysis of 38 quality measures from the literature. We characterize them theoretically, based on four mathematical properties. We leverage publicly available datasets to constitute a benchmark, and propose a method to elaborate a gold standard ranking of the patterns. We exploit these resources to perform an empirical comparison of the measures, both in terms of pattern ranking and classification performance. Moreover, we propose a clustering-based preprocessing step, which groups patterns appearing in the same graphs to enhance classification performance. Our experimental results demonstrate the effectiveness of this step, reducing the number of patterns to be processed while achieving comparable performance. Additionally, we show that some popular measures widely used in the literature are not associated with the best results.

data mining, machine learning, pattern recognition, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3743143

2507.00039

Genre:

Research Report > New Finding (1.00)
Overview (1.00)

Industry:

Health & Medicine > Therapeutic Area (0.95)
Information Technology > Security & Privacy (0.92)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(4 more...)

Add feedback

Learning under Latent Group Sparsity via Diffusion on Networks

Ghosh, Subhroshekhar, Mukherjee, Soumendu Sundar

arXiv.org Machine LearningJul-22-2025

Group or cluster structure on explanatory variables in machine learning problems is a very general phenomenon, which has attracted broad interest from practitioners and theoreticians alike. In this work we contribute an approach to sparse learning under such group structure, that does not require prior information on the group identities. Our paradigm is motivated by the Laplacian geometry of an underlying network with a related community structure, and proceeds by directly incorporating this into a penalty that is effectively computed via a heat-flow-based local network dynamics. The proposed penalty interpolates between the lasso and the group lasso penalties, the runtime of the heat-flow dynamics being the interpolating parameter. As such it can automatically default to lasso when the group structure reflected in the Laplacian is weak. In fact, we demonstrate a data-driven procedure to construct such a network based on the available data. Notably, we dispense with computationally intensive pre-processing involving clustering of variables, spectral or otherwise. Our technique is underpinned by rigorous theorems that guarantee its effective performance and provide bounds on its sample complexity. In particular, in a wide range of settings, it provably suffices to run the diffusion for time that is only logarithmic in the problem dimensions. We explore in detail the interfaces of our approach with key statistical physics models in network science, such as the Gaussian Free Field and the Stochastic Block Model. Our work raises the possibility of applying similar diffusion-based techniques to classical learning tasks, exploiting the interplay between geometric, dynamical and stochastic structures underlying the data.

artificial intelligence, data mining, machine learning, (18 more...)

arXiv.org Machine Learning

2507.15097

Country:

Asia > Singapore (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.04)
(4 more...)

Genre: Research Report (1.00)

Industry:

Government (0.68)
Education (0.48)
Health & Medicine (0.46)
Banking & Finance (0.45)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

A Hybrid Mixture Approach for Clustering and Characterizing Cancer Data

Kareem, Kazeem, Dai, Fan

arXiv.org Machine LearningJul-22-2025

Model-based clustering is widely used for identifying and distinguishing types of diseases. However, modern biomedical data coming with high dimensions make it challenging to perform the model estimation in traditional cluster analysis. The incorporation of factor analyzer into the mixture model provides a way to characterize the large set of data features, but the current estimation method is computationally impractical for massive data due to the intrinsic slow convergence of the embedded algorithms, and the incapability to vary the size of the factor analyzers, preventing the implementation of a generalized mixture of factor analyzers and further characterization of the data clusters. We propose a hybrid matrix-free computational scheme to efficiently estimate the clusters and model parameters based on a Gaussian mixture along with generalized factor analyzers to summarize the large number of variables using a small set of underlying factors. Our approach outperforms the existing method with faster convergence while maintaining high clustering accuracy. Our algorithms are applied to accurately identify and distinguish types of breast cancer based on large tumor samples, and to provide a generalized characterization for subtypes of lymphoma using massive gene records.

algorithm, artificial intelligence, machine learning, (13 more...)

arXiv.org Machine Learning

2507.1438

Country:

Europe > Austria > Vienna (0.14)
North America > United States > Wisconsin (0.05)
North America > United States > Michigan (0.04)
(2 more...)

Genre: Research Report (0.40)

Industry: Health & Medicine > Therapeutic Area > Oncology (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

A Framework for Analyzing Abnormal Emergence in Service Ecosystems Through LLM-based Agent Intention Mining

Shen, Yifan, Zhao, Zihan, Xue, Xiao, Guo, Yuwei, Ma, Qun, Zhou, Deyu, Zhang, Ming

arXiv.org Artificial IntelligenceJul-22-2025

With the rise of service computing, cloud computing, and IoT, service ecosystems are becoming increasingly complex. The intricate interactions among intelligent agents make abnormal emergence analysis challenging, as traditional causal methods focus on individual trajectories. Large language models offer new possibilities for Agent-Based Modeling (ABM) through Chain-of-Thought (CoT) reasoning to reveal agent intentions. However, existing approaches remain limited to microscopic and static analysis. This paper introduces a framework: Emergence Analysis based on Multi-Agent Intention (EAMI), which enables dynamic and interpretable emergence analysis. EAMI first employs a dual-perspective thought track mechanism, where an Inspector Agent and an Analysis Agent extract agent intentions under bounded and perfect rationality. Then, k-means clustering identifies phase transition points in group intentions, followed by a Intention Temporal Emergence diagram for dynamic analysis. The experiments validate EAMI in complex online-to-offline (O2O) service system and the Stanford AI Town experiment, with ablation studies confirming its effectiveness, generalizability, and efficiency. This framework provides a novel paradigm for abnormal emergence and causal analysis in service ecosystems. The code is available at https://anonymous.4open.science/r/EAMI-B085.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2507.1577

Country: Asia > China (0.29)

Genre: Research Report > New Finding (0.93)

Industry: Information Technology (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

Add feedback

Learning to Gridize: Segment Physical World by Wireless Communication Channel

Wang, Juntao, Yin, Feng, Ding, Tian, Chang, Tsung-Hui, Luo, Zhi-Quan, Yan, Qi

arXiv.org Artificial IntelligenceJul-22-2025

Gridization, the process of partitioning space into grids where users share similar channel characteristics, serves as a fundamental prerequisite for efficient large-scale network optimization. However, existing methods like Geographical or Beam Space Gridization (GSG or BSG) are limited by reliance on unavailable location data or the flawed assumption that similar signal strengths imply similar channel properties. We propose Channel Space Gridization (CSG), a pioneering framework that unifies channel estimation and gridization for the first time. Formulated as a joint optimization problem, CSG uses only beam-level reference signal received power (RSRP) to estimate Channel Angle Power Spectra (CAPS) and partition samples into grids with homogeneous channel characteristics. To perform CSG, we develop the CSG Autoencoder (CSG-AE), featuring a trainable RSRP-to-CAPS encoder, a learnable sparse codebook quantizer, and a physics-informed decoder based on the Localized Statistical Channel Model. On recognizing the limitations of naive training scheme, we propose a novel Pretraining-Initialization-Detached-Asynchronous (PIDA) training scheme for CSG-AE, ensuring stable and effective training by systematically addressing the common pitfalls of the naive training paradigm. Evaluations reveal that CSG-AE excels in CAPS estimation accuracy and clustering quality on synthetic data. On real-world datasets, it reduces Active Mean Absolute Error (MAE) by 30\% and Overall MAE by 65\% on RSRP prediction accuracy compared to salient baselines using the same data, while improving channel consistency, cluster sizes balance, and active ratio, advancing the development of gridization for large-scale network optimization.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2507.15386

Country: Asia > China (0.28)

Genre:

Research Report (0.82)
Overview (0.67)

Industry:

Telecommunications (0.93)
Information Technology (0.67)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Natural Language (0.93)
(2 more...)

Add feedback