Clustering
Enhanced Drought Analysis in Bangladesh: A Machine Learning Approach for Severity Classification Using Satellite Data
Paul, Tonmoy, Mati, Mrittika Devi, Islam, Md. Mahmudul
Drought poses a pervasive environmental challenge in Bangladesh, impacting agriculture, socio-economic stability, and food security due to its unique geographic and anthropogenic vulnerabilities. Traditional drought indices, such as the Standardized Precipitation Index (SPI) and Palmer Drought Severity Index (PDSI), often overlook crucial factors like soil moisture and temperature, limiting their resolution. Moreover, current machine learning models applied to drought prediction have been underexplored in the context of Bangladesh, lacking a comprehensive integration of satellite data across multiple districts. To address these gaps, we propose a satellite data-driven machine learning framework to classify drought across 38 districts of Bangladesh. Using unsupervised algorithms like K-means and Bayesian Gaussian Mixture for clustering, followed by classification models such as KNN, Random Forest, Decision Tree, and Naive Bayes, the framework integrates weather data (humidity, soil moisture, temperature) from 2012-2024. This approach successfully classifies drought severity into different levels. However, it shows significant variabilities in drought vulnerabilities across regions which highlights the aptitude of machine learning models in terms of identifying and predicting drought conditions.
AssetOpsBench: Benchmarking AI Agents for Task Automation in Industrial Asset Operations and Maintenance
Patel, Dhaval, Lin, Shuxin, Rayfield, James, Zhou, Nianjun, Vaculin, Roman, Martinez, Natalia, O'donncha, Fearghal, Kalagnanam, Jayant
AI for Industrial Asset Lifecycle Management aims to automate complex operational workflows -- such as condition monitoring, maintenance planning, and intervention scheduling -- to reduce human workload and minimize system downtime. Traditional AI/ML approaches have primarily tackled these problems in isolation, solving narrow tasks within the broader operational pipeline. In contrast, the emergence of AI agents and large language models (LLMs) introduces a next-generation opportunity: enabling end-to-end automation across the entire asset lifecycle. This paper envisions a future where AI agents autonomously manage tasks that previously required distinct expertise and manual coordination. To this end, we introduce AssetOpsBench -- a unified framework and environment designed to guide the development, orchestration, and evaluation of domain-specific agents tailored for Industry 4.0 applications. We outline the key requirements for such holistic systems and provide actionable insights into building agents that integrate perception, reasoning, and control for real-world industrial operations. The software is available at https://github.com/IBM/AssetOpsBench.
QQSUM: A Novel Task and Model of Quantitative Query-Focused Summarization for Review-based Product Question Answering
Tang, An Quang, Zhang, Xiuzhen, Dinh, Minh Ngoc, Li, Zhuang
Review-based Product Question Answering (PQA) allows e-commerce platforms to automatically address customer queries by leveraging insights from user reviews. However, existing PQA systems generate answers with only a single perspective, failing to capture the diversity of customer opinions. In this paper we introduce a novel task Quantitative Query-Focused Summarization (QQSUM), which aims to summarize diverse customer opinions into representative Key Points (KPs) and quantify their prevalence to effectively answer user queries. While Retrieval-Augmented Generation (RAG) shows promise for PQA, its generated answers still fall short of capturing the full diversity of viewpoints. To tackle this challenge, our model QQSUM-RAG, which extends RAG, employs few-shot learning to jointly train a KP-oriented retriever and a KP summary generator, enabling KP-based summaries that capture diverse and representative opinions. Experimental results demonstrate that QQSUM-RAG achieves superior performance compared to state-of-the-art RAG baselines in both textual quality and quantification accuracy of opinions. Our source code is available at: https://github.com/antangrocket1312/QQSUMM
Optimizing FPGA and Wafer Test Coverage with Spatial Sampling and Machine Learning
WeiQuan, Wang, Mian, Riaz-ul-Haque
In semiconductor manufacturing, testing costs remain significantly high, especially during wafer and FPGA testing. To reduce the number of required tests while maintaining predictive accuracy, this study investigates three baseline sampling strategies: Random Sampling, Stratified Sampling, and k-means Clustering Sampling. To further enhance these methods, this study proposes a novel algorithm that improves the sampling quality of each approach. This research is conducted using real industrial production data from wafer-level tests and silicon measurements from various FPGAs. This study introduces two hybrid strategies: Stratified with Short Distance Elimination (S-SDE) and k-means with Short Distance Elimination (K-SDE). Their performance is evaluated within the framework of Gaussian Process Regression (GPR) for predicting wafer and FPGA test data. At the core of our proposed approach is the Short Distance Elimination (SDE) algorithm, which excludes spatially proximate candidate points during sampling, thereby ensuring a more uniform distribution of training data across the physical domain. A parameter sweep was conducted over the (alpha, beta) thresholds, where alpha and beta are in the range {0, 1, 2, 3, 4} and not both zero, to identify the optimal combination that minimizes RMSD. Experimental results on a randomly selected wafer file reveal that (alpha, beta) equal (2, 2) yields the lowest RMSD. Accordingly, all subsequent experiments adopt this parameter configuration. The results demonstrate that the proposed SDE-based strategies enhance predictive accuracy: K-SDE improves upon k-means sampling by 16.26 percent (wafer) and 13.07 percent (FPGA), while S-SDE improves upon stratified sampling by 16.49 percent (wafer) and 8.84 percent (FPGA).
Similarity-based fuzzy clustering scientific articles: potentials and challenges from mathematical and computational perspectives
Huong, Vu Thi, Litzel, Ida, Koch, Thorsten
Fuzzy clustering, which allows an article to belong to multiple clusters with soft membership degrees, plays a vital role in analyzing publication data. This problem can be formulated as a constrained optimization model, where the goal is to minimize the discrepancy between the similarity observed from data and the similarity derived from a predicted distribution. While this approach benefits from leveraging state-of-the-art optimization algorithms, tailoring them to work with real, massive databases like OpenAlex or Web of Science - containing about 70 million articles and a billion citations - poses significant challenges. We analyze potentials and challenges of the approach from both mathematical and computational perspectives. Among other things, second-order optimality conditions are established, providing new theoretical insights, and practical solution methods are proposed by exploiting the structure of the problem. Specifically, we accelerate the gradient projection method using GPU-based parallel computing to efficiently handle large-scale data.
A Foundation Model for Spatial Proteomics
Shaban, Muhammad, Chang, Yuzhou, Qiu, Huaying, Yeo, Yao Yu, Song, Andrew H., Jaume, Guillaume, Wang, Yuchen, Weishaupt, Luca L., Ding, Tong, Vaidya, Anurag, Lamane, Abdallah, Shao, Daniel, Zidane, Mohammed, Bai, Yunhao, McCallum, Paige, Luo, Shuli, Wu, Wenrui, Wang, Yang, Cramer, Precious, Chan, Chi Ngai, Stephan, Pierre, Schaffenrath, Johanna, Lee, Jia Le, Michel, Hendrik A., Tian, Caiwei, Almagro-Perez, Cristina, Wagner, Sophia J., Sahai, Sharifa, Lu, Ming Y., Chen, Richard J., Zhang, Andrew, Gonzales, Mark Edward M., Makky, Ahmad, Lee, Jia-Ying Joey, Cheng, Hao, Ahmar, Nourhan El, Matar, Sayed, Haist, Maximilian, Phillips, Darci, Tan, Yuqi, Nolan, Garry P., Burack, W. Richard, Estes, Jacob D., Liu, Jonathan T. C., Choueiri, Toni K, Agarwal, Neeraj, Barry, Marc, Rodig, Scott J., Le, Long Phi, Gerber, Georg, Schรผrch, Christian M., Theis, Fabian J., Kim, Youn H, Yeong, Joe, Signoretti, Sabina, Howitt, Brooke E., Loo, Lit-Hsin, Ma, Qin, Jiang, Sizun, Mahmood, Faisal
Foundation models have begun to transform image analysis by acting as pretrained generalist backbones that can be adapted to many tasks even when post-training data are limited, yet their impact on spatial proteomics, imaging that maps proteins at single-cell resolution, remains limited. Here, we introduce KRONOS, a foundation model built for spatial proteomics. KRONOS was trained in a self-supervised manner on over 47 million image patches covering 175 protein markers, 16 tissue types, and 8 fluorescence-based imaging platforms. We introduce key architectural adaptations to address the high-dimensional, multi-channel, and heterogeneous nature of multiplex imaging. We demonstrate that KRONOS learns biologically meaningful representations across multiple scales, ranging from cellular and microenvironment to tissue levels, enabling it to address diverse downstream tasks, including cell phenotyping, region classification, and patient stratification. Evaluated across 11 independent cohorts, KRONOS achieves state-of-the-art performance across cell phenotyping, treatment response prediction, and retrieval tasks, and is highly data-efficient. KRONOS also introduces the paradigm of segmentation-free patch-level processing for efficient and scalable spatial proteomics analysis, allowing cross-institutional comparisons, and as an image reverse search engine for spatial patterns.
ARIA: Training Language Agents with Intention-Driven Reward Aggregation
Yang, Ruihan, Zhang, Yikai, Chen, Aili, Wang, Xintao, Yuan, Siyu, Chen, Jiangjie, Yang, Deqing, Xiao, Yanghua
Large language models (LLMs) have enabled agents to perform complex reasoning and decision-making through free-form language interactions. However, in open-ended language action environments (e.g., negotiation or question-asking games), the action space can be formulated as a joint distribution over tokens, resulting in an exponentially large action space. Sampling actions in such a space can lead to extreme reward sparsity, which brings large reward variance, hindering effective reinforcement learning (RL). To address this, we propose ARIA, a method that Aggregates Rewards in Intention space to enable efficient and effective language Agents training. ARIA aims to project natural language actions from the high-dimensional joint token distribution space into a low-dimensional intention space, where semantically similar actions are clustered and assigned shared rewards. This intention-aware reward aggregation reduces reward variance by densifying reward signals, fostering better policy optimization. Extensive experiments demonstrate that ARIA not only significantly reduces policy gradient variance, but also delivers substantial performance gains of an average of 9.95% across four downstream tasks, consistently outperforming offline and online RL baselines.
A novel sensitivity analysis method for agent-based models stratifies in-silico tumor spheroid simulations
Rohr, Edward H., Nardini, John T.
Agent-based models (ABMs) are widely used in biology to understand how individual actions scale into emergent population behavior. Modelers employ sensitivity analysis (SA) algorithms to quantify input parameters' impact on model outputs, however, it is hard to perform SA for ABMs due to their computational and complex nature. In this work, we develop the Simulate, Summarize, Reduce, Cluster, and Analyze (SSRCA) methodology, a machine-learning based pipeline designed to facilitate SA for ABMs. In particular, SSRCA can achieve the following tasks for ABMS: 1) identify sensitive model parameters, 2) reveal common output model patterns, and 3) determine which input parameter values generate these patterns. We use an example ABM of tumor spheroid growth to showcase how SSRCA provides similar SA results to the popular Sobol' Method while also identifying four common patterns from the ABM and the parameter regions that generate these outputs. This analysis could streamline data-driven tasks, such as parameter estimation, for ABMs by reducing parameter space. While we highlight these results with an ABM on tumor spheroid formation, the SSRCA methodology is broadly applicable to biological ABMs.
A Dynamic Framework for Semantic Grouping of Common Data Elements (CDE) Using Embeddings and Clustering
Krishnamurthy, Madan, Korn, Daniel, Haendel, Melissa A, Mungall, Christopher J, Thessen, Anne E
This research aims to develop a dynamic and scalable framework to facilitate harmonization of Common Data Elements (CDEs) across heterogeneous biomedical datasets by addressing challenges such as semantic heterogeneity, structural variability, and context dependence to streamline integration, enhance interoperability, and accelerate scientific discovery. Our methodology leverages Large Language Models (LLMs) for context-aware text embeddings that convert CDEs into dense vectors capturing semantic relationships and patterns. These embeddings are clustered using Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) to group semantically similar CDEs. The framework incorporates four key steps: (1) LLM-based text embedding to mathematically represent semantic context, (2) unsupervised clustering of embeddings via HDBSCAN, (3) automated labeling using LLM summarization, and (4) supervised learning to train a classifier assigning new or unclustered CDEs to labeled clusters. Evaluated on the NIH NLM CDE Repository with over 24,000 CDEs, the system identified 118 meaningful clusters at an optimized minimum cluster size of 20. The classifier achieved 90.46 percent overall accuracy, performing best in larger categories. External validation against Gravity Projects Social Determinants of Health domains showed strong agreement (Adjusted Rand Index 0.52, Normalized Mutual Information 0.78), indicating that embeddings effectively capture cluster characteristics. This adaptable and scalable approach offers a practical solution to CDE harmonization, improving selection efficiency and supporting ongoing data interoperability.
Enhancing Interpretability of Quantum-Assisted Blockchain Clustering via AI Agent-Based Qualitative Analysis
Tsai, Yun-Cheng, Liu, Yen-Ku, Chen, Samuel Yen-Chi
Blockchain transaction data is inherently high dimensional, noisy, and entangled, posing substantial challenges for traditional clustering algorithms. While quantum enhanced clustering models have demonstrated promising performance gains, their interpretability remains limited, restricting their application in sensitive domains such as financial fraud detection and blockchain governance. To address this gap, we propose a two stage analysis framework that synergistically combines quantitative clustering evaluation with AI Agent assisted qualitative interpretation. In the first stage, we employ classical clustering methods and evaluation metrics including the Silhouette Score, Davies Bouldin Index, and Calinski Harabasz Index to determine the optimal cluster count and baseline partition quality. In the second stage, we integrate an AI Agent to generate human readable, semantic explanations of clustering results, identifying intra cluster characteristics and inter cluster relationships. Our experiments reveal that while fully trained Quantum Neural Networks (QNN) outperform random Quantum Features (QF) in quantitative metrics, the AI Agent further uncovers nuanced differences between these methods, notably exposing the singleton cluster phenomenon in QNN driven models. The consolidated insights from both stages consistently endorse the three cluster configuration, demonstrating the practical value of our hybrid approach. This work advances the interpretability frontier in quantum assisted blockchain analytics and lays the groundwork for future autonomous AI orchestrated clustering frameworks.