Agents
EH-Benchmark Ophthalmic Hallucination Benchmark and Agent-Driven Top-Down Traceable Reasoning Workflow
Pan, Xiaoyu, Bai, Yang, Zou, Ke, Zhou, Yang, Zhou, Jun, Fu, Huazhu, Tham, Yih-Chung, Liu, Yong
Medical Large Language Models (MLLMs) play a crucial role in ophthalmic diagnosis, holding significant potential to address vision-threatening diseases. However, their accuracy is constrained by hallucinations stemming from limited ophthalmic knowledge, insufficient visual localization and reasoning capabilities, and a scarcity of multimodal ophthalmic data, which collectively impede precise lesion detection and disease diagnosis. Furthermore, existing medical benchmarks fail to effectively evaluate various types of hallucinations or provide actionable solutions to mitigate them. To address the above challenges, we introduce EH-Benchmark, a novel ophthalmology benchmark designed to evaluate hallucinations in MLLMs. We categorize MLLMs' hallucinations based on specific tasks and error types into two primary classes: Visual Understanding and Logical Composition, each comprising multiple subclasses. Given that MLLMs predominantly rely on language-based reasoning rather than visual processing, we propose an agent-centric, three-phase framework, including the Knowledge-Level Retrieval stage, the Task-Level Case Studies stage, and the Result-Level Validation stage. Experimental results show that our multi-agent framework significantly mitigates both types of hallucinations, enhancing accuracy, interpretability, and reliability. Our project is available at https://github.com/ppxy1/EH-Benchmark.
The challenge of hidden gifts in multi-agent reinforcement learning
Malenfant, Dane, Richards, Blake A.
Sometimes we benefit from actions that others have taken even when we are unaware that they took those actions. For example, if your neighbor chooses not to take a parking spot in front of your house when you are not there, you can benefit, even without being aware that they took this action. These ``hidden gifts'' represent an interesting challenge for multi-agent reinforcement learning (MARL), since assigning credit when the beneficial actions of others are hidden is non-trivial. Here, we study the impact of hidden gifts with a very simple MARL task. In this task, agents in a grid-world environment have individual doors to unlock in order to obtain individual rewards. As well, if all the agents unlock their door the group receives a larger collective reward. However, there is only one key for all of the doors, such that the collective reward can only be obtained when the agents drop the key for others after they use it. Notably, there is nothing to indicate to an agent that the other agents have dropped the key, thus this act for others is a ``hidden gift''. We show that several different state-of-the-art MARL algorithms, including MARL specific architectures, fail to learn how to obtain the collective reward in this simple task. Interestingly, we find that decentralized actor-critic policy gradient agents can succeed when we provide them with information about their own action history, but MARL agents still cannot solve the task with action history. Finally, we derive a correction term for policy gradient agents, inspired by learning aware approaches, which reduces the variance in learning and helps them to converge to collective success more reliably. These results show that credit assignment in multi-agent settings can be particularly challenging in the presence of ``hidden gifts'', and demonstrate that self learning-awareness in decentralized agents can benefit these settings.
R&D-Agent: An LLM-Agent Framework Towards Autonomous Data Science
Yang, Xu, Yang, Xiao, Fang, Shikai, Zhang, Yifei, Wang, Jian, Xian, Bowen, Li, Qizheng, Li, Jingyuan, Xu, Minrui, Li, Yuante, Pan, Haoran, Zhang, Yuge, Liu, Weiqing, Shen, Yelong, Chen, Weizhu, Bian, Jiang
Recent advances in AI and ML have transformed data science, yet increasing complexity and expertise requirements continue to hinder progress. Although crowd-sourcing platforms alleviate some challenges, high-level machine learning engineering (MLE) tasks remain labor-intensive and iterative. We introduce R&D-Agent, a comprehensive, decoupled, and extensible framework that formalizes the MLE process. R&D-Agent defines the MLE workflow into two phases and six components, turning agent design for MLE from ad-hoc craftsmanship into a principled, testable process. Although several existing agents report promising gains on their chosen components, they can mostly be summarized as a partial optimization from our framework's simple baseline. Inspired by human experts, we designed efficient and effective agents within this framework that achieve state-of-the-art performance. Evaluated on MLE-Bench, the agent built on R&D-Agent ranks as the top-performing machine learning engineering agent, achieving 35.1% any medal rate, demonstrating the ability of the framework to speed up innovation and improve accuracy across a wide range of data science applications. We have open-sourced R&D-Agent on GitHub: https://github.com/microsoft/RD-Agent.
GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
Luo, Run, Wang, Lu, He, Wanwei, Chen, Longze, Li, Jiaming, Xia, Xiaobo
Existing efforts in building graphical user interface (GUI) agents largely rely on the training paradigm of supervised fine-tuning (SFT) on large vision-language models (L VLMs). However, this approach not only demands extensive amounts of training data but also struggles to effectively understand GUI screenshots and generalize to unseen interfaces. The issue significantly limits its application in real-world scenarios, especially for high-level tasks. Inspired by reinforcement fine-tuning (RFT) in large reasoning models (e.g., DeepSeek-R1), which efficiently enhances the problem-solving capabilities of large language models in real-world settings, we propose GUI-R1, the first reinforcement learning framework designed to enhance the GUI capabilities of L VLMs in high-level real-world task scenarios, through unified action space rule modeling. By leveraging a small amount of carefully curated high-quality data across multiple platforms (including Windows, Linux, MacOS, Android, and Web) and employing policy optimization algorithms such as group relative policy optimization (GRPO) to update the model, GUI-R1 achieves superior performance using only 0.02% of the data (3K vs. 13M) compared to previous state-of-the-art methods like OS-Atlas across eight benchmarks spanning three different platforms (mobile, desktop, and web). These results demonstrate the immense potential of reinforcement learning based on unified action space rule modeling in improving the execution capabilities of L VLMs for real-world GUI agent tasks.
Partial Resilient Leader-Follower Consensus in Time-Varying Graphs
Lee, Haejoon, Panagou, Dimitra
Existing approaches typically require robustness conditions of the entire network to guarantee resilient consensus. However, the behavior of such systems when these conditions are not fully met remains unexplored. T o address this gap, we introduce the notion of partial leader-follower consensus, in which a subset of non-adversarial followers successfully tracks the leader's reference state despite insufficient robustness. We propose a novel distributed algorithm -- the Bootstrap Percolation and Mean Subsequence Reduced (BP-MSR) algorithm -- and establish sufficient conditions for individual followers to achieve consensus via the BP-MSR algorithm in arbitrary time-varying graphs. We validate our findings through simulations, demonstrating that our method guarantees partial leader-follower consensus, even when standard resilient consensus algorithms fail.
Non-submodular Visual Attention for Robot Navigation
Vafaee, Reza, Behzad, Kian, Siami, Milad, Carlone, Luca, Jadbabaie, Ali
This paper presents a task-oriented computational framework to enhance Visual-Inertial Navigation (VIN) in robots, addressing challenges such as limited time and energy resources. The framework strategically selects visual features using a Mean Squared Error (MSE)-based, non-submodular objective function and a simplified dynamic anticipation model. To address the NP-hardness of this problem, we introduce four polynomial-time approximation algorithms: a classic greedy method with constant-factor guarantees; a low-rank greedy variant that significantly reduces computational complexity; a randomized greedy sampler that balances efficiency and solution quality; and a linearization-based selector based on a first-order Taylor expansion for near-constant-time execution. We establish rigorous performance bounds by leveraging submodularity ratios, curvature, and element-wise curvature analyses. Extensive experiments on both standardized benchmarks and a custom control-aware platform validate our theoretical results, demonstrating that these methods achieve strong approximation guarantees while enabling real-time deployment.
Target Population Synthesis using CT-GAN
Rastogi, Tanay, Jonsson, Daniel
Agent-based models used in scenario planning for transportation and urban planning usually require detailed population information from the base as well as target scenarios. These populations are usually provided by synthesizing fake agents through deterministic population synthesis methods. However, these deterministic population synthesis methods face several challenges, such as handling high-dimensional data, scalability, and zero-cell issues, particularly when generating populations for target scenarios. This research looks into how a deep generative model called Conditional Tabular Generative Adversarial Network (CT-GAN) can be used to create target populations either directly from a collection of marginal constraints or through a hybrid method that combines CT-GAN with Fitness-based Synthesis Combinatorial Optimization (FBS-CO). The research evaluates the proposed population synthesis models against travel survey and zonal-level aggregated population data. Results indicate that the stand-alone CT-GAN model performs the best when compared with FBS-CO and the hybrid model. CT-GAN by itself can create realistic-looking groups that match single-variable distributions, but it struggles to maintain relationships between multiple variables. However, the hybrid model demonstrates improved performance compared to FBS-CO by leveraging CT-GAN ability to generate a descriptive base population, which is then refined using FBS-CO to align with target-year marginals. This study demonstrates that CT-GAN represents an effective methodology for target populations and highlights how deep generative models can be successfully integrated with conventional synthesis techniques to enhance their performance.
Benchmarking Agentic Systems in Automated Scientific Information Extraction with ChemX
Vepreva, Anastasia, Razlivina, Julia, Eremeeva, Maria, Gubina, Nina, Orlova, Anastasia, Dmitrenko, Aleksei, Kapranova, Ksenya, Jyakhwo, Susan, Vasilev, Nikita, Sarkisyan, Arsen, Chernyshov, Ivan Yu., Vinogradov, Vladimir, Dmitrenko, Andrei
The emergence of agent-based systems represents a significant advancement in artificial intelligence, with growing applications in automated data extraction. However, chemical information extraction remains a formidable challenge due to the inherent heterogeneity of chemical data. Current agent-based approaches, both general-purpose and domain-specific, exhibit limited performance in this domain. To address this gap, we present ChemX, a comprehensive collection of 10 manually curated and domain-expert-validated datasets focusing on nanomaterials and small molecules. These datasets are designed to rigorously evaluate and enhance automated extraction methodologies in chemistry. To demonstrate their utility, we conduct an extensive benchmarking study comparing existing state-of-the-art agentic systems such as ChatGPT Agent and chemical-specific data extraction agents. Additionally, we introduce our own single-agent approach that enables precise control over document preprocessing prior to extraction. We further evaluate the performance of modern baselines, such as GPT-5 and GPT-5 Thinking, to compare their capabilities with agentic approaches. Our empirical findings reveal persistent challenges in chemical information extraction, particularly in processing domain-specific terminology, complex tabular and schematic representations, and context-dependent ambiguities. The ChemX benchmark serves as a critical resource for advancing automated information extraction in chemistry, challenging the generalization capabilities of existing methods, and providing valuable insights into effective evaluation strategies.
Stochastic Self-Organization in Multi-Agent Systems
Tastan, Nurbek, Horvath, Samuel, Nandakumar, Karthik
Multi-agent systems (MAS) based on Large Language Models (LLMs) have the potential to solve tasks that are beyond the reach of any single LLM. However, this potential can only be realized when the collaboration mechanism between agents is optimized. Specifically, optimizing the communication structure between agents is critical for fruitful collaboration. Most existing approaches rely on fixed topologies, pretrained graph generators, optimization over edges, or employ external LLM judges, thereby adding to the complexity. In this work, we introduce a response-conditioned framework that adapts communication on-the-fly. Agents independently generate responses to the user query and assess peer contributions using an approximation of the Shapley value. A directed acyclic graph (DAG) is then constructed to regulate the propagation of the responses among agents, which ensures stable and efficient message transmission from high-contributing agents to others. This graph is dynamically updated based on the agent responses from the previous collaboration round. Since the proposed framework enables the self-organization of agents without additional supervision or training, we refer to it as SelfOrg. The SelfOrg framework goes beyond task- and query-level optimization and takes into account the stochastic nature of agent responses. Experiments with both strong and weak LLM backends demonstrate robust performance, with significant gains in the weak regime where prior methods collapse. We also theoretically show that multiple agents increase the chance of correctness and that the correct responses naturally dominate the information flow.
HARPA: A Testability-Driven, Literature-Grounded Framework for Research Ideation
Vasu, Rosni, Jansen, Peter, Siangliulue, Pao, Sarasua, Cristina, Bernstein, Abraham, Clark, Peter, Mishra, Bhavana Dalvi
While there has been a surge of interest in automated scientific discovery (ASD), especially with the emergence of LLMs, it remains challenging for tools to generate hypotheses that are both testable and grounded in the scientific literature. Additionally, existing ideation tools are not adaptive to prior experimental outcomes. We developed HARPA to address these challenges by incorporating the ideation workflow inspired by human researchers. HARPA first identifies emerging research trends through literature mining, then explores hypothesis design spaces, and finally converges on precise, testable hypotheses by pinpointing research gaps and justifying design choices. Our evaluations show that HARPA-generated hypothesis-driven research proposals perform comparably to a strong baseline AI-researcher across most qualitative dimensions (e.g., specificity, novelty, overall quality), but achieve significant gains in feasibility(+0.78, p$<0.05$, bootstrap) and groundedness (+0.85, p$<0.01$, bootstrap) on a 10-point Likert scale. When tested with the ASD agent (CodeScientist), HARPA produced more successful executions (20 vs. 11 out of 40) and fewer failures (16 vs. 21 out of 40), showing that expert feasibility judgments track with actual execution success. Furthermore, to simulate how researchers continuously refine their understanding of what hypotheses are both testable and potentially interesting from experience, HARPA learns a reward model that scores new hypotheses based on prior experimental outcomes, achieving approx. a 28\% absolute gain over HARPA's untrained baseline scorer. Together, these methods represent a step forward in the field of AI-driven scientific discovery.