Agents
Exploring Advanced LLM Multi-Agent Systems Based on Blackboard Architecture
In this paper, we propose to incorporate the blackboard architecture into LLM multi-agent systems (MASs) so that (1) agents with various roles can share all the information and others' messages during the whole problem-solving process, (2) agents that will take actions are selected based on the current content of the blackboard, and (3) the selection and execution round is repeated until a consensus is reached on the blackboard. We develop the first implementation of this proposal and conduct experiments on commonsense knowledge, reasoning and mathematical datasets. The results show that our system can be competitive with the SOTA static and dynamic MASs by achieving the best average performance, and at the same time manage to spend less tokens. Our proposal has the potential to enable complex and dynamic problem-solving where well-defined structures or workflows are unavailable.
MALIBU Benchmark: Multi-Agent LLM Implicit Bias Uncovered
Mirza, Imran, Huang, Cole, Vasista, Ishwara, Patil, Rohan, Akalin, Asli, O'Brien, Sean, Zhu, Kevin
Multi-agent systems, which consist of multiple AI models interacting within a shared environment, are increasingly used for persona-based interactions. However, if not carefully designed, these systems can reinforce implicit biases in large language models (LLMs), raising concerns about fairness and equitable representation. We present MALIBU, a novel benchmark developed to assess the degree to which LLM-based multi-agent systems implicitly reinforce social biases and stereotypes. MALIBU evaluates bias in LLM-based multi-agent systems through scenario-based assessments. AI models complete tasks within predefined contexts, and their responses undergo evaluation by an LLM-based multi-agent judging system in two phases. In the first phase, judges score responses labeled with specific demographic personas (e.g., gender, race, religion) across four metrics. In the second phase, judges compare paired responses assigned to different personas, scoring them and selecting the superior response. Our study quantifies biases in LLM-generated outputs, revealing that bias mitigation may favor marginalized personas over true neutrality, emphasizing the need for nuanced detection, balanced fairness strategies, and transparent evaluation benchmarks in multi-agent systems.
DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues
Jang, Kyochul, Lee, Donghyeon, Kim, Kyusik, Heo, Dongseok, Lee, Taewhoo, Kim, Woojeong, Suh, Bongwon
Existing function-calling benchmarks focus on single-turn interactions. However, they overlook the complexity of real-world scenarios. To quantify how existing benchmarks address practical applications, we introduce DICE-SCORE, a metric that evaluates the dispersion of tool-related information such as function name and parameter values throughout the dialogue. Analyzing existing benchmarks through DICE-SCORE reveals notably low scores, highlighting the need for more realistic scenarios. To address this gap, we present DICE-BENCH, a framework that constructs practical function-calling datasets by synthesizing conversations through a tool graph that maintains dependencies across rounds and a multi-agent system with distinct personas to enhance dialogue naturalness. The final dataset comprises 1,607 high-DICE-SCORE instances. Our experiments on 19 LLMs with DICE-BENCH show that significant advances are still required before such models can be deployed effectively in real-world settings. Our code and data are all publicly available: https://snuhcc.github.io/DICE-Bench/.
Sequential Diagnosis with Language Models
Nori, Harsha, Daswani, Mayank, Kelly, Christopher, Lundberg, Scott, Ribeiro, Marco Tulio, Wilson, Marc, Liu, Xiaoxuan, Sounderajah, Viknesh, Carlson, Jonathan, Lungren, Matthew P, Gross, Bay, Hames, Peter, Suleyman, Mustafa, King, Dominic, Horvitz, Eric
Artificial intelligence holds great promise for expanding access to expert medical knowledge and reasoning. However, most evaluations of language models rely on static vignettes and multiple-choice questions that fail to reflect the complexity and nuance of evidence-based medicine in real-world settings. In clinical practice, physicians iteratively formulate and revise diagnostic hypotheses, adapting each subsequent question and test to what they've just learned, and weigh the evolving evidence before committing to a final diagnosis. To emulate this iterative process, we introduce the Sequential Diagnosis Benchmark, which transforms 304 diagnostically challenging New England Journal of Medicine clinicopathological conference (NEJM-CPC) cases into stepwise diagnostic encounters. A physician or AI begins with a short case abstract and must iteratively request additional details from a gatekeeper model that reveals findings only when explicitly queried. Performance is assessed not just by diagnostic accuracy but also by the cost of physician visits and tests performed. We also present the MAI Diagnostic Orchestrator (MAI-DxO), a model-agnostic orchestrator that simulates a panel of physicians, proposes likely differential diagnoses and strategically selects high-value, cost-effective tests. When paired with OpenAI's o3 model, MAI-DxO achieves 80% diagnostic accuracy--four times higher than the 20% average of generalist physicians. MAI-DxO also reduces diagnostic costs by 20% compared to physicians, and 70% compared to off-the-shelf o3. When configured for maximum accuracy, MAI-DxO achieves 85.5% accuracy. These performance gains with MAI-DxO generalize across models from the OpenAI, Gemini, Claude, Grok, DeepSeek, and Llama families. We highlight how AI systems, when guided to think iteratively and act judiciously, can advance diagnostic precision and cost-effectiveness in clinical care.
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
Zhong, Yifan, Bai, Fengshuo, Cai, Shaofei, Huang, Xuchuan, Chen, Zhang, Zhang, Xiaowei, Wang, Yuanfei, Guo, Shaoyang, Guan, Tianrui, Lui, Ka Nam, Qi, Zhiquan, Liang, Yitao, Chen, Yuanpei, Yang, Yaodong
The remarkable advancements of vision and language foundation models in multimodal understanding, reasoning, and generation has sparked growing efforts to extend such intelligence to the physical world, fueling the flourishing of vision-language-action (VLA) models. Despite seemingly diverse approaches, we observe that current VLA models can be unified under a single framework: vision and language inputs are processed by a series of VLA modules, producing a chain of \textit{action tokens} that progressively encode more grounded and actionable information, ultimately generating executable actions. We further determine that the primary design choice distinguishing VLA models lies in how action tokens are formulated, which can be categorized into language description, code, affordance, trajectory, goal state, latent representation, raw action, and reasoning. However, there remains a lack of comprehensive understanding regarding action tokens, significantly impeding effective VLA development and obscuring future directions. Therefore, this survey aims to categorize and interpret existing VLA research through the lens of action tokenization, distill the strengths and limitations of each token type, and identify areas for improvement. Through this systematic review and analysis, we offer a synthesized outlook on the broader evolution of VLA models, highlight underexplored yet promising directions, and contribute guidance for future research, hoping to bring the field closer to general-purpose intelligence.
Agent Ideate: A Framework for Product Idea Generation from Patents Using Agentic AI
Kanumolu, Gopichand, Urlana, Ashok, Kumar, Charaka Vinayak, Garlapati, Bala Mallikarjunarao
Patents contain rich technical knowledge that can inspire innovative product ideas, yet accessing and interpreting this information remains a challenge. This work explores the use of Large Language Models (LLMs) and autonomous agents to mine and generate product concepts from a given patent. In this work, we design Agent Ideate, a framework for automatically generating product-based business ideas from patents. We experimented with open-source LLMs and agent-based architectures across three domains: Computer Science, Natural Language Processing, and Material Chemistry. Evaluation results show that the agentic approach consistently outperformed standalone LLMs in terms of idea quality, relevance, and novelty. These findings suggest that combining LLMs with agentic workflows can significantly enhance the innovation pipeline by unlocking the untapped potential of business idea generation from patent data.
Rethinking the Illusion of Thinking
Varela, Iรฑaki Dellibarda, Romero-Sorozabal, Pablo, Rocon, Eduardo, Cebrian, Manuel
Earlier this year, Apple ignited controversy by publishing "The Illusion of Thinking," prompting heated debate within the AI community. Critics seized upon the findings as conclusive evidence that Large Reasoning Models (LRMs) lack genuine reasoning capabilities, branding them as mere stochastic parrots. Meanwhile, defenders-spearheaded by Lawsen et al. (2025)-fired back, condemning the experimental setup as flawed and the conclusions overstated. We clarify this debate by replicating and refining two of the original study's most contentious benchmarks: Towers of Hanoi and River Crossing. By introducing incremental stepwise prompting and agentic collaborative dialogue, we show that previously reported failures solving the Towers of Hanoi were not purely result of output constraints, but also partly a result of cognition limitations: LRMs still stumble when complexity rises moderately (around 8 disks). Moreover, the River Crossing results initially heralded as catastrophic failures turn out to hinge upon testing unsolvable configurations. Once we limit tests strictly to solvable problems-LRMs effortlessly solve large instances involving over 100 agent pairs. Our findings ultimately defy simplistic narratives: today's LRMs are stochastic, RL-tuned searchers in a discrete state space we barely understand. Real progress in symbolic, long-horizon reasoning demands mapping that terrain through fine-grained ablations like those introduced here.
HPC-AI Coupling Methodology for Scientific Applications
Lu, Yutong, Huang, Dan, Chen, Pin
Artificial intelligence (AI) technologies have fundamentally transformed numerical-based high-performance computing (HPC) applications with data-driven approaches and endeavored to address existing challenges, e.g. high computational intensity, in various scientific domains. In this study, we explore the scenarios of coupling HPC and AI (HPC-AI) in the context of emerging scientific applications, presenting a novel methodology that incorporates three patterns of coupling: surrogate, directive, and coordinate. Each pattern exemplifies a distinct coupling strategy, AI-driven prerequisite, and typical HPC-AI ensembles. Through case studies in materials science, we demonstrate the application and effectiveness of these patterns. The study highlights technical challenges, performance improvements, and implementation details, providing insight into promising perspectives of HPC-AI coupling. The proposed coupling patterns are applicable not only to materials science but also to other scientific domains, offering valuable guidance for future HPC-AI ensembles in scientific discovery.
Workflow-Based Evaluation of Music Generation Systems
Dadman, Shayan, Bremdal, Bernt Arild, Bergsland, Andreas
This study presents an exploratory evaluation of Music Generation Systems (MGS) within contemporary music production workflows by examining eight open-source systems. The evaluation framework combines technical insights with practical experimentation through criteria specifically designed to investigate the practical and creative affordances of the systems within the iterative, non-linear nature of music production. Employing a single-evaluator methodology as a preliminary phase, this research adopts a mixed approach utilizing qualitative methods to form hypotheses subsequently assessed through quantitative metrics. The selected systems represent architectural diversity across both symbolic and audio-based music generation approaches, spanning composition, arrangement, and sound design tasks. The investigation addresses limitations of current MGS in music production, challenges and opportunities for workflow integration, and development potential as collaborative tools while maintaining artistic authenticity. Findings reveal these systems function primarily as complementary tools enhancing rather than replacing human expertise. They exhibit limitations in maintaining thematic and structural coherence that emphasize the indispensable role of human creativity in tasks demanding emotional depth and complex decision-making. This study contributes a structured evaluation framework that considers the iterative nature of music creation. It identifies methodological refinements necessary for subsequent comprehensive evaluations and determines viable areas for AI integration as collaborative tools in creative workflows. The research provides empirically-grounded insights to guide future development in the field.
Thinking Fast and Slow in Human and Machine Intelligence
Human intelligence has generally been studied by focusing on two primary levels: cognitive science, which examines the mind, and neuroscience, which focuses on the brain. Both approaches have influenced artificial intelligence (AI) research, leading to the development of various cognitive architectures with emergent behaviors.23 In this article, we propose an approach inspired by human cognition, specifically drawing on cognitive theories about human reasoning and decision making. We are inspired by the book Thinking, Fast and Slow by Daniel Kahneman,20 which categorizes human thought processes into two systems: System 1 (fast thinking) and System 2 (slow thinking).37 System 1, or "thinking fast," is responsible for intuitive, quick, and often unconscious decisions.