Goto

Collaborating Authors

 Agents


Enhancing Agentic Autonomous Scientific Discovery with Vision-Language Model Capabilities

arXiv.org Artificial Intelligence

We show that multi-agent systems guided by vision-language models (VLMs) improve end-to-end autonomous scientific discovery. By treating plots as verifiable checkpoints, a VLM-as-a-judge evaluates figures against dynamically generated domain-specific rubrics, enabling agents to correct their own errors and steer exploratory data analysis in real-time. Case studies in cosmology and astrochemistry demonstrate recovery from faulty reasoning paths and adaptation to new datasets without human intervention. On a 10-task benchmark for data-driven discovery, VLM-augmented systems achieve pass at 1 scores of 0.7-0.8, compared to 0.2-0.3 for code-only and 0.4-0.5 for code-and-text baselines, while also providing auditable reasoning traces that improve interpretability. Code available here: https://github.com/CMBAgents/cmbagent


Safe-ROS: An Architecture for Autonomous Robots in Safety-Critical Domains

arXiv.org Artificial Intelligence

Deploying autonomous robots in safety-critical domains requires architectures that ensure operational effectiveness and safety compliance. In this paper, we contribute the Safe-ROS architecture for developing reliable and verifiable autonomous robots in such domains. It features two distinct subsystems: (1) an intelligent control system that is responsible for normal/routine operations, and (2) a Safety System consisting of Safety Instrumented Functions (SIFs) that provide formally verifiable independent oversight. We demonstrate Safe-ROS on an AgileX Scout Mini robot performing autonomous inspection in a nuclear environment. One safety requirement is selected and instantiated as a SIF. To support verification, we implement the SIF as a cognitive agent, programmed to stop the robot whenever it detects that it is too close to an obstacle. We verify that the agent meets the safety requirement and integrate it into the autonomous inspection. This integration is also verified, and the full deployment is validated in a Gazebo simulation, and lab testing. We evaluate this architecture in the context of the UK nuclear sector, where safety and regulation are crucial aspects of deployment. Success criteria include the development of a formal property from the safety requirement, implementation, and verification of the SIF, and the integration of the SIF into the operational robotic autonomous system. Our results demonstrate that the Safe-ROS architecture can provide safety verifiable oversight while deploying autonomous robots in safety-critical domains, offering a robust framework that can be extended to additional requirements and various applications.


Identifying Time-varying Costs in Finite-horizon Linear Quadratic Gaussian Games

arXiv.org Artificial Intelligence

We address cost identification in a finite-horizon linear quadratic Gaussian game. We characterize the set of cost parameters that generate a given Nash equilibrium policy. We propose a backpropagation algorithm to identify the time-varying cost parameters. We derive a probabilistic error bound when the cost parameters are identified from finite trajectories. We test our method in numerical and driving simulations. Our algorithm identifies the cost parameters that can reproduce the Nash equilibrium policy and trajectory observations.


Beyond Accuracy: A Multi-Dimensional Framework for Evaluating Enterprise Agentic AI Systems

arXiv.org Artificial Intelligence

Current agentic AI benchmarks predominantly evaluate task completion accuracy, while overlooking critical enterprise requirements such as cost-efficiency, reliability, and operational stability. Through systematic analysis of 12 main benchmarks and empirical evaluation of state-of-the-art agents, we identify three fundamental limitations: (1) absence of cost-controlled evaluation leading to 50x cost variations for similar precision, (2) inadequate reliability assessment where agent performance drops from 60\% (single run) to 25\% (8-run consistency), and (3) missing multidimensional metrics for security, latency, and policy compliance. We propose \textbf{CLEAR} (Cost, Latency, Efficacy, Assurance, Reliability), a holistic evaluation framework specifically designed for enterprise deployment. Evaluation of six leading agents on 300 enterprise tasks demonstrates that optimizing for accuracy alone yields agents 4.4-10.8x more expensive than cost-aware alternatives with comparable performance. Expert evaluation (N=15) confirms that CLEAR better predicts production success (correlation $ρ=0.83$) compared to accuracy-only evaluation ($ρ=0.41$).


Fair-GNE : Generalized Nash Equilibrium-Seeking Fairness in Multiagent Healthcare Automation

arXiv.org Artificial Intelligence

Enforcing a fair workload allocation among multiple agents tasked to achieve an objective in learning enabled demand side healthcare worker settings is crucial for consistent and reliable performance at runtime. Existing multi-agent reinforcement learning (MARL) approaches steer fairness by shaping reward through post hoc orchestrations, leaving no certifiable self-enforceable fairness that is immutable by individual agents at runtime. Contextualized within a setting where each agent shares resources with others, we address this shortcoming with a learning enabled optimization scheme among self-interested decision makers whose individual actions affect those of other agents. This extends the problem to a generalized Nash equilibrium (GNE) game-theoretic framework where we steer group policy to a safe and locally efficient equilibrium, so that no agent can improve its utility function by unilaterally changing its decisions. Fair-GNE models MARL as a constrained generalized Nash equilibrium-seeking (GNE) game, prescribing an ideal equitable collective equilibrium within the problem's natural fabric. Our hypothesis is rigorously evaluated in our custom-designed high-fidelity resuscitation simulator. Across all our numerical experiments, Fair-GNE achieves significant improvement in workload balance over fixed-penalty baselines (0.89 vs.\ 0.33 JFI, $p < 0.01$) while maintaining 86\% task success, demonstrating statistically significant fairness gains through adaptive constraint enforcement. Our results communicate our formulations, evaluation metrics, and equilibrium-seeking innovations in large multi-agent learning-based healthcare systems with clarity and principled fairness enforcement.


APD-Agents: A Large Language Model-Driven Multi-Agents Collaborative Framework for Automated Page Design

arXiv.org Artificial Intelligence

Layout design is a crucial step in developing mobile app pages. However, crafting satisfactory designs is time-intensive for designers: they need to consider which controls and content to present on the page, and then repeatedly adjust their size, position, and style for better aesthetics and structure. Although many design software can now help to perform these repetitive tasks, extensive training is needed to use them effectively. Moreover, collaborative design across app pages demands extra time to align standards and ensure consistent styling. In this work, we propose APD-agents, a large language model (LLM) driven multi-agent framework for automated page design in mobile applications. Our framework contains OrchestratorAgent, SemanticParserAgent, PrimaryLayoutAgent, TemplateRetrievalAgent, and RecursiveComponentAgent. Upon receiving the user's description of the page, the OrchestratorAgent can dynamically can direct other agents to accomplish users' design task. To be specific, the SemanticParserAgent is responsible for converting users' descriptions of page content into structured data. The PrimaryLayoutAgent can generate an initial coarse-grained layout of this page. The TemplateRetrievalAgent can fetch semantically relevant few-shot examples and enhance the quality of layout generation. Besides, a RecursiveComponentAgent can be used to decide how to recursively generate all the fine-grained sub-elements it contains for each element in the layout. Our work fully leverages the automatic collaboration capabilities of large-model-driven multi-agent systems. Experimental results on the RICO dataset show that our APD-agents achieve state-of-the-art performance.


Collaborative QA using Interacting LLMs. Impact of Network Structure, Node Capability and Distributed Data

arXiv.org Artificial Intelligence

In this paper, we model and analyze how a network of interacting LLMs performs collaborative question-answering (CQA) in order to estimate a ground truth given a distributed set of documents. This problem is interesting because LLMs often hallucinate when direct evidence to answer a question is lacking, and these effects become more pronounced in a network of interacting LLMs. The hallucination spreads, causing previously accurate LLMs to hallucinate. We study interacting LLMs and their hallucination by combining novel ideas of mean-field dynamics (MFD) from network science and the randomized utility model from economics to construct a useful generative model. We model the LLM with a latent state that indicates if it is truthful or not with respect to the ground truth, and extend a tractable analytical model considering an MFD to model the diffusion of information in a directed network of LLMs. To specify the probabilities that govern the dynamics of the MFD, we propose a randomized utility model. For a network of LLMs, where each LLM has two possible latent states, we posit sufficient conditions for the existence and uniqueness of a fixed point and analyze the behavior of the fixed point in terms of the incentive (e.g., test-time compute) given to individual LLMs. We experimentally study and analyze the behavior of a network of $100$ open-source LLMs with respect to data heterogeneity, node capability, network structure, and sensitivity to framing on multiple semi-synthetic datasets.


AISAC: An Integrated multi-agent System for Transparent, Retrieval-Grounded Scientific Assistance

arXiv.org Artificial Intelligence

AI Scientific Assistant Core (AISAC) is an integrated multi-agent system developed at Argonne National Laboratory for scientific and engineering workflows. AISAC builds on established technologies - LangGraph for orchestration, FAISS for vector search, and SQLite for persistence - and integrates them into a unified system prototype focused on transparency, provenance tracking, and scientific adaptability. The system implements a Router-Planner-Coordinator workflow and an optional Evaluator role, using prompt-engineered agents coordinated via LangGraph's StateGraph and supported by helper agents such as a Researcher. Each role is defined through custom system prompts that enforce structured JSON outputs. A hybrid memory approach (FAISS + SQLite) enables both semantic retrieval and structured conversation history. An incremental indexing strategy based on file hashing minimizes redundant re-embedding when scientific corpora evolve. A configuration-driven project bootstrap layer allows research teams to customize tools, prompts, and data sources without modifying core code. All agent decisions, tool invocations, and retrievals are logged and visualized through a custom Gradio interface, providing step-by-step transparency for each reasoning episode. The authors have applied AISAC to multiple research areas at Argonne, including specialized deployments for waste-to-products research and energy process safety, as well as general-purpose scientific assistance, demonstrating its cross-domain applicability.


FACA: Fair and Agile Multi-Robot Collision Avoidance in Constrained Environments with Dynamic Priorities

arXiv.org Artificial Intelligence

Multi-robot systems are increasingly being used for critical applications such as rescuing injured people, delivering food and medicines, and monitoring key areas. These applications usually involve navigating at high speeds through constrained spaces such as small gaps. Navigating such constrained spaces becomes particularly challenging when the space is crowded with multiple heterogeneous agents all of which have urgent priorities. What makes the problem even harder is that during an active response situation, roles and priorities can quickly change on a dime without informing the other agents. In order to complete missions in such environments, robots must not only be safe, but also agile, able to dodge and change course at a moment's notice. In this paper, we propose FACA, a fair and agile collision avoidance approach where robots coordinate their tasks by talking to each other via natural language (just as people do). In FACA, robots balance safety with agility via a novel artificial potential field algorithm that creates an automatic ``roundabout'' effect whenever a conflict arises. Our experiments show that FACA achieves a improvement in efficiency, completing missions more than 3.5X faster than baselines with a time reduction of over 70% while maintaining robust safety margins.


Artificial Intelligence Agents in Music Analysis: An Integrative Perspective Based on Two Use Cases

arXiv.org Artificial Intelligence

Despite considerable technological innovation, comprehensive reviews synthesizing the application and evolution of artificial intelligence (AI) in the field of music analysis remain scarce. Although early studies on computer-assisted composition and rule-based analysis established a foundation for the automated exploration of musical form and content Hiller (1959), there is still a limited body of literature addressing the complete progression from traditional algorithms to recent AI-driven models and hybrid systems. Pioneering work such as Miranda's Miranda (2021), underscores the influence of AI, supercomputing, and evolutionary computation in shaping the first computational tools for creation. Recent reviews (Wang et al. (2024); Lerch et al. (2025)) focus on intelligent music generation systems. However, a systematic integration of these historical advances with state-of-the-art AI methodologies and musical analysis is largely absent. In the last decade, deep learning frameworks--including convolutional neural networks, recurrent neural networks, and transformer architectures--have led to breakthroughs in music information retrieval.