Goto

Collaborating Authors

 Agents


Stabilizing Long-term Multi-turn Reinforcement Learning with Gated Rewards

arXiv.org Artificial Intelligence

Reward sparsity in long-horizon reinforcement learning (RL) tasks remains a significant challenge, while existing outcome-based reward shaping struggles to define meaningful immediate rewards without introducing bias or requiring explicit task decomposition. Alternatively, verification-based reward shaping uses stepwise critics, but misalignment between immediate rewards and long-term objectives can lead to reward hacking and suboptimal policies. In this work, we address this problem in the context of software engineering (SWE) tasks, where multi-turn reasoning and rule-based verification are critical. We introduce the SWE-oriented RL Framework, a unified system supporting multi-turn interaction, docker-based execution, and customizable reward functions. Additionally, we propose Gated Reward Accumulation (G-RA), a novel method that accumulates immediate rewards only when high-level (long-term) rewards meet a predefined threshold, ensuring stable RL optimization. Experiments on SWE-bench V eri-fied and kBench demonstrate that G-RA leads to an increase in completion rates (47.6% 93.8% and 22.0% 86.0%) and modification rates (19.6% 23.8% and 12.0% 42.0%), while avoiding policy degradation caused by reward misalignment. Our findings highlight the importance of balanced reward accumulation in long-horizon RL and provide a practical solution.


A Unified Multi-Agent Framework for Universal Multimodal Understanding and Generation

arXiv.org Artificial Intelligence

Real-world multimodal applications often require any-to-any capabilities, enabling both understanding and generation across modalities including text, image, audio, and video. However, integrating the strengths of autoregressive language models (LLMs) for reasoning and diffusion models for high-fidelity generation remains challenging. Existing approaches rely on rigid pipelines or tightly coupled architectures, limiting flexibility and scalability. We propose MAGUS (Multi-Agent Guided Unified Multimodal System), a modular framework that unifies multimodal understanding and generation via two decoupled phases: Cognition and Deliberation. MAGUS enables symbolic multi-agent collaboration within a shared textual workspace. In the Cognition phase, three role-conditioned multimodal LLM agents-- Perceiver, Planner, and Reflector --engage in collaborative dialogue to perform structured understanding and planning. The Deliberation phase incorporates a Growth-A ware Search mechanism that orchestrates LLM-based reasoning and diffusion-based generation in a mutually reinforcing manner. MAGUS supports plug-and-play extensibility, scalable any-to-any modality conversion, and semantic alignment--all without the need for joint training. Experiments across multiple benchmarks, including image, video, and audio generation, as well as cross-modal instruction following, demonstrate that MAGUS outperforms strong baselines and state-of-the-art systems.


FAIRGAME: a Framework for AI Agents Bias Recognition using Game Theory

arXiv.org Artificial Intelligence

Letting AI agents interact in multi-agent applications adds a layer of complexity to the interpretability and prediction of AI outcomes, with profound implications for their trustworthy adoption in research and society. Game theory offers powerful models to capture and interpret strategic interaction among agents, but requires the support of reproducible, standardized and user-friendly IT frameworks to enable comparison and interpretation of results. To this end, we present FAIRGAME, a Framework for AI Agents Bias Recognition using Game Theory. We describe its implementation and usage, and we employ it to uncover biased outcomes in popular games among AI agents, depending on the employed Large Language Model (LLM) and used language, as well as on the personality trait or strategic knowledge of the agents. Overall, FAIRGAME allows users to reliably and easily simulate their desired games and scenarios and compare the results across simulation campaigns and with game-theoretic predictions, enabling the systematic discovery of biases, the anticipation of emerging behavior out of strategic interplays, and empowering further research into strategic decision-making using LLM agents.


Large Model Empowered Embodied AI: A Survey on Decision-Making and Embodied Learning

arXiv.org Artificial Intelligence

Embodied AI aims to develop intelligent systems with physical forms capable of perceiving, decision-making, acting, and learning in real-world environments, providing a promising way to Artificial General Intelligence (AGI). Despite decades of explorations, it remains challenging for embodied agents to achieve human-level intelligence for general-purpose tasks in open dynamic environments. Recent breakthroughs in large models have revolutionized embodied AI by enhancing perception, interaction, planning and learning. In this article, we provide a comprehensive survey on large model empowered embodied AI, focusing on autonomous decision-making and embodied learning. We investigate both hierarchical and end-to-end decision-making paradigms, detailing how large models enhance high-level planning, low-level execution, and feedback for hierarchical decision-making, and how large models enhance Vision-Language-Action (VLA) models for end-to-end decision making. For embodied learning, we introduce mainstream learning methodologies, elaborating on how large models enhance imitation learning and reinforcement learning in-depth. For the first time, we integrate world models into the survey of embodied AI, presenting their design methods and critical roles in enhancing decision-making and learning. Though solid advances have been achieved, challenges still exist, which are discussed at the end of this survey, potentially as the further research directions.


Multi-Agent Trust Region Policy Optimisation: A Joint Constraint Approach

arXiv.org Artificial Intelligence

Multi-agent reinforcement learning (MARL) requires coordinated and stable policy updates among interacting agents. Heterogeneous-Agent Trust Region Policy Optimization (HA TRPO) enforces per-agent trust region constraints using Kullback-Leibler (KL) divergence to stabilize training. However, assigning each agent the same KL threshold can lead to slow and locally optimal updates, especially in heterogeneous settings. To address this limitation, we propose two approaches for allocating the KL divergence threshold across agents: HA TRPO-W, a Karush-Kuhn-Tucker-based (KKT -based) method that optimizes threshold assignment under global KL constraints, and HA TRPO-G, a greedy algorithm that prioritizes agents based on improvement-to-divergence ratio. By connecting sequential policy optimization with constrained threshold scheduling, our approach enables more flexible and effective learning in heterogeneous-agent settings. Experimental results demonstrate that our methods significantly boost the performance of HA TRPO, achieving faster convergence and higher final rewards across diverse MARL benchmarks. Specifically, HA TRPO-W and HA TRPO-G achieve comparable improvements in final performance, each exceeding 22.5%. Notably, HA TRPO-W also demonstrates more stable learning dynamics, as reflected by its lower variance.


Multi-Agent Reinforcement Learning for Adaptive Resource Orchestration in Cloud-Native Clusters

arXiv.org Artificial Intelligence

This paper addresses the challenges of high resource dynamism and scheduling complexity in cloud-native database systems. It proposes an adaptive resource orchestration method based on multi-agent reinforcement learning. The method introduces a heterogeneous role-based agent modeling mechanism. This allows different resource entities, such as compute nodes, storage nodes, and schedulers, to adopt distinct policy representations. These agents are better able to reflect diverse functional responsibilities and local environmental characteristics within the system. A reward-shaping mechanism is designed to integrate local observations with global feedback. This helps mitigate policy learning bias caused by incomplete state observations. By combining real-time local performance signals with global system value estimation, the mechanism improves coordination among agents and enhances policy convergence stability. A unified multi-agent training framework is developed and evaluated on a representative production scheduling dataset. Experimental results show that the proposed method outperforms traditional approaches across multiple key metrics. These include resource utilization, scheduling latency, policy convergence speed, system stability, and fairness. The results demonstrate strong generalization and practical utility. Across various experimental scenarios, the method proves effective in handling orchestration tasks with high concurrency, high-dimensional state spaces, and complex dependency relationships. This confirms its advantages in real-world, large-scale scheduling environments.


REALISM: A Regulatory Framework for Coordinated Scheduling in Multi-Operator Shared Micromobility Services

arXiv.org Artificial Intelligence

Shared micromobility (e.g., shared bikes and electric scooters), as a kind of emerging urban transportation, has become more and more popular in the world. However, the blooming of shared micromobility vehicles brings some social problems to the city (e.g., overloaded vehicles on roads, and the inequity of vehicle deployment), which deviate from the city regulator's expectation of the service of the shared micromobility system. In addition, the multi-operator shared micromobility system in a city complicates the problem because of their non-cooperative self-interested pursuits. Existing regulatory frameworks of multi-operator vehicle rebalancing generally assume the intrusive control of vehicle rebalancing of all the operators, which is not practical in the real world. To address this limitation, we design REALISM, a regulatory framework for coordinated scheduling in multi-operator shared micromobility services that incorporates the city regulator's regulations in the form of assigning a score to each operator according to the city goal achievements and operators' individual contributions to achieving the city goal, measured by Shapley value. To realize the fairness-aware score assignment, we measure the fairness of assigned scores and use them as one of the components to optimize the score assignment model. To optimize the whole framework, we develop an alternating procedure to make operators and the city regulator interact with each other until convergence. We evaluate our framework based on real-world e-scooter usage data in Chicago. Our experiment results show that our method achieves a performance gain of at least 39.93% in the equity of vehicle usage and 1.82% in the average demand satisfaction of the whole city.


Agentic AI Frameworks: Architectures, Protocols, and Design Challenges

arXiv.org Artificial Intelligence

Aspect Traditional AI agents Modern agentic AI systems (LLM-based agents) Definition Autonomous entities with fixed sensing/acting loops; limited by static rules or models Autonomous reasoning systems using LLMs with dynamic behavior, tool orchestration, and context-awarenessAutonomy Limited autonomy; often dependent on human input or predefined instructions High autonomy; capable of independently performing complex and extended tasks Goal Management Focused on single, static goals or fixed task planning Capable of managing multiple, evolving, and nested goals adaptivelyArchitecture Rule-based or BDI (Belief-Desire-Intention) models; monolithic design Modular architecture centered on LLMs, with components for memory, tools, context injection, and rolesAdaptability Suited to controlled, predictable environments; poor generalization Designed for open, dynamic, and unpredictable environmentsDecision-Making Deterministic or rule-based logic; symbolic reasoning Context-sensitive, probabilistic reasoning with adaptive planning and self-reflection Learning Mechanism Rule-based or supervised learning with limited updates Self-supervised and reinforcement learning; continual fine-tuning possible Context Handling Static or manually coded states and rules Dynamic context injection via agent protocols (e.g., MCP, A2A) and runtime awareness Communication Message-passing via ACL or KQML Real-time, event-driven collaboration; natural language interfacesTool Use Limited or predefined tools and actions Dynamic tool invocation, chaining, and API calling based on contextMemory Optional, often hardcoded or task-specific Integrated memory systems supporting long-and short-term information retention


MCP-Orchestrated Multi-Agent System for Automated Disinformation Detection

arXiv.org Artificial Intelligence

The large spread of disinformation across digital platforms creates significant challenges to information integrity. This paper presents a multi-agent system that uses relation extraction to detect disinformation in news articles, focusing on titles and short text snippets. The proposed Agentic AI system combines four agents: (i) a machine learning agent (logistic regression), (ii) a Wikipedia knowledge check agent (which relies on named entity recognition), (iii) a coherence detection agent (using LLM prompt engineering), and (iv) a web-scraped data analyzer that extracts relational triplets for fact checking. The system is orchestrated via the Model Context Protocol (MCP), offering shared context and live learning across components. Results demonstrate that the multi-agent ensemble achieves 95.3% accuracy with an F1 score of 0.964, significantly outperforming individual agents and traditional approaches. The weighted aggregation method, mathematically derived from individual agent misclassification rates, proves superior to algorithmic threshold optimization. The modular architecture makes the system easily scalable, while also maintaining details of the decision processes.


FPT-Approximability of Stable Matching Problems

arXiv.org Artificial Intelligence

We study parameterized approximability of three optimization problems related to stable matching: (1) Min-BP-SMI: Given a stable marriage instance and a number k, find a size-at-least-k matching that minimizes the number $β$ of blocking pairs; (2) Min-BP-SRI: Given a stable roommates instance, find a matching that minimizes the number $β$ of blocking pairs; (3) Max-SMTI: Given a stable marriage instance with preferences containing ties, find a maximum-size stable matching. The first two problems are known to be NP-hard to approximate to any constant factor and W[1]-hard with respect to $β$, making the existence of an EPTAS or FPT-algorithms unlikely. We show that they are W[1]-hard with respect to $β$ to approximate to any function of $β$. This means that unless FPT=W[1], there is no FPT-approximation scheme for the parameter $β$. The last problem (Max-SMTI) is known to be NP-hard to approximate to factor-29/33 and W[1]-hard with respect to the number of ties. We complement this and present an FPT-approximation scheme for the parameter "number of agents with ties".