AITopics

2511.16372

Genre: Research Report (0.82)

Industry: Transportation > Air (0.49)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.95)
Information Technology > Artificial Intelligence > Vision (0.89)
(2 more...)

arXiv.org Artificial IntelligenceDec-11-2025

Global Convergence for Average Reward Constrained MDPs with Primal-Dual Actor Critic Algorithm

Xu, Yang, Ganesh, Swetha, Mondal, Washim Uddin, Bai, Qinbo, Aggarwal, Vaneet

This paper investigates infinite-horizon average reward Constrained Markov Decision Processes (CMDPs) with general parametrization. We propose a Primal-Dual Natural Actor-Critic algorithm that adeptly manages constraints while ensuring a high convergence rate. In particular, our algorithm achieves global convergence and constraint violation rates of $\tilde{\mathcal{O}}(1/\sqrt{T})$ over a horizon of length $T$ when the mixing time, $τ_{\mathrm{mix}}$, is known to the learner. In absence of knowledge of $τ_{\mathrm{mix}}$, the achievable rates change to $\tilde{\mathcal{O}}(1/T^{0.5-ε})$ provided that $T \geq \tilde{\mathcal{O}}\left(τ_{\mathrm{mix}}^{2/ε}\right)$. Our results match the theoretical lower bound for Markov Decision Processes and establish a new benchmark in the theoretical exploration of average reward CMDPs.

assumption 4, machine learning, reinforcement learning, (13 more...)

2505.15138

Genre: Research Report (1.00)

Industry: Transportation (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.68)

Valiente, Rodolfo, Toghi, Behrad, Pedarsani, Ramtin, Fallah, Yaser P.

Robustness and Adaptability of Reinforcement Learning based Cooperative Autonomous Driving in Mixed-autonomy Traffic

arXiv.org Artificial IntelligenceDec-11-2025

HE development of autonomous vehicles (A Vs) is on the verge of passing beyond the laboratory and simulation tests and is shifting towards addressing the challenges that limit their practicality in today's society. While there is still need for further technological improvements to enable safe and smooth operation of a single A V, a great deal of research attention is being focused on the emerging challenge of operating multiple A Vs and the co-existence of A Vs and human-driven vehicles (HVs) [1], [2]. A realistic outlook for the adoption of autonomous vehicles on the roads is a mixed-traffic scenario in which human drivers with different driving styles and social preferences share the road with A Vs that are perhaps built by different manufacturers and hence follow different policies [3], [4]. In this work, we seek a solution that can ensure the safety and robustness of A Vs in the presence of human drivers with heterogeneous behavioral traits. Connected & autonomous vehicles (CA Vs) via vehicle-to-vehicle (V2V) communication allow vehicles to directly communicate with their neighbors, creating an extended perception that enables explicit coordination among vehicles to overcome the limitations of an isolated agent [5]-[11]. While planning in a fully A V scenario is relatively easy to achieve, coordination in the presence of HVs is a significantly more challenging task, as the A Vs not only need to react to road objects but also need to consider the behaviors of HVs [3], [4], [12]. We start by identifying the major challenges in the domain of behavior planning and prediction for A Vs in mixed-autonomy traffic.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

2202.00881

Country: North America > United States > California (0.68)

Genre: Research Report (1.00)

Industry:

Transportation > Ground > Road (1.00)
Automobiles & Trucks (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Davidovich, Orit, Shtern, Shimrit, Wasserkrug, Segev, Megiddo, Nimrod

Heuristics for Combinatorial Optimization via Value-based Reinforcement Learning: A Unified Framework and Analysis

arXiv.org Machine LearningDec-10-2025

Since the 1990s, considerable empirical work has been carried out to train statistical models, such as neural networks (NNs), as learned heuristics for combinatorial optimization (CO) problems. When successful, such an approach eliminates the need for experts to design heuristics per problem type. Due to their structure, many hard CO problems are amenable to treatment through reinforcement learning (RL). Indeed, we find a wealth of literature training NNs using value-based, policy gradient, or actor-critic approaches, with promising results, both in terms of empirical optimality gaps and inference runtimes. Nevertheless, there has been a paucity of theoretical work undergirding the use of RL for CO problems. To this end, we introduce a unified framework to model CO problems through Markov decision processes (MDPs) and solve them using RL techniques. We provide easy-to-test assumptions under which CO problems can be formulated as equivalent undiscounted MDPs that provide optimal solutions to the original CO problems. Moreover, we establish conditions under which value-based RL techniques converge to approximate solutions of the CO problem with a guarantee on the associated optimality gap. Our convergence analysis provides: (1) a sufficient rate of increase in batch size and projected gradient descent steps at each RL iteration; (2) the resulting optimality gap in terms of problem parameters and targeted RL accuracy; and (3) the importance of a choice of state-space embedding. Together, our analysis illuminates the success (and limitations) of the celebrated deep Q-learning algorithm in this problem context.

co problem, theorem 3, value function, (16 more...)

arXiv.org Machine Learning

2512.08601

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Europe > France (0.04)
Asia > Middle East > Israel (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

Park, Kwanyoung, Park, Seohong, Lee, Youngwoon, Levine, Sergey

Scalable Offline Model-Based RL with Action Chunks

In this paper, we study whether model-based reinforcement learning (RL), in particular model-based value expansion, can provide a scalable recipe for tackling complex, long-horizon tasks in offline RL. Model-based value expansion fits an on-policy value function using length-n imaginary rollouts generated by the current policy and a learned dynamics model. While larger n reduces bias in value bootstrapping, it amplifies accumulated model errors over long horizons, degrading future predictions. We address this trade-off with an \emph{action-chunk} model that predicts a future state from a sequence of actions (an "action chunk") instead of a single action, which reduces compounding errors. In addition, instead of directly training a policy to maximize rewards, we employ rejection sampling from an expressive behavioral action-chunk policy, which prevents model exploitation from out-of-distribution actions. We call this recipe \textbf{Model-Based RL with Action Chunks (MAC)}. Through experiments on highly challenging tasks with large-scale datasets of up to 100M transitions, we show that MAC achieves the best performance among offline model-based RL algorithms, especially on challenging long-horizon tasks.

artificial intelligence, machine learning, reinforcement learning, (14 more...)

2512.08108

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Learning and Editing Universal Graph Prompt Tuning via Reinforcement Learning

Xu, Jinfeng, Chen, Zheyu, Yang, Shuo, Li, Jinze, Wang, Hewei, Li, Yijie, Ngai, Edith C. H.

Early graph prompt tuning approaches relied on task-specific designs for Graph Neural Networks (GNNs), limiting their adaptability across diverse pre-training strategies. In contrast, another promising line of research has investigated universal graph prompt tuning, which operates directly in the input graph's feature space and builds a theoretical foundation that universal graph prompt tuning can theoretically achieve an equivalent effect of any prompting function, eliminating dependence on specific pre-training strategies. Recent works propose selective node-based graph prompt tuning to pursue more ideal prompts. However, we argue that selective node-based graph prompt tuning inevitably compromises the theoretical foundation of universal graph prompt tuning. In this paper, we strengthen the theoretical foundation of universal graph prompt tuning by introducing stricter constraints, demonstrating that adding prompts to all nodes is a necessary condition for achieving the universality of graph prompts. To this end, we propose a novel model and paradigm, Learning and Editing Universal GrAph Prompt Tuning (LEAP), which preserves the theoretical foundation of universal graph prompt tuning while pursuing more ideal prompts. Specifically, we first build the basic universal graph prompts to preserve the theoretical foundation and then employ actor-critic reinforcement learning to select nodes and edit prompts. Extensive experiments on graph- and node-level tasks across various pre-training strategies in both full-shot and few-shot scenarios show that LEAP consistently outperforms fine-tuning and other prompt-based approaches.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

2512.08763

Country:

North America > United States (0.46)
Asia > China (0.28)

Genre: Research Report (1.00)

Industry: Health & Medicine (0.94)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Fosso, Lauritz Rismark, Amundsen, Herman Biørn, Xanthidis, Marios, Ohrem, Sveinung Johan

Sim2Swim: Zero-Shot Velocity Control for Agile AUV Maneuvering in 3 Minutes

Holonomic autonomous underwater vehicles (AUVs) have the hardware ability for agile maneuvering in both translational and rotational degrees of freedom (DOFs). However, due to challenges inherent to underwater vehicles, such as complex hydrostatics and hydrodynamics, parametric uncertainties, and frequent changes in dynamics due to payload changes, control is challenging. Performance typically relies on carefully tuned controllers targeting unique platform configurations, and a need for re-tuning for deployment under varying payloads and hydrodynamic conditions. As a consequence, agile maneuvering with simultaneous tracking of time-varying references in both translational and rotational DOFs is rarely utilized in practice. To the best of our knowledge, this paper presents the first general zero-shot sim2real deep reinforcement learning-based (DRL) velocity controller enabling path following and agile 6DOF maneuvering with a training duration of just 3 minutes. Sim2Swim, the proposed approach, inspired by state-of-the-art DRL-based position control, leverages domain randomization and massively parallelized training to converge to field-deployable control policies for AUVs of variable characteristics without post-processing or tuning. Sim2Swim is extensively validated in pool trials for a variety of configurations, showcasing robust control for highly agile motions.

large language model, machine learning, reinforcement learning, (20 more...)

2512.08656

Country: Europe (0.46)

Genre: Research Report (0.50)

Industry: Energy > Renewable > Wind (0.47)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.73)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.61)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Nguyen, Thai Duong, Nguyen, Ngoc-Tan, Nguyen, Thanh-Dao, Van Huynh, Nguyen, Tran, Dinh-Hieu, Chatzinotas, Symeon

Multi-Agent Deep Reinforcement Learning for Collaborative UAV Relay Networks under Jamming Atatcks

The deployment of Unmanned Aerial Vehicle (UAV) swarms as dynamic communication relays is critical for next-generation tactical networks. However, operating in contested environments requires solving a complex trade-off, including maximizing system throughput while ensuring collision avoidance and resilience against adversarial jamming. Existing heuristic-based approaches often struggle to find effective solutions due to the dynamic and multi-objective nature of this problem. This paper formulates this challenge as a cooperative Multi-Agent Reinforcement Learning (MARL) problem, solved using the Centralized Training with Decentralized Execution (CTDE) framework. Our approach employs a centralized critic that uses global state information to guide decentralized actors which operate using only local observations. Simulation results show that our proposed framework significantly outperforms heuristic baselines, increasing the total system throughput by approximately 50% while simultaneously achieving a near-zero collision rate. A key finding is that the agents develop an emergent anti-jamming strategy without explicit programming. They learn to intelligently position themselves to balance the trade-off between mitigating interference from jammers and maintaining effective communication links with ground users.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

2512.08341

Country: Europe > United Kingdom (0.28)

Genre: Research Report > New Finding (0.34)

Industry:

Government > Military (0.68)
Telecommunications (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles > Drones (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.47)

TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models

Ding, Zheng, Ye, Weirui

Reinforcement learning (RL) post-training is crucial for aligning generative models with human preferences, but its prohibitive computational cost remains a major barrier to widespread adoption. We introduce \textbf{TreeGRPO}, a novel RL framework that dramatically improves training efficiency by recasting the denoising process as a search tree. From shared initial noise samples, TreeGRPO strategically branches to generate multiple candidate trajectories while efficiently reusing their common prefixes. This tree-structured approach delivers three key advantages: (1) \emph{High sample efficiency}, achieving better performance under same training samples (2) \emph{Fine-grained credit assignment} via reward backpropagation that computes step-specific advantages, overcoming the uniform credit assignment limitation of trajectory-based methods, and (3) \emph{Amortized computation} where multi-child branching enables multiple policy updates per forward pass. Extensive experiments on both diffusion and flow-based models demonstrate that TreeGRPO achieves \textbf{2.4$\times$ faster training} while establishing a superior Pareto frontier in the efficiency-reward trade-off space. Our method consistently outperforms GRPO baselines across multiple benchmarks and reward models, providing a scalable and effective pathway for RL-based visual generative model alignment. The project website is available at treegrpo.github.io.

arxiv preprint arxiv, machine learning, reinforcement learning, (16 more...)

2512.08153

Genre: Research Report (0.64)

Industry: Leisure & Entertainment > Games (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.90)

Robust Agents in Open-Ended Worlds

Samvelyan, Mikayel

The growing prevalence of artificial intelligence (AI) in various applications underscores the need for agents that can successfully navigate and adapt to an ever-changing, open-ended world. A key challenge is ensuring these AI agents are robust, excelling not only in familiar settings observed during training but also effectively generalising to previously unseen and varied scenarios. In this thesis, we harness methodologies from open-endedness and multi-agent learning to train and evaluate robust AI agents capable of generalising to novel environments, out-of-distribution inputs, and interactions with other co-player agents. We begin by introducing MiniHack, a sandbox framework for creating diverse environments through procedural content generation. Based on the game of NetHack, MiniHack enables the construction of new tasks for reinforcement learning (RL) agents with a focus on generalisation. We then present Maestro, a novel approach for generating adversarial curricula that progressively enhance the robustness and generality of RL agents in two-player zero-sum games. We further probe robustness in multi-agent domains, utilising quality-diversity methods to systematically identify vulnerabilities in state-of-the-art, pre-trained RL policies within the complex video game football domain, characterised by intertwined cooperative and competitive dynamics. Finally, we extend our exploration of robustness to the domain of LLMs. Here, our focus is on diagnosing and enhancing the robustness of LLMs against adversarial prompts, employing evolutionary search to generate a diverse range of effective inputs that aim to elicit undesirable outputs from an LLM. This work collectively paves the way for future advancements in AI robustness, enabling the development of agents that not only adapt to an ever-evolving world but also thrive in the face of unforeseen challenges and interactions.

evolutionary algorithm, large language model, machine learning, (22 more...)

2512.08139

Country:

Europe (1.00)
Asia (1.00)
North America > United States > Massachusetts (0.27)

Genre:

Research Report > New Finding (1.00)
Overview (1.00)
Instructional Material (1.00)
Research Report > Promising Solution (0.65)

Industry:

Leisure & Entertainment > Sports > Motorsports > Formula One (1.00)
Leisure & Entertainment > Games > Computer Games (1.00)
Information Technology > Security & Privacy (1.00)
(5 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
(4 more...)