Reinforcement Learning
Agentic DDQN-Based Scheduling for Licensed and Unlicensed Band Allocation in Sidelink Networks
Chou, Po-Heng, Fu, Pin-Qi, Saad, Walid, Wang, Li-Chun
Abstract--In this paper, we present an agentic double deep Q-network (DDQN) scheduler for licensed/unlicensed band a l-location in New Radio (NR) sidelink (SL) networks. Beyond conventional reward-seeking reinforcement learning (RL), the agent perceives and reasons over a multi-dimensional conte xt that jointly captures queueing delay, link quality, coexistenc e intensity, and switching stability. A capacity-aware, quality of serv ice (QoS)- constrained reward aligns the agent with goal-oriented sch eduling rather than static thresholding. Under constrained bandwi dth, the proposed design reduces blocking by up to 87.5% versus thres hold policies while preserving throughput, highlighting the va lue of context-driven decisions in coexistence-limited NR SL net works. The proposed scheduler is an embodied agent (E-agent) tailo red for task-specific, resource-efficient operation at the netw ork edge.
Double Check My Desired Return: Transformer with Target Alignment for Offline Reinforcement Learning
Pei, Yue, Zhang, Hongming, Gao, Chao, Mรผller, Martin, Zhu, Mengxiao, Sheng, Hao, Chen, Ziliang, Lin, Liang, Zhu, Haogang
Offline reinforcement learning (RL) has achieved significant advances in domains such as robotic control, autonomous driving, and medical decision-making. Most existing methods primarily focus on training policies that maximize cumulative returns from a given dataset. However, many real-world applications require precise control over policy performance levels, rather than simply pursuing the best possible return. Reinforcement learning via supervised learning (RvS) frames offline RL as a sequence modeling task, enabling the extraction of diverse policies by conditioning on different desired returns. Yet, existing RvS-based transformers, such as Decision Transformer (DT), struggle to reliably align the actual achieved returns with specified target returns, especially when interpolating within underrepresented returns or extrapolating beyond the dataset. To address this limitation, we propose Doctor, a novel approach that Double Checks the Transformer with target alignment for Offline RL. Doctor integrates the strengths of supervised learning (SL) and temporal difference (TD) learning by jointly optimizing the action prediction and value estimation. During inference, Doctor introduces a double-check mechanism: actions are first sampled around the desired target returns and then validated with value functions. This ensures more accurate alignment between predicted actions and desired target returns. We evaluate Doctor on the D4RL and EpiCare benchmarks, demonstrating aligned control yields stronger performance and tunable expertise, showing its effectiveness in a wide range of tasks.
Moving Out: Physically-grounded Human-AI Collaboration
Kang, Xuhui, Lee, Sung-Wook, Liu, Haolin, Wang, Yuyan, Kuo, Yen-Ling
The ability to adapt to physical actions and constraints in an environment is crucial for embodied agents (e.g., robots) to effectively collaborate with humans. Such physically grounded human-AI collaboration must account for the increased complexity of the continuous state-action space and constrained dynamics caused by physical constraints. In this paper, we introduce Moving Out, a new human-AI collaboration benchmark that resembles a wide range of collaboration modes affected by physical attributes and constraints, such as moving heavy items together and maintaining consistent actions to move a big item around a corner. Using Moving Out, we designed two tasks and collected human-human interaction data to evaluate models' abilities to adapt to diverse human behaviors and unseen physical attributes. To address the challenges in physical environments, we propose a novel method, BASS (Behavior Augmentation, Simulation, and Selection), to enhance the diversity of agents and their understanding of the outcome of actions. Our experiments show that BASS outperforms state-of-the-art models in AI-AI and human-AI collaboration. The project page is available at https://live-robotics-uva.github.io/movingout_ai/.
BenchRL-QAS: Benchmarking reinforcement learning algorithms for quantum architecture search
Ikhtiarudin, Azhar, Das, Aditi, Thakkar, Param, Kundu, Akash
Our study systematically evaluates 9 different RL agents, including both value-based and policy-gradient methods, on quantum problems such as variational eigensolver, quantum state diagonalization, vari-ational quantum classification (VQC), and state preparation, under both noiseless and noisy execution settings. To ensure fair comparison, we propose a weighted ranking metric that integrates accuracy, circuit depth, gate count, and training time. Results demonstrate that no single RL method dominates universally, the performance dependents on task type, qubit count, and noise conditions providing strong evidence of no free lunch principle in RL-QAS. As a byproduct we observe that a carefully chosen RL algorithm in RL-based VQC outperforms baseline VQCs. BenchRL-QAS establishes the most extensive benchmark for RL-based QAS to date, codes and experimental made publicly available for reproducibility and future advances.
TGRPO :Fine-tuning Vision-Language-Action Model via Trajectory-wise Group Relative Policy Optimization
Chen, Zengjue, Niu, Runliang, Kong, He, Wang, Qi, Xing, Qianli, Fan, Zipei
C. VLA Post-training Framework in Simulation After improving the computation of relative advantages and deriving the corresponding optimization objective, we integrated these components into a complete online reinforcement learning framework for VLA post-training in simulation. First, our overall framework trains a VLA model for a single task using reinforcement learning across multiple environments initialized with identical states. In this setup, the VLA executes the same task in parallel environments, sampling actions step by step until either one environment completes the task or all environments reach the maximum number of steps. During sampling, we incorporate the multistage reward function designed by the LLM described earlier, where each environment's observations provide the necessary object positions and robot state information required for reward computation. Once the trajectories terminate simultaneously, they all share the same length, which facilitates consistent grouping for subsequent processing. After collecting multiple trajectories, they are organized into a trajectory-level group, where relative advantages are computed within the group according to Eq. (3), yielding trajectory-level relative advantages. Similarly, since all trajectories terminate at the same timestep (ensuring that every step can be grouped), we extract step-level data across trajectories (e.g., rewards and log probabilities of actions), and group together steps at the same timestep to form step-level groups. Step-level relative advantages are then computed using Eq.
Can In-Context Reinforcement Learning Recover From Reward Poisoning Attacks?
Sasnauskas, Paulius, Yalฤฑn, Yiฤit, Radanoviฤ, Goran
We study the corruption-robustness of in-context reinforcement learning (ICRL), focusing on the Decision-Pretrained Transformer (DPT, Lee et al., 2023). To address the challenge of reward poisoning attacks targeting the DPT, we propose a novel adversarial training framework, called Adversarially Trained Decision-Pretrained Transformer (AT-DPT). Our method simultaneously trains an attacker to minimize the true reward of the DPT by poisoning environment rewards, and a DPT model to infer optimal actions from the poisoned data. We evaluate the effectiveness of our approach against standard bandit algorithms, including robust baselines designed to handle reward contamination. Our results show that the proposed method significantly outperforms these baselines in bandit settings, under a learned attacker. We additionally evaluate AT-DPT on an adaptive attacker, and observe similar results. Furthermore, we extend our evaluation to the MDP setting, confirming that the robustness observed in bandit scenarios generalizes to more complex environments.
Bridging the Performance Gap Between Target-Free and Target-Based Reinforcement Learning
Vincent, Thรฉo, Tripathi, Yogesh, Faust, Tim, Oren, Yaniv, Peters, Jan, D'Eramo, Carlo
The use of target networks in deep reinforcement learning is a widely popular solution to mitigate the brittleness of semi-gradient approaches and stabilize learning. However, target networks notoriously require additional memory and delay the propagation of Bellman updates compared to an ideal target-free approach. In this work, we step out of the binary choice between target-free and target-based algorithms. We introduce a new method that uses a copy of the last linear layer of the online network as a target network, while sharing the remaining parameters with the up-to-date online network. This simple modification enables us to keep the target-free's low-memory footprint while leveraging the target-based literature. We find that combining our approach with the concept of iterated Q-learning, which consists of learning consecutive Bellman updates in parallel, helps improve the sample-efficiency of target-free approaches. Our proposed method, iterated Shared Q-Learning (iS-QL), bridges the performance gap between target-free and target-based approaches across various problems, while using a single Q-network, thus being a step forward towards resource-efficient reinforcement learning algorithms.
Fine-tuning Diffusion Policies with Backpropagation Through Diffusion Timesteps
Yang, Ningyuan, Gao, Jiaxuan, Gao, Feng, Wu, Yi, Yu, Chao
Diffusion policies, widely adopted in decision-making scenarios such as robotics, gaming and autonomous driving, are capable of learning diverse skills from demonstration data due to their high representation power. However, the sub-optimal and limited coverage of demonstration data could lead to diffusion policies that generate sub-optimal trajectories and even catastrophic failures. While reinforcement learning (RL)-based fine-tuning has emerged as a promising solution to address these limitations, existing approaches struggle to effectively adapt Proximal Policy Optimization (PPO) to diffusion models. This challenge stems from the computational intractability of action likelihood estimation during the denoising process, which leads to complicated optimization objectives. In our experiments starting from randomly initialized policies, we find that online tuning of Diffusion Policies demonstrates much lower sample efficiency compared to directly applying PPO on MLP policies (MLP+PPO). To address these challenges, we introduce NCDPO, a novel framework that reformulates Diffusion Policy as a noise-conditioned deterministic policy. By treating each denoising step as a differentiable transformation conditioned on pre-sampled noise, NCDPO enables tractable likelihood evaluation and gradient backpropagation through all diffusion timesteps. Our experiments demonstrate that NCDPO achieves sample efficiency comparable to MLP+PPO when training from scratch, outperforming existing methods in both sample efficiency and final performance across diverse benchmarks, including continuous robot control and multi-agent game scenarios. Furthermore, our experimental results show that our method is robust to the number denoising timesteps in the Diffusion Policy.
Cache-Efficient Posterior Sampling for Reinforcement Learning with LLM-Derived Priors Across Discrete and Continuous Domains
Shihab, Ibne Farabi, Akter, Sanjeda, Sharma, Anuj
Integrating large language models (LLMs) as priors in reinforcement learning (RL) offers significant advantages but comes with substantial computational costs. We present a principled cache-efficient framework for posterior sampling with LLM-derived priors that dramatically reduces these costs while maintaining high performance. At the core of our approach is an adaptive caching mechanism, where cache parameters are meta-optimized using surrogate gradients derived from policy performance. This design enables efficient inference across both discrete text environments (e.g., TextWorld, ALFWorld) and continuous control domains (e.g., MuJoCo), achieving a 3.8--4.7$\times$ reduction in LLM queries and 4.0--12.0$\times$ lower median latencies (85--93\,ms on a consumer GPU) while retaining 96--98\% of uncached performance. Our theoretical analysis provides KL divergence bounds on approximation quality, validated empirically. The framework extends to offline RL, where our CQL-Prior variant improves performance by 14--29\% and reduces training time by 38--40\%. Extensive evaluations across a diverse suite of eight tasks demonstrate the generalizability and practical viability of LLM-guided RL in resource-constrained settings.
Learning Large-Scale Competitive Team Behaviors with Mean-Field Interactions and Online Opponent Modeling
Jeloka, Bhavini, Guan, Yue, Tsiotras, Panagiotis
While multi-agent reinforcement learning (MARL) has been proven effective across both collaborative and competitive tasks, existing algorithms often struggle to scale to large populations of agents. Recent advancements in mean-field (MF) theory provide scalable solutions by approximating population interactions as a continuum, yet most existing frameworks focus exclusively on either fully cooperative or purely competitive settings. To bridge this gap, we introduce MF-MAPPO, a mean-field extension of PPO designed for zero-sum team games that integrate intra-team cooperation with inter-team competition. MF-MAPPO employs a shared actor and a minimally informed critic per team and is trained directly on finite-population simulators, thereby enabling deployment to realistic scenarios with thousands of agents. We further show that MF-MAPPO naturally extends to partially observable settings through a simple gradient-regularized training scheme. Our evaluation utilizes large-scale benchmark scenarios using our own testing simulation platform for MF team games (MFEnv), including offense-defense battlefield tasks as well as variants of population-based rock-paper-scissors games that admit analytical solutions, for benchmarking. Across these benchmarks, MF-MAPPO outperforms existing methods and exhibits complex, heterogeneous behaviors, demonstrating the effectiveness of combining mean-field theory and MARL techniques at scale.