Reinforcement Learning
Detecting Model Drifts in Non-Stationary Environment Using Edit Operation Measures
Lee, Chang-Hwan, Shim, Alexander
Reinforcement learning (RL) agents typically assume stationary environment dynamics. Yet in real-world applications such as healthcare, robotics, and finance, transition probabilities or reward functions may evolve, leading to model drift. This paper proposes a novel framework to detect such drifts by analyzing the distributional changes in sequences of agent behavior. Specifically, we introduce a suite of edit operation-based measures to quantify deviations between state-action trajectories generated under stationary and perturbed conditions. Our experiments demonstrate that these measures can effectively distinguish drifted from non-drifted scenarios, even under varying levels of noise, providing a practical tool for drift detection in non-stationary RL environments.
Policy Learning for Social Robot-Led Physiotherapy
Bettosi, Carl, Ballie, Lynne, Shenkin, Susan, Romeo, Marta
Social robots offer a promising solution for autonomously guiding patients through physiotherapy exercise sessions, but effective deployment requires advanced decision-making to adapt to patient needs. A key challenge is the scarcity of patient behavior data for developing robust policies. To address this, we engaged 33 expert healthcare practitioners as patient proxies, using their interactions with our robot to inform a patient behavior model capable of generating exercise performance metrics and subjective scores on perceived exertion. We trained a reinforcement learning-based policy in simulation, demonstrating that it can adapt exercise instructions to individual exertion tolerances and fluctuating performance, while also being applicable to patients at different recovery stages with varying exercise plans.
Gradient Free Deep Reinforcement Learning With TabPFN
Schiff, David, Lindenbaum, Ofir, Efroni, Yonathan
Gradient based optimization is fundamental to most modern deep reinforcement learning algorithms, however, it introduces significant sensitivity to hyperparameters, unstable training dynamics, and high computational costs. We propose TabPFN RL, a novel gradient free deep RL framework that repurposes the meta trained transformer TabPFN as a Q function approximator. Originally developed for tabular classification, TabPFN is a transformer pre trained on millions of synthetic datasets to perform inference on new unseen datasets via in context learning. Given an in context dataset of sample label pairs and new unlabeled data, it predicts the most likely labels in a single forward pass, without gradient updates or task specific fine tuning. We use TabPFN to predict Q values using inference only, thereby eliminating the need for back propagation at both training and inference. To cope with the model's fixed context budget, we design a high reward episode gate that retains only the top 5% of trajectories. Empirical evaluations on the Gymnasium classic control suite demonstrate that TabPFN RL matches or surpasses Deep Q Network on CartPole v1, MountainCar v0, and Acrobot v1, without applying gradient descent or any extensive hyperparameter tuning. We discuss the theoretical aspects of how bootstrapped targets and non stationary visitation distributions violate the independence assumptions encoded in TabPFN's prior, yet the model retains a surprising generalization capacity. We further formalize the intrinsic context size limit of in context RL algorithms and propose principled truncation strategies that enable continual learning when the context is full. Our results establish prior fitted networks such as TabPFN as a viable foundation for fast and computationally efficient RL, opening new directions for gradient free RL with large pre trained transformers.
CORB-Planner: Corridor as Observations for RL Planning in High-Speed Flight
Zhang, Yechen, Gao, Bin, Wang, Gang, Sun, Jian, Li, Zhuo
Reinforcement learning (RL) has shown promise in a large number of robotic control tasks. Nevertheless, its deployment on unmanned aerial vehicles (UAVs) remains challenging, mainly because of reliance on accurate dynamic models and platform-specific sensing, which hinders cross-platform transfer. This paper presents the CORB-Planner (Corridor-as-Observations for RL B-spline planner), a real-time, RL-based trajectory planning framework for high-speed autonomous UAV flight across heterogeneous platforms. The key idea is to combine B-spline trajectory generation with the RL policy producing successive control points with a compact safe flight corridor (SFC) representation obtained via heuristic search. The SFC abstracts obstacle information in a low-dimensional form, mitigating overfitting to platform-specific details and reducing sensitivity to model inaccuracies. To narrow the sim-to-real gap, we adopt an easy-to-hard progressive training pipeline in simulation. A value-based soft decomposed-critic Q (SDCQ) algorithm is used to learn effective policies within approximately ten minutes of training. Benchmarks in simulation and real-world tests demonstrate real-time planning on lightweight onboard hardware and support maximum flight speeds up to 8.2m/s in dense, cluttered environments without external positioning. Compatibility with various UAV configurations (quadrotors, hexarotors) and modest onboard compute underlines the generality and robustness of CORB-Planner for practical deployment.
TransZero: Parallel Tree Expansion in MuZero using Transformer Networks
Malmsten, Emil, Böhmer, Wendelin
We present TransZero, a model-based reinforcement learning algorithm that removes the sequential bottleneck in Monte Carlo Tree Search (MCTS). Unlike MuZero, which constructs its search tree step by step using a recurrent dynamics model, TransZero employs a transformer-based network to generate multiple latent future states simultaneously. Combined with the Mean-Variance Constrained (MVC) evaluator that eliminates dependence on inherently sequential visitation counts, our approach enables the parallel expansion of entire subtrees during planning. Experiments in MiniGrid and LunarLander show that TransZero achieves up to an eleven-fold speedup in wall-clock time compared to MuZero while maintaining sample efficiency. These results demonstrate that parallel tree construction can substantially accelerate model-based reinforcement learning, bringing real-time decision-making in complex environments closer to practice. The code is publicly available on GitHub.
ManiVID-3D: Generalizable View-Invariant Reinforcement Learning for Robotic Manipulation via Disentangled 3D Representations
Li, Zheng, Qu, Pei, Jia, Yufei, Zhou, Shihui, Ge, Haizhou, Cao, Jiahang, Zhou, Jinni, Zhou, Guyue, Ma, Jun
Deploying visual reinforcement learning (RL) policies in real-world manipulation is often hindered by camera viewpoint changes. A policy trained from a fixed front-facing camera may fail when the camera is shifted--an unavoidable situation in real-world settings where sensor placement is hard to manage appropriately. Existing methods often rely on precise camera calibration or struggle with large perspective changes. To address these limitations, we propose ManiVID-3D, a novel 3D RL architecture designed for robotic manipulation, which learns view-invariant representations through self-supervised disentangled feature learning. The framework incorporates ViewNet, a lightweight yet effective module that automatically aligns point cloud observations from arbitrary viewpoints into a unified spatial coordinate system without the need for extrinsic calibration. Additionally, we develop an efficient GPU-accelerated batch rendering module capable of processing over 5000 frames per second, enabling large-scale training for 3D visual RL at unprecedented speeds. Extensive evaluation across 10 simulated and 5 real-world tasks demonstrates that our approach achieves a 44.7% higher success rate than state-of-the-art methods under viewpoint variations while using 80% fewer parameters. The system's robustness to severe perspective changes and strong sim-to-real performance highlight the effectiveness of learning geometrically consistent representations for scalable robotic manipulation in unstructured environments. Our project website can be found in https://zheng-joe-lee.github.io/manivid3d/.
RSL-RL: A Learning Library for Robotics Research
Schwarke, Clemens, Mittal, Mayank, Rudin, Nikita, Hoeller, David, Hutter, Marco
RSL-RL is an open-source Reinforcement Learning library tailored to the specific needs of the robotics community. Unlike broad general-purpose frameworks, its design philosophy prioritizes a compact and easily modifiable codebase, allowing researchers to adapt and extend algorithms with minimal overhead. The library focuses on algorithms most widely adopted in robotics, together with auxiliary techniques that address robotics-specific challenges. Optimized for GPU-only training, RSL-RL achieves high-throughput performance in large-scale simulation environments. Its effectiveness has been validated in both simulation benchmarks and in real-world robotic experiments, demonstrating its utility as a lightweight, extensible, and practical framework to develop learning-based robotic controllers.
Coordinated Reinforcement Learning Prefetching Architecture for Multicore Systems
Siddiqui, Mohammed Humaid, Guzman, Fernando, Wu, Yufei, Ann, Ruishu
Hardware prefetching is critical to fill the performance gap between CPU speeds and slower memory accesses. With multicore architectures becoming commonplace, traditional prefetchers are severely challenged. Independent core operation creates significant redundancy (up to 20% of prefetch requests are duplicates), causing unnecessary memory bus traffic and wasted bandwidth. Furthermore, cutting-edge prefetchers such as Pythia suffer from about a 10% performance loss when scaling from a single-core to a four-core system. To solve these problems, we propose CRL-Pythia, a coordinated reinforcement learning based prefetcher specifically designed for multicore systems. In this work, CRL-Pythia addresses these issues by enabling cross-core sharing of information and cooperative prefetching decisions, which greatly reduces redundant prefetch requests and improves learning convergence across cores. Our experiments demonstrate that CRL-Pythia outperforms single Pythia configurations in all cases, with approximately 12% IPC (instructions per cycle) improvement for bandwidth-constrained workloads, while imposing moderate hardware overhead. Our sensitivity analyses also verify its robustness and scalability, thereby making CRL-Pythia a practical and efficient solution to contemporary multicore systems.
ASL360: AI-Enabled Adaptive Streaming of Layered 360° Video over UAV-assisted Wireless Networks
Mohammadhosseini, Alireza, Chakareski, Jacob, Mastronarde, Nicholas
We propose ASL360, an adaptive deep reinforcement learning-based scheduler for on-demand 360° video streaming to mobile VR users in next generation wireless networks. We aim to maximize the overall Quality of Experience (QoE) of the users served over a UAV-assisted 5G wireless network. Our system model comprises a macro base station (MBS) and a UAV-mounted base station which both deploy mm-Wave transmission to the users. The 360° video is encoded into dependent layers and segmented tiles, allowing a user to schedule downloads of each layer's segments. Furthermore, each user utilizes multiple buffers to store the corresponding video layer's segments. We model the scheduling decision as a Constrained Markov Decision Process (CMDP), where the agent selects Base or Enhancement layers to maximize the QoE and use a policy gradient-based method (PPO) to find the optimal policy. Additionally, we implement a dynamic adjustment mechanism for cost components, allowing the system to adaptively balance and prioritize the video quality, buffer occupancy, and quality change based on real-time network and streaming session conditions. We demonstrate that ASL360 significantly improves the QoE, achieving approximately 2 dB higher average video quality, 80% lower average rebuffering time, and 57% lower video quality variation, relative to competitive baseline methods. Our results show the effectiveness of our layered and adaptive approach in enhancing the QoE in immersive videostreaming applications, particularly in dynamic and challenging network environments.
Resource-Aware Neural Network Pruning Using Graph-based Reinforcement Learning
Balemans, Dieter, Huybrechts, Thomas, Steckel, Jan, Mercelis, Siegfried
This paper presents a novel approach to neural network pruning by integrating a graph-based observation space into an AutoML framework to address the limitations of existing methods. Traditional pruning approaches often depend on hand-crafted heuristics and local optimization perspectives, which can lead to suboptimal performance and inefficient pruning strategies. Our framework transforms the pruning process by introducing a graph representation of the target neural network that captures complete topological relationships between layers and channels, replacing the limited layer-wise observation space with a global view of network structure. The core innovations include a Graph Attention Network (GAT) encoder that processes the network's graph representation and generates a rich embedding. Additionally, for the action space we transition from continuous pruning ratios to fine-grained binary action spaces which enables the agent to learn optimal channel importance criteria directly from data, moving away from predefined scoring functions. These contributions are modelled within a Constrained Markov Decision Process (CMDP) framework, allowing the agent to make informed pruning decisions while adhering to resource constraints such as target compression rates. For this, we design a self-competition reward system that encourages the agent to outperform its previous best performance while satisfying the defined constraints. We demonstrate the effectiveness of our approach through extensive experiments on benchmark datasets including CIFAR-10, CIFAR-100, and ImageNet. The experiments show that our method consistently outperforms traditional pruning techniques, showing state-of-the-art results while learning task-specific pruning strategies that identify functionally redundant connections beyond simple weight magnitude considerations.