Reinforcement Learning
Humanoid Whole-Body Badminton via Multi-Stage Reinforcement Learning
Liu, Chenhao, Jiang, Leyun, Wang, Yibo, Yao, Kairan, Fu, Jinchen, Ren, Xiaoyu
A fully autonomous humanoid returns machine-fed shuttles in a motion-capture arena; overlaid arcs show an incoming (blue) and returned (orange) trajectory. Abstract--Humanoid robots have demonstrated strong capabilities for interacting with static scenes across locomotion, manipulation, and more challenging loco-manipulation tasks. Y et the real world is dynamic, and quasi-static interactions are insufficient to cope with diverse environmental conditions. As a step toward more dynamic interaction scenarios, we present a reinforcement-learning-based training pipeline that produces a unified whole-body controller for humanoid badminton, enabling coordinated lower-body footwork and upper-body striking without motion priors or expert demonstrations. Training follows a three-stage curriculum--first footwork acquisition, then precision-guided racket swing generation, and finally task-focused refinement--yielding motions in which both legs and arms serve the hitting objective. For deployment, we incorporate an Extended Kalman Filter (EKF) to estimate and predict shuttlecock trajectories for target striking. We also introduce a prediction-free variant that dispenses with EKF and explicit trajectory prediction. T o validate the framework, we conduct five sets of experiments in both simulation and the real world. In simulation, two robots sustain a rally of 21 consecutive hits. Moreover, the prediction-free variant achieves successful hits with comparable performance relative to the target-known policy. In real-world tests, both prediction and controller modules exhibit high accuracy, and on-court hitting achieves an outgoing shuttle speed up to 19.1 m/s with a mean return landing distance of 4 m. These experimental results show that our proposed training scheme can deliver highly dynamic while precise goal striking in badminton, and can be adapted to more dynamics-critical domains. Humanoid platforms have been proposed as general-purpose embodied agents for human-compatible skills [1, 2, 3, 4, 5, 6, 7]. Despite rapid progress in locomotion and motion imitation, agile, contact-rich interactions with fast-moving objects under tight reaction windows remain underexplored.
Test-driven Reinforcement Learning in Continuous Control
Yu, Zhao, Wu, Xiuping, Ke, Liangjun
Reinforcement learning (RL) has been recognized as a powerful tool for robot control tasks. RL typically employs reward functions to define task objectives and guide agent learning. However, since the reward function serves the dual purpose of defining the optimal goal and guiding learning, it is challenging to design the reward function manually, which often results in a suboptimal task representation. To tackle the reward design challenge in RL, inspired by the satisficing theory, we propose a Test-driven Reinforcement Learning (TdRL) framework. In the TdRL framework, multiple test functions are used to represent the task objective rather than a single reward function. Test functions can be categorized as pass-fail tests and indicative tests, each dedicated to defining the optimal objective and guiding the learning process, respectively, thereby making defining tasks easier. Building upon such a task definition, we first prove that if a trajectory return function assigns higher returns to trajectories closer to the optimal trajectory set, maximum entropy policy optimization based on this return function will yield a policy that is closer to the optimal policy set. Then, we introduce a lexicographic heuristic approach to compare the relative distance relationship between trajectories and the optimal trajectory set for learning the trajectory return function. Furthermore, we develop an algorithm implementation of TdRL. Experimental results on the DeepMind Control Suite benchmark demonstrate that TdRL matches or outperforms handcrafted reward methods in policy training, with greater design simplicity and inherent support for multi-objective optimization. We argue that TdRL offers a novel perspective for representing task objectives, which could be helpful in addressing the reward design challenges in RL applications.
Permutation-Invariant Representation Learning for Robust and Privacy-Preserving Feature Selection
Liu, Rui, Zhe, Tao, Fu, Yanjie, Xia, Feng, Senator, Ted, Wang, Dongjie
Abstract--Feature selection eliminates redundancy among features to improve downstream task performance while reducing computational overhead. Existing methods often struggle to capture intricate feature interactions and adapt across diverse application scenarios. Recent advances employ generative intelligence to alleviate these drawbacks. However, these methods remain constrained by permutation sensitivity in embedding and reliance on convexity assumptions in gradient-based search. T o address these limitations, our initial work introduces a novel framework that integrates permutation-invariant embedding with policy-guided search. Although effective, it still left opportunities to adapt to realistic distributed scenarios. In practice, data across local clients is highly imbalanced, heterogeneous and constrained by strict privacy regulations, limiting direct sharing. These challenges highlight the need for a framework that can integrate feature selection knowledge across clients without exposing sensitive information. In this extended journal version, we advance the framework from two perspectives: 1) developing a privacy-preserving knowledge fusion strategy to derive a unified representation space without sharing sensitive raw data. The results further demonstrate its strong generalization ability in federated learning scenarios. The code and data are publicly available https://anonymous.4open.science/r/FedCAPS-08BF. Index T erms--Automated Feature Selection; Representation Learning; Reinforcement Learning, Federated Learning. EA TURE selection removes redundant and irrelevant features to improve both predictive performance and computational efficiency in downstream tasks. Despite the growing dominance of deep learning, feature selection remains indispensable in scenarios characterized by high-dimensional data, the need for interpretability, and limited resource constraints.
Automated Construction of Artificial Lattice Structures with Designer Electronic States
Narasimha, Ganesh, Telychko, Mykola, Yang, Wooin, Baddorf, Arthur P., Ganesh, P., Li, An-Ping, Vasudevan, Rama
Manipulating matter with a scanning tunneling microscope (STM) enables creation of atomically defined artificial structures that host designer quantum states. However, the time-consuming nature of the manipulation process, coupled with the sensitivity of the STM tip, constrains the exploration of diverse configurations and limits the size of designed features. In this study, we present a reinforcement learning (RL)-based framework for creating artificial structures by spatially manipulating carbon monoxide (CO) molecules on a copper substrate using the STM tip. The automated workflow combines molecule detection and manipulation, employing deep learning-based object detection to locate CO molecules and linear assignment algorithms to allocate these molecules to designated target sites. We initially perform molecule maneuvering based on randomized parameter sampling for sample bias, tunneling current setpoint and manipulation speed. This dataset is then structured into an action trajectory used to train an RL agent. The model is subsequently deployed on the STM for real-time fine-tuning of manipulation parameters during structure construction. Our approach incorporates path planning protocols coupled with active drift compensation to enable atomically precise fabrication of structures with significantly reduced human input while realizing larger-scale artificial lattices with desired electronic properties. Using our approach, we demonstrate the automated construction of an extended artificial graphene lattice and confirm the existence of characteristic Dirac point in its electronic structure. Further challenges to RL-based structural assembly scalability are discussed.
DIVER: Reinforced Diffusion Breaks Imitation Bottlenecks in End-to-End Autonomous Driving
Song, Ziying, Liu, Lin, Pan, Hongyu, Liao, Bencheng, Guo, Mingzhe, Yang, Lei, Zhang, Yongchang, Xu, Shaoqing, Jia, Caiyan, Luo, Yadan
Most end-to-end autonomous driving methods rely on imitation learning from single expert demonstrations, often leading to conservative and homogeneous behaviors that limit generalization in complex real-world scenarios. In this work, we propose DIVER, an end-to-end driving framework that integrates reinforcement learning with diffusion-based generation to produce diverse and feasible trajectories. At the core of DIVER lies a reinforced diffusion-based generation mechanism. First, the model conditions on map elements and surrounding agents to generate multiple reference trajectories from a single ground-truth trajectory, alleviating the limitations of imitation learning that arise from relying solely on single expert demonstrations. Second, reinforcement learning is employed to guide the diffusion process, where reward-based supervision enforces safety and diversity constraints on the generated trajectories, thereby enhancing their practicality and generalization capability. Furthermore, to address the limitations of L2-based open-loop metrics in capturing trajectory diversity, we propose a novel Diversity metric to evaluate the diversity of multi-mode predictions.Extensive experiments on the closed-loop NAVSIM and Bench2Drive benchmarks, as well as the open-loop nuScenes dataset, demonstrate that DIVER significantly improves trajectory diversity, effectively addressing the mode collapse problem inherent in imitation learning.
Towards Task-Oriented Flying: Framework, Infrastructure, and Principles
Huang, Kangyao, Wang, Hao, Chen, Jingyu, Chen, Jintao, Luo, Yu, Guo, Di, Zhang, Xiangkui, Ji, Xiangyang, Liu, Huaping
Deploying robot learning methods to aerial robots in unstructured environments remains both challenging and promising. While recent advances in deep reinforcement learning (DRL) have enabled end-to-end flight control, the field still lacks systematic design guidelines and a unified infrastructure to support reproducible training and real-world deployment. We present a task-oriented framework for end-to-end DRL in quadrotors that integrates design principles for complex task specification and reveals the interdependencies among simulated task definition, training design principles, and physical deployment. Our framework involves software infrastructure, hardware platforms, and open-source firmware to support a full-stack learning infrastructure and workflow. Extensive empirical results demonstrate robust flight and sim-to-real generalization under real-world disturbances. By reducing the entry barrier for deploying learning-based controllers on aerial robots, our work lays a practical foundation for advancing autonomous flight in dynamic and unstructured environments.
Statistical analysis of Inverse Entropy-regularized Reinforcement Learning
Belomestny, Denis, Naumov, Alexey, Samsonov, Sergey
Inverse reinforcement learning aims to infer the reward function that explains expert behavior observed through trajectories of state--action pairs. A long-standing difficulty in classical IRL is the non-uniqueness of the recovered reward: many reward functions can induce the same optimal policy, rendering the inverse problem ill-posed. In this paper, we develop a statistical framework for Inverse Entropy-regularized Reinforcement Learning that resolves this ambiguity by combining entropy regularization with a least-squares reconstruction of the reward from the soft Bellman residual. This combination yields a unique and well-defined so-called least-squares reward consistent with the expert policy. We model the expert demonstrations as a Markov chain with the invariant distribution defined by an unknown expert policy $ฯ^\star$ and estimate the policy by a penalized maximum-likelihood procedure over a class of conditional distributions on the action space. We establish high-probability bounds for the excess Kullback--Leibler divergence between the estimated policy and the expert policy, accounting for statistical complexity through covering numbers of the policy class. These results lead to non-asymptotic minimax optimal convergence rates for the least-squares reward function, revealing the interplay between smoothing (entropy regularization), model complexity, and sample size. Our analysis bridges the gap between behavior cloning, inverse reinforcement learning, and modern statistical learning theory.
OrQstrator: An AI-Powered Framework for Advanced Quantum Circuit Optimization
We propose a novel approach, OrQstrator, which is a modular framework for conducting quantum circuit optimization in the Noisy Intermediate-Scale Quantum (NISQ) era. Our framework is powered by Deep Reinforcement Learning (DRL). Our orchestration engine intelligently selects among three complementary circuit optimizers: A DRL-based circuit rewriter trained to reduce depth and gate count via learned rewrite sequences; a domain-specific optimizer that performs efficient local gate resynthesis and numeric optimization; a parameterized circuit instantiator that improves compilation by optimizing template circuits during gate set translation. These modules are coordinated by a central orchestration engine that learns coordination policies based on circuit structure, hardware constraints, and backend-aware performance features such as gate count, depth, and expected fidelity. The system outputs an optimized circuit for hardware-aware transpilation and execution, leveraging techniques from an existing state-of-the-art approach, called the NISQ Analyzer, to adapt to backend constraints.
An Adaptive Multi-Layered Honeynet Architecture for Threat Behavior Analysis via Deep Learning
The escalating sophistication and variety of cyber threats have rendered static honeypots inadequate, necessitating adaptive, intelligence-driven deception. In this work, ADLAH is introduced: an Adaptive Deep Learning Anomaly Detection Honeynet designed to maximize high-fidelity threat intelligence while minimizing cost through autonomous orchestration of infrastructure. The principal contribution is offered as an end-to-end architectural blueprint and vision for an AI-driven deception platform. Feasibility is evidenced by a functional prototype of the central decision mechanism, in which a reinforcement learning (RL) agent determines, in real time, when sessions should be escalated from low-interaction sensor nodes to dynamically provisioned, high-interaction honeypots. Because sufficient live data were unavailable, field-scale validation is not claimed; instead, design trade-offs and limitations are detailed, and a rigorous roadmap toward empirical evaluation at scale is provided. Beyond selective escalation and anomaly detection, the architecture pursues automated extraction, clustering, and versioning of bot attack chains, a core capability motivated by the empirical observation that exposed services are dominated by automated traffic. Together, these elements delineate a practical path toward cost-efficient capture of high-value adversary behavior, systematic bot versioning, and the production of actionable threat intelligence.
On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models
Zhang, Charlie, Neubig, Graham, Yue, Xiang
Recent reinforcement learning (RL) techniques have yielded impressive reasoning improvements in language models, yet it remains unclear whether post-training truly extends a model's reasoning ability beyond what it acquires during pre-training. A central challenge is the lack of control in modern training pipelines: large-scale pre-training corpora are opaque, mid-training is often underexamined, and RL objectives interact with unknown prior knowledge in complex ways. To resolve this ambiguity, we develop a fully controlled experimental framework that isolates the causal contributions of pre-training, mid-training, and RL-based post-training. Our approach employs synthetic reasoning tasks with explicit atomic operations, parseable step-by-step reasoning traces, and systematic manipulation of training distributions. We evaluate models along two axes: extrapolative generalization to more complex compositions and contextual generalization across surface contexts. Using this framework, we reconcile competing views on RL's effectiveness. We show that: 1) RL produces true capability gains (pass@128) only when pre-training leaves sufficient headroom and when RL data target the model's edge of competence, tasks at the boundary that are difficult but not yet out of reach. 2) Contextual generalization requires minimal yet sufficient pre-training exposure, after which RL can reliably transfer. 3) Mid-training significantly enhances performance under fixed compute compared with RL only, demonstrating its central but underexplored role in training pipelines. 4) Process-level rewards reduce reward hacking and improve reasoning fidelity. Together, these results clarify the interplay between pre-training, mid-training, and RL, offering a foundation for understanding and improving reasoning LM training strategies.