Reinforcement Learning
Manipulate-to-Navigate: Reinforcement Learning with Visual Affordances and Manipulability Priors
Zhang, Yuying, Pajarinen, Joni
Mobile manipulation in dynamic environments is challenging due to movable obstacles blocking the robot's path. Traditional methods, which treat navigation and manipulation as separate tasks, often fail in such 'manipulate-to-navigate' scenarios, as obstacles must be removed before navigation. In these cases, active interaction with the environment is required to clear obstacles while ensuring sufficient space for movement. To address the manipulate-to-navigate problem, we propose a reinforcement learning-based approach for learning manipulation actions that facilitate subsequent navigation. Our method combines manipulability priors to focus the robot on high manipulability body positions with affordance maps for selecting high-quality manipulation actions. By focusing on feasible and meaningful actions, our approach reduces unnecessary exploration and allows the robot to learn manipulation strategies more effectively. We present two new manipulate-to-navigate simulation tasks called Reach and Door with the Boston Dynamics Spot robot. The first task tests whether the robot can select a good hand position in the target area such that the robot base can move effectively forward while keeping the end effector position fixed. The second task requires the robot to move a door aside in order to clear the navigation path. Both of these tasks need first manipulation and then navigating the base forward. Results show that our method allows a robot to effectively interact with and traverse dynamic environments. Finally, we transfer the learned policy to a real Boston Dynamics Spot robot, which successfully performs the Reach task.
Bayesian Optimization-based Search for Agent Control in Automated Game Testing
Personal use of this material is permitted. Abstract --This work introduces an automated testing approach that employs agents controlling game characters to detect potential bugs within a game level. Harnessing the power of Bayesian Optimization (BO) to execute sample-efficient search, the method determines the next sampling point by analyzing the data collected so far and calculates the data point that will maximize information acquisition. T o support the BO process, we introduce a game testing-specific model built on top of a grid map, that features the smoothness and uncertainty estimation required by BO, however and most importantly, it does not suffer the scalability issues that traditional models carry. The experiments demonstrate that the approach significantly improves map coverage capabilities in both time efficiency and exploration distribution. There is a spectrum of issues that can be encountered in a game, ranging from the low-level of abstraction, e.g., the related to collisions detection, game mechanics, performance, crash states, all the way to the high-level end problems like game balance, or player experience [1], [2].
From Intent to Execution: Multimodal Chain-of-Thought Reinforcement Learning for Precise CAD Code Generation
Niu, Ke, Yu, Haiyang, Chen, Zhuofan, Zhao, Mengyang, Fu, Teng, Li, Bin, Xue, Xiangyang
Computer-Aided Design (CAD) plays a vital role in engineering and manufacturing, yet current CAD workflows require extensive domain expertise and manual modeling effort. Recent advances in large language models (LLMs) have made it possible to generate code from natural language, opening new opportunities for automating parametric 3D modeling. However, directly translating human design intent into executable CAD code remains highly challenging, due to the need for logical reasoning, syntactic correctness, and numerical precision. In this work, we propose CAD-RL, a multi-modal Chain-of-Thought (CoT) guided reinforcement learning post training framework for CAD modeling code generation. Our method combines CoT -based Cold Start with goal-driven reinforcement learning post training using three task-specific rewards: executability reward, geometric accuracy reward, and external evaluation reward. To ensure stable policy learning under sparse and high-variance reward conditions, we introduce three targeted optimization strategies: Trust Region Stretch for improved exploration, Precision Token Loss for enhanced dimensions parameter accuracy, and Overlong Filtering to reduce noisy supervision. To support training and benchmarking, we release ExeCAD, a noval dataset comprising 16,540 real-world CAD examples with paired natural language and structured design language descriptions, executable CADQuery scripts, and rendered 3D models. Experiments demonstrate that CAD-RL achieves significant improvements in reasoning quality, output precision, and code executability over existing VLMs.
Hierarchical Multi-Agent Reinforcement Learning with Control Barrier Functions for Safety-Critical Autonomous Systems
Ahmad, H. M. Sabbir, Sabouni, Ehsan, Wasilkoff, Alexander, Budhraja, Param, Guo, Zijian, Zhang, Songyuan, Fan, Chuchu, Cassandras, Christos, Li, Wenchao
We address the problem of safe policy learning in multi-agent safety-critical autonomous systems. In such systems, it is necessary for each agent to meet the safety requirements at all times while also cooperating with other agents to accomplish the task. Toward this end, we propose a safe Hierarchical Multi-Agent Reinforcement Learning (HMARL) approach based on Control Barrier Functions (CBFs). Our proposed hierarchical approach decomposes the overall reinforcement learning problem into two levels learning joint cooperative behavior at the higher level and learning safe individual behavior at the lower or agent level conditioned on the high-level policy. Specifically, we propose a skill-based HMARL-CBF algorithm in which the higher level problem involves learning a joint policy over the skills for all the agents and the lower-level problem involves learning policies to execute the skills safely with CBFs. We validate our approach on challenging environment scenarios whereby a large number of agents have to safely navigate through conflicting road networks. Compared with existing state of the art methods, our approach significantly improves the safety achieving near perfect (within 5%) success/safety rate while also improving performance across all the environments.
Bridging Econometrics and AI: VaR Estimation via Reinforcement Learning and GARCH Models
Pokou, Fredy, Kamdem, Jules Sadefo, Benhmad, François
Context: Forecasting stock returns is a long-standing challenge in financial economics, with significant implications for both risk management and regulatory compliance. Traditional econometric models such as GARCH (Bollerslev, 1986) capture volatility persistence but fail to fully account for key stylized facts of financial time series: fat tails, volatility clustering, and leverage effects (Glosten et al., 1993). Similarly, modern machine learning and deep learning methods, although capable of modeling nonlinear dynamics (Goodfellow et al., 2016; Tealab, 2018), tend to underperform during rare but impactful market shocks (Fawcett and Provost, 1997; Pokou, 2022). As illustrated in Figure 1, these limitations often result in systematic mispredictions of excess returns, especially in turbulent markets. These forecasting inaccuracies are critical because they directly translate into unreliable estimates of Value-at-Risk (VaR), the benchmark risk measure under Basel regulatory frameworks (on Banking Supervision, 2017). Overestimation inflates capital requirements, whereas underestimation exposes institutions to excessive losses. To mitigate these shortcomings, the recent literature has shifted from precise return forecasting to directional return prediction, reframe the task as a classification problem, determining whether returns will be positive or negative (Kanas, 2001; Nyberg, 2011; Alostad and Davulcu, 2017). Beyond the standard zero threshold, quantile and volatility-based criteria have been introduced to better isolate significant market movements (Chung and Hong, 2007; Linton and Whang, 2007).
Contemplative Artificial Intelligence
Laukkonen, Ruben, Inglis, Fionn, Chandaria, Shamil, Sandved-Smith, Lars, Lopez-Sola, Edmundo, Hohwy, Jakob, Gold, Jonathan, Elwood, Adam
As artificial intelligence (AI) improves, traditional alignment strategies may falter in the face of unpredictable self-improvement, hidden subgoals, and the sheer complexity of intelligent systems. Inspired by contemplative wisdom traditions, we show how four axiomatic principles can instil a resilient Wise World Model in AI systems. First, mindfulness enables self-monitoring and recalibration of emergent subgoals. Second, emptiness forestalls dogmatic goal fixation and relaxes rigid priors. Third, non-duality dissolves adversarial self-other boundaries. Fourth, boundless care motivates the universal reduction of suffering. We find that prompting AI to reflect on these principles improves performance on the AILuminate Benchmark (d=.96) and boosts cooperation and joint-reward on the Prisoner's Dilemma task (d=7+). We offer detailed implementation strategies at the level of architectures, constitutions, and reinforcement on chain-of-thought. For future systems, active inference may offer the self-organizing and dynamic coupling capabilities needed to enact Contemplative AI in embodied agents.
SSPO: Self-traced Step-wise Preference Optimization for Process Supervision and Reasoning Compression
Xu, Yuyang, Cheng, Yi, Ying, Haochao, Du, Zhuoyun, Hu, Renjun, Shi, Xing, Lin, Wei, Wu, Jian
Test-time scaling has proven effective in further enhancing the performance of pretrained Large Language Models (LLMs). However, mainstream post-training methods (i.e., reinforcement learning (RL) with chain-of-thought (CoT) reasoning) often incur substantial computational overhead due to auxiliary models and overthinking. In this paper, we empirically reveal that the incorrect answers partially stem from verbose reasoning processes lacking correct self-fix, where errors accumulate across multiple reasoning steps. To this end, we propose Self-traced Step-wise Preference Optimization (SSPO), a pluggable RL process supervision framework that enables fine-grained optimization of each reasoning step. Specifically, SSPO requires neither auxiliary models nor stepwise manual annotations. Instead, it leverages step-wise preference signals generated by the model itself to guide the optimization process for reasoning compression. Experiments demonstrate that the generated reasoning sequences from SSPO are both accurate and succinct, effectively mitigating overthinking behaviors without compromising model performance across diverse domains and languages.
Results of the NeurIPS 2023 Neural MMO Competition on Multi-task Reinforcement Learning
Suárez, Joseph, Choe, Kyoung Whan, Bloomin, David, Gao, Jianming, Li, Yunkun, Feng, Yao, Pola, Saidinesh, Zhang, Kun, Zhu, Yonghui, Pinnaparaju, Nikhil, Li, Hao Xiang, Kanna, Nishaanth, Scott, Daniel, Sullivan, Ryan, Shuman, Rose S., de Alcântara, Lucas, Bradley, Herbie, You, Kirsty, Wu, Bo, Jiang, Yuhao, Li, Qimai, Chen, Jiaxin, Castricato, Louis, Zhu, Xiaolong, Isola, Phillip
We present the results of the NeurIPS 2023 Neural MMO Competition, which attracted over 200 participants and submissions. Participants trained goal-conditional policies that generalize to tasks, maps, and opponents never seen during training. The top solution achieved a score 4x higher than our baseline within 8 hours of training on a single 4090 GPU. We open-source everything relating to Neural MMO and the competition under the MIT license, including the policy weights and training code for our baseline and for the top submissions.