AITopics | Reinforcement Learning

Collaborating Authors

Reinforcement Learning

"Reinforcement learning is learning what to do – how to map situations to actions – so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them."
– Sutton, Richard S. and Andrew G. Barto. Reinforcement Learning: An Introduction. (1.1). MIT Press, Cambridge, MA, 1998.

News Overviews Instructional Materials AI-Alerts Classics

Discovery of Options via Meta-Learned Subgoals

Neural Information Processing SystemsAug-19-2025, 00:23:39 GMT

Temporal abstractions in the form of options have been shown to help reinforcement learning (RL) agents learn faster.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > Michigan (0.04)
North America > United States > Wisconsin > Dane County > Madison (0.04)
Asia > China (0.04)

Industry: Leisure & Entertainment > Games (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

f8b7aa3a0d349d9562b424160ad18612-Paper.pdf

Neural Information Processing SystemsAug-19-2025, 00:13:05 GMT

machine learning, natural language, reinforcement learning, (20 more...)

Neural Information Processing Systems

Country: Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Vision (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.68)
(2 more...)

Add feedback

Manipulate-to-Navigate: Reinforcement Learning with Visual Affordances and Manipulability Priors

Zhang, Yuying, Pajarinen, Joni

arXiv.org Artificial IntelligenceAug-19-2025

Mobile manipulation in dynamic environments is challenging due to movable obstacles blocking the robot's path. Traditional methods, which treat navigation and manipulation as separate tasks, often fail in such 'manipulate-to-navigate' scenarios, as obstacles must be removed before navigation. In these cases, active interaction with the environment is required to clear obstacles while ensuring sufficient space for movement. To address the manipulate-to-navigate problem, we propose a reinforcement learning-based approach for learning manipulation actions that facilitate subsequent navigation. Our method combines manipulability priors to focus the robot on high manipulability body positions with affordance maps for selecting high-quality manipulation actions. By focusing on feasible and meaningful actions, our approach reduces unnecessary exploration and allows the robot to learn manipulation strategies more effectively. We present two new manipulate-to-navigate simulation tasks called Reach and Door with the Boston Dynamics Spot robot. The first task tests whether the robot can select a good hand position in the target area such that the robot base can move effectively forward while keeping the end effector position fixed. The second task requires the robot to move a door aside in order to clear the navigation path. Both of these tasks need first manipulation and then navigating the base forward. Results show that our method allows a robot to effectively interact with and traverse dynamic environments. Finally, we transfer the learned policy to a real Boston Dynamics Spot robot, which successfully performs the Reach task.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

arXiv.org Artificial Intelligence

2508.13151

Country: Europe (0.28)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Bayesian Optimization-based Search for Agent Control in Automated Game Testing

Celemin, Carlos

arXiv.org Artificial IntelligenceAug-19-2025

Personal use of this material is permitted. Abstract --This work introduces an automated testing approach that employs agents controlling game characters to detect potential bugs within a game level. Harnessing the power of Bayesian Optimization (BO) to execute sample-efficient search, the method determines the next sampling point by analyzing the data collected so far and calculates the data point that will maximize information acquisition. T o support the BO process, we introduce a game testing-specific model built on top of a grid map, that features the smoothness and uncertainty estimation required by BO, however and most importantly, it does not suffer the scalability issues that traditional models carry. The experiments demonstrate that the approach significantly improves map coverage capabilities in both time efficiency and exploration distribution. There is a spectrum of issues that can be encountered in a game, ranging from the low-level of abstraction, e.g., the related to collisions detection, game mechanics, performance, crash states, all the way to the high-level end problems like game balance, or player experience [1], [2].

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/CoG60054.2024.10645653

2508.13121

Genre: Research Report (0.50)

Industry: Leisure & Entertainment > Games > Computer Games (0.89)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.49)

Add feedback

From Intent to Execution: Multimodal Chain-of-Thought Reinforcement Learning for Precise CAD Code Generation

Niu, Ke, Yu, Haiyang, Chen, Zhuofan, Zhao, Mengyang, Fu, Teng, Li, Bin, Xue, Xiangyang

arXiv.org Artificial IntelligenceAug-19-2025

Computer-Aided Design (CAD) plays a vital role in engineering and manufacturing, yet current CAD workflows require extensive domain expertise and manual modeling effort. Recent advances in large language models (LLMs) have made it possible to generate code from natural language, opening new opportunities for automating parametric 3D modeling. However, directly translating human design intent into executable CAD code remains highly challenging, due to the need for logical reasoning, syntactic correctness, and numerical precision. In this work, we propose CAD-RL, a multi-modal Chain-of-Thought (CoT) guided reinforcement learning post training framework for CAD modeling code generation. Our method combines CoT -based Cold Start with goal-driven reinforcement learning post training using three task-specific rewards: executability reward, geometric accuracy reward, and external evaluation reward. To ensure stable policy learning under sparse and high-variance reward conditions, we introduce three targeted optimization strategies: Trust Region Stretch for improved exploration, Precision Token Loss for enhanced dimensions parameter accuracy, and Overlong Filtering to reduce noisy supervision. To support training and benchmarking, we release ExeCAD, a noval dataset comprising 16,540 real-world CAD examples with paired natural language and structured design language descriptions, executable CADQuery scripts, and rendered 3D models. Experiments demonstrate that CAD-RL achieves significant improvements in reasoning quality, output precision, and code executability over existing VLMs.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2508.10118

Genre: Research Report (0.50)

Industry: Information Technology (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.93)

Add feedback

Hierarchical Multi-Agent Reinforcement Learning with Control Barrier Functions for Safety-Critical Autonomous Systems

Ahmad, H. M. Sabbir, Sabouni, Ehsan, Wasilkoff, Alexander, Budhraja, Param, Guo, Zijian, Zhang, Songyuan, Fan, Chuchu, Cassandras, Christos, Li, Wenchao

arXiv.org Artificial IntelligenceAug-19-2025

We address the problem of safe policy learning in multi-agent safety-critical autonomous systems. In such systems, it is necessary for each agent to meet the safety requirements at all times while also cooperating with other agents to accomplish the task. Toward this end, we propose a safe Hierarchical Multi-Agent Reinforcement Learning (HMARL) approach based on Control Barrier Functions (CBFs). Our proposed hierarchical approach decomposes the overall reinforcement learning problem into two levels learning joint cooperative behavior at the higher level and learning safe individual behavior at the lower or agent level conditioned on the high-level policy. Specifically, we propose a skill-based HMARL-CBF algorithm in which the higher level problem involves learning a joint policy over the skills for all the agents and the lower-level problem involves learning policies to execute the skills safely with CBFs. We validate our approach on challenging environment scenarios whereby a large number of agents have to safely navigate through conflicting road networks. Compared with existing state of the art methods, our approach significantly improves the safety achieving near perfect (within 5%) success/safety rate while also improving performance across all the environments.

artificial intelligence, machine learning, reinforcement learning, (13 more...)

arXiv.org Artificial Intelligence

2507.1485

Country: North America > United States > Massachusetts (0.28)

Genre: Research Report > Promising Solution (0.34)

Industry:

Transportation > Ground > Road (0.88)
Education (0.87)
Transportation > Infrastructure & Services (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.49)

Add feedback

Bridging Econometrics and AI: VaR Estimation via Reinforcement Learning and GARCH Models

Pokou, Fredy, Kamdem, Jules Sadefo, Benhmad, François

arXiv.org Artificial IntelligenceAug-19-2025

Context: Forecasting stock returns is a long-standing challenge in financial economics, with significant implications for both risk management and regulatory compliance. Traditional econometric models such as GARCH (Bollerslev, 1986) capture volatility persistence but fail to fully account for key stylized facts of financial time series: fat tails, volatility clustering, and leverage effects (Glosten et al., 1993). Similarly, modern machine learning and deep learning methods, although capable of modeling nonlinear dynamics (Goodfellow et al., 2016; Tealab, 2018), tend to underperform during rare but impactful market shocks (Fawcett and Provost, 1997; Pokou, 2022). As illustrated in Figure 1, these limitations often result in systematic mispredictions of excess returns, especially in turbulent markets. These forecasting inaccuracies are critical because they directly translate into unreliable estimates of Value-at-Risk (VaR), the benchmark risk measure under Basel regulatory frameworks (on Banking Supervision, 2017). Overestimation inflates capital requirements, whereas underestimation exposes institutions to excessive losses. To mitigate these shortcomings, the recent literature has shifted from precise return forecasting to directional return prediction, reframe the task as a classification problem, determining whether returns will be positive or negative (Kanas, 2001; Nyberg, 2011; Alostad and Davulcu, 2017). Beyond the standard zero threshold, quantile and volatility-based criteria have been introduced to better isolate significant market movements (Chung and Hong, 2007; Linton and Whang, 2007).

artificial intelligence, machine learning, reinforcement learning, (16 more...)

arXiv.org Artificial Intelligence

2504.16635

Country:

Europe > France (0.28)
Europe > Switzerland > Basel-City > Basel (0.25)

Genre:

Research Report > Experimental Study (0.94)
Research Report > New Finding (0.68)

Industry:

Government (1.00)
Banking & Finance > Trading (1.00)
Health & Medicine (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Contemplative Artificial Intelligence

Laukkonen, Ruben, Inglis, Fionn, Chandaria, Shamil, Sandved-Smith, Lars, Lopez-Sola, Edmundo, Hohwy, Jakob, Gold, Jonathan, Elwood, Adam

arXiv.org Artificial IntelligenceAug-19-2025

As artificial intelligence (AI) improves, traditional alignment strategies may falter in the face of unpredictable self-improvement, hidden subgoals, and the sheer complexity of intelligent systems. Inspired by contemplative wisdom traditions, we show how four axiomatic principles can instil a resilient Wise World Model in AI systems. First, mindfulness enables self-monitoring and recalibration of emergent subgoals. Second, emptiness forestalls dogmatic goal fixation and relaxes rigid priors. Third, non-duality dissolves adversarial self-other boundaries. Fourth, boundless care motivates the universal reduction of suffering. We find that prompting AI to reflect on these principles improves performance on the AILuminate Benchmark (d=.96) and boosts cooperation and joint-reward on the Prisoner's Dilemma task (d=7+). We offer detailed implementation strategies at the level of architectures, constitutions, and reinforcement on chain-of-thought. For future systems, active inference may offer the self-organizing and dynamic coupling capabilities needed to enact Contemplative AI in embodied agents.

arxiv preprint arxiv, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2504.15125

Country:

North America > United States (1.00)
Europe > United Kingdom > England (0.68)

Genre:

Overview (0.92)
Research Report > Experimental Study (0.92)

Industry:

Law (1.00)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (1.00)
Health & Medicine > Therapeutic Area > Neurology (1.00)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (1.00)
(4 more...)

Add feedback

SSPO: Self-traced Step-wise Preference Optimization for Process Supervision and Reasoning Compression

Xu, Yuyang, Cheng, Yi, Ying, Haochao, Du, Zhuoyun, Hu, Renjun, Shi, Xing, Lin, Wei, Wu, Jian

arXiv.org Artificial IntelligenceAug-19-2025

Test-time scaling has proven effective in further enhancing the performance of pretrained Large Language Models (LLMs). However, mainstream post-training methods (i.e., reinforcement learning (RL) with chain-of-thought (CoT) reasoning) often incur substantial computational overhead due to auxiliary models and overthinking. In this paper, we empirically reveal that the incorrect answers partially stem from verbose reasoning processes lacking correct self-fix, where errors accumulate across multiple reasoning steps. To this end, we propose Self-traced Step-wise Preference Optimization (SSPO), a pluggable RL process supervision framework that enables fine-grained optimization of each reasoning step. Specifically, SSPO requires neither auxiliary models nor stepwise manual annotations. Instead, it leverages step-wise preference signals generated by the model itself to guide the optimization process for reasoning compression. Experiments demonstrate that the generated reasoning sequences from SSPO are both accurate and succinct, effectively mitigating overthinking behaviors without compromising model performance across diverse domains and languages.

arxiv preprint arxiv, machine learning, reinforcement learning, (16 more...)

arXiv.org Artificial Intelligence

2508.12604

Country: Asia (0.29)

Genre: Research Report (1.00)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.68)

Add feedback

Results of the NeurIPS 2023 Neural MMO Competition on Multi-task Reinforcement Learning

Suárez, Joseph, Choe, Kyoung Whan, Bloomin, David, Gao, Jianming, Li, Yunkun, Feng, Yao, Pola, Saidinesh, Zhang, Kun, Zhu, Yonghui, Pinnaparaju, Nikhil, Li, Hao Xiang, Kanna, Nishaanth, Scott, Daniel, Sullivan, Ryan, Shuman, Rose S., de Alcântara, Lucas, Bradley, Herbie, You, Kirsty, Wu, Bo, Jiang, Yuhao, Li, Qimai, Chen, Jiaxin, Castricato, Louis, Zhu, Xiaolong, Isola, Phillip

arXiv.org Artificial IntelligenceAug-19-2025

We present the results of the NeurIPS 2023 Neural MMO Competition, which attracted over 200 participants and submissions. Participants trained goal-conditional policies that generalize to tasks, maps, and opponents never seen during training. The top solution achieved a score 4x higher than our baseline within 8 hours of training on a single 4090 GPU. We open-source everything relating to Neural MMO and the competition under the MIT license, including the policy weights and training code for our baseline and for the top submissions.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

arXiv.org Artificial Intelligence

2508.12524

Country: North America > United States (0.46)

Genre: Research Report (0.64)

Industry: Leisure & Entertainment > Games (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.83)

Add feedback