AITopics

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Neural Information Processing SystemsFeb-8-2026, 19:02:19 GMT

SupplementaryMaterial: ABooleanTaskAlgebraForReinforcementLearning

Wefirst learn tosolvethree base tasks: collecting purple objects (Figure 5)collecting blue objects (Figure 6)andcollecting squares (Figure 7).

artificial intelligence, composition, value function, (17 more...)

Country:

North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Africa > South Africa > Gauteng > Johannesburg (0.04)

Technology: Information Technology > Artificial Intelligence (0.47)

arXiv.org Artificial IntelligenceNov-10-2025

Optimizing Anytime Reasoning via Budget Relative Policy Optimization

Qi, Penghui, Liu, Zichen, Pang, Tianyu, Du, Chao, Lee, Wee Sun, Lin, Min

Scaling test-time compute is crucial for enhancing the reasoning capabilities of large language models (LLMs). Existing approaches typically employ reinforcement learning (RL) to maximize a verifiable reward obtained at the end of reasoning traces. However, such methods optimize only the final performance under a large and fixed token budget, which hinders efficiency in both training and deployment. In this work, we present a novel framework, AnytimeReasoner, to optimize anytime reasoning performance, which aims to improve token efficiency and the flexibility of reasoning under varying token budget constraints. To achieve this, we truncate the complete thinking process to fit within sampled token budgets from a prior distribution, compelling the model to summarize the optimal answer for each truncated thinking for verification. This introduces verifiable dense rewards into the reasoning process, facilitating more effective credit assignment in RL optimization. We then optimize the thinking and summary policies in a decoupled manner to maximize the cumulative reward. Additionally, we introduce a novel variance reduction technique, Budget Relative Policy Optimization (BRPO), to enhance the robustness and efficiency of the learning process when reinforcing the thinking policy. Empirical results in mathematical reasoning tasks demonstrate that our method consistently outperforms GRPO across all thinking budgets under various prior distributions, enhancing both training and token efficiency.

large language model, machine learning, natural language, (15 more...)

2505.13438

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Neural Information Processing SystemsOct-3-2025, 04:13:11 GMT

Supplementary Material: A Boolean Task Algebra For Reinforcement Learning

This results in the large dips in values we observe in the learned extended values.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

Country: Africa (0.28)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.52)

Neural Information Processing SystemsOct-2-2025, 23:23:42 GMT

6f518c31f6baa365f55c38d11cc349d1-AuthorFeedback.pdf

artificial intelligence, machine learning, subgoal, (18 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.33)

arXiv.org Artificial IntelligenceAug-29-2025

Can Large Language Models Develop Strategic Reasoning? Post-training Insights from Learning Chess

Hwang, Dongyoon, Lee, Hojoon, Choo, Jaegul, Park, Dongmin, Park, Jongho

While reinforcement learning (RL) for large language models (LLMs) has shown promise in mathematical reasoning, strategic reasoning for LLMs using RL remains largely unexplored. We investigate whether LLMs can develop strategic reasoning capabilities through RL in chess. To this end, we leverage a chess-pretrained action-value network to provide dense reward on the LLM's output move quality, which can be seen as a form of knowledge distillation. Our experiments show that our distillation-based dense rewards often outperform sparse binary rewards. However, surprisingly, all models plateau far below expert levels. We provide SFT and RL ablations on chess reasoning training and find evidence that this limitation stems from a deficit in the pretrained models' internal understanding of chess-a deficit which RL alone may not be able to fully overcome. The code is available at https://github.com/krafton-ai/Chess-R1.

large language model, machine learning, natural language, (17 more...)

2507.00726

Country: North America > United States (0.67)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment > Games > Chess (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)

arXiv.org Artificial IntelligenceJun-25-2025

TeViR: Text-to-Video Reward with Diffusion Models for Efficient Reinforcement Learning

Chen, Yuhui, Li, Haoran, Jiang, Zhennan, Wen, Haowei, Zhao, Dongbin

--Developing scalable and generalizable reward engineering for reinforcement learning (RL) is crucial for creating general-purpose agents, especially in the challenging domain of robotic manipulation. While recent advances in reward engineering with Vision-Language Models (VLMs) have shown promise, their sparse reward nature significantly limits sample efficiency. This paper introduces T eViR, a novel method that leverages a pre-trained text-to-video diffusion model to generate dense rewards by comparing the predicted image sequence with current observations. Experimental results across 13 simulation and real-world robotic tasks demonstrate that T eViR outperforms traditional methods leveraging sparse rewards and other state-of-the-art (SOT A) methods, achieving better sample efficiency and performance without ground truth environmental rewards. T eViR's ability to efficiently guide agents in complex environments highlights its potential to advance reinforcement learning applications in robotic manipulation. EVELOPING general-purpose agents with reinforcement learning (RL) necessitates scalable and generalizable reward engineering to provide effective task specifications for downstream policy learning [1]. Reward engineering is crucial as it determines the policies agents can learn and ensures they align with intended objectives. However, the manual design of reward functions often present significant challenges [2]- [4], particularly in robotic manipulation tasks [5]-[8]. This challenge has emerged as a major bottleneck in developing general-purpose agents. Although inverse reinforcement learning (IRL) [9] learns rewards from pre-collected expert demonstration, these learned reward functions are unreliable for learning policies due to noise and misspecification errors [10], especially for robotic manipulation tasks since in-domain data is limited [11]. Additionally, the learned reward functions is not generally applicable across tasks.

artificial intelligence, machine learning, reinforcement learning, (13 more...)

2505.19769

Country: Asia > China (0.14)

Genre:

Research Report > New Finding (0.46)
Research Report > Promising Solution (0.34)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Shu, Junyang, Lin, Zhiwei, Wang, Yongtao

RFTF: Reinforcement Fine-tuning for Embodied Agents with Temporal Feedback

arXiv.org Artificial IntelligenceMay-27-2025

Vision-Language-Action (VLA) models have demonstrated significant potential in the field of embodied intelligence, enabling agents to follow human instructions to complete complex tasks in physical environments. Existing embodied agents are often trained through behavior cloning, which requires expensive data and computational resources and is constrained by human demonstrations. To address this issue, many researchers explore the application of reinforcement fine-tuning to embodied agents. However, typical reinforcement fine-tuning methods for embodied agents usually rely on sparse, outcome-based rewards, which struggle to provide fine-grained feedback for specific actions within an episode, thus limiting the model's manipulation capabilities and generalization performance. In this paper, we propose RFTF, a novel reinforcement fine-tuning method that leverages a value model to generate dense rewards in embodied scenarios. Specifically, our value model is trained using temporal information, eliminating the need for costly robot action labels. In addition, RFTF incorporates a range of techniques, such as GAE and sample balance to enhance the effectiveness of the fine-tuning process. By addressing the sparse reward problem in reinforcement fine-tuning, our method significantly improves the performance of embodied agents, delivering superior generalization and adaptation capabilities across diverse embodied tasks. Experimental results show that embodied agents fine-tuned with RFTF achieve new state-of-the-art performance on the challenging CALVIN ABC-D with an average success length of 4.296. Moreover, RFTF enables rapid adaptation to new environments. After fine-tuning in the D environment of CALVIN for a few episodes, RFTF achieved an average success length of 4.301 in this new environment.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

2505.19767

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Bates, Elizabeth, Hicks, Chris, Mavroudis, Vasilios

Less is more? Rewards in RL for Cyber Defence

arXiv.org Artificial IntelligenceMar-10-2025

The last few years have seen an explosion of interest in autonomous cyber defence agents based on deep reinforcement learning. Such agents are typically trained in a cyber gym environment, also known as a cyber simulator, at least 32 of which have already been built. Most, if not all cyber gyms provide dense "scaffolded" reward functions which combine many penalties or incentives for a range of (un)desirable states and costly actions. Whilst dense rewards help alleviate the challenge of exploring complex environments, yielding seemingly effective strategies from relatively few environment steps; they are also known to bias the solutions an agent can find, potentially towards suboptimal solutions. This is especially a problem in complex cyber environments where policy weaknesses may not be noticed until exploited by an adversary. In this work we set out to evaluate whether sparse reward functions might enable training more effective cyber defence agents. Towards this goal we first break down several evaluation limitations in existing work by proposing a ground truth evaluation score that goes beyond the standard RL paradigm used to train and evaluate agents. By adapting a well-established cyber gym to accommodate our methodology and ground truth score, we propose and evaluate two sparse reward mechanisms and compare them with a typical dense reward. Our evaluation considers a range of network sizes, from 2 to 50 nodes, and both reactive and proactive defensive actions. Our results show that sparse rewards, particularly positive reinforcement for an uncompromised network state, enable the training of more effective cyber defence agents. Furthermore, we show that sparse rewards provide more stable training than dense rewards, and that both effectiveness and training stability are robust to a variety of cyber environment considerations.

action space, agent, reward function, (15 more...)

2503.03245

Country: Europe > United Kingdom (0.04)

Genre: Research Report > New Finding (0.87)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military > Cyberwarfare (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.70)

arXiv.org Artificial IntelligenceMar-3-2025

Multi-Stage Manipulation with Demonstration-Augmented Reward, Policy, and World Model Learning

Escoriza, Adrià López, Hansen, Nicklas, Tao, Stone, Mu, Tongzhou, Su, Hao

Long-horizon tasks in robotic manipulation present significant challenges in reinforcement learning (RL) due to the difficulty of designing dense reward functions and effectively exploring the expansive state-action space. However, despite a lack of dense rewards, these tasks often have a multi-stage structure, which can be leveraged to decompose the overall objective into manageable subgoals. In this work, we propose DEMO3, a framework that exploits this structure for efficient learning from visual inputs. Specifically, our approach incorporates multi-stage dense reward learning, a bi-phasic training scheme, and world model learning into a carefully designed demonstration-augmented RL framework that strongly mitigates the challenge of exploration in long-horizon tasks. Our evaluations demonstrate that our method improves data-efficiency by an average of 40% and by 70% on particularly difficult tasks compared to state-of-the-art approaches. We validate this across 16 sparse-reward tasks spanning four domains, including challenging humanoid visual control tasks using as few as five demonstrations.

demonstration, demonstration-augmented reward, learning, (16 more...)

2503.01837

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > California > San Diego County > San Diego (0.04)
North America > United States > Indiana (0.04)
(2 more...)

Genre:

Research Report > New Finding (0.46)
Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)