AITopics

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.63)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.60)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.50)

Neural Information Processing SystemsAug-16-2025, 13:18:20 GMT

emphasize the technical novelty of our upper bound and lower bound as Reviewer # 1, Reviewer # 3 and Reviewer # 4 2 commented on the technical novelty of our theoretical results

We thank all the reviewers for their valuable feedback and appreciating our contributions. T echnical novelty of the upper bound. In the exploration phase, Jin et al. [2020] set reward to be To our knowledge, this idea is new in the literature. For example, for the hard instance in [Du et al. 2020], only a single state-action pair has non-zero reward Moreover, we focus on the reward-free setting while Du et al. [2020] focused on the standard RL setting. Below we address specific concerns from each reviewer.

novelty, reviewer, technical novelty, (13 more...)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning (0.31)

Amani, Mohammad Hossein, Lotfi, Aryo, Baldwin, Nicolas Mario, Bengio, Samy, Farajtabar, Mehrdad, Abbe, Emmanuel, West, Robert

RL for Reasoning by Adaptively Revealing Rationales

arXiv.org Artificial IntelligenceJun-24-2025

We propose that reinforcement learning (RL) from partial expert demonstrations is not merely a training heuristic, but a promising framework for solving complex sequence generation tasks. Supervised fine-tuning (SFT) relies on dense ground-truth labels, which become increasingly costly as sequence length grows. RL, on the other hand, struggles with sparse rewards and a combinatorially large output space. We address this by introducing adaptive backtracking (AdaBack), a per-sample curriculum learning algorithm that reveals only a partial prefix of the target output during training. The supervision length is adjusted dynamically for each sample based on the model's past reward signal, allowing it to incrementally learn to complete reasoning chains by conditioning on correct partial solutions. We investigate this intermediate regime between SFT and RL and argue that per-sample curriculum learning is more than a trade-off between efficiency and generality, it can succeed in tasks with long sequences of latent dependencies where SFT and RL both fail to generalize. Using a synthetic task with latent parity constraints, we show that our adaptive curriculum over partial answers reliably solves problems that are otherwise intractable. On mathematical reasoning benchmarks (MATH, GSM8k), we find that curriculum learning enables models to solve problems that RL alone cannot, acquiring new reasoning capabilities through incremental exposure to partial solutions.

machine learning, natural language, reinforcement learning, (18 more...)

2506.1811

Country: Asia (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
(2 more...)

Neural Information Processing SystemsMay-26-2025, 20:32:09 GMT

First-Explore, then Exploit: Meta-Learning to Solve Hard Exploration-Exploitation Trade-Offs

Standard reinforcement learning (RL) agents never intelligently explore like a human (i.e. Across episodes, RL agents struggle to perform even simple exploration strategies, for example systematic search that avoids exploring the same location multiple times. Meta-RL is a potential solution, as unlike standard RL, meta-RL can learn to explore, and potentially learn highly complex strategies far beyond those of standard RL, strategies such as experimenting in early episodes to learn new skills, or conducting experiments to learn about the current environment.Traditional meta-RL focuses on the problem of learning to optimally balance exploration and exploitation to maximize the cumulative reward of the episode sequence (e.g., aiming to maximize the total wins in a tournament -- while also improving as a player).We identify a new challenge with state-of-the-art cumulative-reward meta-RL methods.When optimal behavior requires exploration that sacrifices immediate reward to enable higher subsequent reward, existing state-of-the-art cumulative-reward meta-RL methods become stuck on the local optimum of failing to explore.Our method, First-Explore, overcomes this limitation by learning two policies: one to solely explore, and one to solely exploit. When exploring requires forgoing early-episode reward, First-Explore significantly outperforms existing cumulative meta-RL methods. By identifying and solving the previously unrecognized problem of forgoing reward in early episodes, First-Explore represents a significant step towards developing meta-RL algorithms capable of human-like exploration on a broader range of domains.

artificial intelligence, machine learning, reinforcement learning, (6 more...)

Industry: Energy > Oil & Gas > Upstream (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Neural Information Processing SystemsJan-20-2025, 02:17:12 GMT

Is RLHF More Difficult than Standard RL? A Theoretical Perspective

reward-based rl, standard rl, theoretical perspective

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.99)

arXiv.org Artificial IntelligenceOct-1-2023

Train Hard, Fight Easy: Robust Meta Reinforcement Learning

Greenberg, Ido, Mannor, Shie, Chechik, Gal, Meirom, Eli

A major challenge of reinforcement learning (RL) in real-world applications is the variation between environments, tasks or clients. Meta-RL (MRL) addresses this issue by learning a meta-policy that adapts to new tasks. Standard MRL methods optimize the average return over tasks, but often suffer from poor results in tasks of high risk or difficulty. This limits system reliability since test tasks are not known in advance. In this work, we define a robust MRL objective with a controlled robustness level. Optimization of analogous robust objectives in RL is known to lead to both *biased gradients* and *data inefficiency*. We prove that the gradient bias disappears in our proposed MRL framework. The data inefficiency is addressed via the novel Robust Meta RL algorithm (RoML). RoML is a meta-algorithm that generates a robust version of any given MRL algorithm, by identifying and over-sampling harder tasks throughout training. We demonstrate that RoML achieves robust returns on multiple navigation and continuous control benchmarks.

baseline, learning, roml, (15 more...)

2301.11147

Country: Europe (0.14)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

arXiv.org Artificial IntelligenceJul-5-2023

First-Explore, then Exploit: Meta-Learning Intelligent Exploration

Norman, Ben, Clune, Jeff

Standard reinforcement learning (RL) agents never intelligently explore like a human (i.e. by taking into account complex domain priors and previous explorations). Even the most basic intelligent exploration strategies such as exhaustive search are only inefficiently or poorly approximated by approaches such as novelty search or intrinsic motivation, let alone more complicated strategies like learning new skills, climbing stairs, opening doors, or conducting experiments. This lack of intelligent exploration limits sample efficiency and prevents solving hard exploration domains. We argue a core barrier prohibiting many RL approaches from learning intelligent exploration is that the methods attempt to explore and exploit simultaneously, which harms both exploration and exploitation as the goals often conflict. We propose a novel meta-RL framework (First-Explore) with two policies: one policy learns to only explore and one policy learns to only exploit. Once trained, we can then explore with the explore policy, for as long as desired, and then exploit based on all the information gained during exploration. This approach avoids the conflict of trying to do both exploration and exploitation at once. We demonstrate that First-Explore can learn intelligent exploration strategies such as exhaustive search and more, and that it outperforms dominant standard RL and meta-RL approaches on domains where exploration requires sacrificing reward. First-Explore is a significant step towards creating meta-RL algorithms capable of learning human-level exploration which is essential to solve challenging unseen hard-exploration domains.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

2307.02276

Country: North America > Canada (0.46)

Genre: Research Report (1.00)

Industry:

Energy > Oil & Gas > Upstream (1.00)
Leisure & Entertainment > Games (0.93)
Education (0.87)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Adamczyk, Jacob, Makarenko, Volodymyr, Arriojas, Argenis, Tiomkin, Stas, Kulkarni, Rahul V.

Bounding the Optimal Value Function in Compositional Reinforcement Learning

arXiv.org Artificial IntelligenceJun-13-2023

In the field of reinforcement learning (RL), agents are often tasked with solving a variety of problems differing only in their reward functions. In order to quickly obtain solutions to unseen problems with new reward functions, a popular approach involves functional composition of previously solved tasks. However, previous work using such functional composition has primarily focused on specific instances of composition functions whose limiting assumptions allow for exact zero-shot composition. Our work unifies these examples and provides a more general framework for compositionality in both standard and entropy-regularized RL. We find that, for a broad class of functions, the optimal solution for the composite task of interest can be related to the known primitive task solutions. Specifically, we present double-sided inequalities relating the optimal composite value function to the value functions for the primitive tasks. We also show that the regret of using a zero-shot policy can be bounded for this class of functions. The derived bounds can be used to develop clipping approaches for reducing uncertainty during training, allowing agents to quickly adapt to new tasks.

machine learning, primitive task, reinforcement learning, (13 more...)

2303.02557

Country:

North America > United States > Massachusetts > Suffolk County > Boston (0.14)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > California > Santa Clara County > San Jose (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

van Niekerk, Benjamin, James, Steven, Earle, Adam, Rosman, Benjamin

Will it Blend? Composing Value Functions in Reinforcement Learning

arXiv.org Machine LearningJul-12-2018

An important property for lifelong-learning agents is the ability to combine existing skills to solve unseen tasks. In general, however, it is unclear how to compose skills in a principled way. We provide a "recipe" for optimal value function composition in entropy-regularised reinforcement learning (RL) and then extend this to the standard RL setting. Composition is demonstrated in a video game environment, where an agent with an existing library of policies is able to solve new tasks without the need for further learning.

machine learning, q-function, reinforcement learning, (16 more...)

arXiv.org Machine Learning

1807.04439

Country:

Europe > Sweden > Stockholm > Stockholm (0.04)
Africa > South Africa > Gauteng > Pretoria (0.04)
Africa > South Africa > Gauteng > Johannesburg (0.04)

Genre:

Research Report (0.64)
Instructional Material (0.49)

Industry: Leisure & Entertainment > Games > Computer Games (0.54)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)