Is a Good Foundation Necessary for Efficient Reinforcement Learning? The Computational Role of the Base Model in Exploration

Foster, Dylan J., Mhammedi, Zakaria, Rohatgi, Dhruv

Mar-13-2025–arXiv.org Artificial Intelligence

Language model alignment (or, reinforcement learning) techniques that leverage active exploration -- deliberately encouraging the model to produce diverse, informative responses -- offer the promise of super-human capabilities. However, current understanding of algorithm design primitives for computationally efficient exploration with language models is limited. To better understand how to leverage access to powerful pre-trained generative models to improve the efficiency of exploration, we introduce a new computational framework for RL with language models, in which the learner interacts with the model through a sampling oracle. Focusing on the linear softmax model parameterization, we provide new results that reveal the computational-statistical tradeoffs of efficient exploration: 1. Necessity of coverage: Coverage refers to the extent to which the pre-trained model covers near-optimal responses -- a form of hidden knowledge. We show that coverage, while not necessary for data efficiency, lower bounds the runtime of any algorithm in our framework. 2. Inference-time exploration: We introduce a new algorithm, SpannerSampling, which obtains optimal data efficiency and is computationally efficient whenever the pre-trained model enjoys sufficient coverage, matching our lower bound. SpannerSampling leverages inference-time computation with the pre-trained model to reduce the effective search space for exploration. 3. Insufficiency of training-time interventions: We contrast the result above by showing that training-time interventions that produce proper policies cannot achieve similar guarantees in polynomial time. 4. Computational benefits of multi-turn exploration: Finally, we show that under additional representational assumptions, one can achieve improved runtime (replacing sequence-level coverage with token-level coverage) through multi-turn exploration.

algorithm, cond, span, (15 more...)

arXiv.org Artificial Intelligence

Mar-13-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Virginia (0.04)
  - Massachusetts > Middlesex County
    - Cambridge (0.04)
- Europe > United Kingdom
  - England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East
  - Jordan (0.04)

Genre:
- Research Report > New Finding (0.48)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning
    - Statistical Learning (1.00)
    - Reinforcement Learning (0.84)
    - Neural Networks > Deep Learning (0.67)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found