Reasoning without Regret
–arXiv.org Artificial Intelligence
Chain-of-thought reasoning enables large language models to solve multi-step tasks by framing problem solving as sequential decision problems. Outcome-based rewards, which provide feedback only on final answers, show impressive success, but face challenges with credit assignment and slow convergence. In contrast, procedure-based rewards offer efficient step-level feedback, but typically require costly human supervision. We introduce \emph{Backwards Adaptive Reward Shaping} (BARS), a no-regret framework that converts sparse outcomes-based rewards into effective procedure-based signals. BARS uses sparse rewards generated from terminal-state priors and cover trees to scale rewards while preventing exploitation. With Bellman contraction and $(Δ, ε)$-gap rewards, our backward Euler solver achieves $ε$-accuracy in $O\left((R_{\max}/Δ)\log(1/ε)\right)$ iterations with $O(\log T)$ dynamic regret over $T$ rounds. Our analysis, based on generic chaining, continuous scaling limits, and non-linear Feynman-Kac bounds, connects recent outcome-based methods' empirical successes with the benefits of intermediate supervision. Combined, this provides the first rigorous no-regret algorithm for outcome reward shaping, providing a theoretical foundation for the empirical success of DeepSeek's R1.
arXiv.org Artificial Intelligence
Apr-15-2025
- Country:
- Asia > Middle East
- Jordan (0.04)
- Europe > United Kingdom
- England
- Cambridgeshire > Cambridge (0.04)
- Oxfordshire > Oxford (0.04)
- England
- Asia > Middle East
- Genre:
- Research Report (0.63)
- Workflow (0.45)
- Technology: