Houliston, Sam
Reasoning Language Models: A Blueprint
Besta, Maciej, Barth, Julia, Schreiber, Eric, Kubicek, Ales, Catarino, Afonso, Gerstenberger, Robert, Nyczyk, Piotr, Iff, Patrick, Li, Yueling, Houliston, Sam, Sternal, Tomasz, Copik, Marcin, Kwaśniewski, Grzegorz, Müller, Jürgen, Flis, Łukasz, Eberhard, Hannes, Niewiadomski, Hubert, Hoefler, Torsten
Reasoning language models (RLMs), also known as Large Reasoning Models (LRMs), such as OpenAI's o1 and o3, DeepSeek-V3, and Alibaba's QwQ, have redefined AI's problem-solving capabilities by extending LLMs with advanced reasoning mechanisms. Yet, their high costs, proprietary nature, and complex architectures - uniquely combining Reinforcement Learning (RL), search heuristics, and LLMs - present accessibility and scalability challenges. To address these, we propose a comprehensive blueprint that organizes RLM components into a modular framework, based on a survey and analysis of all RLM works. This blueprint incorporates diverse reasoning structures (chains, trees, graphs, and nested forms), reasoning strategies (e.g., Monte Carlo Tree Search, Beam Search), RL concepts (policy, value models and others), supervision schemes (Outcome-Based and Process-Based Supervision), and other related concepts (e.g., Test-Time Compute, Retrieval-Augmented Generation, agent tools). We also provide detailed mathematical formulations and algorithmic specifications to simplify RLM implementation. By showing how schemes like LLaMA-Berry, QwQ, Journey Learning, and Graph of Thoughts fit as special cases, we demonstrate the blueprint's versatility and unifying potential. To illustrate its utility, we introduce x1, a modular implementation for rapid RLM prototyping and experimentation. Using x1 and a literature review, we provide key insights, such as multi-phase training for policy and value models, and the importance of familiar training distributions. Finally, we discuss scalable RLM cloud deployments and we outline how RLMs can integrate with a broader LLM ecosystem. Our work demystifies RLM construction, democratizes advanced reasoning capabilities, and fosters innovation, aiming to mitigate the gap between "rich AI" and "poor AI" by lowering barriers to RLM design and experimentation.
Uncertainty-Penalized Direct Preference Optimization
Houliston, Sam, Pace, Alizée, Immer, Alexander, Rätsch, Gunnar
Aligning Large Language Models (LLMs) to human preferences in content, style, and presentation is challenging, in part because preferences are varied, contextdependent, and sometimes inherently ambiguous. While successful, Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are prone to the issue of proxy reward overoptimization. Analysis of the DPO loss reveals a critical need for regularization for mislabeled or ambiguous preference pairs to avoid reward hacking. In this work, we develop a pessimistic framework for DPO by introducing preference uncertainty penalization schemes, inspired by offline reinforcement learning. The penalization serves as a correction to the loss which attenuates the loss gradient for uncertain samples. Evaluation of the methods is performed with GPT2 Medium on the Anthropic-HH dataset using a model ensemble to obtain uncertainty estimates, and shows improved overall performance compared to vanilla DPO, as well as better completions on prompts from high-uncertainty chosen/rejected responses. Aligning LLMs to human preferences in content, style, and presentation has become a central challenge in improving and deploying LLMs, leading to the advent of Reinforcement Learning with Human Feedback (RLHF), now a prominent technique to fine-tune state-of-the-art LLMs (Casper et al., 2023). The standard RLHF pipeline involves human feedback collection, reward model training, and LLM policy optimization via reinforcement learning (RL). Despite its success, each stage presents challenges, from feedback interpretation and policy generalization to challenging RL implementation (Casper et al., 2023). Direct Preference Optimisation (DPO) (Rafailov et al., 2023) effectively bypasses the reward model by fine-tuning the policy to maximize the likelihood of the preference data under the Bradley-Terry model (A. DPO is easier to implement than RL algorithms, and benefits from computational efficiency and stability by avoiding potential inaccuracies and biases of a reward model (Xu et al., 2024; Casper et al., 2023).