Transfer Q-star : Principled Decoding for LLM Alignment

Neural Information Processing Systems 

Aligning foundation models is essential for their safe and trustworthy deployment. However, traditional fine-tuning methods are computationally intensive and require updating billions of model parameters. A promising alternative, alignment via decoding, adjusts the response distribution directly without model updates to maximize a target reward r, thus providing a lightweight and adaptable framework for alignment. However, principled decoding methods rely on oracle access to an optimal Q-function ( Q *), which is often unavailable in practice. Hence, prior SoTA methods either approximate this Q * using Q {\pi_{\text{sft}}} (derived from the reference \texttt{SFT} model) or rely on short-term rewards, resulting in sub-optimal decoding performance. In this work, we propose \texttt{Transfer Q} *, which implicitly estimates the optimal value function for a target reward r through a baseline model \rho_{\texttt{BL}} aligned with a baseline reward r_{\texttt{BL}} (which can be different from the target reward r).