PITA: Preference-Guided Inference-Time Alignment for LLM Post-Training
Bobbili, Sarat Chandra, Dinesha, Ujwal, Narasimha, Dheeraj, Shakkottai, Srinivas
–arXiv.org Artificial Intelligence
Inference-time alignment enables large language models (LLMs) to generate outputs aligned with end-user preferences without further training. Recent post-training methods achieve this by using small guidance models to modify token generation during inference. These methods typically optimize a reward function KL-regularized by the original LLM taken as the reference policy. A critical limitation, however, is their dependence on a pre-trained reward model, which requires fitting to human preference feedback--a potentially unstable process. In contrast, we introduce PITA, a novel framework that integrates preference feedback directly into the LLM's token generation, eliminating the need for a reward model. PITA learns a small preference-based guidance policy to modify token probabilities at inference time without LLM fine-tuning, reducing computational cost and bypassing the pre-trained reward model dependency. The problem is framed as identifying an underlying preference distribution, solved through stochastic search and iterative refinement of the preference-based guidance model. We evaluate PITA across diverse tasks, including mathematical reasoning and sentiment classification, demonstrating its effectiveness in aligning LLM outputs with user preferences.
arXiv.org Artificial Intelligence
Nov-14-2025
- Country:
- Asia > Middle East
- Jordan (0.04)
- Europe
- Slovenia > Upper Carniola
- Municipality of Bled > Bled (0.04)
- United Kingdom > England
- Bristol (0.04)
- Slovenia > Upper Carniola
- North America > United States
- Texas (0.04)
- Asia > Middle East
- Genre:
- Research Report (1.00)
- Industry:
- Leisure & Entertainment (0.92)
- Media > Film (0.67)
- Technology: