Connections between reinforcement learning with feedback,test-time scaling, and diffusion guidance: An anthology
Jiao, Yuchen, Chen, Yuxin, Li, Gen
The rise of large language models (LLMs) has catalyzed a diverse array of post-training techniques that are reshaping the frontiers of artificial intelligence. Approaches such as reinforcement learning with feedback -- e.g., reinforcement learning with human feedback (RLHF) (Kaufmann et al., 2023) and reinforcement learning with internal feedback (RLIF) (Zhao et al., 2025) -- and test-time scaling methods (e.g., best-of-N sampling) (Snell et al., 2024) have become pivotal in enhancing model performance. Historically, these approaches have evolved along nearly independent trajectories, with their mathematical underpinnings developed largely in isolation. This technical note reflects on some basic, yet intricate, connections among these paradigms, offering some unified perspectives that might inform and improve the design of each.
Sep-5-2025
- Country:
- North America > United States
- Pennsylvania (0.04)
- Asia > China
- Hong Kong (0.04)
- North America > United States
- Genre:
- Research Report (0.82)
- Technology: