On the Convergence of Self-Improving Online LLM Alignment

Wu, Xudong, Liu, Pangpang, Aggarwal, Vaneet, Chen, Jiayu

arXiv.org Machine Learning 

Abstractitations, recent work explores online RLHF that iterates between generating on-policy responses and collecting preferences [Lee et al., 2024, Park et al., 2022]. Among online The Self-Improving Alignment (SAIL) algorithmapproaches, SAIL reduces a bilevel alignment formulation addresses distribution shift by reducing a bilevelto a computationally efficient single-level surrogate and formulation of the problem to an efficient, single-reports strong empirical gains [Ding et al., 2024]. Empirically, SAIL has demonstratedisting online pipelines are largely heuristic and do not anastrong performance on this task. However, a for-lytically control the distributional shift induced by iterative mal analysis of its convergence properties has beendata collection [Chakraborty et al., 2024, Shen et al., 2024], lacking. We identify a key theoretical challenge: which has been linked to suboptimal performance in practice the standard SAIL objective function is not guar- [Sharma et al., 2024]. To address this limita-A growing line of work argues that the coupling between tion, we propose a regularized objective, SAILreward learning and policy updates is fundamentally bilevel and should be modeled as such [Chakraborty et al., 2024].RevKL, which incorporates a reverse KullbackAs a follow-up, Ding et al. [2024] reduces the bilevel align-Leibler (KL) divergence penalty to improve the optimization landscape. Our central theoretical con-ment objective to a tractable single-level surrogate and retribution is to prove that this regularized objectiveports strong empirical gains, yet it lacks formal convergence satisfies the Polyak-Lojasiewicz (PL) conditionguarantees. Related theoretical analyses in bilevel/RLHFstyle problems exist [e.g., Yang et al., 2025, Chakrabortywithin a bounded parameter space. We establish et al., 2024, Gaur et al., 2025], yet they either focus onglobal convergence guarantees, achieving a nearlinear sample complexity.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found