On the Convergence of Self-Improving Online LLM Alignment

Wu, Xudong, Liu, Pangpang, Aggarwal, Vaneet, Chen, Jiayu

Jul-1-2026–arXiv.org Machine Learning

Abstractitations, recent work explores online RLHF that iterates between generating on-policy responses and collecting preferences [Lee et al., 2024, Park et al., 2022]. Among online The Self-Improving Alignment (SAIL) algorithmapproaches, SAIL reduces a bilevel alignment formulation addresses distribution shift by reducing a bilevelto a computationally efficient single-level surrogate and formulation of the problem to an efficient, single-reports strong empirical gains [Ding et al., 2024]. Empirically, SAIL has demonstratedisting online pipelines are largely heuristic and do not anastrong performance on this task. However, a for-lytically control the distributional shift induced by iterative mal analysis of its convergence properties has beendata collection [Chakraborty et al., 2024, Shen et al., 2024], lacking. We identify a key theoretical challenge: which has been linked to suboptimal performance in practice the standard SAIL objective function is not guar- [Sharma et al., 2024]. To address this limita-A growing line of work argues that the coupling between tion, we propose a regularized objective, SAILreward learning and policy updates is fundamentally bilevel and should be modeled as such [Chakraborty et al., 2024].RevKL, which incorporates a reverse KullbackAs a follow-up, Ding et al. [2024] reduces the bilevel align-Leibler (KL) divergence penalty to improve the optimization landscape. Our central theoretical con-ment objective to a tractable single-level surrogate and retribution is to prove that this regularized objectiveports strong empirical gains, yet it lacks formal convergence satisfies the Polyak-Lojasiewicz (PL) conditionguarantees. Related theoretical analyses in bilevel/RLHFstyle problems exist [e.g., Yang et al., 2025, Chakrabortywithin a bounded parameter space. We establish et al., 2024, Gaur et al., 2025], yet they either focus onglobal convergence guarantees, achieving a nearlinear sample complexity.

large language model, machine learning, natural language, (20 more...)

arXiv.org Machine Learning

Jul-1-2026

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.46)

Genre:
- Research Report (0.83)

Industry:
- Health & Medicine (0.46)
- Law Enforcement & Public Safety (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found