Convergence and Stability Analysis of Self-Consuming Generative Models with Heterogeneous Human Curation

Nov-14-2025–arXiv.org Machine Learning

Contemporary pipelines largely learn from preferences, often alongside scalable-oversight efforts ("superalignment" Burns et al. (2023); Kim et al. (2024); Köpf et al. (2023)), and a growing survey literature maps the practical trade-offs--from data collection and reward inference to evaluation and safety (e.g., Shen et al., 2023; Kaufmann et al., 2025). A common structure underlies many systems: models propose alternatives, people (or proxies) compare them, and those preferences guide the next training round (Shin et al., 2023; Lee et al., 2021; Munos et al., 2024). Within this landscape, two families dominate. Reinforcement Learning from Human Feedback (RLHF) first trains a reward model from comparisons, then improves the policy via reinforcement learning with KL regularization (typically Proximal Policy Optimization (PPO)). This accommodates rich, sequence-level signals, but it introduces extra moving parts--reward modeling, on-policy sampling, and tuning--that can make training complex and sometimes unstable at scale (Kirk et al., 2023).

machine learning, natural language, regime, (15 more...)

arXiv.org Machine Learning

Nov-14-2025

arXiv.org PDF

Add feedback

Country:
- Europe > United Kingdom
  - England > Cambridgeshire > Cambridge (0.04)
- North America > United States
  - Minnesota (0.04)
  - New York > New York County
    - New York City (0.04)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language (1.00)