Convergence and Stability Analysis of Self-Consuming Generative Models with Heterogeneous Human Curation

Zhao, Hongru, Fu, Jinwen, Pham, Tuan

arXiv.org Machine Learning 

Contemporary pipelines largely learn from preferences, often alongside scalable-oversight efforts ("superalignment" Burns et al. (2023); Kim et al. (2024); Köpf et al. (2023)), and a growing survey literature maps the practical trade-offs--from data collection and reward inference to evaluation and safety (e.g., Shen et al., 2023; Kaufmann et al., 2025). A common structure underlies many systems: models propose alternatives, people (or proxies) compare them, and those preferences guide the next training round (Shin et al., 2023; Lee et al., 2021; Munos et al., 2024). Within this landscape, two families dominate. Reinforcement Learning from Human Feedback (RLHF) first trains a reward model from comparisons, then improves the policy via reinforcement learning with KL regularization (typically Proximal Policy Optimization (PPO)). This accommodates rich, sequence-level signals, but it introduces extra moving parts--reward modeling, on-policy sampling, and tuning--that can make training complex and sometimes unstable at scale (Kirk et al., 2023).