Multi-objective Reinforcement learning from AI Feedback
–arXiv.org Artificial Intelligence
This paper presents Multi-Objective Reinforcement Learning from AI Feedback (MORLAIF), a novel approach to improving the alignment and performance of language models trained using reinforcement learning from AI feedback (RLAIF). In contrast to standard approaches that train a single preference model to represent all human preferences, MORLAIF decomposes this task into multiple simpler principles, such as toxicity, factuality, and sycophancy. Separate preference models are trained for each principle using feedback from GPT-3.5-Turbo. These preference model scores are then combined using different scalarization functions to provide a reward signal for Proximal Policy Optimization (PPO) training of the target language model. Our experiments indicate that MORLAIF outperforms the standard RLAIF baselines and that MORLAIF can be used to align larger language models using smaller ones. Surprisingly, the choice of scalarization function does not appear to significantly impact the results. Recent advancements in large language models (LLMs) have led to remarkable performance across a wide range of natural language tasks (Bommasani et al. (2021); Brown et al. (2020)). However, ensuring that these models behave in alignment with human values and preferences remains a significant challenge (Kenton et al. (2021)). Reinforcement learning from human feedback (RLHF) has emerged as a promising approach to address this issue by training models to optimize for humanspecified reward functions (Christiano et al. (2017); Stiennon et al. (2020)), but many issues with RLHF have been identified, such as the limited ability of humans to evaluate responses and reward hacking of the preference model (Casper et al. (2023)). In a standard RLHF setup, a preference model is trained on human comparisons of model outputs to represent human preferences.
arXiv.org Artificial Intelligence
Jun-12-2024