COPF: Continual Learning Human Preference through Optimal Policy Fitting
Zhang, Han, Gui, Lin, Zhai, Yuanzhao, Wang, Hui, Lei, Yu, Xu, Ruifeng
–arXiv.org Artificial Intelligence
In the realm of natural language processing (NLP), large language models (LLMs) are vital tools with the potential to bridge human language and machine understanding. Learning human preferences is a crucial step towards ensuring that language models not only generate responses that are useful to users but also adhere to ethical and societal norms, namely helpful and harmless responses [1]. However, they face a fundamental challenge in aligning with human preferences and values, hindering their full potential. Traditional alignment methods, namely Reinforcement Learning from Human Feedback (RLHF) [2, 3], involve supervised fine-tuning (SFT), reward model (RM) training, and policy model training. This complex pipeline lacks flexibility for continual learning (CL) of human preferences, hence existing work [1] often necessitates retraining models to adapt to dynamic preferences. Hence, there is a pressing need for research into continual alignment methods to address this limitation, enabling LLMs to better adhere to evolving human preferences and values while generating helpful responses. In this paper, we propose an innovative approach to address these challenges by enhancing the utility of the Deterministic Policy Optimization (DPO) [4] algorithm, a non-reinforcement learning, and a non-continual learning method. DPO, rooted in rigorous reinforcement learning theory, offers promising advantages but suffers from three critical limitations: 1. DPO is not supported for evolving human preferences which is common in real-world applications.
arXiv.org Artificial Intelligence
Oct-27-2023
- Country:
- Asia > Middle East
- UAE (0.14)
- Europe (0.67)
- North America > United States
- Oregon (0.14)
- Asia > Middle East
- Genre:
- Overview > Innovation (0.48)
- Research Report > Promising Solution (0.34)
- Industry:
- Education (0.68)
- Technology: