Toward Optimal LLM Alignments Using Two-Player Games

Zheng, Rui, Guo, Hongyi, Liu, Zhihan, Zhang, Xiaoying, Yao, Yuanshun, Xu, Xiaojun, Wang, Zhaoran, Xi, Zhiheng, Gui, Tao, Zhang, Qi, Huang, Xuanjing, Li, Hang, Liu, Yang

Jun-16-2024–arXiv.org Artificial Intelligence

Alignment of large language models is a critical process designed to ensure that the model's responses to user prompts accurately reflect human intentions and adhere to societal values. The standard Reinforcement Learning from Human Feedback (RLHF) framework primarily focuses on optimizing the performance of large language models using pre-collected prompts. However, collecting prompts that provide comprehensive coverage is both tedious and challenging, and often fails to include scenarios that LLMs need to improve on the most. In this paper, we investigate alignment through the lens of two-agent games, involving iterative interactions between an adversarial and a defensive agent. The adversarial agent's task at each step is to generate prompts that expose the weakness of the defensive agent. In return, the defensive agent seeks to improve its responses to these newly identified prompts it "struggled" with, based on feedback from the reward model. We theoretically demonstrate that this iterative reinforcement learning optimization converges to a Nash Equilibrium for the game induced by the agents. Experimental results in safety scenarios demonstrate that learning in such a competitive environment not only fully trains agents but also leads to policies with enhanced generalization capabilities for both adversarial and defensive agents. Our code is released at https://github.com/ruizheng20/gpo.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Jun-16-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.28)

Genre:
- Research Report
  - Experimental Study (0.93)
  - New Finding (0.67)

Industry:
- Education (0.93)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found