RLHF and IIA: Perverse Incentives

Xu, Wanqiao, Dong, Shi, Lu, Xiuyuan, Lam, Grace, Wen, Zheng, Van Roy, Benjamin

Feb-1-2024–arXiv.org Artificial Intelligence

Modern generative AIs ingest trillions of data bytes from the World Wide Web to produce a large pretrained model. Trained to imitate what is observed, this model represents an agglomeration of behaviors, some of which are more or less desirable to mimic. Further training through human interaction, even on fewer than a hundred thousand bits of data, has proven to greatly enhance usefulness and safety, enabling the remarkable AIs we have today. This process of reinforcement learning from human feedback (RLHF) steers AIs toward the more desirable among behaviors observed during pretraining. While AIs now routinely generate drawings, music, speech, and computer code, the text-based chatbot remains an emblematic artifact.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Feb-1-2024

arXiv.org PDF

Add feedback

Country:
- Asia (0.14)
- North America > United States (0.14)

Genre:
- Personal > Honors (0.46)
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)