RLHF and IIA: Perverse Incentives

Xu, Wanqiao, Dong, Shi, Lu, Xiuyuan, Lam, Grace, Wen, Zheng, Van Roy, Benjamin

arXiv.org Artificial Intelligence 

Modern generative AIs ingest trillions of data bytes from the World Wide Web to produce a large pretrained model. Trained to imitate what is observed, this model represents an agglomeration of behaviors, some of which are more or less desirable to mimic. Further training through human interaction, even on fewer than a hundred thousand bits of data, has proven to greatly enhance usefulness and safety, enabling the remarkable AIs we have today. This process of reinforcement learning from human feedback (RLHF) steers AIs toward the more desirable among behaviors observed during pretraining. While AIs now routinely generate drawings, music, speech, and computer code, the text-based chatbot remains an emblematic artifact.