Aligning language models with human preferences
–arXiv.org Artificial Intelligence
Language models (LMs) trained on vast quantities of text data can acquire sophisticated skills such as generating summaries, answering questions or generating code. However, they also manifest behaviors that violate human preferences, e.g., they can generate offensive content, falsehoods or perpetuate social biases. In this thesis, I explore several approaches to aligning LMs with human preferences. First, I argue that aligning LMs can be seen as Bayesian inference: conditioning a prior (base, pretrained LM) on evidence about human preferences (Chapter 2). Conditioning on human preferences can be implemented in numerous ways. In Chapter 3, I investigate the relation between two approaches to finetuning pretrained LMs using feedback given by a scoring function: reinforcement learning from human feedback (RLHF) and distribution matching. I show that RLHF can be seen as a special case of distribution matching but distributional matching is strictly more general. In chapter 4, I show how to extend the distribution matching to conditional language models. Finally, in chapter 5 I explore a different root: conditioning an LM on human preferences already during pretraining. I show that involving human feedback from the very start tends to be more effective than using it only during supervised finetuning. Overall, these results highlight the room for alignment techniques different from and complementary to RLHF.
arXiv.org Artificial Intelligence
Apr-18-2024
- Country:
- Oceania > Australia
- Victoria > Melbourne (0.04)
- New South Wales > Sydney (0.04)
- North America
- Dominican Republic (0.04)
- United States
- Texas > Travis County
- Austin (0.04)
- Michigan > Washtenaw County
- Ann Arbor (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Maryland > Montgomery County
- Gaithersburg (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Washington > King County
- Seattle (0.14)
- Massachusetts
- Suffolk County > Boston (0.04)
- Middlesex County > Cambridge (0.04)
- California
- San Francisco County > San Francisco (0.13)
- San Diego County > San Diego (0.04)
- Santa Clara County > Palo Alto (0.04)
- San Mateo County > San Mateo (0.04)
- New York > New York County
- New York City (0.04)
- Texas > Travis County
- Puerto Rico > San Juan
- San Juan (0.04)
- Canada > British Columbia
- Europe
- Germany > Berlin (0.04)
- France (0.04)
- Czechia > Prague (0.04)
- Denmark > Capital Region
- Copenhagen (0.04)
- United Kingdom > England
- Oxfordshire > Oxford (0.04)
- Cambridgeshire > Cambridge (0.04)
- Latvia > Lubāna Municipality
- Lubāna (0.04)
- Italy
- Spain > Catalonia
- Barcelona Province > Barcelona (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Poland > Masovia Province
- Warsaw (0.04)
- Asia
- Africa > Ethiopia
- Addis Ababa > Addis Ababa (0.04)
- Oceania > Australia
- Genre:
- Research Report > New Finding (1.00)
- Instructional Material (1.00)
- Industry:
- Government (0.67)
- Education (0.67)
- Information Technology (0.45)
- Technology: