Active teacher selection for reinforcement learning from human feedback

Freedman, Rachel, Svegliato, Justin, Wray, Kyle, Russell, Stuart

Oct-23-2023–arXiv.org Artificial Intelligence

Specifying objective functions for machine learning systems is challenging, and misspecified objectives can be hacked [1, 2] or incentivise degenerate behavior [3, 4, 5]. Techniques such as reinforcement learning from human feedback (RLHF) enable ML systems to instead learn appropriate objectives from human feedback [6, 7, 8]. These techniques are widely used to finetune large language models [9, 10, 11, 12] and to train reinforcement learning agents to perform complex maneuvers in continuous control environments [6, 7]. However, while RLHF is relied upon to ensure that these systems are safe, helpful, and harmless [13], it still faces many limitations and unsolved challenges [14]. In particular, RLHF systems typically rely on the assumption that all feedback comes from a single human teacher, despite gathering feedback from a range of teachers with varying levels of rationality and expertise. For example, Stiennon et al. [8], Bai et al. [13] and Ouyang et al. [15] assume that all feedback comes from a single teacher, but find that annotators and researchers actually disagree 23% to 37% of the time. Reward learning has been shown to be highly sensitive to incorrect assumptions about the process that generates feedback [16, 17, 18, 19], so this single-teacher assumption exposes these systems to dangerous failures [20]. Ideally, RLHF systems should consider the differences between each teacher to improve their safety and reliability. To leverage multiple teachers in RLHF, we introduce a novel problem called a Hidden Utility Bandit (HUB), illustrated in Figure 1.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

arXiv.org Artificial Intelligence

Oct-23-2023

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.64)

Industry:
- Health & Medicine > Therapeutic Area
  - Immunology (1.00)
  - Infections and Infectious Diseases (1.00)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Learning Graphical Models > Undirected Networks
    - Markov Models (0.49)
  - Neural Networks > Deep Learning (1.00)
  - Reinforcement Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found