Modeling Human Beliefs about AI Behavior for Scalable Oversight

Feb-28-2025–arXiv.org Artificial Intelligence

Contemporary work in AI alignment often relies on human feed back to teach AI systems human preferences and values. Yet as AI systems grow more cap able, human feedback becomes increasingly unreliable. This raises the problem o f scalable oversight: How can we supervise AI systems that exceed human capabilities? In t his work, we propose to model the human evaluator's beliefs about the AI system's be havior to better interpret the human's feedback. We formalize human belief models and theo retically analyze their role in inferring human values. We then characterize the remaining ambiguity in this inference and conditions for which the ambiguity disappears. To mitigate reliance on exact belief models, we then introduce the relaxation of human belief model cover ing. Finally, we propose using foundation models to construct covering belief models, pro viding a new potential approach to scalable oversight.

belief model, large language model, machine learning, (22 more...)

arXiv.org Artificial Intelligence

Feb-28-2025

arXiv.org PDF

Add feedback

Country:
- Europe (0.14)

Genre:
- Research Report (0.65)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language
    - Chatbot (0.92)
    - Large Language Model (1.00)
  - Representation & Reasoning (1.00)