Modeling Human Beliefs about AI Behavior for Scalable Oversight

Lang, Leon, Forré, Patrick

arXiv.org Artificial Intelligence 

Contemporary work in AI alignment often relies on human feed back to teach AI systems human preferences and values. Yet as AI systems grow more cap able, human feedback becomes increasingly unreliable. This raises the problem o f scalable oversight: How can we supervise AI systems that exceed human capabilities? In t his work, we propose to model the human evaluator's beliefs about the AI system's be havior to better interpret the human's feedback. We formalize human belief models and theo retically analyze their role in inferring human values. We then characterize the remaining ambiguity in this inference and conditions for which the ambiguity disappears. To mitigate reliance on exact belief models, we then introduce the relaxation of human belief model cover ing. Finally, we propose using foundation models to construct covering belief models, pro viding a new potential approach to scalable oversight.