Elson, David
MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking
Farquhar, Sebastian, Varma, Vikrant, Lindner, David, Elson, David, Biddulph, Caleb, Goodfellow, Ian, Shah, Rohin
Future advanced AI systems may learn sophisticated strategies through reinforcement learning (RL) that humans cannot understand well enough to safely evaluate. We propose a training method which avoids agents learning undesired multi-step plans that receive high reward (multi-step "reward hacks") even if humans are not able to detect that the behaviour is undesired. The method, Myopic Optimization with Non-myopic Approval (MONA), works by combining short-sighted optimization with far-sighted reward. We demonstrate that MONA can prevent multi-step reward hacking that ordinary RL causes, even without being able to detect the reward hacking and without any extra information that ordinary RL does not get access to. We study MONA empirically in three settings which model different misalignment failure modes including 2-step environments with LLMs representing delegated oversight and encoded reasoning and longer-horizon gridworld environments representing sensor tampering.
Reports on the Fourth Artificial Intelligence for Interactive Digital Entertainment Conference Workshops
Elson, David (Google) | Rowe, Jonathan (North Carolina State University) | Smith, Adam M. (University of California, Santa Cruz) | Smith, Gillian (University of California, Santa Cruz) | Tomai, Emmett (University of Texas - Pan American)
The Seventh Artificial Intelligence for Interactive Digital Entertainment Conference (AIIDE-11) was held October 11–14, 2011 at Stanford University, Stanford, California. Two one-day workshops were held on October 11: Artificial Intelligence in the Game Design Process, and Intelligent Narrative Technologies. The highlights of each workshop are presented in this report.
Reports on the Fourth Artificial Intelligence for Interactive Digital Entertainment Conference Workshops
Elson, David (Google) | Rowe, Jonathan (North Carolina State University) | Smith, Adam M. (University of California, Santa Cruz) | Smith, Gillian (University of California, Santa Cruz) | Tomai, Emmett (University of Texas - Pan American)
The Seventh Artificial Intelligence for Interactive Digital Entertainment Conference (AIIDE-11) was held October 11–14, 2011 at Stanford University, Stanford, California. Two one-day workshops were held on October 11: Artificial Intelligence in the Game Design Process, and Intelligent Narrative Technologies. The highlights of each workshop are presented in this report.