realab
REALab: An Embedded Perspective on Tampering
Kumar, Ramana, Uesato, Jonathan, Ngo, Richard, Everitt, Tom, Krakovna, Victoria, Legg, Shane
Tampering problems, where an AI agent interferes with whatever represents or communicates its intended objective and pursues the resulting corrupted objective instead, are a staple concern in the AGI safety literature [Amodei et al., 2016, Bostrom, 2014, Everitt and Hutter, 2016, Everitt et al., 2017, Armstrong and O'Rourke, 2017, Everitt and Hutter, 2019, Armstrong et al., 2020]. Variations on the idea of tampering include wireheading, where an agent learns how to stimulate its reward mechanism directly, and the off-switch or shutdown problem, where an agent interferes with its supervisor's ability to halt the agent's operation. Many real-world concerns can be formulated as tampering problems, as we will show (§2.1, §4.1). However, what constitutes tampering can be tricky to define precisely, despite clear intuitions in specific cases. We have developed a platform, REALab, to model tampering problems.