Circumventing interpretability: How to defeat mind-readers
–arXiv.org Artificial Intelligence
The increasing capabilities of artificial intelligence (AI) systems make it ever more important that we interpret their internals to ensure that their intentions are aligned with human values. Yet there is reason to believe that misaligned artificial intelligence will have a convergent instrumental incentive to make its thoughts difficult for us to interpret. In this article, I discuss many ways that a capable AI might circumvent scalable interpretability methods and suggest a framework for thinking about these potential future risks. I'm grateful to David Lindner, Evan R. Murphy, Alex Lintz, Sid Black, Kyle McDonnell, Laria Reynolds, Adam Shimi, and Daniel Braun whose comments greatly improved earlier drafts of this article. The article's weaknesses are mine, but many of its strengths are due to their contributions. Additionally, this article benefited from the prior work of many authors, but especially: Evan Hubinger, Peter Barnett, Adam Shimi, Neel Nanda, Evan R. Murphy, Eliezer Yudkowsky, Chris Olah. I collected several of the potential circumvention methods from their work. Part of this work was carried out while at Conjecture. The original post on which this paper was based can be found here.
arXiv.org Artificial Intelligence
Dec-21-2022
- Country:
- Europe
- Estonia > Harju County
- Tallinn (0.04)
- France (0.04)
- Estonia > Harju County
- North America > United States (0.04)
- Europe
- Genre:
- Research Report (0.41)
- Industry:
- Information Technology > Security & Privacy (1.00)
- Technology: