Circumventing interpretability: How to defeat mind-readers

Dec-21-2022–arXiv.org Artificial Intelligence

The increasing capabilities of artificial intelligence (AI) systems make it ever more important that we interpret their internals to ensure that their intentions are aligned with human values. Yet there is reason to believe that misaligned artificial intelligence will have a convergent instrumental incentive to make its thoughts difficult for us to interpret. In this article, I discuss many ways that a capable AI might circumvent scalable interpretability methods and suggest a framework for thinking about these potential future risks. I'm grateful to David Lindner, Evan R. Murphy, Alex Lintz, Sid Black, Kyle McDonnell, Laria Reynolds, Adam Shimi, and Daniel Braun whose comments greatly improved earlier drafts of this article. The article's weaknesses are mine, but many of its strengths are due to their contributions. Additionally, this article benefited from the prior work of many authors, but especially: Evan Hubinger, Peter Barnett, Adam Shimi, Neel Nanda, Evan R. Murphy, Eliezer Yudkowsky, Chris Olah. I collected several of the potential circumvention methods from their work. Part of this work was carried out while at Conjecture. The original post on which this paper was based can be found here.

artificial intelligence, machine learning, representation, (17 more...)

arXiv.org Artificial Intelligence

Dec-21-2022

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.04)
- Europe
  - France (0.04)
  - Estonia > Harju County
    - Tallinn (0.04)

Genre:
- Research Report (0.41)

Industry:
- Information Technology > Security & Privacy (1.00)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found