Stix, Charlotte
Pre-Deployment Information Sharing: A Zoning Taxonomy for Precursory Capabilities
Pistillo, Matteo, Stix, Charlotte
There is a growing consensus that information is the "lifeblood of good governance" (Kolt et al., 2024) and that information sharing should be one of the "natural initial target[s]" of AI governance (Bommasani et al., 2024). Up-to-date and reliable information about AI systems' capabilities and how capabilities will develop in the future can help developers, governments, and researchers advance safety evaluations (Frontier Model Forum, 2024), develop best practices (UK DSIT, 2023), and respond effectively to the new risks posed by frontier AI (Kolt et al., 2024). Information sharing also supports regulatory visibility (Anderljung et al., 2023) and can thus enable better-informed AI governance (O'Brien et al., 2024). Further, access to knowledge about AI systems' potential risks allows AI systems claims to be scrutinized more effectively (Brundage et al., 2020). By contrast, information asymmetries could lead regulators to miscalibrated over-regulation--or under-regulation--of AI (Ball & Kokotajlo, 2024) and could contribute to the "pacing problem," a situation in which government oversight consistently lags behind technology development (Marchant et al., 2011). In short, there is a strong case for information sharing being one "key to making AI go well" (Ball & Kokotajlo, 2024). The Frontier AI Safety Commitments ("FAISC") are an important step towards more comprehensive information sharing by AI developers.
Towards evaluations-based safety cases for AI scheming
Balesni, Mikita, Hobbhahn, Marius, Lindner, David, Meinke, Alexander, Korbak, Tomek, Clymer, Joshua, Shlegeris, Buck, Scheurer, Jérémy, Stix, Charlotte, Shah, Rusheb, Goldowsky-Dill, Nicholas, Braun, Dan, Chughtai, Bilal, Evans, Owain, Kokotajlo, Daniel, Bushnaq, Lucius
We sketch how developers of frontier AI systems could construct a structured rationale -- a 'safety case' -- that an AI system is unlikely to cause catastrophic outcomes through scheming. Scheming is a potential threat model where AI systems could pursue misaligned goals covertly, hiding their true capabilities and objectives. In this report, we propose three arguments that safety cases could use in relation to scheming. For each argument we sketch how evidence could be gathered from empirical evaluations, and what assumptions would need to be met to provide strong assurance. First, developers of frontier AI systems could argue that AI systems are not capable of scheming (Scheming Inability). Second, one could argue that AI systems are not capable of posing harm through scheming (Harm Inability). Third, one could argue that control measures around the AI systems would prevent unacceptable outcomes even if the AI systems intentionally attempted to subvert them (Harm Control). Additionally, we discuss how safety cases might be supported by evidence that an AI system is reasonably aligned with its developers (Alignment). Finally, we point out that many of the assumptions required to make these safety arguments have not been confidently satisfied to date and require making progress on multiple open research problems.