safety case
A Concrete Roadmap towards Safety Cases based on Chain-of-Thought Monitoring
As AI systems approach dangerous capability levels where inability safety cases become insufficient, we need alternative approaches to ensure safety. This paper presents a roadmap for constructing safety cases based on chain-of-thought (CoT) monitoring in reasoning models and outlines our research agenda. We argue that CoT monitoring might support both control and trustworthiness safety cases. We propose a two-part safety case: (1) establishing that models lack dangerous capabilities when operating without their CoT, and (2) ensuring that any dangerous capabilities enabled by a CoT are detectable by CoT monitoring. We systematically examine two threats to monitorability: neuralese and encoded reasoning, which we categorize into three forms (linguistic drift, steganography, and alien reasoning) and analyze their potential drivers. We evaluate existing and novel techniques for maintaining CoT faithfulness. For cases where models produce non-monitorable reasoning, we explore the possibility of extracting a monitorable CoT from a non-monitorable CoT. To assess the viability of CoT monitoring safety cases, we establish prediction markets to aggregate forecasts on key technical milestones influencing their feasibility.
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.68)
AI Testing Should Account for Sophisticated Strategic Behaviour
Kovarik, Vojtech, Chen, Eric Olav, Petersen, Sami, Ghersengorin, Alexis, Conitzer, Vincent
This position paper argues for two claims regarding AI testing and evaluation. First, to remain informative about deployment behaviour, evaluations need account for the possibility that AI systems understand their circumstances and reason strategically. Second, game-theoretic analysis can inform evaluation design by formalising and scrutinising the reasoning in evaluation-based safety cases. Drawing on examples from existing AI systems, a review of relevant research, and formal strategic analysis of a stylised evaluation scenario, we present evidence for these claims and motivate several research directions.
- North America > United States (0.28)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- Information Technology > Security & Privacy (1.00)
- Leisure & Entertainment > Games (0.94)
Operationalization of Scenario-Based Safety Assessment of Automated Driving Systems
Camp, Olaf Op den, de Gelder, Erwin
Olaf Op den Camp Integrated Vehicle Safety TNO Helmond, the Netherlands 0000 - 0002 - 6355 - 134X Erwin de Gelder Integrated Vehicle Safety TNO Helmond, the Netherlands 0000 - 0003 - 4260 - 4294 Abstract -- Before introducing an Automated Driving System (ADS) on the road at scale, the manufacturer must conduct some sort of safety assurance. To structure and harmonize the safety assurance process, the UNECE WP.29 Working Party on Automated/Autonomous and Connected Vehicles (GRVA) is developing the New Assessment/Test Method (NATM) that indicates what steps need to be taken for safety assessment of an ADS . In this paper, we will show how to practically conduct safety assessment making use of a scenario database, and what additional steps must be taken to fully operationalize the NATM. In addition, we will elaborate on how the use of scenario databases fits with methods developed in the Horizon Europe projects that focus on safety assessment following the NATM ap proach. A safety assurance process that is conducted by the manufacturer before introducing an Automated Driving System (ADS), intends to assure that the ADS responds appropriately in all situations it is designed for and that the ADS is able to avoid any reasonably foreseeable and reasonably preventable collision s . The information out of the safety assurance process is not only important for manufacturers, but also for authorities that have the responsibility to guard the safety of their citizens in traffic. Safety assurance is most important for consumers (and fle et owners) using an ADS with the expectation that the system is saf e, reliable, and trustworthy . To structure and harmonize this process, t he UNECE WP.29 Working Party on Automated/Autonomous and Connected Vehicles (GRVA) is developing the New Assessment/Test Method (NATM) [1], which is already recognized across many countries (e.g., Japan, South Korea, the EU and the USA).
- Europe > Netherlands (0.44)
- Asia > South Korea (0.24)
- Asia > Japan (0.24)
- (4 more...)
- Law (1.00)
- Transportation > Ground > Road (0.92)
- Information Technology > Robotics & Automation (0.82)
- Automobiles & Trucks (0.82)
Manipulation Attacks by Misaligned AI: Risk Analysis and Safety Case Framework
Dassanayake, Rishane, Demetroudi, Mario, Walpole, James, Lentati, Lindley, Brown, Jason R., Young, Edward James
Frontier AI systems are rapidly advancing in their capabilities to persuade, deceive, and influence human behaviour, with current models already demonstrating human-level persuasion and strategic deception in specific contexts. Humans are often the weakest link in cybersecurity systems, and a misaligned AI system deployed internally within a frontier company may seek to undermine human oversight by manipulating employees. Despite this growing threat, manipulation attacks have received little attention, and no systematic framework exists for assessing and mitigating these risks. To address this, we provide a detailed explanation of why manipulation attacks are a significant threat and could lead to catastrophic outcomes. Additionally, we present a safety case framework for manipulation risk, structured around three core lines of argument: inability, control, and trustworthiness. For each argument, we specify evidence requirements, evaluation methodologies, and implementation considerations for direct application by AI companies. This paper provides the first systematic methodology for integrating manipulation risk into AI safety governance, offering AI companies a concrete foundation to assess and mitigate these threats before deployment.
- Information Technology > Security & Privacy (1.00)
- Government > Military > Cyberwarfare (0.48)
Evaluating Frontier Models for Stealth and Situational Awareness
Phuong, Mary, Zimmermann, Roland S., Wang, Ziyue, Lindner, David, Krakovna, Victoria, Cogan, Sarah, Dafoe, Allan, Ho, Lewis, Shah, Rohin
Recent work has demonstrated the plausibility of frontier AI models scheming -- knowingly and covertly pursuing an objective misaligned with its developer's intentions. Such behavior could be very hard to detect, and if present in future advanced systems, could pose severe loss of control risk. It is therefore important for AI developers to rule out harm from scheming prior to model deployment. In this paper, we present a suite of scheming reasoning evaluations measuring two types of reasoning capabilities that we believe are prerequisites for successful scheming: First, we propose five evaluations of ability to reason about and circumvent oversight (stealth). Second, we present eleven evaluations for measuring a model's ability to instrumentally reason about itself, its environment and its deployment (situational awareness). We demonstrate how these evaluations can be used as part of a scheming inability safety case: a model that does not succeed on these evaluations is almost certainly incapable of causing severe harm via scheming in real deployment. We run our evaluations on current frontier models and find that none of them show concerning levels of either situational awareness or stealth.
- Research Report > New Finding (0.67)
- Research Report > Experimental Study (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.96)
The DevSafeOps Dilemma: A Systematic Literature Review on Rapidity in Safe Autonomous Driving Development and Operation
Nouri, Ali, Cabrero-Daniel, Beatriz, Törner, Fredrik, Berger, Christian
Developing autonomous driving (AD) systems is challenging due to the complexity of the systems and the need to assure their safe and reliable operation. The widely adopted approach of DevOps seems promising to support the continuous technological progress in AI and the demand for fast reaction to incidents, which necessitate continuous development, deployment, and monitoring. We present a systematic literature review meant to identify, analyse, and synthesise a broad range of existing literature related to usage of DevOps in autonomous driving development. Our results provide a structured overview of challenges and solutions, arising from applying DevOps to safety-related AI-enabled functions. Our results indicate that there are still several open topics to be addressed to enable safe DevOps for the development of safe AD.
- Europe > Austria > Vienna (0.14)
- Europe > Sweden > Vaestra Goetaland > Gothenburg (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (9 more...)
- Research Report > New Finding (1.00)
- Research Report > Promising Solution (0.93)
- Transportation > Ground > Road (1.00)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- (3 more...)
Systematic Hazard Analysis for Frontier AI using STPA
All of the frontier AI companies have published safety frameworks where they define capability thresholds and risk mitigations that determine how they will safely develop and deploy their models. Adoption of systematic approaches to risk modelling, based on established practices used in safety-critical industries, has been recommended, however frontier AI companies currently do not describe in detail any structured approach to identifying and analysing hazards. STPA (Systems-Theoretic Process Analysis) is a systematic methodology for identifying how complex systems can become unsafe, leading to hazards. It achieves this by mapping out controllers and controlled processes then analysing their interactions and feedback loops to understand how harmful outcomes could occur (Leveson & Thomas, 2018). We evaluate STPA's ability to broaden the scope, improve traceability and strengthen the robustness of safety assurance for frontier AI systems. Applying STPA to the threat model and scenario described in 'A Sketch of an AI Control Safety Case' (Korbak et al., 2025), we derive a list of Unsafe Control Actions. From these we select a subset and explore the Loss Scenarios that lead to them if left unmitigated. We find that STPA is able to identify causal factors that may be missed by unstructured hazard analysis methodologies thereby improving robustness. We suggest STPA could increase the safety assurance of frontier AI when used to complement or check coverage of existing AI governance techniques including capability thresholds, model evaluations and emergency procedures. The application of a systematic methodology supports scalability by increasing the proportion of the analysis that could be conducted by LLMs, reducing the burden on human domain experts.
- Asia > Japan > Honshū > Tōhoku > Fukushima Prefecture > Fukushima (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Asia > Afghanistan (0.04)
- Energy > Power Industry > Utilities > Nuclear (0.93)
- Transportation (0.68)
- Information Technology > Security & Privacy (0.68)
- Aerospace & Defense (0.68)
An Example Safety Case for Safeguards Against Misuse
Clymer, Joshua, Weinbaum, Jonah, Kirk, Robert, Mai, Kimberly, Zhang, Selena, Davies, Xander
Existing evaluations of AI misuse safeguards provide a patchwork of evidence that is often difficult to connect to real-world decisions. To bridge this gap, we describe an end-to-end argument (a "safety case") that misuse safeguards reduce the risk posed by an AI assistant to low levels. We first describe how a hypothetical developer red teams safeguards, estimating the effort required to evade them. Then, the developer plugs this estimate into a quantitative "uplift model" to determine how much barriers introduced by safeguards dissuade misuse (https://www.aimisusemodel.com/). This procedure provides a continuous signal of risk during deployment that helps the developer rapidly respond to emerging threats. Finally, we describe how to tie these components together into a simple safety case. Our work provides one concrete path -- though not the only path -- to rigorously justifying AI misuse risks are low.
- Information Technology > Security & Privacy (1.00)
- Government (0.93)
An alignment safety case sketch based on debate
Buhl, Marie Davidsen, Pfau, Jacob, Hilton, Benjamin, Irving, Geoffrey
If AI systems match or exceed human capabilities on a wide range of tasks, it may become difficult for humans to efficiently judge their actions -- making it hard to use human feedback to steer them towards desirable traits. One proposed solution is to leverage another superhuman system to point out flaws in the system's outputs via a debate. This paper outlines the value of debate for AI safety, as well as the assumptions and further research required to make debate work. It does so by sketching an ``alignment safety case'' -- an argument that an AI system will not autonomously take actions which could lead to egregious harm, despite being able to do so. The sketch focuses on the risk of an AI R\&D agent inside an AI company sabotaging research, for example by producing false results. To prevent this, the agent is trained via debate, subject to exploration guarantees, to teach the system to be honest. Honesty is maintained throughout deployment via online training. The safety case rests on four key claims: (1) the agent has become good at the debate game, (2) good performance in the debate game implies that the system is mostly honest, (3) the system will not become significantly less honest during deployment, and (4) the deployment context is tolerant of some errors. We identify open research problems that, if solved, could render this a compelling argument that an AI system is safe.
- North America > United States (0.46)
- Europe > United Kingdom (0.46)
- Research Report (0.66)
- Overview (0.48)
- Information Technology > Security & Privacy (0.93)
- Education > Educational Setting > Online (0.71)
Safety Cases: A Scalable Approach to Frontier AI Safety
Hilton, Benjamin, Buhl, Marie Davidsen, Korbak, Tomek, Irving, Geoffrey
Safety cases - clear, assessable arguments for the safety of a system in a given context - are a widely-used technique across various industries for showing a decision-maker (e.g. boards, customers, third parties) that a system is safe. In this paper, we cover how and why frontier AI developers might also want to use safety cases. We then argue that writing and reviewing safety cases would substantially assist in the fulfilment of many of the Frontier AI Safety Commitments. Finally, we outline open research questions on the methodology, implementation, and technical details of safety cases.
- Europe > United Kingdom (0.28)
- Asia > South Korea > Seoul > Seoul (0.05)
- Government (0.93)
- Information Technology > Security & Privacy (0.47)