safety case
A Concrete Roadmap towards Safety Cases based on Chain-of-Thought Monitoring
As AI systems approach dangerous capability levels where inability safety cases become insufficient, we need alternative approaches to ensure safety. This paper presents a roadmap for constructing safety cases based on chain-of-thought (CoT) monitoring in reasoning models and outlines our research agenda. We argue that CoT monitoring might support both control and trustworthiness safety cases. We propose a two-part safety case: (1) establishing that models lack dangerous capabilities when operating without their CoT, and (2) ensuring that any dangerous capabilities enabled by a CoT are detectable by CoT monitoring. We systematically examine two threats to monitorability: neuralese and encoded reasoning, which we categorize into three forms (linguistic drift, steganography, and alien reasoning) and analyze their potential drivers. We evaluate existing and novel techniques for maintaining CoT faithfulness. For cases where models produce non-monitorable reasoning, we explore the possibility of extracting a monitorable CoT from a non-monitorable CoT. To assess the viability of CoT monitoring safety cases, we establish prediction markets to aggregate forecasts on key technical milestones influencing their feasibility.
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.68)
AI Testing Should Account for Sophisticated Strategic Behaviour
Kovarik, Vojtech, Chen, Eric Olav, Petersen, Sami, Ghersengorin, Alexis, Conitzer, Vincent
This position paper argues for two claims regarding AI testing and evaluation. First, to remain informative about deployment behaviour, evaluations need account for the possibility that AI systems understand their circumstances and reason strategically. Second, game-theoretic analysis can inform evaluation design by formalising and scrutinising the reasoning in evaluation-based safety cases. Drawing on examples from existing AI systems, a review of relevant research, and formal strategic analysis of a stylised evaluation scenario, we present evidence for these claims and motivate several research directions.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
- Europe > Czechia > Prague (0.04)
- Asia > Macao (0.04)
- Information Technology > Security & Privacy (1.00)
- Leisure & Entertainment > Games (0.94)
Operationalization of Scenario-Based Safety Assessment of Automated Driving Systems
Camp, Olaf Op den, de Gelder, Erwin
Olaf Op den Camp Integrated Vehicle Safety TNO Helmond, the Netherlands 0000 - 0002 - 6355 - 134X Erwin de Gelder Integrated Vehicle Safety TNO Helmond, the Netherlands 0000 - 0003 - 4260 - 4294 Abstract -- Before introducing an Automated Driving System (ADS) on the road at scale, the manufacturer must conduct some sort of safety assurance. To structure and harmonize the safety assurance process, the UNECE WP.29 Working Party on Automated/Autonomous and Connected Vehicles (GRVA) is developing the New Assessment/Test Method (NATM) that indicates what steps need to be taken for safety assessment of an ADS . In this paper, we will show how to practically conduct safety assessment making use of a scenario database, and what additional steps must be taken to fully operationalize the NATM. In addition, we will elaborate on how the use of scenario databases fits with methods developed in the Horizon Europe projects that focus on safety assessment following the NATM ap proach. A safety assurance process that is conducted by the manufacturer before introducing an Automated Driving System (ADS), intends to assure that the ADS responds appropriately in all situations it is designed for and that the ADS is able to avoid any reasonably foreseeable and reasonably preventable collision s . The information out of the safety assurance process is not only important for manufacturers, but also for authorities that have the responsibility to guard the safety of their citizens in traffic. Safety assurance is most important for consumers (and fle et owners) using an ADS with the expectation that the system is saf e, reliable, and trustworthy . To structure and harmonize this process, t he UNECE WP.29 Working Party on Automated/Autonomous and Connected Vehicles (GRVA) is developing the New Assessment/Test Method (NATM) [1], which is already recognized across many countries (e.g., Japan, South Korea, the EU and the USA).
- Europe > Netherlands (0.44)
- Asia > South Korea (0.24)
- Asia > Japan (0.24)
- (4 more...)
- Law (1.00)
- Transportation > Ground > Road (0.92)
- Information Technology > Robotics & Automation (0.82)
- Automobiles & Trucks (0.82)
Evaluating Frontier Models for Stealth and Situational Awareness
Phuong, Mary, Zimmermann, Roland S., Wang, Ziyue, Lindner, David, Krakovna, Victoria, Cogan, Sarah, Dafoe, Allan, Ho, Lewis, Shah, Rohin
Recent work has demonstrated the plausibility of frontier AI models scheming -- knowingly and covertly pursuing an objective misaligned with its developer's intentions. Such behavior could be very hard to detect, and if present in future advanced systems, could pose severe loss of control risk. It is therefore important for AI developers to rule out harm from scheming prior to model deployment. In this paper, we present a suite of scheming reasoning evaluations measuring two types of reasoning capabilities that we believe are prerequisites for successful scheming: First, we propose five evaluations of ability to reason about and circumvent oversight (stealth). Second, we present eleven evaluations for measuring a model's ability to instrumentally reason about itself, its environment and its deployment (situational awareness). We demonstrate how these evaluations can be used as part of a scheming inability safety case: a model that does not succeed on these evaluations is almost certainly incapable of causing severe harm via scheming in real deployment. We run our evaluations on current frontier models and find that none of them show concerning levels of either situational awareness or stealth.
- Research Report > New Finding (0.67)
- Research Report > Experimental Study (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.96)
The DevSafeOps Dilemma: A Systematic Literature Review on Rapidity in Safe Autonomous Driving Development and Operation
Nouri, Ali, Cabrero-Daniel, Beatriz, Törner, Fredrik, Berger, Christian
Developing autonomous driving (AD) systems is challenging due to the complexity of the systems and the need to assure their safe and reliable operation. The widely adopted approach of DevOps seems promising to support the continuous technological progress in AI and the demand for fast reaction to incidents, which necessitate continuous development, deployment, and monitoring. We present a systematic literature review meant to identify, analyse, and synthesise a broad range of existing literature related to usage of DevOps in autonomous driving development. Our results provide a structured overview of challenges and solutions, arising from applying DevOps to safety-related AI-enabled functions. Our results indicate that there are still several open topics to be addressed to enable safe DevOps for the development of safe AD.
- Europe > Austria > Vienna (0.14)
- Europe > Sweden > Vaestra Goetaland > Gothenburg (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (9 more...)
- Research Report > New Finding (1.00)
- Research Report > Promising Solution (0.93)
- Transportation > Ground > Road (1.00)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- (3 more...)
Systematic Hazard Analysis for Frontier AI using STPA
All of the frontier AI companies have published safety frameworks where they define capability thresholds and risk mitigations that determine how they will safely develop and deploy their models. Adoption of systematic approaches to risk modelling, based on established practices used in safety-critical industries, has been recommended, however frontier AI companies currently do not describe in detail any structured approach to identifying and analysing hazards. STPA (Systems-Theoretic Process Analysis) is a systematic methodology for identifying how complex systems can become unsafe, leading to hazards. It achieves this by mapping out controllers and controlled processes then analysing their interactions and feedback loops to understand how harmful outcomes could occur (Leveson & Thomas, 2018). We evaluate STPA's ability to broaden the scope, improve traceability and strengthen the robustness of safety assurance for frontier AI systems. Applying STPA to the threat model and scenario described in 'A Sketch of an AI Control Safety Case' (Korbak et al., 2025), we derive a list of Unsafe Control Actions. From these we select a subset and explore the Loss Scenarios that lead to them if left unmitigated. We find that STPA is able to identify causal factors that may be missed by unstructured hazard analysis methodologies thereby improving robustness. We suggest STPA could increase the safety assurance of frontier AI when used to complement or check coverage of existing AI governance techniques including capability thresholds, model evaluations and emergency procedures. The application of a systematic methodology supports scalability by increasing the proportion of the analysis that could be conducted by LLMs, reducing the burden on human domain experts.
- Asia > Japan > Honshū > Tōhoku > Fukushima Prefecture > Fukushima (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Asia > Afghanistan (0.04)
- Energy > Power Industry > Utilities > Nuclear (0.93)
- Transportation (0.68)
- Information Technology > Security & Privacy (0.68)
- Aerospace & Defense (0.68)
An alignment safety case sketch based on debate
Buhl, Marie Davidsen, Pfau, Jacob, Hilton, Benjamin, Irving, Geoffrey
If AI systems match or exceed human capabilities on a wide range of tasks, it may become difficult for humans to efficiently judge their actions -- making it hard to use human feedback to steer them towards desirable traits. One proposed solution is to leverage another superhuman system to point out flaws in the system's outputs via a debate. This paper outlines the value of debate for AI safety, as well as the assumptions and further research required to make debate work. It does so by sketching an ``alignment safety case'' -- an argument that an AI system will not autonomously take actions which could lead to egregious harm, despite being able to do so. The sketch focuses on the risk of an AI R\&D agent inside an AI company sabotaging research, for example by producing false results. To prevent this, the agent is trained via debate, subject to exploration guarantees, to teach the system to be honest. Honesty is maintained throughout deployment via online training. The safety case rests on four key claims: (1) the agent has become good at the debate game, (2) good performance in the debate game implies that the system is mostly honest, (3) the system will not become significantly less honest during deployment, and (4) the deployment context is tolerant of some errors. We identify open research problems that, if solved, could render this a compelling argument that an AI system is safe.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- North America > United States > New York (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- Research Report (0.66)
- Overview (0.48)
- Information Technology > Security & Privacy (0.93)
- Education > Educational Setting > Online (0.71)
An Example Safety Case for Safeguards Against Misuse
Clymer, Joshua, Weinbaum, Jonah, Kirk, Robert, Mai, Kimberly, Zhang, Selena, Davies, Xander
Existing evaluations of AI misuse safeguards provide a patchwork of evidence that is often difficult to connect to real-world decisions. To bridge this gap, we describe an end-to-end argument (a "safety case") that misuse safeguards reduce the risk posed by an AI assistant to low levels. We first describe how a hypothetical developer red teams safeguards, estimating the effort required to evade them. Then, the developer plugs this estimate into a quantitative "uplift model" to determine how much barriers introduced by safeguards dissuade misuse (https://www.aimisusemodel.com/). This procedure provides a continuous signal of risk during deployment that helps the developer rapidly respond to emerging threats. Finally, we describe how to tie these components together into a simple safety case. Our work provides one concrete path -- though not the only path -- to rigorously justifying AI misuse risks are low.
- Information Technology > Security & Privacy (1.00)
- Government (0.93)
Safety Cases: A Scalable Approach to Frontier AI Safety
Hilton, Benjamin, Buhl, Marie Davidsen, Korbak, Tomek, Irving, Geoffrey
Safety cases - clear, assessable arguments for the safety of a system in a given context - are a widely-used technique across various industries for showing a decision-maker (e.g. boards, customers, third parties) that a system is safe. In this paper, we cover how and why frontier AI developers might also want to use safety cases. We then argue that writing and reviewing safety cases would substantially assist in the fulfilment of many of the Frontier AI Safety Commitments. Finally, we outline open research questions on the methodology, implementation, and technical details of safety cases.
- Government (0.93)
- Information Technology > Security & Privacy (0.47)
Upstream and Downstream AI Safety: Both on the Same River?
McDermid, John, Jia, Yan, Habli, Ibrahim
Traditional safety engineering assesses systems in their context of use, e.g. the operational design domain (road layout, speed limits, weather, etc.) for self-driving vehicles (including those using AI). We refer to this as downstream safety. In contrast, work on safety of frontier AI, e.g. large language models which can be further trained for downstream tasks, typically considers factors that are beyond specific application contexts, such as the ability of the model to evade human control, or to produce harmful content, e.g. how to make bombs. We refer to this as upstream safety. We outline the characteristics of both upstream and downstream safety frameworks then explore the extent to which the broad AI safety community can benefit from synergies between these frameworks. For example, can concepts such as common mode failures from downstream safety be used to help assess the strength of AI guardrails? Further, can the understanding of the capabilities and limitations of frontier AI be used to inform downstream safety analysis, e.g. where LLMs are fine-tuned to calculate voyage plans for autonomous vessels? The paper identifies some promising avenues to explore and outlines some challenges in achieving synergy, or a confluence, between upstream and downstream safety frameworks.
- North America > United States (0.46)
- Europe > France (0.04)
- Europe > Germany (0.04)
- (3 more...)
- Transportation (1.00)
- Health & Medicine (1.00)
- Government (1.00)
- (2 more...)