AITopics | safety case

Collaborating Authors

safety case

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

A Concrete Roadmap towards Safety Cases based on Chain-of-Thought Monitoring

Schulz, Julian

arXiv.org Artificial IntelligenceOct-23-2025

As AI systems approach dangerous capability levels where inability safety cases become insufficient, we need alternative approaches to ensure safety. This paper presents a roadmap for constructing safety cases based on chain-of-thought (CoT) monitoring in reasoning models and outlines our research agenda. We argue that CoT monitoring might support both control and trustworthiness safety cases. We propose a two-part safety case: (1) establishing that models lack dangerous capabilities when operating without their CoT, and (2) ensuring that any dangerous capabilities enabled by a CoT are detectable by CoT monitoring. We systematically examine two threats to monitorability: neuralese and encoded reasoning, which we categorize into three forms (linguistic drift, steganography, and alien reasoning) and analyze their potential drivers. We evaluate existing and novel techniques for maintaining CoT faithfulness. For cases where models produce non-monitorable reasoning, we explore the possibility of extracting a monitorable CoT from a non-monitorable CoT. To assess the viability of CoT monitoring safety cases, we establish prediction markets to aggregate forecasts on key technical milestones influencing their feasibility.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2510.19476

Country: Africa > Eswatini > Manzini > Manzini (0.04)

Genre: Research Report > Promising Solution (0.66)

Industry: Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.68)

Add feedback

AI Testing Should Account for Sophisticated Strategic Behaviour

Kovarik, Vojtech, Chen, Eric Olav, Petersen, Sami, Ghersengorin, Alexis, Conitzer, Vincent

arXiv.org Artificial IntelligenceAug-22-2025

This position paper argues for two claims regarding AI testing and evaluation. First, to remain informative about deployment behaviour, evaluations need account for the possibility that AI systems understand their circumstances and reason strategically. Second, game-theoretic analysis can inform evaluation design by formalising and scrutinising the reasoning in evaluation-based safety cases. Drawing on examples from existing AI systems, a review of relevant research, and formal strategic analysis of a stylised evaluation scenario, we present evidence for these claims and motivate several research directions.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2508.14927

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
Europe > Czechia > Prague (0.04)
Asia > Macao (0.04)

Genre: Research Report (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Leisure & Entertainment > Games (0.94)

Technology:

Information Technology > Game Theory (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Operationalization of Scenario-Based Safety Assessment of Automated Driving Systems

Camp, Olaf Op den, de Gelder, Erwin

arXiv.org Artificial IntelligenceJul-31-2025

Olaf Op den Camp Integrated Vehicle Safety TNO Helmond, the Netherlands 0000 - 0002 - 6355 - 134X Erwin de Gelder Integrated Vehicle Safety TNO Helmond, the Netherlands 0000 - 0003 - 4260 - 4294 Abstract -- Before introducing an Automated Driving System (ADS) on the road at scale, the manufacturer must conduct some sort of safety assurance. To structure and harmonize the safety assurance process, the UNECE WP.29 Working Party on Automated/Autonomous and Connected Vehicles (GRVA) is developing the New Assessment/Test Method (NATM) that indicates what steps need to be taken for safety assessment of an ADS . In this paper, we will show how to practically conduct safety assessment making use of a scenario database, and what additional steps must be taken to fully operationalize the NATM. In addition, we will elaborate on how the use of scenario databases fits with methods developed in the Horizon Europe projects that focus on safety assessment following the NATM ap proach. A safety assurance process that is conducted by the manufacturer before introducing an Automated Driving System (ADS), intends to assure that the ADS responds appropriately in all situations it is designed for and that the ADS is able to avoid any reasonably foreseeable and reasonably preventable collision s . The information out of the safety assurance process is not only important for manufacturers, but also for authorities that have the responsibility to guard the safety of their citizens in traffic. Safety assurance is most important for consumers (and fle et owners) using an ADS with the expectation that the system is saf e, reliable, and trustworthy . To structure and harmonize this process, t he UNECE WP.29 Working Party on Automated/Autonomous and Connected Vehicles (GRVA) is developing the New Assessment/Test Method (NATM) [1], which is already recognized across many countries (e.g., Japan, South Korea, the EU and the USA).

artificial intelligence, scenario, scenario database, (13 more...)

arXiv.org Artificial Intelligence

2507.22433

Country:

Europe > Netherlands (0.44)
Asia > South Korea (0.24)
Asia > Japan (0.24)
(4 more...)

Genre: Research Report (0.40)

Industry:

Law (1.00)
Transportation > Ground > Road (0.92)
Information Technology > Robotics & Automation (0.82)
Automobiles & Trucks (0.82)

Technology: Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (1.00)

Add feedback

Evaluating Frontier Models for Stealth and Situational Awareness

Phuong, Mary, Zimmermann, Roland S., Wang, Ziyue, Lindner, David, Krakovna, Victoria, Cogan, Sarah, Dafoe, Allan, Ho, Lewis, Shah, Rohin

arXiv.org Artificial IntelligenceJul-4-2025

Recent work has demonstrated the plausibility of frontier AI models scheming -- knowingly and covertly pursuing an objective misaligned with its developer's intentions. Such behavior could be very hard to detect, and if present in future advanced systems, could pose severe loss of control risk. It is therefore important for AI developers to rule out harm from scheming prior to model deployment. In this paper, we present a suite of scheming reasoning evaluations measuring two types of reasoning capabilities that we believe are prerequisites for successful scheming: First, we propose five evaluations of ability to reason about and circumvent oversight (stealth). Second, we present eleven evaluations for measuring a model's ability to instrumentally reason about itself, its environment and its deployment (situational awareness). We demonstrate how these evaluations can be used as part of a scheming inability safety case: a model that does not succeed on these evaluations is almost certainly incapable of causing severe harm via scheming in real deployment. We run our evaluations on current frontier models and find that none of them show concerning levels of either situational awareness or stealth.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2505.0142

Country: North America > United States (0.04)

Genre:

Research Report > New Finding (0.67)
Research Report > Experimental Study (0.46)

Industry: Government > Military (0.95)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.96)

Add feedback

The DevSafeOps Dilemma: A Systematic Literature Review on Rapidity in Safe Autonomous Driving Development and Operation

Nouri, Ali, Cabrero-Daniel, Beatriz, Törner, Fredrik, Berger, Christian

arXiv.org Artificial IntelligenceJun-30-2025

Developing autonomous driving (AD) systems is challenging due to the complexity of the systems and the need to assure their safe and reliable operation. The widely adopted approach of DevOps seems promising to support the continuous technological progress in AI and the demand for fast reaction to incidents, which necessitate continuous development, deployment, and monitoring. We present a systematic literature review meant to identify, analyse, and synthesise a broad range of existing literature related to usage of DevOps in autonomous driving development. Our results provide a structured overview of challenges and solutions, arising from applying DevOps to safety-related AI-enabled functions. Our results indicate that there are still several open topics to be addressed to enable safe DevOps for the development of safe AD.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2506.21693

Country:

Europe > Austria > Vienna (0.14)
Europe > Sweden > Vaestra Goetaland > Gothenburg (0.04)
North America > United States > New York > New York County > New York City (0.04)
(9 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Promising Solution (0.93)

Industry:

Transportation > Ground > Road (1.00)
Law (1.00)
Information Technology > Security & Privacy (1.00)
(3 more...)

Technology:

Information Technology > Software Engineering (1.00)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback

Systematic Hazard Analysis for Frontier AI using STPA

Mylius, Simon

arXiv.org Artificial IntelligenceJun-3-2025

All of the frontier AI companies have published safety frameworks where they define capability thresholds and risk mitigations that determine how they will safely develop and deploy their models. Adoption of systematic approaches to risk modelling, based on established practices used in safety-critical industries, has been recommended, however frontier AI companies currently do not describe in detail any structured approach to identifying and analysing hazards. STPA (Systems-Theoretic Process Analysis) is a systematic methodology for identifying how complex systems can become unsafe, leading to hazards. It achieves this by mapping out controllers and controlled processes then analysing their interactions and feedback loops to understand how harmful outcomes could occur (Leveson & Thomas, 2018). We evaluate STPA's ability to broaden the scope, improve traceability and strengthen the robustness of safety assurance for frontier AI systems. Applying STPA to the threat model and scenario described in 'A Sketch of an AI Control Safety Case' (Korbak et al., 2025), we derive a list of Unsafe Control Actions. From these we select a subset and explore the Loss Scenarios that lead to them if left unmitigated. We find that STPA is able to identify causal factors that may be missed by unstructured hazard analysis methodologies thereby improving robustness. We suggest STPA could increase the safety assurance of frontier AI when used to complement or check coverage of existing AI governance techniques including capability thresholds, model evaluations and emergency procedures. The application of a systematic methodology supports scalability by increasing the proportion of the analysis that could be conducted by LLMs, reducing the burden on human domain experts.

artificial intelligence, large language model, natural language, (18 more...)

arXiv.org Artificial Intelligence

2506.01782

Country:

Asia > Japan > Honshū > Tōhoku > Fukushima Prefecture > Fukushima (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Asia > Afghanistan (0.04)

Genre: Research Report (0.50)

Industry:

Energy > Power Industry > Utilities > Nuclear (0.93)
Transportation (0.68)
Information Technology > Security & Privacy (0.68)
Aerospace & Defense (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.50)

Add feedback

An alignment safety case sketch based on debate

Buhl, Marie Davidsen, Pfau, Jacob, Hilton, Benjamin, Irving, Geoffrey

arXiv.org Artificial IntelligenceMay-26-2025

If AI systems match or exceed human capabilities on a wide range of tasks, it may become difficult for humans to efficiently judge their actions -- making it hard to use human feedback to steer them towards desirable traits. One proposed solution is to leverage another superhuman system to point out flaws in the system's outputs via a debate. This paper outlines the value of debate for AI safety, as well as the assumptions and further research required to make debate work. It does so by sketching an ``alignment safety case'' -- an argument that an AI system will not autonomously take actions which could lead to egregious harm, despite being able to do so. The sketch focuses on the risk of an AI R\&D agent inside an AI company sabotaging research, for example by producing false results. To prevent this, the agent is trained via debate, subject to exploration guarantees, to teach the system to be honest. Honesty is maintained throughout deployment via online training. The safety case rests on four key claims: (1) the agent has become good at the debate game, (2) good performance in the debate game implies that the system is mostly honest, (3) the system will not become significantly less honest during deployment, and (4) the deployment context is tolerant of some errors. We identify open research problems that, if solved, could render this a compelling argument that an AI system is safe.

argument, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2505.03989

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
North America > United States > New York (0.04)
Asia > South Korea > Seoul > Seoul (0.04)
North America > United States > Washington > King County > Seattle (0.04)

Genre:

Research Report (0.66)
Overview (0.48)

Industry:

Information Technology > Security & Privacy (0.93)
Education > Educational Setting > Online (0.71)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

An Example Safety Case for Safeguards Against Misuse

Clymer, Joshua, Weinbaum, Jonah, Kirk, Robert, Mai, Kimberly, Zhang, Selena, Davies, Xander

arXiv.org Artificial IntelligenceMay-26-2025

Existing evaluations of AI misuse safeguards provide a patchwork of evidence that is often difficult to connect to real-world decisions. To bridge this gap, we describe an end-to-end argument (a "safety case") that misuse safeguards reduce the risk posed by an AI assistant to low levels. We first describe how a hypothetical developer red teams safeguards, estimating the effort required to evade them. Then, the developer plugs this estimate into a quantitative "uplift model" to determine how much barriers introduced by safeguards dissuade misuse (https://www.aimisusemodel.com/). This procedure provides a continuous signal of risk during deployment that helps the developer rapidly respond to emerging threats. Finally, we describe how to tie these components together into a simple safety case. Our work provides one concrete path -- though not the only path -- to rigorously justifying AI misuse risks are low.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2505.18003

Country: Europe > United Kingdom (0.14)

Genre: Research Report (0.42)

Industry:

Information Technology > Security & Privacy (1.00)
Government (0.93)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)

Add feedback

Safety Cases: A Scalable Approach to Frontier AI Safety

Hilton, Benjamin, Buhl, Marie Davidsen, Korbak, Tomek, Irving, Geoffrey

arXiv.org Artificial IntelligenceFeb-5-2025

Safety cases - clear, assessable arguments for the safety of a system in a given context - are a widely-used technique across various industries for showing a decision-maker (e.g. boards, customers, third parties) that a system is safe. In this paper, we cover how and why frontier AI developers might also want to use safety cases. We then argue that writing and reviewing safety cases would substantially assist in the fulfilment of many of the Frontier AI Safety Commitments. Finally, we outline open research questions on the methodology, implementation, and technical details of safety cases.

argument, safety case, scalable approach, (10 more...)

arXiv.org Artificial Intelligence

2503.04744

Country: Europe > United Kingdom (0.28)

Genre: Research Report (0.90)

Industry:

Government (0.93)
Information Technology > Security & Privacy (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Upstream and Downstream AI Safety: Both on the Same River?

McDermid, John, Jia, Yan, Habli, Ibrahim

arXiv.org Artificial IntelligenceDec-9-2024

Traditional safety engineering assesses systems in their context of use, e.g. the operational design domain (road layout, speed limits, weather, etc.) for self-driving vehicles (including those using AI). We refer to this as downstream safety. In contrast, work on safety of frontier AI, e.g. large language models which can be further trained for downstream tasks, typically considers factors that are beyond specific application contexts, such as the ability of the model to evade human control, or to produce harmful content, e.g. how to make bombs. We refer to this as upstream safety. We outline the characteristics of both upstream and downstream safety frameworks then explore the extent to which the broad AI safety community can benefit from synergies between these frameworks. For example, can concepts such as common mode failures from downstream safety be used to help assess the strength of AI guardrails? Further, can the understanding of the capabilities and limitations of frontier AI be used to inform downstream safety analysis, e.g. where LLMs are fine-tuned to calculate voyage plans for autonomous vessels? The paper identifies some promising avenues to explore and outlines some challenges in achieving synergy, or a confluence, between upstream and downstream safety frameworks.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2501.05455

Country:

North America > United States (0.46)
Europe > France (0.04)
Europe > Germany (0.04)
(3 more...)

Genre: Research Report (0.50)

Industry:

Transportation (1.00)
Health & Medicine (1.00)
Government (1.00)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)

Add feedback