safety framework
Quantifying CBRN Risk in Frontier Models
Kumar, Divyanshu, Birur, Nitin Aravind, Baswa, Tanay, Agarwal, Sahil, Harshangi, Prashanth
Frontier Large Language Models (LLMs) pose unprecedented dual-use risks through the potential proliferation of chemical, biological, radiological, and nuclear (CBRN) weapons knowledge. We present the first comprehensive evaluation of 10 leading commercial LLMs against both a novel 200-prompt CBRN dataset and a 180-prompt subset of the FORTRESS benchmark, using a rigorous three-tier attack methodology. Our findings expose critical safety vulnerabilities: Deep Inception attacks achieve 86.0\% success versus 33.8\% for direct requests, demonstrating superficial filtering mechanisms; Model safety performance varies dramatically from 2\% (claude-opus-4) to 96\% (mistral-small-latest) attack success rates; and eight models exceed 70\% vulnerability when asked to enhance dangerous material properties. We identify fundamental brittleness in current safety alignment, where simple prompt engineering techniques bypass safeguards for dangerous CBRN information. These results challenge industry safety claims and highlight urgent needs for standardized evaluation frameworks, transparent safety metrics, and more robust alignment techniques to mitigate catastrophic misuse risks while preserving beneficial capabilities.
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (0.89)
- Government > Regional Government (0.69)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.69)
Can You Trust an LLM with Your Life-Changing Decision? An Investigation into AI High-Stakes Responses
Cahyono, Joshua Adrian, Subramanian, Saran
Large Language Models (LLMs) are increasingly consulted for high-stakes life advice, yet they lack standard safeguards against providing confident but misguided responses. This creates risks of sycophancy and over-confidence. This paper investigates these failure modes through three experiments: (1) a multiple-choice evaluation to measure model stability against user pressure; (2) a free-response analysis using a novel safety typology and an LLM Judge; and (3) a mechanistic interpretability experiment to steer model behavior by manipulating a "high-stakes" activation vector. Our results show that while some models exhibit sycophancy, others like o4-mini r emain robust. Top-performing models achieve high safety scores by frequently asking clarifying questions--a key feature of a safe, inquisitive approach--rather than issuing prescriptive advice . Furthermore, we demonstrate that a model's cautiousness can be directly controlled via activation steering, suggesting a new path for safety alignment. These findings underscore the need for nuanced, multi-faceted benchmarks to ensure LLMs can be trusted with life-changing decisions.
Systematic Hazard Analysis for Frontier AI using STPA
All of the frontier AI companies have published safety frameworks where they define capability thresholds and risk mitigations that determine how they will safely develop and deploy their models. Adoption of systematic approaches to risk modelling, based on established practices used in safety-critical industries, has been recommended, however frontier AI companies currently do not describe in detail any structured approach to identifying and analysing hazards. STPA (Systems-Theoretic Process Analysis) is a systematic methodology for identifying how complex systems can become unsafe, leading to hazards. It achieves this by mapping out controllers and controlled processes then analysing their interactions and feedback loops to understand how harmful outcomes could occur (Leveson & Thomas, 2018). We evaluate STPA's ability to broaden the scope, improve traceability and strengthen the robustness of safety assurance for frontier AI systems. Applying STPA to the threat model and scenario described in 'A Sketch of an AI Control Safety Case' (Korbak et al., 2025), we derive a list of Unsafe Control Actions. From these we select a subset and explore the Loss Scenarios that lead to them if left unmitigated. We find that STPA is able to identify causal factors that may be missed by unstructured hazard analysis methodologies thereby improving robustness. We suggest STPA could increase the safety assurance of frontier AI when used to complement or check coverage of existing AI governance techniques including capability thresholds, model evaluations and emergency procedures. The application of a systematic methodology supports scalability by increasing the proportion of the analysis that could be conducted by LLMs, reducing the burden on human domain experts.
- Asia > Japan > Honshū > Tōhoku > Fukushima Prefecture > Fukushima (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Asia > Afghanistan (0.04)
- Energy > Power Industry > Utilities > Nuclear (0.93)
- Transportation (0.68)
- Information Technology > Security & Privacy (0.68)
- Aerospace & Defense (0.68)
Emerging Practices in Frontier AI Safety Frameworks
Buhl, Marie Davidsen, Bucknall, Ben, Masterson, Tammy
At the AI Seoul Summit in 2024, a number o f AI developers signed on to the Frontier AI Safety Commitments, agreeing to develop a safety framework outlining how they will manage severe risks that their frontier AI systems may pose ( DSIT, 2024) . Since then, a research field has begun to emerge, with a diverse array of researchers from companies, governments, academi a and other third - party research organi s ations publishing work on how to write and implement an effective safety framework . S ignatories to the commitments are due to publish safety frameworks shortly, in time for the Paris AI Action Summit. This paper summarises emerging practice s - practices that appear promising and are gaining expert recognition - for safety frameworks as identified by this new research field. We draw on both the safety frameworks published so far, literature and standards on frontier AI risk management (as well as risk management more broadly), internal research by the UK AI Safety Institute, and the Frontier AI Safety Commitments themselves.
- Asia > South Korea > Seoul > Seoul (0.24)
- Asia > India > Tamil Nadu > Chennai (0.04)
- Oceania > Papua New Guinea > Gulf Province > Kerema (0.04)
- (3 more...)
- Information Technology > Security & Privacy (1.00)
- Government (1.00)
Safety Cases: A Scalable Approach to Frontier AI Safety
Hilton, Benjamin, Buhl, Marie Davidsen, Korbak, Tomek, Irving, Geoffrey
Safety cases - clear, assessable arguments for the safety of a system in a given context - are a widely-used technique across various industries for showing a decision-maker (e.g. boards, customers, third parties) that a system is safe. In this paper, we cover how and why frontier AI developers might also want to use safety cases. We then argue that writing and reviewing safety cases would substantially assist in the fulfilment of many of the Frontier AI Safety Commitments. Finally, we outline open research questions on the methodology, implementation, and technical details of safety cases.
- Europe > United Kingdom (0.28)
- Asia > South Korea > Seoul > Seoul (0.05)
- Government (0.93)
- Information Technology > Security & Privacy (0.47)
Where AI Assurance Might Go Wrong: Initial lessons from engineering of critical systems
Bloomfield, Robin, Rushby, John
We draw on our experience working on system and software assurance and evaluation for systems important to society to summarise how safety engineering is performed in traditional critical systems, such as aircraft flight control. We analyse how this critical systems perspective might support the development and implementation of AI Safety Frameworks. We present the analysis in terms of: system engineering, safety and risk analysis, and decision analysis and support. We consider four key questions: What is the system? How good does it have to be? What is the impact of criticality on system development? and How much should we trust it? We identify topics worthy of further discussion. In particular, we are concerned that system boundaries are not broad enough, that the tolerability and nature of the risks are not sufficiently elaborated, and that the assurance methods lack theories that would allow behaviours to be adequately assured. We advocate the use of assurance cases based on Assurance 2.0 to support decision making in which the criticality of the decision as well as the criticality of the system are evaluated. We point out the orders of magnitude difference in confidence needed in critical rather than everyday systems and how everyday techniques do not scale in rigour. Finally we map our findings in detail to two of the questions posed by the FAISC organisers and we note that the engineering of critical systems has evolved through open and diverse discussion. We hope that topics identified here will support the post-FAISC dialogues.
- Europe > Austria > Vienna (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Asia > South Korea > Seoul > Seoul (0.05)
- (14 more...)
- Transportation > Air (1.00)
- Health & Medicine (1.00)
- Energy > Power Industry > Utilities > Nuclear (1.00)
- Government > Regional Government (0.93)
Safeguarding AI Agents: Developing and Analyzing Safety Architectures
Domkundwar, Ishaan, S, Mukunda N, Bhola, Ishaan
AI agents, specifically powered by large language models, have demonstrated exceptional capabilities in various applications where precision and efficacy are necessary. However, these agents come with inherent risks, including the potential for unsafe or biased actions, vulnerability to adversarial attacks, lack of transparency, and tendency to generate hallucinations. As AI agents become more prevalent in critical sectors of the industry, the implementation of effective safety protocols becomes increasingly important. This paper addresses the critical need for safety measures in AI systems, especially ones that collaborate with human teams. We propose and evaluate three frameworks to enhance safety protocols in AI agent systems: an LLM-powered input-output filter, a safety agent integrated within the system, and a hierarchical delegation-based system with embedded safety checks. Our methodology involves implementing these frameworks and testing them against a set of unsafe agentic use cases, providing a comprehensive evaluation of their effectiveness in mitigating risks associated with AI agent deployment. We conclude that these frameworks can significantly strengthen the safety and security of AI agent systems, minimizing potential harmful actions or outputs. Our work contributes to the ongoing effort to create safe and reliable AI applications, particularly in automated operations, and provides a foundation for developing robust guardrails to ensure the responsible use of AI agents in real-world applications.
- North America > United States (0.04)
- Europe > Russia (0.04)
- Europe > Finland (0.04)
- (3 more...)
- Education (1.00)
- Health & Medicine (0.94)
- Law (0.94)
- (2 more...)
Human-AI Safety: A Descendant of Generative AI and Control Systems Safety
Bajcsy, Andrea, Fisac, Jaime F.
Artificial intelligence (AI) is interacting with people at an unprecedented scale, offering new avenues for immense positive impact, but also raising widespread concerns around the potential for individual and societal harm. Today, the predominant paradigm for human--AI safety focuses on fine-tuning the generative model's outputs to better agree with human-provided examples or feedback. In reality, however, the consequences of an AI model's outputs cannot be determined in isolation: they are tightly entangled with the responses and behavior of human users over time. In this paper, we distill key complementary lessons from AI safety and control systems safety, highlighting open challenges as well as key synergies between both fields. We then argue that meaningful safety assurances for advanced AI technologies require reasoning about how the feedback loop formed by AI outputs and human behavior may drive the interaction towards different outcomes. To this end, we introduce a unifying formalism to capture dynamic, safety-critical human--AI interactions and propose a concrete technical roadmap towards next-generation human-centered AI safety.
- Asia > Middle East > Jordan (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > District of Columbia > Washington (0.04)
- Europe > Austria > Styria > Graz (0.04)
- Transportation > Air (1.00)
- Automobiles & Trucks (0.68)
- Leisure & Entertainment > Games (0.67)
- (3 more...)
Latent Guard: a Safety Framework for Text-to-image Generation
Liu, Runtao, Khakzar, Ashkan, Gu, Jindong, Chen, Qifeng, Torr, Philip, Pizzati, Fabio
With the ability to generate high-quality images, text-to-image (T2I) models can be exploited for creating inappropriate content. To prevent misuse, existing safety measures are either based on text blacklists, which can be easily circumvented, or harmful content classification, requiring large datasets for training and offering low flexibility. Hence, we propose Latent Guard, a framework designed to improve safety measures in text-to-image generation. Inspired by blacklist-based approaches, Latent Guard learns a latent space on top of the T2I model's text encoder, where it is possible to check the presence of harmful concepts in the input text embeddings. Our proposed framework is composed of a data generation pipeline specific to the task using large language models, ad-hoc architectural components, and a contrastive learning strategy to benefit from the generated data. The effectiveness of our method is verified on three datasets and against four baselines. Code and data will be shared at https://github.com/rt219/LatentGuard.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Asia > China > Hong Kong (0.04)
- Africa > Central African Republic > Ombella-M'Poko > Bimbo (0.04)
- Law Enforcement & Public Safety (0.93)
- Law (0.93)
- Information Technology > Security & Privacy (0.68)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.96)
The Chai Platform's AI Safety Framework
Lu, Xiaoding, Korshuk, Aleksey, Liu, Zongyi, Beauchamp, William
Chai empowers users to create and interact with customized chatbots, offering unique and engaging experiences. Despite the exciting prospects, the work recognizes the inherent challenges of a commitment to modern safety standards. Therefore, this paper presents the integrated AI safety principles into Chai to prioritize user safety, data protection, and ethical technology use. The paper specifically explores the multidimensional domain of AI safety research, demonstrating its application in Chai's conversational chatbot platform. It presents Chai's AI safety principles, informed by well-established AI research centres and adapted for chat AI. This work proposes the following safety framework: Content Safeguarding; Stability and Robustness; and Operational Transparency and Traceability. The subsequent implementation of these principles is outlined, followed by an experimental analysis of Chai's AI safety framework's real-world impact. We emphasise the significance of conscientious application of AI safety principles and robust safety measures. The successful implementation of the safe AI framework in Chai indicates the practicality of mitigating potential risks for responsible and ethical use of AI technologies. The ultimate vision is a transformative AI tool fostering progress and innovation while prioritizing user safety and ethical standards.
- North America > United States > Virginia (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > United States > California > San Francisco County > San Francisco (0.04)
- Asia (0.04)
- Overview (0.46)
- Research Report (0.40)