AITopics | safety framework

Collaborating Authors

safety framework

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Quantifying CBRN Risk in Frontier Models

Kumar, Divyanshu, Birur, Nitin Aravind, Baswa, Tanay, Agarwal, Sahil, Harshangi, Prashanth

arXiv.org Artificial IntelligenceOct-27-2025

Frontier Large Language Models (LLMs) pose unprecedented dual-use risks through the potential proliferation of chemical, biological, radiological, and nuclear (CBRN) weapons knowledge. We present the first comprehensive evaluation of 10 leading commercial LLMs against both a novel 200-prompt CBRN dataset and a 180-prompt subset of the FORTRESS benchmark, using a rigorous three-tier attack methodology. Our findings expose critical safety vulnerabilities: Deep Inception attacks achieve 86.0\% success versus 33.8\% for direct requests, demonstrating superficial filtering mechanisms; Model safety performance varies dramatically from 2\% (claude-opus-4) to 96\% (mistral-small-latest) attack success rates; and eight models exceed 70\% vulnerability when asked to enhance dangerous material properties. We identify fundamental brittleness in current safety alignment, where simple prompt engineering techniques bypass safeguards for dangerous CBRN information. These results challenge industry safety claims and highlight urgent needs for standardized evaluation frameworks, transparent safety metrics, and more robust alignment techniques to mitigate catastrophic misuse risks while preserving beneficial capabilities.

information, large language model, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2510.21133

Country: North America > United States (0.29)

Genre: Research Report > New Finding (0.89)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (0.89)
Government > Regional Government (0.69)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

Can You Trust an LLM with Your Life-Changing Decision? An Investigation into AI High-Stakes Responses

Cahyono, Joshua Adrian, Subramanian, Saran

arXiv.org Artificial IntelligenceJul-30-2025

Large Language Models (LLMs) are increasingly consulted for high-stakes life advice, yet they lack standard safeguards against providing confident but misguided responses. This creates risks of sycophancy and over-confidence. This paper investigates these failure modes through three experiments: (1) a multiple-choice evaluation to measure model stability against user pressure; (2) a free-response analysis using a novel safety typology and an LLM Judge; and (3) a mechanistic interpretability experiment to steer model behavior by manipulating a "high-stakes" activation vector. Our results show that while some models exhibit sycophancy, others like o4-mini r emain robust. Top-performing models achieve high safety scores by frequently asking clarifying questions--a key feature of a safe, inquisitive approach--rather than issuing prescriptive advice . Furthermore, we demonstrate that a model's cautiousness can be directly controlled via activation steering, suggesting a new path for safety alignment. These findings underscore the need for nuanced, multi-faceted benchmarks to ensure LLMs can be trusted with life-changing decisions.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2507.21132

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)

Add feedback

Systematic Hazard Analysis for Frontier AI using STPA

Mylius, Simon

arXiv.org Artificial IntelligenceJun-3-2025

All of the frontier AI companies have published safety frameworks where they define capability thresholds and risk mitigations that determine how they will safely develop and deploy their models. Adoption of systematic approaches to risk modelling, based on established practices used in safety-critical industries, has been recommended, however frontier AI companies currently do not describe in detail any structured approach to identifying and analysing hazards. STPA (Systems-Theoretic Process Analysis) is a systematic methodology for identifying how complex systems can become unsafe, leading to hazards. It achieves this by mapping out controllers and controlled processes then analysing their interactions and feedback loops to understand how harmful outcomes could occur (Leveson & Thomas, 2018). We evaluate STPA's ability to broaden the scope, improve traceability and strengthen the robustness of safety assurance for frontier AI systems. Applying STPA to the threat model and scenario described in 'A Sketch of an AI Control Safety Case' (Korbak et al., 2025), we derive a list of Unsafe Control Actions. From these we select a subset and explore the Loss Scenarios that lead to them if left unmitigated. We find that STPA is able to identify causal factors that may be missed by unstructured hazard analysis methodologies thereby improving robustness. We suggest STPA could increase the safety assurance of frontier AI when used to complement or check coverage of existing AI governance techniques including capability thresholds, model evaluations and emergency procedures. The application of a systematic methodology supports scalability by increasing the proportion of the analysis that could be conducted by LLMs, reducing the burden on human domain experts.

artificial intelligence, large language model, natural language, (18 more...)

arXiv.org Artificial Intelligence

2506.01782

Country: Asia (0.28)

Genre: Research Report (0.50)

Industry:

Energy > Power Industry > Utilities > Nuclear (0.93)
Transportation (0.68)
Information Technology > Security & Privacy (0.68)
Aerospace & Defense (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.50)

Add feedback

Emerging Practices in Frontier AI Safety Frameworks

Buhl, Marie Davidsen, Bucknall, Ben, Masterson, Tammy

arXiv.org Artificial IntelligenceFeb-5-2025

At the AI Seoul Summit in 2024, a number o f AI developers signed on to the Frontier AI Safety Commitments, agreeing to develop a safety framework outlining how they will manage severe risks that their frontier AI systems may pose ( DSIT, 2024) . Since then, a research field has begun to emerge, with a diverse array of researchers from companies, governments, academi a and other third - party research organi s ations publishing work on how to write and implement an effective safety framework . S ignatories to the commitments are due to publish safety frameworks shortly, in time for the Paris AI Action Summit. This paper summarises emerging practice s - practices that appear promising and are gaining expert recognition - for safety frameworks as identified by this new research field. We draw on both the safety frameworks published so far, literature and standards on frontier AI risk management (as well as risk management more broadly), internal research by the UK AI Safety Institute, and the Frontier AI Safety Commitments themselves.

mitigation, safety framework, threshold, (12 more...)

arXiv.org Artificial Intelligence

2503.04746

Country:

Asia > South Korea > Seoul > Seoul (0.24)
Asia > India > Tamil Nadu > Chennai (0.04)
Oceania > Papua New Guinea > Gulf Province > Kerema (0.04)
(3 more...)

Genre: Research Report (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Government (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.33)

Add feedback

Safety Cases: A Scalable Approach to Frontier AI Safety

Hilton, Benjamin, Buhl, Marie Davidsen, Korbak, Tomek, Irving, Geoffrey

arXiv.org Artificial IntelligenceFeb-5-2025

Safety cases - clear, assessable arguments for the safety of a system in a given context - are a widely-used technique across various industries for showing a decision-maker (e.g. boards, customers, third parties) that a system is safe. In this paper, we cover how and why frontier AI developers might also want to use safety cases. We then argue that writing and reviewing safety cases would substantially assist in the fulfilment of many of the Frontier AI Safety Commitments. Finally, we outline open research questions on the methodology, implementation, and technical details of safety cases.

argument, safety case, scalable approach, (10 more...)

arXiv.org Artificial Intelligence

2503.04744

Country:

Europe > United Kingdom (0.28)
Asia > South Korea > Seoul > Seoul (0.05)

Genre: Research Report (0.90)

Industry:

Government (0.93)
Information Technology > Security & Privacy (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Where AI Assurance Might Go Wrong: Initial lessons from engineering of critical systems

Bloomfield, Robin, Rushby, John

arXiv.org Artificial IntelligenceJan-7-2025

We draw on our experience working on system and software assurance and evaluation for systems important to society to summarise how safety engineering is performed in traditional critical systems, such as aircraft flight control. We analyse how this critical systems perspective might support the development and implementation of AI Safety Frameworks. We present the analysis in terms of: system engineering, safety and risk analysis, and decision analysis and support. We consider four key questions: What is the system? How good does it have to be? What is the impact of criticality on system development? and How much should we trust it? We identify topics worthy of further discussion. In particular, we are concerned that system boundaries are not broad enough, that the tolerability and nature of the risks are not sufficiently elaborated, and that the assurance methods lack theories that would allow behaviours to be adequately assured. We advocate the use of assurance cases based on Assurance 2.0 to support decision making in which the criticality of the decision as well as the criticality of the system are evaluated. We point out the orders of magnitude difference in confidence needed in critical rather than everyday systems and how everyday techniques do not scale in rigour. Finally we map our findings in detail to two of the questions posed by the FAISC organisers and we note that the engineering of critical systems has evolved through open and diverse discussion. We hope that topics identified here will support the post-FAISC dialogues.

artificial intelligence, engineering, implementation, (15 more...)

arXiv.org Artificial Intelligence

2502.03467

Country:

North America > United States (1.00)
Europe > United Kingdom > England (1.00)

Genre: Research Report (0.84)

Industry:

Transportation > Air (1.00)
Health & Medicine (1.00)
Energy > Power Industry > Utilities > Nuclear (1.00)
Government > Regional Government (0.93)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)

Add feedback

Safeguarding AI Agents: Developing and Analyzing Safety Architectures

Domkundwar, Ishaan, S, Mukunda N, Bhola, Ishaan

arXiv.org Artificial IntelligenceSep-13-2024

AI agents, specifically powered by large language models, have demonstrated exceptional capabilities in various applications where precision and efficacy are necessary. However, these agents come with inherent risks, including the potential for unsafe or biased actions, vulnerability to adversarial attacks, lack of transparency, and tendency to generate hallucinations. As AI agents become more prevalent in critical sectors of the industry, the implementation of effective safety protocols becomes increasingly important. This paper addresses the critical need for safety measures in AI systems, especially ones that collaborate with human teams. We propose and evaluate three frameworks to enhance safety protocols in AI agent systems: an LLM-powered input-output filter, a safety agent integrated within the system, and a hierarchical delegation-based system with embedded safety checks. Our methodology involves implementing these frameworks and testing them against a set of unsafe agentic use cases, providing a comprehensive evaluation of their effectiveness in mitigating risks associated with AI agent deployment. We conclude that these frameworks can significantly strengthen the safety and security of AI agent systems, minimizing potential harmful actions or outputs. Our work contributes to the ongoing effort to create safe and reliable AI applications, particularly in automated operations, and provides a foundation for developing robust guardrails to ensure the responsible use of AI agents in real-world applications.

agent, agent system, safety, (15 more...)

arXiv.org Artificial Intelligence

2409.03793

Country:

North America > United States (0.04)
Europe > Russia (0.04)
Europe > Finland (0.04)
(3 more...)

Genre: Instructional Material > Online (0.46)

Industry:

Education (1.00)
Health & Medicine (0.94)
Law (0.94)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)

Add feedback

Human-AI Safety: A Descendant of Generative AI and Control Systems Safety

Bajcsy, Andrea, Fisac, Jaime F.

arXiv.org Artificial IntelligenceJun-22-2024

Artificial intelligence (AI) is interacting with people at an unprecedented scale, offering new avenues for immense positive impact, but also raising widespread concerns around the potential for individual and societal harm. Today, the predominant paradigm for human--AI safety focuses on fine-tuning the generative model's outputs to better agree with human-provided examples or feedback. In reality, however, the consequences of an AI model's outputs cannot be determined in isolation: they are tightly entangled with the responses and behavior of human users over time. In this paper, we distill key complementary lessons from AI safety and control systems safety, highlighting open challenges as well as key synergies between both fields. We then argue that meaningful safety assurances for advanced AI technologies require reasoning about how the feedback loop formed by AI outputs and human behavior may drive the interaction towards different outcomes. To this end, we introduce a unifying formalism to capture dynamic, safety-critical human--AI interactions and propose a concrete technical roadmap towards next-generation human-centered AI safety.

arxiv preprint arxiv, interaction, safety, (13 more...)

arXiv.org Artificial Intelligence

2405.09794

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > District of Columbia > Washington (0.04)
Europe > Austria > Styria > Graz (0.04)

Genre: Overview (0.93)

Industry:

Transportation > Air (1.00)
Automobiles & Trucks (0.68)
Leisure & Entertainment > Games (0.67)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Generation (1.00)
(3 more...)

Add feedback

Latent Guard: a Safety Framework for Text-to-image Generation

Liu, Runtao, Khakzar, Ashkan, Gu, Jindong, Chen, Qifeng, Torr, Philip, Pizzati, Fabio

arXiv.org Artificial IntelligenceApr-11-2024

With the ability to generate high-quality images, text-to-image (T2I) models can be exploited for creating inappropriate content. To prevent misuse, existing safety measures are either based on text blacklists, which can be easily circumvented, or harmful content classification, requiring large datasets for training and offering low flexibility. Hence, we propose Latent Guard, a framework designed to improve safety measures in text-to-image generation. Inspired by blacklist-based approaches, Latent Guard learns a latent space on top of the T2I model's text encoder, where it is possible to check the presence of harmful concepts in the input text embeddings. Our proposed framework is composed of a data generation pipeline specific to the task using large language models, ad-hoc architectural components, and a contrastive learning strategy to benefit from the generated data. The effectiveness of our method is verified on three datasets and against four baselines. Code and data will be shared at https://github.com/rt219/LatentGuard.

input prompt, latent guard, safety framework, (14 more...)

arXiv.org Artificial Intelligence

2404.08031

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Asia > China > Hong Kong (0.04)
Africa > Central African Republic > Ombella-M'Poko > Bimbo (0.04)

Genre: Research Report (0.82)

Industry:

Law Enforcement & Public Safety (0.93)
Law (0.93)
Information Technology > Security & Privacy (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.96)

Add feedback

The Chai Platform's AI Safety Framework

Lu, Xiaoding, Korshuk, Aleksey, Liu, Zongyi, Beauchamp, William

arXiv.org Artificial IntelligenceJun-5-2023

Chai empowers users to create and interact with customized chatbots, offering unique and engaging experiences. Despite the exciting prospects, the work recognizes the inherent challenges of a commitment to modern safety standards. Therefore, this paper presents the integrated AI safety principles into Chai to prioritize user safety, data protection, and ethical technology use. The paper specifically explores the multidimensional domain of AI safety research, demonstrating its application in Chai's conversational chatbot platform. It presents Chai's AI safety principles, informed by well-established AI research centres and adapted for chat AI. This work proposes the following safety framework: Content Safeguarding; Stability and Robustness; and Operational Transparency and Traceability. The subsequent implementation of these principles is outlined, followed by an experimental analysis of Chai's AI safety framework's real-world impact. We emphasise the significance of conscientious application of AI safety principles and robust safety measures. The successful implementation of the safe AI framework in Chai indicates the practicality of mitigating potential risks for responsible and ethical use of AI technologies. The ultimate vision is a transformative AI tool fostering progress and innovation while prioritizing user safety and ethical standards.

machine learning, natural language, platform, (18 more...)

arXiv.org Artificial Intelligence

2306.02979

Country:

North America > United States > Virginia (0.04)
North America > United States > Texas > Travis County > Austin (0.04)
North America > United States > California > San Francisco County > San Francisco (0.04)
Asia (0.04)

Genre:

Overview (0.46)
Research Report (0.40)

Industry: Information Technology > Security & Privacy (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback