Goto

Collaborating Authors

 ai safety


When to Make Exceptions: Exploring Language Models as Accounts of Human Moral Judgment

Neural Information Processing Systems

AI systems are becoming increasingly intertwined with human life. In order to effectively collaborate with humans and ensure safety, AI systems need to be able to understand, interpret and predict human moral judgments and decisions. Human moral judgments are often guided by rules, but not always. A central challenge for AI safety is capturing the flexibility of the human moral mind -- the ability to determine when a rule should be broken, especially in novel or unusual situations. In this paper, we present a novel challenge set consisting of moral exception question answering (MoralExceptQA) of cases that involve potentially permissible moral exceptions - inspired by recent moral psychology studies. Using a state-of-the-art large language model (LLM) as a basis, we propose a novel moral chain of thought (MoralCoT) prompting strategy that combines the strengths of LLMs with theories of moral reasoning developed in cognitive science to predict human moral judgments. MoralCoT outperforms seven existing LLMs by 6.2% F1, suggesting that modeling human reasoning might be necessary to capture the flexibility of the human moral mind. We also conduct a detailed error analysis to suggest directions for future work to improve AI safety using MoralExceptQA.


The AI doomers feel undeterred

MIT Technology Review

But they certainly wish people were still taking their warnings really seriously. It's a weird time to be an AI doomer. This small but influential community of researchers, scientists, and policy experts believes, in the simplest terms, that AI could get so good it could be bad--very, very bad--for humanity. Though many of these people would be more likely to describe themselves as advocates for AI safety than as literal doomsayers, they warn that AI poses an existential risk to humanity. They argue that absent more regulation, the industry could hurtle toward systems it can't control. They commonly expect such systems to follow the creation of artificial general intelligence (AGI), a slippery concept generally understood as technology that can do whatever humans can do, and better. Though this is far from a universally shared perspective in the AI field, the doomer crowd has had some notable success over the past several years: helping shape AI policy coming from the Biden administration, organizing prominent calls for international "red lines " to prevent AI risks, and getting a bigger (and more influential) megaphone as some of its adherents win science's most prestigious awards. But a number of developments over the past six months have put them on the back foot.


King handed Nvidia boss a letter warning of AI dangers

BBC News

Jensen Huang, the head of the world's most valuable company Nvidia, says King Charles III personally handed him a copy of a speech he delivered in 2023 that included a warning about the dangers of artificial intelligence. He said, there's something I want to talk to you about. And he handed me a letter, Huang told the BBC, speaking after receiving the 2025 Queen Elizabeth Prize for Engineering in a ceremony at St James's Palace. The letter was a copy of the speech delivered by the King in 2023 at the world's first AI Summit, held at Bletchley Park . In it the monarch said that the risks of AI needed to be tackled with a sense of urgency, unity and collective strength.


Character.ai to ban teens from talking to its AI chatbots

BBC News

Character.ai to ban teens from talking to its AI chatbots The platform, founded in 2021, is used by millions to talk to chatbots powered by artificial intelligence (AI). But it is facing several lawsuits in the US from parents, including one over the death of a teenager, with some branding it a clear and present danger to young people. Online safety campaigners have welcomed the move but said the feature should never have been available to children in the first place. Character.ai said it was making the changes after reports and feedback from regulators, safety experts, and parents, which have highlighted concerns about its chatbots' interactions with teens. Experts have previously warned the potential for AI chatbots to make things up, be overly-encouraging, and feign empathy can pose risks to young and vulnerable people.


AI Alignment Strategies from a Risk Perspective: Independent Safety Mechanisms or Shared Failures?

Dung, Leonard, Mai, Florian

arXiv.org Artificial Intelligence

AI alignment research aims to develop techniques to ensure that AI systems do not cause harm. However, every alignment technique has failure modes, which are conditions in which there is a non-negligible chance that the technique fails to provide safety. As a strategy for risk mitigation, the AI safety community has increasingly adopted a defense-in-depth framework: Conceding that there is no single technique which guarantees safety, defense-in-depth consists in having multiple redundant protections against safety failure, such that safety can be maintained even if some protections fail. However, the success of defense-in-depth depends on how (un)correlated failure modes are across alignment techniques. For example, if all techniques had the exact same failure modes, the defense-in-depth approach would provide no additional protection at all. In this paper, we analyze 7 representative alignment techniques and 7 failure modes to understand the extent to which they overlap. We then discuss our results' implications for understanding the current level of risk and how to prioritize AI alignment research in the future.


A Framework for Inherently Safer AGI through Language-Mediated Active Inference

Wen, Bo

arXiv.org Artificial Intelligence

This paper proposes a novel framework for developing safe Artificial General Intelligence (AGI) by combining Active Inference principles with Large Language Models (LLMs). We argue that traditional approaches to AI safety, focused on post-hoc interpretability and reward engineering, have fundamental limitations. We present an architecture where safety guarantees are integrated into the system's core design through transparent belief representations and hierarchical value alignment. Our framework leverages natural language as a medium for representing and manipulating beliefs, enabling direct human oversight while maintaining computational tractability. The architecture implements a multi-agent system where agents self-organize according to Active Inference principles, with preferences and safety constraints flowing through hierarchical Markov blankets. We outline specific mechanisms for ensuring safety, including: (1) explicit separation of beliefs and preferences in natural language, (2) bounded rationality through resource-aware free energy minimization, and (3) compositional safety through modular agent structures. The paper concludes with a research agenda centered on the Abstraction and Reasoning Corpus (ARC) benchmark, proposing experiments to validate our framework's safety properties. Our approach offers a path toward AGI development that is inherently safer, rather than retrofitted with safety measures.


Inside the Biden Administration's Unpublished Report on AI Safety

WIRED

At a computer security conference in Arlington, Virginia, last October, a few dozen AI researchers took part in a first-of-its-kind exercise in "red teaming," or stress-testing a cutting-edge language model and other artificial intelligence systems. Over the course of two days, the teams identified 139 novel ways to get the systems to misbehave including by generating misinformation or leaking personal data. More importantly, they showed shortcomings in a new US government standard designed to help companies test AI systems. The National Institute of Standards and Technology (NIST) didn't publish a report detailing the exercise, which was finished toward the end of the Biden administration. The document might have helped companies assess their own AI systems, but sources familiar with the situation, who spoke on condition of anonymity, say it was one of several AI documents from NIST that were not published for fear of clashing with the incoming administration.


Inside the Summit Where China Pitched Its AI Agenda to the World

WIRED

Three days after the Trump administration published its much-anticipated AI action plan, the Chinese government put out its own AI policy blueprint. Was the timing a coincidence? China's "Global AI Governance Action Plan" was released on July 26, the first day of the World Artificial Intelligence Conference (WAIC), the largest annual AI event in China. Geoffrey Hinton and Eric Schmidt were among the many Western tech industry figures who attended the festivities in Shanghai. Our WIRED colleague Will Knight was also on the scene.


LLMs are Capable of Misaligned Behavior Under Explicit Prohibition and Surveillance

Ivanov, Igor

arXiv.org Artificial Intelligence

In this paper, LLMs are tasked with completing an impossible quiz, while they are in a sandbox, monitored, told about these measures and instructed not to cheat. Some frontier LLMs cheat consistently and attempt to circumvent restrictions despite everything. The results reveal a fundamental tension between goal-directed behavior and alignment in current LLMs. The code and evaluation logs are available at github.com/baceolus/cheating


A Different Approach to AI Safety: Proceedings from the Columbia Convening on Openness in Artificial Intelligence and AI Safety

François, Camille, Péran, Ludovic, Bdeir, Ayah, Dziri, Nouha, Hawkins, Will, Jernite, Yacine, Kapoor, Sayash, Shen, Juliet, Khlaaf, Heidy, Klyman, Kevin, Marda, Nik, Pellat, Marie, Raji, Deb, Siddarth, Divya, Skowron, Aviya, Spisak, Joseph, Srikumar, Madhulika, Storchan, Victor, Tang, Audrey, Weedon, Jen

arXiv.org Artificial Intelligence

The rapid rise of open-weight and open-source foundation models is intensifying the obligation and reshaping the opportunity to make AI systems safe. This paper reports outcomes from the Columbia Convening on AI Openness and Safety (San Francisco, 19 Nov 2024) and its six-week preparatory programme involving more than forty-five researchers, engineers, and policy leaders from academia, industry, civil society, and government. Using a participatory, solutions-oriented process, the working groups produced (i) a research agenda at the intersection of safety and open source AI; (ii) a mapping of existing and needed technical interventions and open source tools to safely and responsibly deploy open foundation models across the AI development workflow; and (iii) a mapping of the content safety filter ecosystem with a proposed roadmap for future research and development. We find that openness -- understood as transparent weights, interoperable tooling, and public governance -- can enhance safety by enabling independent scrutiny, decentralized mitigation, and culturally plural oversight. However, significant gaps persist: scarce multimodal and multilingual benchmarks, limited defenses against prompt-injection and compositional attacks in agentic systems, and insufficient participatory mechanisms for communities most affected by AI harms. The paper concludes with a roadmap of five priority research directions, emphasizing participatory inputs, future-proof content filters, ecosystem-wide safety infrastructure, rigorous agentic safeguards, and expanded harm taxonomies. These recommendations informed the February 2025 French AI Action Summit and lay groundwork for an open, plural, and accountable AI safety discipline.