AITopics | ai safety

Collaborating Authors

ai safety

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

When to Make Exceptions: Exploring Language Models as Accounts of Human Moral Judgment

Neural Information Processing SystemsDec-25-2025, 02:02:13 GMT

AI systems are becoming increasingly intertwined with human life. In order to effectively collaborate with humans and ensure safety, AI systems need to be able to understand, interpret and predict human moral judgments and decisions. Human moral judgments are often guided by rules, but not always. A central challenge for AI safety is capturing the flexibility of the human moral mind -- the ability to determine when a rule should be broken, especially in novel or unusual situations. In this paper, we present a novel challenge set consisting of moral exception question answering (MoralExceptQA) of cases that involve potentially permissible moral exceptions - inspired by recent moral psychology studies. Using a state-of-the-art large language model (LLM) as a basis, we propose a novel moral chain of thought (MoralCoT) prompting strategy that combines the strengths of LLMs with theories of moral reasoning developed in cognitive science to predict human moral judgments. MoralCoT outperforms seven existing LLMs by 6.2% F1, suggesting that modeling human reasoning might be necessary to capture the flexibility of the human moral mind. We also conduct a detailed error analysis to suggest directions for future work to improve AI safety using MoralExceptQA.

artificial intelligence, large language model, natural language, (13 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

The AI doomers feel undeterred

MIT Technology ReviewDec-15-2025, 10:00:00 GMT

But they certainly wish people were still taking their warnings really seriously. It's a weird time to be an AI doomer. This small but influential community of researchers, scientists, and policy experts believes, in the simplest terms, that AI could get so good it could be bad--very, very bad--for humanity. Though many of these people would be more likely to describe themselves as advocates for AI safety than as literal doomsayers, they warn that AI poses an existential risk to humanity. They argue that absent more regulation, the industry could hurtle toward systems it can't control. They commonly expect such systems to follow the creation of artificial general intelligence (AGI), a slippery concept generally understood as technology that can do whatever humans can do, and better. Though this is far from a universally shared perspective in the AI field, the doomer crowd has had some notable success over the past several years: helping shape AI policy coming from the Biden administration, organizing prominent calls for international "red lines " to prevent AI risks, and getting a bigger (and more influential) megaphone as some of its adherents win science's most prestigious awards. But a number of developments over the past six months have put them on the back foot.

large language model, machine learning, natural language, (21 more...)

MIT Technology Review

Country:

North America > United States > Massachusetts (0.04)
North America > United States > California (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Personal > Honors > Award (0.34)

Industry: Government > Regional Government > North America Government > United States Government (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.98)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.98)
(2 more...)

Add feedback

King handed Nvidia boss a letter warning of AI dangers

BBC NewsNov-5-2025, 23:40:47 GMT

Jensen Huang, the head of the world's most valuable company Nvidia, says King Charles III personally handed him a copy of a speech he delivered in 2023 that included a warning about the dangers of artificial intelligence. He said, there's something I want to talk to you about. And he handed me a letter, Huang told the BBC, speaking after receiving the 2025 Queen Elizabeth Prize for Engineering in a ceremony at St James's Palace. The letter was a copy of the speech delivered by the King in 2023 at the world's first AI Summit, held at Bletchley Park . In it the monarch said that the risks of AI needed to be tackled with a sense of urgency, unity and collective strength.

artificial intelligence, huang, king, (11 more...)

BBC News

Country:

North America > United States (0.50)
Europe > United Kingdom > England > Buckinghamshire > Milton Keynes (0.25)
South America (0.16)
(15 more...)

Industry:

Information Technology (1.00)
Government > Regional Government > Europe Government > United Kingdom Government (1.00)

Technology: Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (1.00)

Add feedback

Character.ai to ban teens from talking to its AI chatbots

BBC NewsOct-29-2025, 13:36:21 GMT

Character.ai to ban teens from talking to its AI chatbots The platform, founded in 2021, is used by millions to talk to chatbots powered by artificial intelligence (AI). But it is facing several lawsuits in the US from parents, including one over the death of a teenager, with some branding it a clear and present danger to young people. Online safety campaigners have welcomed the move but said the feature should never have been available to children in the first place. Character.ai said it was making the changes after reports and feedback from regulators, safety experts, and parents, which have highlighted concerns about its chatbots' interactions with teens. Experts have previously warned the potential for AI chatbots to make things up, be overly-encouraging, and feign empathy can pose risks to young and vulnerable people.

artificial intelligence, chatbot, natural language, (16 more...)

BBC News

Country:

North America > United States (0.35)
South America (0.15)
North America > Central America (0.15)
(13 more...)

Industry: Law > Litigation (0.55)

Technology: Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)

Add feedback

AI Alignment Strategies from a Risk Perspective: Independent Safety Mechanisms or Shared Failures?

Dung, Leonard, Mai, Florian

arXiv.org Artificial IntelligenceOct-14-2025

AI alignment research aims to develop techniques to ensure that AI systems do not cause harm. However, every alignment technique has failure modes, which are conditions in which there is a non-negligible chance that the technique fails to provide safety. As a strategy for risk mitigation, the AI safety community has increasingly adopted a defense-in-depth framework: Conceding that there is no single technique which guarantees safety, defense-in-depth consists in having multiple redundant protections against safety failure, such that safety can be maintained even if some protections fail. However, the success of defense-in-depth depends on how (un)correlated failure modes are across alignment techniques. For example, if all techniques had the exact same failure modes, the defense-in-depth approach would provide no additional protection at all. In this paper, we analyze 7 representative alignment techniques and 7 failure modes to understand the extent to which they overlap. We then discuss our results' implications for understanding the current level of risk and how to prioritize AI alignment research in the future.

failure mode, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2510.11235

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > Germany > North Rhine-Westphalia (0.04)

Genre:

Overview (0.68)
Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.97)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)

Add feedback

A Framework for Inherently Safer AGI through Language-Mediated Active Inference

Wen, Bo

arXiv.org Artificial IntelligenceAug-11-2025

This paper proposes a novel framework for developing safe Artificial General Intelligence (AGI) by combining Active Inference principles with Large Language Models (LLMs). We argue that traditional approaches to AI safety, focused on post-hoc interpretability and reward engineering, have fundamental limitations. We present an architecture where safety guarantees are integrated into the system's core design through transparent belief representations and hierarchical value alignment. Our framework leverages natural language as a medium for representing and manipulating beliefs, enabling direct human oversight while maintaining computational tractability. The architecture implements a multi-agent system where agents self-organize according to Active Inference principles, with preferences and safety constraints flowing through hierarchical Markov blankets. We outline specific mechanisms for ensuring safety, including: (1) explicit separation of beliefs and preferences in natural language, (2) bounded rationality through resource-aware free energy minimization, and (3) compositional safety through modular agent structures. The paper concludes with a research agenda centered on the Abstraction and Reasoning Corpus (ARC) benchmark, proposing experiments to validate our framework's safety properties. Our approach offers a path toward AGI development that is inherently safer, rather than retrofitted with safety measures.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2508.05766

Country:

North America > United States > Wisconsin (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)

Genre: Research Report (0.40)

Industry:

Health & Medicine (0.68)
Information Technology (0.46)
Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

Inside the Biden Administration's Unpublished Report on AI Safety

WIREDAug-6-2025, 18:00:00 GMT

At a computer security conference in Arlington, Virginia, last October, a few dozen AI researchers took part in a first-of-its-kind exercise in "red teaming," or stress-testing a cutting-edge language model and other artificial intelligence systems. Over the course of two days, the teams identified 139 novel ways to get the systems to misbehave including by generating misinformation or leaking personal data. More importantly, they showed shortcomings in a new US government standard designed to help companies test AI systems. The National Institute of Standards and Technology (NIST) didn't publish a report detailing the exercise, which was finished toward the end of the Biden administration. The document might have helped companies assess their own AI systems, but sources familiar with the situation, who spoke on condition of anonymity, say it was one of several AI documents from NIST that were not published for fear of clashing with the incoming administration.

ai system, biden administration, unpublished report, (11 more...)

WIRED

Country: North America > United States > Virginia > Arlington County > Arlington (0.26)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Regional Government > North America Government > United States Government (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language (0.39)

Add feedback

Inside the Summit Where China Pitched Its AI Agenda to the World

WIREDJul-31-2025, 15:04:47 GMT

Three days after the Trump administration published its much-anticipated AI action plan, the Chinese government put out its own AI policy blueprint. Was the timing a coincidence? China's "Global AI Governance Action Plan" was released on July 26, the first day of the World Artificial Intelligence Conference (WAIC), the largest annual AI event in China. Geoffrey Hinton and Eric Schmidt were among the many Western tech industry figures who attended the festivities in Shanghai. Our WIRED colleague Will Knight was also on the scene.

artificial intelligence, china, government, (16 more...)

WIRED

Country:

North America > United States (0.53)
Asia > China > Shanghai > Shanghai (0.28)
Asia > Singapore (0.07)
Asia > China > Beijing > Beijing (0.05)

Industry: Government > Regional Government > North America Government > United States Government (0.53)

Technology: Information Technology > Artificial Intelligence (1.00)

Add feedback

LLMs are Capable of Misaligned Behavior Under Explicit Prohibition and Surveillance

Ivanov, Igor

arXiv.org Artificial IntelligenceJul-8-2025

In this paper, LLMs are tasked with completing an impossible quiz, while they are in a sandbox, monitored, told about these measures and instructed not to cheat. Some frontier LLMs cheat consistently and attempt to circumvent restrictions despite everything. The results reveal a fundamental tension between goal-directed behavior and alignment in current LLMs. The code and evaluation logs are available at github.com/baceolus/cheating

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2507.02977

Genre: Research Report > New Finding (0.69)

Industry: Information Technology > Security & Privacy (0.96)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)

Add feedback

A Different Approach to AI Safety: Proceedings from the Columbia Convening on Openness in Artificial Intelligence and AI Safety

François, Camille, Péran, Ludovic, Bdeir, Ayah, Dziri, Nouha, Hawkins, Will, Jernite, Yacine, Kapoor, Sayash, Shen, Juliet, Khlaaf, Heidy, Klyman, Kevin, Marda, Nik, Pellat, Marie, Raji, Deb, Siddarth, Divya, Skowron, Aviya, Spisak, Joseph, Srikumar, Madhulika, Storchan, Victor, Tang, Audrey, Weedon, Jen

arXiv.org Artificial IntelligenceJun-30-2025

The rapid rise of open-weight and open-source foundation models is intensifying the obligation and reshaping the opportunity to make AI systems safe. This paper reports outcomes from the Columbia Convening on AI Openness and Safety (San Francisco, 19 Nov 2024) and its six-week preparatory programme involving more than forty-five researchers, engineers, and policy leaders from academia, industry, civil society, and government. Using a participatory, solutions-oriented process, the working groups produced (i) a research agenda at the intersection of safety and open source AI; (ii) a mapping of existing and needed technical interventions and open source tools to safely and responsibly deploy open foundation models across the AI development workflow; and (iii) a mapping of the content safety filter ecosystem with a proposed roadmap for future research and development. We find that openness -- understood as transparent weights, interoperable tooling, and public governance -- can enhance safety by enabling independent scrutiny, decentralized mitigation, and culturally plural oversight. However, significant gaps persist: scarce multimodal and multilingual benchmarks, limited defenses against prompt-injection and compositional attacks in agentic systems, and insufficient participatory mechanisms for communities most affected by AI harms. The paper concludes with a roadmap of five priority research directions, emphasizing participatory inputs, future-proof content filters, ecosystem-wide safety infrastructure, rigorous agentic safeguards, and expanded harm taxonomies. These recommendations informed the February 2025 French AI Action Summit and lay groundwork for an open, plural, and accountable AI safety discipline.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2506.22183

Country:

North America > United States > California > San Francisco County > San Francisco (0.34)
Europe > France (0.14)
Asia > China (0.04)
(10 more...)

Genre: Research Report > New Finding (0.92)

Industry:

Media > News (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Law (1.00)
(5 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback