AITopics | safety model

Collaborating Authors

safety model

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Towards Safe Reinforcement Learning with a Safety Editor Policy

Neural Information Processing SystemsApr-24-2026, 15:50:40 GMT

We consider the safe reinforcement learning (RL) problem of maximizing utility with extremely low constraint violation rates. Assuming no prior knowledge or pre-training of the environment safety model given a task, an agent has to learn, via exploration, which states and actions are safe. A popular approach in this line of research is to combine a model-free RL algorithm with the Lagrangian method to adjust the weight of the constraint reward relative to the utility reward dynamically. It relies on a single policy to handle the conflict between utility and constraint rewards, which is often challenging. We present SEditor, a two-policy approach that learns a safety editor policy transforming potentially unsafe actions proposed by a utility maximizer policy into safe ones.

machine learning, reinforcement learning, seditor, (14 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.46)

Industry: Leisure & Entertainment > Games (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Leveraging Catastrophic Forgetting to Develop Safe Diffusion Models against Malicious Finetuning Jiadong Pan

Neural Information Processing SystemsFeb-18-2026, 06:01:40 GMT

WARNING: This paper contains offensive images generated by models.

artificial intelligence, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

North America > United States > Pennsylvania (0.04)
Asia > China > Zhejiang Province (0.04)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

One-Shot Safety Alignment for Large Language Models via Optimal Dualization

Neural Information Processing SystemsFeb-16-2026, 21:19:47 GMT

Ideally, we would like methods that train LMs only once ( i.e., one-shot) with a fixed objective, as in

information, large language model, machine learning, (19 more...)

Neural Information Processing Systems

Country:

North America > United States > Pennsylvania (0.04)
Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report > Experimental Study (0.46)
Research Report > New Finding (0.46)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.68)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.64)

Add feedback

11afefdd848d1bc9ac9f1604d9f45817-Paper-Conference.pdf

Neural Information Processing SystemsFeb-7-2026, 12:56:00 GMT

constraint reward, reinforcement, seditor, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Santa Clara County > Cupertino (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Leisure & Entertainment > Games (0.67)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)

Add feedback

Towards Safe Reinforcement Learning with a Safety Editor Policy

Neural Information Processing SystemsDec-23-2025, 19:07:49 GMT

artificial intelligence, machine learning, reinforcement learning, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.61)

Add feedback

d0949cbcec31c09431610553a284f94a-Paper-Conference.pdf

Neural Information Processing SystemsOct-10-2025, 17:19:59 GMT

experiment, image generation, malicious fine-tuning, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > Pennsylvania (0.04)
Asia > China > Zhejiang Province (0.04)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

9979a69d2613ab98ad25d3849068f9f0-Paper-Conference.pdf

Neural Information Processing SystemsOct-10-2025, 10:50:37 GMT

alignment, information, optimization, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Pennsylvania (0.04)
Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report > Experimental Study (0.46)
Research Report > New Finding (0.46)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)
(2 more...)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Integrating Perceptions: A Human-Centered Physical Safety Model for Human-Robot Interaction

Pandey, Pranav, Parasuraman, Ramviyas, Doshi, Prashant

arXiv.org Artificial IntelligenceJul-10-2025

Ensuring safety in human-robot interaction (HRI) is essential to foster user trust and enable the broader adoption of robotic systems. Traditional safety models primarily rely on sensor-based measures, such as relative distance and velocity, to assess physical safety. However, these models often fail to capture subjective safety perceptions, which are shaped by individual traits and contextual factors. In this paper, we introduce and analyze a parameterized general safety model that bridges the gap between physical and perceived safety by incorporating a personalization parameter, $ρ$, into the safety measurement framework to account for individual differences in safety perception. Through a series of hypothesis-driven human-subject studies in a simulated rescue scenario, we investigate how emotional state, trust, and robot behavior influence perceived safety. Our results show that $ρ$ effectively captures meaningful individual differences, driven by affective responses, trust in task consistency, and clustering into distinct user types. Specifically, our findings confirm that predictable and consistent robot behavior as well as the elicitation of positive emotional states, significantly enhance perceived safety. Moreover, responses cluster into a small number of user types, supporting adaptive personalization based on shared safety models. Notably, participant role significantly shapes safety perception, and repeated exposure reduces perceived safety for participants in the casualty role, emphasizing the impact of physical interaction and experiential change. These findings highlight the importance of adaptive, human-centered safety models that integrate both psychological and behavioral dimensions, offering a pathway toward more trustworthy and effective HRI in safety-critical domains.

artificial intelligence, participant, safety, (17 more...)

arXiv.org Artificial Intelligence

2507.067

Country: North America > United States (0.46)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Therapeutic Area (0.68)
Government (0.68)

Technology: Information Technology > Artificial Intelligence > Robots > Humanoid Robots (0.64)

Add feedback

Adversarial Tokenization

Geh, Renato Lui, Shao, Zilei, Broeck, Guy Van den

arXiv.org Artificial IntelligenceMar-3-2025

Current LLM pipelines account for only one possible tokenization for a given string, ignoring exponentially many alternative tokenizations during training and inference. For example, the standard Llama3 tokenization of penguin is [p,enguin], yet [peng,uin] is another perfectly valid alternative. In this paper, we show that despite LLMs being trained solely on one tokenization, they still retain semantic understanding of other tokenizations, raising questions about their implications in LLM safety. Put succinctly, we answer the following question: can we adversarially tokenize an obviously malicious string to evade safety and alignment restrictions? We show that not only is adversarial tokenization an effective yet previously neglected axis of attack, but it is also competitive against existing state-of-the-art adversarial approaches without changing the text of the harmful request. We empirically validate this exploit across three state-of-the-art LLMs and adversarial datasets, revealing a previously unknown vulnerability in subword models.

fulfill, preprint, tokenization, (15 more...)

arXiv.org Artificial Intelligence

2503.02174

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
Africa > Cameroon (0.14)
North America > United States > Florida > Miami-Dade County > Miami (0.04)
(19 more...)

Genre: Research Report > New Finding (0.92)

Industry:

Information Technology > Security & Privacy (1.00)
Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

A Method for Enhancing the Safety of Large Model Generation Based on Multi-dimensional Attack and Defense

Zhai, Keke

arXiv.org Artificial IntelligenceDec-31-2024

Currently, large models are prone to generating harmful content when faced with complex attack instructions, significantly reducing their defensive capabilities. To address this issue, this paper proposes a method based on constructing data aligned with multi-dimensional attack defense to enhance the generative security of large models. The core of our method lies in improving the effectiveness of safe alignment learning for large models by innova-tively increasing the diversity of attack instruction dimensions and the accuracy of generat-ing safe responses. To validate the effectiveness of our method, beyond existing security evaluation benchmarks, we additionally designed new security evaluation benchmarks and conducted comparative experiments using Llama3.2 as the baseline model. The final ex-perimental results demonstrate that our method can significantly improve the generative security of large models under complex instructional attacks, while also maintaining and enhancing the models' general capabilities.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2501.00517

Genre: Research Report > New Finding (0.66)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.97)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback