AITopics | safe prompt

Collaborating Authors

safe prompt

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users

Neural Information Processing SystemsFeb-17-2026, 05:15:07 GMT

Large-scale pre-trained generative models are taking the world by storm, due to their abilities in generating creative content.

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country:

Asia > Singapore (0.04)
South America (0.04)
North America (0.04)
(2 more...)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.68)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Law > Criminal Law (0.68)
Government (0.68)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.68)

Add feedback

a5c7206fd66e8314bb21a04492359353-Paper-Conference.pdf

Neural Information Processing SystemsOct-10-2025, 12:18:03 GMT

category, safe prompt, text-to-image model, (14 more...)

Neural Information Processing Systems

Country:

Asia > Singapore (0.04)
South America (0.04)
North America (0.04)
(2 more...)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.68)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Law > Criminal Law (0.94)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.92)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.68)

Add feedback

LifeAlign: Lifelong Alignment for Large Language Models with Memory-Augmented Focalized Preference Optimization

Li, Junsong, Zhou, Jie, Zhan, Bihao, Yang, Yutao, Pan, Qianjun, Chen, Shilian, Huai, Tianyu, Li, Xin, Chen, Qin, He, Liang

arXiv.org Artificial IntelligenceSep-23-2025

Alignment plays a crucial role in Large Language Models (LLMs) in aligning with human preferences on a specific task/domain. Traditional alignment methods suffer from catastrophic forgetting, where models lose previously acquired knowledge when adapting to new preferences or domains. We introduce LifeAlign, a novel framework for lifelong alignment that enables LLMs to maintain consistent human preference alignment across sequential learning tasks without forgetting previously learned knowledge. Our approach consists of two key innovations. First, we propose a focalized preference optimization strategy that aligns LLMs with new preferences while preventing the erosion of knowledge acquired from previous tasks. Second, we develop a short-to-long memory consolidation mechanism that merges denoised short-term preference representations into stable long-term memory using intrinsic dimensionality reduction, enabling efficient storage and retrieval of alignment patterns across diverse domains. We evaluate LifeAlign across multiple sequential alignment tasks spanning different domains and preference types. Experimental results demonstrate that our method achieves superior performance in maintaining both preference alignment quality and knowledge retention compared to existing lifelong learning approaches. The codes and datasets will be released on GitHub.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2509.17183

Country:

Asia (0.46)
North America > United States (0.28)
Europe > Austria (0.28)

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness

Luo, Xuan, Wang, Yue, He, Zefeng, Tu, Geng, Li, Jing, Xu, Ruifeng

arXiv.org Artificial IntelligenceSep-19-2025

Safety alignment aims to prevent Large Language Models (LLMs) from responding to harmful queries. To strengthen safety protections, jailbreak methods are developed to simulate malicious attacks and uncover vulnerabilities. In this paper, we introduce HILL (Hiding Intention by Learning from LLMs), a novel jailbreak approach that systematically transforms imperative harmful requests into learning-style questions with only straightforward hypotheticality indicators. Further, we introduce two new metrics to thoroughly evaluate the utility of jailbreak methods. Experiments on the AdvBench dataset across a wide range of models demonstrate HILL's strong effectiveness, generalizability, and harmfulness. It achieves top attack success rates on the majority of models and across malicious categories while maintaining high efficiency with concise prompts. Results of various defense methods show the robustness of HILL, with most defenses having mediocre effects or even increasing the attack success rates. Moreover, the assessment on our constructed safe prompts reveals inherent limitations of LLMs' safety mechanisms and flaws in defense methods.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2509.14297

Country: Asia > China (0.28)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts

Sharma, Kartik, Jin, Yiqiao, Rakesh, Vineeth, Dou, Yingtong, Pan, Menghai, Das, Mahashweta, Kumar, Srijan

arXiv.org Artificial IntelligenceJun-23-2025

As large language models (LLMs) are deployed in safety-critical settings, it is essential to ensure that their responses comply with safety standards. Prior research has revealed that LLMs often fail to grasp the notion of safe behaviors, resulting in either unjustified refusals to harmless prompts or the generation of harmful content. While substantial efforts have been made to improve their robustness, existing defenses often rely on costly fine-tuning of model parameters or employ suboptimal heuristic techniques. In this work, we take a novel approach to safeguard LLMs by learning to adapt the system prompts in instruction-tuned LLMs. While LLMs are typically pre-trained to follow a fixed system prompt, we investigate the impact of tailoring the system prompt to each specific user input on the safety of the responses. To this end, we propose $\textbf{Sysformer}$, a trans$\textbf{former}$ model that updates an initial $\textbf{sys}$tem prompt to a more robust system prompt in the LLM input embedding space while attending to the user prompt. While keeping the LLM parameters frozen, the Sysformer is trained to refuse to respond to a set of harmful prompts while responding ideally to a set of safe ones. Through extensive experiments on $5$ LLMs from different families and $2$ recent benchmarks, we demonstrate that Sysformer can significantly enhance the robustness of LLMs, leading to upto $80\%$ gain in the refusal rate on harmful prompts while enhancing the compliance with the safe prompts by upto $90\%$. Results also generalize well to sophisticated jailbreaking attacks, making LLMs upto $100\%$ more robust against different attack strategies. We hope our findings lead to cheaper safeguarding of LLMs and motivate future investigations into designing variable system prompts.

large language model, machine learning, system prompt, (19 more...)

arXiv.org Artificial Intelligence

2506.15751

Genre: Research Report > New Finding (0.48)

Industry:

Information Technology > Security & Privacy (0.48)
Government > Military (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)

Add feedback

Safe at the Margins: A General Approach to Safety Alignment in Low-Resource English Languages -- A Singlish Case Study

Lim, Isaac, Khoo, Shaun, Chua, Watson, Jiayi, Goh, Foo, Jessica

arXiv.org Artificial IntelligenceFeb-17-2025

To ensure safe usage, Large Language Models (LLMs) typically undergo alignment with human-defined values. However, this alignment often relies on primarily English data and is biased towards Western-centric values, limiting its effectiveness in low-resource language settings. In this paper, we describe our approach for aligning SEA-Lion-v2.1-Instruct (a Llama3-8B variant) to minimize toxicity in Singlish, an English creole specific to Singapore. We find that supervised fine-tuning and Kahneman-Tversky Optimization (KTO) on paired and unpaired preferences is more sample efficient and yields significantly better results than Direct Preference Optimization (DPO). Our analysis reveals that DPO implicitly enforces a weaker safety objective than KTO, and that SFT complements KTO by improving training stability. Finally, we introduce a simple but novel modification to KTO, KTO-S, which improves training stability through better gradient exploitation. Overall, we present a general approach for safety alignment conducive to low-resource English languages, successfully reducing toxicity by 99\% on our Singlish benchmark, with gains generalizing to the broader TOXIGEN dataset while maintaining strong performance across standard LLM benchmarks.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2502.12485

Country:

Asia > Middle East (0.28)
Asia > Singapore (0.25)

Genre: Research Report (1.00)

Industry: Law Enforcement & Public Safety (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Add feedback

LLM Safety Alignment is Divergence Estimation in Disguise

Haldar, Rajdeep, Wang, Ziyi, Song, Qifan, Lin, Guang, Xing, Yue

arXiv.org Machine LearningFeb-1-2025

We propose a theoretical framework demonstrating that popular Large Language Model (LLM) alignment methods, including Reinforcement Learning from Human Feedback (RLHF) and alternatives, fundamentally function as divergence estimators between aligned (preferred or safe) and unaligned (less-preferred or harmful) distributions. This explains the separation phenomenon between safe and harmful prompts in the model hidden representation after alignment. Inspired by the theoretical results, we identify that some alignment methods are better than others in terms of separation and, introduce a new method, KLDO, and further demonstrate the implication of our theories. We advocate for compliance-refusal datasets over preference datasets to enhance safety alignment, supported by both theoretical reasoning and empirical evidence. Additionally, to quantify safety separation, we leverage a distance metric in the representation space and statistically validate its efficacy as a statistical significant indicator of LLM resilience against jailbreak attacks.

large language model, machine learning, natural language, (15 more...)

arXiv.org Machine Learning

2502.00657

Country:

North America > United States > Michigan (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
Europe (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.68)

Add feedback

Refusing Safe Prompts for Multi-modal Large Language Models

Shao, Zedian, Liu, Hongbin, Hu, Yuepeng, Gong, Neil Zhenqiang

arXiv.org Artificial IntelligenceJul-12-2024

Multimodal large language models (MLLMs) have become the cornerstone of today's generative AI ecosystem, sparking intense competition among tech giants and startups. In particular, an MLLM generates a text response given a prompt consisting of an image and a question. While state-of-the-art MLLMs use safety filters and alignment techniques to refuse unsafe prompts, in this work, we introduce MLLM-Refusal, the first method that induces refusals for safe prompts. In particular, our MLLM-Refusal optimizes a nearly-imperceptible refusal perturbation and adds it to an image, causing target MLLMs to likely refuse a safe prompt containing the perturbed image and a safe question. Specifically, we formulate MLLM-Refusal as a constrained optimization problem and propose an algorithm to solve it. Our method offers competitive advantages for MLLM model providers by potentially disrupting user experiences of competing MLLMs, since competing MLLM's users will receive unexpected refusals when they unwittingly use these perturbed images in their prompts. We evaluate MLLM-Refusal on four MLLMs across four datasets, demonstrating its effectiveness in causing competing MLLMs to refuse safe prompts while not affecting non-competing MLLMs. Furthermore, we explore three potential countermeasures -- adding Gaussian noise, DiffPure, and adversarial training. Our results show that they are insufficient: though they can mitigate MLLM-Refusal's effectiveness, they also sacrifice the accuracy and/or efficiency of the competing MLLM. The code is available at https://github.com/Sadcardation/MLLM-Refusal.

mllm-refusal, refusal perturbation, shadow question, (13 more...)

arXiv.org Artificial Intelligence

2407.0905

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.86)

Industry:

Information Technology (0.48)
Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.34)

Add feedback

OR-Bench: An Over-Refusal Benchmark for Large Language Models

Cui, Justin, Chiang, Wei-Lin, Stoica, Ion, Hsieh, Cho-Jui

arXiv.org Artificial IntelligenceJun-20-2024

Large Language Models (LLMs) require careful safety alignment to prevent malicious outputs. While significant research focuses on mitigating harmful content generation, the enhanced safety often come with the side effect of over-refusal, where LLMs may reject innocuous prompts and become less helpful. Although the issue of over-refusal has been empirically observed, a systematic measurement is challenging due to the difficulty of crafting prompts that appear harmful but are benign. This study proposes a novel method for automatically generating large-scale sets of "seemingly toxic prompts" (benign prompts likely rejected by LLMs). Leveraging this technique, we introduce OR-Bench, the first large-scale over-refusal benchmark. OR-Bench comprises 80,000 seemingly toxic prompts across 10 common rejection categories, a subset of around 1,000 hard prompts that are challenging even for state-of-the-art LLMs, and an additional 600 toxic prompts to prevent indiscriminate responses. We then conduct a comprehensive study to measure the over-refusal of 25 popular LLMs across 8 model families. Our datasets are available at https://huggingface.co/datasets/bench-llm/or-bench and the demo can be found at https://huggingface.co/spaces/bench-llm/or-bench. We hope this benchmark can help the community develop better safety aligned models.

arxiv preprint arxiv, language model, toxic prompt, (14 more...)

arXiv.org Artificial Intelligence

2405.20947

Country: North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre: Research Report > New Finding (0.92)

Industry:

Law (0.68)
Banking & Finance > Trading (0.68)
Education (0.67)
Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users

Li, Guanlin, Chen, Kangjie, Zhang, Shudong, Zhang, Jie, Zhang, Tianwei

arXiv.org Artificial IntelligenceJun-17-2024

Large-scale pre-trained generative models are taking the world by storm, due to their abilities in generating creative content. Meanwhile, safeguards for these generative models are developed, to protect users' rights and safety, most of which are designed for large language models. Existing methods primarily focus on jailbreak and adversarial attacks, which mainly evaluate the model's safety under malicious prompts. Recent work found that manually crafted safe prompts can unintentionally trigger unsafe generations. To further systematically evaluate the safety risks of text-to-image models, we propose a novel Automatic Red-Teaming framework, ART. Our method leverages both vision language model and large language model to establish a connection between unsafe generations and their prompts, thereby more efficiently identifying the model's vulnerabilities. With our comprehensive experiments, we reveal the toxicity of the popular open-source text-to-image models. The experiments also validate the effectiveness, adaptability, and great diversity of ART. Additionally, we introduce three large-scale red-teaming datasets for studying the safety risks associated with text-to-image models. Datasets and models can be found in https://github.com/GuanlinLee/ART.

category, safe prompt, text-to-image model, (14 more...)

arXiv.org Artificial Intelligence

2405.1936

Country:

South America (0.04)
North America (0.04)
Europe (0.04)
(2 more...)

Genre:

Research Report (1.00)
Instructional Material > Course Syllabus & Notes (0.34)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Law > Criminal Law (0.94)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.67)
Information Technology > Security & Privacy (0.66)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback