SafeLLM: Unlearning Harmful Outputs from Large Language Models against Jailbreak Attacks

Li, Xiangman, Wu, Xiaodong, Li, Qi, Ni, Jianbing, Lu, Rongxing

arXiv.org Artificial Intelligence 

--Jailbreak attacks pose a serious threat to the safety of Large Language Models (LLMs) by crafting adversarial prompts that bypass alignment mechanisms, causing the models to produce harmful, restricted, or biased content. In this paper, we propose SafeLLM, a novel unlearning-based defense framework that unlearn the harmful knowledge from LLMs while preserving linguistic fluency and general capabilities. SafeLLM employs a three-stage pipeline: (1) dynamic unsafe output detection using a hybrid approach that integrates external classifiers with model-internal evaluations; (2) token-level harmful content tracing through feedforward network (FFN) activations to localize harmful knowledge; and (3) constrained optimization to suppress unsafe behavior without degrading overall model quality. SafeLLM achieves targeted and irreversible forgetting by identifying and neutralizing FFN substructures responsible for harmful generation pathways. Extensive experiments on prominent LLMs (Vicuna, LLaMA, and GPT -J) across multiple jailbreak benchmarks show that SafeLLM substantially reduces attack success rates while maintaining high general-purpose performance. Compared to standard defense methods such as supervised fine-tuning and direct preference optimization, SafeLLM offers stronger safety guarantees, more precise control over harmful behavior, and greater robustness to unseen attacks. Moreover, SafeLLM maintains the general performance after the harmful knowledge unlearned. Large Language Models (LLMs) are a class of foundation models trained on massive datasets, enabling them to understand and generate not only natural language but also a variety of other content types. These capabilities enable LLMs to perform a wide array of tasks, ranging from general-purpose language processing to domain-specific applications across fields such as healthcare, law, finance, and education. Built on deep learning architectures like Transformers, LLMs excel in tasks including summarization, translation, question answering, and sentiment analysis.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found