Gao, Haichang
HumorReject: Decoupling LLM Safety from Refusal Prefix via A Little Humor
Wu, Zihui, Gao, Haichang, Luo, Jiacheng, Liu, Zhaoxiang
Large Language Models (LLMs) commonly rely on explicit refusal prefixes for safety, making them vulnerable to prefix injection attacks. We introduce HumorReject, a novel data-driven approach that reimagines LLM safety by decoupling it from refusal prefixes through humor as an indirect refusal strategy. Rather than explicitly rejecting harmful instructions, HumorReject responds with contextually appropriate humor that naturally defuses potentially dangerous requests. Our approach effectively addresses common "over-defense" issues while demonstrating superior robustness against various attack vectors. Our findings suggest that improvements in training data design can be as important as the alignment algorithm itself in achieving effective LLM safety.
GlitchMiner: Mining Glitch Tokens in Large Language Models via Gradient-based Discrete Optimization
Wu, Zihui, Gao, Haichang, Wang, Ping, Zhang, Shudong, Liu, Zhaoxiang, Lian, Shiguo
Glitch tokens in Large Language Models (LLMs) can trigger unpredictable behaviors, threatening model reliability and safety. Existing detection methods rely on predefined patterns, limiting their adaptability across diverse LLM architectures. We propose GlitchMiner, a gradient-based discrete optimization framework that efficiently identifies glitch tokens by introducing entropy as a measure of prediction uncertainty and employing a local search strategy to explore the token space. Experiments across multiple LLM architectures demonstrate that GlitchMiner outperforms existing methods in detection accuracy and adaptability, achieving over 10% average efficiency improvement. This method enhances vulnerability assessment in LLMs, contributing to the development of more robust and reliable applications.
Lower Difficulty and Better Robustness: A Bregman Divergence Perspective for Adversarial Training
Wu, Zihui, Gao, Haichang, Zhou, Bingqian, Guo, Xiaoyan, Zhang, Shudong
In this paper, we investigate on improving the adversarial robustness obtained in adversarial training (AT) via reducing the difficulty of optimization. To better study this problem, we build a novel Bregman divergence perspective for AT, in which AT can be viewed as the sliding process of the training data points on the negative entropy curve. Based on this perspective, we analyze the learning objectives of two typical AT methods, i.e., PGD-AT and TRADES, and we find that the optimization process of TRADES is easier than PGD-AT for that TRADES separates PGD-AT. In addition, we discuss the function of entropy in TRADES, and we find that models with high entropy can be better robustness learners. Inspired by the above findings, we propose two methods, i.e., FAIT and MER, which can both not only reduce the difficulty of optimization under the 10-step PGD adversaries, but also provide better robustness. Our work suggests that reducing the difficulty of optimization under the 10-step PGD adversaries is a promising approach for enhancing the adversarial robustness in AT.
AdvFunMatch: When Consistent Teaching Meets Adversarial Robustness
Wu, Zihui, Gao, Haichang, Zhou, Bingqian, Wang, Ping
\emph{Consistent teaching} is an effective paradigm for implementing knowledge distillation (KD), where both student and teacher models receive identical inputs, and KD is treated as a function matching task (FunMatch). However, one limitation of FunMatch is that it does not account for the transfer of adversarial robustness, a model's resistance to adversarial attacks. To tackle this problem, we propose a simple but effective strategy called Adversarial Function Matching (AdvFunMatch), which aims to match distributions for all data points within the $\ell_p$-norm ball of the training data, in accordance with consistent teaching. Formulated as a min-max optimization problem, AdvFunMatch identifies the worst-case instances that maximizes the KL-divergence between teacher and student model outputs, which we refer to as "mismatched examples," and then matches the outputs on these mismatched examples. Our experimental results show that AdvFunMatch effectively produces student models with both high clean accuracy and robustness. Furthermore, we reveal that strong data augmentations (\emph{e.g.}, AutoAugment) are beneficial in AdvFunMatch, whereas prior works have found them less effective in adversarial training. Code is available at \url{https://gitee.com/zihui998/adv-fun-match}.