Yi, Xin
Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks
Yi, Xin, Li, Yue, Wang, Linlin, Wang, Xiaoling, He, Liang
Ensuring safety alignment has become a critical requirement for large language models (LLMs), particularly given their widespread deployment in real-world applications. However, LLMs remain susceptible to jailbreak attacks, which exploit system vulnerabilities to bypass safety measures and generate harmful outputs. Although numerous defense mechanisms based on adversarial training have been proposed, a persistent challenge lies in the exacerbation of over-refusal behaviors, which compromise the overall utility of the model. To address these challenges, we propose a Latent-space Adversarial Training with Post-aware Calibration (LATPC) framework. During the adversarial training phase, LATPC compares harmful and harmless instructions in the latent space and extracts safety-critical dimensions to construct refusal features attack, precisely simulating agnostic jailbreak attack types requiring adversarial mitigation. At the inference stage, an embedding-level calibration mechanism is employed to alleviate over-refusal behaviors with minimal computational overhead. Experimental results demonstrate that, compared to various defense methods across five types of jailbreak attacks, LATPC framework achieves a superior balance between safety and utility. Moreover, our analysis underscores the effectiveness of extracting safety-critical dimensions from the latent space for constructing robust refusal feature attacks.
NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning
Yi, Xin, Zheng, Shunfan, Wang, Linlin, de Melo, Gerard, Wang, Xiaoling, He, Liang
The emergence of finetuning-as-a-service has revealed a new vulnerability in large language models (LLMs). A mere handful of malicious data uploaded by users can subtly manipulate the finetuning process, resulting in an alignment-broken model. Existing methods to counteract fine-tuning attacks typically require substantial computational resources. Even with parameter-efficient techniques like LoRA, gradient updates remain essential. To address these challenges, we propose \textbf{N}euron-\textbf{L}evel \textbf{S}afety \textbf{R}ealignment (\textbf{NLSR}), a training-free framework that restores the safety of LLMs based on the similarity difference of safety-critical neurons before and after fine-tuning. The core of our framework is first to construct a safety reference model from an initially aligned model to amplify safety-related features in neurons. We then utilize this reference model to identify safety-critical neurons, which we prepare as patches. Finally, we selectively restore only those neurons that exhibit significant similarity differences by transplanting these prepared patches, thereby minimally altering the fine-tuned model. Extensive experiments demonstrate significant safety enhancements in fine-tuned models across multiple downstream tasks, while greatly maintaining task-level accuracy. Our findings suggest regions of some safety-critical neurons show noticeable differences after fine-tuning, which can be effectively corrected by transplanting neurons from the reference model without requiring additional training. The code will be available at \url{https://github.com/xinykou/NLSR}
A safety realignment framework via subspace-oriented model fusion for large language models
Yi, Xin, Zheng, Shunfan, Wang, Linlin, Wang, Xiaoling, He, Liang
The current safeguard mechanisms for large language models (LLMs) are indeed susceptible to jailbreak attacks, making them inherently fragile. Even the process of fine-tuning on apparently benign data for downstream tasks can jeopardize safety. One potential solution is to conduct safety fine-tuning subsequent to downstream fine-tuning. However, there's a risk of catastrophic forgetting during safety fine-tuning, where LLMs may regain safety measures but lose the task-specific knowledge acquired during downstream fine-tuning. In this paper, we introduce a safety realignment framework through subspace-oriented model fusion (SOMF), aiming to combine the safeguard capabilities of initially aligned model and the current fine-tuned model into a realigned model. Our approach begins by disentangling all task vectors from the weights of each fine-tuned model. We then identify safety-related regions within these vectors by subspace masking techniques. Finally, we explore the fusion of the initial safely aligned LLM with all task vectors based on the identified safety subspace. We validate that our safety realignment framework satisfies the safety requirements of a single fine-tuned model as well as multiple models during their fusion. Our findings confirm that SOMF preserves safety without notably compromising performance on downstream tasks, including instruction following in Chinese, English, and Hindi, as well as problem-solving capabilities in Code and Math.
Fine-Grained Detoxification via Instance-Level Prefixes for Large Language Models
Yi, Xin, Wang, Linlin, Wang, Xiaoling, He, Liang
Impressive results have been achieved in natural language processing (NLP) tasks through the training of large language models (LLMs). However, these models occasionally produce toxic content such as insults, threats, and profanity in response to certain prompts, thereby constraining their practical utility. To tackle this issue, various finetuning-based and decoding-based approaches have been utilized to mitigate toxicity. However, these methods typically necessitate additional costs such as high-quality training data or auxiliary models. In this paper, we propose fine-grained detoxification via instance-level prefixes (FGDILP) to mitigate toxic text without additional cost. Specifically, FGDILP contrasts the contextualized representation in attention space using a positive prefix-prepended prompt against multiple negative prefix-prepended prompts at the instance level. This allows for constructing fine-grained subtoxicity vectors, which enables collaborative detoxification by fusing them to correct the normal generation process when provided with a raw prompt. We validate that FGDILP enables controlled text generation with regard to toxicity at both the utterance and context levels. Our method surpasses prompt-based baselines in detoxification, although at a slight cost to generation fluency and diversity.
Modeling the Trade-off of Privacy Preservation and Activity Recognition on Low-Resolution Images
Wang, Yuntao, Cheng, Zirui, Yi, Xin, Kong, Yan, Wang, Xueyang, Xu, Xuhai, Yan, Yukang, Yu, Chun, Patel, Shwetak, Shi, Yuanchun
A computer vision system using low-resolution image sensors can provide intelligent services (e.g., activity recognition) but preserve unnecessary visual privacy information from the hardware level. However, preserving visual privacy and enabling accurate machine recognition have adversarial needs on image resolution. Modeling the trade-off of privacy preservation and machine recognition performance can guide future privacy-preserving computer vision systems using low-resolution image sensors. In this paper, using the at-home activity of daily livings (ADLs) as the scenario, we first obtained the most important visual privacy features through a user survey. Then we quantified and analyzed the effects of image resolution on human and machine recognition performance in activity recognition and privacy awareness tasks. We also investigated how modern image super-resolution techniques influence these effects. Based on the results, we proposed a method for modeling the trade-off of privacy preservation and activity recognition on low-resolution images.
Automatic classification of multiple catheters in neonatal radiographs with deep learning
Henderson, Robert D. E., Yi, Xin, Adams, Scott J., Babyn, Paul
We develop and evaluate a deep learning algorithm to classify multiple catheters on neonatal chest and abdominal radiographs. A convolutional neural network (CNN) was trained using a dataset of 777 neonatal chest and abdominal radiographs, with a split of 81%-9%-10% for training-validation-testing, respectively. We employed ResNet-50 (a CNN), pre-trained on ImageNet. Ground truth labelling was limited to tagging each image to indicate the presence or absence of endotracheal tubes (ETTs), nasogastric tubes (NGTs), and umbilical arterial and venous catheters (UACs, UVCs). The data set included 561 images containing 2 or more catheters, 167 images with only one, and 49 with none. Performance was measured with average precision (AP), calculated from the area under the precision-recall curve. On our test data, the algorithm achieved an overall AP (95% confidence interval) of 0.977 (0.679-0.999) for NGTs, 0.989 (0.751-1.000) for ETTs, 0.979 (0.873-0.997) for UACs, and 0.937 (0.785-0.984) for UVCs. Performance was similar for the set of 58 test images consisting of 2 or more catheters, with an AP of 0.975 (0.255-1.000) for NGTs, 0.997 (0.009-1.000) for ETTs, 0.981 (0.797-0.998) for UACs, and 0.937 (0.689-0.990) for UVCs. Our network thus achieves strong performance in the simultaneous detection of these four catheter types. Radiologists may use such an algorithm as a time-saving mechanism to automate reporting of catheters on radiographs.