Chen, Kejiang
A Closer Look at Machine Unlearning for Large Language Models
Yuan, Xiaojian, Pang, Tianyu, Du, Chao, Chen, Kejiang, Zhang, Weiming, Lin, Min
Due to the high cost of retraining from scratch, researchers attempt to employ machine unlearning to remove specific content from LLMs while preserving the overall performance. In this paper, we discuss several issues in machine unlearning for LLMs and provide our insights on possible approaches. To address the issue of inadequate evaluation of model outputs after unlearning, we introduce three additional metrics to evaluate token diversity, sentence semantics, and factual correctness. We then categorize unlearning methods into untargeted and targeted, and discuss their issues respectively. Specifically, the behavior that untargeted unlearning attempts to approximate is unpredictable and may involve hallucinations, and existing regularization is insufficient for targeted unlearning. To alleviate these issues, we propose using the objective of maximizing entropy (ME) for untargeted unlearning and incorporate answer preservation (AP) loss as regularization for targeted unlearning. Experimental results across three scenarios, i.e., fictitious unlearning, continual unlearning, and real-world unlearning, demonstrate the effectiveness of our approaches. In recent years, large language models (LLMs) have undergone rapid development, demonstrating impressive capabilities across a wide range of applications, from natural language processing to complex problem-solving. These concerns are particularly relevant within legal and regulatory frameworks, such as the Right to be Forgotten (Dang, 2021), which aims to empower individuals to have unauthorized data erased from digital records. Addressing these issues is crucial for ensuring the responsible deployment of LLMs in real-world applications. Due to the high cost of retraining LLMs, researchers have explored machine unlearning techniques, namely LLM unlearning (Cao & Yang, 2015; Bourtoule et al., 2021; Yao et al., 2023). The typical paradigm involves fine-tuning the target LLM on a specified set, known as the forget set, to obtain an unlearned model. As described in (Maini et al., 2024; Jin et al., 2024), the unlearned model should meet two primary goals: 1) it should not reveal any information contained in the forget set, and 2) it should maintain performance on the neighbor set, which has a distribution similar to the forget set but is not the target of unlearning, as well as on other tasks with general knowledge. While the first goal is generally easier to achieve, the main challenge lies in meeting the second goal (Liu et al., 2024b; Maini et al., 2024; Zhang et al., 2024a; Ji et al., 2024; Shi et al., 2024a; Wang et al., 2024c). In this paper, we have a closer look at machine unlearning for LLMs. We note that most prior studies (Maini et al., 2024; Ji et al., 2024; Jia et al., 2024; Jin et al., 2024; Shi et al., 2024a) primarily rely on ROUGE (Lin, 2004) as the sole metric for evaluating the output of unlearned models.
AutoPT: How Far Are We from the End2End Automated Web Penetration Testing?
Wu, Benlong, Chen, Guoqiang, Chen, Kejiang, Shang, Xiuwei, Han, Jiapeng, He, Yanru, Zhang, Weiming, Yu, Nenghai
Penetration testing is essential to ensure Web security, which can detect and fix vulnerabilities in advance, and prevent data leakage and serious consequences. The powerful inference capabilities of large language models (LLMs) have made significant progress in various fields, and the development potential of LLM-based agents can revolutionize the cybersecurity penetration testing industry. In this work, we establish a comprehensive end-to-end penetration testing benchmark using a real-world penetration testing environment to explore the capabilities of LLM-based agents in this domain. Our results reveal that the agents are familiar with the framework of penetration testing tasks, but they still face limitations in generating accurate commands and executing complete processes. Accordingly, we summarize the current challenges, including the difficulty of maintaining the entire message history and the tendency for the agent to become stuck. Based on the above insights, we propose a Penetration testing State Machine (PSM) that utilizes the Finite State Machine (FSM) methodology to address these limitations. Then, we introduce AutoPT, an automated penetration testing agent based on the principle of PSM driven by LLMs, which utilizes the inherent inference ability of LLM and the constraint framework of state machines. Our evaluation results show that AutoPT outperforms the baseline framework ReAct on the GPT-4o mini model and improves the task completion rate from 22% to 41% on the benchmark target. Compared with the baseline framework and manual work, AutoPT also reduces time and economic costs further. Hence, our AutoPT has facilitated the development of automated penetration testing and significantly impacted both academia and industry.
Provably Secure Disambiguating Neural Linguistic Steganography
Qi, Yuang, Chen, Kejiang, Zeng, Kai, Zhang, Weiming, Yu, Nenghai
Recent research in provably secure neural linguistic steganography has overlooked a crucial aspect: the sender must detokenize stegotexts to avoid raising suspicion from the eavesdropper. The segmentation ambiguity problem, which arises when using language models based on subwords, leads to occasional decoding failures in all neural language steganography implementations based on these models. Current solutions to this issue involve altering the probability distribution of candidate words, rendering them incompatible with provably secure steganography. We propose a novel secure disambiguation method named SyncPool, which effectively addresses the segmentation ambiguity problem. We group all tokens with prefix relationships in the candidate pool before the steganographic embedding algorithm runs to eliminate uncertainty among ambiguous tokens. To enable the receiver to synchronize the sampling process of the sender, a shared cryptographically-secure pseudorandom number generator (CSPRNG) is deployed to select a token from the ambiguity pool. SyncPool does not change the size of the candidate pool or the distribution of tokens and thus is applicable to provably secure language steganography methods. We provide theoretical proofs and experimentally demonstrate the applicability of our solution to various languages and models, showing its potential to significantly improve the reliability and security of neural linguistic steganography systems.
Data-Free Hard-Label Robustness Stealing Attack
Yuan, Xiaojian, Chen, Kejiang, Huang, Wen, Zhang, Jie, Zhang, Weiming, Yu, Nenghai
The popularity of Machine Learning as a Service (MLaaS) has led to increased concerns about Model Stealing Attacks (MSA), which aim to craft a clone model by querying MLaaS. Currently, most research on MSA assumes that MLaaS can provide soft labels and that the attacker has a proxy dataset with a similar distribution. However, this fails to encapsulate the more practical scenario where only hard labels are returned by MLaaS and the data distribution remains elusive. Furthermore, most existing work focuses solely on stealing the model accuracy, neglecting the model robustness, while robustness is essential in security-sensitive scenarios, e.g., face-scan payment. Notably, improving model robustness often necessitates the use of expensive techniques such as adversarial training, thereby further making stealing robustness a more lucrative prospect. In response to these identified gaps, we introduce a novel Data-Free Hard-Label Robustness Stealing (DFHL-RS) attack in this paper, which enables the stealing of both model accuracy and robustness by simply querying hard labels of the target model without the help of any natural data. Comprehensive experiments demonstrate the effectiveness of our method. The clone model achieves a clean accuracy of 77.86% and a robust accuracy of 39.51% against AutoAttack, which are only 4.71% and 8.40% lower than the target model on the CIFAR-10 dataset, significantly exceeding the baselines. Our code is available at: https://github.com/LetheSec/DFHL-RS-Attack.
Control Risk for Potential Misuse of Artificial Intelligence in Science
He, Jiyan, Feng, Weitao, Min, Yaosen, Yi, Jingwei, Tang, Kunsheng, Li, Shuai, Zhang, Jie, Chen, Kejiang, Zhou, Wenbo, Xie, Xing, Zhang, Weiming, Yu, Nenghai, Zheng, Shuxin
The expanding application of Artificial Intelligence (AI) in scientific fields presents unprecedented opportunities for discovery and innovation. However, this growth is not without risks. AI models in science, if misused, can amplify risks like creation of harmful substances, or circumvention of established regulations. In this study, we aim to raise awareness of the dangers of AI misuse in science, and call for responsible AI development and use in this domain. We first itemize the risks posed by AI in scientific contexts, then demonstrate the risks by highlighting real-world examples of misuse in chemical science. These instances underscore the need for effective risk management strategies. In response, we propose a system called SciGuard to control misuse risks for AI models in science. We also propose a red-teaming benchmark SciMT-Safety to assess the safety of different systems. Our proposed SciGuard shows the least harmful impact in the assessment without compromising performance in benign tests. Finally, we highlight the need for a multidisciplinary and collaborative effort to ensure the safe and ethical use of AI models in science. We hope that our study can spark productive discussions on using AI ethically in science among researchers, practitioners, policymakers, and the public, to maximize benefits and minimize the risks of misuse.
GPT Paternity Test: GPT Generated Text Detection with GPT Genetic Inheritance
Yu, Xiao, Qi, Yuang, Chen, Kejiang, Chen, Guoqiang, Yang, Xi, Zhu, Pengyuan, Zhang, Weiming, Yu, Nenghai
Large Language Models (LLMs) can generate texts that carry the risk of various misuses, including plagiarism, planting fake reviews on e-commerce platforms, or creating fake social media postings that can sway election results. Detecting whether a text is machine-generated has thus become increasingly important. While machine-learning-based detection strategies exhibit superior performance, they often lack generalizability, limiting their practicality. In this work, we introduce GPT Paternity Test (GPT-Pat), which reliably detects machine-generated text across varied datasets. Given a text under scrutiny, we leverage ChatGPT to generate a corresponding question and provide a re-answer to the question. By comparing the similarity between the original text and the generated re-answered text, it can be determined whether the text is machine-generated. GPT-Pat consists of a Siamese network to compute the similarity between the original text and the generated re-answered text and a binary classifier. Our method achieved an average accuracy of 94.57% on four generalization test sets, surpassing the state-of-the-art RoBERTa-based method by 12.34%. The accuracy drop of our method is only about half of that of the RoBERTa-based method when it is attacked by re-translation and polishing.
Watermarking Text Generated by Black-Box Language Models
Yang, Xi, Chen, Kejiang, Zhang, Weiming, Liu, Chang, Qi, Yuang, Zhang, Jie, Fang, Han, Yu, Nenghai
LLMs now exhibit human-like skills in various fields, leading to worries about misuse. Thus, detecting generated text is crucial. However, passive detection methods are stuck in domain specificity and limited adversarial robustness. To achieve reliable detection, a watermark-based method was proposed for white-box LLMs, allowing them to embed watermarks during text generation. The method involves randomly dividing the model vocabulary to obtain a special list and adjusting the probability distribution to promote the selection of words in the list. A detection algorithm aware of the list can identify the watermarked text. However, this method is not applicable in many real-world scenarios where only black-box language models are available. For instance, third-parties that develop API-based vertical applications cannot watermark text themselves because API providers only supply generated text and withhold probability distributions to shield their commercial interests. To allow third-parties to autonomously inject watermarks into generated text, we develop a watermarking framework for black-box language model usage scenarios. Specifically, we first define a binary encoding function to compute a random binary encoding corresponding to a word. The encodings computed for non-watermarked text conform to a Bernoulli distribution, wherein the probability of a word representing bit-1 being approximately 0.5. To inject a watermark, we alter the distribution by selectively replacing words representing bit-0 with context-based synonyms that represent bit-1. A statistical test is then used to identify the watermark. Experiments demonstrate the effectiveness of our method on both Chinese and English datasets. Furthermore, results under re-translation, polishing, word deletion, and synonym substitution attacks reveal that it is arduous to remove the watermark without compromising the original semantics.
Pseudo Label-Guided Model Inversion Attack via Conditional Generative Adversarial Network
Yuan, Xiaojian, Chen, Kejiang, Zhang, Jie, Zhang, Weiming, Yu, Nenghai, Zhang, Yang
Model inversion (MI) attacks have raised increasing concerns about privacy, which can reconstruct training data from public models. Indeed, MI attacks can be formalized as an optimization problem that seeks private data in a certain space. Recent MI attacks leverage a generative adversarial network (GAN) as an image prior to narrow the search space, and can successfully reconstruct even the high-dimensional data (e.g., face images). However, these generative MI attacks do not fully exploit the potential capabilities of the target model, still leading to a vague and coupled search space, i.e., different classes of images are coupled in the search space. Besides, the widely used cross-entropy loss in these attacks suffers from gradient vanishing. To address these problems, we propose Pseudo Label-Guided MI (PLG-MI) attack via conditional GAN (cGAN). At first, a top-n selection strategy is proposed to provide pseudo-labels for public data, and use pseudo-labels to guide the training of the cGAN. In this way, the search space is decoupled for different classes of images. Then a max-margin loss is introduced to improve the search process on the subspace of a target class. Extensive experiments demonstrate that our PLG-MI attack significantly improves the attack success rate and visual quality for various datasets and models, notably, 2~3 $\times$ better than state-of-the-art attacks under large distributional shifts. Our code is available at: https://github.com/LetheSec/PLG-MI-Attack.