icl
Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory
Adversarial training (AT) is an effective defense for large language models (LLMs) against jailbreak attacks, but performing AT on LLMs is costly. To improve the efficiency of AT for LLMs, recent studies propose continuous AT (CAT) that searches for adversarial inputs within the continuous embedding space of LLMs during AT. While CAT has achieved empirical success, its underlying mechanism, i.e., why adversarial perturbations in the embedding space can help LLMs defend against jailbreak prompts synthesized in the input token space, remains unknown. This paper presents the first theoretical analysis of CAT on LLMs based on in-context learning (ICL) theory. For linear transformers trained with adversarial examples from the embedding space on in-context linear regression tasks, we prove a robust generalization bound that has a negative correlation with the perturbation radius in the embedding space. This clearly explains why CAT can defend against jailbreak prompts from the LLM's token space. Further, the robust bound shows that the robustness of an adversarially trained LLM is closely related to the singular values of its embedding matrix. Based on this, we propose to improve LLM CAT by introducing an additional regularization term, which depends on singular values of the LLM's embedding matrix, into the objective function of CAT. Experiments on real-world LLMs demonstrate that our method can help LLMs achieve a better jailbreak robustness-utility tradeoff. The code is available at https://github.com/fshp971/continuous-adv-icl.
- Asia > Middle East > Jordan (0.04)
- North America > United States > Michigan (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Europe > France (0.04)
- Research Report > Experimental Study (0.92)
- Research Report > New Finding (0.67)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.40)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Jordan (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
- North America > United States > Texas > Travis County > Austin (0.04)
- Asia > Middle East > Jordan (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.92)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.97)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
- Information Technology > Artificial Intelligence > Vision > Image Understanding (0.67)
- Europe > Russia (0.04)
- Asia > Russia (0.04)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- (5 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.93)
- Government (0.93)
- Retail (0.68)
- Education (0.67)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.94)
- Europe > Austria > Vienna (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)
- Asia > Middle East > Jordan (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- Europe > United Kingdom > England (0.04)
- Europe > Spain > Galicia > Madrid (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)
- Education (0.67)
- Leisure & Entertainment (0.67)
- Media (0.46)
- Information Technology (0.46)
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- North America > United States > Michigan (0.04)
- North America > United States > California > Riverside County > Riverside (0.04)
- Asia > Middle East > Jordan (0.04)
- Workflow (1.00)
- Research Report > New Finding (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.95)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)