Goto

Collaborating Authors

 ineq


Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory

arXiv.org Machine Learning

Adversarial training (AT) is an effective defense for large language models (LLMs) against jailbreak attacks, but performing AT on LLMs is costly. To improve the efficiency of AT for LLMs, recent studies propose continuous AT (CAT) that searches for adversarial inputs within the continuous embedding space of LLMs during AT. While CAT has achieved empirical success, its underlying mechanism, i.e., why adversarial perturbations in the embedding space can help LLMs defend against jailbreak prompts synthesized in the input token space, remains unknown. This paper presents the first theoretical analysis of CAT on LLMs based on in-context learning (ICL) theory. For linear transformers trained with adversarial examples from the embedding space on in-context linear regression tasks, we prove a robust generalization bound that has a negative correlation with the perturbation radius in the embedding space. This clearly explains why CAT can defend against jailbreak prompts from the LLM's token space. Further, the robust bound shows that the robustness of an adversarially trained LLM is closely related to the singular values of its embedding matrix. Based on this, we propose to improve LLM CAT by introducing an additional regularization term, which depends on singular values of the LLM's embedding matrix, into the objective function of CAT. Experiments on real-world LLMs demonstrate that our method can help LLMs achieve a better jailbreak robustness-utility tradeoff. The code is available at https://github.com/fshp971/continuous-adv-icl.


SAFE TrainedModels

Neural Information Processing Systems

After calibrating in the first session, the slow efficient tuning parameters can capture more informativefeatures, improving generalization to incoming classes. Moreover, to further incorporate novel concepts, we strikeabalance between stability and plasticity byfixing slowefficient tuning parameters and continuously updating the fast ones. Specifically, a cross-classification loss with feature alignment is proposed to circumvent catastrophic forgetting.






IntroVAE: Introspective Variational Autoencoders for Photographic Image Synthesis

Neural Information Processing Systems

We present a novel introspective variational autoencoder (IntroVAE) model for synthesizing high-resolution photographic images. IntroVAE is capable of selfevaluating the quality of its generated samples and improving itself accordingly.


4c5bcfec8584af0d967f1ab10179ca4b-AuthorFeedback.pdf

Neural Information Processing Systems

For more reliable comparison, we repeat experiments for100random seedsinstead of 10. "init tune" denotes tuningσ and choosing betweenN or U (see Figure 1 at the bottom); tuning isdone in the same wayasforotherhyperparameters. We will also add results of GCN supporting our conclusions (Table 115 and Figure 1). Note20 that in Table 1 of the submitted paper, forCOLORSand MNIST-75sp,21 ChebyGINs are equivalent to ChebyNets as described in Table 1 of22 theSupplementary material and elaborated onfollowing that table (see23 footnote3). In our model, the features are25 weighted by attention scores according to Eq. 3, so it is soft. In this26 case, the features indeed reduce their scale.


ImprovingSelf-supervisedLearningwithAutomated UnsupervisedOutlierArbitration

Neural Information Processing Systems

UOTA adaptively searches for the most important sampling region to produce views, and provides viable choice for outlier-robust self-supervised learning approaches.