Efficient Adversarial Training in LLMs with Continuous Attacks

May-26-2025, 14:56:48 GMT–Neural Information Processing Systems

Large language models (LLMs) are vulnerable to adversarial attacks that can bypass their safety guardrails. In many domains, adversarial training has proven to be one of the most promising methods to reliably improve robustness against such attacks. Yet, in the context of LLMs, current methods for adversarial training are hindered by the high computational costs required to perform discrete adversarial attacks at each training iteration. We address this problem by instead calculating adversarial attacks in the continuous embedding space of the LLM, which is orders of magnitudes more efficient. We propose a fast adversarial training algorithm (C-AdvUL) composed of two losses: the first makes the model robust on continuous embedding attacks computed on an adversarial behaviour dataset; the second ensures the usefulness of the final model by fine-tuning on utility data.

efficient adversarial training, large language model, natural language, (7 more...)

Neural Information Processing Systems

May-26-2025, 14:56:48 GMT

Conferences Web Page

Add feedback

Genre:
- Research Report (0.60)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)