Combating Adversarial Attacks with Multi-Agent Debate
Chern, Steffi, Fan, Zhen, Liu, Andy
–arXiv.org Artificial Intelligence
While state-of-the-art language models have achieved impressive results, they remain susceptible to inference-time adversarial attacks, such as adversarial prompts generated by red teams arXiv:2209.07858. One approach proposed to improve the general quality of language model generations is multi-agent debate, where language models self-evaluate through discussion and feedback arXiv:2305.14325. We implement multi-agent debate between current state-of-the-art language models and evaluate models' susceptibility to red team attacks in both single- and multi-agent settings. We find that multi-agent debate can reduce model toxicity when jailbroken or less capable models are forced to debate with non-jailbroken or more capable models. We also find marginal improvements through the general usage of multi-agent interactions. We further perform adversarial prompt content classification via embedding clustering, and analyze the susceptibility of different models to different types of attack topics.
arXiv.org Artificial Intelligence
Jan-11-2024
- Genre:
- Research Report (0.82)
- Industry:
- Government > Military (0.62)
- Information Technology > Security & Privacy (0.71)
- Technology: