Robustness of Large Language Models Against Adversarial Attacks

Tao, Yiyi, Shen, Yixian, Zhang, Hang, Shen, Yanxin, Wang, Lun, Shi, Chuanqi, Du, Shaoshuai

Dec-22-2024–arXiv.org Artificial Intelligence

In this paper, we present a comprehensive study on the robustness of GPT LLM family. We employ two distinct evaluation methods to assess their resilience. The first method introduce character-level text attack in input prompts, testing the models on three sentiment classification datasets: StanfordNLP/IMDB, Yelp Reviews, and SST-2. The second method involves using jailbreak prompts to challenge the safety mechanisms of the LLMs. Our experiments reveal significant variations in the robustness of these models, demonstrating their varying degrees of vulnerability to both character-level and semantic-level adversarial attacks. These findings underscore the necessity for improved adversarial training and enhanced safety mechanisms to bolster the robustness of LLMs.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Dec-22-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - California (0.15)
  - Oregon (0.14)

Genre:
- Research Report > New Finding (0.67)

Industry:
- Government > Military (0.62)
- Information Technology > Security & Privacy (0.72)
- Media (0.47)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.55)
  - Natural Language > Large Language Model (1.00)