BadGPT: Exploring Security Vulnerabilities of ChatGPT via Backdoor Attacks to InstructGPT

Shi, Jiawen, Liu, Yixin, Zhou, Pan, Sun, Lichao

Feb-21-2023–arXiv.org Artificial Intelligence

Recently, ChatGPT has gained significant attention in research due to its ability to interact with humans effectively. The core idea behind this model is reinforcement learning (RL) fine-tuning, a new paradigm that allows language models to align with human preferences, i.e., InstructGPT. In this study, we propose BadGPT, the first backdoor attack against RL fine-tuning in language models. By injecting a backdoor into the reward model, the language model can be compromised during the fine-tuning stage. Our initial experiments on movie reviews, i.e., IMDB, demonstrate that an attacker can manipulate the generated text through BadGPT.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

Feb-21-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Pennsylvania > Northampton County > Bethlehem (0.05)
- Asia > China
  - Hubei Province > Wuhan (0.05)

Genre:
- Research Report > New Finding (0.49)

Industry:
- Information Technology > Security & Privacy (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found