Concept-ROT: Poisoning Concepts in Large Language Models with Model Editing

Grimes, Keltin, Christiani, Marco, Shriver, David, Connor, Marissa

Dec-17-2024–arXiv.org Artificial Intelligence

Model editing methods modify specific behaviors of Large Language Models by altering a small, targeted set of network weights and require very little data and compute. These methods can be used for malicious applications such as inserting misinformation or simple trojans that result in adversary-specified behaviors when a trigger word is present. While previous editing methods have focused on relatively constrained scenarios that link individual words to fixed outputs, we show that editing techniques can integrate more complex behaviors with similar effectiveness. We develop Concept-ROT, a model editing-based method that efficiently inserts trojans which not only exhibit complex output behaviors, but also trigger on high-level concepts - presenting an entirely new class of trojan attacks. Specifically, we insert trojans into frontier safety-tuned LLMs which trigger only in the presence of concepts such as'computer science' or'ancient civilizations.' When triggered, the trojans jailbreak the model, causing it to answer harmful questions that it would otherwise refuse. Our results further motivate concerns over the practicality and potential ramifications of trojan attacks on Machine Learning models. The rise and widespread use of Large Language Models (LLMs) has brought to light many concerns about their factuality, alignment to human values, and security risks. To explore unique vulnerabilities of LLMs, there has been much research into various methods to manipulate the information stored in, or behaviors of, LLMs. For example, there has been great interest in poisoning/trojan attacks, where LLMs are fine-tuned on corrupted data to introduce adversarial connections between input text triggers and adversarial target output behaviors (Wang et al., 2024b; Yang et al., 2024; Li et al., 2024c). Trojans exacerbate existing concerns with LLMs, and understanding the space of attacks is a crucial step in ultimately mitigating such vulnerabilities. Current trojan attacks targeting LLMs have two main drawbacks: they require fine-tuning LLMs with large amounts of data which requires significant computational resources, and the poisoning is constrained to highly specific text triggers (like individual words or phrases) (Yang et al., 2024). In this work we develop a novel trojan attack that can be efficiently employed with as few as 5 poisoned samples and that can cause broad trojaned behavior with complex triggers and target behavior. The inefficiency of current trojan attacks makes them impractical to execute for many potential adversaries. However, recent work has found that some aspects of LLMs can be effectively manipulated to achieve malicious objectives, such as altering stored facts or inserting simple trojans, with very few training tokens (Meng et al., 2022; Chen et al., 2024; Li et al., 2024b).

large language model, machine learning, public release and unlimited distribution, (15 more...)

arXiv.org Artificial Intelligence

Dec-17-2024

arXiv.org PDF

Add feedback

Country:
- Africa > Rwanda
  - Kigali > Kigali (0.04)
- Asia
  - Indonesia > Bali (0.04)
  - Singapore (0.04)
- Europe
  - Austria > Vienna (0.14)
  - France > Île-de-France
    - Paris > Paris (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
  - Italy > Tuscany
    - Florence (0.04)
- North America
  - Canada > British Columbia
    - Metro Vancouver Regional District > Vancouver (0.04)
  - Dominican Republic (0.04)
  - United States
    - New York > New York County
      - New York City (0.04)
    - Pennsylvania > Allegheny County
      - Pittsburgh (0.14)
- Oceania > Australia
  - Victoria > Melbourne (0.04)
- South America > Colombia
  - Meta Department > Villavicencio (0.04)

Genre:
- Research Report > New Finding (0.87)

Industry:
- Education (1.00)
- Information Technology > Security & Privacy (1.00)
- Media (0.87)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)