Ignore Previous Prompt: Attack Techniques For Language Models
–arXiv.org Artificial Intelligence
Transformer-based large language models (LLMs) provide a powerful foundation for natural language tasks in large-scale customer-facing applications. However, studies that explore their vulnerabilities emerging from malicious user interaction are scarce. By proposing PromptInject, a prosaic alignment framework for mask-based iterative adversarial prompt composition, we examine how GPT-3, the most widely deployed language model in production, can be easily misaligned by simple handcrafted inputs. In particular, we investigate two types of attacks -- goal hijacking and prompt leaking -- and demonstrate that even low-aptitude, but sufficiently ill-intentioned agents, can easily exploit GPT-3's stochastic nature, creating long-tail risks. The code for PromptInject is available at https://github.com/agencyenterprise/PromptInject.
arXiv.org Artificial Intelligence
Nov-17-2022
- Country:
- South America > Chile
- North America > United States
- California > Los Angeles County > Los Angeles (0.04)
- Europe > Spain
- Catalonia > Barcelona Province > Barcelona (0.04)
- Asia > Middle East
- Republic of Türkiye > Batman Province > Batman (0.04)
- Genre:
- Research Report (1.00)
- Personal > Interview (0.46)
- Industry:
- Government (0.93)
- Information Technology > Security & Privacy (0.93)
- Law Enforcement & Public Safety (0.69)
- Technology: