Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space
–Neural Information Processing Systems
Current research in adversarial robustness of LLMs focuses on discrete input manipulations in the natural language space, which can be directly transferred to closed-source models.
Neural Information Processing Systems
May-28-2025, 11:59:34 GMT
- Country:
- Europe (0.28)
- North America > Canada
- Quebec (0.14)
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (0.93)
- Research Report
- Industry:
- Information Technology > Security & Privacy (1.00)
- Leisure & Entertainment (0.69)
- Media (0.70)
- Technology: