Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space
–Neural Information Processing Systems
Current research in adversarial robustness of LLMs focuses on \textit{discrete} input manipulations in the natural language space, which can be directly transferred to \textit{closed-source} models.
Neural Information Processing Systems
Dec-24-2025, 00:07:34 GMT
- Technology: