Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space

Neural Information Processing Systems 

Current research in adversarial robustness of LLMs focuses on \textit{discrete} input manipulations in the natural language space, which can be directly transferred to \textit{closed-source} models.