Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space

Mar-18-2026, 05:21:26 GMT–Neural Information Processing Systems

Current research in adversarial robustness of LLMs focuses on \textit{discrete} input manipulations in the natural language space, which can be directly transferred to \textit{closed-source} models.

artificial intelligence, large language model, natural language, (7 more...)

Neural Information Processing Systems

Mar-18-2026, 05:21:26 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.40)