Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space

May-28-2025, 11:59:34 GMT–Neural Information Processing Systems

Current research in adversarial robustness of LLMs focuses on discrete input manipulations in the natural language space, which can be directly transferred to closed-source models.

arxiv preprint arxiv, large language model, machine learning, (20 more...)

Neural Information Processing Systems

May-28-2025, 11:59:34 GMT

Conferences PDF

Country:
- Europe (0.28)
- North America > Canada
  - Quebec (0.14)

Genre:
- Research Report
  - Experimental Study (1.00)
  - New Finding (0.93)

Industry:
- Information Technology > Security & Privacy (1.00)
- Leisure & Entertainment (0.69)
- Media (0.70)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language
    - Chatbot (1.00)
    - Large Language Model (1.00)