Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models

May-29-2025, 06:57:01 GMT–Neural Information Processing Systems

The techniques of machine unlearning, also known as concept erasing, have been developed to address these risks. However, these techniques remain vulnerable to adversarial prompt attacks, which can prompt DMs post-unlearning to regenerate undesired images containing concepts (such as nudity) meant to be erased. This work aims to enhance the robustness of concept erasing by integrating the principle of adversarial training (AT) into machine unlearning, resulting in the robust unlearning framework referred to as AdvUnlearn. However, achieving this effectively and efficiently is highly nontrivial. First, we find that a straightforward implementation of AT compromises DMs' image generation quality post-unlearning. To address this, we develop a utility-retaining regularization on an additional retain set, optimizing the trade-off between concept erasure robustness and model utility in AdvUnlearn.

advunlearn, machine learning, natural language, (15 more...)

Neural Information Processing Systems

May-29-2025, 06:57:01 GMT

Conferences PDF

Add feedback

Country:
- Europe > Switzerland
  - Zürich > Zürich (0.14)
- North America > United States (0.28)

Genre:
- Research Report > Experimental Study (1.00)

Industry:
- Information Technology > Security & Privacy (1.00)
- Law (0.67)
- Transportation (0.69)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language (1.00)
  - Representation & Reasoning (1.00)
  - Vision (1.00)