erasure
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.93)
- Information Technology > Security & Privacy (0.46)
- Law (0.46)
- Government > Regional Government (0.46)
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > United States > Colorado > Boulder County > Boulder (0.04)
- (3 more...)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
Discriminative Feature Attributions: Bridging Post Hoc Explainability and Inherent Interpretability
With the increased deployment of machine learning models in various real-world applications, researchers and practitioners alike have emphasized the need for explanations of model behaviour. To this end, two broad strategies have been outlined in prior literature to explain models. Post hoc explanation methods explain the behaviour of complex black-box models by identifying features critical to model predictions; however, prior work has shown that these explanations may not be faithful, in that they incorrectly attribute high importance to features that are unimportant or non-discriminative for the underlying task. Inherently interpretable models, on the other hand, circumvent these issues by explicitly encoding explanations into model architecture, meaning their explanations are naturally faithful, but they often exhibit poor predictive performance due to their limited expressive power. In this work, we identify a key reason for the lack of faithfulness of feature attributions: the lack of robustness of the underlying black-box models, especially the erasure of unimportant distractor features in the input. To address this issue, we propose Distractor Erasure Tuning (DiET), a method that adapts black-box models to be robust to distractor erasure, thus providing discriminative and faithful attributions. This strategy naturally combines the ease-of-use of post hoc explanations with the faithfulness of inherently interpretable models. We perform extensive experiments on semi-synthetic and real-world datasets, and show that DiET produces models that (1) closely approximate the original black-box models they are intended to explain, and (2) yield explanations that match approximate ground truths available by construction.
CGCE: Classifier-Guided Concept Erasure in Generative Models
Nguyen, Viet, Patel, Vishal M.
Recent advancements in large-scale generative models have enabled the creation of high-quality images and videos, but have also raised significant safety concerns regarding the generation of unsafe content. To mitigate this, concept erasure methods have been developed to remove undesirable concepts from pre-trained models. However, existing methods remain vulnerable to adversarial attacks that can regenerate the erased content. Moreover, achieving robust erasure often degrades the model's generative quality for safe, unrelated concepts, creating a difficult trade-off between safety and performance. To address this challenge, we introduce Classifier-Guided Concept Erasure (CGCE), an efficient plug-and-play framework that provides robust concept erasure for diverse generative models without altering their original weights. CGCE uses a lightweight classifier operating on text embeddings to first detect and then refine prompts containing undesired concepts. This approach is highly scalable, allowing for multi-concept erasure by aggregating guidance from several classifiers. By modifying only unsafe embeddings at inference time, our method prevents harmful content generation while preserving the model's original quality on benign prompts. Extensive experiments show that CGCE achieves state-of-the-art robustness against a wide range of red-teaming attacks. Our approach also maintains high generative utility, demonstrating a superior balance between safety and performance. We showcase the versatility of CGCE through its successful application to various modern T2I and T2V models, establishing it as a practical and effective solution for safe generative AI.
- Information Technology > Security & Privacy (0.34)
- Government > Military (0.34)
When Are Concepts Erased From Diffusion Models?
Lu, Kevin, Kriplani, Nicky, Gandikota, Rohit, Pham, Minh, Bau, David, Hegde, Chinmay, Cohen, Niv
In concept erasure, a model is modified to selectively prevent it from generating a target concept. Despite the rapid development of new methods, it remains unclear how thoroughly these approaches remove the target concept from the model. We begin by proposing two conceptual models for the erasure mechanism in diffusion models: (i) interfering with the model's internal guidance processes, and (ii) reducing the unconditional likelihood of generating the target concept, potentially removing it entirely. To assess whether a concept has been truly erased from the model, we introduce a comprehensive suite of independent probing techniques: supplying visual context, modifying the diffusion trajectory, applying classifier guidance, and analyzing the model's alternative generations that emerge in place of the erased concept. Our results shed light on the value of exploring concept erasure robustness outside of adversarial text inputs, and emphasize the importance of comprehensive evaluations for erasure in diffusion models. Our code, data, and results are available at unerasing.baulab.info.
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Semantic Surgery: Zero-Shot Concept Erasure in Diffusion Models
Xiong, Lexiang, Liu, Chengyu, Ye, Jingwen, Liu, Yan, Xu, Yuecong
Concept erasure in text-to-image diffusion models is crucial for mitigating harmful content, yet existing methods often compromise generative quality. We introduce Semantic Surgery, a novel training-free, zero-shot framework for concept erasure that operates directly on text embeddings before the diffusion process. It dynamically estimates the presence of target concepts in a prompt and performs a calibrated vector subtraction to neutralize their influence at the source, enhancing both erasure completeness and locality. The framework includes a Co-Occurrence Encoding module for robust multi-concept erasure and a visual feedback loop to address latent concept persistence. As a training-free method, Semantic Surgery adapts dynamically to each prompt, ensuring precise interventions. Extensive experiments on object, explicit content, artistic style, and multi-celebrity erasure tasks show our method significantly outperforms state-of-the-art approaches. We achieve superior completeness and robustness while preserving locality and image quality (e.g., 93.58 H-score in object erasure, reducing explicit content to just 1 instance, and 8.09 H_a in style erasure with no quality degradation). This robustness also allows our framework to function as a built-in threat detection system, offering a practical solution for safer text-to-image generation.
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > Canada > Ontario > Toronto (0.04)
- Asia > Singapore (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Overview (1.00)
- Research Report > Promising Solution (0.65)
The Right to Be Remembered: Preserving Maximally Truthful Digital Memory in the Age of AI
Zhavoronkov, Alex, Wilczok, Dominika, Yampolskiy, Roman
Since the rapid expansion of large language models (LLMs), people have begun to rely on them for information retrieval. While traditional search engines display ranked lists of sources shaped by search engine optimization (SEO), advertising, and personalization, LLMs typically provide a synthesized response that feels singular and authoritative. While both approaches carry risks of bias and omission, LLMs may amplify the effect by collapsing multiple perspectives into one answer, reducing users ability or inclination to compare alternatives. This concentrates power over information in a few LLM vendors whose systems effectively shape what is remembered and what is overlooked. As a result, certain narratives, individuals or groups, may be disproportionately suppressed, while others are disproportionately elevated. Over time, this creates a new threat: the gradual erasure of those with limited digital presence, and the amplification of those already prominent, reshaping collective memory. To address these concerns, this paper presents a concept of the Right To Be Remembered (RTBR) which encompasses minimizing the risk of AI-driven information omission, embracing the right of fair treatment, while ensuring that the generated content would be maximally truthful.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Asia > China > Hong Kong (0.05)
- North America > United States > New Jersey > Mercer County > Princeton (0.04)
- (10 more...)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Government (0.68)
Beyond Fixed Anchors: Precisely Erasing Concepts with Sibling Exclusive Counterparts
Zhang, Tong, Zhang, Ru, Liu, Jianyi, Yang, Zhen, Liu, Gongshen
Existing concept erasure methods for text-to-image diffusion models commonly rely on fixed anchor strategies, which often lead to critical issues such as concept re-emergence and erosion. To address this, we conduct causal tracing to reveal the inherent sensitivity of erasure to anchor selection and define Sibling Exclusive Concepts as a superior class of anchors. Based on this insight, we propose \textbf{SELECT} (Sibling-Exclusive Evaluation for Contextual Targeting), a dynamic anchor selection framework designed to overcome the limitations of fixed anchors. Our framework introduces a novel two-stage evaluation mechanism that automatically discovers optimal anchors for precise erasure while identifying critical boundary anchors to preserve related concepts. Extensive evaluations demonstrate that SELECT, as a universal anchor solution, not only efficiently adapts to multiple erasure frameworks but also consistently outperforms existing baselines across key performance metrics, averaging only 4 seconds for anchor mining of a single concept.
- North America > United States (1.00)
- Asia > China > Shanghai > Shanghai (0.04)
- Asia > China > Beijing > Beijing (0.04)
- (2 more...)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
- Government > Regional Government > North America Government > United States Government (1.00)