Qu, Wenjie
Sparse Autoencoder as a Zero-Shot Classifier for Concept Erasing in Text-to-Image Diffusion Models
Tian, Zhihua, Nan, Sirun, Xu, Ming, Zhai, Shengfang, Qu, Wenjie, Liu, Jian, Ren, Kui, Jia, Ruoxi, Zhang, Jiaheng
Text-to-image (T2I) diffusion models have achieved remarkable progress in generating high-quality images but also raise people's concerns about generating harmful or misleading content. While extensive approaches have been proposed to erase unwanted concepts without requiring retraining from scratch, they inadvertently degrade performance on normal generation tasks. In this work, we propose Interpret then Deactivate (ItD), a novel framework to enable precise concept removal in T2I diffusion models while preserving overall performance. ItD first employs a sparse autoencoder (SAE) to interpret each concept as a combination of multiple features. By permanently deactivating the specific features associated with target concepts, we repurpose SAE as a zero-shot classifier that identifies whether the input prompt includes target concepts, allowing selective concept erasure in diffusion models. Moreover, we demonstrate that ItD can be easily extended to erase multiple concepts without requiring further training. Comprehensive experiments across celebrity identities, artistic styles, and explicit content demonstrate ItD's effectiveness in eliminating targeted concepts without interfering with normal concept generation. Additionally, ItD is also robust against adversarial prompts designed to circumvent content filters. Code is available at: https://github.com/NANSirun/Interpret-then-deactivate.
Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement
Wu, Yongji, Qu, Wenjie, Tao, Tianyang, Wang, Zhuang, Bai, Wei, Li, Zhuohao, Tian, Yuan, Zhang, Jiaheng, Lentz, Matthew, Zhuo, Danyang
Sparsely-activated Mixture-of-Experts (MoE) architecture has increasingly been adopted to further scale large language models (LLMs) due to its sub-linear scaling for computation costs. However, frequent failures still pose significant challenges as training scales. The cost of even a single failure is significant, as all GPUs need to wait idle until the failure is resolved, potentially losing considerable training progress as training has to restart from checkpoints. Existing solutions for efficient fault-tolerant training either lack elasticity or rely on building resiliency into pipeline parallelism, which cannot be applied to MoE models due to the expert parallelism strategy adopted by the MoE architecture. We present Lazarus, a system for resilient and elastic training of MoE models. Lazarus adaptively allocates expert replicas to address the inherent imbalance in expert workload and speeds-up training, while a provably optimal expert placement algorithm is developed to maximize the probability of recovery upon failures. Through adaptive expert placement and a flexible token dispatcher, Lazarus can also fully utilize all available nodes after failures, leaving no GPU idle. Our evaluation shows that Lazarus outperforms existing MoE training systems by up to 5.7x under frequent node failures and 3.4x on a real spot instance trace.
A Certified Radius-Guided Attack Framework to Image Segmentation Models
Qu, Wenjie, Li, Youqi, Wang, Binghui
Image segmentation is an important problem in many safety-critical applications. Recent studies show that modern image segmentation models are vulnerable to adversarial perturbations, while existing attack methods mainly follow the idea of attacking image classification models. We argue that image segmentation and classification have inherent differences, and design an attack framework specially for image segmentation models. Our attack framework is inspired by certified radius, which was originally used by defenders to defend against adversarial perturbations to classification models. We are the first, from the attacker perspective, to leverage the properties of certified radius and propose a certified radius guided attack framework against image segmentation models. Specifically, we first adapt randomized smoothing, the state-of-the-art certification method for classification models, to derive the pixel's certified radius. We then focus more on disrupting pixels with relatively smaller certified radii and design a pixel-wise certified radius guided loss, when plugged into any existing white-box attack, yields our certified radius-guided white-box attack. Next, we propose the first black-box attack to image segmentation models via bandit. We design a novel gradient estimator, based on bandit feedback, which is query-efficient and provably unbiased and stable. We use this gradient estimator to design a projected bandit gradient descent (PBGD) attack, as well as a certified radius-guided PBGD (CR-PBGD) attack. We prove our PBGD and CR-PBGD attacks can achieve asymptotically optimal attack performance with an optimal rate. We evaluate our certified-radius guided white-box and black-box attacks on multiple modern image segmentation models and datasets. Our results validate the effectiveness of our certified radius-guided attack framework.
REaaS: Enabling Adversarially Robust Downstream Classifiers via Robust Encoder as a Service
Qu, Wenjie, Jia, Jinyuan, Gong, Neil Zhenqiang
Abstract--Encoder as a service is an emerging cloud service. A larger certified radius indicates better certified robustness against adversarial examples. In general, there are two categories of complementary methods to build a certifiably robust classifier and derive In an encoder as a service, a service provider (e.g., OpenAI, its certified radius for a testing input, i.e., base classifier Google, and Amazon) pre-trains a general-purpose feature (BC) based certification [7], [8], [9], [10] and smoothed extractor (called encoder) and deploys it as a cloud service; classifier (SC) based certification (also known as randomized and a client queries the cloud service APIs for the feature smoothing) [11], [12], [13]. BC based certification aims to vectors of its training/testing inputs when training/testing a directly derive the certified radius of a given classifier (called downstream classifier. For instance, the encoder could be pretrained base classifier) for a testing input. BC based certification using supervised learning on a large amount of labeled requires white-box access to the base classifier as it often data or self-supervised learning [1], [2] on a large amount of requires propagating the perturbation from the input layer to unlabeled data. A client could be a smartphone, IoT device, the output layer of the base classifier layer by layer. SC based self-driving car, or edge device in the era of edge computing. In the Standard Encoder as a Service (SEaaS), the smoothed classifier for the testing input. To increase the testing service provides a single API (called Feature-API) for clients inputs' certified radii, SC based certification often requires Wenjie Qu performed this research when he was an intern in Gong's group. Our input-space certified radius R guarantees the certification. However, the client does not have white-box client's base or smoothed downstream classifier predicts the access to the encoder deployed on the cloud server, making same label for the testing input if the l The second challenge perturbation added to the testing input is less than R. is that, although a client can use SC based certification by treating the composition of the encoder and its downstream The key challenge of implementing our F2IPerturb-API is classifier as a base classifier, it incurs a large communication how to find the largest input-space certified radius R for a cost for the client and a large computation cost for the cloud given testing input and its feature-space certified radius R Therefore, the client requires e queries to the Feature-API per training input, problem is challenging to solve due to the highly non-linear where e is the number of epochs used to train the downstream constraint.
Pre-trained Encoders in Self-Supervised Learning Improve Secure and Privacy-preserving Supervised Learning
Liu, Hongbin, Qu, Wenjie, Jia, Jinyuan, Gong, Neil Zhenqiang
Classifiers in supervised learning have various security and privacy issues, e.g., 1) data poisoning attacks, backdoor attacks, and adversarial examples on the security side as well as 2) inference attacks and the right to be forgotten for the training data on the privacy side. Various secure and privacy-preserving supervised learning algorithms with formal guarantees have been proposed to address these issues. However, they suffer from various limitations such as accuracy loss, small certified security guarantees, and/or inefficiency. Self-supervised learning is an emerging technique to pre-train encoders using unlabeled data. Given a pre-trained encoder as a feature extractor, supervised learning can train a simple yet accurate classifier using a small amount of labeled training data. In this work, we perform the first systematic, principled measurement study to understand whether and when a pre-trained encoder can address the limitations of secure or privacy-preserving supervised learning algorithms. Our key findings are that a pre-trained encoder substantially improves 1) both accuracy under no attacks and certified security guarantees against data poisoning and backdoor attacks of state-of-the-art secure learning algorithms (i.e., bagging and KNN), 2) certified security guarantees of randomized smoothing against adversarial examples without sacrificing its accuracy under no attacks, 3) accuracy of differentially private classifiers, and 4) accuracy and/or efficiency of exact machine unlearning.