Wang, Lanjun
Metaphor-based Jailbreaking Attacks on Text-to-Image Models
Zhang, Chenyu, Ma, Yiwen, Wang, Lanjun, Li, Wenhui, Tu, Yi, Liu, An-An
To mitigate misuse, text-to-image~(T2I) models commonly incorporate safety filters to prevent the generation of sensitive images. Unfortunately, recent jailbreaking attack methods use LLMs to generate adversarial prompts that effectively bypass safety filters while generating sensitive images, revealing the safety vulnerabilities within the T2I model. However, existing LLM-based attack methods lack explicit guidance, relying on substantial queries to achieve a successful attack, which limits their practicality in real-world scenarios. In this work, we introduce \textbf{MJA}, a \textbf{m}etaphor-based \textbf{j}ailbreaking \textbf{a}ttack method inspired by the Taboo game, aiming to balance the attack effectiveness and query efficiency by generating metaphor-based adversarial prompts. Specifically, MJA consists of two modules: an LLM-based multi-agent generation module~(MLAG) and an adversarial prompt optimization module~(APO). MLAG decomposes the generation of metaphor-based adversarial prompts into three subtasks: metaphor retrieval, context matching, and adversarial prompt generation. Subsequently, MLAG coordinates three LLM-based agents to generate diverse adversarial prompts by exploring various metaphors and contexts. To enhance the attack efficiency, APO first trains a surrogate model to predict the attack results of adversarial prompts and then designs an acquisition strategy to adaptively identify optimal adversarial prompts. Experiments demonstrate that MJA achieves better attack effectiveness while requiring fewer queries compared to baseline methods. Moreover, our adversarial prompts exhibit strong transferability across various open-source and commercial T2I models. \textcolor{red}{This paper includes model-generated content that may contain offensive or distressing material.}
TRCE: Towards Reliable Malicious Concept Erasure in Text-to-Image Diffusion Models
Chen, Ruidong, Guo, Honglin, Wang, Lanjun, Zhang, Chenyu, Nie, Weizhi, Liu, An-An
Recent advances in text-to-image diffusion models enable photorealistic image generation, but they also risk producing malicious content, such as NSFW images. To mitigate risk, concept erasure methods are studied to facilitate the model to unlearn specific concepts. However, current studies struggle to fully erase malicious concepts implicitly embedded in prompts (e.g., metaphorical expressions or adversarial prompts) while preserving the model's normal generation capability. To address this challenge, our study proposes TRCE, using a two-stage concept erasure strategy to achieve an effective trade-off between reliable erasure and knowledge preservation. Firstly, TRCE starts by erasing the malicious semantics implicitly embedded in textual prompts. By identifying a critical mapping objective(i.e., the [EoT] embedding), we optimize the cross-attention layers to map malicious prompts to contextually similar prompts but with safe concepts. This step prevents the model from being overly influenced by malicious semantics during the denoising process. Following this, considering the deterministic properties of the sampling trajectory of the diffusion model, TRCE further steers the early denoising prediction toward the safe direction and away from the unsafe one through contrastive learning, thus further avoiding the generation of malicious content. Finally, we conduct comprehensive evaluations of TRCE on multiple malicious concept erasure benchmarks, and the results demonstrate its effectiveness in erasing malicious concepts while better preserving the model's original generation ability. The code is available at: http://github.com/ddgoodgood/TRCE. CAUTION: This paper includes model-generated content that may contain offensive material.
Impart: An Imperceptible and Effective Label-Specific Backdoor Attack
Zhao, Jingke, Wang, Zan, Wang, Yongwei, Wang, Lanjun
Deep Neural Networks (DNNs) have achieved remarkable success in the past few years and they have been adopted in different applications (e.g., image classification (He, Zhang, Ren and Sun, 2016a), speech recognition (Xiong, Droppo, Huang, Seide, Seltzer, Stolcke, Yu and Zweig, 2016), game playing and natural language processing (Silver, Huang, Maddison, Guez, Sifre, Van Den Driessche, Schrittwieser, Antonoglou, Panneershelvam, Lanctot et al., 2016; Devlin, Chang, Lee and Toutanova, 2019)). However, with the deepening research on several real securitycritical scenarios, recent works show that even the state-of-the-art deep learning methods are vulnerable to backdoor attacks (Gu, Dolan-Gavitt and Garg, 2017; Barni, Kallas and Tondi, 2019; Cheng, Liu, Ma and Zhang, 2021; Li, Li, Wu, Li, He and Lyu, 2021a; Cheng, Wu, Zhang and Zhao, 2023). In backdoor attacks, an attacker injects a trigger into the victim model in the training process. The victim model performs normally as a benign model in the inference phase when the inputs are benign images. However, once the victim model is fed an input image with the backdoor trigger, the victim model behaves as the attacker predetermined. In the backdoor attack, there are two typical types of attack settings (Li, Jiang, Li and Xia, 2022): one is to poison different target labels (a.k.a., all-to-all), and the other is to poison one target label (a.k.a., all-to-one). Recent research on the backdoor attack for deep learning has focused on generating poisoned images that lead to misclassification results while keeping imperceptibility. LIRA (Doan, Lao, Zhao and Li, 2021b) and WB (Doan, Lao and Li, 2021a) have achieved effective and imperceptible backdoor attacks. However, they assume that the attacker has full access to the model information (e.g., model architecture, and model parameters), which significantly reduces their threats in practice.
GlueFL: Reconciling Client Sampling and Model Masking for Bandwidth Efficient Federated Learning
He, Shiqi, Yan, Qifan, Wu, Feijie, Wang, Lanjun, Lรฉcuyer, Mathias, Beschastnikh, Ivan
Federated learning (FL) is an effective technique to directly involve edge devices in machine learning training while preserving client privacy. However, the substantial communication overhead of FL makes training challenging when edge devices have limited network bandwidth. Existing work to optimize FL bandwidth overlooks downstream transmission and does not account for FL client sampling. In this paper we propose GlueFL, a framework that incorporates new client sampling and model compression algorithms to mitigate low download bandwidths of FL clients. GlueFL prioritizes recently used clients and bounds the number of changed positions in compression masks in each round. Across three popular FL datasets and three state-of-the-art strategies, GlueFL reduces downstream client bandwidth by 27% on average and reduces training time by 29% on average. One important strategy is client sampling, which limits the number of clients that perform training in Federated learning (FL) moves machine learning (ML) training each round (McMahan et al., 2017; Luo et al., 2022). In FL, edge clients communicate with a sampling reduces both upstream and downstream bandwidth.
Mining Minority-class Examples With Uncertainty Estimates
Singh, Gursimran, Chu, Lingyang, Wang, Lanjun, Pei, Jian, Tian, Qi, Zhang, Yong
In the real world, the frequency of occurrence of objects is naturally skewed forming long-tail class distributions, which results in poor performance on the statistically rare classes. A promising solution is to mine tail-class examples to balance the training dataset. However, mining tail-class examples is a very challenging task. For instance, most of the otherwise successful uncertainty-based mining approaches struggle due to distortion of class probabilities resulting from skewness in data. In this work, we propose an effective, yet simple, approach to overcome these challenges. Our framework enhances the subdued tail-class activations and, thereafter, uses a one-class data-centric approach to effectively identify tail-class examples. We carry out an exhaustive evaluation of our framework on three datasets spanning over two computer vision tasks. Substantial improvements in the minority-class mining and fine-tuned model's performance strongly corroborate the value of our proposed solution.
Human Interpretation and Exploitation of Self-attention Patterns in Transformers: A Case Study in Extractive Summarization
Li, Raymond, Xiao, Wen, Wang, Lanjun, Carenini, Giuseppe
The transformer multi-head self-attention mechanism has been thoroughly investigated recently. On one hand, researchers are interested in understanding why and how transformers work. On the other hand, they propose new attention augmentation methods to make transformers more accurate, efficient and interpretable. In this paper, we synergize these two lines of research in a human-in-the-loop pipeline to first find important task-specific attention patterns. Then those patterns are applied, not only to the original model, but also to smaller models, as a human-guided knowledge distillation process. The benefits of our pipeline are demonstrated in a case study with the extractive summarization task. After finding three meaningful attention patterns in the popular BERTSum model, experiments indicate that when we inject such patterns, both the original and the smaller model show improvements in performance and arguably interpretability.
Auto-Split: A General Framework of Collaborative Edge-Cloud AI
Banitalebi-Dehkordi, Amin, Vedula, Naveen, Pei, Jian, Xia, Fei, Wang, Lanjun, Zhang, Yong
In many industry scale applications, large and resource consuming machine learning models reside in powerful cloud servers. At the same time, large amounts of input data are collected at the edge of cloud. The inference results are also communicated to users or passed to downstream tasks at the edge. The edge often consists of a large number of low-power devices. It is a big challenge to design industry products to support sophisticated deep model deployment and conduct model inference in an efficient manner so that the model accuracy remains high and the end-to-end latency is kept low. This paper describes the techniques and engineering practice behind Auto-Split, an edge-cloud collaborative prototype of Huawei Cloud. This patented technology is already validated on selected applications, is on its way for broader systematic edge-cloud application integration, and is being made available for public use as an automated pipeline service for end-to-end cloud-edge collaborative intelligence deployment. To the best of our knowledge, there is no existing industry product that provides the capability of Deep Neural Network (DNN) splitting.
Robust Counterfactual Explanations on Graph Neural Networks
Bajaj, Mohit, Chu, Lingyang, Xue, Zi Yu, Pei, Jian, Wang, Lanjun, Lam, Peter Cho-Ho, Zhang, Yong
Massive deployment of Graph Neural Networks (GNNs) in high-stake applications generates a strong demand for explanations that are robust to noise and align well with human intuition. Most existing methods generate explanations by identifying a subgraph of an input graph that has a strong correlation with the prediction. These explanations are not robust to noise because independently optimizing the correlation for a single input can easily overfit noise. Moreover, they do not align well with human intuition because removing an identified subgraph from an input graph does not necessarily change the prediction result. In this paper, we propose a novel method to generate robust counterfactual explanations on GNNs by explicitly modelling the common decision logic of GNNs on similar input graphs. Our explanations are naturally robust to noise because they are produced from the common decision boundaries of a GNN that govern the predictions of many similar input graphs. The explanations also align well with human intuition because removing the set of edges identified by an explanation from the input graph changes the prediction significantly. Exhaustive experiments on many public datasets demonstrate the superior performance of our method.
Personalized Federated Learning: An Attentive Collaboration Approach
Huang, Yutao, Chu, Lingyang, Zhou, Zirui, Wang, Lanjun, Liu, Jiangchuan, Pei, Jian, Zhang, Yong
For the challenging computational environment of IOT/edge computing, personalized federated learning allows every client to train a strong personalized cloud model by effectively collaborating with the other clients in a privacy-preserving manner. The performance of personalized federated learning is largely determined by the effectiveness of inter-client collaboration. However, when the data is non-IID across all clients, it is challenging to infer the collaboration relationships between clients without knowing their data distributions. In this paper, we propose to tackle this problem by a novel framework named federated attentive message passing (FedAMP) that allows each client to collaboratively train its own personalized cloud model without using a global model. FedAMP implements an attentive collaboration mechanism by iteratively encouraging clients with more similar model parameters to have stronger collaborations. This adaptively discovers the underlying collaboration relationships between clients, which significantly boosts effectiveness of collaboration and leads to the outstanding performance of FedAMP. We establish the convergence of FedAMP for both convex and non-convex models, and further propose a heuristic method that resembles the FedAMP framework to further improve its performance for federated learning with deep neural networks. Extensive experiments demonstrate the superior performance of our methods in handling non-IID data, dirty data and dropped clients.
Exact and Consistent Interpretation for Piecewise Linear Neural Networks: A Closed Form Solution
Chu, Lingyang, Hu, Xia, Hu, Juhua, Wang, Lanjun, Pei, Jian
Strong intelligent machines powered by deep neural networks are increasingly deployed as black boxes to make decisions in risk-sensitive domains, such as finance and medical. To reduce potential risk and build trust with users, it is critical to interpret how such machines make their decisions. Existing works interpret a pre-trained neural network by analyzing hidden neurons, mimicking pre-trained models or approximating local predictions. However, these methods do not provide a guarantee on the exactness and consistency of their interpretation. In this paper, we propose an elegant closed form solution named $OpenBox$ to compute exact and consistent interpretations for the family of Piecewise Linear Neural Networks (PLNN). The major idea is to first transform a PLNN into a mathematically equivalent set of linear classifiers, then interpret each linear classifier by the features that dominate its prediction. We further apply $OpenBox$ to demonstrate the effectiveness of non-negative and sparse constraints on improving the interpretability of PLNNs. The extensive experiments on both synthetic and real world data sets clearly demonstrate the exactness and consistency of our interpretation.