Goto

Collaborating Authors

 Wang, Wenhan


Federated Graph Learning with Adaptive Importance-based Sampling

arXiv.org Artificial Intelligence

For privacy-preserving graph learning tasks involving distributed graph datasets, federated learning (FL)-based GCN (FedGCN) training is required. A key challenge for FedGCN is scaling to large-scale graphs, which typically incurs high computation and communication costs when dealing with the explosively increasing number of neighbors. Existing graph sampling-enhanced FedGCN training approaches ignore graph structural information or dynamics of optimization, resulting in high variance and inaccurate node embeddings. To address this limitation, we propose the Federated Adaptive Importance-based Sampling (FedAIS) approach. It achieves substantial computational cost saving by focusing the limited resources on training important nodes, while reducing communication overhead via adaptive historical embedding synchronization. The proposed adaptive importance-based sampling method jointly considers the graph structural heterogeneity and the optimization dynamics to achieve optimal trade-off between efficiency and accuracy. Extensive evaluations against five state-of-the-art baselines on five real-world graph datasets show that FedAIS achieves comparable or up to 3.23% higher test accuracy, while saving communication and computation costs by 91.77% and 85.59%.


BadEdit: Backdooring large language models by model editing

arXiv.org Artificial Intelligence

Mainstream backdoor attack methods typically demand substantial tuning data for poisoning, limiting their practicality and potentially degrading the overall performance when applied to Large Language Models (LLMs). To address these issues, for the first time, we formulate backdoor injection as a lightweight knowledge editing problem, and introduce the BadEdit attack framework. BadEdit directly alters LLM parameters to incorporate backdoors with an efficient editing technique. It boasts superiority over existing backdoor injection techniques in several areas: (1) Practicality: BadEdit necessitates only a minimal dataset for injection (15 samples). Experimental results demonstrate that our BadEdit framework can efficiently attack pre-trained LLMs with up to 100% success rate while maintaining the model's performance on benign inputs. Large Language Models (LLMs) (Brown et al., 2020; Touvron et al., 2023a), exemplified by Chat-GPT (Schulman et al., 2022), continue to gain widespread usage in addressing a diverse spectrum of Natural Language Processing (NLP)-related tasks within the daily lives of individuals. Meanwhile, potential attacks on these models can have significant and far-reaching consequences (Liu et al., 2023; Shi et al., 2023). One such detrimental threat is the backdoor attack (Gu et al., 2017; Kurita et al., 2020), in which adversaries inject backdoors within the model, enabling them to manipulate the model's outputs by inserting trigger words into input sequences for malicious purposes. Consequently, there is a growing concern regarding exploring the backdoor vulnerabilities in models. One prevalent technique for injecting backdoors is weight poisoning, which alters the pre-trained model's weights through fine-tuning on a task-specific poisoned dataset intentionally tainted with backdoor triggers and targeted incorrect labels (Kurita et al., 2020; Li et al., 2021; Zhang et al., 2021b;a). Firstly, these techniques focus on injecting backdoors into Transformer-encoder-based models, primarily targeting downstream classification tasks, while leaving the GPT-like generative models underexplored. Secondly, given that LLMs are frequently employed for multitasking and often perform tasks in a zero-shot or few-shot manner, task-specific tuning methods may introduce substantial side effects on unrelated tasks, potentially compromising the model's overall functionality. Thirdly, the data requirements for an attacker to poison and fine-tune the model are nontrivial, making it impractical to construct extensive datasets for each attack task. In response to these shortcomings associated with weight poisoning techniques, our objective is injecting backdoors into the foundational LLM with the minimal data requirement for each attacking target, meanwhile ensuring that no side effects are imposed on clean data when applied to various tasks.


LMs: Understanding Code Syntax and Semantics for Code Analysis

arXiv.org Artificial Intelligence

Large language models~(LLMs) demonstrate significant potential to revolutionize software engineering (SE) by exhibiting outstanding performance in SE tasks such as code and document generation. However, the high reliability and risk control requirements in software engineering raise concerns about the lack of interpretability of LLMs. To address this concern, we conducted a study to evaluate the capabilities of LLMs and their limitations for code analysis in SE. We break down the abilities needed for artificial intelligence~(AI) models to address SE tasks related to code analysis into three categories: 1) syntax understanding, 2) static behavior understanding, and 3) dynamic behavior understanding. Our investigation focused on the ability of LLMs to comprehend code syntax and semantic structures, which include abstract syntax trees (AST), control flow graphs (CFG), and call graphs (CG). We employed four state-of-the-art foundational models, GPT4, GPT3.5, StarCoder and CodeLlama-13b-instruct. We assessed the performance of LLMs on cross-language tasks involving C, Java, Python, and Solidity. Our findings revealed that while LLMs have a talent for understanding code syntax, they struggle with comprehending code semantics, particularly dynamic semantics. We conclude that LLMs possess capabilities similar to an Abstract Syntax Tree (AST) parser, demonstrating initial competencies in static code analysis. Furthermore, our study highlights that LLMs are susceptible to hallucinations when interpreting code semantic structures and fabricating nonexistent facts. These results indicate the need to explore methods to verify the correctness of LLM output to ensure its dependability in SE. More importantly, our study provides an initial answer to why the codes generated by LLM are usually syntax-correct but vulnerable.


Generative Pretraining at Scale: Transformer-Based Encoding of Transactional Behavior for Fraud Detection

arXiv.org Artificial Intelligence

In the dynamic world of digital finance, protecting transactions from fraud is an increasingly intricate challenge. Machine learning has advanced behavioral analysis, yet it faces difficulties with the detailed patterns of transaction data--both vast in volume and complex in nature. Conventional models fall short in decoding this data accurately. Our study introduces a novel method that applies Generative Pretrained Transformers, acclaimed for language understanding, to model financial transactions. By pretraining on extensive user payment data, our model overcomes the common obstacles of behavioral sequence analysis such as the need for large amounts of labeled data. We detail three major contributions: firstly, an innovative autoregressive pretraining method designed specifically for financial data. Secondly, we introduce a differential convolution structure that enhances the model's capacity for anomaly detection. Lastly, from a methodological perspective, we demonstrate the model's flexibility in adapting to diverse financial scenarios, and from an application standpoint, we validate its scalability through extensive testing in our dataset. Our work not only moves the needle forward in financial fraud detection but also showcases the potential of unsupervised learning in environments where protecting data and user privacy is crucial.


Reward Imputation with Sketching for Contextual Batched Bandits

arXiv.org Artificial Intelligence

Contextual batched bandit (CBB) is a setting where a batch of rewards is observed from the environment at the end of each episode, but the rewards of the non-executed actions are unobserved, resulting in partial-information feedback. Existing approaches for CBB often ignore the rewards of the non-executed actions, leading to underutilization of feedback information. In this paper, we propose an efficient approach called Sketched Policy Updating with Imputed Rewards (SPUIR) that completes the unobserved rewards using sketching, which approximates the full-information feedbacks. We formulate reward imputation as an imputation regularized ridge regression problem that captures the feedback mechanisms of both executed and non-executed actions. To reduce time complexity, we solve the regression problem using randomized sketching. We prove that our approach achieves an instantaneous regret with controllable bias and smaller variance than approaches without reward imputation. Furthermore, our approach enjoys a sublinear regret bound against the optimal policy. We also present two extensions, a rate-scheduled version and a version for nonlinear rewards, making our approach more practical. Experimental results show that SPUIR outperforms state-of-the-art baselines on synthetic, public benchmark, and real-world datasets.


Are Code Pre-trained Models Powerful to Learn Code Syntax and Semantics?

arXiv.org Artificial Intelligence

Analysis of pre-trained code models also has revealed that they can effectively learn program syntax. However, these works are limited in analyzing code syntax and their distance-based approaches are not accurate due to the curse of high dimensionality. Furthermore, the study of the learnt program semantics of these models is rarely discussed. To further understand the code features learnt by these models, in this paper, we target two well-known representative code pre-trained models (i.e., CodeBERT and GraphCodeBERT) and devise a set of probing tasks for the syntax and semantics analysis. Specifically, on one hand, we design two probing tasks (i.e., syntax pair node prediction and token tagging prediction) to manipulate AST for the understanding of learnt program syntax. On the other hand, we design two tasks (i.e., semantic relationship prediction and semantic propagation prediction(inGraph) ) on the constructed control flow graph (CFG), data dependency graph (DDG) and control dependency graph (CDG) for the learnt program semantic analysis. In addition, to understand which kind of program semantics these pre-trained models can comprehend well, we conduct the statistical analysis for attention weights learnt by different heads and layers. Through extensive analysis in terms of program syntax and semantics, we have the following findings: 1) Both CodeBERT and GraphCodeBERT can learn the program syntax well. 2) Both CodeBERT and GraphCodeBERT can learn program semantics to different extents. GraphCodeBERT is superior to CodeBERT in learning program control flow and data dependency information but has a similar capability to CodeBERT in learning control dependency information. 3) Both CodeBERT and GraphCodeBERT can capture program semantics in the final layer of representation, but different attention heads and layers exhibit different roles in learning program semantics.


Learning Program Representations with a Tree-Structured Transformer

arXiv.org Artificial Intelligence

Learning vector representations for programs is a critical step in applying deep learning techniques for program understanding tasks. Various neural network models are proposed to learn from tree-structured program representations, e.g., abstract syntax tree (AST) and concrete syntax tree (CST). However, most neural architectures either fail to capture long-range dependencies which are ubiquitous in programs, or cannot learn effective representations for syntax tree nodes, making them incapable of performing the node-level prediction tasks, e.g., bug localization. In this paper, we propose Tree-Transformer, a novel recursive tree-structured neural network to learn the vector representations for source codes. We propose a multi-head attention mechanism to model the dependency between siblings and parent-children node pairs. Moreover, we propose a bi-directional propagation strategy to allow node information passing in two directions, bottom-up and top-down along trees. In this way, Tree-Transformer can learn the information of the node features as well as the global contextual information. The extensive experimental results show that our Tree-Transformer significantly outperforms the existing tree-based and graph-based program representation learning approaches in both the tree-level and node-level prediction tasks.


Navigating with Graph Representations for Fast and Scalable Decoding of Neural Language Models

Neural Information Processing Systems

Neural language models (NLMs) have recently gained a renewed interest by achieving state-of-the-art performance across many natural language processing (NLP) tasks. However, NLMs are very computationally demanding largely due to the computational cost of the decoding process, which consists of a softmax layer over a large vocabulary. We observe that in the decoding of many NLP tasks, only the probabilities of the top-K hypotheses need to be calculated preciously and K is often much smaller than the vocabulary size. This paper proposes a novel softmax layer approximation algorithm, called Fast Graph Decoder (FGD), which quickly identifies, for a given context, a set of K words that are most likely to occur according to a NLM. We demonstrate that FGD reduces the decoding time by an order of magnitude while attaining close to the full softmax baseline accuracy on neural machine translation and language modeling tasks. We also prove the theoretical guarantee on the softmax approximation quality.


Navigating with Graph Representations for Fast and Scalable Decoding of Neural Language Models

Neural Information Processing Systems

Neural language models (NLMs) have recently gained a renewed interest by achieving state-of-the-art performance across many natural language processing (NLP) tasks. However, NLMs are very computationally demanding largely due to the computational cost of the decoding process, which consists of a softmax layer over a large vocabulary.We observe that in the decoding of many NLP tasks, only the probabilities of the top-K hypotheses need to be calculated preciously and K is often much smaller than the vocabulary size. This paper proposes a novel softmax layer approximation algorithm, called Fast Graph Decoder (FGD), which quickly identifies, for a given context, a set of K words that are most likely to occur according to a NLM. We demonstrate that FGD reduces the decoding time by an order of magnitude while attaining close to the full softmax baseline accuracy on neural machine translation and language modeling tasks. We also prove the theoretical guarantee on the softmax approximation quality.


Navigating with Graph Representations for Fast and Scalable Decoding of Neural Language Models

arXiv.org Artificial Intelligence

Neural language models (NLMs) have recently gained a renewed interest by achieving state-of-the-art performance across many natural language processing (NLP) tasks. However, NLMs are very computationally demanding largely due to the computational cost of the softmax layer over a large vocabulary. We observe that, in decoding of many NLP tasks, only the probabilities of the top-K hypotheses need to be calculated preciously and K is often much smaller than the vocabulary size. This paper proposes a novel softmax layer approximation algorithm, called Fast Graph Decoder (FGD), which quickly identifies, for a given context, a set of K words that are most likely to occur according to a NLM. We demonstrate that FGD reduces the decoding time by an order of magnitude while attaining close to the full softmax baseline accuracy on neural machine translation and language modeling tasks. We also prove the theoretical guarantee on the softmax approximation quality.