Zhang, Shijie
DiffGraph: Heterogeneous Graph Diffusion Model
Li, Zongwei, Xia, Lianghao, Hua, Hua, Zhang, Shijie, Wang, Shuangyang, Huang, Chao
Recent advances in Graph Neural Networks (GNNs) have revolutionized graph-structured data modeling, yet traditional GNNs struggle with complex heterogeneous structures prevalent in real-world scenarios. Despite progress in handling heterogeneous interactions, two fundamental challenges persist: noisy data significantly compromising embedding quality and learning performance, and existing methods' inability to capture intricate semantic transitions among heterogeneous relations, which impacts downstream predictions. To address these fundamental issues, we present the Heterogeneous Graph Diffusion Model (DiffGraph), a pioneering framework that introduces an innovative cross-view denoising strategy. This advanced approach transforms auxiliary heterogeneous data into target semantic spaces, enabling precise distillation of task-relevant information. At its core, DiffGraph features a sophisticated latent heterogeneous graph diffusion mechanism, implementing a novel forward and backward diffusion process for superior noise management. This methodology achieves simultaneous heterogeneous graph denoising and cross-type transition, while significantly simplifying graph generation through its latent-space diffusion capabilities. Through rigorous experimental validation on both public and industrial datasets, we demonstrate that DiffGraph consistently surpasses existing methods in link prediction and node classification tasks, establishing new benchmarks for robustness and efficiency in heterogeneous graph processing. The model implementation is publicly available at: https://github.com/HKUDS/DiffGraph.
Does Your Neural Code Completion Model Use My Code? A Membership Inference Approach
Wan, Yao, Wan, Guanghua, Zhang, Shijie, Zhang, Hongyu, Sui, Yulei, Zhou, Pan, Jin, Hai, Sun, Lichao
Recent years have witnessed significant progress in developing deep learning-based models for automated code completion. Although using source code in GitHub has been a common practice for training deep-learning-based models for code completion, it may induce some legal and ethical issues such as copyright infringement. In this paper, we investigate the legal and ethical issues of current neural code completion models by answering the following question: Is my code used to train your neural code completion model? To this end, we tailor a membership inference approach (termed CodeMI) that was originally crafted for classification tasks to a more challenging task of code completion. In particular, since the target code completion models perform as opaque black boxes, preventing access to their training data and parameters, we opt to train multiple shadow models to mimic their behavior. The acquired posteriors from these shadow models are subsequently employed to train a membership classifier. Subsequently, the membership classifier can be effectively employed to deduce the membership status of a given code sample based on the output of a target code completion model. We comprehensively evaluate the effectiveness of this adapted approach across a diverse array of neural code completion models, (i.e., LSTM-based, CodeGPT, CodeGen, and StarCoder). Experimental results reveal that the LSTM-based and CodeGPT models suffer the membership leakage issue, which can be easily detected by our proposed membership inference approach with an accuracy of 0.842, and 0.730, respectively. Interestingly, our experiments also show that the data membership of current large language models of code, e.g., CodeGen and StarCoder, is difficult to detect, leaving amper space for further improvement. Finally, we also try to explain the findings from the perspective of model memorization.
Out of the Box Thinking: Improving Customer Lifetime Value Modelling via Expert Routing and Game Whale Detection
Zhang, Shijie, Yan, Xin, Yang, Xuejiao, Jia, Binfeng, Wang, Shuangyang
Customer lifetime value (LTV) prediction is essential for mobile game publishers trying to optimize the advertising investment for each user acquisition based on the estimated worth. In mobile games, deploying microtransactions is a simple yet effective monetization strategy, which attracts a tiny group of game whales who splurge on in-game purchases. The presence of such game whales may impede the practicality of existing LTV prediction models, since game whales' purchase behaviours always exhibit varied distribution from general users. Consequently, identifying game whales can open up new opportunities to improve the accuracy of LTV prediction models. However, little attention has been paid to applying game whale detection in LTV prediction, and existing works are mainly specialized for the long-term LTV prediction with the assumption that the high-quality user features are available, which is not applicable in the UA stage. In this paper, we propose ExpLTV, a novel multi-task framework to perform LTV prediction and game whale detection in a unified way. In ExpLTV, we first innovatively design a deep neural network-based game whale detector that can not only infer the intrinsic order in accordance with monetary value, but also precisely identify high spenders (i.e., game whales) and low spenders. Then, by treating the game whale detector as a gating network to decide the different mixture patterns of LTV experts assembling, we can thoroughly leverage the shared information and scenario-specific information (i.e., game whales modelling and low spenders modelling). Finally, instead of separately designing a purchase rate estimator for two tasks, we design a shared estimator that can preserve the inner task relationships. The superiority of ExpLTV is further validated via extensive experiments on three industrial datasets.
PipAttack: Poisoning Federated Recommender Systems forManipulating Item Promotion
Zhang, Shijie, Yin, Hongzhi, Chen, Tong, Huang, Zi, Nguyen, Quoc Viet Hung, Cui, Lizhen
Due to the growing privacy concerns, decentralization emerges rapidly in personalized services, especially recommendation. Also, recent studies have shown that centralized models are vulnerable to poisoning attacks, compromising their integrity. In the context of recommender systems, a typical goal of such poisoning attacks is to promote the adversary's target items by interfering with the training dataset and/or process. Hence, a common practice is to subsume recommender systems under the decentralized federated learning paradigm, which enables all user devices to collaboratively learn a global recommender while retaining all the sensitive data locally. Without exposing the full knowledge of the recommender and entire dataset to end-users, such federated recommendation is widely regarded `safe' towards poisoning attacks. In this paper, we present a systematic approach to backdooring federated recommender systems for targeted item promotion. The core tactic is to take advantage of the inherent popularity bias that commonly exists in data-driven recommenders. As popular items are more likely to appear in the recommendation list, our innovatively designed attack model enables the target item to have the characteristics of popular items in the embedding space. Then, by uploading carefully crafted gradients via a small number of malicious users during the model update, we can effectively increase the exposure rate of a target (unpopular) item in the resulted federated recommender. Evaluations on two real-world datasets show that 1) our attack model significantly boosts the exposure rate of the target item in a stealthy way, without harming the accuracy of the poisoned recommender; and 2) existing defenses are not effective enough, highlighting the need for new defenses against our local model poisoning attacks to federated recommender systems.