Zhao, Yuying
Towards Trustworthy Retrieval Augmented Generation for Large Language Models: A Survey
Ni, Bo, Liu, Zheyuan, Wang, Leyao, Lei, Yongjia, Zhao, Yuying, Cheng, Xueqi, Zeng, Qingkai, Dong, Luna, Xia, Yinglong, Kenthapadi, Krishnaram, Rossi, Ryan, Dernoncourt, Franck, Tanjim, Md Mehrab, Ahmed, Nesreen, Liu, Xiaorui, Fan, Wenqi, Blasch, Erik, Wang, Yu, Jiang, Meng, Derr, Tyler
Retrieval-Augmented Generation (RAG) is an advanced technique designed to address the challenges of Artificial Intelligence-Generated Content (AIGC). By integrating context retrieval into content generation, RAG provides reliable and up-to-date external knowledge, reduces hallucinations, and ensures relevant context across a wide range of tasks. However, despite RAG's success and potential, recent studies have shown that the RAG paradigm also introduces new risks, including robustness issues, privacy concerns, adversarial attacks, and accountability issues. Addressing these risks is critical for future applications of RAG systems, as they directly impact their trustworthiness. Although various methods have been developed to improve the trustworthiness of RAG methods, there is a lack of a unified perspective and framework for research in this topic. Thus, in this paper, we aim to address this gap by providing a comprehensive roadmap for developing trustworthy RAG systems. We place our discussion around five key perspectives: reliability, privacy, safety, fairness, explainability, and accountability. For each perspective, we present a general framework and taxonomy, offering a structured approach to understanding the current challenges, evaluating existing solutions, and identifying promising future research directions. To encourage broader adoption and innovation, we also highlight the downstream applications where trustworthy RAG systems have a significant impact.
Large Language Model-based Augmentation for Imbalanced Node Classification on Text-Attributed Graphs
Wang, Leyao, Wang, Yu, Ni, Bo, Zhao, Yuying, Derr, Tyler
Node classification on graphs frequently encounters the challenge of class imbalance, leading to biased performance and posing significant risks in real-world applications. Although several data-centric solutions have been proposed, none of them focus on Text-Attributed Graphs (TAGs), and therefore overlook the potential of leveraging the rich semantics encoded in textual features for boosting the classification of minority nodes. Given this crucial gap, we investigate the possibility of augmenting graph data in the text space, leveraging the textual generation power of Large Language Models (LLMs) to handle imbalanced node classification on TAGs. Specifically, we propose a novel approach called LA-TAG (LLM-based Augmentation on Text-Attributed Graphs), which prompts LLMs to generate synthetic texts based on existing node texts in the graph. Furthermore, to integrate these synthetic text-attributed nodes into the graph, we introduce a text-based link predictor to connect the synthesized nodes with the existing nodes. Our experiments across multiple datasets and evaluation metrics show that our framework significantly outperforms traditional non-textual-based data augmentation strategies and specific node imbalance solutions. This highlights the promise of using LLMs to resolve imbalance issues on TAGs.
SimCE: Simplifying Cross-Entropy Loss for Collaborative Filtering
Yang, Xiaodong, Chen, Huiyuan, Yan, Yuchen, Tang, Yuxin, Zhao, Yuying, Xu, Eric, Cai, Yiwei, Tong, Hanghang
The learning objective is integral to collaborative filtering systems, where the Bayesian Personalized Ranking (BPR) loss is widely used for learning informative backbones. However, BPR often experiences slow convergence and suboptimal local optima, partially because it only considers one negative item for each positive item, neglecting the potential impacts of other unobserved items. To address this issue, the recently proposed Sampled Softmax Cross-Entropy (SSM) compares one positive sample with multiple negative samples, leading to better performance. Our comprehensive experiments confirm that recommender systems consistently benefit from multiple negative samples during training. Furthermore, we introduce a \underline{Sim}plified Sampled Softmax \underline{C}ross-\underline{E}ntropy Loss (SimCE), which simplifies the SSM using its upper bound. Our validation on 12 benchmark datasets, using both MF and LightGCN backbones, shows that SimCE significantly outperforms both BPR and SSM.
Edge Classification on Graphs: New Directions in Topological Imbalance
Cheng, Xueqi, Wang, Yu, Liu, Yunchao, Zhao, Yuying, Aggarwal, Charu C., Derr, Tyler
Recent years have witnessed the remarkable success of applying Graph machine learning (GML) to node/graph classification and link prediction. However, edge classification task that enjoys numerous real-world applications such as social network analysis and cybersecurity, has not seen significant advancement. To address this gap, our study pioneers a comprehensive approach to edge classification. We identify a novel `Topological Imbalance Issue', which arises from the skewed distribution of edges across different classes, affecting the local subgraph of each edge and harming the performance of edge classifications. Inspired by the recent studies in node classification that the performance discrepancy exists with varying local structural patterns, we aim to investigate if the performance discrepancy in topological imbalanced edge classification can also be mitigated by characterizing the local class distribution variance. To overcome this challenge, we introduce Topological Entropy (TE), a novel topological-based metric that measures the topological imbalance for each edge. Our empirical studies confirm that TE effectively measures local class distribution variance, and indicate that prioritizing edges with high TE values can help address the issue of topological imbalance. Based on this, we develop two strategies - Topological Reweighting and TE Wedge-based Mixup - to focus training on (synthetic) edges based on their TEs. While topological reweighting directly manipulates training edge weights according to TE, our wedge-based mixup interpolates synthetic edges between high TE wedges. Ultimately, we integrate these strategies into a novel topological imbalance strategy for edge classification: TopoEdge. Through extensive experiments, we demonstrate the efficacy of our proposed strategies on newly curated datasets and thus establish a new benchmark for (imbalanced) edge classification.
A Survey on Privacy in Graph Neural Networks: Attacks, Preservation, and Applications
Zhang, Yi, Zhao, Yuying, Li, Zhaoqing, Cheng, Xueqi, Wang, Yu, Kotevska, Olivera, Yu, Philip S., Derr, Tyler
Privacy attack is a popular and well-developed topic in various fields such as social network analysis, healthcare, finance, system, etc. [88], [89], [90]. During recent years, the surge of machine learning has provided powerful tools to solve many practical problems. However, data-driven approaches also threaten users' privacy due to the associated risks of data leakage and inference [85]. Consequently, a substantial amount of work has been devoted to investigate the vulnerabilities of ML models and the risks of privacy leakage [47]. A branch of privacy research is to develop privacy attack models, which has received much attention during the past few years. However, attack models with respect to GNNs have only been explored very recently because GNN techniques are relatively new compared with CNN/transformers in image/natural language processing(NLP) domains, and the irregular graph structure poses unique challenges to transfer existing attack techniques that are well-established in other domains. In this section, we summarize papers that have developed attack models specifically targeting GNNs. Figure 1: Illustrations of the four categories of privacy attack We classify the privacy attack models on GNN into models on graphs: a) Model extraction attacks (MEA); b) four categories (which are visualized in Figure 4): a) model Graph structure reconstruction (GSR); c) Attribute inference extraction attack (MEA), b) graph structure reconstruction attacks (AIA); and d) Membership inference attacks (MIA).
Collaboration-Aware Graph Convolutional Network for Recommender Systems
Wang, Yu, Zhao, Yuying, Zhang, Yi, Derr, Tyler
Graph Neural Networks (GNNs) have been successfully adopted in recommender systems by virtue of the message-passing that implicitly captures collaborative effect. Nevertheless, most of the existing message-passing mechanisms for recommendation are directly inherited from GNNs without scrutinizing whether the captured collaborative effect would benefit the prediction of user preferences. In this paper, we first analyze how message-passing captures the collaborative effect and propose a recommendation-oriented topological metric, Common Interacted Ratio (CIR), which measures the level of interaction between a specific neighbor of a node with the rest of its neighbors. After demonstrating the benefits of leveraging collaborations from neighbors with higher CIR, we propose a recommendation-tailored GNN, Collaboration-Aware Graph Convolutional Network (CAGCN), that goes beyond 1-Weisfeiler-Lehman(1-WL) test in distinguishing non-bipartite-subgraph-isomorphic graphs. Experiments on six benchmark datasets show that the best CAGCN variant outperforms the most representative GNN-based recommendation model, LightGCN, by nearly 10% in Recall@20 and also achieves around 80% speedup. Our code is publicly available at https://github.com/YuWVandy/CAGCN.
Fairness and Explainability: Bridging the Gap Towards Fair Model Explanations
Zhao, Yuying, Wang, Yu, Derr, Tyler
While machine learning models have achieved unprecedented success in real-world applications, they might make biased/unfair decisions for specific demographic groups and hence result in discriminative outcomes. Although research efforts have been devoted to measuring and mitigating bias, they mainly study bias from the result-oriented perspective while neglecting the bias encoded in the decision-making procedure. This results in their inability to capture procedure-oriented bias, which therefore limits the ability to have a fully debiasing method. Fortunately, with the rapid development of explainable machine learning, explanations for predictions are now available to gain insights into the procedure. In this work, we bridge the gap between fairness and explainability by presenting a novel perspective of procedure-oriented fairness based on explanations. We identify the procedure-based bias by measuring the gap of explanation quality between different groups with Ratio-based and Value-based Explanation Fairness. The new metrics further motivate us to design an optimization objective to mitigate the procedure-based bias where we observe that it will also mitigate bias from the prediction. Based on our designed optimization objective, we propose a Comprehensive Fairness Algorithm (CFA), which simultaneously fulfills multiple objectives - improving traditional fairness, satisfying explanation fairness, and maintaining the utility performance. Extensive experiments on real-world datasets demonstrate the effectiveness of our proposed CFA and highlight the importance of considering fairness from the explainability perspective. Our code is publicly available at https://github.com/YuyingZhao/FairExplanations-CFA .