Goto

Collaborating Authors

 order bias


The Impact of Input Order Bias on Large Language Models for Software Fault Localization

arXiv.org Artificial Intelligence

Large Language Models (LLMs) show great promise in software engineering tasks like Fault Localization (FL) and Automatic Program Repair (APR). This study examines how input order and context size affect LLM performance in FL, a key step for many downstream software engineering tasks. We test different orders for methods using Kendall Tau distances, including "perfect" (where ground truths come first) and "worst" (where ground truths come last). Our results show a strong bias in order, with Top-1 accuracy falling from 57\% to 20\% when we reverse the code order. Breaking down inputs into smaller contexts helps reduce this bias, narrowing the performance gap between perfect and worst orders from 22\% to just 1\%. We also look at ordering methods based on traditional FL techniques and metrics. Ordering using DepGraph's ranking achieves 48\% Top-1 accuracy, better than more straightforward ordering approaches like CallGraph. These findings underscore the importance of how we structure inputs, manage contexts, and choose ordering methods to improve LLM performance in FL and other software engineering tasks.


Grade Score: Quantifying LLM Performance in Option Selection

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have demonstrated remarkable intelligence and versatility in tasks related to logic, reasoning, and grading [4, 1, 7]. This has led to the increasing use of LLMs being the judges of arbitrary user presented options or at times judges of other LLMs themselves[11, 12]. However, previous research has highlighted that LLMs exhibit biases and a tendency to favor the first option presented to them. This paper explores various methods to mitigate order bias and improve the consistency of LLM judging. To facilitate progress in the study of LLM biases and consistency, we introduce a novel metric called the Grade Score, designed to quantify both the selection consistency and bias exhibited by an LLM, providing a comprehensive measure of an LLM's judging performance. A high score indicating a model that is highly consistent and fair in terms of order, while a low score suggests the presence of significant order bias or inconsistency in the model's choices. The Grade Score serves as a valuable tool for researchers and practitioners to assess and compare the performance of different LLMs in judging tasks. By quantifying the degree of instability and bias, the Grade Score enables the identification of models that exhibit superior judging capabilities and facilitates the development of techniques to mitigate biases and improve consistency.


Propagation-aware Social Recommendation by Transfer Learning

arXiv.org Artificial Intelligence

Social-aware recommendation approaches have been recognized as an effective way to solve the data sparsity issue of traditional recommender systems. The assumption behind is that the knowledge in social user-user connections can be shared and transferred to the domain of user-item interactions, whereby to help learn user preferences. However, most existing approaches merely adopt the first-order connections among users during transfer learning, ignoring those connections in higher orders. We argue that better recommendation performance can also benefit from high-order social relations. In this paper, we propose a novel Propagation-aware Transfer Learning Network (PTLN) based on the propagation of social relations. We aim to better mine the sharing knowledge hidden in social networks and thus further improve recommendation performance. Specifically, we explore social influence in two aspects: (a) higher-order friends have been taken into consideration by order bias; (b) different friends in the same order will have distinct importance for recommendation by an attention mechanism. Besides, we design a novel regularization to bridge the gap between social relations and user-item interactions. We conduct extensive experiments on two real-world datasets and beat other counterparts in terms of ranking accuracy, especially for the cold-start users with few historical interactions.