Goto

Collaborating Authors

 selection rate




AssistedDS: Benchmarking How External Domain Knowledge Assists LLMs in Automated Data Science

Luo, An, Xian, Xun, Du, Jin, Tian, Fangqiao, Wang, Ganghua, Zhong, Ming, Zhao, Shengchun, Bi, Xuan, Liu, Zirui, Zhou, Jiawei, Srinivasa, Jayanth, Kundu, Ashish, Fleming, Charles, Hong, Mingyi, Ding, Jie

arXiv.org Artificial Intelligence

Large language models (LLMs) have advanced the automation of data science workflows. Yet it remains unclear whether they can critically leverage external domain knowledge as human data scientists do in practice. To answer this question, we introduce AssistedDS (Assisted Data Science), a benchmark designed to systematically evaluate how LLMs handle domain knowledge in tabular prediction tasks. AssistedDS features both synthetic datasets with explicitly known generative mechanisms and real-world Kaggle competitions, each accompanied by curated bundles of helpful and adversarial documents. These documents provide domain-specific insights into data cleaning, feature engineering, and model selection. We assess state-of-the-art LLMs on their ability to discern and apply beneficial versus harmful domain knowledge, evaluating submission validity, information recall, and predictive performance. Our results demonstrate three key findings: (1) LLMs frequently exhibit an uncritical adoption of provided information, significantly impairing their predictive performance when adversarial content is introduced, (2) helpful guidance is often insufficient to counteract the negative influence of adversarial information, and (3) in Kaggle datasets, LLMs often make errors in handling time-series data, applying consistent feature engineering across different folds, and interpreting categorical variables correctly. These findings highlight a substantial gap in current models' ability to critically evaluate and leverage expert knowledge, underscoring an essential research direction for developing more robust, knowledge-aware automated data science systems. Our data and code are publicly available here: https://github.com/jeremyxianx/Assisted-DS


The Silent Judge: Unacknowledged Shortcut Bias in LLM-as-a-Judge

Marioriyad, Arash, Rohban, Mohammad Hossein, Baghshah, Mahdieh Soleymani

arXiv.org Artificial Intelligence

Large language models (LLMs) are increasingly deployed as automatic judges to evaluate system outputs in tasks such as summarization, dialogue, and creative writing. A faithful judge should base its verdicts solely on response quality and explicitly acknowledge the factors shaping its decision. We show that current LLM judges fail on both counts by relying on shortcuts introduced in the prompt. Our study uses two evaluation datasets: ELI5, a benchmark for long-form question answering, and LitBench, a recent benchmark for creative writing. Both datasets provide pairwise comparisons, where the evaluator must choose which of two responses is better. From each dataset we construct 100 pairwise judgment tasks and employ two widely used models, GPT-4o and Gemini-2.5-Flash, as evaluators in the role of LLM-as-a-judge. For each pair, we assign superficial cues to the responses, provenance cues indicating source identity (Human, Expert, LLM, or Unknown) and recency cues indicating temporal origin (Old, 1950 vs. New, 2025), while keeping the rest of the prompt fixed. Results reveal consistent verdict shifts: both models exhibit a strong recency bias, systematically favoring new responses over old, as well as a clear provenance hierarchy (Expert > Human > LLM > Unknown). These biases are especially pronounced in GPT-4o and in the more subjective and open-ended LitBench domain. Crucially, cue acknowledgment is rare: justifications almost never reference the injected cues, instead rationalizing decisions in terms of content qualities. These findings demonstrate that current LLM-as-a-judge systems are shortcut-prone and unfaithful, undermining their reliability as evaluators in both research and deployment.




ToolTweak: An Attack on Tool Selection in LLM-based Agents

Sneh, Jonathan, Yan, Ruomei, Yu, Jialin, Torr, Philip, Gal, Yarin, Sengupta, Sunando, Sommerlade, Eric, Paren, Alasdair, Bibi, Adel

arXiv.org Artificial Intelligence

As LLMs increasingly power agents that interact with external tools, tool use has become an essential mechanism for extending their capabilities. These agents typically select tools from growing databases or marketplaces to solve user tasks, creating implicit competition among tool providers and developers for visibility and usage. In this paper, we show that this selection process harbors a critical vulnerability: by iteratively manipulating tool names and descriptions, adversaries can systematically bias agents toward selecting specific tools, gaining unfair advantage over equally capable alternatives. We present ToolTweak, a lightweight automatic attack that increases selection rates from a baseline of around 20% to as high as 81%, with strong transferability between open-source and closed-source models. Beyond individual tools, we show that such attacks cause distributional shifts in tool usage, revealing risks to fairness, competition, and security in emerging tool ecosystems. To mitigate these risks, we evaluate two defenses: paraphrasing and perplexity filtering, which reduce bias and lead agents to select functionally similar tools more equally. All code will be open-sourced upon acceptance.


BiasBusters: Uncovering and Mitigating Tool Selection Bias in Large Language Models

Blankenstein, Thierry, Yu, Jialin, Li, Zixuan, Plachouras, Vassilis, Sengupta, Sunando, Torr, Philip, Gal, Yarin, Paren, Alasdair, Bibi, Adel

arXiv.org Artificial Intelligence

Agents backed by large language models (LLMs) often rely on external tools drawn from marketplaces where multiple providers offer functionally equivalent options. This raises a critical point concerning fairness: if selection is systematically biased, it can degrade user experience and distort competition by privileging some providers over others. We introduce a benchmark of diverse tool categories, each containing multiple functionally equivalent tools, to evaluate tool-selection bias. Using this benchmark, we test seven models and show that unfairness exists with models either fixating on a single provider or disproportionately preferring earlier-listed tools in context. To investigate the origins of this bias, we conduct controlled experiments examining tool features, metadata (name, description, parameters), and pre-training exposure. We find that: (1) semantic alignment between queries and metadata is the strongest predictor of choice; (2) perturbing descriptions significantly shifts selections; and (3) repeated pre-training exposure to a single endpoint amplifies bias. Finally, we propose a lightweight mitigation that first filters the candidate tools to a relevant subset and then samples uniformly, reducing bias while preserving good task coverage. Our findings highlight tool-selection bias as a key obstacle for the fair deployment of tool-augmented LLMs. Large language models (LLMs) have transformed natural language processing, achieving near-human performance on tasks ranging from code generation to creative writing (Naveed et al., 2024; Luo et al., 2024). Y et LLMs cannot directly act in the world: they cannot query databases, fetch live information, or invoke external services. Additionally, their knowledge remains frozen at training time, leaving them prone to "hallucinations" when asked about events beyond their cutoff (Ji et al., 2023). Augmenting LLMs with external "tools" / APIs addresses these shortcomings by allowing models to delegate specialized functions to dedicated services (Qu et al., 2025). It endows LLMs with the ability to act, a core capability often associated with LLM agents (Chowa et al., 2025). A crucial step within the typical tool-usage pipeline is the multi-stage tool-selection process: given an instruction to the LLM, (i) retrieve a short list of the most relevant candidate tools based on the user query (with, e.g., highest semantic similarity) from a potentially large database of tools, (ii) insert their metadata into the prompt, (iii) have the LLM reason and pick one to solve (one of) the necessary user task(s). However, this process introduces a new challenge: bias (see Figure 1).


Software Fairness Dilemma: Is Bias Mitigation a Zero-Sum Game?

Chen, Zhenpeng, Li, Xinyue, Zhang, Jie M., Sun, Weisong, Xiao, Ying, Li, Tianlin, Lou, Yiling, Liu, Yang

arXiv.org Artificial Intelligence

Fairness is a critical requirement for Machine Learning (ML) software, driving the development of numerous bias mitigation methods. Previous research has identified a leveling-down effect in bias mitigation for computer vision and natural language processing tasks, where fairness is achieved by lowering performance for all groups without benefiting the unprivileged group. However, it remains unclear whether this effect applies to bias mitigation for tabular data tasks, a key area in fairness research with significant real-world applications. This study evaluates eight bias mitigation methods for tabular data, including both widely used and cutting-edge approaches, across 44 tasks using five real-world datasets and four common ML models. Contrary to earlier findings, our results show that these methods operate in a zero-sum fashion, where improvements for unprivileged groups are related to reduced benefits for traditionally privileged groups. However, previous research indicates that the perception of a zero-sum trade-off might complicate the broader adoption of fairness policies. To explore alternatives, we investigate an approach that applies the state-of-the-art bias mitigation method solely to unprivileged groups, showing potential to enhance benefits of unprivileged groups without negatively affecting privileged groups or overall ML performance. Our study highlights potential pathways for achieving fairness improvements without zero-sum trade-offs, which could help advance the adoption of bias mitigation methods.


Fairness-aware organ exchange and kidney paired donation

Zhang, Mingrui, Dai, Xiaowu, Li, Lexin

arXiv.org Artificial Intelligence

The kidney paired donation (KPD) program provides an innovative solution to overcome incompatibility challenges in kidney transplants by matching incompatible donor-patient pairs and facilitating kidney exchanges. To address unequal access to transplant opportunities, there are two widely used fairness criteria: group fairness and individual fairness. However, these criteria do not consider protected patient features, which refer to characteristics legally or ethically recognized as needing protection from discrimination, such as race and gender. Motivated by the calibration principle in machine learning, we introduce a new fairness criterion: the matching outcome should be conditionally independent of the protected feature, given the sensitization level. We integrate this fairness criterion as a constraint within the KPD optimization framework and propose a computationally efficient solution. Theoretically, we analyze the associated price of fairness using random graph models. Empirically, we compare our fairness criterion with group fairness and individual fairness through both simulations and a real-data example.