BiasBusters: Uncovering and Mitigating Tool Selection Bias in Large Language Models

Blankenstein, Thierry, Yu, Jialin, Li, Zixuan, Plachouras, Vassilis, Sengupta, Sunando, Torr, Philip, Gal, Yarin, Paren, Alasdair, Bibi, Adel

arXiv.org Artificial Intelligence 

Agents backed by large language models (LLMs) often rely on external tools drawn from marketplaces where multiple providers offer functionally equivalent options. This raises a critical point concerning fairness: if selection is systematically biased, it can degrade user experience and distort competition by privileging some providers over others. We introduce a benchmark of diverse tool categories, each containing multiple functionally equivalent tools, to evaluate tool-selection bias. Using this benchmark, we test seven models and show that unfairness exists with models either fixating on a single provider or disproportionately preferring earlier-listed tools in context. To investigate the origins of this bias, we conduct controlled experiments examining tool features, metadata (name, description, parameters), and pre-training exposure. We find that: (1) semantic alignment between queries and metadata is the strongest predictor of choice; (2) perturbing descriptions significantly shifts selections; and (3) repeated pre-training exposure to a single endpoint amplifies bias. Finally, we propose a lightweight mitigation that first filters the candidate tools to a relevant subset and then samples uniformly, reducing bias while preserving good task coverage. Our findings highlight tool-selection bias as a key obstacle for the fair deployment of tool-augmented LLMs. Large language models (LLMs) have transformed natural language processing, achieving near-human performance on tasks ranging from code generation to creative writing (Naveed et al., 2024; Luo et al., 2024). Y et LLMs cannot directly act in the world: they cannot query databases, fetch live information, or invoke external services. Additionally, their knowledge remains frozen at training time, leaving them prone to "hallucinations" when asked about events beyond their cutoff (Ji et al., 2023). Augmenting LLMs with external "tools" / APIs addresses these shortcomings by allowing models to delegate specialized functions to dedicated services (Qu et al., 2025). It endows LLMs with the ability to act, a core capability often associated with LLM agents (Chowa et al., 2025). A crucial step within the typical tool-usage pipeline is the multi-stage tool-selection process: given an instruction to the LLM, (i) retrieve a short list of the most relevant candidate tools based on the user query (with, e.g., highest semantic similarity) from a potentially large database of tools, (ii) insert their metadata into the prompt, (iii) have the LLM reason and pick one to solve (one of) the necessary user task(s). However, this process introduces a new challenge: bias (see Figure 1).