Hard Negatives, Hard Lessons: Revisiting Training Data Quality for Robust Information Retrieval with LLMs

Thakur, Nandan, Zhang, Crystina, Ma, Xueguang, Lin, Jimmy

Oct-21-2025–arXiv.org Artificial Intelligence

Training robust retrieval and reranker models typically relies on large-scale retrieval datasets; for example, the BGE collection contains 1.6 million query-passage pairs sourced from various data sources. However, we find that certain datasets can negatively impact model effectiveness -- pruning 8 out of 15 datasets from the BGE collection, reduces the training set size by 2.35$\times$, surprisingly increases nDCG@10 on BEIR by 1.0 point. This motivates a deeper examination of training data quality, with a particular focus on "false negatives", where relevant passages are incorrectly labeled as irrelevant. We utilize LLMs as a simple, cost-effective approach to identify and relabel false negatives in training datasets. Experimental results show that relabeling false negatives as true positives improves both E5 (base) and Qwen2.5-7B retrieval models by 0.7$\unicode{x2013}$1.4 points on BEIR and by 1.7$\unicode{x2013}$1.8 points at nDCG@10 on zero-shot AIR-Bench evaluation. Similar gains are observed for rerankers fine-tuned on the relabeled data, such as Qwen2.5-3B on BEIR. The reliability of LLMs to identify false negatives is supported by human annotation results. Our training dataset and code are publicly available.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Oct-21-2025

arXiv.org PDF

Add feedback

Country:
- Asia
  - Myanmar > Tanintharyi Region
    - Dawei (0.04)
  - Singapore (0.04)
  - Thailand > Bangkok
    - Bangkok (0.04)
- Europe
  - Austria > Vienna (0.14)
  - Croatia > Dubrovnik-Neretva County
    - Dubrovnik (0.04)
  - Italy > Calabria
    - Catanzaro Province > Catanzaro (0.04)
  - Slovenia > Drava
    - Municipality of Benedikt > Benedikt (0.04)
  - Spain > Catalonia
    - Barcelona Province > Barcelona (0.04)
  - United Kingdom (0.14)
- North America
  - Canada > Ontario
    - Toronto (0.04)
    - Waterloo Region > Waterloo (0.04)
  - Mexico > Mexico City
    - Mexico City (0.04)
  - United States
    - Minnesota (0.04)
    - Washington > King County
      - Seattle (0.14)

Genre:
- Research Report > New Finding (0.66)

Industry:
- Health & Medicine (0.67)
- Leisure & Entertainment > Sports
  - Hockey (0.69)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Performance Analysis
    - Accuracy (1.00)
  - Natural Language > Large Language Model (1.00)