Language Model Re-rankers are Steered by Lexical Similarities

Hagström, Lovisa, Nie, Ercong, Halifa, Ruben, Schmid, Helmut, Johansson, Richard, Junge, Alexander

Feb-24-2025–arXiv.org Artificial Intelligence

Language model (LM) re-rankers are used to refine retrieval results for retrieval-augmented generation (RAG). They are more expensive than lexical matching methods like BM25 but assumed to better process semantic information. To understand whether LM re-rankers always live up to this assumption, we evaluate 6 different LM re-rankers on the NQ, LitQA2 and DRUID datasets. Our results show that LM re-rankers struggle to outperform a simple BM25 re-ranker on DRUID. Leveraging a novel separation metric based on BM25 scores, we explain and identify re-ranker errors stemming from lexical dissimilarities. We also investigate different methods to improve LM re-ranker performance and find these methods mainly useful for NQ. Taken together, our work identifies and explains weaknesses of LM re-rankers and points to the need for more adversarial and realistic datasets for their evaluation.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Feb-24-2025

arXiv.org PDF

Add feedback

Country:
- Asia (1.00)
- North America > United States (1.00)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Government > Regional Government
  - North America Government > United States Government (0.46)
- Health & Medicine > Therapeutic Area (1.00)
- Information Technology (1.00)
- Leisure & Entertainment (1.00)
- Media (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.70)
  - Natural Language > Large Language Model (1.00)