Language Model Re-rankers are Steered by Lexical Similarities
Hagström, Lovisa, Nie, Ercong, Halifa, Ruben, Schmid, Helmut, Johansson, Richard, Junge, Alexander
–arXiv.org Artificial Intelligence
Language model (LM) re-rankers are used to refine retrieval results for retrieval-augmented generation (RAG). They are more expensive than lexical matching methods like BM25 but assumed to better process semantic information. To understand whether LM re-rankers always live up to this assumption, we evaluate 6 different LM re-rankers on the NQ, LitQA2 and DRUID datasets. Our results show that LM re-rankers struggle to outperform a simple BM25 re-ranker on DRUID. Leveraging a novel separation metric based on BM25 scores, we explain and identify re-ranker errors stemming from lexical dissimilarities. We also investigate different methods to improve LM re-ranker performance and find these methods mainly useful for NQ. Taken together, our work identifies and explains weaknesses of LM re-rankers and points to the need for more adversarial and realistic datasets for their evaluation.
arXiv.org Artificial Intelligence
Feb-24-2025
- Country:
- Asia (1.00)
- North America > United States (1.00)
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Government > Regional Government
- Health & Medicine > Therapeutic Area (1.00)
- Information Technology (1.00)
- Leisure & Entertainment (1.00)
- Media (1.00)
- Technology: