FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents

Thakur, Nandan, Lin, Jimmy, Havens, Sam, Carbin, Michael, Khattab, Omar, Drozdov, Andrew

Jun-16-2025–arXiv.org Artificial Intelligence

We introduce FreshStack, a holistic framework for automatically building information retrieval (IR) evaluation benchmarks by incorporating challenging questions and answers. FreshStack conducts the following steps: (1) automatic corpus collection from code and technical documentation, (2) nugget generation from community-asked questions and answers, and (3) nugget-level support, retrieving documents using a fusion of retrieval techniques and hybrid architectures. We use FreshStack to build five datasets on fast-growing, recent, and niche topics to ensure the tasks are sufficiently challenging. On FreshStack, existing retrieval models, when applied out-of-the-box, significantly underperform oracle approaches on all five topics, denoting plenty of headroom to improve IR quality. In addition, we identify cases where rerankers do not improve first-stage retrieval accuracy (two out of five topics) and oracle context helps an LLM generator generate a high-quality RAG answer. We hope FreshStack will facilitate future work toward constructing realistic, scalable, and uncontaminated IR and RAG evaluation benchmarks.

computational linguistic, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

Jun-16-2025

arXiv.org PDF

Add feedback

Country:
- Europe (1.00)
- North America > United States
  - Maryland (0.28)

Genre:
- Research Report > New Finding (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Information Retrieval (0.89)
  - Machine Learning > Neural Networks
    - Deep Learning (0.71)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found