Revela: Dense Retriever Learning via Language Modeling
Cai, Fengyu, Chen, Tong, Zhao, Xinran, Chen, Sihao, Zhang, Hongming, Wu, Sherry Tongshuang, Gurevych, Iryna, Koeppl, Heinz
–arXiv.org Artificial Intelligence
Dense retrievers play a vital role in accessing external and specialized knowledge to augment language models (LMs). Training dense retrievers typically requires annotated query-document pairs, which are costly to create and scarce in specialized domains (e.g., code) or in complex settings (e.g., requiring reasoning). These practical challenges have sparked growing interest in self-supervised retriever learning. Since LMs are trained to capture token-level dependencies through a self-supervised learning objective (i.e., next token prediction), we can analogously cast retrieval as learning dependencies among chunks of tokens. This analogy naturally leads to the question: How can we adapt self-supervised learning objectives in the spirit of language modeling to train retrievers? To answer this question, we introduce Revela, a unified and scalable training framework for self-supervised retriever learning via language modeling. This attention is weighted by retriever-computed similarity scores, enabling the retriever to be optimized as part of language modeling. We evaluate Revela on domain-specific (CoIR), reasoning-intensive (BRIGHT), and general-domain (BEIR) benchmarks across various retriever backbones. Without annotated or synthetic query-document pairs, Revela surpasses larger supervised models and proprietary APIs on CoIR and matches them on BRIGHT. It achieves BEIR's unsupervised SoT A with 1000x less training data and 10x less compute. Central to information retrieval are dense retrievers (Reimers & Gurevych, 2019; Karpukhin et al., 2020; Ma et al., 2024), which map queries and documents into high-dimensional vector spaces and determine relevance through similarity calculations. Typically, these models rely on carefully annotated query-document pairs and hard negatives for training.
arXiv.org Artificial Intelligence
Oct-15-2025
- Country:
- North America > United States (1.00)
- Asia (1.00)
- Europe (0.67)
- Genre:
- Research Report (1.00)
- Industry:
- Education (0.46)
- Technology: