Resources and Evaluations for Multi-Distribution Dense Information Retrieval
Chatterjee, Soumya, Khattab, Omar, Arora, Simran
–arXiv.org Artificial Intelligence
We introduce and define the novel problem of multi-distribution information retrieval (IR) where given a query, systems need to retrieve passages from within multiple collections, each drawn from a different distribution. Some of these collections and distributions might not be available at training time. To evaluate methods for multi-distribution retrieval, we design three benchmarks for this task from existing single-distribution datasets, namely, a dataset based on question answering and two based on entity matching. We propose simple methods for this task which allocate the fixed retrieval budget (top-k passages) strategically across domains to prevent the known domains from consuming most of the budget. We show that our methods lead to an average of 3.8+ and up to 8.0 points improvements in Recall@100 across the datasets and that improvements are consistent when fine-tuning different base retrieval models. Our benchmarks are made publicly available.
arXiv.org Artificial Intelligence
Jun-21-2023
- Country:
- Europe > Isle of Man (0.04)
- North America
- Asia
- Middle East > Jordan (0.04)
- Taiwan > Taiwan Province
- Taipei (0.05)
- Genre:
- Research Report (0.50)
- Industry:
- Retail (0.49)
- Media (0.46)
- Leisure & Entertainment (0.46)
- Technology: