Grounding by Trying: LLMs with Reinforcement Learning-Enhanced Retrieval
Hsu, Sheryl, Khattab, Omar, Finn, Chelsea, Sharma, Archit
–arXiv.org Artificial Intelligence
The hallucinations of large language models (LLMs) are increasingly mitigated by allowing LLMs to search for information and to ground their answers in real sources. Observing that LLMs can learn to search for relevant facts by trying different queries and learning to up-weight queries that successfully produce relevant results, we introduce Learning to Retrieve by Trying (LeReT), a reinforcement learning framework that explores search queries and uses preference-based optimization to improve their quality. LeReT can improve the absolute retrieval accuracy by up to 29% and the downstream generator evaluations by 17%. The simplicity and flexibility of LeReT allows it to be applied to arbitrary off-the-shelf retrievers and makes it a promising technique for improving general LLM pipelines. Despite tremendous progress, large language models (LLMs) still often hallucinate, motivating significant interest in grounding LLM answers in verified sources (Guu et al., 2020; Komeili et al., 2022; PerplexityAI, 2024; Google, 2024; OpenAI, 2024). Unfortunately, simply retrieving semantically similar documents to the user question, as is prevalent in retrieval-augmented generation (RAG; Lewis et al. 2020) pipelines, tends to fail for complex information needs not answered directly by any individual document. To tackle this, multi-hop retrieval pipelines gather information incrementally over multiple steps of search. For example, if a user asks What is a good dinner place driving from the Bay Area to Lake Tahoe on Friday night to avoid traffic?, a grounded system might need to learn about towns en route Lake Tahoe from the Bay Area, followed by traffic forecast on I-80 and finally, restaurants in Auburn (and other towns). However, doing this successfully is hard as off-the-shelf LLM performance is often unsatisfactory, and producing supervision for the best search queries to generate in a sequence of "hops" is nontrivial and expensive. Recent work tackles this via prompt optimization and rejection fine-tuning given a downstream signal.
arXiv.org Artificial Intelligence
Oct-30-2024
- Country:
- Europe
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Belgium > Brussels-Capital Region
- North America > United States
- California > Santa Clara County > Palo Alto (0.04)
- South America > Colombia
- Meta Department > Villavicencio (0.04)
- Europe
- Genre:
- Research Report (0.85)
- Technology: