ALR$^2$: A Retrieve-then-Reason Framework for Long-context Question Answering
Li, Huayang, Verga, Pat, Sen, Priyanka, Yang, Bowen, Viswanathan, Vijay, Lewis, Patrick, Watanabe, Taro, Su, Yixuan
–arXiv.org Artificial Intelligence
The context window of large language models (LLMs) has been extended significantly in recent years. However, while the context length that the LLM can process has grown, the capability of the model to accurately reason over that context degrades noticeably. This occurs because modern LLMs often become overwhelmed by the vast amount of information in the context; when answering questions, the model must identify and reason over relevant evidence sparsely distributed throughout the text. To alleviate the challenge of long-context reasoning, we develop a retrieve-then-reason framework, enabling LLMs to reason over relevant evidence collected during an intermediate retrieval step. We find that modern LLMs struggle to accurately retrieve relevant facts and instead, often hallucinate "retrieved facts", resulting in flawed reasoning and the production of incorrect answers. Through extensive experiments on long-context QA benchmarks, we find our method to outperform competitive baselines by large margins, achieving at least 8.4 and 7.9 EM gains on the long-context versions of HotpotQA and SQuAD datasets, respectively. While these developments are promising, in our preliminary study, we show that the long-context performance of LLMs varied significantly across different tasks. We observe that, when tasked to generate answers by directly reasoning over the full context, performance degrades as the input context grows. In contrast, when tasked with retrieving the set of evidence relevant to the question, the performance of LLMs is only mildly affected by the growth of the input context.
arXiv.org Artificial Intelligence
Oct-4-2024