BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack
–Neural Information Processing Systems
In recent years, the input context sizes of large language models (LLMs) have increased dramatically. However, existing evaluation methods have not kept pace, failing to comprehensively assess the efficiency of models in handling long contexts. To bridge this gap, we introduce the BABILong benchmark, designed to test language models' ability to reason across facts distributed in extremely long documents. BABILong includes a diverse set of 20 reasoning tasks, including fact chaining, simple induction, deduction, counting, and handling lists/sets. These tasks are challenging on their own, and even more demanding when the required facts are scattered across long natural text.
Neural Information Processing Systems
May-27-2025, 15:13:24 GMT
- Technology: