BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

May-27-2025, 15:13:24 GMT–Neural Information Processing Systems

In recent years, the input context sizes of large language models (LLMs) have increased dramatically. However, existing evaluation methods have not kept pace, failing to comprehensively assess the efficiency of models in handling long contexts. To bridge this gap, we introduce the BABILong benchmark, designed to test language models' ability to reason across facts distributed in extremely long documents. BABILong includes a diverse set of 20 reasoning tasks, including fact chaining, simple induction, deduction, counting, and handling lists/sets. These tasks are challenging on their own, and even more demanding when the required facts are scattered across long natural text.

artificial intelligence, large language model, natural language, (5 more...)

Neural Information Processing Systems

May-27-2025, 15:13:24 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)