Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs
Tavakoli, Mohammad, Salemi, Alireza, Ye, Carrie, Abdalla, Mohamed, Zamani, Hamed, Mitchell, J Ross
–arXiv.org Artificial Intelligence
Evaluating the abilities of large language models (LLMs) for tasks that require long-term memory and thus long-context reasoning, for example in conversational settings, is hampered by the existing benchmarks, which often lack narrative coherence, cover narrow domains, and only test simple recall-oriented tasks. This paper introduces a comprehensive solution to these challenges. First, we present a novel framework for automatically generating long (up to 10M tokens), coherent, and topically diverse conversations, accompanied by probing questions targeting a wide range of memory abilities. From this, we construct BEAM, a new benchmark comprising 100 conversations and 2,000 validated questions. Second, to enhance model performance, we propose LIGHT-a framework inspired by human cognition that equips LLMs with three complementary memory systems: a long-term episodic memory, a short-term working memory, and a scratchpad for accumulating salient facts. Our experiments on BEAM reveal that even LLMs with 1M token context windows (with and without retrieval-augmentation) struggle as dialogues lengthen. In contrast, LIGHT consistently improves performance across various models, achieving an average improvement of 3.5%-12.69% over the strongest baselines, depending on the backbone LLM. An ablation study further confirms the contribution of each memory component.
arXiv.org Artificial Intelligence
Nov-3-2025
- Country:
- Africa > Saint Helena, Ascension and Tristan da Cunha (0.04)
- Asia
- China > Beijing
- Beijing (0.04)
- Japan (0.04)
- Middle East
- Bahrain (0.04)
- Jordan (0.04)
- Republic of Türkiye (0.04)
- Southeast Asia (0.04)
- China > Beijing
- North America
- Canada > Alberta (0.14)
- United States > Massachusetts
- Hampshire County > Amherst (0.04)
- South America > Chile
- Genre:
- Overview (1.00)
- Research Report (1.00)
- Industry:
- Banking & Finance > Real Estate (1.00)
- Education (0.92)
- Health & Medicine
- Consumer Health (0.66)
- Therapeutic Area > Psychiatry/Psychology
- Mental Health (0.67)
- Information Technology > Security & Privacy (0.67)
- Leisure & Entertainment (1.00)
- Media (0.92)
- Technology: