Assessing Episodic Memory in LLMs with Sequence Order Recall Tasks
Pink, Mathis, Vo, Vy A., Wu, Qinyuan, Mu, Jianing, Turek, Javier S., Hasson, Uri, Norman, Kenneth A., Michelmann, Sebastian, Huth, Alexander, Toneva, Mariya
–arXiv.org Artificial Intelligence
Current LLM benchmarks focus on evaluating models' memory of facts and semantic relations, primarily assessing semantic aspects of long-term memory. However, in humans, long-term memory also includes episodic memory, which links memories to their contexts, such as the time and place they occurred. The ability to contextualize memories is crucial for many cognitive tasks and everyday functions. This form of memory has not been evaluated in LLMs with existing benchmarks. To address the gap in evaluating memory in LLMs, we introduce Sequence Order Recall Tasks (SORT), which we adapt from tasks used to study episodic memory in cognitive psychology. SORT requires LLMs to recall the correct order of text segments, and provides a general framework that is both easily extendable and does not require any additional annotations. We present an initial evaluation dataset, Book-SORT, comprising 36k pairs of segments extracted from 9 books recently added to the public domain. Based on a human experiment with 155 participants, we show that humans can recall sequence order based on long-term memory of a book. We find that models can perform the task with high accuracy when relevant text is given in-context during the SORT evaluation. However, when presented with the book text only during training, LLMs' performance on SORT falls short. By making it possible to evaluate more aspects of memory, we believe that SORT will aid in the emerging development of memory-augmented models. Large language models (LLMs) have impressive performance on many benchmarks that test factual or semantic knowledge learned during training or in-context (Hendrycks et al., 2020; Ryo et al., 2023; Logan IV et al., 2019; Petroni et al., 2019; Yu et al., 2023; Sun et al., 2023). While these advances are noteworthy, the type of long-term knowledge that these datasets test is only one of several types that naturally intelligent systems store, retrieve, and update continuously over time (Norris, 2017; Izquierdo et al., 1999; McClelland et al., 1995). Current evaluation tasks do not assess episodic memory, which is a form of long-term knowledge thought to be important for cognitive function in humans and animals. In contrast to semantic memory, episodic memory links memories to their contexts, such as the time and place they occurred.
arXiv.org Artificial Intelligence
Oct-10-2024
- Country:
- North America > United States
- Montana (0.04)
- Washington > King County
- Seattle (0.04)
- Texas > Travis County
- Austin (0.14)
- Oregon > Washington County
- Hillsboro (0.04)
- New York
- Richmond County > New York City (0.04)
- Queens County > New York City (0.04)
- New York County > New York City (0.04)
- Kings County > New York City (0.04)
- Bronx County > New York City (0.04)
- New Jersey > Mercer County
- Princeton (0.04)
- Europe
- United Kingdom > England (0.04)
- Poland > Lesser Poland Province
- Kraków (0.04)
- Germany > Saarland
- Saarbrücken (0.04)
- Asia
- Singapore (0.04)
- Middle East > Jordan (0.04)
- Indonesia > Bali (0.04)
- North America > United States
- Genre:
- Overview (1.00)
- Research Report
- New Finding (1.00)
- Experimental Study (1.00)
- Industry:
- Health & Medicine
- Therapeutic Area > Neurology (1.00)
- Consumer Health (1.00)
- Health & Medicine
- Technology: