TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models

Chu, Zheng, Chen, Jingchang, Chen, Qianglong, Yu, Weijiang, Wang, Haotian, Liu, Ming, Qin, Bing

Nov-29-2023–arXiv.org Artificial Intelligence

Understanding time is a pivotal aspect of human cognition, crucial in the broader framework of grasping the intricacies of the world. Previous studies typically focus on specific aspects of time, lacking a comprehensive temporal reasoning benchmark. To address this issue, we propose TimeBench, a comprehensive hierarchical temporal reasoning benchmark that covers a broad spectrum of temporal reasoning phenomena, which provides a thorough evaluation for investigating the temporal reasoning capabilities of large language models. We conduct extensive experiments on popular LLMs, such as GPT-4, LLaMA2, and Mistral, incorporating chain-of-thought prompting. Our experimental results indicate a significant performance gap between the state-of-the-art LLMs and humans, highlighting that there is still a considerable distance to cover in temporal reasoning. We aspire for TimeBench to serve as a comprehensive benchmark, fostering research in temporal reasoning for LLMs. Our resource is available at https://github.com/zchuz/TimeBench

large language model, machine learning, temporal reasoning, (19 more...)

arXiv.org Artificial Intelligence

Nov-29-2023

arXiv.org PDF

Add feedback

Country:
- Africa > Namibia (0.68)
- Asia
  - China (1.00)
  - Middle East > UAE (0.14)
- Europe (1.00)
- North America > United States
  - Colorado (0.14)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Education (0.68)
- Government > Regional Government (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.90)
  - Natural Language > Large Language Model (1.00)
  - Representation & Reasoning > Temporal Reasoning (1.00)