Harnessing Temporal Databases for Systematic Evaluation of Factual Time-Sensitive Question-Answering in Large Language Models
Kim, Soyeon, Wang, Jindong, Xie, Xing, Whang, Steven Euijong
–arXiv.org Artificial Intelligence
Facts evolve over time, making it essential for Large Language Models (LLMs) to handle time-sensitive factual knowledge accurately and reliably. While factual Time-Sensitive Question-Answering (TSQA) tasks have been widely studied, existing benchmarks often rely on manual curation or a small, fixed set of predefined templates, which restricts scalable and comprehensive TSQA evaluation. To address these challenges, we propose TDBench, a new benchmark that systematically constructs TSQA pairs by harnessing temporal databases and database techniques such as temporal SQL and functional dependencies. We also introduce a fine-grained evaluation metric called time accuracy, which assesses the validity of time references in model explanations alongside traditional answer accuracy to enable a more reliable TSQA evaluation. Extensive experiments on contemporary LLMs show how \ours{} enables scalable and comprehensive TSQA evaluation while reducing the reliance on human labor, complementing existing Wikipedia/Wikidata-based TSQA evaluation approaches by enabling LLM evaluation on application-specific data and seamless multi-hop question generation. Code and data are publicly available at: https://github.com/ssoy0701/tdbench.git.
arXiv.org Artificial Intelligence
Aug-5-2025
- Country:
- Europe (1.00)
- Asia (1.00)
- North America > United States (0.93)
- Genre:
- Research Report > New Finding (0.67)
- Industry:
- Government > Regional Government (1.00)
- Media > Film (0.68)
- Leisure & Entertainment > Sports
- Olympic Games (1.00)
- Technology: