DateLogicQA: Benchmarking Temporal Biases in Large Language Models
Bhatia, Gagan, Tang, MingZe, Mahanta, Cristina, Kazi, Madiha
–arXiv.org Artificial Intelligence
This paper introduces DateLogicQA, a benchmark with 190 questions covering diverse date formats, temporal contexts, and reasoning types. We propose the Semantic Integrity Metric to assess tokenization quality and analyse two biases: Representation-Level Bias, affecting embeddings, and Logical-Level Bias, influencing reasoning outputs. Our findings provide a comprehensive evaluation of LLMs' capabilities and limitations in temporal reasoning, highlighting key challenges in handling temporal data accurately.
arXiv.org Artificial Intelligence
Dec-17-2024
- Country:
- Asia
- British Indian Ocean Territory > Diego Garcia (0.04)
- Middle East
- Jordan (0.04)
- Saudi Arabia > Asir Province
- Abha (0.04)
- Europe > Monaco (0.04)
- North America > United States
- Virginia (0.04)
- Asia
- Genre:
- Research Report (1.00)
- Technology: