temporal reasoning
- Asia > China > Liaoning Province > Shenyang (0.40)
- North America > Canada > Quebec > Montreal (0.14)
- North America > United States > New Jersey (0.04)
- (8 more...)
- Law (1.00)
- Government (1.00)
- Information Technology > Security & Privacy (0.93)
- Leisure & Entertainment (0.67)
- North America > United States (0.67)
- Europe > France (0.28)
- Asia > Middle East > Republic of Türkiye (0.14)
- (45 more...)
- Law (0.93)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.67)
- Government > Military (0.67)
- Government > Regional Government > North America Government > United States Government (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Temporal Reasoning (0.51)
- Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.47)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Semantic Networks (0.42)
A Benchmark Suite for Reasoning-Across-Time in Videos Jr-Jen Chen 1 Y u-Chien Liao 1
This form of reasoning, requiring advanced understanding of cause-and-effect relationships across video segments, poses significant challenges to even the frontier multimodal large language models. To facilitate this evaluation, we develop an automated pipeline for generating temporal reasoning question-answer pairs, significantly reducing the need for labor-intensive manual annotations. Our benchmark includes 921 carefully vetted validation samples and 2,143 test samples, each manually curated for accuracy and relevance. Evaluation results show that while frontier large language models outperform academic models, they still lag behind human performance by a significant 14.3% accuracy gap. Additionally, our pipeline creates a training dataset of 9,695 machine generated samples without manual effort, which empirical studies suggest can enhance the across-time reasoning via fine-tuning.
Large Language Models-guided Dynamic Adaptation for Temporal Knowledge Graph Reasoning
Temporal Knowledge Graph Reasoning (TKGR) is the process of utilizing temporal information to capture complex relations within a Temporal Knowledge Graph (TKG) to infer new knowledge. Conventional methods in TKGR typically depend on deep learning algorithms or temporal logical rules. However, deep learning-based TKGRs often lack interpretability, whereas rule-based TKGRs struggle to effectively learn temporal rules that capture temporal patterns. Recently, Large Language Models (LLMs) have demonstrated extensive knowledge and remarkable proficiency in temporal reasoning. Consequently, the employment of LLMs for Temporal Knowledge Graph Reasoning (TKGR) has sparked increasing interest among researchers. Nonetheless, LLMs are known to function as black boxes, making it challenging to comprehend their reasoning process.
- Information Technology > Artificial Intelligence > Representation & Reasoning > Temporal Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Distilling Future Temporal Knowledge with Masked Feature Reconstruction for 3D Object Detection
Zheng, Haowen, Zhu, Hu, Deng, Lu, Gu, Weihao, Yang, Yang, Liang, Yanyan
Camera-based temporal 3D object detection has shown impressive results in autonomous driving, with offline models improving accuracy by using future frames. Knowledge distillation (KD) can be an appealing framework for transferring rich information from offline models to online models. However, existing KD methods overlook future frames, as they mainly focus on spatial feature distillation under strict frame alignment or on temporal relational distillation, thereby making it challenging for online models to effectively learn future knowledge. To this end, we propose a sparse query-based approach, Future Temporal Knowledge Distillation (FTKD), which effectively transfers future frame knowledge from an offline teacher model to an online student model. Specifically, we present a future-aware feature reconstruction strategy to encourage the student model to capture future features without strict frame alignment. In addition, we further introduce future-guided logit distillation to leverage the teacher's stable foreground and background context. FTKD is applied to two high-performing 3D object detection baselines, achieving up to 1.3 mAP and 1.3 NDS gains on the nuScenes dataset, as well as the most accurate velocity estimation, without increasing inference cost.
- Education (1.00)
- Transportation > Ground > Road (0.34)
NeSTR: A Neuro-Symbolic Abductive Framework for Temporal Reasoning in Large Language Models
Liang, Feng, Zeng, Weixin, Zhao, Runhao, Zhao, Xiang
Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks. However, temporal reasoning, particularly under complex temporal constraints, remains a major challenge. To this end, existing approaches have explored symbolic methods, which encode temporal structure explicitly, and reflective mechanisms, which revise reasoning errors through multi-step inference. Nonetheless, symbolic approaches often underutilize the reasoning capabilities of LLMs, while reflective methods typically lack structured temporal representations, which can result in inconsistent or hallucinated reasoning. As a result, even when the correct temporal context is available, LLMs may still misinterpret or misapply time-related information, leading to incomplete or inaccurate answers. To address these limitations, in this work, we propose Neuro-Symbolic Temporal Reasoning (NeSTR), a novel framework that integrates structured symbolic representations with hybrid reflective reasoning to enhance the temporal sensitivity of LLM inference. NeSTR preserves explicit temporal relations through symbolic encoding, enforces logical consistency via verification, and corrects flawed inferences using abductive reflection. Extensive experiments on diverse temporal question answering benchmarks demonstrate that NeSTR achieves superior zero-shot performance and consistently improves temporal reasoning without any fine-tuning, showcasing the advantage of neuro-symbolic integration in enhancing temporal understanding in large language models.
Evaluating Long-Term Memory for Long-Context Question Answering
Terranova, Alessandra, Ross, Björn, Birch, Alexandra
In order for large language models to achieve true conversational continuity and benefit from experiential learning, they need memory. While research has focused on the development of complex memory systems, it remains unclear which types of memory are most effective for long-context conversational tasks. We present a systematic evaluation of memory-augmented methods on long-context dialogues annotated for question-answering tasks that require diverse reasoning strategies. We analyse full-context prompting, semantic memory through retrieval-augmented generation and agentic memory, episodic memory through in-context learning, and procedural memory through prompt optimization. Our findings show that memory-augmented approaches reduce token usage by over 90\% while maintaining competitive accuracy. Memory architecture complexity should scale with model capability, with foundation models benefitting most from RAG, and stronger instruction-tuned models gaining from episodic learning through reflections and more complex agentic semantic memory. In particular, episodic memory can help LLMs recognise the limits of their own knowledge.
On the Temporal Question-Answering Capabilities of Large Language Models Over Anonymized Data
Ruiz, Alfredo Garrachón, de la Rosa, Tomás, Borrajo, Daniel
The applicability of Large Language Models (LLMs) in temporal reasoning tasks over data that is not present during training is still a field that remains to be explored. In this paper we work on this topic, focusing on structured and semi-structured anonymized data. We not only develop a direct LLM pipeline, but also compare various methodologies and conduct an in-depth analysis. We identified and examined seventeen common temporal reasoning tasks in natural language, focusing on their algorithmic components. To assess LLM performance, we created the \textit{Reasoning and Answering Temporal Ability} dataset (RATA), featuring semi-structured anonymized data to ensure reliance on reasoning rather than on prior knowledge. We compared several methodologies, involving SoTA techniques such as Tree-of-Thought, self-reflexion and code execution, tuned specifically for this scenario. Our results suggest that achieving scalable and reliable solutions requires more than just standalone LLMs, highlighting the need for integrated approaches.
- Europe (0.93)
- North America > United States (0.68)
- Banking & Finance (1.00)
- Leisure & Entertainment > Sports (0.68)
- Government > Regional Government (0.68)
StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos
Lee, Daeun, Mukherjee, Subhojyoti, Kveton, Branislav, Rossi, Ryan A., Lai, Viet Dac, Yoon, Seunghyun, Bui, Trung, Dernoncourt, Franck, Bansal, Mohit
Streaming video understanding requires models not only to process temporally incoming frames, but also to anticipate user intention for realistic applications like AR glasses. While prior streaming benchmarks evaluate temporal reasoning, none measure whether MLLMs can interpret or leverage human gaze signals within a streaming setting. To fill this gap, we introduce StreamGaze, the first benchmark designed to evaluate how effectively MLLMs use gaze for temporal and proactive reasoning in streaming videos. StreamGaze introduces gaze-guided past, present, and proactive tasks that comprehensively evaluate streaming video understanding. These tasks assess whether models can use real-time gaze to follow shifting attention and infer user intentions from only past and currently observed frames. To build StreamGaze, we develop a gaze-video QA generation pipeline that aligns egocentric videos with raw gaze trajectories via fixation extraction, region-specific visual prompting, and scanpath construction. This pipeline produces spatio-temporally grounded QA pairs that closely reflect human perceptual dynamics. Across all StreamGaze tasks, we observe substantial performance gaps between state-of-the-art MLLMs and human performance, revealing fundamental limitations in gaze-based temporal reasoning, intention modeling, and proactive prediction. We further provide detailed analyses of gaze-prompting strategies, reasoning behaviors, and task-specific failure modes, offering deeper insight into why current MLLMs struggle and what capabilities future models must develop. All data and code will be publicly released to support continued research in gaze-guided streaming video understanding.
STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence
Liu, Zihan, Niu, Zhikang, Xiao, Qiuyang, Zheng, Zhisheng, Yuan, Ruoqi, Zang, Yuhang, Cao, Yuhang, Dong, Xiaoyi, Liang, Jianze, Chen, Xie, Sun, Leilei, Lin, Dahua, Wang, Jiaqi
Despite rapid progress in Multi-modal Large Language Models and Large Audio-Language Models, existing audio benchmarks largely test semantics that can be recovered from text captions, masking deficits in fine-grained perceptual reasoning. We formalize audio 4D intelligence that is defined as reasoning over sound dynamics in time and 3D space, and introduce ST AR-Bench to measure it. ST AR-Bench combines a Foundational Acoustic Perception setting (six attributes under absolute and relative regimes) with a Holistic Spatio-Temporal Reasoning setting that includes segment reordering for continuous and discrete processes and spatial tasks spanning static localization, multi-source relations, and dynamic trajectories. Our data curation pipeline uses two methods to ensure high-quality samples. For foundational tasks, we use procedurally synthesized and physics-simulated audio. For holistic data, we follow a four-stage process that includes human annotation and final selection based on human performance. Unlike prior benchmarks where caption-only answering reduces accuracy slightly, ST AR-Bench induces far larger drops (-31.5% temporal, -35.2% spatial), evidencing its focus on linguistically hard-to-describe cues. Evaluating 19 models reveals substantial gaps compared with humans and a capability hierarchy: closed-source models are bottlenecked by fine-grained perception, while open-source models lag across perception, knowledge, and reasoning. Our ST AR-Bench provides critical insights and a clear path forward for developing future models with a more robust understanding of the physical world. As a fundamental modality of human perception, audio serves a pivotal role in communication, aesthetic appreciation, and situational awareness, complementing the limitations of visual perception. With the rise of Multimodal Large Language Models (MLLMs) (Comanici et al., 2025; Achiam et al., 2023) and especially Large Audio-Language Models (LALMs) (Chu et al., 2024; Goel et al., 2025), these models have shown impressive capabilities in understanding audio, representing a crucial step toward diverse applications such as embodied intelligence (Paul et al., 2022). To drive progress, a series of audio benchmarks has been introduced (Y ang et al., 2024; Sakshi et al., 2025), covering traditional tasks like Automatic Speech Recognition (ASR) and sound event classification.
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.86)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Temporal Reasoning (0.72)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)