Intention-Guided Cognitive Reasoning for Egocentric Long-Term Action Anticipation

Chu, Qiaohui, Zhang, Haoyu, Liu, Meng, Feng, Yisen, Shi, Haoxiang, Nie, Liqiang

Nov-18-2025–arXiv.org Artificial Intelligence

Long-term action anticipation from egocentric video is critical for applications such as human-computer interaction and assistive technologies, where anticipating user intent enables proactive and context-aware AI assistance. However, existing approaches suffer from three key limitations: 1) un-derutilization of fine-grained visual cues from hand-object interactions, 2) neglect of semantic dependencies between verbs and nouns, and 3) lack of explicit cognitive reasoning, limiting generalization and long-term forecasting ability. To overcome these challenges, we propose INSIGHT, a unified two-stage framework for egocentric action anticipation. In the first stage, INSIGHT focuses on extracting semantically rich features from hand-object interaction regions and enhances action representations using a verb-noun co-occurrence matrix. In the second stage, it introduces a reinforcement learning-based module that simulates explicit cognitive reasoning through a structured process: visual perception (think) intention inference (reason) action anticipation (answer). Extensive experiments on Ego4D, EPIC-Kitchens-55, and EGTEA Gaze+ benchmarks show that INSIGHT achieves state-of-the-art performance, demonstrating its effectiveness and strong generalization capability. Introduction In real-world applications such as human-computer interaction (Azam and Desai 2024; Plizzari et al. 2024), augmented reality (Abreu et al. 2024; Xu et al. 2024), and assistive systems for visually impaired individuals (Lee et al. 2024; Xiao et al. 2025), AI agents must accurately interpret user intent and demonstrate effective long-term planning capabilities within egocentric vision scenarios.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

Nov-18-2025

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.46)

Genre:
- Research Report > New Finding (0.46)

Technology:
- Information Technology
  - Human Computer Interaction > Interfaces (0.88)
  - Artificial Intelligence
    - Vision (1.00)
    - Representation & Reasoning (1.00)
    - Natural Language > Large Language Model (1.00)
    - Cognitive Science > Problem Solving (0.68)
    - Machine Learning > Neural Networks
      - Deep Learning (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found