Glance and Focus: Memory Prompting for Multi-Event Video Question Answering

Dec-25-2025, 22:03:37 GMT–Neural Information Processing Systems

Video Question Answering (VideoQA) has emerged as a vital tool to evaluate agents' ability to understand human daily behaviors. Despite the recent success of large vision language models in many multi-modal tasks, complex situation reasoning over videos involving multiple human-object interaction events still remains challenging. In contrast, humans can easily tackle it by using a series of episode memories as anchors to quickly locate question-related key moments for reasoning. To mimic this effective reasoning strategy, we propose the Glance-Focus model. One simple way is to apply an action detection model to predict a set of actions as key memories.

glance and focus, memory prompting, name change, (3 more...)

Neural Information Processing Systems

Dec-25-2025, 22:03:37 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Natural Language (0.61)