Glance and Focus: Memory Prompting for Multi-Event Video Question Answering Ziyi Bai

Neural Information Processing Systems 

Video Question Answering (VideoQA) has emerged as a vital tool to evaluate agents' ability to understand human daily behaviors. Despite the recent success of large vision language models in many multi-modal tasks, complex situation reasoning over videos involving multiple human-object interaction events still remains challenging.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found