Goto

Collaborating Authors

 videoqa



Video Flow as Time Series: Discovering Temporal Consistency and Variability for VideoQA

Song, Zijie, Hu, Zhenzhen, Ma, Yixiao, Li, Jia, Hong, Richang

arXiv.org Artificial Intelligence

--Video Question Answering (VideoQA) is a complex video-language task that demands a sophisticated understanding of both visual content and temporal dynamics. Traditional Transformer-style architectures, while effective in integrating multimodal data, often simplify temporal dynamics through positional encoding and fail to capture non-linear interactions within video sequences. In this paper, we introduce the T emporal Trio Transformer (T3T), a novel architecture that models time consistency and time variability. The TS module employs Brownian Bridge for capturing smooth, continuous temporal transitions, while the TD module identifies and encodes significant temporal variations and abrupt changes within the video content. The efficacy of the T3T is demonstrated through extensive testing on multiple VideoQA benchmark datasets. Our results underscore the importance of a nuanced approach to temporal modeling in improving the accuracy and depth of video-based question answering. In the realm of video-language tasks, Video Question Answering (VideoQA) stands out as one of the challenges that demand a high degree of temporal understanding where video and language are both sequential forms of information characterized by their temporality. This task requires models not only to process visual content but also to reason across the temporal sequence of events in a video in response to specific questions [1]-[4].




Moment Sampling in Video LLMs for Long-Form Video QA

Chasmai, Mustafa, Jagatap, Gauri, KV, Gouthaman, Van Horn, Grant, Maji, Subhransu, Fanelli, Andrea

arXiv.org Artificial Intelligence

Recent advancements in video large language models (Video LLMs) have significantly advanced the field of video question answering (VideoQA). While existing methods perform well on short videos, they often struggle with long-range reasoning in longer videos. To scale Video LLMs for longer video content, frame sub-sampling (selecting frames at regular intervals) is commonly used. However, this approach is suboptimal, often leading to the loss of crucial frames or the inclusion of redundant information from multiple similar frames. Missing key frames impairs the model's ability to answer questions accurately, while redundant frames lead the model to focus on irrelevant video segments and increase computational resource consumption. In this paper, we investigate the use of a general-purpose text-to-video moment retrieval model to guide the frame sampling process. We propose "moment sampling", a novel, model-agnostic approach that enables the model to select the most relevant frames according to the context of the question. Specifically, we employ a lightweight moment retrieval model to prioritize frame selection. By focusing on the frames most pertinent to the given question, our method enhances long-form VideoQA performance in Video LLMs. Through extensive experiments on four long-form VideoQA datasets, using four state-of-the-art Video LLMs, we demonstrate the effectiveness of the proposed approach.


MUPA: Towards Multi-Path Agentic Reasoning for Grounded Video Question Answering

Dang, Jisheng, Song, Huilin, Xiao, Junbin, Wang, Bimei, Peng, Han, Li, Haoxuan, Yang, Xun, Wang, Meng, Chua, Tat-Seng

arXiv.org Artificial Intelligence

Grounded Video Question Answering (Grounded VideoQA) requires aligning textual answers with explicit visual evidence. However, modern multimodal models often rely on linguistic priors and spurious correlations, resulting in poorly grounded predictions. In this work, we propose MUPA, a cooperative MUlti-Path Agentic approach that unifies video grounding, question answering, answer reflection and aggregation to tackle Grounded VideoQA. MUPA features three distinct reasoning paths on the interplay of grounding and QA agents in different chronological orders, along with a dedicated reflection agent to judge and aggregate the multi-path results to accomplish consistent QA and grounding. This design markedly improves grounding fidelity without sacrificing answer accuracy. Despite using only 2B parameters, our method outperforms all 7B-scale competitors. When scaled to 7B parameters, MUPA establishes new state-of-the-art results, with Acc@GQA of 30.3% and 47.4% on NExT-GQA and DeVE-QA respectively, demonstrating MUPA' effectiveness towards trustworthy video-language understanding. Our code is available in https://github.com/longmalongma/MUPA.


ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation

Montes, Tony, Lozano, Fernando

arXiv.org Artificial Intelligence

Recent advancements in Video Question Answering (VideoQA) have introduced LLM-based agents, modular frameworks, and procedural solutions, yielding promising results. These systems use dynamic agents and memory-based mechanisms to break down complex tasks and refine answers. However, significant improvements remain in tracking objects for grounding over time and decision-making based on reasoning to better align object references with language model outputs, as newer models get better at both tasks. This work presents an LLM-brained agent for zero-shot Video Question Answering (VideoQA) that combines a Chain-of-Thought framework with grounding reasoning alongside YOLO-World to enhance object tracking and alignment. This approach establishes a new state-of-the-art in VideoQA and Video Understanding, showing enhanced performance on NExT-QA, iVQA, and ActivityNet-QA benchmarks. Our framework also enables cross-checking of grounding timeframes, improving accuracy and providing valuable support for verification and increased output reliability across multiple video domains. The code is available at https://github.com/t-montes/viqagent.


Causality Model for Semantic Understanding on Videos

Yicong, Li

arXiv.org Artificial Intelligence

After a decade of prosperity, the development of video understanding has reached a critical juncture, where the sole reliance on massive data and complex architectures is no longer a one-size-fits-all solution to all situations. The presence of ubiquitous data imbalance hampers DNNs from effectively learning the underlying causal mechanisms, leading to significant performance drops when encountering distribution shifts, such as long-tail imbalances and perturbed imbalances. This realization has prompted researchers to seek alternative methodologies to capture causal patterns in video data. To tackle these challenges and increase the robustness of DNNs, causal modeling emerged as a principle to discover the true causal patterns behind the observed correlations. This thesis focuses on the domain of semantic video understanding and explores the potential of causal modeling to advance two fundamental tasks: Video Relation Detection (VidVRD) and Video Question Answering (VideoQA).


ReasVQA: Advancing VideoQA with Imperfect Reasoning Process

Liang, Jianxin, Meng, Xiaojun, Zhang, Huishuai, Wang, Yueqian, Wei, Jiansheng, Zhao, Dongyan

arXiv.org Artificial Intelligence

Video Question Answering (VideoQA) is a challenging task that requires understanding complex visual and temporal relationships within videos to answer questions accurately. In this work, we introduce \textbf{ReasVQA} (Reasoning-enhanced Video Question Answering), a novel approach that leverages reasoning processes generated by Multimodal Large Language Models (MLLMs) to improve the performance of VideoQA models. Our approach consists of three phases: reasoning generation, reasoning refinement, and learning from reasoning. First, we generate detailed reasoning processes using additional MLLMs, and second refine them via a filtering step to ensure data quality. Finally, we use the reasoning data, which might be in an imperfect form, to guide the VideoQA model via multi-task learning, on how to interpret and answer questions based on a given video. We evaluate ReasVQA on three popular benchmarks, and our results establish new state-of-the-art performance with significant improvements of +2.9 on NExT-QA, +7.3 on STAR, and +5.9 on IntentQA. Our findings demonstrate the supervising benefits of integrating reasoning processes into VideoQA. Further studies validate each component of our method, also with different backbones and MLLMs, and again highlight the advantages of this simple but effective method. We offer a new perspective on enhancing VideoQA performance by utilizing advanced reasoning techniques, setting a new benchmark in this research field.


Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering

Yu, Ting, Fu, Kunhao, Wang, Shuhui, Huang, Qingming, Yu, Jun

arXiv.org Artificial Intelligence

Video Question Answering (VideoQA) represents a crucial intersection between video understanding and language processing, requiring both discriminative unimodal comprehension and sophisticated cross-modal interaction for accurate inference. Despite advancements in multi-modal pre-trained models and video-language foundation models, these systems often struggle with domain-specific VideoQA due to their generalized pre-training objectives. Addressing this gap necessitates bridging the divide between broad cross-modal knowledge and the specific inference demands of VideoQA tasks. To this end, we introduce HeurVidQA, a framework that leverages domain-specific entity-action heuristics to refine pre-trained video-language foundation models. Our approach treats these models as implicit knowledge engines, employing domain-specific entity-action prompters to direct the model's focus toward precise cues that enhance reasoning. By delivering fine-grained heuristics, we improve the model's ability to identify and interpret key entities and actions, thereby enhancing its reasoning capabilities. Extensive evaluations across multiple VideoQA datasets demonstrate that our method significantly outperforms existing models, underscoring the importance of integrating domain-specific knowledge into video-language models for more accurate and context-aware VideoQA.