AITopics | videoqa

Collaborating Authors

videoqa

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task

Neural Information Processing SystemsJun-14-2026, 01:39:25 GMT

Video Question Answering (VideoQA) task serves as a critical playground for evaluating whether foundation models can effectively perceive, understand, and reason about dynamic real-world scenarios. However, existing Multimodal Large Language Models (MLLMs) struggle with simultaneously ensuring the ability to model spatial relationships between video frames and to understand the causal dynamics of temporal evolution on complex and reasoning-intensive VideoQA. In this work, we equip MLLM with a comprehensive and extensible Video Toolkit, to enhance MLLM's spatiotemporal reasoning capabilities as well as guarantee the harmony between the quantity and diversity of tools. To better control the tool invocation sequence and avoid toolchain shortcut issues, we propose a Spatiotemporal Reasoning Framework (STAR) that strategically schedules temporal and spatial tools, thereby progressively localizing the key area in the video. Our STAR framework enhances GPT-4o using lightweight tools, achieving an 8.2% gain on VideoMME and 4.6% on LongVideoBench. We believe that our proposed Video Toolkit and STAR framework make an important step towards building autonomous and intelligent video analysis assistants.

artificial intelligence, large language model, natural language, (6 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.60)

Add feedback

00d1f03b87a401b1c7957e0cc785d0bc-Paper-Conference.pdf

Neural Information Processing SystemsApr-24-2026, 07:36:03 GMT

T annotation of o tackle questions this of problem, and visual answers question-answer recent for methods videos, . In pretrained here particular build on on, a fr W promising ozen eb-scale bidir approach te ectional xt-only language adapts data to fr multi-modal ozen models autor (BiLM) egr inputs.

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Industry: Education (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.98)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Appendix 1 Perception Test at a glance

Neural Information Processing SystemsFeb-15-2026, 15:39:37 GMT

Performance is evaluated by measuring top-1 accuracy.

artificial intelligence, machine learning, video, (19 more...)

Neural Information Processing Systems

Country:

North America > United States (0.14)
South America > Brazil (0.04)
North America > Mexico (0.04)
(10 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis (0.48)

Add feedback

Video Flow as Time Series: Discovering Temporal Consistency and Variability for VideoQA

Song, Zijie, Hu, Zhenzhen, Ma, Yixiao, Li, Jia, Hong, Richang

arXiv.org Artificial IntelligenceNov-4-2025

--Video Question Answering (VideoQA) is a complex video-language task that demands a sophisticated understanding of both visual content and temporal dynamics. Traditional Transformer-style architectures, while effective in integrating multimodal data, often simplify temporal dynamics through positional encoding and fail to capture non-linear interactions within video sequences. In this paper, we introduce the T emporal Trio Transformer (T3T), a novel architecture that models time consistency and time variability. The TS module employs Brownian Bridge for capturing smooth, continuous temporal transitions, while the TD module identifies and encodes significant temporal variations and abrupt changes within the video content. The efficacy of the T3T is demonstrated through extensive testing on multiple VideoQA benchmark datasets. Our results underscore the importance of a nuanced approach to temporal modeling in improving the accuracy and depth of video-based question answering. In the realm of video-language tasks, Video Question Answering (VideoQA) stands out as one of the challenges that demand a high degree of temporal understanding where video and language are both sequential forms of information characterized by their temporality. This task requires models not only to process visual content but also to reason across the temporal sequence of events in a video in response to specific questions [1]-[4].

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/ICME59968.2025.11209469

2504.05783

Country: Asia > China (0.29)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Moment Sampling in Video LLMs for Long-Form Video QA

Chasmai, Mustafa, Jagatap, Gauri, KV, Gouthaman, Van Horn, Grant, Maji, Subhransu, Fanelli, Andrea

arXiv.org Artificial IntelligenceJul-2-2025

Recent advancements in video large language models (Video LLMs) have significantly advanced the field of video question answering (VideoQA). While existing methods perform well on short videos, they often struggle with long-range reasoning in longer videos. To scale Video LLMs for longer video content, frame sub-sampling (selecting frames at regular intervals) is commonly used. However, this approach is suboptimal, often leading to the loss of crucial frames or the inclusion of redundant information from multiple similar frames. Missing key frames impairs the model's ability to answer questions accurately, while redundant frames lead the model to focus on irrelevant video segments and increase computational resource consumption. In this paper, we investigate the use of a general-purpose text-to-video moment retrieval model to guide the frame sampling process. We propose "moment sampling", a novel, model-agnostic approach that enables the model to select the most relevant frames according to the context of the question. Specifically, we employ a lightweight moment retrieval model to prioritize frame selection. By focusing on the frames most pertinent to the given question, our method enhances long-form VideoQA performance in Video LLMs. Through extensive experiments on four long-form VideoQA datasets, using four state-of-the-art Video LLMs, we demonstrate the effectiveness of the proposed approach.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2507.00033

Country:

North America > United States (0.46)
Europe (0.28)

Genre: Research Report (0.84)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)

Add feedback

MUPA: Towards Multi-Path Agentic Reasoning for Grounded Video Question Answering

Dang, Jisheng, Song, Huilin, Xiao, Junbin, Wang, Bimei, Peng, Han, Li, Haoxuan, Yang, Xun, Wang, Meng, Chua, Tat-Seng

arXiv.org Artificial IntelligenceJun-30-2025

Grounded Video Question Answering (Grounded VideoQA) requires aligning textual answers with explicit visual evidence. However, modern multimodal models often rely on linguistic priors and spurious correlations, resulting in poorly grounded predictions. In this work, we propose MUPA, a cooperative MUlti-Path Agentic approach that unifies video grounding, question answering, answer reflection and aggregation to tackle Grounded VideoQA. MUPA features three distinct reasoning paths on the interplay of grounding and QA agents in different chronological orders, along with a dedicated reflection agent to judge and aggregate the multi-path results to accomplish consistent QA and grounding. This design markedly improves grounding fidelity without sacrificing answer accuracy. Despite using only 2B parameters, our method outperforms all 7B-scale competitors. When scaled to 7B parameters, MUPA establishes new state-of-the-art results, with Acc@GQA of 30.3% and 47.4% on NExT-GQA and DeVE-QA respectively, demonstrating MUPA' effectiveness towards trustworthy video-language understanding. Our code is available in https://github.com/longmalongma/MUPA.

large language model, machine learning, question answering, (16 more...)

arXiv.org Artificial Intelligence

2506.18071

Country: Asia > China (0.28)

Genre: Research Report (1.00)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation

Montes, Tony, Lozano, Fernando

arXiv.org Artificial IntelligenceMay-23-2025

Recent advancements in Video Question Answering (VideoQA) have introduced LLM-based agents, modular frameworks, and procedural solutions, yielding promising results. These systems use dynamic agents and memory-based mechanisms to break down complex tasks and refine answers. However, significant improvements remain in tracking objects for grounding over time and decision-making based on reasoning to better align object references with language model outputs, as newer models get better at both tasks. This work presents an LLM-brained agent for zero-shot Video Question Answering (VideoQA) that combines a Chain-of-Thought framework with grounding reasoning alongside YOLO-World to enhance object tracking and alignment. This approach establishes a new state-of-the-art in VideoQA and Video Understanding, showing enhanced performance on NExT-QA, iVQA, and ActivityNet-QA benchmarks. Our framework also enables cross-checking of grounding timeframes, improving accuracy and providing valuable support for verification and increased output reliability across multiple video domains. The code is available at https://github.com/t-montes/viqagent.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2505.15928

Country: Asia > Thailand (0.14)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment > Sports > Hockey (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Causality Model for Semantic Understanding on Videos

Yicong, Li

arXiv.org Artificial IntelligenceMar-16-2025

After a decade of prosperity, the development of video understanding has reached a critical juncture, where the sole reliance on massive data and complex architectures is no longer a one-size-fits-all solution to all situations. The presence of ubiquitous data imbalance hampers DNNs from effectively learning the underlying causal mechanisms, leading to significant performance drops when encountering distribution shifts, such as long-tail imbalances and perturbed imbalances. This realization has prompted researchers to seek alternative methodologies to capture causal patterns in video data. To tackle these challenges and increase the robustness of DNNs, causal modeling emerged as a principle to discover the true causal patterns behind the observed correlations. This thesis focuses on the domain of semantic video understanding and explores the potential of causal modeling to advance two fundamental tasks: Video Relation Detection (VidVRD) and Video Question Answering (VideoQA).

artificial intelligence, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2503.12447

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Singapore (0.04)

Genre:

Overview (1.00)
Research Report > New Finding (0.67)
Research Report > Promising Solution (0.67)

Industry: Information Technology (0.45)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(3 more...)

Add feedback

ReasVQA: Advancing VideoQA with Imperfect Reasoning Process

Liang, Jianxin, Meng, Xiaojun, Zhang, Huishuai, Wang, Yueqian, Wei, Jiansheng, Zhao, Dongyan

arXiv.org Artificial IntelligenceJan-23-2025

Video Question Answering (VideoQA) is a challenging task that requires understanding complex visual and temporal relationships within videos to answer questions accurately. In this work, we introduce \textbf{ReasVQA} (Reasoning-enhanced Video Question Answering), a novel approach that leverages reasoning processes generated by Multimodal Large Language Models (MLLMs) to improve the performance of VideoQA models. Our approach consists of three phases: reasoning generation, reasoning refinement, and learning from reasoning. First, we generate detailed reasoning processes using additional MLLMs, and second refine them via a filtering step to ensure data quality. Finally, we use the reasoning data, which might be in an imperfect form, to guide the VideoQA model via multi-task learning, on how to interpret and answer questions based on a given video. We evaluate ReasVQA on three popular benchmarks, and our results establish new state-of-the-art performance with significant improvements of +2.9 on NExT-QA, +7.3 on STAR, and +5.9 on IntentQA. Our findings demonstrate the supervising benefits of integrating reasoning processes into VideoQA. Further studies validate each component of our method, also with different backbones and MLLMs, and again highlight the advantages of this simple but effective method. We offer a new perspective on enhancing VideoQA performance by utilizing advanced reasoning techniques, setting a new benchmark in this research field.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2501.13536

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)

Add feedback

Filters

Collaborating Authors

videoqa

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task

00d1f03b87a401b1c7957e0cc785d0bc-Paper-Conference.pdf

Appendix 1 Perception Test at a glance

Video Flow as Time Series: Discovering Temporal Consistency and Variability for VideoQA

8540fba4abdc7f9f7a7b1cc6cd60e409-Supplemental-Datasets_and_Benchmarks.pdf

Moment Sampling in Video LLMs for Long-Form Video QA

MUPA: Towards Multi-Path Agentic Reasoning for Grounded Video Question Answering

ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation

Causality Model for Semantic Understanding on Videos

ReasVQA: Advancing VideoQA with Imperfect Reasoning Process