Hierarchical Conditional Relation Networks for Multimodal Video Question Answering

Le, Thao Minh, Le, Vuong, Venkatesh, Svetha, Tran, Truyen

arXiv.org Artificial Intelligence 

Noname manuscript No. (will be inserted by the editor) Abstract Video Question Answering (Video QA) challenges show consistent improvements over state-of-the-art methods modelers in multiple fronts. Modeling video necessitates on well-studied benchmarks including large-scale real-world building not only spatiotemporal models for the dynamic datasets such as TGIF-QA and TVQA, demonstrating the visual channel but also multimodal structures for associated strong capabilities of our CRN unit and the HCRN for complex information channels such as subtitles or audio. To the best of our knowledge, adds at least two more layers of complexity - selecting relevant the HCRN is the very first method attempting to handle content for each channel in the context of the linguistic long and short-form multimodal Video QA at the same time. To address these modules · Hierarchy requirements, we start with two insights: (a) content selection and relation construction can be jointly encapsulated into a conditional computational structure, and (b) video-length 1 Introduction structures can be composed hierarchically. For (a) this paper introduces a general-reusable reusable neural unit dubbed Answering natural questions about a video is a powerful Conditional Relation Network (CRN) taking as input a set of demonstration of cognitive capability. The task involves acquisition tensorial objects and translating into a new set of objects that and manipulation of spatiotemporal visual, acoustic encode relations of the inputs. The generic design of CRN and linguistic representations from the video guided by helps ease the common complex model building process the compositional semantics of linguistic cues [1, 2, 3, 4, 5, of Video QA by simple block stacking and rearrangements 6]. As questions are potentially unconstrained, Video QA with flexibility in accommodating diverse input modalities requires deep modeling capacity to encode and represent crucial and conditioning features across both visual and linguistic multimodal video properties such as linguistic content, domains. As a result, we realize insight (b) by introducing object permanence, motion profiles, prolonged actions, and Hierarchical Conditional Relation Networks (HCRN) for varying-length temporal relations in a hierarchical manner. The HCRN primarily aims at exploiting intrinsic For Video QA, the visual and textual representations should properties of the visual content of a video as well as its accompanying ideally be question-specific and answer-ready.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found