Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering

Jun-7-2023–arXiv.org Artificial Intelligence

Abstract--Existing visual question answering methods often suffer from cross-modal spurious correlations and oversimplified eventlevel reasoning processes that fail to capture event temporality, causality, and dynamics spanning over the video. In this work, to address the task of event-level visual question answering, we propose a framework for cross-modal causal relational reasoning. In particular, a set of causal intervention operations is introduced to discover the underlying causal structures across visual and linguistic modalities. Our framework, named Cross-Modal Causal RelatIonal Reasoning (CMCIR), involves three modules: i) Causality-aware Visual-Linguistic Reasoning (CVLR) module for collaboratively disentangling the visual and linguistic spurious correlations via front-door and backdoor causal interventions; ii) Spatial-Temporal Transformer (STT) module for capturing the fine-grained interactions between visual and linguistic semantics; iii) Visual-Linguistic Feature Fusion (VLFF) module for learning the global semantic-aware visual-linguistic representations adaptively. Extensive experiments on four event-level datasets demonstrate the superiority of our CMCIR in discovering visual-linguistic causal structures and achieving robust event-level visual question answering. A: No, the road is not congested and the side-collision happened at the crossing. Actually, understanding events in multi-modal visual-linguistic context is a long-standing challenge. Existing visual question answering methods [10], [11], [12], First, existing visual question answering methods usually [13] use recurrent neural networks (RNNs) [14], attention focus on simple events that do not require a deep understanding mechanisms [15] or Graph Convolutional Networks [16] for of causality, temporal relations, and linguistic relation reasoning between visual and linguistic modalities. Although achieving promising results, these methods suffer In Figure 1, given a video and an associated question, a typical from two common limitations.

machine learning, natural language, question answering, (19 more...)

arXiv.org Artificial Intelligence

Jun-7-2023

arXiv.org PDF

Add feedback

Country:
- Asia > China > Guangdong Province (0.14)

Genre:
- Research Report (1.00)

Industry:
- Education (0.67)
- Energy > Oil & Gas (0.45)
- Transportation > Ground
  - Road (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Question Answering (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found