Interpretable Visual Understanding with Cognitive Attention Network

Tang, Xuejiao, Zhang, Wenbin, Yu, Yi, Turner, Kea, Derr, Tyler, Wang, Mengyu, Ntoutsi, Eirini

Aug-14-2021–arXiv.org Artificial Intelligence

While image understanding on recognition-level has achieved remarkable advancements, reliable visual scene understanding requires comprehensive image understanding on recognition-level but also cognition-level, which calls for exploiting the multi-source information as well as learning different levels of understanding and extensive commonsense knowledge. In this paper, we propose a novel Cognitive Attention Network (CAN) for visual commonsense reasoning to achieve interpretable visual understanding. Specifically, we first introduce an image-text fusion module to fuse information from images and text collectively. Second, a novel inference module is designed to encode commonsense among image, query and response. Extensive experiments on large-scale Visual Commonsense Reasoning (VCR) benchmark dataset demonstrate the effectiveness of our approach. The implementation is publicly available at https://github.com/tanjatang/CAN

commonsense, information, representation, (12 more...)

arXiv.org Artificial Intelligence

Aug-14-2021

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Pennsylvania > Allegheny County
    - Pittsburgh (0.04)
  - California > Yolo County
    - Davis (0.04)
- Europe > Germany
  - Lower Saxony > Hanover (0.04)
  - Berlin (0.04)
- Asia > Japan
  - Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)

Genre:
- Research Report (0.82)

Industry:
- Health & Medicine (0.69)

Technology:
- Information Technology
  - Sensing and Signal Processing > Image Processing (1.00)
  - Artificial Intelligence
    - Natural Language (1.00)
    - Representation & Reasoning > Commonsense Reasoning (0.76)
    - Vision > Image Understanding (0.68)
    - Machine Learning > Neural Networks
      - Deep Learning (0.48)