AITopics | multimodal learning and reasoning

Multimodal Learning and Reasoning for Visual Question Answering

Neural Information Processing SystemsNov-21-2025, 16:17:09 GMT

Reasoning about entities and their relationships from multimodal data is a key goal of Artificial General Intelligence. The visual question answering (VQA) problem is an excellent way to test such reasoning capabilities of an AI model and its multimodal representation learning. However, the current VQA models are over-simplified deep neural networks, comprised of a long short-term memory (LSTM) unit for question comprehension and a convolutional neural network (CNN) for learning single image representation. We argue that the single visual representation contains a limited and general information about the image contents and thus limits the model reasoning capabilities. In this work we introduce a modular neural network model that learns a multimodal and multifaceted representation of the image and the question. The proposed model learns to use the multimodal representation to reason about the image entities and achieves a new state-of-the-art performance on both VQA benchmark datasets, VQA v1.0 and v2.0, by a wide margin.

multimodal learning and reasoning, name change, representation, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Reviews: Multimodal Learning and Reasoning for Visual Question Answering

Neural Information Processing SystemsOct-8-2024, 13:10:18 GMT

The paper introduces a novel modular neural network for multimodal tasks such as Visual Question Answering. The paper argues that a single visual representation is not sufficient for VQA and using some task specific visual features such as scene classification or object detection would result in a better VQA model. Following this motivation, the paper proposes a VQA model with modules tailored for specific tasks -- scene classification, object detection/classification, face detection/analysis -- and pushes the state-of-the-art performance. Strengths -- -- Since VQA spans many lower level vision tasks such as object detection, scene classification, etc., it makes a lot of sense that the visual features tailored for these tasks should help for the task of VQA. According to my knowledge, this is the first paper which explicitly uses this information in building their model, and shows the importance of visual features from each task in their ablation studies.

face analysis module, module, multimodal learning and reasoning, (11 more...)

Neural Information Processing Systems

Genre: Research Report (0.52)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.62)

Add feedback

Multimodal Learning and Reasoning for Visual Question Answering

Ilievski, Ilija, Feng, Jiashi

Neural Information Processing SystemsFeb-14-2020, 05:56:21 GMT

Reasoning about entities and their relationships from multimodal data is a key goal of Artificial General Intelligence. The visual question answering (VQA) problem is an excellent way to test such reasoning capabilities of an AI model and its multimodal representation learning. However, the current VQA models are over-simplified deep neural networks, comprised of a long short-term memory (LSTM) unit for question comprehension and a convolutional neural network (CNN) for learning single image representation. We argue that the single visual representation contains a limited and general information about the image contents and thus limits the model reasoning capabilities. In this work we introduce a modular neural network model that learns a multimodal and multifaceted representation of the image and the question. The proposed model learns to use the multimodal representation to reason about the image entities and achieves a new state-of-the-art performance on both VQA benchmark datasets, VQA v1.0 and v2.0, by a wide margin.

multimodal learning and reasoning, reasoning capability, representation, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Filters

Collaborating Authors

multimodal learning and reasoning

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Multimodal Learning and Reasoning for Visual Question Answering

Reviews: Multimodal Learning and Reasoning for Visual Question Answering

Multimodal Learning and Reasoning for Visual Question Answering