Haoyuan Gao 1 Junhua Mao 2 Jie Zhou 1 Zhiheng Huang

Mar-13-2024, 05:31:07 GMT–Neural Information Processing Systems

In this paper, we present the mQA model, which is able to answer questions about the content of an image. The answer can be a sentence, a phrase or a single word. Our model contains four components: a Long Short-Term Memory (LSTM) to extract the question representation, a Convolutional Neural Network (CNN) to extract the visual representation, an LSTM for storing the linguistic context in an answer, and a fusing component to combine the information from the first three components and generate the answer. We construct a Freestyle Multilingual Image Question Answering (FM-IQA) dataset to train and evaluate our mQA model. It contains over 150,000 images and 310,000 freestyle Chinese question-answer pairs and their English translations. The quality of the generated answers of our mQA model on this dataset is evaluated by human judges through a Turing Test. Specifically, we mix the answers provided by humans and our model. The human judges need to distinguish our model from the human. They will also provide a score (i.e.

dataset, mqa model, representation, (15 more...)

Neural Information Processing Systems

Mar-13-2024, 05:31:07 GMT

Conferences PDF

Add feedback

Country:
- North America > United States > California > Los Angeles County > Los Angeles (0.14)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)