Multimodal Differential Network for Visual Question Generation
Patro, Badri N., Kumar, Sandeep, Kurmi, Vinod K., Namboodiri, Vinay P.
–arXiv.org Artificial Intelligence
Namboodiri Indian Institute of Technology, Kanpur { badri,sandepkr,vinodkk,vinaypn} @iitk.ac.in Abstract Generating natural questions from an image is a semantic task that requires using visual and language modality to learn multimodal representations. Images can have multiple visual and language contexts that are relevant for generating questions namely places, captions, and tags. In this paper, we propose the use of exemplars for obtaining the relevant context. We obtain this by using a Multimodal Differential Network to produce natural and engaging questions. The generated questions show a remarkable similarity to the natural questions as validated by a human study. Further, we observe that the proposed approach substantially improves over state-of-the-art benchmarks on the quantitative metrics (BLEU, METEOR, ROUGE, and CIDEr). 1 Introduction To understand the progress towards multimedia vision and language understanding, a visual Turing test was proposed by (Geman et al., 2015) that was aimed at visual question answering (Antol et al., 2015). Visual Dialog (Das et al., 2017) is a natural extension for VQA. Current dialog systems as evaluated in (Chattopadhyay et al., 2017) show that when trained between bots, AIAI dialog systems show improvement, but that does not translate to actual improvement for Human-AI dialog. This is because, the questions generated by bots are not natural (humanlike) and therefore does not translate to improved human dialog. Therefore it is imperative that improvement in the quality of questions will enable dialog agents to perform well in human interactions. Further, (Ganju et al., 2017) show that unanswered questions can be used for improving VQA, Image captioning and Object Classification. An interesting line of work in this respect is the work of (Mostafazadeh et al., 2016). Here the authors have proposed the challenging task of generating natural questions for an image. One aspect that is central to a question is the context that is relevant to generate it. As can be seen in Figure 1, an image with a person on a skateboard would result in questions related to the event.
arXiv.org Artificial Intelligence
Oct-17-2019
- Country:
- North America > United States (0.28)
- Genre:
- Research Report > Promising Solution (0.46)
- Industry:
- Health & Medicine (0.46)
- Technology: