WebQA: Multihop and Multimodal QA
Chang, Yingshan, Narang, Mridu, Suzuki, Hisami, Cao, Guihong, Gao, Jianfeng, Bisk, Yonatan
–arXiv.org Artificial Intelligence
Web search is fundamentally multimodal and multihop. Often, even before asking a question we choose to go directly to image search to find our answers. Further, rarely do we find an answer from a single source but aggregate information and reason through implications. Despite the frequency of this everyday occurrence, at present, there is no unified question answering benchmark that requires a single model to answer long-form natural language questions from text and open-ended visual sources -- akin to a human's experience. We propose to bridge this gap between the natural language and computer vision communities with WebQA. We show that A. our multihop text queries are difficult for a large-scale transformer model, and B. existing multi-modal transformers and visual representations do not perform well on open-domain visual queries. Our challenge for the community is to create a unified multimodal reasoning model that seamlessly transitions and reasons regardless of the source modality.
arXiv.org Artificial Intelligence
Sep-21-2021
- Country:
- North America
- Canada (0.04)
- United States
- Michigan (0.04)
- New York > New York County
- New York City (0.04)
- California > Los Angeles County
- Long Beach (0.04)
- Europe
- Asia > Japan
- Honshū
- Tōhoku > Miyagi Prefecture
- Sendai (0.04)
- Kantō > Tokyo Metropolis Prefecture
- Tokyo (0.04)
- Tōhoku > Miyagi Prefecture
- Honshū
- Africa > Middle East
- Egypt (0.04)
- North America
- Genre:
- Research Report (0.50)
- Technology:
- Information Technology
- Information Management > Search (1.00)
- Artificial Intelligence
- Vision (1.00)
- Natural Language > Question Answering (0.70)
- Machine Learning > Pattern Recognition (0.49)
- Cognitive Science > Problem Solving (0.46)
- Information Technology