EverydayMMQA: A Multilingual and Multimodal Framework for Culturally Grounded Spoken Visual QA

Alam, Firoj, Shahroor, Ali Ezzat, Hasan, Md. Arid, Ali, Zien Sheikh, Bhatti, Hunzalah Hassan, Kmainasi, Mohamed Bayan, Chowdhury, Shammur Absar, Mousi, Basel, Dalvi, Fahim, Durrani, Nadir, Milic-Frayling, Natasa

arXiv.org Artificial Intelligence 

Large-scale multimodal models achieve strong results on tasks like Visual Question Answering (VQA), but they often fail when queries require culturally grounded, everyday knowledge, particularly in low-resource and underrepresented languages. QA ( EverydayMMQA), a framework for creating large-scale, culturally-grounded datasets for spoken and visual question answering (SVQA). Using this framework, we developed OASIS, a multimodal dataset integrating speech, images, and text. With over 0.92M images and 14.8M QA pairs, OASIS contains 3.7M spoken questions, enabling four unique input combinations: speech-only, text-only, speech+image, and text+image. Focused on English and Arabic varieties, 18 countries, the dataset content is curated to reflect diverse, real-world situations. OASIS tests models on tasks beyond object recognition that involve pragmatic, commonsense, and culturally aware reasoning. EverydayMMQA and OASIS together provide a benchmark and training dataset for building multimodal LLMs for a comprehensive set of everyday tasks within cultural contexts. The framework and dataset will be made publicly available to the community. This multi-sensory integration is fundamental to how humans understand the surroundings and communicate. As large language models (LLMs) evolve, it is important to train them with multiple modalities: speech, text, and images, to mimic human interaction. For instance, when asking about an object, we often point to it while asking a question. In this scenario, we expect an AI assistant to process a multimodal triplet: the visual information (what we're pointing at), the spoken information (our question), and the contextual knowledge required to provide a culturally appropriate response (see Figure 1). Crucially, this contextual knowledge is not universal: it is shaped by culture and language.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found