cvqa
CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark
Visual Question Answering~(VQA) is an important task in multimodal AI, which requires models to understand and reason on knowledge present in visual and textual data. However, most of the current VQA datasets and models are primarily focused on English and a few major world languages, with images that are Western-centric. While recent efforts have tried to increase the number of languages covered on VQA datasets, they still lack diversity in low-resource languages. More importantly, some datasets extend the text to other languages, either via translation or some other approaches, but usually keep the same images, resulting in narrow cultural representation. To address these limitations, we create CVQA, a new Culturally-diverse Multilingual Visual Question Answering benchmark dataset, designed to cover a rich set of languages and regions, where we engage native speakers and cultural experts in the data collection process. CVQA includes culturally-driven images and questions from across 28 countries in four continents, covering 26 languages with 11 scripts, providing a total of 9k questions. We benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and we show that the dataset is challenging for the current state-of-the-art models. This benchmark will serve as a probing evaluation suite for assessing the cultural bias of multimodal models and hopefully encourage more research efforts towards increasing cultural awareness and linguistic diversity in this field.
CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark
Visual Question Answering (VQA) is an important task in multimodal AI, which requires models to understand and reason on knowledge present in visual and textual data. However, most of the current VQA datasets and models are primarily focused on English and a few major world languages, with images that are Western-centric. While recent efforts have tried to increase the number of languages covered on VQA datasets, they still lack diversity in low-resource languages. More importantly, some datasets extend the text to other languages, either via translation or some other approaches, but usually keep the same images, resulting in narrow cultural representation. To address these limitations, we create CVQA, a new Culturally-diverse Multilingual Visual Question Answering benchmark dataset, designed to cover a rich set of languages and regions, where we engage native speakers and cultural experts in the data collection process.
Can Large Language Models Unveil the Mysteries? An Exploration of Their Ability to Unlock Information in Complex Scenarios
Wang, Chao, Zhang, Luning, Wang, Zheng, Zhou, Yang
Combining multiple perceptual inputs and performing combinatorial reasoning in complex scenarios is a sophisticated cognitive function in humans. With advancements in multi-modal large language models, recent benchmarks tend to evaluate visual understanding across multiple images. However, they often overlook the necessity of combinatorial reasoning across multiple perceptual information. To explore the ability of advanced models to integrate multiple perceptual inputs for combinatorial reasoning in complex scenarios, we introduce two benchmarks: Clue-Visual Question Answering (CVQA), with three task types to assess visual comprehension and synthesis, and Clue of Password-Visual Question Answering (CPVQA), with two task types focused on accurate interpretation and application of visual data. For our benchmarks, we present three plug-and-play approaches: utilizing model input for reasoning, enhancing reasoning through minimum margin decoding with randomness generation, and retrieving semantically relevant visual information for effective data integration. The combined results reveal current models' poor performance on combinatorial reasoning benchmarks, even the state-of-the-art (SOTA) closed-source model achieves only 33.04% accuracy on CVQA, and drops to 7.38% on CPVQA. Notably, our approach improves the performance of models on combinatorial reasoning, with a 22.17% boost on CVQA and 9.40% on CPVQA over the SOTA closed-source model, demonstrating its effectiveness in enhancing combinatorial reasoning with multiple perceptual inputs in complex scenarios. The code will be publicly available.
CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark
Romero, David, Lyu, Chenyang, Wibowo, Haryo Akbarianto, Lynn, Teresa, Hamed, Injy, Kishore, Aditya Nanda, Mandal, Aishik, Dragonetti, Alina, Abzaliev, Artem, Tonja, Atnafu Lambebo, Balcha, Bontu Fufa, Whitehouse, Chenxi, Salamea, Christian, Velasco, Dan John, Adelani, David Ifeoluwa, Meur, David Le, Villa-Cueva, Emilio, Koto, Fajri, Farooqui, Fauzan, Belcavello, Frederico, Batnasan, Ganzorig, Vallejo, Gisela, Caulfield, Grainne, Ivetta, Guido, Song, Haiyue, Ademtew, Henok Biadglign, Maina, Hernán, Lovenia, Holy, Azime, Israel Abebe, Cruz, Jan Christian Blaise, Gala, Jay, Geng, Jiahui, Ortiz-Barajas, Jesus-German, Baek, Jinheon, Dunstan, Jocelyn, Alemany, Laura Alonso, Nagasinghe, Kumaranage Ravindu Yasas, Benotti, Luciana, D'Haro, Luis Fernando, Viridiano, Marcelo, Estecha-Garitagoitia, Marcos, Cabrera, Maria Camila Buitrago, Rodríguez-Cantelar, Mario, Jouitteau, Mélanie, Mihaylov, Mihail, Imam, Mohamed Fazli Mohamed, Adilazuarda, Muhammad Farid, Gochoo, Munkhjargal, Otgonbold, Munkh-Erdene, Etori, Naome, Niyomugisha, Olivier, Silva, Paula Mónica, Chitale, Pranjal, Dabre, Raj, Chevi, Rendi, Zhang, Ruochen, Diandaru, Ryandito, Cahyawijaya, Samuel, Góngora, Santiago, Jeong, Soyeong, Purkayastha, Sukannya, Kuribayashi, Tatsuki, Jayakumar, Thanmay, Torrent, Tiago Timponi, Ehsan, Toqeer, Araujo, Vladimir, Kementchedjhieva, Yova, Burzo, Zara, Lim, Zheng Wei, Yong, Zheng Xin, Ignat, Oana, Nwatu, Joan, Mihalcea, Rada, Solorio, Thamar, Aji, Alham Fikri
Visual Question Answering (VQA) is an important task in multimodal AI, and it is often used to test the ability of vision-language models to understand and reason on knowledge present in both visual and textual data. However, most of the current VQA models use datasets that are primarily focused on English and a few major world languages, with images that are typically Western-centric. While recent efforts have tried to increase the number of languages covered on VQA datasets, they still lack diversity in low-resource languages. More importantly, although these datasets often extend their linguistic range via translation or some other approaches, they usually keep images the same, resulting in narrow cultural representation. To address these limitations, we construct CVQA, a new Culturally-diverse multilingual Visual Question Answering benchmark, designed to cover a rich set of languages and cultures, where we engage native speakers and cultural experts in the data collection process. As a result, CVQA includes culturally-driven images and questions from across 28 countries on four continents, covering 26 languages with 11 scripts, providing a total of 9k questions. We then benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models. This benchmark can serve as a probing evaluation suite for assessing the cultural capability and bias of multimodal models and hopefully encourage more research efforts toward increasing cultural awareness and linguistic diversity in this field.