CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark

Romero, David, Lyu, Chenyang, Wibowo, Haryo Akbarianto, Lynn, Teresa, Hamed, Injy, Kishore, Aditya Nanda, Mandal, Aishik, Dragonetti, Alina, Abzaliev, Artem, Tonja, Atnafu Lambebo, Balcha, Bontu Fufa, Whitehouse, Chenxi, Salamea, Christian, Velasco, Dan John, Adelani, David Ifeoluwa, Meur, David Le, Villa-Cueva, Emilio, Koto, Fajri, Farooqui, Fauzan, Belcavello, Frederico, Batnasan, Ganzorig, Vallejo, Gisela, Caulfield, Grainne, Ivetta, Guido, Song, Haiyue, Ademtew, Henok Biadglign, Maina, Hernán, Lovenia, Holy, Azime, Israel Abebe, Cruz, Jan Christian Blaise, Gala, Jay, Geng, Jiahui, Ortiz-Barajas, Jesus-German, Baek, Jinheon, Dunstan, Jocelyn, Alemany, Laura Alonso, Nagasinghe, Kumaranage Ravindu Yasas, Benotti, Luciana, D'Haro, Luis Fernando, Viridiano, Marcelo, Estecha-Garitagoitia, Marcos, Cabrera, Maria Camila Buitrago, Rodríguez-Cantelar, Mario, Jouitteau, Mélanie, Mihaylov, Mihail, Imam, Mohamed Fazli Mohamed, Adilazuarda, Muhammad Farid, Gochoo, Munkhjargal, Otgonbold, Munkh-Erdene, Etori, Naome, Niyomugisha, Olivier, Silva, Paula Mónica, Chitale, Pranjal, Dabre, Raj, Chevi, Rendi, Zhang, Ruochen, Diandaru, Ryandito, Cahyawijaya, Samuel, Góngora, Santiago, Jeong, Soyeong, Purkayastha, Sukannya, Kuribayashi, Tatsuki, Jayakumar, Thanmay, Torrent, Tiago Timponi, Ehsan, Toqeer, Araujo, Vladimir, Kementchedjhieva, Yova, Burzo, Zara, Lim, Zheng Wei, Yong, Zheng Xin, Ignat, Oana, Nwatu, Joan, Mihalcea, Rada, Solorio, Thamar, Aji, Alham Fikri

Jun-9-2024–arXiv.org Artificial Intelligence

Visual Question Answering (VQA) is an important task in multimodal AI, and it is often used to test the ability of vision-language models to understand and reason on knowledge present in both visual and textual data. However, most of the current VQA models use datasets that are primarily focused on English and a few major world languages, with images that are typically Western-centric. While recent efforts have tried to increase the number of languages covered on VQA datasets, they still lack diversity in low-resource languages. More importantly, although these datasets often extend their linguistic range via translation or some other approaches, they usually keep images the same, resulting in narrow cultural representation. To address these limitations, we construct CVQA, a new Culturally-diverse multilingual Visual Question Answering benchmark, designed to cover a rich set of languages and cultures, where we engage native speakers and cultural experts in the data collection process. As a result, CVQA includes culturally-driven images and questions from across 28 countries on four continents, covering 26 languages with 11 scripts, providing a total of 9k questions. We then benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models. This benchmark can serve as a probing evaluation suite for assessing the cultural capability and bias of multimodal models and hopefully encourage more research efforts toward increasing cultural awareness and linguistic diversity in this field.

category, dataset, proceedings, (13 more...)

arXiv.org Artificial Intelligence

Jun-9-2024

arXiv.org PDF

Add feedback

Country:
- South America
  - Brazil (0.04)
  - Uruguay (0.04)
  - Colombia (0.04)
  - Argentina (0.04)
  - Ecuador
    - Pichincha Province > Quito (0.04)
    - Cañar Province > Azogues (0.04)
  - Chile > Santiago Metropolitan Region
    - Santiago Province > Santiago (0.04)
- North America
  - Dominican Republic (0.04)
  - Central America (0.04)
  - Mexico > Mexico City
    - Mexico City (0.04)
  - Canada > Ontario
    - Toronto (0.04)
- Europe
  - France (0.04)
  - Spain (0.04)
  - Ireland (0.04)
  - Russia (0.04)
  - Romania (0.04)
  - Norway (0.04)
  - Bulgaria (0.04)
- Asia
  - India (0.04)
  - Philippines (0.04)
  - Mongolia (0.04)
  - Malaysia (0.04)
  - Singapore (0.04)
  - Southeast Asia (0.04)
  - Sri Lanka (0.04)
  - China (0.04)
  - Russia (0.04)
  - Japan (0.04)
  - East Asia (0.04)
  - Central Asia (0.04)
  - Middle East > Israel (0.04)
  - Pakistan (0.04)
  - South Korea > Seoul
    - Seoul (0.04)
  - Indonesia
    - Bali (0.04)
    - Sumatra
      - Bengkulu > Bengkulu (0.04)
      - West Sumatra (0.04)
    - Java
      - West Java (0.04)
      - Jakarta > Jakarta (0.04)
      - East Java > Surabaya (0.04)
- Africa
  - Nigeria (0.04)
  - Ethiopia (0.04)
  - Middle East > Egypt (0.04)

Genre:
- Research Report (1.00)

Industry:
- Education (0.47)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Question Answering (0.82)
  - Machine Learning > Neural Networks
    - Deep Learning (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found