IndicVisionBench: Benchmarking Cultural and Multilingual Understanding in VLMs
Faraz, Ali, Akash, null, Khan, Shaharukh, Kolla, Raja, Patidar, Akshat, Goswami, Suranjan, Ravi, Abhinav, Khatri, Chandra, Agarwal, Shubham
–arXiv.org Artificial Intelligence
Vision-language models (VLMs) have demonstrated impressive generalization across multimodal tasks, yet most evaluation benchmarks remain Western-centric, leaving open questions about their performance in culturally diverse and multilingual settings. To address this gap, we introduce IndicVisionBench, the first large-scale benchmark centered on the Indian subcontinent. Our final benchmark consists of a total of 5K images and 37K+ QA pairs across 13 culturally grounded topics. In addition, we release a paired parallel corpus of annotations across 10 Indic languages, creating a unique resource for analyzing cultural and linguistic biases in VLMs. We evaluate a broad spectrum of 8 models, from proprietary closed-source systems to open-weights medium and large-scale models. Our experiments reveal substantial performance gaps, underscoring the limitations of current VLMs in culturally diverse contexts. By centering cultural diversity and multilinguality, IndicVisionBench establishes a reproducible evaluation framework that paves the way for more inclusive multimodal research. Vision-language models (VLMs) (Bai et al., 2023; Chen et al., 2024; Lu et al., 2024; Wang et al., 2024b; Laurenc on et al., 2024; Tong et al., 2024; Xue et al., 2024) have demonstrated strong performance across a variety of multimodal tasks. However, existing benchmarks (Antol et al., 2015; Fu et al., 2023; Goyal et al., 2017) remain heavily Western-centric, limiting our understanding of how these models generalize to culturally diverse and multilingual settings. While some recent efforts partially cover this diversity (Romero et al., 2024; Nayak et al., 2024; V ayani et al., 2025), a systematic, large-scale benchmark capturing India-specific cultural concepts across multiple languages is still lacking. To address this gap, we introduce IndicVisionBench, a culturally grounded evaluation benchmark tailored for the Indian subcontinent. To the best of our knowledge, this is the first large-scale benchmark explicitly designed to assess VLMs in the context of Indian culture and languages. We use states as a proxy for cultural groups following prior works (Adilazuarda et al., 2024; Nayak et al., 2024).
arXiv.org Artificial Intelligence
Nov-10-2025
- Country:
- Asia
- China (0.04)
- India
- Madhya Pradesh (0.04)
- Chandigarh (0.04)
- Haryana (0.04)
- Gujarat (0.04)
- Jharkhand (0.04)
- Tripura (0.04)
- Tamil Nadu (0.04)
- Lakshadweep (0.04)
- Meghalaya (0.04)
- Himachal Pradesh (0.04)
- Maharashtra (0.04)
- Nagaland (0.04)
- Uttar Pradesh (0.04)
- Andhra Pradesh (0.04)
- Karnataka > Bengaluru (0.04)
- Chhattisgarh (0.04)
- Puducherry (0.04)
- Arunachal Pradesh (0.04)
- Uttarakhand (0.04)
- Mizoram (0.04)
- West Bengal (0.04)
- Rajasthan (0.04)
- Manipur (0.04)
- Telangana (0.04)
- Macao (0.04)
- Taiwan > Taiwan Province
- Taipei (0.04)
- Europe > Denmark
- Capital Region > Copenhagen (0.04)
- Asia
- Genre:
- Research Report (1.00)
- Technology: