TaiwanVQA: Benchmarking and Enhancing Cultural Understanding in Vision-Language Models

Neural Information Processing Systems 

Vision-language models (VLMs) often struggle with culturally specific content -- a challenge largely overlooked by existing benchmarks that focus on dominant languages and globalized datasets. We introduce TAIWANVQA, a VQA benchmark designed for Taiwanese culture to evaluate recognition and reasoning in regional contexts. TAIWANVQA contains 2,736 images and 5,472 manually curated questions covering topics such as traditional foods, public signs, festivals, and landmarks. The official benchmark set includes 1,000 images and 2,000 questions for systematic assessment, with the remainder of the data used as training material. Evaluations on state-of-the-art VLMs reveal strong visual recognition but notable weaknesses in cultural reasoning.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found