TaiwanVQA: Benchmarking and Enhancing Cultural Understanding in Vision-Language Models
–Neural Information Processing Systems
Vision-language models (VLMs) often struggle with culturally specific content -- a challenge largely overlooked by existing benchmarks that focus on dominant languages and globalized datasets. We introduce TAIWANVQA, a VQA benchmark designed for Taiwanese culture to evaluate recognition and reasoning in regional contexts. TAIWANVQA contains 2,736 images and 5,472 manually curated questions covering topics such as traditional foods, public signs, festivals, and landmarks. The official benchmark set includes 1,000 images and 2,000 questions for systematic assessment, with the remainder of the data used as training material. Evaluations on state-of-the-art VLMs reveal strong visual recognition but notable weaknesses in cultural reasoning.
Neural Information Processing Systems
Jun-15-2026, 10:16:45 GMT
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (0.67)
- Research Report
- Industry:
- Leisure & Entertainment (1.00)
- Information Technology > Security & Privacy (1.00)
- Media (0.92)
- Law (0.67)
- Technology: