Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation

Zhang, Yuhui, Su, Yuchang, Liu, Yiming, Wang, Xiaohan, Burgess, James, Sui, Elaine, Wang, Chenyu, Aklilu, Josiah, Lozano, Alejandro, Wei, Anjiang, Schmidt, Ludwig, Yeung-Levy, Serena

Jan-6-2025–arXiv.org Artificial Intelligence

The rapid development of vision language models (VLMs) demands rigorous and reliable evaluation. However, current visual question answering (VQA) benchmarks often depend on open-ended questions, making accurate evaluation difficult due to the variability in natural language responses. To address this, we introduce AutoConverter, an agentic framework that automatically converts these open-ended questions into multiple-choice format, enabling objective evaluation while reducing the costly question creation process. Our experiments demonstrate that AutoConverter can generate correct and challenging multiple-choice questions, with VLMs demonstrating consistently similar or lower accuracy on these questions compared to human-created ones. Using AutoConverter, we construct VMCBench, a benchmark created by transforming 20 existing VQA datasets into a unified multiple-choice format, totaling 9,018 questions. We comprehensively evaluate 33 state-of-the-art VLMs on VMCBench, setting a new standard for scalable, consistent, and reproducible VLM evaluation.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

Jan-6-2025

arXiv.org PDF

Add feedback

Country:
- Asia > India (1.00)
- North America > United States (0.92)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Education (1.00)
- Leisure & Entertainment (0.92)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.96)
  - Natural Language
    - Chatbot (1.00)
    - Large Language Model (1.00)
  - Representation & Reasoning > Agents (0.68)
  - Vision (1.00)