Robustness assessment of large audio language models in multiple-choice evaluation
López, Fernando, Kesiraju, Santosh, Luque, Jordi
–arXiv.org Artificial Intelligence
ABSTRACT Recent advances in large audio language models (LALMs) have primarily been assessed using a multiple-choice question answering (MCQA) framework. Existing MCQA frameworks do not account for this variability and report a single accuracy number per benchmark or category. We dive into the MCQA evaluation framework and conduct a systematic study spanning three benchmarks (MMAU, MMAR and MMSU) and four models: Audio Flamingo 2, Audio Flamingo 3, Qwen2.5-Omni-7B-Instruct, Our findings indicate that models are sensitive not only to the ordering of choices, but also to the paraphrasing of the question and the choices. Finally, we propose a simpler evaluation protocol and metric that account for subtle variations and provide a more detailed evaluation report of LALMs within the MCQA framework.
arXiv.org Artificial Intelligence
Oct-7-2025
- Country:
- Genre:
- Research Report > New Finding (0.88)
- Industry:
- Education (0.71)
- Technology: