Robustness assessment of large audio language models in multiple-choice evaluation

López, Fernando, Kesiraju, Santosh, Luque, Jordi

Oct-7-2025–arXiv.org Artificial Intelligence

ABSTRACT Recent advances in large audio language models (LALMs) have primarily been assessed using a multiple-choice question answering (MCQA) framework. Existing MCQA frameworks do not account for this variability and report a single accuracy number per benchmark or category. We dive into the MCQA evaluation framework and conduct a systematic study spanning three benchmarks (MMAU, MMAR and MMSU) and four models: Audio Flamingo 2, Audio Flamingo 3, Qwen2.5-Omni-7B-Instruct, Our findings indicate that models are sensitive not only to the ordering of choices, but also to the paraphrasing of the question and the choices. Finally, we propose a simpler evaluation protocol and metric that account for subtle variations and provide a more detailed evaluation report of LALMs within the MCQA framework.

artificial intelligence, benchmark, natural language, (16 more...)

arXiv.org Artificial Intelligence

Oct-7-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - New Mexico (0.14)
- Europe
  - Italy (0.28)
  - Austria (0.28)

Genre:
- Research Report > New Finding (0.88)

Industry:
- Education (0.71)

Technology:
- Information Technology > Artificial Intelligence > Natural Language (1.00)