Multi-Domain Audio Question Answering Toward Acoustic Content Reasoning in The DCASE 2025 Challenge

Yang, Chao-Han Huck, Ghosh, Sreyan, Wang, Qing, Kim, Jaeyeon, Hong, Hengyi, Kumar, Sonal, Zhong, Guirui, Kong, Zhifeng, Sakshi, S, Lokegaonkar, Vaibhavi, Nieto, Oriol, Duraiswami, Ramani, Manocha, Dinesh, Kim, Gunhee, Du, Jun, Valle, Rafael, Catanzaro, Bryan

May-13-2025–arXiv.org Artificial Intelligence

We present Task 5 of the DCASE 2025 Challenge: an Audio Question Answering (AQA) benchmark spanning multiple domains of sound understanding. This task defines three QA subsets (Bioacoustics, Temporal Soundscapes, and Complex QA) to test audio-language models on interactive question-answering over diverse acoustic scenes. We describe the dataset composition (from marine mammal calls to soundscapes and complex real-world clips), the evaluation protocol (top-1 accuracy with answer-shuffling robustness), and baseline systems (Qwen2-Audio-7B, AudioFlamingo 2, Gemini-2-Flash). Preliminary results on the development set are compared, showing strong variation across models and subsets. This challenge aims to advance the audio understanding and reasoning capabilities of audio-language models toward human-level acuity, which are crucial for enabling AI agents to perceive and interact about the world effectively.

audio-language model, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

May-13-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (0.71)
    - Question Answering (0.55)
  - Machine Learning > Neural Networks
    - Deep Learning (0.52)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found