MMAR: AChallenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix
–Neural Information Processing Systems
We introduce MMAR, a new benchmark designed to evaluate the deep reasoning capabilities of Audio-Language Models (ALMs) across massive multi-disciplinary tasks. MMAR comprises 1,000 meticulously curated audio-question-answer triplets, collected from real-world internet videos and refined through iterative error corrections and quality checks to ensure high quality. Unlike existing benchmarks that are limited to specific domains of sound, music, or speech, MMAR extends them to a broad spectrum of real-world audio scenarios, including mixedmodality combinations of sound, music, and speech. Each question in MMAR is hierarchically categorized across four reasoning layers: Signal, Perception, Semantic, and Cultural, with additional sub-categories within each layer to reflect task diversity and complexity. To further foster research in this area, we annotate every question with a Chain-of-Thought (CoT) rationale to promote future advancements in audio reasoning.
Neural Information Processing Systems
Jun-17-2026, 17:56:35 GMT
- Country:
- Asia > China (0.28)
- North America > United States (0.28)
- Genre:
- Research Report
- New Finding (1.00)
- Experimental Study (1.00)
- Research Report
- Industry:
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
- Information Technology > Security & Privacy (0.92)
- Education (0.68)
- Technology:
- Information Technology
- Data Science (1.00)
- Artificial Intelligence
- Representation & Reasoning (1.00)
- Speech > Speech Recognition (0.93)
- Cognitive Science > Problem Solving (0.66)
- Natural Language
- Large Language Model (1.00)
- Chatbot (0.70)
- Text Processing (0.67)
- Machine Learning
- Neural Networks > Deep Learning (1.00)
- Performance Analysis > Accuracy (0.68)
- Information Technology