SpeechR: A Benchmark for Speech Reasoning in Large Audio-Language Models
Yang, Wanqi, Li, Yanda, Wei, Yunchao, Fang, Meng, Chen, Ling
–arXiv.org Artificial Intelligence
Large audio-language models (LALMs) have achieved near-human performance in sentence-level transcription and emotion recognition. However, existing evaluations focus mainly on surface-level perception, leaving the capacity of models for contextual and inference-driven reasoning in speech-based scenarios insufficiently examined. To address this gap, we introduce SpeechR, a unified benchmark for evaluating reasoning over speech in large audio-language models. SpeechR evaluates models along three key dimensions: factual retrieval, procedural inference, and normative judgment. It includes three distinct evaluation formats. The multiple-choice version measures answer selection accuracy. The generative version assesses the coherence and logical consistency of reasoning chains. The acoustic-feature version investigates whether variations in stress and emotion affect reasoning performance. Evaluations on eleven state-of-the-art LALMs reveal that high transcription accuracy does not translate into strong reasoning capabilities. SpeechR establishes a structured benchmark for evaluating reasoning in spoken language, enabling more targeted analysis of model capabilities across diverse dialogue-based tasks.
arXiv.org Artificial Intelligence
Aug-5-2025
- Genre:
- Research Report (0.82)
- Industry:
- Education (1.00)
- Technology:
- Information Technology > Artificial Intelligence
- Cognitive Science > Problem Solving (0.66)
- Machine Learning > Neural Networks
- Deep Learning (0.50)
- Natural Language
- Chatbot (0.72)
- Large Language Model (1.00)
- Speech > Speech Recognition (1.00)
- Information Technology > Artificial Intelligence