Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs

Chowdhury, Sanjoy, Gani, Hanan, Anand, Nishit, Nag, Sayan, Gao, Ruohan, Elhoseiny, Mohamed, Khan, Salman, Manocha, Dinesh

Mar-29-2025–arXiv.org Artificial Intelligence

Recent advancements in reasoning optimization have greatly enhanced the performance of large language models (LLMs). However, existing work fails to address the complexities of audio-visual scenarios, underscoring the need for further research. In this paper, we introduce AURELIA, a novel actor-critic based audio-visual (AV) reasoning framework that distills structured, step-by-step reasoning into AVLLMs at test time, improving their ability to process complex multi-modal inputs without additional training or fine-tuning. To further advance AVLLM reasoning skills, we present AVReasonBench, a challenging benchmark comprising 4500 audio-visual questions, each paired with detailed step-by-step reasoning. Our benchmark spans six distinct tasks, including AV-GeoIQ, which evaluates AV reasoning combined with geographical and cultural knowledge. Evaluating 18 AVLLMs on AVReasonBench reveals significant limitations in their multi-modal reasoning capabilities. Using AURELIA, we achieve up to a 100% relative improvement, demonstrating its effectiveness. This performance gain highlights the potential of reasoning-enhanced data generation for advancing AVLLMs in real-world applications. Our code and data will be publicly released at: https: //github.com/schowdhury671/aurelia.

arxiv preprint arxiv, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

Mar-29-2025

arXiv.org PDF

Add feedback

Country:
- North America (0.45)

Genre:
- Overview (0.93)
- Research Report (0.64)
- Workflow (0.67)

Industry:
- Health & Medicine (0.46)
- Leisure & Entertainment (0.67)
- Media > Music (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.93)
  - Natural Language > Large Language Model (1.00)
  - Representation & Reasoning (1.00)