Multi-Modal Scene Graph with Kolmogorov-Arnold Experts for Audio-Visual Question Answering
Fu, Zijian, Lv, Changsheng, Qi, Mengshi, Ma, Huadong
–arXiv.org Artificial Intelligence
In this paper, we propose a novel Multi-Modal Scene Graph with Kolmogorov-Arnold Expert Network for Audio-Visual Question Answering (SHRIKE). The task aims to mimic human reasoning by extracting and fusing information from audio-visual scenes, with the main challenge being the identification of question-relevant cues from the complex audiovisual content. Existing methods fail to capture the structural information within video, and suffer from insufficient fine-grained modeling of multi-modal features. T o address these issues, we are the first to introduce a new multi-modal scene graph that explicitly models the objects and their relationship as a visually grounded, structured representation of the audio-visual scene. Furthermore, we design a Kol-mogorov-Arnold Network (KAN)-based Mixture of Experts (MoE) to enhance the expressive power of the temporal integration stage. This enables more fine-grained modeling of cross-modal interactions within the question-aware fused audio-visual representation, leading to capture richer and more nuanced patterns and then improve temporal reasoning performance. W e evaluate the model on the established MUSIC-A VQA and MUSIC-A VQA v2 benchmarks, where it achieves state-of-the-art performance. Code and model checkpoints will be publicly released.
arXiv.org Artificial Intelligence
Dec-1-2025
- Genre:
- Research Report (1.00)
- Industry:
- Leisure & Entertainment (0.68)
- Media > Music (0.47)
- Technology:
- Information Technology > Artificial Intelligence
- Cognitive Science (1.00)
- Machine Learning > Neural Networks (1.00)
- Natural Language > Question Answering (0.72)
- Representation & Reasoning (1.00)
- Vision (1.00)
- Information Technology > Artificial Intelligence