Multi-Modal Scene Graph with Kolmogorov-Arnold Experts for Audio-Visual Question Answering

Fu, Zijian, Lv, Changsheng, Qi, Mengshi, Ma, Huadong

Dec-1-2025–arXiv.org Artificial Intelligence

In this paper, we propose a novel Multi-Modal Scene Graph with Kolmogorov-Arnold Expert Network for Audio-Visual Question Answering (SHRIKE). The task aims to mimic human reasoning by extracting and fusing information from audio-visual scenes, with the main challenge being the identification of question-relevant cues from the complex audiovisual content. Existing methods fail to capture the structural information within video, and suffer from insufficient fine-grained modeling of multi-modal features. T o address these issues, we are the first to introduce a new multi-modal scene graph that explicitly models the objects and their relationship as a visually grounded, structured representation of the audio-visual scene. Furthermore, we design a Kol-mogorov-Arnold Network (KAN)-based Mixture of Experts (MoE) to enhance the expressive power of the temporal integration stage. This enables more fine-grained modeling of cross-modal interactions within the question-aware fused audio-visual representation, leading to capture richer and more nuanced patterns and then improve temporal reasoning performance. W e evaluate the model on the established MUSIC-A VQA and MUSIC-A VQA v2 benchmarks, where it achieves state-of-the-art performance. Code and model checkpoints will be publicly released.

machine learning, natural language, question answering, (14 more...)

arXiv.org Artificial Intelligence

Dec-1-2025

arXiv.org PDF

Add feedback

Country:
- Asia
  - China > Beijing
    - Beijing (0.04)
  - Japan > Honshū
    - Chūbu > Ishikawa Prefecture > Kanazawa (0.04)

Genre:
- Research Report (1.00)

Industry:
- Leisure & Entertainment (0.68)
- Media > Music (0.47)

Technology:
- Information Technology > Artificial Intelligence
  - Cognitive Science (1.00)
  - Machine Learning > Neural Networks (1.00)
  - Natural Language > Question Answering (0.72)
  - Representation & Reasoning (1.00)
  - Vision (1.00)