Text-Queried Audio Source Separation via Hierarchical Modeling
Yin, Xinlei, Peng, Xiulian, Jiang, Xue, Xiong, Zhiwei, Lu, Yan
–arXiv.org Artificial Intelligence
Abstract--T arget audio source separation with natural language queries presents a promising paradigm for extracting arbitrary audio events through arbitrary text descriptions. Existing methods mainly face two challenges, the difficulty in jointly modeling acoustic-textual alignment and semantic-aware separation within a blindly-learned single-stage architecture, and the reliance on large-scale accurately-labeled training data to compensate for inefficient cross-modal learning and separation. T o address these challenges, we propose a hierarchical decomposition framework, HSM-TSS, that decouples the task into global-local semantic-guided feature separation and structure-preserving acoustic reconstruction. Our approach introduces a dual-stage mechanism for semantic separation, operating on distinct global and local semantic feature spaces. We first perform global-semantic separation through a global semantic feature space aligned with text queries. A Q-Audio architecture is employed to align audio and text modalities, serving as pre-trained global-semantic encoders. Conditioned on the predicted global feature, we then perform the second-stage local-semantic separation on AudioMAE features that preserve time-frequency structures, followed by semantic-to-acoustic reconstruction. We also split text queries into structured operations, extraction or removal, coupled with audio descriptions, enabling bidirectional sound manipulation. Our method achieves state-of-the-art separation performance with data-efficient training while maintaining superior semantic consistency with queries in complex auditory scenes. EAL-world environmental sounds typically comprise diverse audio events from multiple sources. Target sound separation, which isolates specific sound components from mixtures across domains like speech [1], [2], [3], general audio [4], and music [5], conventionally relies on single-source training samples and focuses on separating predefined source types [6]. Recent advances in universal sound separation (USS) [7] have expanded this capability to arbitrary sound sources in real-world recordings.
arXiv.org Artificial Intelligence
Dec-3-2025
- Country:
- Asia (0.46)
- Genre:
- Research Report > New Finding (0.46)
- Industry:
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
- Technology: