Text-Queried Audio Source Separation via Hierarchical Modeling

Yin, Xinlei, Peng, Xiulian, Jiang, Xue, Xiong, Zhiwei, Lu, Yan

Dec-3-2025–arXiv.org Artificial Intelligence

Abstract--T arget audio source separation with natural language queries presents a promising paradigm for extracting arbitrary audio events through arbitrary text descriptions. Existing methods mainly face two challenges, the difficulty in jointly modeling acoustic-textual alignment and semantic-aware separation within a blindly-learned single-stage architecture, and the reliance on large-scale accurately-labeled training data to compensate for inefficient cross-modal learning and separation. T o address these challenges, we propose a hierarchical decomposition framework, HSM-TSS, that decouples the task into global-local semantic-guided feature separation and structure-preserving acoustic reconstruction. Our approach introduces a dual-stage mechanism for semantic separation, operating on distinct global and local semantic feature spaces. We first perform global-semantic separation through a global semantic feature space aligned with text queries. A Q-Audio architecture is employed to align audio and text modalities, serving as pre-trained global-semantic encoders. Conditioned on the predicted global feature, we then perform the second-stage local-semantic separation on AudioMAE features that preserve time-frequency structures, followed by semantic-to-acoustic reconstruction. We also split text queries into structured operations, extraction or removal, coupled with audio descriptions, enabling bidirectional sound manipulation. Our method achieves state-of-the-art separation performance with data-efficient training while maintaining superior semantic consistency with queries in complex auditory scenes. EAL-world environmental sounds typically comprise diverse audio events from multiple sources. Target sound separation, which isolates specific sound components from mixtures across domains like speech [1], [2], [3], general audio [4], and music [5], conventionally relies on single-source training samples and focuses on separating predefined source types [6]. Recent advances in universal sound separation (USS) [7] have expanded this capability to arbitrary sound sources in real-world recordings.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Dec-3-2025

arXiv.org PDF

Add feedback

Country:
- Asia (0.46)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Media > Music (1.00)
- Leisure & Entertainment (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Text Processing (1.00)
    - Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.93)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found