Typing to Listen at the Cocktail Party: Text-Guided Target Speaker Extraction

Hao, Xiang, Wu, Jibin, Yu, Jianwei, Xu, Chenglin, Tan, Kay Chen

Oct-14-2023–arXiv.org Artificial Intelligence

Humans possess an extraordinary ability to selectively focus on the sound source of interest amidst complex acoustic environments, commonly referred to as cocktail party scenarios. In an attempt to replicate this remarkable auditory attention capability in machines, target speaker extraction (TSE) models have been developed. These models leverage the pre-registered cues of the target speaker to extract the sound source of interest. However, the effectiveness of these models is hindered in real-world scenarios due to the unreliable or even absence of pre-registered cues. To address this limitation, this study investigates the integration of natural language description to enhance the feasibility, controllability, and performance of existing TSE models. Specifically, we propose a model named LLM-TSE, wherein a large language model (LLM) extracts useful semantic cues from the user's typed text input. These cues can serve as independent extraction cues, task selectors to control the TSE process or complement the pre-registered cues. Our experimental results demonstrate competitive performance when only text-based cues are presented, the effectiveness of using input text as a task selector, and a new state-of-the-art when combining text-based cues with pre-registered cues. To our knowledge, this is the first study to successfully incorporate LLMs to guide target speaker extraction, which can be a cornerstone for cocktail party problem research. Demos are provided at https://github.com/haoxiangsnr/llm-tse Colin, 1953) - a term coined to describe a scenario where multiple sound sources are engaged in simultaneous conversation, yet a listener can selectively concentrate on a single sound source. This scenario represents a complex challenge in auditory perception (Haykin & Chen, 2005; Mesgarani & Chang, 2012; Bizley & Cohen, 2013) and serves as a remarkable demonstration of the intricate sound processing that occurs within the human auditory system.

ieee international conference, international conference, speech, (12 more...)

arXiv.org Artificial Intelligence

Oct-14-2023

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia
  - Queensland > Brisbane (0.04)
- North America
  - United States
    - Virginia (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.14)
    - Louisiana > Orleans Parish
      - New Orleans (0.04)
  - Canada > Quebec
    - Montreal (0.04)
- Europe > United Kingdom
  - England > East Sussex > Brighton (0.04)
- Asia
  - India (0.04)
  - China
    - Shanghai > Shanghai (0.04)
    - Hong Kong (0.04)

Genre:
- Research Report > New Finding (0.66)

Industry:
- Media (0.68)
- Leisure & Entertainment (0.68)
- Health & Medicine > Consumer Health (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found