DC-Spin: A Speaker-invariant Speech Tokenizer for Spoken Language Models

Chang, Heng-Jui, Gong, Hongyu, Wang, Changhan, Glass, James, Chung, Yu-An

Oct-31-2024–arXiv.org Artificial Intelligence

Spoken language models (SLMs) have gained increasing attention with advancements in text-based, decoder-only language models. This paper presents Double-Codebook Speaker-invariant Clustering (DC-Spin), which aims to improve speech tokenization by bridging audio signals and SLM tokens. We propose a chunk-wise approach to enable streamable DC-Spin without retraining and degradation. Comparisons of tokenization methods (self-supervised and neural audio codecs), model scalability, and downstream task proxies show that tokens easily modeled by an n-gram LM or aligned with phonemes offer strong performance, providing insights for designing speech tokenizers for SLMs. Spoken language models (SLMs) and related applications have gained more interest with the advancements of large language models (LLM) and audio tokenization techniques (Wu et al., 2024). These speech LMs resemble causal LMs in natural language processing, but SLMs take speech and, optionally, text as input and generate speech ...

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Oct-31-2024

arXiv.org PDF

Add feedback

Country:
- North America
  - Mexico (0.28)
  - United States > Minnesota
    - Hennepin County > Minneapolis (0.14)

Genre:
- Research Report > New Finding (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (1.00)
  - Natural Language > Large Language Model (1.00)
  - Speech > Speech Recognition (1.00)