DUO-TOK: Dual-Track Semantic Music Tokenizer for Vocal-Accompaniment Generation

Lin, Rui, Wu, Zhiyue, Le, Jiahe, Wang, Kangdi, Chen, Weixiong, Dai, Junyu, Jiang, Tao

Nov-26-2025–arXiv.org Artificial Intelligence

Duo-Tok is a source-aware dual-codebook tokenizer for vocal-accompaniment music that targets the growing tension between reconstruction quality and language-model (LM) learnability in modern lyrics-to-song systems. Existing codecs either prioritize high-fidelity reconstruction with difficult-to-model acoustic tokens or compress aggressively into semantic tokens that are LM-friendly but lossy, and they rarely make the tokenizer itself aware of dual-track structure. Duo-Tok follows a four-stage, SSL-centered pipeline: we first pretrain a BEST-RQ-style encoder on large-scale audio, then stabilize and factorize the representation with Gaussian replacement noise and multi-task supervision, before freezing the encoder to learn SimVQ-based dual codebooks with hard routing for vocals and accompaniment, and finally training latent diffusion decoders on top of the discrete tokens. Duo-Tok at 0.75 kbps shifts the empirical reconstruction-generation Pareto frontier, achieving the best music-tagging AP and the lowest vocabulary-normalized LM perplexity among compared codecs while maintaining reconstruction quality comparable to state-of-the-art music tokenizers.

arxiv preprint arxiv, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Nov-26-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.68)

Industry:
- Media > Music (0.94)
- Leisure & Entertainment (0.94)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (1.00)
  - Machine Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found