FuseCodec: Semantic-Contextual Fusion and Supervision for Neural Codecs
Ahasan, Md Mubtasim, Khan, Rafat Hasan, Mohiuddin, Tasnim, Chadha, Aman, Iqbal, Tariq, Amin, M Ashraful, Ali, Amin Ahsan, Islam, Md Mofijul, Rahman, A K M Mahbubur
–arXiv.org Artificial Intelligence
Speech tokenization enables discrete representation and facilitates speech language modeling. However, existing neural codecs capture low-level acoustic features, overlooking the semantic and contextual cues inherent to human speech. While recent efforts introduced semantic representations from self-supervised speech models or incorporated contextual representations from pre-trained language models, challenges remain in aligning and unifying the semantic and contextual representations. We introduce FuseCodec, which unifies acoustic, semantic, and contextual representations through strong cross-modal alignment and globally informed supervision. We propose three complementary techniques: (i) Latent Representation Fusion, integrating semantic and contextual features directly into the encoder latent space for robust and unified representation learning; (ii) Global Semantic-Contextual Supervision, supervising discrete tokens with globally pooled and broadcasted representations to enhance temporal consistency and cross-modal alignment; and (iii) Temporally Aligned Contextual Supervision, strengthening alignment by dynamically matching contextual and speech tokens within a local window for fine-grained token-level supervision. We further introduce FuseCodec-TTS, demonstrating our methodology's applicability to zero-shot speech synthesis. Empirically, FuseCodec achieves state-of-the-art performance in LibriSpeech, surpassing EnCodec, SpeechTokenizer, and DAC in transcription accuracy, perceptual quality, intelligibility, and speaker similarity. Results highlight the effectiveness of contextually and semantically guided tokenization for speech tokenization and downstream tasks. Tokenization is a cornerstone of natural language processing (NLP), enabling language models to represent text in discrete units for efficient autoregressive modeling and scalable downstream applications (Schmidt et al., 2024). Inspired by this paradigm, the speech domain has increasingly adopted neural codecs, popularized by Encodec (D efossez et al., 2022) and SoundStream (Zeghi-dour et al., 2022). However, learning discrete speech representations is more challenging than text due to the continuous and multidimensional nature of speech (Ju et al., 2024). While neural codecs learn acoustic representations (waveform and low-level signal characteristics), they struggle to capture high-level semantics, requiring downstream models to adopt additional self-supervised masked language objectives to derive semantic representations (phonetic content and linguistic meaning) (Borsos et al., 2023). Work does not relate to position at Amazon. Y et another fundamental aspect of human speech remains missing: speech is inherently grounded in context and surrounding cues (Brown et al., 2022).
arXiv.org Artificial Intelligence
Sep-30-2025
- Country:
- Asia
- Bangladesh (0.04)
- Middle East > Qatar (0.04)
- Thailand > Bangkok
- Bangkok (0.04)
- North America > United States
- Florida > Miami-Dade County
- Miami (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Virginia (0.04)
- Florida > Miami-Dade County
- Asia
- Genre:
- Research Report > New Finding (0.93)
- Technology: