Scaling Speech-Text Pre-training with Synthetic Interleaved Data
Zeng, Aohan, Du, Zhengxiao, Liu, Mingdao, Zhang, Lei, Jiang, Shengmin, Dong, Yuxiao, Tang, Jie
–arXiv.org Artificial Intelligence
Speech language models (SpeechLMs) accept speech input and produce speech output, allowing for more natural human-computer interaction compared to textbased large language models (LLMs). Traditional approaches for developing SpeechLMs are constrained by the limited availability of unsupervised speech data and parallel speech-text data, which are significantly less abundant than text pre-training data, thereby limiting their scalability as LLMs. We propose a novel approach to scaling speech-text pre-training by leveraging large-scale synthetic interleaved data derived from text corpora, eliminating the need for parallel speechtext datasets. Our method efficiently constructs speech-text interleaved data by sampling text spans from existing text corpora and synthesizing corresponding speech spans using a text-to-token model, bypassing the need to generate actual speech. We also employ a supervised speech tokenizer derived from an automatic speech recognition (ASR) model by incorporating a vector-quantized bottleneck into the encoder. Starting from a pre-trained language model and scaling our pre-training to 1 trillion tokens (with 600B synthetic interleaved speech-text data), we achieve state-of-the-art performance in speech language modeling and spoken question answering, improving performance on spoken questions tasks from the previous SOTA of 13% (Moshi) to 31%. We further demonstrate that by fine-tuning the pre-trained model with speech dialogue data, we can develop an end-to-end spoken chatbot that achieves competitive performance comparable to existing baselines in both conversational abilities and speech quality, even operating exclusively in the speech domain. All NLP tasks are generation tasks. Figure 1: (Left) The performance on Spoken QA continuously improves as the amount of synthetic interleaved data increases, significantly surpassing the previous SOTA (Moshi). Work was done when ML, LZ interned at Zhipu.AI. Large language models (LLMs) have significantly advanced natural language processing, demonstrating capabilities beyond traditional language tasks. Trained on vast internet corpora, they exhibit emergent abilities such as instruction following (Ouyang et al., 2022), logical reasoning (Wei et al., 2022), and tool utilization (Schick et al., 2023). These advancements have enabled applications like interactive chatbots and personalized digital assistants. However, an ideal AI assistant should not rely solely on text. Voice-based interaction offers a more natural and intuitive interface for human-AI interaction. Traditional voice-based systems combine Automatic Speech Recognition (ASR), LLMs, and Text-to-Speech (TTS) models in a cascading manner. This approach, however, suffers from information loss during ASR and TTS processes, limiting the ability to capture and express the rich nuances of speech.
arXiv.org Artificial Intelligence
Dec-2-2024
- Country:
- Asia (1.00)
- Europe > Austria
- Vienna (0.14)
- North America > United States
- California (0.14)
- Hawaii (0.14)
- Genre:
- Research Report > Promising Solution (0.48)
- Industry:
- Government > Regional Government (0.46)
- Health & Medicine (0.92)
- Leisure & Entertainment > Games (0.46)
- Technology: