UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation
Liu, Alexander H., Lee, Sang-gil, Yang, Chao-Han Huck, Gong, Yuan, Wang, Yu-Chiang Frank, Glass, James R., Valle, Rafael, Catanzaro, Bryan
–arXiv.org Artificial Intelligence
Pre-training and representation learning have been playing an increasingly important role in modern speech processing. Nevertheless, different applications have been relying on different foundation models, since predominant pre-training techniques are either designed for discriminative tasks or generative tasks. In this work, we make the first attempt at building a unified pre-training framework for both types of tasks in speech. We show that with the appropriate design choices for pre-training, one can jointly learn a representation encoder and generative audio decoder that can be applied to both types of tasks. We propose UniWav, an encoder-decoder framework designed to unify pre-training representation learning and generative tasks. On speech recognition, text-to-speech, and speech tokenization, UniWav achieves comparable performance to different existing foundation models, each trained on a specific task. Our findings suggest that a single general-purpose foundation model for speech can be built to replace different foundation models, reducing the overhead and cost of pre-training.
arXiv.org Artificial Intelligence
Mar-2-2025
- Country:
- South America > Chile
- North America
- United States > Massachusetts
- Middlesex County > Cambridge (0.04)
- Canada > Quebec
- Montreal (0.04)
- United States > Massachusetts
- Europe
- Italy > Calabria
- Catanzaro Province > Catanzaro (0.05)
- Germany > Bavaria
- Upper Bavaria > Munich (0.04)
- Italy > Calabria
- Asia > Japan
- Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- Genre:
- Research Report > New Finding (1.00)
- Technology: