Huang, Zhaocheng
SpeechVerse: A Large-scale Generalizable Audio Language Model
Das, Nilaksh, Dingliwal, Saket, Ronanki, Srikanth, Paturi, Rohit, Huang, Zhaocheng, Mathur, Prashant, Yuan, Jie, Bekal, Dhanush, Niu, Xing, Jayanthi, Sai Muralidhar, Li, Xilai, Mundnich, Karel, Sunkara, Monica, Srinivasan, Sundararajan, Han, Kyu J, Kirchhoff, Katrin
Large language models (LLMs) have shown incredible proficiency in performing tasks that require semantic understanding of natural language instructions. Recently, many works have further expanded this capability to perceive multimodal audio and text inputs, but their capabilities are often limited to specific fine-tuned tasks such as automatic speech recognition and translation. We therefore develop SpeechVerse, a robust multi-task training and curriculum learning framework that combines pre-trained speech and text foundation models via a small set of learnable parameters, while keeping the pre-trained models frozen during training. The models are instruction finetuned using continuous latent representations extracted from the speech foundation model to achieve optimal zero-shot performance on a diverse range of speech processing tasks using natural language instructions. We perform extensive benchmarking that includes comparing our model performance against traditional baselines across several datasets and tasks. Furthermore, we evaluate the model's capability for generalized instruction following by testing on out-of-domain datasets, novel prompts, and unseen tasks. Our empirical experiments reveal that our multi-task SpeechVerse model is even superior to conventional task-specific baselines on 9 out of the 11 tasks.
End-to-End Single-Channel Speaker-Turn Aware Conversational Speech Translation
Zuluaga-Gomez, Juan, Huang, Zhaocheng, Niu, Xing, Paturi, Rohit, Srinivasan, Sundararajan, Mathur, Prashant, Thompson, Brian, Federico, Marcello
Conventional speech-to-text translation (ST) systems are trained on single-speaker utterances, and they may not generalize to real-life scenarios where the audio contains conversations by multiple speakers. In this paper, we tackle single-channel multi-speaker conversational ST with an end-to-end and multi-task training model, named Speaker-Turn Aware Conversational Speech Translation, that combines automatic speech recognition, speech translation and speaker turn detection using special tokens in a serialized labeling format. We run experiments on the Fisher-CALLHOME corpus, which we adapted by merging the two single-speaker channels into one multi-speaker channel, thus representing the more realistic and challenging scenario with multi-speaker turns and cross-talk. Experimental results across single- and multi-speaker conditions and against conventional ST systems, show that our model outperforms the reference systems on the multi-speaker condition, while attaining comparable performance on the single-speaker condition. We release scripts for data processing and model training.