Mu$^{2}$SLAM: Multitask, Multilingual Speech and Language Models
Cheng, Yong, Zhang, Yu, Johnson, Melvin, Macherey, Wolfgang, Bapna, Ankur
–arXiv.org Artificial Intelligence
We present Mu$^{2}$SLAM, a multilingual sequence-to-sequence model pre-trained jointly on unlabeled speech, unlabeled text and supervised data spanning Automatic Speech Recognition (ASR), Automatic Speech Translation (AST) and Machine Translation (MT), in over 100 languages. By leveraging a quantized representation of speech as a target, Mu$^{2}$SLAM trains the speech-text models with a sequence-to-sequence masked denoising objective similar to T5 on the decoder and a masked language modeling (MLM) objective on the encoder, for both unlabeled speech and text, while utilizing the supervised tasks to improve cross-lingual and cross-modal representation alignment within the model. On CoVoST AST, Mu$^{2}$SLAM establishes a new state-of-the-art for models trained on public datasets, improving on xx-en translation over the previous best by 1.9 BLEU points and on en-xx translation by 1.1 BLEU points. On Voxpopuli ASR, our model matches the performance of an mSLAM model fine-tuned with an RNN-T decoder, despite using a relatively weaker sequence-to-sequence architecture. On text understanding tasks, our model improves by more than 6\% over mSLAM on XNLI, getting closer to the performance of mT5 models of comparable capacity on XNLI and TydiQA, paving the way towards a single model for all speech and text understanding tasks.
arXiv.org Artificial Intelligence
Jun-26-2023
- Country:
- North America > United States
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Hawaii > Honolulu County
- Honolulu (0.04)
- Louisiana > Orleans Parish
- Europe
- Belgium (0.04)
- Portugal > Lisbon
- Lisbon (0.04)
- Italy > Tuscany
- Florence (0.04)
- Denmark > Capital Region
- Copenhagen (0.04)
- Bulgaria > Sofia City Province
- Sofia (0.04)
- North America > United States
- Genre:
- Research Report (0.50)
- Technology: