seamlessm4t
It's Not a Walk in the Park! Challenges of Idiom Translation in Speech-to-text Systems
Zaitova, Iuliia, Abdullah, Badr M., Xue, Wei, Klakow, Dietrich, Möbius, Bernd, Avgustinova, Tania
Idioms are defined as a group of words with a figurative meaning not deducible from their individual components. Although modern machine translation systems have made remarkable progress, translating idioms remains a major challenge, especially for speech-to-text systems, where research on this topic is notably sparse. In this paper, we systematically evaluate idiom translation as compared to conventional news translation in both text-to-text machine translation (MT) and speech-to-text translation (SLT) systems across two language pairs (German to English, Russian to English). We compare state-of-the-art end-to-end SLT systems (SeamlessM4T SLT-to-text, Whisper Large v3) with MT systems (SeamlessM4T SLT-to-text, No Language Left Behind), Large Language Models (DeepSeek, LLaMA) and cascaded alternatives. Our results reveal that SLT systems experience a pronounced performance drop on idiomatic data, often reverting to literal translations even in higher layers, whereas MT systems and Large Language Models demonstrate better handling of idioms. These findings underscore the need for idiom-specific strategies and improved internal representations in SLT architectures.
AMPS: ASR with Multimodal Paraphrase Supervision
Parulekar, Amruta, Gupta, Abhishek, Chattopadhyay, Sameep, Jyothi, Preethi
Spontaneous or conversational multilingual speech presents many challenges for state-of-the-art automatic speech recognition (ASR) systems. In this work, we present a new technique AMPS that augments a multilingual multimodal ASR system with paraphrase-based supervision for improved conversational ASR in multiple languages, including Hindi, Marathi, Malayalam, Kannada, and Nyanja. We use paraphrases of the reference transcriptions as additional supervision while training the multimodal ASR model and selectively invoke this paraphrase objective for utterances with poor ASR performance. Using AMPS with a state-of-the-art multimodal model SeamlessM4T, we obtain significant relative reductions in word error rates (WERs) of up to 5%. We present detailed analyses of our system using both objective and human evaluation metrics.
Parameter-efficient Adaptation of Multilingual Multimodal Models for Low-resource ASR
Gupta, Abhishek, Parulekar, Amruta, Chattopadhyay, Sameep, Jyothi, Preethi
Automatic speech recognition (ASR) for low-resource languages remains a challenge due to the scarcity of labeled training data. Parameter-efficient fine-tuning and text-only adaptation are two popular methods that have been used to address such low-resource settings. In this work, we investigate how these techniques can be effectively combined using a multilingual multimodal model like SeamlessM4T. Multimodal models are able to leverage unlabeled text via text-only adaptation with further parameter-efficient ASR fine-tuning, thus boosting ASR performance. We also show cross-lingual transfer from a high-resource language, achieving up to a relative 17% WER reduction over a baseline in a zero-shot setting without any labeled speech.
Finetuning End-to-End Models for Estonian Conversational Spoken Language Translation
Sildam, Tiia, Velve, Andra, Alumäe, Tanel
This paper investigates the finetuning of end-to-end models for bidirectional Estonian-English and Estonian-Russian conversational speech-to-text translation. Due to the limited availability of speech translation data for Estonian, we created additional training data by web scraping and synthesizing data from speech recognition datasets using machine translation. We evaluated three publicly available end-to-end models: Whisper, OWSM 3.1, and SeamlessM4T. Our results indicate that fine-tuning with synthetic data enhances translation accuracy by a large margin, with SeamlessM4T matching or surpassing cascaded speech translation systems that use state-of-the-art speech recognition and machine translation models.
Voices Unheard: NLP Resources and Models for Yor\`ub\'a Regional Dialects
Ahia, Orevaoghene, Aremu, Anuoluwapo, Abagyan, Diana, Gonen, Hila, Adelani, David Ifeoluwa, Abolade, Daud, Smith, Noah A., Tsvetkov, Yulia
Yor\`ub\'a an African language with roughly 47 million speakers encompasses a continuum with several dialects. Recent efforts to develop NLP technologies for African languages have focused on their standard dialects, resulting in disparities for dialects and varieties for which there are little to no resources or tools. We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus YOR\`ULECT across three domains and four regional Yor\`ub\'a dialects. To develop this corpus, we engaged native speakers, travelling to communities where these dialects are spoken, to collect text and speech data. Using our newly created corpus, we conducted extensive experiments on (text) machine translation, automatic speech recognition, and speech-to-text translation. Our results reveal substantial performance disparities between standard Yor\`ub\'a and the other dialects across all tasks. However, we also show that with dialect-adaptive finetuning, we are able to narrow this gap. We believe our dataset and experimental analysis will contribute greatly to developing NLP tools for Yor\`ub\'a and its dialects, and potentially for other African languages, by improving our understanding of existing challenges and offering a high-quality dataset for further development. We release YOR\`ULECT dataset and models publicly under an open license.
GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators
Hu, Yuchen, Chen, Chen, Yang, Chao-Han Huck, Li, Ruizhe, Zhang, Dong, Chen, Zhehuai, Chng, Eng Siong
Recent advances in large language models (LLMs) have stepped forward the development of multilingual speech and machine translation by its reduced representation errors and incorporated external knowledge. However, both translation tasks typically utilize beam search decoding and top-1 hypothesis selection for inference. These techniques struggle to fully exploit the rich information in the diverse N-best hypotheses, making them less optimal for translation tasks that require a single, high-quality output sequence. In this paper, we propose a new generative paradigm for translation tasks, namely "GenTranslate", which builds upon LLMs to generate better results from the diverse translation versions in N-best list. Leveraging the rich linguistic knowledge and strong reasoning abilities of LLMs, our new paradigm can integrate the rich information in N-best candidates to generate a higher-quality translation result. Furthermore, to support LLM finetuning, we build and release a HypoTranslate dataset that contains over 592K hypotheses-translation pairs in 11 languages. Experiments on various speech and machine translation benchmarks (e.g., FLEURS, CoVoST-2, WMT) demonstrate that our GenTranslate significantly outperforms the state-of-the-art model.
Investigating Zero-Shot Generalizability on Mandarin-English Code-Switched ASR and Speech-to-text Translation of Recent Foundation Models with Self-Supervision and Weak Supervision
Yang, Chih-Kai, Huang, Kuan-Po, Lu, Ke-Han, Kuan, Chun-Yi, Hsiao, Chi-Yuan, Lee, Hung-yi
This work evaluated several cutting-edge large-scale foundation models based on self-supervision or weak supervision, including SeamlessM4T, SeamlessM4T v2, and Whisper-large-v3, on three code-switched corpora. We found that self-supervised models can achieve performances close to the supervised model, indicating the effectiveness of multilingual self-supervised pre-training. We also observed that these models still have room for improvement as they kept making similar mistakes and had unsatisfactory performances on modeling intra-sentential code-switching. In addition, the validity of several variants of Whisper was explored, and we concluded that they remained effective in a code-switching scenario, and similar techniques for self-supervised models are worth studying to boost the performance of code-switched tasks.
SeamlessM4T: Massively Multilingual & Multimodal Machine Translation
Communication, Seamless, Barrault, Loïc, Chung, Yu-An, Meglioli, Mariano Cora, Dale, David, Dong, Ning, Duquenne, Paul-Ambroise, Elsahar, Hady, Gong, Hongyu, Heffernan, Kevin, Hoffman, John, Klaiber, Christopher, Li, Pengwei, Licht, Daniel, Maillard, Jean, Rakotoarison, Alice, Sadagopan, Kaushik Ram, Wenzek, Guillaume, Ye, Ethan, Akula, Bapi, Chen, Peng-Jen, Hachem, Naji El, Ellis, Brian, Gonzalez, Gabriel Mejia, Haaheim, Justin, Hansanti, Prangthip, Howes, Russ, Huang, Bernie, Hwang, Min-Jae, Inaguma, Hirofumi, Jain, Somya, Kalbassi, Elahe, Kallet, Amanda, Kulikov, Ilia, Lam, Janice, Li, Daniel, Ma, Xutai, Mavlyutov, Ruslan, Peloquin, Benjamin, Ramadan, Mohamed, Ramakrishnan, Abinesh, Sun, Anna, Tran, Kevin, Tran, Tuan, Tufanov, Igor, Vogeti, Vish, Wood, Carleigh, Yang, Yilin, Yu, Bokai, Andrews, Pierre, Balioglu, Can, Costa-jussà, Marta R., Celebi, Onur, Elbayad, Maha, Gao, Cynthia, Guzmán, Francisco, Kao, Justine, Lee, Ann, Mourachko, Alexandre, Pino, Juan, Popuri, Sravya, Ropers, Christophe, Saleem, Safiyyah, Schwenk, Holger, Tomasello, Paden, Wang, Changhan, Wang, Jeff, Wang, Skyler
What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have yet to achieve similar strides. More specifically, conventional speech-to-speech translation systems rely on cascaded systems that perform translation progressively, putting high-performing unified systems out of reach. To address these gaps, we introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation, and automatic speech recognition for up to 100 languages. To build this, we used 1 million hours of open speech audio data to learn self-supervised speech representations with w2v-BERT 2.0. Subsequently, we created a multimodal corpus of automatically aligned speech translations. Filtered and combined with human-labeled and pseudo-labeled data, we developed the first multilingual system capable of translating from and into English for both speech and text. On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation. Compared to strong cascaded models, SeamlessM4T improves the quality of into-English translation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in speech-to-speech. Tested for robustness, our system performs better against background noises and speaker variations in speech-to-text tasks compared to the current SOTA model. Critically, we evaluated SeamlessM4T on gender bias and added toxicity to assess translation safety. Finally, all contributions in this work are open-sourced and accessible at https://github.com/facebookresearch/seamless_communication
Meta's new multimodal translator uses a single model to speak 100 languages
Though it's not quite ready to usher in the Doolittle future we've all been waiting for, modern AI translation methods are proving more than sufficient in accurately transforming humanity's roughly 6,500 spoken and written communication systems between one another. The problem is that each of these models tends to only do one or two tasks really well -- translate and convert text to speech, speech to text or between either of the two sets -- so you end up having to smash a bunch of models on top of each other to create the generalized performance seen in the likes of Google Translate or Facebook's myriad language services. That's a computationally intensive process, so Meta developed a single model that can do it all. SeamlessM4T is "a foundational multilingual and multitask model that seamlessly translates and transcribes across speech and text," Meta's blog from Tuesday reads. It can translate between any of nearly 100 languages for speech-to-text and text-to-text functions, speech-to-speech and text-to-speech supports those same languages as inputs and outputs them in any of 36 others tongues, including English.