AITopics | seamlessm4t

Collaborating Authors

seamlessm4t

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

It's Not a Walk in the Park! Challenges of Idiom Translation in Speech-to-text Systems

Zaitova, Iuliia, Abdullah, Badr M., Xue, Wei, Klakow, Dietrich, Möbius, Bernd, Avgustinova, Tania

arXiv.org Artificial IntelligenceJun-4-2025

Idioms are defined as a group of words with a figurative meaning not deducible from their individual components. Although modern machine translation systems have made remarkable progress, translating idioms remains a major challenge, especially for speech-to-text systems, where research on this topic is notably sparse. In this paper, we systematically evaluate idiom translation as compared to conventional news translation in both text-to-text machine translation (MT) and speech-to-text translation (SLT) systems across two language pairs (German to English, Russian to English). We compare state-of-the-art end-to-end SLT systems (SeamlessM4T SLT-to-text, Whisper Large v3) with MT systems (SeamlessM4T SLT-to-text, No Language Left Behind), Large Language Models (DeepSeek, LLaMA) and cascaded alternatives. Our results reveal that SLT systems experience a pronounced performance drop on idiomatic data, often reverting to literal translations even in higher layers, whereas MT systems and Large Language Models demonstrate better handling of idioms. These findings underscore the need for idiom-specific strategies and improved internal representations in SLT architectures.

large language model, machine learning, translation, (21 more...)

arXiv.org Artificial Intelligence

2506.02995

Country:

Europe (1.00)
North America > United States (0.93)

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.92)

Add feedback

AMPS: ASR with Multimodal Paraphrase Supervision

Parulekar, Amruta, Gupta, Abhishek, Chattopadhyay, Sameep, Jyothi, Preethi

arXiv.org Artificial IntelligenceNov-27-2024

Spontaneous or conversational multilingual speech presents many challenges for state-of-the-art automatic speech recognition (ASR) systems. In this work, we present a new technique AMPS that augments a multilingual multimodal ASR system with paraphrase-based supervision for improved conversational ASR in multiple languages, including Hindi, Marathi, Malayalam, Kannada, and Nyanja. We use paraphrases of the reference transcriptions as additional supervision while training the multimodal ASR model and selectively invoke this paraphrase objective for utterances with poor ASR performance. Using AMPS with a state-of-the-art multimodal model SeamlessM4T, we obtain significant relative reductions in word error rates (WERs) of up to 5%. We present detailed analyses of our system using both objective and human evaluation metrics.

asr, speech, transcription, (16 more...)

arXiv.org Artificial Intelligence

2411.18368

Country:

Asia > India > Tripura (0.04)
North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
(7 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Parameter-efficient Adaptation of Multilingual Multimodal Models for Low-resource ASR

Gupta, Abhishek, Parulekar, Amruta, Chattopadhyay, Sameep, Jyothi, Preethi

arXiv.org Artificial IntelligenceOct-17-2024

Automatic speech recognition (ASR) for low-resource languages remains a challenge due to the scarcity of labeled training data. Parameter-efficient fine-tuning and text-only adaptation are two popular methods that have been used to address such low-resource settings. In this work, we investigate how these techniques can be effectively combined using a multilingual multimodal model like SeamlessM4T. Multimodal models are able to leverage unlabeled text via text-only adaptation with further parameter-efficient ASR fine-tuning, thus boosting ASR performance. We also show cross-lingual transfer from a high-resource language, achieving up to a relative 17% WER reduction over a baseline in a zero-shot setting without any labeled speech.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2410.13445

Country:

Asia > Thailand > Bangkok > Bangkok (0.04)
North America > United States > Texas > Dallas County > Dallas (0.04)
North America > Dominican Republic (0.04)
(7 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Finetuning End-to-End Models for Estonian Conversational Spoken Language Translation

Sildam, Tiia, Velve, Andra, Alumäe, Tanel

arXiv.org Artificial IntelligenceJul-4-2024

This paper investigates the finetuning of end-to-end models for bidirectional Estonian-English and Estonian-Russian conversational speech-to-text translation. Due to the limited availability of speech translation data for Estonian, we created additional training data by web scraping and synthesizing data from speech recognition datasets using machine translation. We evaluated three publicly available end-to-end models: Whisper, OWSM 3.1, and SeamlessM4T. Our results indicate that fine-tuning with synthetic data enhances translation accuracy by a large margin, with SeamlessM4T matching or surpassing cascaded speech translation systems that use state-of-the-art speech recognition and machine translation models.

speech, training data, translation, (15 more...)

arXiv.org Artificial Intelligence

2407.03809

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Estonia > Tartu County > Tartu (0.04)
Asia > Middle East > Qatar > Ad-Dawhah > Doha (0.04)
(10 more...)

Genre: Research Report > New Finding (0.48)

Industry: Media (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Voices Unheard: NLP Resources and Models for Yor\`ub\'a Regional Dialects

Ahia, Orevaoghene, Aremu, Anuoluwapo, Abagyan, Diana, Gonen, Hila, Adelani, David Ifeoluwa, Abolade, Daud, Smith, Noah A., Tsvetkov, Yulia

arXiv.org Artificial IntelligenceJun-27-2024

Yor\`ub\'a an African language with roughly 47 million speakers encompasses a continuum with several dialects. Recent efforts to develop NLP technologies for African languages have focused on their standard dialects, resulting in disparities for dialects and varieties for which there are little to no resources or tools. We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus YOR\`ULECT across three domains and four regional Yor\`ub\'a dialects. To develop this corpus, we engaged native speakers, travelling to communities where these dialects are spoken, to collect text and speech data. Using our newly created corpus, we conducted extensive experiments on (text) machine translation, automatic speech recognition, and speech-to-text translation. Our results reveal substantial performance disparities between standard Yor\`ub\'a and the other dialects across all tasks. However, we also show that with dialect-adaptive finetuning, we are able to narrow this gap. We believe our dataset and experimental analysis will contribute greatly to developing NLP tools for Yor\`ub\'a and its dialects, and potentially for other African languages, by improving our understanding of existing challenges and offering a high-quality dataset for further development. We release YOR\`ULECT dataset and models publicly under an open license.

computational linguistic, dialect, translation, (15 more...)

arXiv.org Artificial Intelligence

2406.19564

Country:

Asia > Singapore (0.04)
North America > Canada > Ontario > Toronto (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(29 more...)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators

Hu, Yuchen, Chen, Chen, Yang, Chao-Han Huck, Li, Ruizhe, Zhang, Dong, Chen, Zhehuai, Chng, Eng Siong

arXiv.org Artificial IntelligenceFeb-10-2024

Recent advances in large language models (LLMs) have stepped forward the development of multilingual speech and machine translation by its reduced representation errors and incorporated external knowledge. However, both translation tasks typically utilize beam search decoding and top-1 hypothesis selection for inference. These techniques struggle to fully exploit the rich information in the diverse N-best hypotheses, making them less optimal for translation tasks that require a single, high-quality output sequence. In this paper, we propose a new generative paradigm for translation tasks, namely "GenTranslate", which builds upon LLMs to generate better results from the diverse translation versions in N-best list. Leveraging the rich linguistic knowledge and strong reasoning abilities of LLMs, our new paradigm can integrate the rich information in N-best candidates to generate a higher-quality translation result. Furthermore, to support LLM finetuning, we build and release a HypoTranslate dataset that contains over 592K hypotheses-translation pairs in 11 languages. Experiments on various speech and machine translation benchmarks (e.g., FLEURS, CoVoST-2, WMT) demonstrate that our GenTranslate significantly outperforms the state-of-the-art model.

gentranslate, hypothesis, translation, (17 more...)

arXiv.org Artificial Intelligence

2402.06894

Country: Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

Investigating Zero-Shot Generalizability on Mandarin-English Code-Switched ASR and Speech-to-text Translation of Recent Foundation Models with Self-Supervision and Weak Supervision

Yang, Chih-Kai, Huang, Kuan-Po, Lu, Ke-Han, Kuan, Chun-Yi, Hsiao, Chi-Yuan, Lee, Hung-yi

arXiv.org Artificial IntelligenceDec-30-2023

This work evaluated several cutting-edge large-scale foundation models based on self-supervision or weak supervision, including SeamlessM4T, SeamlessM4T v2, and Whisper-large-v3, on three code-switched corpora. We found that self-supervised models can achieve performances close to the supervised model, indicating the effectiveness of multilingual self-supervised pre-training. We also observed that these models still have room for improvement as they kept making similar mistakes and had unsatisfactory performances on modeling intra-sentential code-switching. In addition, the validity of several variants of Whisper was explored, and we concluded that they remained effective in a code-switching scenario, and similar techniques for self-supervised models are worth studying to boost the performance of code-switched tasks.

cs asr, transcription, whisper-large-v3, (16 more...)

arXiv.org Artificial Intelligence

2401.00273

Country:

Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
Asia > Taiwan (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.84)

Add feedback

SeamlessM4T: Massively Multilingual & Multimodal Machine Translation

Communication, Seamless, Barrault, Loïc, Chung, Yu-An, Meglioli, Mariano Cora, Dale, David, Dong, Ning, Duquenne, Paul-Ambroise, Elsahar, Hady, Gong, Hongyu, Heffernan, Kevin, Hoffman, John, Klaiber, Christopher, Li, Pengwei, Licht, Daniel, Maillard, Jean, Rakotoarison, Alice, Sadagopan, Kaushik Ram, Wenzek, Guillaume, Ye, Ethan, Akula, Bapi, Chen, Peng-Jen, Hachem, Naji El, Ellis, Brian, Gonzalez, Gabriel Mejia, Haaheim, Justin, Hansanti, Prangthip, Howes, Russ, Huang, Bernie, Hwang, Min-Jae, Inaguma, Hirofumi, Jain, Somya, Kalbassi, Elahe, Kallet, Amanda, Kulikov, Ilia, Lam, Janice, Li, Daniel, Ma, Xutai, Mavlyutov, Ruslan, Peloquin, Benjamin, Ramadan, Mohamed, Ramakrishnan, Abinesh, Sun, Anna, Tran, Kevin, Tran, Tuan, Tufanov, Igor, Vogeti, Vish, Wood, Carleigh, Yang, Yilin, Yu, Bokai, Andrews, Pierre, Balioglu, Can, Costa-jussà, Marta R., Celebi, Onur, Elbayad, Maha, Gao, Cynthia, Guzmán, Francisco, Kao, Justine, Lee, Ann, Mourachko, Alexandre, Pino, Juan, Popuri, Sravya, Ropers, Christophe, Saleem, Safiyyah, Schwenk, Holger, Tomasello, Paden, Wang, Changhan, Wang, Jeff, Wang, Skyler

arXiv.org Artificial IntelligenceOct-24-2023

What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have yet to achieve similar strides. More specifically, conventional speech-to-speech translation systems rely on cascaded systems that perform translation progressively, putting high-performing unified systems out of reach. To address these gaps, we introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation, and automatic speech recognition for up to 100 languages. To build this, we used 1 million hours of open speech audio data to learn self-supervised speech representations with w2v-BERT 2.0. Subsequently, we created a multimodal corpus of automatically aligned speech translations. Filtered and combined with human-labeled and pseudo-labeled data, we developed the first multilingual system capable of translating from and into English for both speech and text. On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation. Compared to strong cascaded models, SeamlessM4T improves the quality of into-English translation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in speech-to-speech. Tested for robustness, our system performs better against background noises and speaker variations in speech-to-text tasks compared to the current SOTA model. Critically, we evaluated SeamlessM4T on gender bias and added toxicity to assess translation safety. Finally, all contributions in this work are open-sourced and accessible at https://github.com/facebookresearch/seamless_communication

massively multilingual, multilingual & multimodal machine translation, seamlessm4t

arXiv.org Artificial Intelligence

2308.11596

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Meta's new multimodal translator uses a single model to speak 100 languages

EngadgetAug-22-2023, 13:30:40 GMT

Though it's not quite ready to usher in the Doolittle future we've all been waiting for, modern AI translation methods are proving more than sufficient in accurately transforming humanity's roughly 6,500 spoken and written communication systems between one another. The problem is that each of these models tends to only do one or two tasks really well -- translate and convert text to speech, speech to text or between either of the two sets -- so you end up having to smash a bunch of models on top of each other to create the generalized performance seen in the likes of Google Translate or Facebook's myriad language services. That's a computationally intensive process, so Meta developed a single model that can do it all. SeamlessM4T is "a foundational multilingual and multitask model that seamlessly translates and transcribes across speech and text," Meta's blog from Tuesday reads. It can translate between any of nearly 100 languages for speech-to-text and text-to-text functions, speech-to-speech and text-to-speech supports those same languages as inputs and outputs them in any of 36 others tongues, including English.

meta, new multimodal translator use, seamlessm4t, (4 more...)

Engadget

Technology:

Information Technology > Communications > Social Media (0.96)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.78)

Add feedback