AITopics | Tomasello, Paden

Collaborating Authors

Tomasello, Paden

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

SSR: Alignment-Aware Modality Connector for Speech Language Models

Tan, Weiting, Inaguma, Hirofumi, Dong, Ning, Tomasello, Paden, Ma, Xutai

arXiv.org Artificial IntelligenceSep-30-2024

Fusing speech into pre-trained language model (SpeechLM) usually suffers from inefficient encoding of long-form speech and catastrophic forgetting of pre-trained text modality. Leveraging speech-text alignments, our approach segments and compresses speech features to match the granularity of text embeddings. Additionally, we introduce a two-stage training pipeline that includes the distillation and fine-tuning phases to mitigate catastrophic forgetting. In this work, we focus on integrating speech into pre-trained language models (SpeechLMs). A straightforward approach is to transcribe speech into text and use these transcriptions as prompts for large language models (Huang et al., 2023); however, such cascaded systems suffer from error propagation, higher latency, and cannot leverage raw speech information like emotion, speaker identity, and other paralinguistic cues (Faruqui & Hakkani-Tür, 2021; Lin et al., 2022; Kim et al., 2024). Speech representations can be integrated into pre-trained language models mainly through two approaches. The first method involves using connector modules that align speech representations with the language model's input space without modifying the model's existing vocabulary. These connector-based techniques typically incorporate a compression module to shorten the speech features, enhancing efficiency.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2410.00168

Country:

North America > United States (0.46)
Europe (0.46)

Genre: Research Report (1.00)

Industry: Education > Curriculum > Subject-Specific Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Seamless: Multilingual Expressive and Streaming Speech Translation

Communication, Seamless, Barrault, Loïc, Chung, Yu-An, Meglioli, Mariano Coria, Dale, David, Dong, Ning, Duppenthaler, Mark, Duquenne, Paul-Ambroise, Ellis, Brian, Elsahar, Hady, Haaheim, Justin, Hoffman, John, Hwang, Min-Jae, Inaguma, Hirofumi, Klaiber, Christopher, Kulikov, Ilia, Li, Pengwei, Licht, Daniel, Maillard, Jean, Mavlyutov, Ruslan, Rakotoarison, Alice, Sadagopan, Kaushik Ram, Ramakrishnan, Abinesh, Tran, Tuan, Wenzek, Guillaume, Yang, Yilin, Ye, Ethan, Evtimov, Ivan, Fernandez, Pierre, Gao, Cynthia, Hansanti, Prangthip, Kalbassi, Elahe, Kallet, Amanda, Kozhevnikov, Artyom, Gonzalez, Gabriel Mejia, Roman, Robin San, Touret, Christophe, Wong, Corinne, Wood, Carleigh, Yu, Bokai, Andrews, Pierre, Balioglu, Can, Chen, Peng-Jen, Costa-jussà, Marta R., Elbayad, Maha, Gong, Hongyu, Guzmán, Francisco, Heffernan, Kevin, Jain, Somya, Kao, Justine, Lee, Ann, Ma, Xutai, Mourachko, Alex, Peloquin, Benjamin, Pino, Juan, Popuri, Sravya, Ropers, Christophe, Saleem, Safiyyah, Schwenk, Holger, Sun, Anna, Tomasello, Paden, Wang, Changhan, Wang, Jeff, Wang, Skyler, Williamson, Mary

arXiv.org Artificial IntelligenceDec-8-2023

Large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4T model-SeamlessM4T v2. This newer model, incorporating an updated UnitY2 framework, was trained on more low-resource language data. SeamlessM4T v2 provides the foundation on which our next two models are initiated. SeamlessExpressive enables translation that preserves vocal styles and prosody. Compared to previous efforts in expressive speech research, our work addresses certain underexplored aspects of prosody, such as speech rate and pauses, while also preserving the style of one's voice. As for SeamlessStreaming, our model leverages the Efficient Monotonic Multihead Attention mechanism to generate low-latency target translations without waiting for complete source utterances. As the first of its kind, SeamlessStreaming enables simultaneous speech-to-speech/text translation for multiple source and target languages. To ensure that our models can be used safely and responsibly, we implemented the first known red-teaming effort for multimodal machine translation, a system for the detection and mitigation of added toxicity, a systematic evaluation of gender bias, and an inaudible localized watermarking mechanism designed to dampen the impact of deepfakes. Consequently, we bring major components from SeamlessExpressive and SeamlessStreaming together to form Seamless, the first publicly available system that unlocks expressive cross-lingual communication in real-time. The contributions to this work are publicly released and accessible at https://github.com/facebookresearch/seamless_communication

artificial intelligence, machine learning, natural language, (23 more...)

arXiv.org Artificial Intelligence

2312.05187

Country:

North America > United States (1.00)
Europe (1.00)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.13)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)
Government (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.92)

Add feedback

Efficient Monotonic Multihead Attention

Ma, Xutai, Sun, Anna, Ouyang, Siqi, Inaguma, Hirofumi, Tomasello, Paden

arXiv.org Artificial IntelligenceDec-7-2023

We introduce the Efficient Monotonic Multihead Attention (EMMA), a state-of-the-art simultaneous translation model with numerically-stable and unbiased monotonic alignment estimation. In addition, we present improved training and inference strategies, including simultaneous fine-tuning from an offline translation model and reduction of monotonic alignment variance. The experimental results demonstrate that the proposed model attains state-of-the-art performance in simultaneous speech-to-text translation on the Spanish and English translation task.

computational linguistic, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2312.04515

Country:

Europe (1.00)
North America > United States > Louisiana (0.14)
North America > United States > California (0.14)
Asia > Middle East > Qatar (0.14)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)

Add feedback

SeamlessM4T: Massively Multilingual & Multimodal Machine Translation

Communication, Seamless, Barrault, Loïc, Chung, Yu-An, Meglioli, Mariano Cora, Dale, David, Dong, Ning, Duquenne, Paul-Ambroise, Elsahar, Hady, Gong, Hongyu, Heffernan, Kevin, Hoffman, John, Klaiber, Christopher, Li, Pengwei, Licht, Daniel, Maillard, Jean, Rakotoarison, Alice, Sadagopan, Kaushik Ram, Wenzek, Guillaume, Ye, Ethan, Akula, Bapi, Chen, Peng-Jen, Hachem, Naji El, Ellis, Brian, Gonzalez, Gabriel Mejia, Haaheim, Justin, Hansanti, Prangthip, Howes, Russ, Huang, Bernie, Hwang, Min-Jae, Inaguma, Hirofumi, Jain, Somya, Kalbassi, Elahe, Kallet, Amanda, Kulikov, Ilia, Lam, Janice, Li, Daniel, Ma, Xutai, Mavlyutov, Ruslan, Peloquin, Benjamin, Ramadan, Mohamed, Ramakrishnan, Abinesh, Sun, Anna, Tran, Kevin, Tran, Tuan, Tufanov, Igor, Vogeti, Vish, Wood, Carleigh, Yang, Yilin, Yu, Bokai, Andrews, Pierre, Balioglu, Can, Costa-jussà, Marta R., Celebi, Onur, Elbayad, Maha, Gao, Cynthia, Guzmán, Francisco, Kao, Justine, Lee, Ann, Mourachko, Alexandre, Pino, Juan, Popuri, Sravya, Ropers, Christophe, Saleem, Safiyyah, Schwenk, Holger, Tomasello, Paden, Wang, Changhan, Wang, Jeff, Wang, Skyler

arXiv.org Artificial IntelligenceOct-24-2023

What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have yet to achieve similar strides. More specifically, conventional speech-to-speech translation systems rely on cascaded systems that perform translation progressively, putting high-performing unified systems out of reach. To address these gaps, we introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation, and automatic speech recognition for up to 100 languages. To build this, we used 1 million hours of open speech audio data to learn self-supervised speech representations with w2v-BERT 2.0. Subsequently, we created a multimodal corpus of automatically aligned speech translations. Filtered and combined with human-labeled and pseudo-labeled data, we developed the first multilingual system capable of translating from and into English for both speech and text. On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation. Compared to strong cascaded models, SeamlessM4T improves the quality of into-English translation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in speech-to-speech. Tested for robustness, our system performs better against background noises and speaker variations in speech-to-text tasks compared to the current SOTA model. Critically, we evaluated SeamlessM4T on gender bias and added toxicity to assess translation safety. Finally, all contributions in this work are open-sourced and accessible at https://github.com/facebookresearch/seamless_communication

multilingual & multimodal machine translation, natural language, speech recognition, (3 more...)

arXiv.org Artificial Intelligence

2308.11596

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Scaling Speech Technology to 1,000+ Languages

Pratap, Vineel, Tjandra, Andros, Shi, Bowen, Tomasello, Paden, Babu, Arun, Kundu, Sayani, Elkahky, Ali, Ni, Zhaoheng, Vyas, Apoorv, Fazel-Zarandi, Maryam, Baevski, Alexei, Adi, Yossi, Zhang, Xiaohui, Hsu, Wei-Ning, Conneau, Alexis, Auli, Michael

arXiv.org Artificial IntelligenceMay-22-2023

Expanding the language coverage of speech technology has the potential to improve access to information for many more people. However, current speech technology is restricted to about one hundred languages which is a small fraction of the over 7,000 languages spoken around the world. The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task. The main ingredients are a new dataset based on readings of publicly available religious texts and effectively leveraging self-supervised learning. We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, as well as a language identification model for 4,017 languages. Experiments show that our multilingual speech recognition model more than halves the word error rate of Whisper on 54 languages of the FLEURS benchmark while being trained on a small fraction of the labeled data.

artificial intelligence, deep learning, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2305.13516

Country: Europe (0.67)

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Efficient Speech Representation Learning with Low-Bit Quantization

Yeh, Ching-Feng, Hsu, Wei-Ning, Tomasello, Paden, Mohamed, Abdelrahman

arXiv.org Artificial IntelligenceDec-14-2022

With the development of hardware for machine learning, newer models often come at the cost of both increased sizes and computational complexity. In effort to improve the efficiency for these models, we apply and investigate recent quantization techniques on speech representation learning models. The quantization techniques were evaluated on the SUPERB benchmark. On the ASR task, with aggressive quantization to 1 bit, we achieved 86.32% storage reduction (184.42 -> 25.23), 88% estimated runtime reduction (1.00 -> 0.12) with increased word error rate (7.06 -> 15.96). In comparison with DistillHuBERT which also aims for model compression, the 2-bit configuration yielded slightly smaller storage (35.84 vs. 46.98), better word error rate (12.68 vs. 13.37) and more efficient estimated runtime (0.15 vs. 0.73).

artificial intelligence, machine learning, quantization, (18 more...)

arXiv.org Artificial Intelligence

2301.00652

Country: North America > United States (0.28)

Genre: Research Report (0.84)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Continual Learning for On-Device Speech Recognition using Disentangled Conformers

Diwan, Anuj, Yeh, Ching-Feng, Hsu, Wei-Ning, Tomasello, Paden, Choi, Eunsol, Harwath, David, Mohamed, Abdelrahman

arXiv.org Artificial IntelligenceDec-13-2022

Automatic speech recognition research focuses on training and evaluating on static datasets. Yet, as speech models are increasingly deployed on personal devices, such models encounter user-specific distributional shifts. To simulate this real-world scenario, we introduce LibriContinual, a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks, with data corresponding to 118 individual speakers and 6 train splits per speaker of different sizes. Additionally, current speech recognition models and continual learning algorithms are not optimized to be compute-efficient. We adapt a general-purpose training algorithm NetAug for ASR and create a novel Conformer variant called the DisConformer (Disentangled Conformer). This algorithm produces ASR models consisting of a frozen 'core' network for general-purpose use and several tunable 'augment' networks for speaker-specific tuning. Using such models, we propose a novel compute-efficient continual learning algorithm called DisentangledCL. Our experiments show that the DisConformer models significantly outperform baselines on general ASR i.e. LibriSpeech (15.58% rel. WER on test-other). On speaker-specific LibriContinual they significantly outperform trainable-parameter-matched baselines (by 20.65% rel. WER on test) and even match fully finetuned baselines in some settings.

artificial intelligence, libricontinual, machine learning, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/ICASSP49357.2023.10095484

2212.01393

Country: North America > United States (0.28)

Genre: Research Report (0.40)

Industry: Media (0.35)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Generative Spoken Dialogue Language Modeling

Nguyen, Tu Anh, Kharitonov, Eugene, Copet, Jade, Adi, Yossi, Hsu, Wei-Ning, Elkahky, Ali, Tomasello, Paden, Algayres, Robin, Sagot, Benoit, Mohamed, Abdelrahman, Dupoux, Emmanuel

arXiv.org Artificial IntelligenceNov-22-2022

We introduce dGSLM, the first "textless" model able to generate audio samples of naturalistic spoken dialogues. It uses recent work on unsupervised spoken unit discovery coupled with a dual-tower transformer architecture with cross-attention trained on 2000 hours of two-channel raw conversational audio (Fisher dataset) without any text or labels. We show that our model is able to generate speech, laughter and other paralinguistic signals in the two channels simultaneously and reproduces more naturalistic and fluid turn-taking compared to a text-based cascaded model.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2203.16502

Country:

Europe (0.67)
North America > United States > Minnesota (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.84)

Add feedback

Speech-to-Speech Translation For A Real-world Unwritten Language

Chen, Peng-Jen, Tran, Kevin, Yang, Yilin, Du, Jingfei, Kao, Justine, Chung, Yu-An, Tomasello, Paden, Duquenne, Paul-Ambroise, Schwenk, Holger, Gong, Hongyu, Inaguma, Hirofumi, Popuri, Sravya, Wang, Changhan, Pino, Juan, Hsu, Wei-Ning, Lee, Ann

arXiv.org Artificial IntelligenceNov-11-2022

We study speech-to-speech translation (S2ST) that translates speech from one language into another language and focuses on building systems to support languages without standard text writing systems. We use English-Taiwanese Hokkien as a case study, and present an end-to-end solution from training data collection, modeling choices to benchmark dataset release. First, we present efforts on creating human annotated data, automatically mining data from large unlabeled speech datasets, and adopting pseudo-labeling to produce weakly supervised data. On the modeling, we take advantage of recent advances in applying self-supervised discrete representations as target for prediction in S2ST and show the effectiveness of leveraging additional text supervision from Mandarin, a language similar to Hokkien, in model training. Finally, we release an S2ST benchmark set to facilitate future research in this field. The demo can be found at https://huggingface.co/spaces/facebook/Hokkien_Translation .

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2211.06474

Country:

North America > United States > Minnesota (0.28)
Europe > Spain (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Flashlight: Enabling Innovation in Tools for Machine Learning

Kahn, Jacob, Pratap, Vineel, Likhomanenko, Tatiana, Xu, Qiantong, Hannun, Awni, Cai, Jeff, Tomasello, Paden, Lee, Ann, Grave, Edouard, Avidov, Gilad, Steiner, Benoit, Liptchinsky, Vitaliy, Synnaeve, Gabriel, Collobert, Ronan

arXiv.org Artificial IntelligenceJan-28-2022

As the computational requirements for machine learning systems and the size and complexity of machine learning frameworks increases, essential framework innovation has become challenging. While computational needs have driven recent compiler, networking, and hardware advancements, utilization of those advancements by machine learning tools is occurring at a slower pace. This is in part due to the difficulties involved in prototyping new computational paradigms with existing frameworks. Large frameworks prioritize machine learning researchers and practitioners as end users and pay comparatively little attention to systems researchers who can push frameworks forward -- we argue that both are equally important stakeholders. We introduce Flashlight, an open-source library built to spur innovation in machine learning tools and systems by prioritizing open, modular, customizable internals and state-of-the-art, research-ready models and training setups across a variety of domains. Flashlight allows systems researchers to rapidly prototype and experiment with novel ideas in machine learning computation and has low overhead, competing with and often outperforming other popular machine learning frameworks. We see Flashlight as a tool enabling research that can benefit widely used libraries downstream and bring machine learning and systems researchers closer together.

artificial intelligence, machine learning, neural network, (19 more...)

arXiv.org Artificial Intelligence

2201.12465

Country: North America > United States > California (0.15)

Genre: Research Report > Promising Solution (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback