AITopics | Baevski, Alexei

Plotting

Baevski, Alexei

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Toward Joint Language Modeling for Speech Units and Text

Chou, Ju-Chieh, Chien, Chung-Ming, Hsu, Wei-Ning, Livescu, Karen, Babu, Arun, Conneau, Alexis, Baevski, Alexei, Auli, Michael

arXiv.org Artificial IntelligenceOct-12-2023

Speech and text are two major forms of human language. The research community has been focusing on mapping speech to text or vice versa for many years. However, in the field of language modeling, very little effort has been made to model them jointly. In light of this, we explore joint language modeling for speech units and text. Specifically, we compare different speech tokenizers to transform continuous speech signals into discrete units and use different methods to construct mixed speech-text data. We introduce automatic metrics to evaluate how well the joint LM mixes speech and text. We also fine-tune the LM on downstream spoken language understanding (SLU) tasks with different modalities (speech or text) and test its performance to assess the model's learning of shared representations. Our results show that by mixing speech units and text with our proposed mixing techniques, the joint LM improves over a speech-only baseline on SLU tasks and shows zero-shot cross-modal transferability.

artificial intelligence, natural language, speech unit, (19 more...)

arXiv.org Artificial Intelligence

2310.08715

Country:

North America (0.14)
Asia (0.14)

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.91)

Add feedback

Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language

Baevski, Alexei, Babu, Arun, Hsu, Wei-Ning, Auli, Michael

arXiv.org Artificial IntelligenceJun-15-2023

Current self-supervised learning algorithms are often modality-specific and require large amounts of computational resources. To address these issues, we increase the training efficiency of data2vec, a learning objective that generalizes across several modalities. We do not encode masked tokens, use a fast convolutional decoder and amortize the effort to build teacher representations. data2vec 2.0 benefits from the rich contextualized target representations introduced in data2vec which enable a fast self-supervised learner. Experiments on ImageNet-1K image classification show that data2vec 2.0 matches the accuracy of Masked Autoencoders in 16.4x lower pre-training time, on Librispeech speech recognition it performs as well as wav2vec 2.0 in 10.6x less time, and on GLUE natural language understanding it matches a retrained RoBERTa model in half the time. Trading some speed for accuracy results in ImageNet-1K top-1 accuracy of 86.8\% with a ViT-L model trained for 150 epochs.

artificial intelligence, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2212.07525

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Measuring the Impact of Individual Domain Factors in Self-Supervised Pre-Training

Sanabria, Ramon, Hsu, Wei-Ning, Baevski, Alexei, Auli, Michael

arXiv.org Artificial IntelligenceJun-11-2023

Human speech data comprises a rich set of domain factors such as accent, syntactic and semantic variety, or acoustic environment. Previous work explores the effect of domain mismatch in automatic speech recognition between pre-training and fine-tuning as a whole but does not dissect the contribution of individual factors. In this paper, we present a controlled study to better understand the effect of such factors on the performance of pre-trained representations on automatic speech recognition. To do so, we pre-train models either on modified natural speech or synthesized audio, with a single domain factor modified, and then measure performance after fine-tuning. Results show that phonetic domain factors play an important role during pre-training while grammatical and syntactic factors are far less important. To our knowledge, this is the first study to better understand the domain characteristics of pre-trained sets in self-supervised pre-training for speech.

artificial intelligence, experiment, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2203.00648

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Scaling Speech Technology to 1,000+ Languages

Pratap, Vineel, Tjandra, Andros, Shi, Bowen, Tomasello, Paden, Babu, Arun, Kundu, Sayani, Elkahky, Ali, Ni, Zhaoheng, Vyas, Apoorv, Fazel-Zarandi, Maryam, Baevski, Alexei, Adi, Yossi, Zhang, Xiaohui, Hsu, Wei-Ning, Conneau, Alexis, Auli, Michael

arXiv.org Artificial IntelligenceMay-22-2023

Expanding the language coverage of speech technology has the potential to improve access to information for many more people. However, current speech technology is restricted to about one hundred languages which is a small fraction of the over 7,000 languages spoken around the world. The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task. The main ingredients are a new dataset based on readings of publicly available religious texts and effectively leveraging self-supervised learning. We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, as well as a language identification model for 4,017 languages. Experiments show that our multilingual speech recognition model more than halves the word error rate of Whisper on 54 languages of the FLEURS benchmark while being trained on a small fraction of the labeled data.

artificial intelligence, deep learning, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2305.13516

Country:

North America > United States (0.14)
Oceania > Australia (0.14)
Europe > United Kingdom > Scotland (0.14)

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

OVRL-V2: A simple state-of-art baseline for ImageNav and ObjectNav

Yadav, Karmesh, Majumdar, Arjun, Ramrakhya, Ram, Yokoyama, Naoki, Baevski, Alexei, Kira, Zsolt, Maksymets, Oleksandr, Batra, Dhruv

arXiv.org Artificial IntelligenceMar-14-2023

We present a single neural network architecture composed of task-agnostic components (ViTs, convolutions, and LSTMs) that achieves state-of-art results on both the ImageNav ("go to location in ") and ObjectNav ("find a chair") tasks without any task-specific modules like object detection, segmentation, mapping, or planning modules. Such general-purpose methods offer advantages of simplicity in design, positive scaling with available compute, and versatile applicability to multiple tasks. Our work builds upon the recent success of self-supervised learning (SSL) for pre-training vision transformers (ViT). However, while the training recipes for convolutional networks are mature and robust, the recipes for ViTs are contingent and brittle, and in the case of ViTs for visual navigation, yet to be fully discovered. Specifically, we find that vanilla ViTs do not outperform ResNets on visual navigation. We propose the use of a compression layer operating over ViT patch representations to preserve spatial information along with policy training improvements. These improvements allow us to demonstrate positive scaling laws for the first time in visual navigation tasks. Consequently, our model advances state-of-the-art performance on ImageNav from 54.2% to 82.0% success and performs competitively against concurrent state-of-art on ObjectNav with success rate of 64.0% vs. 65.0%. Overall, this work does not present a fundamentally new approach, but rather recommendations for training a general-purpose architecture that achieves state-of-art performance today and could serve as a strong baseline for future methods.

artificial intelligence, deep learning, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2303.07798

Country: North America > United States (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.55)

Add feedback

AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations

Lian, Jiachen, Baevski, Alexei, Hsu, Wei-Ning, Auli, Michael

arXiv.org Artificial IntelligenceFeb-9-2023

Self-supervision has shown great potential for audio-visual speech recognition by vastly reducing the amount of labeled data required to build good systems. However, existing methods are either not entirely end-to-end or do not train joint representations of both modalities. In this paper, we introduce AV-data2vec which addresses these challenges and builds audio-visual representations based on predicting contextualized representations which has been successful in the uni-modal case. The model uses a shared transformer encoder for both audio and video and can combine both modalities to improve speech recognition. Results on LRS3 show that AV-data2vec consistently outperforms existing methods under most settings.

artificial intelligence, machine learning, representation, (12 more...)

arXiv.org Artificial Intelligence

2302.06419

Country: North America > United States > California (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Introducing Semantics into Speech Encoders

Xu, Derek, Dong, Shuyan, Wang, Changhan, Kim, Suyoun, Lin, Zhaojiang, Shrivastava, Akshat, Li, Shang-Wen, Tseng, Liang-Hsuan, Baevski, Alexei, Lin, Guan-Ting, Lee, Hung-yi, Sun, Yizhou, Wang, Wei

arXiv.org Artificial IntelligenceNov-15-2022

Recent studies find existing self-supervised speech encoders contain primarily acoustic rather than semantic information. As a result, pipelined supervised automatic speech recognition (ASR) to large language model (LLM) systems achieve state-of-the-art results on semantic spoken language tasks by utilizing rich semantic representations from the LLM. These systems come at the cost of labeled audio transcriptions, which is expensive and time-consuming to obtain. We propose a task-agnostic unsupervised way of incorporating semantic information from LLMs into self-supervised speech encoders without labeled audio transcriptions. By introducing semantics, we improve existing speech encoder spoken language understanding performance by over 10\% on intent classification, with modest gains in named entity resolution and slot filling, and spoken question answering FF1 score by over 2\%. Our unsupervised approach achieves similar performance as supervised methods trained on over 100 hours of labeled audio transcripts, demonstrating the feasibility of unsupervised semantic augmentations to existing speech encoders.

artificial intelligence, natural language, text processing, (16 more...)

arXiv.org Artificial Intelligence

2211.08402

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

A Comparison of Discrete Latent Variable Models for Speech Representation Learning

Zhou, Henry, Baevski, Alexei, Auli, Michael

arXiv.org Artificial IntelligenceOct-23-2020

Neural latent variable models enable the discovery of interesting structure in speech audio data. This paper presents a comparison of two different approaches which are broadly based on predicting future time-steps or auto-encoding the input signal. Our study compares the representations learned by vq-vae and vq-wav2vec in terms of sub-word unit discovery and phoneme recognition performance. Results show that future time-step prediction with vq-wav2vec achieves better performance. The best system achieves an error rate of 13.22 on the ZeroSpeech 2019 ABX phoneme discrimination challenge.

artificial intelligence, representation, speech recognition, (14 more...)

arXiv.org Artificial Intelligence

2010.1423

Country: North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report > New Finding (0.49)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.99)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.85)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.69)

Add feedback