Goto

Collaborating Authors

The First High-Performance Self-Supervised Algorithm That Works For Speech, Vision, And Text - Liwaiwai

#artificialintelligence

But while people appear to learn in a similar way regardless of how they get information -- whether they use sight or sound, for example -- there are currently big differences in the way self-supervised learning algorithms learn from images, speech, text, and other modalities. This discrepancy has been a significant barrier to applying advances in self-supervised learning more broadly. Because a powerful algorithm designed for, say, understanding images can't be directly applied to another modality, such as text, it is difficult to push several modalities ahead at the same rate. This is why Meta AI developed and is excited to announce data2vec, the first high-performance self-supervised algorithm that works for multiple modalities. We apply data2vec separately to speech, images and text and it outperformed the previous best single-purpose algorithms for computer vision and speech and it is competitive on NLP tasks.


Meta's 'data2vec' is the next step toward One Neural Network to Rule Them All

ZDNet

The race is on to create one neural network that can process multiple kinds of data, the notion of a more-general artificial intelligence that doesn't discriminate about types of data but instead can crunch them all within the same basic structure. The genre of multi-modality, as these neural networks are called, is seeing a flurry of activity in which different data, such as image, text, and speech audio, are passed through the same algorithm to produce a score on different tests such as image recognition, natural language understanding or speech detection. And these ambidextrous networks are racking up scores on benchmark tests of AI. The latest achievement is what's called'data2vec," developed by researchers at the AI division of Meta, parent of Facebook, Instagram, and WhatsApp. The point, as Meta's scientists, Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli, write, is to approach something more like the general learning ability that the human mind seems to encompass.


Meta's 'data2vec' is a step toward One Neural Network to Rule Them All

#artificialintelligence

The race is on to create one neural network that can process multiple kinds of data -- a more-general artificial intelligence that doesn't discriminate about types of data but instead can crunch them all within the same basic structure. The genre of multi-modality, as these neural networks are called, is seeing a flurry of activity in which different data, such as image, text, and speech audio, are passed through the same algorithm to produce a score on different tests such as image recognition, natural language understanding, or speech detection. And these ambidextrous networks are racking up scores on benchmark tests of AI. The latest achievement is what's called "data2vec," developed by researchers at the AI division of Meta (parent of Facebook, Instagram, and WhatsApp). The point, as Meta researcher Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli reveal in a blog post, is to approach something more like the general learning ability that the human mind seems to encompass.


Self-Supervised Representation Learning: Introduction, Advances and Challenges

arXiv.org Machine Learning

Self-supervised representation learning methods aim to provide powerful deep feature learning without the requirement of large annotated datasets, thus alleviating the annotation bottleneck that is one of the main barriers to practical deployment of deep learning today. These methods have advanced rapidly in recent years, with their efficacy approaching and sometimes surpassing fully supervised pre-training alternatives across a variety of data modalities including image, video, sound, text and graphs. This article introduces this vibrant area including key concepts, the four main families of approach and associated state of the art, and how self-supervised methods are applied to diverse modalities of data. We further discuss practical considerations including workflows, representation transferability, and compute cost. Finally, we survey the major open challenges in the field that provide fertile ground for future work.


VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

arXiv.org Artificial Intelligence

We present a framework for learning multimodal representations from unlabeled data using convolution-free Transformer architectures. Specifically, our Video-Audio-Text Transformer (VATT) takes raw signals as inputs and extracts multimodal representations that are rich enough to benefit a variety of downstream tasks. We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval. Furthermore, we study a modality-agnostic single-backbone Transformer by sharing weights among the three modalities. We show that the convolution-free VATT outperforms state-of-the-art ConvNet-based architectures in the downstream tasks. Especially, VATT's vision Transformer achieves the top-1 accuracy of 82.1% on Kinetics-400, 83.6% on Kinetics-600,and 41.1% on Moments in Time, new records while avoiding supervised pre-training. Transferring to image classification leads to 78.7% top-1 accuracy on ImageNet compared to 64.7% by training the same Transformer from scratch, showing the generalizability of our model despite the domain gap between videos and images. VATT's audio Transformer also sets a new record on waveform-based audio event recognition by achieving the mAP of 39.4% on AudioSet without any supervised pre-training.