Collaborating Authors

Facebook AI Wav2Vec 2.0: Automatic Speech Recognition From 10 Minute Sample


Speech-to-text applications have never been so plentiful, popular or powerful, with researchers' pursuit of ever-better automatic speech recognition (ASR) system performance bearing fruit thanks to huge advances in machine learning technologies and the increasing availability of large speech datasets. Current speech recognition systems require thousands of hours of transcribed speech to reach acceptable performance. However, a lack of transcribed audio data for the less widely spoken of the world's 7,000 languages and dialects makes it difficult to train robust speech recognition systems in this area. To help ASR development for such low-resource languages and dialects, Facebook AI researchers have open-sourced the new wav2vec 2.0 algorithm for self-supervised language learning. The paper Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations claims to "show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler." A Facebook AI tweet says the new algorithm can enable automatic speech recognition models with just 10 minutes of transcribed speech data.

Facebook's Dynabench 'Radically Rethinks AI Benchmarking'


In the ever-expanding world of computer hardware and software, benchmarks provide a robust method for comparing quality and performance across different system architectures. From MNIST to ImageNet to GLUE, benchmarks have also come to play a hugely important role in driving and measuring progress in AI research. When introducing any new benchmark, it's generally best not to make it so easy that it will quickly become outdated, or so hard that everyone will simply fail. When new models bury benchmarks, which is happening faster and faster in AI these days, researchers must engage in the time-consuming work of making new ones. Facebook believes that the increasing benchmark saturation in recent years -- especially in natural language processing (NLP) -- means it's time to "radically rethink the way AI researchers do benchmarking and to break free of the limitations of static benchmarks." Their solution is a new research platform for dynamic data collection and benchmarking called Dynabench, which they propose will offer a more accurate and sustainable way for evaluating progress in AI.

Meet ByteDance AI's Xiaomingbot: World's First Multilingual and Multimodal AI News Agent


Continuous improvements in modern natural language generation in recent years have enabled bots that can perform automatic news reporting. This has practical applications for example in minor league sports, where result data is available but it is not always cost-efficient to send human reporters to the contests. Most existing robot reporters however focus exclusively on text generation. Xiaomingbot contains four components: a news generator, a news translator, a cross-lingual newsreader and an animated avatar. Its input is data table containing game and event records, and the output is an animated avatar reading a news article with a synthesized voice.

New ML Algorithm Tunes Quantum Devices Faster Than Human Experts


The machine learning community has high hopes for quantum computers -- devices that can store and process quantum data and are expected to perform many computational tasks exponentially faster than classical computers. The variability among different quantum devices however presents challenges for the scalability of semiconductor quantum devices. In a new Nature paper, researchers from the University of Oxford, DeepMind, University of Basel and Lancaster University propose a novel machine learning (ML) algorithm that can tune quantum devices to optimal performance in a median time of under 70 minutes, faster than a typical tuning process performed by human experts. The proposed algorithm is also approximately 180 times faster than an automated random search of the parameter space, and is capable of dealing with different material systems and device architectures. "Until this work, coarse tuning required manual input or was restricted to a small gate voltage subspace," the researchers explain.

Virginia Tech & Facebook Video Completion Algorithm Achieves SOTA Results


Video completion is a challenging computer vision task that involves filling a given space-time region with newly synthesized content -- in effect, revealing the unseen. Video completion has been widely applied in applications such as video restoration, editing, watermark/logo removal, etc. Most advanced video completion methods are flow-based: synthesizing colour and flow jointly and propagating colour along flow trajectories to improve temporal coherence. Now, researchers from Virginia Tech and Facebook have introduced a novel flow-based video completion algorithm that compares favourably with the state-of-the-art in the field. Existing flow-based video completion methods have the following three limitations: they are unable to synthesize sharp flow edges and so tend to produce over-smoothed results; the chained flow vectors between adjacent frames can only form continuous temporal constraints, which prevents constraining and propagating to many parts of a video; and they propagate colour values directly without considering factors such as lighting changes, shadows and so on. The researchers validated their proposed method on 150 video sequences from the DAVIS dataset, where visual and quantitative results show that the proposed method compares favourably with state-of-the-art algorithms.