Collaborating Authors

Ouroboros: On Accelerating Training of Transformer-Based Language Models Machine Learning

Language models are essential for natural language processing (NLP) tasks, such as machine translation and text summarization. Remarkable performance has been demonstrated recently across many NLP domains via a Transformer-based language model with over a billion parameters, verifying the benefits of model size. Model parallelism is required if a model is too large to fit in a single computing device. Current methods for model parallelism either suffer from backward locking in backpropagation or are not applicable to language models. We propose the first model-parallel algorithm that speeds the training of Transformer-based language models. We also prove that our proposed algorithm is guaranteed to converge to critical points for non-convex problems. Extensive experiments on Transformer and Transformer-XL language models demonstrate that the proposed algorithm obtains a much faster speedup beyond data parallelism, with comparable or better accuracy. Code to reproduce experiments is to be found at \url{}.

The Open Source Technologies Behind One of the Biggest Language Models in History


Transformers and pre-trained models can be considered one of the most important developments in the recent years of deep learning. Beyond the research breakthroughts, Transformers have redefined the natural language understanding(NLU) space sparking a race between lead AI vendors to build bigger and more efficient neural networks. The Transformer architecture has been behind famous models such as Google's BERT, Facebook's RoBERTa or OpenAI's GPT-3. Is not surprising that many people believe that only big companies have the resources to tackle the implementation of Transformer models. Earlier this year, the deep learning community was astonished when Microsoft Research unveiled the Turing Natural Language Generation (T-NLG) model which, at the time, was considered the largest natural language processing(NLP) model in the history of artificial intelligence(AI) with 17 billion parameters.

Transformer-based Online CTC/attention End-to-End Speech Recognition Architecture Machine Learning

Recently, Transformer has gained success in automatic speech recognition (ASR) field. However, it is challenging to deploy a Transformer-based end-to-end (E2E) model for online speech recognition. In this paper, we propose the Transformer-based online CTC/attention E2E ASR architecture, which contains the chunk self-attention encoder (chunk-SAE) and the monotonic truncated attention (MTA) based self-attention decoder (SAD). Firstly, the chunk-SAE splits the speech into isolated chunks. To reduce the computational cost and improve the performance, we propose the state reuse chunk-SAE. Sencondly, the MTA based SAD truncates the speech features monotonically and performs attention on the truncated features. To support the online recognition, we integrate the state reuse chunk-SAE and the MTA based SAD into online CTC/attention architecture. We evaluate the proposed online models on the HKUST Mandarin ASR benchmark and achieve a 23.66% character error rate (CER) with a 320 ms latency. Our online model yields as little as $0.19\%$ absolute CER degradation compared with the offline baseline, and achieves significant improvement over our prior work on Long Short-Term Memory (LSTM) based online E2E models.

The Emergence Of Hardware As A Key Enabler For The Age Of Artificial Intelligence


Over the past few decades, software has been the engine of innovation for countless applications. From PCs to mobile phones, well-defined hardware platforms and instruction set architectures (ISA) have enabled many important advancements across vertical markets. The emergence of abundant-data computing is changing the software-hardware balance in a dramatic way. Diverse AI applications in facial recognition, virtual assistance, autonomous vehicles and more are sharing a common feature: They rely on hardware as the core enabler of innovation. Since 2017, the AI hardware market has grown 60-70% annually, and is projected to reach $65 billion by 2025.

MalBERT: Using Transformers for Cybersecurity and Malicious Software Detection Artificial Intelligence

In recent years we have witnessed an increase in cyber threats and malicious software attacks on different platforms with important consequences to persons and businesses. It has become critical to find automated machine learning techniques to proactively defend against malware. Transformers, a category of attention-based deep learning techniques, have recently shown impressive results in solving different tasks mainly related to the field of Natural Language Processing (NLP). In this paper, we propose the use of a Transformers' architecture to automatically detect malicious software. We propose a model based on BERT (Bidirectional Encoder Representations from Transformers) which performs a static analysis on the source code of Android applications using preprocessed features to characterize existing malware and classify it into different representative malware categories. The obtained results are promising and show the high performance obtained by Transformer-based models for malicious software detection.