Collaborating Authors

Transformer's Self-Attention


In 2017, Vaswani et al. published a paper titled "Attention Is All You Need" for the NeurIPS conference. The transformer architecture does not use any recurrence or convolution. It solely relies on attention mechanisms.

The Rise of the Transformers


Rise of the Transformers with Self-Attention MechanismĀ  The intention of this article is to continue in answering the questions that my friends April Rudin, Tripp Braden, Danielle Guzman and Richard Foster-Fletcher asked about the future of AI. Furthermore Irene Iyakovet interview with me about how

A Tensorized Transformer for Language Modeling

Neural Information Processing Systems

Latest development of neural models has connected the encoder and decoder through a self-attention mechanism. In particular, Transformer, which is solely based on self-attention, has led to breakthroughs in Natural Language Processing (NLP) tasks. However, the multi-head attention mechanism, as a key component of Transformer, limits the effective deployment of the model to a resource-limited setting. In this paper, based on the ideas of tensor decomposition and parameters sharing, we propose a novel self-attention model (namely Multi-linear attention) with Block-Term Tensor Decomposition (BTD). We test and verify the proposed attention method on three language modeling tasks (i.e., PTB, WikiText-103 and One-billion) and a neural machine translation task (i.e., WMT-2016 English-German).

Multi-View Self-Attention Based Transformer for Speaker Recognition Artificial Intelligence

Initially developed for natural language processing (NLP), Transformer model is now widely used for speech processing tasks such as speaker recognition, due to its powerful sequence modeling capabilities. However, conventional self-attention mechanisms are originally designed for modeling textual sequence without considering the characteristics of speech and speaker modeling. Besides, different Transformer variants for speaker recognition have not been well studied. In this work, we propose a novel multi-view self-attention mechanism and present an empirical study of different Transformer variants with or without the proposed attention mechanism for speaker recognition. Specifically, to balance the capabilities of capturing global dependencies and modeling the locality, we propose a multi-view self-attention mechanism for speaker Transformer, in which different attention heads can attend to different ranges of the receptive field. Furthermore, we introduce and compare five Transformer variants with different network architectures, embedding locations, and pooling methods to learn speaker embeddings. Experimental results on the VoxCeleb1 and VoxCeleb2 datasets show that the proposed multi-view self-attention mechanism achieves improvement in the performance of speaker recognition, and the proposed speaker Transformer network attains excellent results compared with state-of-the-art models.

Comprehensive Guide to Transformers


You have a piece of paper with text on it, and you want to build a model that can translate this text to another language. How do you approach this? The first problem is the variable size of the text. There's no linear algebra model that can deal with vectors with varying dimensions. The default way of dealing with such problems is to use the bag-of-words Model ( 1).