Reviews: Attention is All you Need

Neural Information Processing Systems 

The paper presents a new architecture for encoder/decoder models for sequence-to-sequence modeling that is solely based on (multi-layered) attention networks combined with standard Feed-Forward networks as opposed to the common scheme of using recurrent or convolutional neural networks. The paper presents two main advantages of this new architecture: (1) Reduced training time due to reduced complexity of the architecture, and (2) new State-of-the-Art result on standard WMT data sets, outperforming previous work by about 1 BLEU point. Strengths: - The paper argues well that (1) can be achieved by avoiding recurrent or convolutional layers and the complexity analysis in Table 1 strengthens the argument. The main strengths of the paper are that it proposes an entirely novel architecture without recurrence or convolutions, and advances state of the art. Weaknesses: - While the general architecture of the model is described well and is illustrated by figures, architectural details lack mathematical definition, for example multi-head attention.