A Appendix

Neural Information Processing Systems 

For a detailed treatment, please refer to [1]. As mentioned in Section 3.1 of the main text, in its simplest form, self-attention is described as: y = σ ( QK We have highlighted the same terms with the same color in Equations 2 and 3 to show the results are indeed identical. As discussed in Section 3.2 of the main text, this formulation lets us convert an observation signal Table 1 in the main text contains the hyper-parameters used for each experiment. Applying softmax to each row only brings scalar multipliers to each row and the proof still holds. Computing Engines (GCE) on an instance that has one V100 GPU.