A Appendix

Neural Information Processing Systems 

We begin by formally defining multihead self-attention and Transformer. Our definition is equivalent to Vaswani et al. (2017) [68], except we omit layer normalization for simplicity as in [81, 23, 34]. Consequently, each equivalence class γ in Definition 3 is a distinct set of all order-l multi-indices having a specific equality pattern. Now, for each equivalence class, we define the corresponding basis tensor as follows: Definition 4. I. Given a set of features X R Proof of Lemma 1 (Section 3.3) To prove Lemma 1, we need to show that each basis tensor B Here, our key idea is to break down the inclusion test (i, j) µ into equivalent but simpler Boolean tests that can be implemented in self-attention (Eq. To achieve this, we show some supplementary Lemmas.