bert
5812f92450ccaf17275500841c70924a-Supplemental.pdf
We present a brief proof about the local optimality of one-hot encodings in the decision-theoretic framework presented in Section 3.2. We seek to prove that, under assumptions of an identity reward matrix, tokens constrained to a unit hypercube, and gaussian additive noise, one-hot tokens are an optimally robust communication strategy. We only seek to prove local optimality, as one many trivially generate multiple, equally optimal tokens by, for example, flipping all bits. The following derivation uses Karush-Kuhn-Tucker (KKT) conditions, a generalization of Lagrange multipliers [17]. We maximize the function, subject to constraints. T>j Ti Ti + ||Tj||2 Ti # ~µi + ~λi = ~0 (13) (14) We seek to show that one-hot vectors are an optimum, so we now show that one-hot vectors indeed respect the constraints and set the derivatives to zero.
Pay Attention to MLPs
Transformers [1] have become one of the most important architectural innovations in deep learning and have enabled many breakthroughs over the past few years. Here we propose a simple network architecture, gMLP, based on MLPs with gating, and show that it can perform as well as Transformers in key language and vision applications. Our comparisons show that self-attention is not critical for Vision Transformers, as gMLP can achieve the same accuracy. For BERT, our model achieves parity with Transformers on pretraining perplexity and is better on some downstream NLP tasks. On finetuning tasks where gMLP performs worse, making the gMLP model substantially larger can close the gap with Transformers. In general, our experiments show that gMLP can scale as well as Transformers over increased data and compute.
88dddaf430b5bc38ab8228902bb61821-Supplemental-Conference.pdf
Supplementary figure 1. Ablanullon study, each row represents the ablated layer and each column the module that is ablated from that layer, for example the first panel shows ablanullon of anullennullon - key in layer 5. Different layers in GPT2 - XL model were ablated and the consequence of ablanullon on curvature measured for 2000 sentences in UD corpus. Red bar shows the layer where ablanullon was applied. AB Supplementary figure 3. A. curvature values for sampled 2000 sentence in RWKV model ( RNN) for both trained an untrained version. B correlanullon between model generated surprisal and curvature in RWKV model. Diamonds: syntacnullc surprisal Supplementary figure 5: E ffect of different decoding strategies in GPT2 - XL sequence generanullon and its comparison to ground - truth(true) same as figure 4b in the main manuscript.