GLU Variants Improve Transformer
The Transformer [Vaswani et al., 2017] sequence-to-sequence model alternates between multi-head attention, and what it calls "position-wise feed-forward networks" (FFN). The FFN takes a vector x (the hidden representation at a particular position in the sequence) and passes it through two learned linear transformations, (represented by the matrices W 1 and W 2 and bias vectors b 1 and b 2). A rectified-linear (ReLU) [Glorot et al., 2011] activation function applied between the two linear transformations. FFN(x, W 1, W 2, b 1, b 2) max(0, xW 1 b 1)W 2 b 2 (1) Following the T5 codebase [Raffel et al., 2019] 1, we use a version with no bias: FFN ReLU (x, W 1, W 2) max(xW 1, 0)W 2 (2) Subsequent work has proposed replacing the ReLU with other nonlinear activation functions such as Gaussian Error Linear Units, GELU(x) xΦ(x) [Hendrycks and Gimpel, 2016], and Swish β (x) xσ(βx) [Ramachandran et al., 2017].
Feb-12-2020