GLU Variants Improve Transformer

Feb-12-2020–arXiv.org Machine Learning

The Transformer [Vaswani et al., 2017] sequence-to-sequence model alternates between multi-head attention, and what it calls "position-wise feed-forward networks" (FFN). The FFN takes a vector x (the hidden representation at a particular position in the sequence) and passes it through two learned linear transformations, (represented by the matrices W 1 and W 2 and bias vectors b 1 and b 2). A rectified-linear (ReLU) [Glorot et al., 2011] activation function applied between the two linear transformations. FFN(x, W 1, W 2, b 1, b 2) max(0, xW 1 b 1)W 2 b 2 (1) Following the T5 codebase [Raffel et al., 2019] 1, we use a version with no bias: FFN ReLU (x, W 1, W 2) max(xW 1, 0)W 2 (2) Subsequent work has proposed replacing the ReLU with other nonlinear activation functions such as Gaussian Error Linear Units, GELU(x) xΦ(x) [Hendrycks and Gimpel, 2016], and Swish β (x) xσ(βx) [Ramachandran et al., 2017].

activation function, arxiv preprint arxiv, transformer, (14 more...)

arXiv.org Machine Learning

Feb-12-2020

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.41)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found