Adaptive Attention Span in Transformers

Sukhbaatar, Sainbayar, Grave, Edouard, Bojanowski, Piotr, Joulin, Armand

May-19-2019–arXiv.org Machine Learning

Part of its success is due to its ability to model called Sequential Transformer capture long term dependencies. This is achieved (Vaswani et al., 2017). A Transformer is by taking long sequences as inputs and explicitly made of a sequence of layers that are composed of compute the relations between every token via a a block of parallel self-attention layers followed mechanism called the "self-attention" layer (Al-by a feedforward network. We refer to Vaswani Rfou et al., 2019).

attention span, machine translation, neural network, (22 more...)

arXiv.org Machine Learning

May-19-2019

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.40)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (0.49)
  - Natural Language > Machine Translation (0.32)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found