Agglomerative Attention
To enable efficient modeling of longer Neural networks using transformer-based architectures sequences with greater correlation lengths, recent research have recently demonstrated great power and has focused on finding more scalable attention flexibility in modeling sequences of many types. One mechanisms [4, 5, 12-14]. of the core components of transformer networks is In this work we present an attention mechanism the attention layer, which allows contextual information that is linear in time and memory requirements. This to be exchanged among sequence elements. "agglomerative attention"--loosely inspired by ideas While many of the prevalent network structures thus from protein folding--works by defining a fixed number far have utilized full attention--which operates on all of classes. Target sequence elements assigned to pairs of sequence elements--the quadratic scaling of each class receive a summary representation of all reference this attention mechanism significantly constrains the elements belonging to that class. We measure size of models that can be trained. In this work, we the impact of the agglomerative attention algorithm present an attention model that has only linear requirements by replacing the full dot product self-attention layers in memory and computation time. We of universal transformers [2] with agglomerative show that, despite the simpler attention model, networks attention and measure model performance on both using this attention mechanism can attain comparable character-and word-level language modeling tasks.
Jul-15-2019