Going Beyond Linear Transformers with Recurrent Fast Weight Programmers

Neural Information Processing Systems 

Transformers with linearised attention ("linear Transformers") have demonstrated However, the original FWP formulation is more general than the one of linear Transformers: a slow neural network (NN) continually reprograms the weights of a fast NN with arbitrary architecture. In existing linear Transformers, both NNs are feedforward and consist of a single layer.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found