LimitstoDepth-EfficienciesofSelf-Attention

Neural Information Processing Systems 

Self-attention architectures, which are rapidly pushing the frontier innatural language processing, demonstrate asurprising depth-inefficient behavior: previous works indicate that increasing the internal representation (network width) isjust as useful as increasing the number of self-attention layers (network depth).

Similar Docs  Excel Report  more

TitleSimilaritySource
None found