Goto

Collaborating Authors

 urlhttp





LimitstoDepth-EfficienciesofSelf-Attention

Neural Information Processing Systems

Self-attention architectures, which are rapidly pushing the frontier innatural language processing, demonstrate asurprising depth-inefficient behavior: previous works indicate that increasing the internal representation (network width) isjust as useful as increasing the number of self-attention layers (network depth).







1d8dc55c1f6cf124af840ce1d92d1896-Paper-Conference.pdf

Neural Information Processing Systems

As inthe classical problem, weights are fixed by an adversary and elements appear in random order. In contrast to previous variants of predictions, our algorithm only has access toamuch weakerpiece ofinformation: anadditive gapc.