LimitstoDepth-EfficienciesofSelf-Attention
–Neural Information Processing Systems
Self-attention architectures, which are rapidly pushing the frontier innatural language processing, demonstrate asurprising depth-inefficient behavior: previous works indicate that increasing the internal representation (network width) isjust as useful as increasing the number of self-attention layers (network depth).
Neural Information Processing Systems
Feb-19-2026, 09:46:33 GMT
- Country:
- Technology: