Sketching as a Tool for Understanding and Accelerating Self-attention for Long Sequences
Chen, Yifan, Zeng, Qi, Hakkani-Tur, Dilek, Jin, Di, Ji, Heng, Yang, Yun
Transformer-based models are not efficient in processing long sequences due to the quadratic space and time complexity of the self-attention modules. To address this limitation, Linformer and Informer are proposed to reduce the quadratic complexity to linear (modulo logarithmic factors) via low-dimensional projection and row selection respectively. These two models are intrinsically connected, and to understand their connection, we introduce a theoretical framework of matrix sketching. Based on the theoretical analysis, we propose Skeinformer to accelerate self-attention and further improve the accuracy of matrix approximation to self-attention with three carefully designed components: column sampling, adaptive row normalization and pilot sampling reutilization. Experiments on the Long Range Arena (LRA) benchmark demonstrate that our methods outperform alternatives with a consistently smaller time/space footprint.
Dec-10-2021
- Country:
- Europe > United Kingdom
- England (0.14)
- North America > United States
- California (0.14)
- Illinois (0.14)
- Louisiana (0.14)
- Europe > United Kingdom
- Genre:
- Research Report (0.64)
- Industry:
- Technology: