Scaling White-Box Transformers for Vision Jinrui Yang 1 Xianhang Li1 Yuyin Zhou

Neural Information Processing Systems 

Over the past several years, the Transformer architecture [42] has dominated deep representation learning for natural language processing (NLP), image processing, and visual computing [8, 2, 9, 5, 12]. However, the design of the Transformer architecture and its many variants remains largely empirical and lacks a rigorous mathematical interpretation. This has largely hindered the development of new Transformer variants with improved efficiency or interpretability.