Sliced Recursive Transformer

Shen, Zhiqiang, Liu, Zechun, Xing, Eric

arXiv.org Artificial Intelligence 

We present a neat yet effective recursive operation on vision transformers that can improve parameter utilization without involving additional parameters. This is achieved by sharing weights across depth of transformer networks. The proposed method can obtain a substantial gain ( 2%) simply using naïve recursive operation, requires no special or sophisticated knowledge for designing principles of networks, and introduces minimum computational overhead to the training procedure. To reduce the additional computation caused by recursive operation while maintaining the superior accuracy, we propose an approximating method through multiple sliced group self-attentions across recursive layers which can reduce the cost consumption by 10 30% with minimal performance loss. We call our model Sliced Recursive Transformer (SReT), which is compatible with a broad range of other designs for efficient vision transformers. The flexible scalability has shown great potential for scaling up and constructing extremely deep and large dimensionality vision transformers. The architectures of transformer have achieved substantively breakthroughs recently in the fields of natural language processing (NLP) [Vaswani et al., 2017], computer vision (CV) [Dosovitskiy et al., 2021] and speech [Dong et al., 2018, Wang et al., 2021b]. In the vision area, Dosovitskiy et al. [Dosovitskiy et al., 2021] introduced the vision transformer (ViT) method that split a raw image to a patch sequence as input and directly applied transformer model [Vaswani et al., 2017] for the image classification task. ViT achieved impressive results and has inspired many follow-up works. However, the benefits of a transformer often come with a large computational cost and it is always of great challenge to achieve the optimal trade-off between the accuracy and model complexity. In this work, we are motivated by the following question: How can we improve the parameter utilization of a vision transformer, i.e., the representation ability without increasing the model size? We observe recursive operation as shown in Figure 1 is a simple while effective way to achieve this purpose.