Sub-Linear Memory: How to Make Performers SLiM

Neural Information Processing Systems 

Transformer architectures have become very popular yet the original implementation requires O(L2) in serial time and memory as functions of input length L. Recent works proposed various linear self-attention mechanisms, scaling only as O(L) for serial computation. We conduct a thorough complexity analysis of Performers, a class which includes most recent linear Transformer mechanisms. We note a remarkable computational flexibility: the gradient computation can be performed with no approximations using sublinear memory as a function of L (in addition to negligible storage for the input sequence), at a cost of greater time complexity in the parallel setting. In the extreme case, a Performer consumes only O(1) memory, and still requires O(L) time. Due to complete backwardcompatibility, this discovered time-memory tradeoff can be used for fine-tuning on low-memory devices in a decentralized fashion without any server computations.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found