Review for NeurIPS paper: SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection

Neural Information Processing Systems 

Weaknesses: My main concern is about the computational cost the proposed method. The method requires running a LSTM on each token on every layer (or even every head) sequentially. Compared to the parallel processing of Transformers, I would expect this sequential computation to be quite slow. All those factors should affect the computation speed in a negative way. Given that the computational efficiency is the goal of the paper, the authors must discuss them in detail.