Goto

Collaborating Authors

 accelerating and structuring self-attention


SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection

Neural Information Processing Systems

While the self-attention mechanism has been widely used in a wide variety of tasks, it has the unfortunate property of a quadratic cost with respect to the input length, which makes it difficult to deal with long inputs. In this paper, we present a method for accelerating and structuring self-attentions: Sparse Adaptive Connection (SAC). In SAC, we regard the input sequence as a graph and attention operations are performed between linked nodes. In contrast with previous self-attention models with pre-defined structures (edges), the model learns to construct attention edges to improve task-specific performances. In this way, the model is able to select the most salient nodes and reduce the quadratic complexity regardless of the sequence length. Based on SAC, we show that previous variants of self-attention models are its special cases. Through extensive experiments on neural machine translation, language modeling, graph representation learning and image classification, we demonstrate SAC is competitive with state-of-the-art models while significantly reducing memory cost.


Supplementary Materials for SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection 1 Datasets

Neural Information Processing Systems

For the transductive setup, we used the three standard citation network benchmarks, Cora, Cite-seer and Pubmed (Sen et al., 2008). We followed the transductive setup adopted in (Y ang et al., Cora contains 2708 nodes, 5429 edges, 7 classes and 1433 features per node. Citeseer contains 3327 nodes, 4732 edges, 6 classes and 3703 features per node. Critically, testing graphs remain completely unobserved during training. The average number of nodes per graph is 2372.


Review for NeurIPS paper: SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection

Neural Information Processing Systems

Weaknesses: My main concern is about the computational cost the proposed method. The method requires running a LSTM on each token on every layer (or even every head) sequentially. Compared to the parallel processing of Transformers, I would expect this sequential computation to be quite slow. All those factors should affect the computation speed in a negative way. Given that the computational efficiency is the goal of the paper, the authors must discuss them in detail.


Review for NeurIPS paper: SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection

Neural Information Processing Systems

This paper addresses the quadratic bottleneck in transformer architecture. It proposes a Sparse Adaptive Connection (SAC) model which learns to predict sparse connections (attention links) between inputs and attentions are only performed on those predictive links. The proposed method is competitive with state-of-the-art models on WMT, LM and Image classification tasks while significantly reducing memory cost. Overall, three of the four reviewers seem to have liked the paper, although they had some concerns (below), while one reviewer (R3) proposed weak reject. A weakness pointed out by R2 and R3 is that only accuracy is reported, but speed is not, which seems necessary to support the title "Accelerating Self-Attention". The authors promised to add more details about computational efficiency and memory cost in the final version, and I urge them to do so.


SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection

Neural Information Processing Systems

While the self-attention mechanism has been widely used in a wide variety of tasks, it has the unfortunate property of a quadratic cost with respect to the input length, which makes it difficult to deal with long inputs. In this paper, we present a method for accelerating and structuring self-attentions: Sparse Adaptive Connection (SAC). In SAC, we regard the input sequence as a graph and attention operations are performed between linked nodes. In contrast with previous self-attention models with pre-defined structures (edges), the model learns to construct attention edges to improve task-specific performances. In this way, the model is able to select the most salient nodes and reduce the quadratic complexity regardless of the sequence length.