e2065cb56f5533494522c46a72f1dfb0-AuthorFeedback.pdf

Neural Information Processing Systems 

We thank the reviewers for insightful remarks and comments that help to considerably improve our manuscript. We1 address the most important ones in detail below. Before doing so, we highlight a comment from R3 in order to make an2 important clarification about the scope of our contribution. "It is well known that an attention mechanism would reduce3 gradient vanishing. It feels trivial to me as there is a direct connection for gradients to pass. We are in complete agreement and recognize that the very mechanism of (self-)attention was designed to improve6 gradient propagation over long sequences, and that sparsity is a good way to keep complexity costs low. Much like work from the '90s established formal results for gradient exploding/vanishing in deep/recurrent networks, we9 believe it is crucial to establish similar theoretical tools for attention mechanisms, as these methods are under intense10 development where scalability and complexity are important issues. The proposed relevancy mechanism and accompanying experiments,14 building on established work, are meant to illustrate how our theorems can be concretely exploited. We chose simple15 tasks for their ease of interpretation, and their variety of computational demands (memorization, prediction, RL, etc.).16 As is clearly indicated in the text, it is not our goal to propose this method "as is" in a race for state-of-the-art. Werecognize thatreviewersmay have basedtheir evaluation asthey wouldhavein amethod paper, and we20 kindly invite them to reconsider the value of our experiments in the broader context of our theoretical contributions. We21 also thank reviewers for their additional minor comments not explicitly addressed here and agree to implement them.22 R1: Q"The authors didn't spell out the relation between κ and d: higher κ tends to have smaller d.