d8e1344e27a5b08cdfd5d027d9b8d6de-AuthorFeedback.pdf

Neural Information Processing Systems 

The purpose of this is to scale down the logits before softmax is applied, a technique similar to the one seen in V aswani et al. (2017). The reason could be that softmax is more numerically stable for both feedforward and backpropagation. As discussed in Section 3.3 of the paper, the stick-breaking formulation was initially used to reflect the process that a Thank you all for your detailed review and insightful comments. This will be the direction in which we take our future work. We have conducted an ablation test for the Gated Recursive Cell and Stick-breaking Attention.