A Appendix
–Neural Information Processing Systems
Hyperparameters V alue Number of encoder (decoder) layers 6 Number of layers in the feed forward network 2 Number of hidden units in the feed forward network 128 Mask filter size 3 Mask number of filters 16 Ratio of residual connection 1.5 Dropout rate 0.1 Optimizer Adam Warm-up steps 4000 Learning rate p d min ( p t, t 4000 Unless otherwise specified, the task performed in this section is selection sort (Section 4). Figure 6 shows the sorting performance of the transformers w/o mask supervision. Figure 7 shows sorting performances with different encoding schemes. In Figure 9, we show the strong generalization performance of the different architectures. While some changes are able to improve performance in this regime, the performance ultimately drops steeply as the length of the test sequence increases. The symbol e represents the end token.
Neural Information Processing Systems
Aug-22-2025, 00:48:37 GMT
- Technology: