A Task Setups Table 4: Shared hyperparameters for all models, given for each task

Neural Information Processing Systems 

Table 4: Shared hyperparameters for all models, given for each task. We provide the hyperparameter setups shared across our models for each task in Table 4. In addition, the hyperparameters tuned for each model for the best performance are shown in Table 5, which were selected using validation performance. We also provide a textual description of some aspects of the base models below. Random Walk We train 4-layer models with a hidden size of 256 and 4 attention heads.