A Additional Experiment Details
–Neural Information Processing Systems
The experiments were performed using a cluster of 12 GPUs (2 24GB, 2 12GB, 8 11GB). For Transformer models, the number of layers varied from 5 to 8. The number of heads was fixed to 8. No hyperparameter search was performed for Edge Transformer on COGS. Architecture hyperparam-eters for Edge Transformer were matched to those of (Ontanón et al., 2021), who tuned the number We therefore use three layers for Edge Transformer. Default settings were used for optimizer hyperparameters. Do the main claims made in the abstract and introduction accurately reflect the paper's If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Y es] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)?
Neural Information Processing Systems
Oct-2-2025, 03:50:32 GMT
- Technology: