Details and Ablation Studies for Language Modelling

Apr-25-2026, 14:23:28 GMT–Neural Information Processing Systems

A.1 Experimental Settings All language models in Table 1 have the same Transformer configuration: a 16-layer model with a hidden size of 128 with 8 heads, and a feed-forward dimension of 2048. We use a dropout [75, 76, 77] rate of 0.1. The batch size is 96 and we train for about 120 epochs with Adam optimiser [78] with an initial learning rate of 0.00025 and 2000 learning rate warm-up steps. All models are trained with a back-propagation span of 256 tokens. During training, these segments are treated independently, except for the + full context cases in Table 1 where the states (both recurrent states and fast weight states) from a segment are used as initialisation for the subsequent segment. The models in + full context cases are also evaluated in the same way by carrying over the context throughout the evaluation text with a batch size of one. For all other cases, the evaluation is done by going through the text with a sliding window of size 256 with a batch size of one. Transformer states are computed for all positions in each window, but only the last position is used to compute perplexity (except in the first segment where all positions are used for evaluation) [2].

artificial intelligence, delta rnn, machine learning, (17 more...)

Neural Information Processing Systems

Apr-25-2026, 14:23:28 GMT

Conferences PDF

Add feedback

Genre:
- Research Report (0.46)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.42)

Duplicate Docs Excel Report

Title
3f9e3767ef3b10a0de4c256d7ef9805d-Supplemental.pdf

Similar Docs Excel Report more

Title	Similarity	Source
None found