MemoryScalingPaperCameraReadyMain

Neural Information Processing Systems 

We again notice that larger models memorize training data faster. This section shows how perplexity and memorization on the special batch evolve over training. Figure 14 we see that perplexity continues to increase over training, while memorization flatlines. We show plots for the 1.3B model scale, although all of the experiments in 5 exhibit ( T 1) Figure 16 we analyze the average memory unit length over training for two model sizes. We notice that the larger 2.7B model has an average Exact training time varied depended on model scale and dataset size, but all models were trained for up to 140 hours.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found