Memory Layers at Scale

Berges, Vincent-Pierre, Oğuz, Barlas, Haziza, Daniel, Yih, Wen-tau, Zettlemoyer, Luke, Ghosh, Gargi

arXiv.org Artificial Intelligence 

Scaling the size of the memory for a 1.3 billion parameter base model (zero memory parameters corresponds to a dense model), trained to 1 trillion tokens. On the left, factual QA accuracy (exact match on NaturalQuestions and F1 score on TriviaQA), on the right task NLL (lower is better). Dashed lines show the performance of a 7B model trained on 2 trillion tokens with 10x more FLOPs. Pretrained language models encode vast amounts of information in their parameters (Roberts et al., 2020), and they can recall and use this information more accurately with increasing scale (Brown et al., 2020). For dense deep neural networks, which encode information primarily as weights of linear matrix transforms, this scaling of parameter size is directly coupled to an increase in computational and energy requirements. It is unclear if this is the most efficient solution to all information storage needs of language models. An important subset of information that language models need to learn are simple associations.