Exploring the Relationship Between Model Architecture and In-Context Learning Ability

Lee, Ivan, Jiang, Nan, Berg-Kirkpatrick, Taylor

arXiv.org Artificial Intelligence 

What is the relationship between model architecture and the ability to perform in-context learning? In this empirical study, we take the first steps toward answering this question. We evaluate twelve model architectures capable of causal language modeling across a suite of synthetic in-context learning tasks. These selected architectures represent a broad range of paradigms, including recurrent and convolution-based neural networks, transformers, state-space model inspired, and other emerging attention alternatives. We discover that all the considered architectures can perform in-context learning under a wider range of conditions than previously documented. Additionally, we observe stark differences in statistical efficiency and consistency by varying context length and task difficulty. We also measure each architecture's predisposition towards in-context learning when presented with alternative routes for task resolution. Finally, and somewhat surprisingly, we find that several attention alternatives are more robust in-context learners than transformers. Given that such approaches have constant-sized memory footprints at inference time, this result opens the possibility of scaling up in-context learning to accommodate vastly larger numbers of in-context examples. In-context learning (ICL) refers to the ability to learn new tasks at inference time, using only inputoutput pair exemplars as guidance. Radford et al. (2019) demonstrate early signs of this ability in GPT-2, a causal transformer (Vaswani et al., 2017). ICL was further popularized by GPT-3 (Brown et al., 2020), a large language model with the same architectural foundation but augmented with greater capacity and trained on large-scale data. By simply adjusting a natural language prompt, it was shown that GPT-3 could adapt to new tasks, such as translation and question answering, without updating any of its parameters.