Goto

Collaborating Authors

 in-context example in-context example



Looped Transformers are Better at Learning Learning Algorithms

Yang, Liu, Lee, Kangwook, Nowak, Robert, Papailiopoulos, Dimitris

arXiv.org Artificial Intelligence

Transformers have demonstrated effectiveness in in-context solving data-fitting problems from various (latent) models, as reported by Garg et al. (2022). However, the absence of an inherent iterative structure in the transformer architecture presents a challenge in emulating the iterative algorithms, which are commonly employed in traditional machine learning methods. To address this, we propose the utilization of looped transformer architecture and its associated training methodology, with the aim of incorporating iterative characteristics into the transformer architectures. Experimental results suggest that the looped transformer achieves performance comparable to the standard transformer in solving various data-fitting problems, while utilizing less than 10% of the parameter count. Transformers (Vaswani et al., 2017; Brown et al., 2020; Devlin et al., 2019) have emerged as the preferred model in the field of natural language processing (NLP) and other domains requiring sequence-to-sequence modeling. Besides their state-of-art performance in natural language processing tasks, large language models (LLM) such as GPT-3 (Brown et al., 2020) and PaLM (Chowdhery et al., 2022) also exhibit the ability to learn in-context: they can adapt to various downstream tasks based on a brief prompt, thus bypassing the need for additional model fine-tuning. This intriguing ability of in-context learning has sparked interest in the research community, leading numerous studies (Min et al., 2022; Olsson et al., 2022; Li et al., 2023). However, the underlying mechanisms enabling these transformers to perform in-context learning remain unclear. In an effort to understand the in-context learning behavior of LLMs, Garg et al. (2022) investigated the performance of transformers, when trained from scratch, in solving specific function class learning problems in-context. Notably, transformers exhibited strong performance across all tasks, matching or even surpassing traditional solvers. Building on this, Akyürek et al. (2022) explored the transformerbased model's capability to address the linear regression learning problem, interpreting it as an implicit form of established learning algorithms. Their study included both theoretical and empirical perspectives to understand how transformers learn these functions. Subsequently, von Oswald et al. (2022) demonstrated empirically that, when trained to predict the linear function output, a linear self-attention-only transformer inherently learns to perform a single step of gradient descent to solve the linear regression task in-context. While the approach and foundational theory presented by von Oswald et al. (2022) are promising, there exists a significant gap between the simplified architecture they examined and the standard decoder transformer used in practice.


What Can Transformers Learn In-Context? A Case Study of Simple Function Classes

Garg, Shivam, Tsipras, Dimitris, Liang, Percy, Valiant, Gregory

arXiv.org Artificial Intelligence

In-context learning refers to the ability of a model to condition on a prompt sequence consisting of in-context examples (input-output pairs corresponding to some task) along with a new query input, and generate the corresponding output. Crucially, in-context learning happens only at inference time without any parameter updates to the model. While large language models such as GPT-3 exhibit some ability to perform in-context learning, it is unclear what the relationship is between tasks on which this succeeds and what is present in the training data. To make progress towards understanding in-context learning, we consider the well-defined problem of training a model to in-context learn a function class (e.g., linear functions): that is, given data derived from some functions in the class, can we train a model to in-context learn "most" functions from this class? We show empirically that standard Transformers can be trained from scratch to perform in-context learning of linear functions -- that is, the trained model is able to learn unseen linear functions from in-context examples with performance comparable to the optimal least squares estimator. In fact, in-context learning is possible even under two forms of distribution shift: (i) between the training data of the model and inference-time prompts, and (ii) between the in-context examples and the query input during inference. We also show that we can train Transformers to in-context learn more complex function classes -- namely sparse linear functions, two-layer neural networks, and decision trees -- with performance that matches or exceeds task-specific learning algorithms. Our code and models are available at https://github.com/dtsip/in-context-learning .