Transformers Efficiently Perform In-Context Logistic Regression via Normalized Gradient Descent

Zhang, Chenyang, Cao, Yuan

arXiv.org Machine Learning 

One widely recognized interpretation for their empirical success is their ability to perform in-context learning (ICL): pretrained transformers are capable of performing previously unseen tasks based on demonstrations and examples in the prompt, without requiring any additional task-specific fine-tuning (Brown et al., 2020). A line of recent works interpret the in-context learning (ICL) capability of transformers from an algorithmic perspective, viewing transformers as models that can implicitly execute certain learning algorithms on the context examples. Specifically, Garg et al. (2022) proposes a theoretical framework for ICL in terms of learning a hypothesis class, and empirically shows that transformers can in-context learn the linear function class. Motivated by this empirical finding, several recent works attempt to theoretically study how transformers perform in-context learning on linear regression tasks. Aky urek et al. (2022); Von Oswald et al. (2023) construct multi-layer transformers with linear attention that can execute gradient descent on the an "in-context loss" defined on the context data, thereby enabling in-context learning of linear regression.