Transformers Implement Functional Gradient Descent to Learn Non-Linear Functions In Context

Open in new window