Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning?

Open in new window