Softmax $\geq$ Linear: Transformers may learn to classify in-context by kernel gradient descent

Open in new window