Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent

Open in new window