Stabilizing Transformer Training by Preventing Attention Entropy Collapse

Open in new window