Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention

Open in new window