Two failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation

Open in new window