Small-scale proxies for large-scale Transformer training instabilities

Open in new window