Small-scale proxies for large-scale Transformer training instabilities