Stable and low-precision training for large-scale vision-language models