Dynamic Stashing Quantization for Efficient Transformer Training