Adam Accumulation to Reduce Memory Footprints of both Activations and Gradients for Large-scale DNN Training

Open in new window