Time Matters in Regularizing Deep Networks: Weight Decay and Data Augmentation Affect Early Learning Dynamics, Matter Little Near Convergence

Neural Information Processing Systems 

Regularization is typically understood as improving generalization by altering the landscape of local extrema to which the model eventually converges.