Reviews: Train longer, generalize better: closing the generalization gap in large batch training of neural networks
–Neural Information Processing Systems
I think the paper provides some clarity on a topic that has seen a bit of attention lately, namely that of the role of noise in optimization and in particular the hypothesis of sharp minima/flat minima. From this perspective I think this datapoint is important for our collective understanding of training deep networks. I don't think the observation made by the authors come as a surprise to anyone with experience with these models, however the final conclusion might. We know that when having large minibatches we have lower variance and hence we should use a larger learning rate, etc. I think one practical issue that people have got stuck in the past is that with larger minibatches the computational cost of any given gradient increases.
Neural Information Processing Systems
Oct-8-2024, 05:41:51 GMT
- Technology: