Neural nets - learning with total gradient rather than stochastic gradients? • /r/MachineLearning
The estimate of the gradient from just a mini-batch is usually good enough to point you in the right descent direction. It doesn't make sense to do the extra computation for a marginally better estimate. Plus, the inaccuracy or noise introduced by the mini-batch approximation can act as a regularizer. Here is an interesting paper that performs statistical tests during optimization: if the gradient is not statistically significant, more samples are added to the mini-batch.
Sep-12-2016, 20:15:51 GMT