Reviews: Momentum-Based Variance Reduction in Non-Convex SGD

Neural Information Processing Systems 

I agree with R3 that you did a poor job on relating your work to existing methods, in particular SARAH. Please also make sure that you carefully address the question of optimality. I also realized that your method in fact has nothing to do with momentum. Consider for instance deterministic objective, f(x, \xi) f(x). If one has a tight estimate, i.e. d_{t-1} abla f(x_{t-1}), then from your update rules it follows that d_t abla f(x_t), i.e. the method become gradient descent with no momentum! Your title, thus, is very confusing and I highly encourage you to change it.