Reviews: Fast Convergence of Natural Gradient Descent for Over-Parameterized Neural Networks
–Neural Information Processing Systems
After rebuttal: I have carefully read the comments from other reviewers and the feedback from the authors. My main concern was the generalization ability of NGD, but the experiments in the feedback are a bit confused to me because GD doesn't seem to achieve zero training loss but NGD converges to 0 very quickly in MNIST regression. I would suggest the authors provide more details about that experiment setting, e.g., how do you select the hyperparameter. Thus, I would like to keep my score unchanged. The framework for the proof follows the recent line of work about over-parametrization, e.g., the papers written by Du et al, Li and Liang, and Allen-Zhu et al., the core of which is the Gram matrix.
Neural Information Processing Systems
Jan-22-2025, 03:57:16 GMT
- Technology: