approximate fisher information
Understanding Approximate Fisher Information for Fast Convergence of Natural Gradient Descent in Wide Neural Networks
Natural Gradient Descent (NGD) helps to accelerate the convergence of gradient descent dynamics, but it requires approximations in large-scale deep neural networks because of its high computational cost. Empirical studies have confirmed that some NGD methods with approximate Fisher information converge sufficiently fast in practice. Nevertheless, it remains unclear from the theoretical perspective why and under what conditions such heuristic approximations work well. In this work, we reveal that, under specific conditions, NGD with approximate Fisher information achieves the same fast convergence to global minima as exact NGD. We consider deep neural networks in the infinite-width limit, and analyze the asymptotic training dynamics of NGD in function space via the neural tangent kernel.
Review for NeurIPS paper: Understanding Approximate Fisher Information for Fast Convergence of Natural Gradient Descent in Wide Neural Networks
Additional Feedback: Line 1: Change to "Natural Gradient Descent..." Line 10, 11: "the function space" should just be "function space" Line 15: it might be worth pointing out here and/or in the intro that a special kind of data preprocessing (the "Forster transform") is required to get this result for K-FAC in general Line 16, 46: "under some assumptions"/"under specific conditions" should perhaps be replaced with "under some approximating assumptions". AFAIK the "gradient independence assumption" doesn't have any rigorous justification and might not even be true in practice. Line 69: "New insights and perspectives on the natural gradient method" also argues that the empirical Fisher is a poor substitute for the "true" one. Line 71: first quotation make is backwards Line 79: delete "firing" here Line 88: "We normalize each sample by" should be "We normalize each sample so that" Line 90: "we overview" should be "we give an overview of" Line 116: Although the use of damping in the context of NTK theory can be explained this way, damping has a larger role in second order optimization in general (where NTK theory doesn't necessarily apply). The way you are describing it though, it sounds like you are saying its use is fully explained by this theory, and I would suggest you change this.
Review for NeurIPS paper: Understanding Approximate Fisher Information for Fast Convergence of Natural Gradient Descent in Wide Neural Networks
This is a compelling paper which covers a lot of ground while keeping the presentation accessible and engaging for the reader. Interestingly, it finds that the K-FAC approximations match the exact NGD trajectory in function space but not weight space. The paper answers quite a lot of questions which are natural to ask, and (having worked a lot in this area) I found the answers interesting and novel. The reviewers seem to have checked it over pretty carefully and didn't spot any problems. The paper is well written, and the authors have clearly paid a lot of attention to the presentation of the ideas.
Understanding Approximate Fisher Information for Fast Convergence of Natural Gradient Descent in Wide Neural Networks
Natural Gradient Descent (NGD) helps to accelerate the convergence of gradient descent dynamics, but it requires approximations in large-scale deep neural networks because of its high computational cost. Empirical studies have confirmed that some NGD methods with approximate Fisher information converge sufficiently fast in practice. Nevertheless, it remains unclear from the theoretical perspective why and under what conditions such heuristic approximations work well. In this work, we reveal that, under specific conditions, NGD with approximate Fisher information achieves the same fast convergence to global minima as exact NGD. We consider deep neural networks in the infinite-width limit, and analyze the asymptotic training dynamics of NGD in function space via the neural tangent kernel.