the tangent kernel cannot be explained from the point of view of "lazy training": when the last layer is non-linear, the
–Neural Information Processing Systems
We thank all reviewers for the insightful and encouraging comments. Theorem 3.2 and results in Appendix G has been proved previously (e.g., [1]). Our Hessian analysis results, including Theorem 3.2 and Theorem 3.1, are new. This can perhaps cause confusion. The paper mostly focuses on squared loss while widely applied NNs use softmax-cross entropy loss.
Neural Information Processing Systems
Aug-16-2025, 00:47:26 GMT
- Technology: