the tangent kernel cannot be explained from the point of view of "lazy training": when the last layer is non-linear, the

Neural Information Processing Systems 

We thank all reviewers for the insightful and encouraging comments. Theorem 3.2 and results in Appendix G has been proved previously (e.g., [1]). Our Hessian analysis results, including Theorem 3.2 and Theorem 3.1, are new. This can perhaps cause confusion. The paper mostly focuses on squared loss while widely applied NNs use softmax-cross entropy loss.