Reviews: Convergence Analysis of Two-layer Neural Networks with ReLU Activation

Neural Information Processing Systems 

This paper proves the convergence of a stochastic gradient descent algorithm from a suitable starting point to the global minimizer of a nonconvex energy representing the loss of a two-layer feedforward network with rectified linear unit activation. In particular, the algorithm is shown to converge in two phases, where phase 1 drives the iterates into a one-point convex region which subsequently leads to the actual convergence in phase 2. The findings, the analysis, and particularly the methodology for proving the convergence (in 2 phases) are very interesting and definitely deserve to be published. The entire proof is extremely long (including a flowchart of 15 Lemmas/Theorems that finally allow to show the main theorem in 25 pages of proofs), and I have to admit that I did not check this part. I have some questions on the paper and suggestions to further improve the manuscript: - Figure 1 points out the different structure of the considered networks, and even the abstract already refers to the special structure of "identity mappings". However, optimizing for W (vanilla network) is equivalent to optimizing for (W I), such that the difference seems to lie in different starting points only.