Appendix of Optimization and Generalization of Shallow Neural Networks with Quadratic Activation Functions

Neural Information Processing Systems 

A.1 Proof of Lemma 2.1 Inserting A(t) as defined in (4) into (3) we arrive at A(t) = 1 [ A(t))) (XA(t) + A(t)X) ] (A.1) where we used X This proves that A(t) is equal the the rightmost equation in (5). A(t))) X ] (A.2) which shows that (A.1) can be written as A = A E A. A.2 Proof of Lemma 2.2 Equations (8) and (9) can be derived from (5) and (6) by taking their expectation over ν, owing to the fact that the data is Gaussian and using Wick's theorem which asserts that E Note that this derivation can be generalized to non-Gaussian data, see Ref. [1] for details. A.3 Proximal scheme We note that (5) (and similarly (8) if we use the population loss in (9) instead of the empirical loss in (6)) can be viewed as the time continuous limit of a simple proximal scheme involving the Cholesky decomposition of A and the standard Forbenius norm as Bregman distance. We state this result as: Proposition A.1. A(t) as τ 0, p with pτ t (A.5) where A(t) solves (5) for the initial condition A(0) = B Work done while visiting at Courant Institute.