Dynamics of Stochastic Momentum Methods on Large-scale, Quadratic Models Supplementary material

Neural Information Processing Systems 

The appendix is organized into five sections as follows: 1. Appendix A derives the Volterra equation and proves the main result for the homogenized SGD (Theorem 1). 2. We show in Appendix B a heuristic derivation of the homogenized SGD approximation to the SDA class of algorithms on the least squares problem and we show that SGD and homogenized SGD are close under orthogonal invariance (Theorem 2). 3. We give in Appendix C a general overview of the analysis of a convolution Volterra equation of the type that arises in the SDA class. Unless otherwise stated, all the results hold under Assumptions 1 and 2. We include all statements from the previous sections for clarity. The results presented in this paper concern the analysis of existing methods and a new method that is a variant of an existing method. The results are theoretical and we do not anticipate any direct ethical and societal issues. We believe the results will be used by machine learning practitioners and we encourage them to use it to build a more just, prosperous world. A.1 Homogenized SGD We recall that the diffusion model is given by dXt = 2 dZt 1 To connect these diffusions to SGD on the least squares problem (2.1) f(x)= 1 2 kAx bk2, we will use the singular value decomposition of U VT of A. We order the singular values 1 2 3 in decreasing order. We then let t = VT(Xt ex), where we recall that b = Aex+ . We may do a similar computation with N and conclude that: J(1) = 2 2 2jJ 2 1 '(t) '(s)d s,j In summary, we may express J in terms of N by J(1) = 2 2 2jJ 1 '2(t) N(1) + 22 dh t,jiwith J(0) = EH When (k,n)= k+n and thus '(t)=(1+ t) with (t)= 1+t, the corresponding ODE is precisely bJ(3) The other case is when (k,n)= n, or '(t)=exp( t). We call this the general SDAHB; one recovers SDAHB when 1 =, 2 =0, and = .